Step 5: Implementation

Your fix is the smallest diff that makes the failing test pass without breaking any other test. Period. Anything else — a refactor you noticed, a TODO you want to address, a better name for a field — belongs in a separate JIRA, not this PR.

Committer reviewers' single biggest objection to first-time contributors is scope creep. The second is API hygiene. This chapter is about both.


Minimum-Diff Principle

The fix should change as few lines as possible while addressing the root cause identified in Step 4. Everything that survives compilation but is not strictly required to fix the bug is review surface area. Review surface area is the enemy of "merged this week."

Too much

-  public void handleVertexCompleted(VertexEvent event) {
-    if (recoveryData != null) {
-      handleRecovery(event);
-      return;
-    }
-    completedTaskCount++;
-    if (completedTaskCount == numTasks) {
-      transitionToSucceeded();
-    }
-  }
+  // Refactored to use stream API for clarity
+  public void handleVertexCompleted(final VertexEvent event) {
+    Optional.ofNullable(recoveryData)
+      .filter(rd -> isReplayingRecovery())
+      .ifPresentOrElse(
+          rd -> handleRecovery(event),
+          () -> {
+            this.completedTaskCount = this.completedTaskCount + 1;
+            this.maybeTransitionToSucceeded();
+          });
+  }
+
+  private void maybeTransitionToSucceeded() {
+    if (completedTaskCount == numTasks) {
+      transitionToSucceeded();
+    }
+  }

This will be rejected. You changed five things (stream API, final keyword, method extraction, control-flow shape, formatting). A committer cannot tell which change is the actual fix without re-deriving the root cause from scratch.

Just right

   public void handleVertexCompleted(VertexEvent event) {
-    if (recoveryData != null) {
+    if (recoveryData != null && isReplayingRecovery()) {
       handleRecovery(event);
       return;
     }
     completedTaskCount++;
     if (completedTaskCount == numTasks) {
       transitionToSucceeded();
     }
   }

One line. The change matches the root-cause statement verbatim. A reviewer reads it, opens the root-cause doc, agrees in 30 seconds.

The extracted helper, the final keyword, the stream rewrite — all may be good ideas. File them as separate JIRAs after this lands.

The Boy Scout rule does NOT apply

In a green-field project, "leave the campground cleaner than you found it" is fine. In Apache project review, drive-by cleanups block your fix because they expand the review and trigger objections you do not need to deal with to land the actual bug fix. Resist the urge.


Where Does the Fix Go? A Decision Tree

Is the bug a check that should have rejected an input but didn't?
    -> Guard condition (likely in a setter or builder).
       Example: TezConfiguration.validate(), DAG.verify().

Is the bug a wrong state machine transition?
    -> State-machine transition table edit.
       Look for stateMachineFactory.addTransition() in the affected *Impl class.
       The fix is usually adding/removing a transition or changing its target state.

Is the bug a config key being read at the wrong place or with the wrong default?
    -> Config validation in the constructor of the class that reads it.
       Or a fix to where conf.get() / conf.getInt() is called.

Is the bug a logic error in business code (wrong arithmetic, wrong comparator,
missing close())?
    -> Logic bug. Fix is local to the offending method.
       Add a test that asserts the corrected behavior.

Is the bug a race?
    -> First, prove it is actually a race with DrainDispatcher. Most "races"
       turn out to be logic bugs that *look* race-y because event ordering
       is non-obvious.
    -> If genuinely a race: usually a missing dispatcher.await, a missing
       volatile, or a transition guard that isn't atomic with a counter
       increment. Synchronize the smallest critical section.

Is the bug a memory issue (OOM, off-heap leak)?
    -> Almost never in scope for a first Capstone. Pause and consult a committer.

Configuration Keys: The Right Way

You will be tempted to "add a knob" — a new tez.foo.bar flag that defaults to the old (buggy) behavior, lets users opt in to the fix. Resist. Knobs are an admission that you don't trust your fix. If your fix is correct, it should be the new default; if it isn't, fix the fix, not the user's configuration burden.

When a knob IS justified:

  • The fix changes a performance-sensitive default that may regress some users.
  • The fix changes user-visible output format (release-note required).
  • The fix is gated on a long-deprecation window and the old behavior must remain available for one or two releases.

When you DO add a key, conform to Tez convention. Read:

grep -n "TEZ_AM\|TEZ_TASK\|TEZ_RUNTIME" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java \
  | head -40

You will see the pattern:

/**
 * Maximum number of times an AM can attempt to launch a task before failing
 * the task.
 * <p>
 * Default: {@link #TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT}.
 *
 * @since 0.9.0
 */
@ConfigurationScope(Scope.AM)
@ConfigurationProperty(type = "integer")
public static final String TEZ_AM_TASK_MAX_FAILED_ATTEMPTS =
    TEZ_AM_PREFIX + "task.max.failed.attempts";
public static final int TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT = 4;

Mandatory elements for any new key:

  • Javadoc that explains what the knob does and when to change it.
  • @since X.Y.Z matching the next release version.
  • @ConfigurationScope (AM, VERTEX, TASK, CLIENT).
  • @ConfigurationProperty(type = "integer" / "long" / "boolean" / "string").
  • A _DEFAULT constant alongside.
  • Use the right prefix constant (TEZ_AM_PREFIX, TEZ_RUNTIME_PREFIX, etc.).
  • Add to tez-api/src/main/resources/META-INF/services/... if the doc-gen needs to pick it up (check existing keys to see if their config-doc generator catches up automatically or needs manual entries).

A new key that violates any of these will fail review.


Tez Coding Style

Read the existing class you are editing. Match its style exactly. The project-wide rules below are necessary but not sufficient — the file-local conventions matter just as much.

Logging

Always slf4j, never log4j directly, never System.out:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);

LOG.info("Vertex {} transitioned from {} to {} on event {}",
    getName(), oldState, newState, event.getType());

Use {} parameterization, never string concatenation in log args. Use the exception form LOG.error("Failed to schedule task {}", taskId, ex) rather than concatenating ex.toString().

Preconditions

Tez uses Guava Preconditions heavily. Use it for invariants and argument checks:

import com.google.common.base.Preconditions;

Preconditions.checkNotNull(event, "event must not be null");
Preconditions.checkArgument(parallelism > 0,
    "parallelism must be positive, got %s for vertex %s", parallelism, vertexName);
Preconditions.checkState(getState() == VertexState.RUNNING,
    "Vertex %s must be RUNNING to receive %s, was %s",
    getName(), event.getType(), getState());

The variadic %s form is preferable to string concatenation because it is free when the check passes.

Exception messages

Always include the context: which vertex, which task ID, which state, which event. Diagnosing a Tez bug from a stack trace alone is hard enough; an exception message that just says "invalid state" is hostile.

Bad:

throw new IllegalStateException("invalid state");

Good:

throw new IllegalStateException(String.format(
    "Vertex %s received event %s in state %s, which is not legal. "
        + "Expected one of [RUNNING, INITED].",
    getName(), event.getType(), getState()));

Forbidden

  • System.out.println / System.err.println (use LOG).
  • e.printStackTrace() (use LOG.error("...", e)).
  • Thread.sleep in production code unless you have a // TEZ-NNNN: justification comment AND a committer agreed in review.
  • New synchronized methods on hot paths — discuss in the JIRA before adding.
  • Adding new dependencies to pom.xml without discussion. This is a major re-review trigger.

Imports

  • No wildcard imports (import foo.bar.*;). The project's checkstyle catches these and you will fail precommit.
  • Group order: java, javax, org, com, third-party, project. Most IDEs handle this automatically.

Tests

Discussed fully in Step 6, but: every fix must come with at least one test that fails on master and passes with your fix. No test, no merge.


Building Incrementally

Do not try to write the whole fix and run the whole test suite. That feedback loop is too slow. Instead:

# Tight loop: compile + run only the changed module's affected test.
mvn install -DskipTests -pl tez-api,tez-common -am -q && \
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q

# When that goes green, broaden:
mvn test -pl tez-dag -Dtest='TestVertex*' -q

# Finally, full module:
mvn test -pl tez-dag -q

If your fix touches tez-api, you have to rebuild every downstream module. The -am flag is your friend — "also make" upstream deps.


When You Get Stuck

Hard rule: if you have not made forward progress in three sessions, post on the JIRA. Format:

Status update: I have the repro from Step 2 passing/failing as expected. My
working hypothesis is <one-sentence>. I have tried:

1. <approach A> — does not work because <observed result>.
2. <approach B> — does not work because <observed result>.

I am unsure whether to (a) <option a> or (b) <option b>. The constraint I am
trying to satisfy is <invariant>. If anyone has context on whether <approach C>
was considered for a related JIRA, please share.

Reproducer is at <link to gist or branch>.

This is not failure. This is community engagement done right. Committers respect contributors who ask sharp questions with context attached. They ignore contributors who ask "any update?" or "can you help?"


Validation / Self-check

Before advancing to Step 6:

  1. Your fix is committed to your branch as a single commit with the title TEZ-NNNN: <short summary> and a body that references the root-cause document.
  2. git diff origin/master --stat shows the smallest plausible diff (single digit files changed, double-digit lines at most for a typical bug fix).
  3. The diff contains zero unrelated changes (no formatting-only changes, no import reordering not caused by your edit, no Javadoc cleanups in methods you didn't touch).
  4. mvn install -DskipTests -pl <changed-module> -am -q succeeds.
  5. The Step 2 reproducer test now passes (you'll generalize the test in Step 6 — the repro itself is still the gating signal).
  6. If you added a TezConfiguration key, it has all required annotations, Javadoc, _DEFAULT constant, and @since tag.
  7. You have re-read your diff line by line and convinced yourself every line change is required by the root cause. Strike anything that isn't.