Step 6: Testing

Your reproducer from Step 2 is the minimum — it proves the bug existed. The tests in this step prove that the fix is correct, that it stays correct, and that the next person who edits this code path will notice if they break it again. A good test suite is the most durable artifact you ship.

Two kinds of tests are required. Unit tests using a controlled dispatcher (fast, deterministic, surgical) and at least one integration test on MiniTezCluster (slow, realistic, end-to-end). Both. Always both.


Unit Tests with DrainDispatcher

The single most important Tez test pattern: synchronous, deterministic state- machine testing. Read the canonical example top to bottom before you write your own:

~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java

Each is 1000+ lines. They are not light reading. They are also the only authoritative source on what is and isn't testable at the unit layer.

What DrainDispatcher Does

DrainDispatcher is Hadoop's synchronous testing dispatcher (from hadoop-yarn-common). When you dispatch() an event into it, the event sits in a queue. When you call await(), the queue drains synchronously on the calling thread — every handler runs before await() returns. This gives you two superpowers:

  1. Deterministic event ordering. You can dispatch A, dispatch B, await — and you know A's handler completed before B's started.
  2. No real threading. Bugs reproduce on every machine, not just under contention.

State-Transition Test Pattern

The template every state-machine unit test follows:

@Test
public void testV_TASK_COMPLETED_inRunningWithRecovery() throws Exception {
  // 1. Arrange: drive the SUT to the state under test.
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
  dispatcher.await();
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
  dispatcher.await();
  assertEquals(VertexState.RUNNING, vertex.getState());

  // 2. Set up the precondition that triggers the bug.
  vertex.setRecoveryData(mockRecoveryData());

  // 3. Act: fire the event under test.
  TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
  vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
  dispatcher.await();

  // 4. Assert: the new state and any side-effect counters.
  assertEquals(VertexState.SUCCEEDED, vertex.getState());
  assertEquals(vertex.getNumTasks(), vertex.getCompletedTaskCount());
  assertFalse("vertex must not call handleRecovery when not actually replaying",
      vertex.getRecoveryHandlerCalled());
}

The sections — arrange, set precondition, act, assert — should always be visible. Reviewers skim for that shape. Hidden setup inside helpers makes the test harder to debug when it fails on a future change.

Build a Negative Test Too

You proved the bug is fixed. Now prove the non-buggy path still works:

@Test
public void testV_TASK_COMPLETED_inRunningWithoutRecovery() throws Exception {
  // Same arrange/state machinery, but recoveryData stays null.
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
  dispatcher.await();
  // ...
  TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
  vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
  dispatcher.await();
  // Without recovery data, the existing transition behavior is unchanged.
  assertEquals(VertexState.SUCCEEDED, vertex.getState());
}

The negative test catches the regression where someone "fixes" your fix by removing the recovery branch entirely.

Test Both Branches of Every Guard You Added

If your fix is:

if (recoveryData != null && isReplayingRecovery()) { ... }

You owe four tests, one per combination:

recoveryData == nullisReplayingRecovery() returnsExpected branch
truen/a (short-circuited)non-recovery path
falsetruerecovery path
falsefalsenon-recovery path (this is the bug fix)
truetruenon-recovery path (impossible? assert it cannot happen)

The last row is the kind of test that catches a future refactor where someone deletes the short-circuit.


MockAppContext, MockHistoryEventHandler, and friends

Building a VertexImpl in a unit test requires a small zoo of collaborators (an AppContext, an event handler, an EdgeManager, etc.). Don't try to build them all from scratch — copy the helpers from TestVertexImpl.

grep -nE "private.*setUp\(|class Mock|createVertex\(" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
  | head -30

You'll see helper methods like createVertex(...), createDAG(...), and inner MockHistoryEventHandler. Use them as a template; do not duplicate them in your own test if you can extend the existing test class with a new @Test method.


Integration Tests with MiniTezCluster

Unit tests prove the fix works in isolation. Integration tests prove it works when wired up to a real YARN cluster (in-process, but real). For correctness bugs and shuffle bugs, this is non-negotiable.

Canonical example:

~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java

Read its setUp / tearDown carefully. The pattern:

private static MiniTezCluster mrrTezCluster;
private static Path TEST_ROOT_DIR;

@BeforeClass
public static void setup() throws IOException {
  Configuration conf = new Configuration();
  TEST_ROOT_DIR = new Path("target", TestYourFix.class.getName() + "-tmpDir");
  mrrTezCluster = new MiniTezCluster(TestYourFix.class.getSimpleName(),
      /*numNodeManagers=*/ 1, /*numLocalDirs=*/ 1, /*numLogDirs=*/ 1);
  mrrTezCluster.init(conf);
  mrrTezCluster.start();
}

@AfterClass
public static void tearDown() {
  if (mrrTezCluster != null) {
    mrrTezCluster.stop();
    mrrTezCluster = null;
  }
}

@Test(timeout = 180_000)
public void testTezNNNNFixEndToEnd() throws Exception {
  TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
  DAG dag = buildDAGThatExercisesFix();

  TezClient tezClient = TezClient.create("test-tez-NNNN", tezConf);
  tezClient.start();
  try {
    DAGClient dagClient = tezClient.submitDAG(dag);
    DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(
        EnumSet.of(StatusGetOpts.GET_COUNTERS));

    assertEquals(DAGStatus.State.SUCCEEDED, status.getState());

    // The actual assertion — what proves the fix works end-to-end:
    long counterVal = status.getDAGCounters()
        .findCounter(YourCounterGroup.class.getName(), "ExpectedCounter")
        .getValue();
    assertEquals(20L, counterVal);
  } finally {
    tezClient.stop();
  }
}

awaitVertexState and the Deterministic Polling Pattern

MiniTezCluster tests look async (real cluster, real time) but you can still write deterministic assertions. Use the await* helpers in the tez-tests test utility classes:

grep -rn "awaitVertexState\|awaitDAGCompletion\|awaitTaskAttempt" \
  ~/tez-src/tez-tests/src/test/java/

Pattern:

TestTezUtils.awaitVertexState(dagClient, "v1", VertexStatus.State.SUCCEEDED, 60_000);

This polls with backoff up to the timeout. It never returns early on a spurious signal and never sleeps a fixed wallclock duration.


Determinism Rules

Hard rules. Violating any of them gets your PR sent back.

RuleBadGood
No Thread.sleepThread.sleep(500)dispatcher.await() or awaitVertexState(...)
No wallclock waitswhile (!done && System.currentTimeMillis() < deadline) {...}latch.await(60, SECONDS) driven by event callback
No Random without seednew Random()new Random(42L)
No timezone-dependent assertionassertEquals("2024-...", LocalDate.now())inject Clock
No order-dependent assertion on a SetassertEquals(List.of("a","b"), new HashSet<>(...))sort first or use containsInAnyOrder
Tests must clean up tmpdirsleaving target/...-tmpDir between runs@After removes it or uses unique nanoTime() path
No global mutable statestatic int counter = 0; shared across testsper-test instance state

Tez has shipped many flaky-test fixes. Read a few of them:

cd ~/tez-src
git log --oneline --grep="flaky\|intermittent" | head -20
git show <flaky-fix-sha>

Notice the pattern — most flaky fixes are replacing a Thread.sleep with an event-driven await, or replacing a counter assertion with a state assertion.


Coverage Target

You do not need 100% line coverage on the file you touched. You do need ~80% coverage on the lines you changed, plus tests that exercise every new branch (true and false sides).

Spot-check coverage:

mvn test -pl tez-dag -Dtest='TestVertexImpl*' \
  org.jacoco:jacoco-maven-plugin:prepare-agent \
  org.jacoco:jacoco-maven-plugin:report

# Open tez-dag/target/site/jacoco/index.html

If your changed lines show red, add a test before pushing.


A Complete Test That Fails on Master, Passes With Fix

The deliverable for this step is a test (typically two or three @Test methods on the same class) that:

  1. Fails on a clean checkout of origin/master — assertion error, not a compilation error, not a setup error.
  2. Passes when run against your fix branch.
  3. Runs in under 10 seconds for unit tests, under 3 minutes for integration tests.
  4. Has zero flakes in 10 consecutive runs.

Verify the third and fourth:

for i in {1..10}; do
  echo "=== Run $i ==="
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNN -q || break
done

If even one run fails, you have a flaky test. Fix it before pushing. A flaky test you ship is technical debt every other contributor will pay.


Test Naming

Tez convention:

  • Unit test file: Test<ClassUnderTest>.java lives in <module>/src/test/java/<package>/. If TestVertexImpl.java already exists, add a new @Test method there rather than a new file.
  • Test method: test<Method>_<Condition>_<ExpectedResult> or test<Scenario>_<ExpectedBehavior>.
  • Bad: testFoo, testBug, testCase1.
  • Good: testV_TASK_COMPLETED_inRunningWithRecoveryData_doesNotShortCircuit.

The verbose name is the test's documentation. Future-you reading the failure output of CI will be glad for the verbosity.


Validation / Self-check

Before advancing to Step 7:

  1. At least two @Test methods exist that fail on origin/master and pass on your branch.
  2. At least one of them uses DrainDispatcher for deterministic event ordering (or has a documented reason it doesn't — pure unit, no events).
  3. At least one integration test on MiniTezCluster is present if your fix affects end-to-end behavior (correctness, shuffle, scheduling).
  4. Ten consecutive runs of your tests are all green.
  5. Every new conditional branch in your production code has at least one test that exercises each side.
  6. No Thread.sleep, no wallclock waits, no unseeded Random, no order-dependent assertions on unordered collections.
  7. mvn test -pl <module> runs your tests in under the budget (10s unit, 3min integration).