Step 6: Testing
Your reproducer from Step 2 is the minimum — it proves the bug existed. The tests in this step prove that the fix is correct, that it stays correct, and that the next person who edits this code path will notice if they break it again. A good test suite is the most durable artifact you ship.
Two kinds of tests are required. Unit tests using a controlled dispatcher
(fast, deterministic, surgical) and at least one integration test on
MiniTezCluster (slow, realistic, end-to-end). Both. Always both.
Unit Tests with DrainDispatcher
The single most important Tez test pattern: synchronous, deterministic state- machine testing. Read the canonical example top to bottom before you write your own:
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
Each is 1000+ lines. They are not light reading. They are also the only authoritative source on what is and isn't testable at the unit layer.
What DrainDispatcher Does
DrainDispatcher is Hadoop's synchronous testing dispatcher (from
hadoop-yarn-common). When you dispatch() an event into it, the event sits
in a queue. When you call await(), the queue drains synchronously on the
calling thread — every handler runs before await() returns. This gives you
two superpowers:
- Deterministic event ordering. You can dispatch A, dispatch B, await — and you know A's handler completed before B's started.
- No real threading. Bugs reproduce on every machine, not just under contention.
State-Transition Test Pattern
The template every state-machine unit test follows:
@Test
public void testV_TASK_COMPLETED_inRunningWithRecovery() throws Exception {
// 1. Arrange: drive the SUT to the state under test.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
dispatcher.await();
assertEquals(VertexState.RUNNING, vertex.getState());
// 2. Set up the precondition that triggers the bug.
vertex.setRecoveryData(mockRecoveryData());
// 3. Act: fire the event under test.
TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
dispatcher.await();
// 4. Assert: the new state and any side-effect counters.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
assertEquals(vertex.getNumTasks(), vertex.getCompletedTaskCount());
assertFalse("vertex must not call handleRecovery when not actually replaying",
vertex.getRecoveryHandlerCalled());
}
The sections — arrange, set precondition, act, assert — should always be visible. Reviewers skim for that shape. Hidden setup inside helpers makes the test harder to debug when it fails on a future change.
Build a Negative Test Too
You proved the bug is fixed. Now prove the non-buggy path still works:
@Test
public void testV_TASK_COMPLETED_inRunningWithoutRecovery() throws Exception {
// Same arrange/state machinery, but recoveryData stays null.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
// ...
TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
dispatcher.await();
// Without recovery data, the existing transition behavior is unchanged.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
}
The negative test catches the regression where someone "fixes" your fix by removing the recovery branch entirely.
Test Both Branches of Every Guard You Added
If your fix is:
if (recoveryData != null && isReplayingRecovery()) { ... }
You owe four tests, one per combination:
recoveryData == null | isReplayingRecovery() returns | Expected branch |
|---|---|---|
| true | n/a (short-circuited) | non-recovery path |
| false | true | recovery path |
| false | false | non-recovery path (this is the bug fix) |
| true | true | non-recovery path (impossible? assert it cannot happen) |
The last row is the kind of test that catches a future refactor where someone deletes the short-circuit.
MockAppContext, MockHistoryEventHandler, and friends
Building a VertexImpl in a unit test requires a small zoo of collaborators
(an AppContext, an event handler, an EdgeManager, etc.). Don't try to
build them all from scratch — copy the helpers from TestVertexImpl.
grep -nE "private.*setUp\(|class Mock|createVertex\(" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
| head -30
You'll see helper methods like createVertex(...), createDAG(...), and
inner MockHistoryEventHandler. Use them as a template; do not duplicate them
in your own test if you can extend the existing test class with a new
@Test method.
Integration Tests with MiniTezCluster
Unit tests prove the fix works in isolation. Integration tests prove it works when wired up to a real YARN cluster (in-process, but real). For correctness bugs and shuffle bugs, this is non-negotiable.
Canonical example:
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java
Read its setUp / tearDown carefully. The pattern:
private static MiniTezCluster mrrTezCluster;
private static Path TEST_ROOT_DIR;
@BeforeClass
public static void setup() throws IOException {
Configuration conf = new Configuration();
TEST_ROOT_DIR = new Path("target", TestYourFix.class.getName() + "-tmpDir");
mrrTezCluster = new MiniTezCluster(TestYourFix.class.getSimpleName(),
/*numNodeManagers=*/ 1, /*numLocalDirs=*/ 1, /*numLogDirs=*/ 1);
mrrTezCluster.init(conf);
mrrTezCluster.start();
}
@AfterClass
public static void tearDown() {
if (mrrTezCluster != null) {
mrrTezCluster.stop();
mrrTezCluster = null;
}
}
@Test(timeout = 180_000)
public void testTezNNNNFixEndToEnd() throws Exception {
TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
DAG dag = buildDAGThatExercisesFix();
TezClient tezClient = TezClient.create("test-tez-NNNN", tezConf);
tezClient.start();
try {
DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(
EnumSet.of(StatusGetOpts.GET_COUNTERS));
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
// The actual assertion — what proves the fix works end-to-end:
long counterVal = status.getDAGCounters()
.findCounter(YourCounterGroup.class.getName(), "ExpectedCounter")
.getValue();
assertEquals(20L, counterVal);
} finally {
tezClient.stop();
}
}
awaitVertexState and the Deterministic Polling Pattern
MiniTezCluster tests look async (real cluster, real time) but you can still
write deterministic assertions. Use the await* helpers in the tez-tests
test utility classes:
grep -rn "awaitVertexState\|awaitDAGCompletion\|awaitTaskAttempt" \
~/tez-src/tez-tests/src/test/java/
Pattern:
TestTezUtils.awaitVertexState(dagClient, "v1", VertexStatus.State.SUCCEEDED, 60_000);
This polls with backoff up to the timeout. It never returns early on a spurious signal and never sleeps a fixed wallclock duration.
Determinism Rules
Hard rules. Violating any of them gets your PR sent back.
| Rule | Bad | Good |
|---|---|---|
No Thread.sleep | Thread.sleep(500) | dispatcher.await() or awaitVertexState(...) |
| No wallclock waits | while (!done && System.currentTimeMillis() < deadline) {...} | latch.await(60, SECONDS) driven by event callback |
No Random without seed | new Random() | new Random(42L) |
| No timezone-dependent assertion | assertEquals("2024-...", LocalDate.now()) | inject Clock |
| No order-dependent assertion on a Set | assertEquals(List.of("a","b"), new HashSet<>(...)) | sort first or use containsInAnyOrder |
| Tests must clean up tmpdirs | leaving target/...-tmpDir between runs | @After removes it or uses unique nanoTime() path |
| No global mutable state | static int counter = 0; shared across tests | per-test instance state |
Tez has shipped many flaky-test fixes. Read a few of them:
cd ~/tez-src
git log --oneline --grep="flaky\|intermittent" | head -20
git show <flaky-fix-sha>
Notice the pattern — most flaky fixes are replacing a Thread.sleep with
an event-driven await, or replacing a counter assertion with a state
assertion.
Coverage Target
You do not need 100% line coverage on the file you touched. You do need ~80% coverage on the lines you changed, plus tests that exercise every new branch (true and false sides).
Spot-check coverage:
mvn test -pl tez-dag -Dtest='TestVertexImpl*' \
org.jacoco:jacoco-maven-plugin:prepare-agent \
org.jacoco:jacoco-maven-plugin:report
# Open tez-dag/target/site/jacoco/index.html
If your changed lines show red, add a test before pushing.
A Complete Test That Fails on Master, Passes With Fix
The deliverable for this step is a test (typically two or three @Test
methods on the same class) that:
- Fails on a clean checkout of
origin/master— assertion error, not a compilation error, not a setup error. - Passes when run against your fix branch.
- Runs in under 10 seconds for unit tests, under 3 minutes for integration tests.
- Has zero flakes in 10 consecutive runs.
Verify the third and fourth:
for i in {1..10}; do
echo "=== Run $i ==="
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNN -q || break
done
If even one run fails, you have a flaky test. Fix it before pushing. A flaky test you ship is technical debt every other contributor will pay.
Test Naming
Tez convention:
- Unit test file:
Test<ClassUnderTest>.javalives in<module>/src/test/java/<package>/. IfTestVertexImpl.javaalready exists, add a new@Testmethod there rather than a new file. - Test method:
test<Method>_<Condition>_<ExpectedResult>ortest<Scenario>_<ExpectedBehavior>. - Bad:
testFoo,testBug,testCase1. - Good:
testV_TASK_COMPLETED_inRunningWithRecoveryData_doesNotShortCircuit.
The verbose name is the test's documentation. Future-you reading the failure output of CI will be glad for the verbosity.
Validation / Self-check
Before advancing to Step 7:
- At least two
@Testmethods exist that fail onorigin/masterand pass on your branch. - At least one of them uses
DrainDispatcherfor deterministic event ordering (or has a documented reason it doesn't — pure unit, no events). - At least one integration test on
MiniTezClusteris present if your fix affects end-to-end behavior (correctness, shuffle, scheduling). - Ten consecutive runs of your tests are all green.
- Every new conditional branch in your production code has at least one test that exercises each side.
- No
Thread.sleep, no wallclock waits, no unseededRandom, no order-dependent assertions on unordered collections. mvn test -pl <module>runs your tests in under the budget (10s unit, 3min integration).