Lab 5.4 — Fix It: Un-Ignore a Flaky Test in TestVertexImpl

Lab type: Fix-It — flaky test investigation and repair
Estimated time: 90 min
Tez module: tez-dag
Key class: TestVertexImpl


Overview

Large Java projects accumulate @Ignored tests that were disabled because they were "flaky" — meaning they passed sometimes and failed other times. A flaky test is almost always a symptom of a real bug: a race condition, an incorrect assertion, or missing test isolation.

In this lab you will:

  1. Find an @Ignored test in TestVertexImpl
  2. Un-ignore it and run it 10 times to characterize the failure
  3. Identify the root cause
  4. Apply the minimum fix
  5. Verify the test passes reliably

Step 1 — Find the Ignored Tests

grep -n "@Ignore\|@Disabled" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

Also search across all tez-dag tests:

grep -rn "@Ignore\|@Disabled" ~/tez-src/tez-dag/src/test/java/ | \
  grep -v "^Binary" | head -30

Step 2 — Pick a Target

Select one ignored test. Prefer tests that have a comment explaining why they were ignored — these are the most educational.

Record:

  1. The test method name
  2. The reason given in the @Ignore annotation or nearby comment
  3. Which state transition or feature it is testing

Step 3 — Un-Ignore and Run

Remove the @Ignore annotation. Run the test 10 times:

for i in $(seq 1 10); do
  mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
    grep -E "PASS|FAIL|ERROR|Tests run" | tail -1
done

Record the pass/fail pattern. Is it:

  • Always failing (deterministic bug)
  • Randomly failing (race condition or timing sensitivity)
  • Always passing (was it already fixed in this version?)

Step 4 — Diagnose the Failure

Read the test carefully. Common flaky-test patterns in Tez state machine tests:

PatternSymptomFix
AsyncDispatcher not drained before assertionAssertion fires before event is processedUse DrainDispatcher instead
Mock returns null for a method that returns a listNullPointerException in production codeStub with Collections.emptyList()
Thread.sleep(N) instead of proper synchronizationFails on slow CI machinesReplace with waitFor() or DrainDispatcher
Leaked state from another testFirst run passes, second failsVerify @Before / @After cleans up completely

Identify which pattern applies.


Step 5 — Apply the Fix

Apply the minimum fix. Options:

Option A — Replace AsyncDispatcher with DrainDispatcher

// Before (flaky):
AsyncDispatcher dispatcher = new AsyncDispatcher();

// After (deterministic):
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertex);
dispatcher.init(conf);
dispatcher.start();
// ... fire events ...
dispatcher.await(); // blocks until queue is empty

Option B — Add missing stub

when(mockContext.getSomeList()).thenReturn(Collections.emptyList());

Option C — Fix assertion order

Move assertions AFTER the dispatcher.await() call.


Step 6 — Verify Reliability

Run the test 20 times:

for i in $(seq 1 20); do
  mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
    grep -E "Tests run" | tail -1
done

All 20 runs must pass.


Step 7 — Run the Full Suite

mvn test -pl tez-dag -q 2>&1 | tail -10

All existing tests must pass.


Step 8 — Format the Patch and Write the JIRA

cd ~/tez-src
git diff > /tmp/TEZ-FLAKY.001.patch
Summary: TestVertexImpl#[testName] is flaky due to [root cause]

Description:
  TestVertexImpl#[testName] was marked @Ignore with the note "[original reason]".
  Investigation shows the root cause is [description].

  The fix [removes AsyncDispatcher / adds missing stub / fixes assertion order],
  making the test deterministic.

  Ran the test 20 times with the fix applied — all passed.

Priority: Minor
Component: tez-dag

Deeper Understanding

#Question
1What is the difference between AsyncDispatcher and DrainDispatcher? Where is DrainDispatcher defined?
2Why is a flaky test arguably worse than no test? What does it do to CI reliability?
3Tez's StateMachineFactory is modeled after Hadoop's. Does Hadoop's TestStateMachine use DrainDispatcher or AsyncDispatcher in its tests?
4Some Tez flaky tests are caused by System.currentTimeMillis() being called in a tight loop and the assertion depending on a specific elapsed time. How would you make such a test deterministic?