Lab 5.4 — Fix It: Un-Ignore a Flaky Test in TestVertexImpl
Lab type: Fix-It — flaky test investigation and repair
Estimated time: 90 min
Tez module: tez-dag
Key class: TestVertexImpl
Overview
Large Java projects accumulate @Ignored tests that were disabled because they
were "flaky" — meaning they passed sometimes and failed other times. A flaky
test is almost always a symptom of a real bug: a race condition, an incorrect
assertion, or missing test isolation.
In this lab you will:
- Find an
@Ignored test inTestVertexImpl - Un-ignore it and run it 10 times to characterize the failure
- Identify the root cause
- Apply the minimum fix
- Verify the test passes reliably
Step 1 — Find the Ignored Tests
grep -n "@Ignore\|@Disabled" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
Also search across all tez-dag tests:
grep -rn "@Ignore\|@Disabled" ~/tez-src/tez-dag/src/test/java/ | \
grep -v "^Binary" | head -30
Step 2 — Pick a Target
Select one ignored test. Prefer tests that have a comment explaining why they were ignored — these are the most educational.
Record:
- The test method name
- The reason given in the
@Ignoreannotation or nearby comment - Which state transition or feature it is testing
Step 3 — Un-Ignore and Run
Remove the @Ignore annotation. Run the test 10 times:
for i in $(seq 1 10); do
mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
grep -E "PASS|FAIL|ERROR|Tests run" | tail -1
done
Record the pass/fail pattern. Is it:
- Always failing (deterministic bug)
- Randomly failing (race condition or timing sensitivity)
- Always passing (was it already fixed in this version?)
Step 4 — Diagnose the Failure
Read the test carefully. Common flaky-test patterns in Tez state machine tests:
| Pattern | Symptom | Fix |
|---|---|---|
AsyncDispatcher not drained before assertion | Assertion fires before event is processed | Use DrainDispatcher instead |
| Mock returns null for a method that returns a list | NullPointerException in production code | Stub with Collections.emptyList() |
Thread.sleep(N) instead of proper synchronization | Fails on slow CI machines | Replace with waitFor() or DrainDispatcher |
| Leaked state from another test | First run passes, second fails | Verify @Before / @After cleans up completely |
Identify which pattern applies.
Step 5 — Apply the Fix
Apply the minimum fix. Options:
Option A — Replace AsyncDispatcher with DrainDispatcher
// Before (flaky):
AsyncDispatcher dispatcher = new AsyncDispatcher();
// After (deterministic):
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertex);
dispatcher.init(conf);
dispatcher.start();
// ... fire events ...
dispatcher.await(); // blocks until queue is empty
Option B — Add missing stub
when(mockContext.getSomeList()).thenReturn(Collections.emptyList());
Option C — Fix assertion order
Move assertions AFTER the dispatcher.await() call.
Step 6 — Verify Reliability
Run the test 20 times:
for i in $(seq 1 20); do
mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
grep -E "Tests run" | tail -1
done
All 20 runs must pass.
Step 7 — Run the Full Suite
mvn test -pl tez-dag -q 2>&1 | tail -10
All existing tests must pass.
Step 8 — Format the Patch and Write the JIRA
cd ~/tez-src
git diff > /tmp/TEZ-FLAKY.001.patch
Summary: TestVertexImpl#[testName] is flaky due to [root cause]
Description:
TestVertexImpl#[testName] was marked @Ignore with the note "[original reason]".
Investigation shows the root cause is [description].
The fix [removes AsyncDispatcher / adds missing stub / fixes assertion order],
making the test deterministic.
Ran the test 20 times with the fix applied — all passed.
Priority: Minor
Component: tez-dag
Deeper Understanding
| # | Question |
|---|---|
| 1 | What is the difference between AsyncDispatcher and DrainDispatcher? Where is DrainDispatcher defined? |
| 2 | Why is a flaky test arguably worse than no test? What does it do to CI reliability? |
| 3 | Tez's StateMachineFactory is modeled after Hadoop's. Does Hadoop's TestStateMachine use DrainDispatcher or AsyncDispatcher in its tests? |
| 4 | Some Tez flaky tests are caused by System.currentTimeMillis() being called in a tight loop and the assertion depending on a specific elapsed time. How would you make such a test deterministic? |