Stage 9 — Flaky Tests
What this stage teaches
Stage 9 is the unglamorous-but-essential stage. You learn:
- The Tez flake taxonomy:
Thread.sleepraces, undrainedAsyncDispatcher,MiniTezClusterport collisions, and@Test(timeout=...)budgets that were too tight for slow CI. - How to distinguish a flake (passes locally, fails on Jenkins 1-in-30 runs) from a real intermittent bug (manifests in production under load). Flakes are tests; intermittent bugs are not.
- The
DrainDispatcher.await()refactor: how to convert a sleep-based synchronisation to an event-drain-based one. - The
@RuleandTestNamepatterns for diagnosing which test in a suite leaks state into the next. - When a flake fix is also a production code fix (the test was right; the code had a race).
Patches are 20–150 lines per test. They rarely change production code. The ones that do warrant a Stage 4–6 ticket in addition to the test fix.
JIRA filter to find candidates
project = TEZ
AND (text ~ "flaky" OR text ~ "intermittent" OR labels = "flaky-test")
AND resolution = Unresolved
ORDER BY updated DESC
A second source: Jenkins precommit history. Pick any open JIRA, find its Jenkins URL in the comments, click through to recent runs, look for tests that failed in one run and passed in the next on the same patch. Those tests are flake candidates regardless of whether a JIRA already exists.
A third source: your own mvn test output. Run any tez-dag test suite three
times in a row:
cd ~/tez-src
for i in 1 2 3; do
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -5
done
Any failure in the three-pass that doesn't repeat is a flake to investigate.
The Tez flake taxonomy
1. Thread.sleep races
The most common shape:
worker.submitJob(j);
Thread.sleep(500); // "wait for it to start"
assertTrue(worker.isJobRunning(j));
On a slow CI box, 500ms may not be enough. On a fast box, the job may have completed before the assertion. Both fail.
The fix is a poll with timeout:
worker.submitJob(j);
TestUtils.waitFor(() -> worker.isJobRunning(j), /*pollMs*/50, /*timeoutMs*/30_000);
assertTrue(worker.isJobRunning(j));
If TestUtils.waitFor does not exist in the module, copy the pattern from
org.apache.tez.test.GenericCounter or write one yourself in three lines.
2. Undrained AsyncDispatcher
The dispatcher is event-driven. A test that fires an event and immediately asserts on state will see the pre-event state half the time.
The fix is DrainDispatcher.await():
cd ~/tez-src
grep -rn "class DrainDispatcher" tez-common/src/main/java tez-dag/src/test
Find the canonical class. The refactor:
- dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
- Thread.sleep(200);
- assertEquals(VertexState.INITED, vertex.getState());
+ dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
+ dispatcher.await();
+ assertEquals(VertexState.INITED, vertex.getState());
The contract: await() returns when the event queue is empty and the last
event has been fully handled (including any subsequent events the handler
itself emitted). If the test still flakes after this refactor, the handler is
emitting events to a different dispatcher (e.g. a child component has its
own). Find it and drain that one too.
3. MiniTezCluster port collisions
The default MiniTezCluster binds a fixed RM port. Two suites running in
parallel on the same machine collide. The fix is per-suite port randomisation:
- tezCluster = new MiniTezCluster("test", 1, 1, 1);
+ tezCluster = new MiniTezCluster(TestName.getMethodName(), 1, 1, 1);
+ Configuration conf = new Configuration();
+ conf.setInt(YarnConfiguration.RM_PORT, 0); // 0 = OS-assigned
+ conf.setInt(YarnConfiguration.RM_SCHEDULER_PORT, 0);
+ conf.setInt(YarnConfiguration.RM_RESOURCE_TRACKER_PORT, 0);
+ tezCluster.init(conf);
The 0 port tells the OS to assign an unused port. Then read the actual port
from the cluster after start:
int amrmPort = tezCluster.getConfig().getInt(YarnConfiguration.RM_PORT, -1);
4. @Test(timeout=...) too tight
A test with @Test(timeout=1000) may pass on a developer's M3 Pro and fail on
a contention-laden Jenkins agent. Raise the timeout to a value that comfortably
covers the slow CI but is still bounded:
- @Test(timeout = 1000)
+ @Test(timeout = 30_000)
public void testInitTransitionRunsOnce() { ... }
The Tez convention: 30s for unit tests, 300s for MiniTezCluster tests.
Never @Test(timeout = 0) — a hung test will block CI for hours.
Walked example — TestShuffleManager flake
Symptom: testReadErrorReportDebounce fails 1-in-12 runs on Jenkins with:
expected:<1> but was:<2>
i.e. the verify on inputContext.sendEvents saw two calls when one was
expected.
Step 1 — Reproduce locally
cd ~/tez-src
for i in $(seq 1 50); do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce \
-q 2>&1 | tail -3
done | grep -c "FAILED"
A local reproduction at 1/50 frequency is good enough to start.
Step 2 — Diagnose
Read the test. The pattern:
sm.reportReadError(src, new IOException("first"));
sm.reportReadError(src, new IOException("second"));
verify(inputContext, times(1)).sendEvents(anyList());
reportReadError may dispatch to an internal executor. The verify runs before
the executor has serviced the call. The Mockito verify sees only the
synchronous call most of the time; the async one fires 1-in-12.
Step 3 — Fix
Replace verify with a timeout-bounded verify:
- verify(inputContext, times(1)).sendEvents(anyList());
+ verify(inputContext, timeout(5_000).times(1)).sendEvents(anyList());
Mockito.timeout(ms) polls until the expected interactions match, then
asserts the count. The test now waits up to 5 seconds before failing.
A bigger refactor (preferred): inject a deterministic executor:
ShuffleManager sm = createShuffleManager(conf, new DirectExecutor());
where DirectExecutor is a java.util.concurrent.Executor whose execute
runs synchronously on the caller thread. Now there is no race, and the
original verify(..., times(1)) is correct.
The reviewer rule: prefer the deterministic executor refactor over
Mockito.timeout. The timeout-based fix masks future races; the deterministic
fix eliminates them.
Step 4 — Confirm the fix
Run the loop again:
for i in $(seq 1 200); do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -3
done | grep -c "FAILED"
200 runs, zero failures, is the bar. Don't ship a flake fix you have not stress-tested.
When a flake is a real bug
Sometimes a test flakes because the production code has a race. If the "obvious" flake fix is to insert a sleep or relax an assertion, stop and ask: could a production caller exercise the same race?
Example: VertexImpl.handle returning before all event-emission side effects
complete. The flaky test fixes itself by dispatcher.await(), but a
production caller doing the same sequence sees a partially-applied state.
That is a Stage 4 bug, not a Stage 9 bug.
The decision rule:
- The test races against an internal event queue → flake fix.
- The test races against a public contract method → file a real bug.
Pitfalls
- Don't
@Ignorea flake to "fix" CI. The next contributor will silently remove the@Ignoreand re-introduce the flake. File a real ticket with a written analysis even if you don't fix it. - Don't bump the
@Test(timeout)without reasoning. A 30s timeout is evidence the test does real work; a 30000s timeout is evidence the test is broken. - Don't replace
assertEqualswithassertTrue(... contains ...)to silence a flake. That weakens the assertion permanently and hides the underlying race. - Don't refactor a test class wholesale in a flake patch. Fix the one test. If the class needs a wholesale refactor, file a separate JIRA.
- Don't use
Thread.yield()to fix a race. It is not a guarantee; it is a hint. Always use a real synchronisation primitive (CountDownLatch,dispatcher.await(),Future.get()). - Don't catch
InterruptedExceptionand ignore it. The Tez convention isThread.currentThread().interrupt(); throw new ...so the interrupt status propagates.
Exit criteria — when you're ready for the next stage
Move to Stage 10 when:
- You have de-flaked at least three tests with confirmed 200-run stability.
- You have caught at least one real production race that was masquerading as a flake.
- You can name the three flake patterns by heart (sleep races, undrained dispatcher, port collisions, tight timeouts).
- A reviewer has accepted your deterministic-executor refactor as the
preferred pattern over
Mockito.timeout.
Stage 10 turns the focus to performance regressions.