Stage 9 — Flaky Tests

What this stage teaches

Stage 9 is the unglamorous-but-essential stage. You learn:

The Tez flake taxonomy: Thread.sleep races, undrained AsyncDispatcher, MiniTezCluster port collisions, and @Test(timeout=...) budgets that were too tight for slow CI.
How to distinguish a flake (passes locally, fails on Jenkins 1-in-30 runs) from a real intermittent bug (manifests in production under load). Flakes are tests; intermittent bugs are not.
The DrainDispatcher.await() refactor: how to convert a sleep-based synchronisation to an event-drain-based one.
The @Rule and TestName patterns for diagnosing which test in a suite leaks state into the next.
When a flake fix is also a production code fix (the test was right; the code had a race).

Patches are 20–150 lines per test. They rarely change production code. The ones that do warrant a Stage 4–6 ticket in addition to the test fix.

JIRA filter to find candidates

project = TEZ
  AND (text ~ "flaky" OR text ~ "intermittent" OR labels = "flaky-test")
  AND resolution = Unresolved
ORDER BY updated DESC

A second source: Jenkins precommit history. Pick any open JIRA, find its Jenkins URL in the comments, click through to recent runs, look for tests that failed in one run and passed in the next on the same patch. Those tests are flake candidates regardless of whether a JIRA already exists.

A third source: your own mvn test output. Run any tez-dag test suite three times in a row:

cd ~/tez-src
for i in 1 2 3; do
  mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -5
done

Any failure in the three-pass that doesn't repeat is a flake to investigate.

The Tez flake taxonomy

1. `Thread.sleep` races

The most common shape:

worker.submitJob(j);
Thread.sleep(500);                 // "wait for it to start"
assertTrue(worker.isJobRunning(j));

On a slow CI box, 500ms may not be enough. On a fast box, the job may have completed before the assertion. Both fail.

The fix is a poll with timeout:

worker.submitJob(j);
TestUtils.waitFor(() -> worker.isJobRunning(j), /*pollMs*/50, /*timeoutMs*/30_000);
assertTrue(worker.isJobRunning(j));

If TestUtils.waitFor does not exist in the module, copy the pattern from org.apache.tez.test.GenericCounter or write one yourself in three lines.

2. Undrained `AsyncDispatcher`

The dispatcher is event-driven. A test that fires an event and immediately asserts on state will see the pre-event state half the time.

The fix is DrainDispatcher.await():

cd ~/tez-src
grep -rn "class DrainDispatcher" tez-common/src/main/java tez-dag/src/test

Find the canonical class. The refactor:

-    dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
-    Thread.sleep(200);
-    assertEquals(VertexState.INITED, vertex.getState());
+    dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
+    dispatcher.await();
+    assertEquals(VertexState.INITED, vertex.getState());

The contract: await() returns when the event queue is empty and the last event has been fully handled (including any subsequent events the handler itself emitted). If the test still flakes after this refactor, the handler is emitting events to a different dispatcher (e.g. a child component has its own). Find it and drain that one too.

3. `MiniTezCluster` port collisions

The default MiniTezCluster binds a fixed RM port. Two suites running in parallel on the same machine collide. The fix is per-suite port randomisation:

-    tezCluster = new MiniTezCluster("test", 1, 1, 1);
+    tezCluster = new MiniTezCluster(TestName.getMethodName(), 1, 1, 1);
+    Configuration conf = new Configuration();
+    conf.setInt(YarnConfiguration.RM_PORT, 0);  // 0 = OS-assigned
+    conf.setInt(YarnConfiguration.RM_SCHEDULER_PORT, 0);
+    conf.setInt(YarnConfiguration.RM_RESOURCE_TRACKER_PORT, 0);
+    tezCluster.init(conf);

The 0 port tells the OS to assign an unused port. Then read the actual port from the cluster after start:

int amrmPort = tezCluster.getConfig().getInt(YarnConfiguration.RM_PORT, -1);

4. `@Test(timeout=...)` too tight

A test with @Test(timeout=1000) may pass on a developer's M3 Pro and fail on a contention-laden Jenkins agent. Raise the timeout to a value that comfortably covers the slow CI but is still bounded:

-  @Test(timeout = 1000)
+  @Test(timeout = 30_000)
   public void testInitTransitionRunsOnce() { ... }

The Tez convention: 30s for unit tests, 300s for MiniTezCluster tests. Never @Test(timeout = 0) — a hung test will block CI for hours.

Walked example — `TestShuffleManager` flake

Symptom: testReadErrorReportDebounce fails 1-in-12 runs on Jenkins with:

expected:<1> but was:<2>

i.e. the verify on inputContext.sendEvents saw two calls when one was expected.

Step 1 — Reproduce locally

cd ~/tez-src
for i in $(seq 1 50); do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce \
    -q 2>&1 | tail -3
done | grep -c "FAILED"

A local reproduction at 1/50 frequency is good enough to start.

Step 2 — Diagnose

Read the test. The pattern:

sm.reportReadError(src, new IOException("first"));
sm.reportReadError(src, new IOException("second"));
verify(inputContext, times(1)).sendEvents(anyList());

reportReadError may dispatch to an internal executor. The verify runs before the executor has serviced the call. The Mockito verify sees only the synchronous call most of the time; the async one fires 1-in-12.

Step 3 — Fix

Replace verify with a timeout-bounded verify:

-    verify(inputContext, times(1)).sendEvents(anyList());
+    verify(inputContext, timeout(5_000).times(1)).sendEvents(anyList());

Mockito.timeout(ms) polls until the expected interactions match, then asserts the count. The test now waits up to 5 seconds before failing.

A bigger refactor (preferred): inject a deterministic executor:

ShuffleManager sm = createShuffleManager(conf, new DirectExecutor());

where DirectExecutor is a java.util.concurrent.Executor whose execute runs synchronously on the caller thread. Now there is no race, and the original verify(..., times(1)) is correct.

The reviewer rule: prefer the deterministic executor refactor over Mockito.timeout. The timeout-based fix masks future races; the deterministic fix eliminates them.

Step 4 — Confirm the fix

Run the loop again:

for i in $(seq 1 200); do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -3
done | grep -c "FAILED"

200 runs, zero failures, is the bar. Don't ship a flake fix you have not stress-tested.

When a flake is a real bug

Sometimes a test flakes because the production code has a race. If the "obvious" flake fix is to insert a sleep or relax an assertion, stop and ask: could a production caller exercise the same race?

Example: VertexImpl.handle returning before all event-emission side effects complete. The flaky test fixes itself by dispatcher.await(), but a production caller doing the same sequence sees a partially-applied state. That is a Stage 4 bug, not a Stage 9 bug.

The decision rule:

The test races against an internal event queue → flake fix.
The test races against a public contract method → file a real bug.

Pitfalls

Don't @Ignore a flake to "fix" CI. The next contributor will silently remove the @Ignore and re-introduce the flake. File a real ticket with a written analysis even if you don't fix it.
Don't bump the @Test(timeout) without reasoning. A 30s timeout is evidence the test does real work; a 30000s timeout is evidence the test is broken.
Don't replace assertEquals with assertTrue(... contains ...) to silence a flake. That weakens the assertion permanently and hides the underlying race.
Don't refactor a test class wholesale in a flake patch. Fix the one test. If the class needs a wholesale refactor, file a separate JIRA.
Don't use Thread.yield() to fix a race. It is not a guarantee; it is a hint. Always use a real synchronisation primitive (CountDownLatch, dispatcher.await(), Future.get()).
Don't catch InterruptedException and ignore it. The Tez convention is Thread.currentThread().interrupt(); throw new ... so the interrupt status propagates.

Exit criteria — when you're ready for the next stage

Move to Stage 10 when:

You have de-flaked at least three tests with confirmed 200-run stability.
You have caught at least one real production race that was masquerading as a flake.
You can name the three flake patterns by heart (sleep races, undrained dispatcher, port collisions, tight timeouts).
A reviewer has accepted your deterministic-executor refactor as the preferred pattern over Mockito.timeout.

Stage 10 turns the focus to performance regressions.

Open-Source Engineer & Contributor