Testing Framework

Tez ships three tiers of tests, each with a different cost/coverage tradeoff. Knowing which tier to use for a given change — and which patterns are considered idiomatic — is the difference between a patch that lands and one that sits in review.


Three tiers

TierModuleBoots...Run costUse for
Uniteach module's src/test/javanothing real; pure mocks + dispatchersecondsState-machine transitions, parsers, helper classes
Mini-clustertez-tests/src/test/javaMiniTezCluster (MiniYARNCluster + Tez session)seconds-to-minutesEnd-to-end DAGs in a JVM
Full clusterexternalreal YARN clusterminutesRelease validation, perf tests
find . -name "MiniTezCluster.java"
wc -l $(find . -name "MiniTezCluster.java")

Unit testing state machines

The dominant pattern for Tez unit tests is arrange-state, send-event, drain-dispatcher, assert. Reference: TestVertexImpl, TestTaskImpl, TestTaskAttemptImpl, TestDAGImpl.

find tez-dag/src/test/java -name "TestVertexImpl.java"
wc -l $(find tez-dag/src/test/java -name "TestVertexImpl.java")
grep -n "DrainDispatcher\|MockVertex\|MockDAG\|setupVertices\|dispatcher.await" \
  $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -20

Building blocks

ClassPurpose
DrainDispatcherSynchronous-ish event dispatcher; await() blocks until queue drains
MockVertex, MockDAG, MockTask, etcLightweight stand-ins that satisfy Vertex etc interfaces
MockClockControllable clock for time-dependent transitions
MockHistoryEventHandlerCaptures recovery / ATS events for assertion
Mockito (mock, when, verify)Mocks for collaborators (TaskSchedulerManager, etc)

Recipe

@Test
public void testVertexInitsAfterAllInputsReady() throws Exception {
  // 1. Arrange
  DrainDispatcher dispatcher = new DrainDispatcher();
  dispatcher.init(new Configuration());
  dispatcher.start();

  TaskSchedulerManager sched = mock(TaskSchedulerManager.class);
  DAG dag = mock(DAG.class);
  when(dag.getID()).thenReturn(TezDAGID.getInstance(appAttemptId, 1));

  VertexImpl v = new VertexImpl(vertexId, plan, name, conf,
      dispatcher.getEventHandler(),
      mock(TaskCommunicatorManagerInterface.class),
      mockClock, taskHeartbeatHandler, mockAppContext,
      VertexLocationHint.create(...), dispatcher,
      mockVertexManager, ...);

  // 2. Act
  dispatcher.getEventHandler().handle(
      new VertexEvent(vertexId, VertexEventType.V_INIT));
  dispatcher.await();

  // 3. Assert
  assertEquals(VertexState.INITED, v.getState());
  verify(sched, never()).taskAllocated(any(), any(), any());
}

Key idioms:

  • Never call Thread.sleep. Always dispatcher.await().
  • Never assume event ordering unless you've sent events sequentially through the same dispatcher.
  • Mock the AppContext aggressively. It's the god-object; mocking it lets each test isolate exactly the collaborators it cares about.

MiniTezCluster tests

find tez-tests/src/test/java -name "TestOrderedWordCount.java" \
                              -o -name "TestMRRJobsDAGApi.java" \
                              -o -name "TestExtServicesWithLocalMode.java" | head
wc -l $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")

MiniTezCluster boots:

  • A MiniYARNCluster (in-process RM + N NMs).
  • A MiniDFSCluster (in-process NN + DNs) — optional.
  • A TezClient configured against the mini cluster.
grep -n "MiniTezCluster\|new MiniYARNCluster\|setup\|tearDown" \
  $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")

Lifecycle

flowchart TD
  setUp[BeforeClass: setup] --> mini[Start MiniTezCluster]
  mini --> tez[Create TezClient]
  test1[Test: build DAG] --> submit[submitDAG]
  submit --> wait[waitForCompletion]
  wait --> assert[Assert DAGStatus + counters]
  tear[AfterClass: tearDown] --> stop[Stop TezClient + cluster]

Common shape

@BeforeClass
public static void setup() throws Exception {
  conf = new Configuration();
  conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 100);
  miniTezCluster = new MiniTezCluster("name", 1, 1, 1);
  miniTezCluster.init(conf);
  miniTezCluster.start();
  TezConfiguration tezConf = new TezConfiguration(miniTezCluster.getConfig());
  tezClient = TezClient.create("test", tezConf);
  tezClient.start();
}

@AfterClass
public static void tearDown() throws Exception {
  tezClient.stop();
  miniTezCluster.stop();
}

@Test(timeout = 60_000)
public void testWordCount() throws Exception {
  DAG dag = buildWordCountDAG();
  DAGClient client = tezClient.submitDAG(dag);
  DAGStatus status = client.waitForCompletionWithStatusUpdates(EnumSet.of(StatusGetOpts.GET_COUNTERS));
  assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
  assertEquals(EXPECTED_ROW_COUNT,
      status.getDAGCounters().findCounter(TaskCounter.OUTPUT_RECORDS).getValue());
}

@Test(timeout = ...) is mandatory

A mini-cluster test that hangs blocks the whole CI build. Every MiniTezCluster test has a JUnit timeout in the 60-300 second range.


Local mode for tests

Faster than MiniTezCluster: no YARN, no DFS, everything in-process.

grep -rn "TEZ_LOCAL_MODE\|setLocalMode\|tez.local.mode" \
  tez-tests/src/test/java tez-runtime-library/src/test/java | head

Used for:

  • Unit-style integration tests where YARN isn't relevant.
  • Examples / smoke tests in tez-examples.
  • Quick repro of runtime issues — see the local mode deep dive.

Patterns: do and don't

Do

grep -rn "DrainDispatcher\|await()" tez-dag/src/test/java | wc -l
  • Send all setup events synchronously, then call dispatcher.await().
  • Use MockClock and advance it explicitly.
  • Capture emitted events with a custom handler and assert on the collection.
  • Use @Before / @After to reset shared dispatcher and mocks.
  • Mock external collaborators (TaskScheduler, ContainerLauncher, NMClient); never instantiate the real ones in unit tests.
  • Bound parallelism in mini-cluster tests (numNodeManagers=1 is usually fine).

Don't

Anti-patternWhy it bites
Thread.sleep(N) to wait for stateFlake city; transition time depends on machine load.
while (vertex.getState() != X) busy loopSame flake, plus burns CPU.
Assume e1 happens before e2 when both posted asyncDispatcher orders by arrival, not posting.
Static state across testsTests run in some JVM order; static leaks corrupt later tests.
Real network calls in unit testsSlow, flaky, often forbidden in CI sandboxes.
System.exit from tested code pathsKills the JVM running the test runner.

CI / build integration

cat pom.xml | head -100
grep -n "surefire\|failsafe" pom.xml
Maven pluginRunsDefault scope
maven-surefire-pluginunit tests under src/test/javaTest*.java, *Test.java, *Tests.java, *TestCase.java
maven-failsafe-pluginintegration testsIT*.java, *IT.java, *ITCase.java

Tez puts MiniTezCluster tests under surefire as well (no separation), which is one reason mvn test is slow. Run a single test:

mvn test -pl tez-dag -Dtest=TestVertexImpl
mvn test -pl tez-tests -Dtest=TestOrderedWordCount#testWordCount

Test-only utilities

find . -path "*/src/main/java/*" -name "Test*.java" | head
find . -path "*/src/test/java/*" -name "Mock*.java" | head

Helpful classes (some live under src/main so they're reusable downstream):

ClassModulePurpose
MiniTezClustertez-testsBootstrap an in-process cluster
TezClientForTesttez-apiSubclass exposing internals
MockDAG, MockVertex, MockTasktez-dag test sourcesPlain-old objects implementing state-machine interfaces
TestProcessor, TestInput, TestOutputtez-testsNo-op IPOs for plan plumbing tests
DrainDispatcherhadoop-yarn-common (depended upon)Dispatcher with await()

Reading exercise

  1. cat $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -150 — read the setup + first test.
  2. grep -n "@Test" $(find tez-dag/src/test/java -name "TestTaskImpl.java" \ -o -name "TestVertexImpl.java" \ -o -name "TestDAGImpl.java") | wc -l — get a sense of the test surface.
  3. cat $(find tez-tests/src/test/java -name "TestOrderedWordCount.java") | head -200 — see a real MiniTezCluster test.
  4. grep -rn "dispatcher.await\|DrainDispatcher" tez-dag/src/test/java | wc -l — confirm the pattern is universal.
  5. grep -rn "Thread.sleep" tez-dag/src/test/java | head — find any stragglers using the anti-pattern; understand why each one is there (usually waiting on real OS state, e.g. a port).
  6. mvn -pl tez-dag test -Dtest=TestVertexImpl -DfailIfNoTests=false — run one and read the output structure.

Common bugs and symptoms

SymptomLikely cause
Test passes locally, flakes in CIThread.sleep waiting for transition; replace with dispatcher.await().
MiniTezCluster test hangs foreverMissing @Test(timeout = …); AM never finishes due to test bug.
BindException in mini-clusterPrevious test didn't stop(); ports leaked.
State machine throws InvalidStateTransitionException in testTest sent event in wrong state; check arrange step.
Mock returns null from getDAG()Forgot to stub when(appContext.getCurrentDAG()).
OutOfMemoryError: Java heap space in surefireEach test forking JVM holds too much; tune argLine=-Xmx1g in pom.
Test depends on counter being non-zero, but it's zeroCounter incremented in code path the mock skipped; verify the code under test actually ran.

Validation: prove you understand this

  1. Outline the four-step recipe for a state-machine unit test, with the exact call to drain the dispatcher.
  2. Name three classes from tez-dag/src/test/java that implement the Mock* pattern and what each replaces.
  3. Explain why Thread.sleep is an anti-pattern in Tez tests and what the correct alternative is for time-dependent transitions.
  4. Given a hang in TestVertexImpl#testTaskKill, list the first three diagnostics you'd inspect (no debugger).
  5. Describe the difference between a MiniTezCluster test and a local-mode test, and give one scenario where each is the correct choice.