Testing Framework
Tez ships three tiers of tests, each with a different cost/coverage tradeoff. Knowing which tier to use for a given change — and which patterns are considered idiomatic — is the difference between a patch that lands and one that sits in review.
Three tiers
| Tier | Module | Boots... | Run cost | Use for |
|---|---|---|---|---|
| Unit | each module's src/test/java | nothing real; pure mocks + dispatcher | seconds | State-machine transitions, parsers, helper classes |
| Mini-cluster | tez-tests/src/test/java | MiniTezCluster (MiniYARNCluster + Tez session) | seconds-to-minutes | End-to-end DAGs in a JVM |
| Full cluster | external | real YARN cluster | minutes | Release validation, perf tests |
find . -name "MiniTezCluster.java"
wc -l $(find . -name "MiniTezCluster.java")
Unit testing state machines
The dominant pattern for Tez unit tests is arrange-state, send-event,
drain-dispatcher, assert. Reference: TestVertexImpl, TestTaskImpl,
TestTaskAttemptImpl, TestDAGImpl.
find tez-dag/src/test/java -name "TestVertexImpl.java"
wc -l $(find tez-dag/src/test/java -name "TestVertexImpl.java")
grep -n "DrainDispatcher\|MockVertex\|MockDAG\|setupVertices\|dispatcher.await" \
$(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -20
Building blocks
| Class | Purpose |
|---|---|
DrainDispatcher | Synchronous-ish event dispatcher; await() blocks until queue drains |
MockVertex, MockDAG, MockTask, etc | Lightweight stand-ins that satisfy Vertex etc interfaces |
MockClock | Controllable clock for time-dependent transitions |
MockHistoryEventHandler | Captures recovery / ATS events for assertion |
Mockito (mock, when, verify) | Mocks for collaborators (TaskSchedulerManager, etc) |
Recipe
@Test
public void testVertexInitsAfterAllInputsReady() throws Exception {
// 1. Arrange
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.init(new Configuration());
dispatcher.start();
TaskSchedulerManager sched = mock(TaskSchedulerManager.class);
DAG dag = mock(DAG.class);
when(dag.getID()).thenReturn(TezDAGID.getInstance(appAttemptId, 1));
VertexImpl v = new VertexImpl(vertexId, plan, name, conf,
dispatcher.getEventHandler(),
mock(TaskCommunicatorManagerInterface.class),
mockClock, taskHeartbeatHandler, mockAppContext,
VertexLocationHint.create(...), dispatcher,
mockVertexManager, ...);
// 2. Act
dispatcher.getEventHandler().handle(
new VertexEvent(vertexId, VertexEventType.V_INIT));
dispatcher.await();
// 3. Assert
assertEquals(VertexState.INITED, v.getState());
verify(sched, never()).taskAllocated(any(), any(), any());
}
Key idioms:
- Never call
Thread.sleep. Alwaysdispatcher.await(). - Never assume event ordering unless you've sent events sequentially through the same dispatcher.
- Mock the AppContext aggressively. It's the god-object; mocking it lets each test isolate exactly the collaborators it cares about.
MiniTezCluster tests
find tez-tests/src/test/java -name "TestOrderedWordCount.java" \
-o -name "TestMRRJobsDAGApi.java" \
-o -name "TestExtServicesWithLocalMode.java" | head
wc -l $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")
MiniTezCluster boots:
- A
MiniYARNCluster(in-process RM + N NMs). - A
MiniDFSCluster(in-process NN + DNs) — optional. - A
TezClientconfigured against the mini cluster.
grep -n "MiniTezCluster\|new MiniYARNCluster\|setup\|tearDown" \
$(find tez-tests/src/test/java -name "TestOrderedWordCount.java")
Lifecycle
flowchart TD
setUp[BeforeClass: setup] --> mini[Start MiniTezCluster]
mini --> tez[Create TezClient]
test1[Test: build DAG] --> submit[submitDAG]
submit --> wait[waitForCompletion]
wait --> assert[Assert DAGStatus + counters]
tear[AfterClass: tearDown] --> stop[Stop TezClient + cluster]
Common shape
@BeforeClass
public static void setup() throws Exception {
conf = new Configuration();
conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 100);
miniTezCluster = new MiniTezCluster("name", 1, 1, 1);
miniTezCluster.init(conf);
miniTezCluster.start();
TezConfiguration tezConf = new TezConfiguration(miniTezCluster.getConfig());
tezClient = TezClient.create("test", tezConf);
tezClient.start();
}
@AfterClass
public static void tearDown() throws Exception {
tezClient.stop();
miniTezCluster.stop();
}
@Test(timeout = 60_000)
public void testWordCount() throws Exception {
DAG dag = buildWordCountDAG();
DAGClient client = tezClient.submitDAG(dag);
DAGStatus status = client.waitForCompletionWithStatusUpdates(EnumSet.of(StatusGetOpts.GET_COUNTERS));
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
assertEquals(EXPECTED_ROW_COUNT,
status.getDAGCounters().findCounter(TaskCounter.OUTPUT_RECORDS).getValue());
}
@Test(timeout = ...) is mandatory
A mini-cluster test that hangs blocks the whole CI build. Every
MiniTezCluster test has a JUnit timeout in the 60-300 second range.
Local mode for tests
Faster than MiniTezCluster: no YARN, no DFS, everything in-process.
grep -rn "TEZ_LOCAL_MODE\|setLocalMode\|tez.local.mode" \
tez-tests/src/test/java tez-runtime-library/src/test/java | head
Used for:
- Unit-style integration tests where YARN isn't relevant.
- Examples / smoke tests in
tez-examples. - Quick repro of runtime issues — see the local mode deep dive.
Patterns: do and don't
Do
grep -rn "DrainDispatcher\|await()" tez-dag/src/test/java | wc -l
- Send all setup events synchronously, then call
dispatcher.await(). - Use
MockClockand advance it explicitly. - Capture emitted events with a custom handler and assert on the collection.
- Use
@Before/@Afterto reset shared dispatcher and mocks. - Mock external collaborators (
TaskScheduler,ContainerLauncher,NMClient); never instantiate the real ones in unit tests. - Bound parallelism in mini-cluster tests (
numNodeManagers=1is usually fine).
Don't
| Anti-pattern | Why it bites |
|---|---|
Thread.sleep(N) to wait for state | Flake city; transition time depends on machine load. |
while (vertex.getState() != X) busy loop | Same flake, plus burns CPU. |
Assume e1 happens before e2 when both posted async | Dispatcher orders by arrival, not posting. |
| Static state across tests | Tests run in some JVM order; static leaks corrupt later tests. |
| Real network calls in unit tests | Slow, flaky, often forbidden in CI sandboxes. |
System.exit from tested code paths | Kills the JVM running the test runner. |
CI / build integration
cat pom.xml | head -100
grep -n "surefire\|failsafe" pom.xml
| Maven plugin | Runs | Default scope |
|---|---|---|
maven-surefire-plugin | unit tests under src/test/java | Test*.java, *Test.java, *Tests.java, *TestCase.java |
maven-failsafe-plugin | integration tests | IT*.java, *IT.java, *ITCase.java |
Tez puts MiniTezCluster tests under surefire as well (no separation),
which is one reason mvn test is slow. Run a single test:
mvn test -pl tez-dag -Dtest=TestVertexImpl
mvn test -pl tez-tests -Dtest=TestOrderedWordCount#testWordCount
Test-only utilities
find . -path "*/src/main/java/*" -name "Test*.java" | head
find . -path "*/src/test/java/*" -name "Mock*.java" | head
Helpful classes (some live under src/main so they're reusable downstream):
| Class | Module | Purpose |
|---|---|---|
MiniTezCluster | tez-tests | Bootstrap an in-process cluster |
TezClientForTest | tez-api | Subclass exposing internals |
MockDAG, MockVertex, MockTask | tez-dag test sources | Plain-old objects implementing state-machine interfaces |
TestProcessor, TestInput, TestOutput | tez-tests | No-op IPOs for plan plumbing tests |
DrainDispatcher | hadoop-yarn-common (depended upon) | Dispatcher with await() |
Reading exercise
cat $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -150— read the setup + first test.grep -n "@Test" $(find tez-dag/src/test/java -name "TestTaskImpl.java" \ -o -name "TestVertexImpl.java" \ -o -name "TestDAGImpl.java") | wc -l— get a sense of the test surface.cat $(find tez-tests/src/test/java -name "TestOrderedWordCount.java") | head -200— see a real MiniTezCluster test.grep -rn "dispatcher.await\|DrainDispatcher" tez-dag/src/test/java | wc -l— confirm the pattern is universal.grep -rn "Thread.sleep" tez-dag/src/test/java | head— find any stragglers using the anti-pattern; understand why each one is there (usually waiting on real OS state, e.g. a port).mvn -pl tez-dag test -Dtest=TestVertexImpl -DfailIfNoTests=false— run one and read the output structure.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
| Test passes locally, flakes in CI | Thread.sleep waiting for transition; replace with dispatcher.await(). |
MiniTezCluster test hangs forever | Missing @Test(timeout = …); AM never finishes due to test bug. |
BindException in mini-cluster | Previous test didn't stop(); ports leaked. |
State machine throws InvalidStateTransitionException in test | Test sent event in wrong state; check arrange step. |
Mock returns null from getDAG() | Forgot to stub when(appContext.getCurrentDAG()). |
OutOfMemoryError: Java heap space in surefire | Each test forking JVM holds too much; tune argLine=-Xmx1g in pom. |
| Test depends on counter being non-zero, but it's zero | Counter incremented in code path the mock skipped; verify the code under test actually ran. |
Validation: prove you understand this
- Outline the four-step recipe for a state-machine unit test, with the exact call to drain the dispatcher.
- Name three classes from
tez-dag/src/test/javathat implement theMock*pattern and what each replaces. - Explain why
Thread.sleepis an anti-pattern in Tez tests and what the correct alternative is for time-dependent transitions. - Given a hang in
TestVertexImpl#testTaskKill, list the first three diagnostics you'd inspect (no debugger). - Describe the difference between a
MiniTezClustertest and a local-mode test, and give one scenario where each is the correct choice.