Lab 5.1 — Explore `MiniTezCluster` and `TestOrderedWordCount`

Lab type: Read & Run
Estimated time: 90 min
Tez module: tez-tests
Key class: org.apache.tez.test.TestOrderedWordCount

Overview

MiniTezCluster spins up an in-process YARN ResourceManager, NodeManager, HDFS NameNode, and DataNode, plus the Tez ApplicationMaster — all inside a single JVM. This lets you submit real DAGs in a JUnit test with no external infrastructure.

TestOrderedWordCount is the canonical example: it submits a multi-stage word-count DAG (tokenize → partition → sort → count) and asserts correct output.

Step 1 — Locate the Files

find ~/tez-src -name "MiniTezCluster.java" | head -5
find ~/tez-src -name "TestOrderedWordCount.java" | head -5
find ~/tez-src -name "MiniTezClusterWithTez.java" | head -5

Step 2 — Read `MiniTezCluster.java`

Open MiniTezCluster.java and answer:

#	Question
1	What superclass does `MiniTezCluster` extend? What Hadoop class sets up the in-process YARN cluster?
2	Where is `TezConfiguration` created and how is it modified to use the in-process services?
3	What is the purpose of the `serviceStart()` method? What does it start?
4	After `serviceStop()`, can you call `serviceStart()` again on the same instance? Why or why not?
5	Where does MiniTezCluster write its temporary data (HDFS files, YARN work dirs)? How would a test clean this up?

Step 3 — Read `TestOrderedWordCount.java`

Work through the test lifecycle:

3a — `@BeforeClass setUpClass()`

#	Question
1	How many NodeManagers does the test cluster start with?
2	After `miniTezCluster.start()`, what call copies the Tez auxiliary service config?
3	Where are test input files created — on HDFS or local FS?
4	Is a new `TezClient` created per test or per class?

3b — `@Test testOrderedWordCount()`

#	Question
1	Trace the method calls from `TezClient.submitDAG()` to when the test receives the final `DAGStatus`.
2	What does the assertion verify — DAG state, output correctness, or counter values?
3	If you wanted to assert on a specific counter (e.g. `TaskCounter.INPUT_RECORDS_PROCESSED`), where in the test would you add that assertion?

3c — `@AfterClass tearDownClass()`

#	Question
1	What is the order of shutdown calls? Does the `TezClient` stop before or after the cluster?
2	Does the test delete the HDFS working directory? Should it?

Step 4 — Run the Test

cd ~/tez-src
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -20

Expected:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

If you see Unable to find class: org.apache.tez.test.TestOrderedWordCount, ensure mvn install -DskipTests completed successfully for all modules.

Step 5 — Measure the Overhead

Time the test:

time mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -3

Record how long it takes. Then answer:

Is the bottleneck cluster startup, DAG execution, or cluster shutdown? (Hint: add -Dorg.apache.tez.test.MiniTezCluster.log.level=DEBUG and look at the timestamps.)
Why is @BeforeClass used instead of @Before? What is the performance difference?

Step 6 — Find More Integration Tests

find ~/tez-src/tez-tests -name "Test*.java" | xargs grep -l "MiniTezCluster" | head -10

Pick one that is NOT TestOrderedWordCount. Read its @BeforeClass and one @Test method. Answer:

What scenario does this test cover that TestOrderedWordCount does not?
Does it use a separate MiniTezCluster instance, or the same one reused across multiple test classes? How?

Step 7 — Source Connection Table

Class used in this lab	Tez source file (relative to repo root)
`MiniTezCluster`
`TezClient`
`TezConfiguration`
`DAGStatus`
`MiniDFSCluster` (Hadoop helper)

Step 8 — JIRA Research

Search:

project = TEZ AND component = "tez-tests" AND resolution = Fixed ORDER BY updated DESC

Find a recent test-improvement JIRA.

What was added or fixed?
Does the patch include a new test, an existing test modification, or a flaky-test fix?

Open-Source Engineer & Contributor