Lab 5.1 — Explore MiniTezCluster and TestOrderedWordCount
Lab type: Read & Run
Estimated time: 90 min
Tez module: tez-tests
Key class: org.apache.tez.test.TestOrderedWordCount
Overview
MiniTezCluster spins up an in-process YARN ResourceManager, NodeManager, HDFS
NameNode, and DataNode, plus the Tez ApplicationMaster — all inside a single
JVM. This lets you submit real DAGs in a JUnit test with no external
infrastructure.
TestOrderedWordCount is the canonical example: it submits a multi-stage
word-count DAG (tokenize → partition → sort → count) and asserts correct output.
Step 1 — Locate the Files
find ~/tez-src -name "MiniTezCluster.java" | head -5
find ~/tez-src -name "TestOrderedWordCount.java" | head -5
find ~/tez-src -name "MiniTezClusterWithTez.java" | head -5
Step 2 — Read MiniTezCluster.java
Open MiniTezCluster.java and answer:
| # | Question |
|---|---|
| 1 | What superclass does MiniTezCluster extend? What Hadoop class sets up the in-process YARN cluster? |
| 2 | Where is TezConfiguration created and how is it modified to use the in-process services? |
| 3 | What is the purpose of the serviceStart() method? What does it start? |
| 4 | After serviceStop(), can you call serviceStart() again on the same instance? Why or why not? |
| 5 | Where does MiniTezCluster write its temporary data (HDFS files, YARN work dirs)? How would a test clean this up? |
Step 3 — Read TestOrderedWordCount.java
Work through the test lifecycle:
3a — @BeforeClass setUpClass()
| # | Question |
|---|---|
| 1 | How many NodeManagers does the test cluster start with? |
| 2 | After miniTezCluster.start(), what call copies the Tez auxiliary service config? |
| 3 | Where are test input files created — on HDFS or local FS? |
| 4 | Is a new TezClient created per test or per class? |
3b — @Test testOrderedWordCount()
| # | Question |
|---|---|
| 1 | Trace the method calls from TezClient.submitDAG() to when the test receives the final DAGStatus. |
| 2 | What does the assertion verify — DAG state, output correctness, or counter values? |
| 3 | If you wanted to assert on a specific counter (e.g. TaskCounter.INPUT_RECORDS_PROCESSED), where in the test would you add that assertion? |
3c — @AfterClass tearDownClass()
| # | Question |
|---|---|
| 1 | What is the order of shutdown calls? Does the TezClient stop before or after the cluster? |
| 2 | Does the test delete the HDFS working directory? Should it? |
Step 4 — Run the Test
cd ~/tez-src
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -20
Expected:
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
If you see Unable to find class: org.apache.tez.test.TestOrderedWordCount,
ensure mvn install -DskipTests completed successfully for all modules.
Step 5 — Measure the Overhead
Time the test:
time mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -3
Record how long it takes. Then answer:
- Is the bottleneck cluster startup, DAG execution, or cluster shutdown?
(Hint: add
-Dorg.apache.tez.test.MiniTezCluster.log.level=DEBUGand look at the timestamps.) - Why is
@BeforeClassused instead of@Before? What is the performance difference?
Step 6 — Find More Integration Tests
find ~/tez-src/tez-tests -name "Test*.java" | xargs grep -l "MiniTezCluster" | head -10
Pick one that is NOT TestOrderedWordCount. Read its @BeforeClass and one
@Test method. Answer:
- What scenario does this test cover that
TestOrderedWordCountdoes not? - Does it use a separate
MiniTezClusterinstance, or the same one reused across multiple test classes? How?
Step 7 — Source Connection Table
| Class used in this lab | Tez source file (relative to repo root) |
|---|---|
MiniTezCluster | |
TezClient | |
TezConfiguration | |
DAGStatus | |
MiniDFSCluster (Hadoop helper) |
Step 8 — JIRA Research
Search:
project = TEZ AND component = "tez-tests" AND resolution = Fixed ORDER BY updated DESC
Find a recent test-improvement JIRA.
- What was added or fixed?
- Does the patch include a new test, an existing test modification, or a flaky-test fix?