Lab 5.1 — Explore MiniTezCluster and TestOrderedWordCount

Lab type: Read & Run
Estimated time: 90 min
Tez module: tez-tests
Key class: org.apache.tez.test.TestOrderedWordCount


Overview

MiniTezCluster spins up an in-process YARN ResourceManager, NodeManager, HDFS NameNode, and DataNode, plus the Tez ApplicationMaster — all inside a single JVM. This lets you submit real DAGs in a JUnit test with no external infrastructure.

TestOrderedWordCount is the canonical example: it submits a multi-stage word-count DAG (tokenize → partition → sort → count) and asserts correct output.


Step 1 — Locate the Files

find ~/tez-src -name "MiniTezCluster.java" | head -5
find ~/tez-src -name "TestOrderedWordCount.java" | head -5
find ~/tez-src -name "MiniTezClusterWithTez.java" | head -5

Step 2 — Read MiniTezCluster.java

Open MiniTezCluster.java and answer:

#Question
1What superclass does MiniTezCluster extend? What Hadoop class sets up the in-process YARN cluster?
2Where is TezConfiguration created and how is it modified to use the in-process services?
3What is the purpose of the serviceStart() method? What does it start?
4After serviceStop(), can you call serviceStart() again on the same instance? Why or why not?
5Where does MiniTezCluster write its temporary data (HDFS files, YARN work dirs)? How would a test clean this up?

Step 3 — Read TestOrderedWordCount.java

Work through the test lifecycle:

3a — @BeforeClass setUpClass()

#Question
1How many NodeManagers does the test cluster start with?
2After miniTezCluster.start(), what call copies the Tez auxiliary service config?
3Where are test input files created — on HDFS or local FS?
4Is a new TezClient created per test or per class?

3b — @Test testOrderedWordCount()

#Question
1Trace the method calls from TezClient.submitDAG() to when the test receives the final DAGStatus.
2What does the assertion verify — DAG state, output correctness, or counter values?
3If you wanted to assert on a specific counter (e.g. TaskCounter.INPUT_RECORDS_PROCESSED), where in the test would you add that assertion?

3c — @AfterClass tearDownClass()

#Question
1What is the order of shutdown calls? Does the TezClient stop before or after the cluster?
2Does the test delete the HDFS working directory? Should it?

Step 4 — Run the Test

cd ~/tez-src
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -20

Expected:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

If you see Unable to find class: org.apache.tez.test.TestOrderedWordCount, ensure mvn install -DskipTests completed successfully for all modules.


Step 5 — Measure the Overhead

Time the test:

time mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -3

Record how long it takes. Then answer:

  1. Is the bottleneck cluster startup, DAG execution, or cluster shutdown? (Hint: add -Dorg.apache.tez.test.MiniTezCluster.log.level=DEBUG and look at the timestamps.)
  2. Why is @BeforeClass used instead of @Before? What is the performance difference?

Step 6 — Find More Integration Tests

find ~/tez-src/tez-tests -name "Test*.java" | xargs grep -l "MiniTezCluster" | head -10

Pick one that is NOT TestOrderedWordCount. Read its @BeforeClass and one @Test method. Answer:

  1. What scenario does this test cover that TestOrderedWordCount does not?
  2. Does it use a separate MiniTezCluster instance, or the same one reused across multiple test classes? How?

Step 7 — Source Connection Table

Class used in this labTez source file (relative to repo root)
MiniTezCluster
TezClient
TezConfiguration
DAGStatus
MiniDFSCluster (Hadoop helper)

Step 8 — JIRA Research

Search:

project = TEZ AND component = "tez-tests" AND resolution = Fixed ORDER BY updated DESC

Find a recent test-improvement JIRA.

  1. What was added or fixed?
  2. Does the patch include a new test, an existing test modification, or a flaky-test fix?