Step 2: Reproduction
You do not have a bug until you have a failing test. Stack traces in JIRA comments are circumstantial evidence; a deterministic, automated reproducer is proof. Until you have one, every hypothesis in Step 4 is unverifiable and every "fix" in Step 5 is theater.
Goal of this step: a JUnit test that fails on a clean checkout of apache/tez:master
without your patch, in under two minutes, on five out of five runs.
Where Reproducers Live
MiniTezCluster is the Tez-specific harness that boots an in-process YARN
cluster plus a DAGAppMaster against the local filesystem. It is the closest
thing to a real deployment that you can debug from your IDE.
find ~/tez-src/tez-tests -name "MiniTezCluster.java"
# tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java
Read it first, then read one consumer:
grep -n "MiniTezCluster" \
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java
grep -n "MiniTezCluster" \
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java
TestTezJobs is the canonical "wire up a real cluster, submit a small DAG, assert
on the output" example. TestOrderedWordCount is the lighter-weight end-to-end
sanity check.
For pure unit-level reproducers (no YARN, no shuffle), use the patterns in:
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
These use DrainDispatcher (a synchronous dispatcher that lets you control event
ordering deterministically) — see Step 6 for the full pattern.
Three Reproducer Templates
Pick the template that matches your issue type.
Template A: Race-Condition Reproducer (state-machine level)
When the bug is "two events arrive in an unexpected order and the state machine
NPEs / wedges / drops a task," you need DrainDispatcher plus controlled event
ordering. No MiniTezCluster.
package org.apache.tez.dag.app.dag.impl;
import org.apache.hadoop.yarn.event.DrainDispatcher;
import org.apache.tez.dag.app.AppContext;
import org.apache.tez.dag.app.dag.event.VertexEventTaskCompleted;
import org.apache.tez.dag.app.dag.event.VertexEventSourceTaskAttemptCompleted;
import org.apache.tez.dag.records.TezTaskID;
import org.junit.Before;
import org.junit.Test;
import static org.junit.Assert.assertEquals;
public class TestVertexImplTezNNNNRepro {
private DrainDispatcher dispatcher;
private VertexImpl vertex;
private AppContext appContext;
@Before
public void setUp() {
dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertexEventHandler());
dispatcher.start();
// Use the same factory as TestVertexImpl. Read its setUp() carefully.
appContext = MockAppContext.create();
vertex = createVertex(appContext, dispatcher);
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
}
@Test
public void reproTaskCompletionBeforeRouteEvent() throws Exception {
// 1. Drive vertex to RUNNING.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
dispatcher.await();
assertEquals(VertexState.RUNNING, vertex.getState());
// 2. Inject a task completion BEFORE the V_ROUTE_EVENT that the bug requires
// has been processed. This is the race window from the JIRA.
TezTaskID t0 = vertex.getTask(0).getTaskId();
vertex.handle(new VertexEventTaskCompleted(t0, TaskState.SUCCEEDED));
// Do NOT call dispatcher.await() yet — interleave a second event.
vertex.handle(new VertexEventSourceTaskAttemptCompleted(...));
dispatcher.await();
// 3. Assertion that fails on master, passes with fix.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
// ^^^^^^^^^^^ on master this is FAILED due to the race
}
}
Key principles:
- Drive the state machine by handing events to
vertex.handle()directly, not by going through a scheduler. - Use
dispatcher.await()to deterministically drain the queue between phases. - The failing assertion is on a
getState()or counter, not on log output.
Template B: Configuration / Validation Reproducer
When the bug is "setting tez.foo=bar is silently ignored / produces wrong
behavior," reproduce at the API layer.
@Test
public void testConfigKeyHonored() throws Exception {
TezConfiguration conf = new TezConfiguration();
conf.set(TezConfiguration.TEZ_AM_FOO_BAR, "42");
DAG dag = DAG.create("test-dag");
Vertex v = Vertex.create("v1", ProcessorDescriptor.create(NoOpProcessor.class.getName()), 4);
dag.addVertex(v);
// The component under test reads conf — instantiate it directly.
FooComponent foo = new FooComponent(conf);
assertEquals(42, foo.getEffectiveValue());
// ^^ on master this is the default (e.g. 100) because conf is ignored
}
No cluster, no DAG submission. Just instantiate the class that reads the config
and assert the effective value. The fix usually changes one conf.get() call.
Template C: Shuffle / Correctness Reproducer
When the bug is "output is wrong" (missing rows, duplicated rows, partial sort),
you need MiniTezCluster and a small DAG with deterministic input.
public class TestShuffleCorrectnessTezNNNN {
private static MiniTezCluster mrrTezCluster;
private static FileSystem fs;
@BeforeClass
public static void setup() throws Exception {
Configuration conf = new Configuration();
fs = FileSystem.getLocal(conf);
mrrTezCluster = new MiniTezCluster("TestShuffleRepro", 1, 1, 1);
mrrTezCluster.init(conf);
mrrTezCluster.start();
}
@AfterClass
public static void cleanup() throws Exception {
if (mrrTezCluster != null) mrrTezCluster.stop();
}
@Test(timeout = 120_000)
public void reproPartitionedOutputMissingRows() throws Exception {
Path inputDir = new Path("/tmp/repro-input-" + System.nanoTime());
Path outputDir = new Path("/tmp/repro-output-" + System.nanoTime());
writeKnownInput(fs, inputDir, /*rows=*/ 10_000);
TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
DAG dag = buildTwoVertexDAG(inputDir, outputDir);
TezClient client = TezClient.create("repro", tezConf);
client.start();
try {
DAGClient dagClient = client.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(null);
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
long outputRowCount = countRows(fs, outputDir);
// On master this is 9_973 (27 rows lost in shuffle). With fix: 10_000.
assertEquals(10_000L, outputRowCount);
} finally {
client.stop();
}
}
}
Build with deterministic input (fixed seed if random) so the missing-row count is reproducible across runs.
Logging: See What the State Machine Is Actually Doing
A reproducer without logs is half a reproducer. You will spend Step 4 staring at these logs.
Drop this into your test resources at src/test/resources/log4j.properties
(or log4j2.properties for newer modules — check which the module uses):
log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss.SSS} %-5p [%t] %c{1}: %m%n
# Tez AM internals — the state-machine event log lives here
log4j.logger.org.apache.tez.dag.app.DAGAppMaster=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.DAGImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.VertexImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskAttemptImpl=DEBUG
# Async dispatcher event flow
log4j.logger.org.apache.tez.dag.app.AsyncDispatcher=DEBUG
# Runtime task lifecycle
log4j.logger.org.apache.tez.runtime.task=DEBUG
log4j.logger.org.apache.tez.runtime.LogicalIOProcessorRuntimeTask=DEBUG
# Shuffle internals
log4j.logger.org.apache.tez.runtime.library.common.shuffle=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher=DEBUG
# Scheduler
log4j.logger.org.apache.tez.dag.app.rm.TaskSchedulerManager=DEBUG
log4j.logger.org.apache.tez.dag.app.rm.YarnTaskSchedulerService=DEBUG
The two most useful patterns to grep for in the output:
grep -E "VertexImpl|TaskImpl|TaskAttemptImpl" target/surefire-reports/*.txt \
| grep -E "state|State|Event|EVENT"
That gives you the state-transition trace, which is what you'll diagram in Step 3.
Capturing container logs from MiniTezCluster
MiniTezCluster writes container logs (where your tasks' stderr/stdout end up)
under the surefire working directory:
<module>/target/<test-class>-tmpDir/<application-id>/container-logs/
Or, in newer YARN versions:
<module>/target/MiniMRYarnCluster-localDir-nm-X_Y/usercache/<user>/appcache/<app>/container_*/
Find them with:
find ~/tez-src/tez-tests/target -name "syslog" -path "*container*" -mmin -30
Read syslog (TaskAttempt logs) and stderr (uncaught exceptions). The
prelaunch.out and directory.info files explain what was actually launched.
Verify Determinism
Five runs. If even one is green, your reproducer is not deterministic yet — it is a coin flip you happen to have caught. Fix the race window before declaring victory.
cd ~/tez-src
for i in 1 2 3 4 5; do
echo "=== Run $i ==="
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
| tail -20
done
Expected output: five FAILs with the same assertion failure on the same line.
If you see 4 FAIL / 1 PASS:
- Add a
Thread.sleepis the wrong answer. (Reread Step 6.) - Insert an explicit event ordering: drain the dispatcher between every event,
inject the conflicting events as a
Futureyou control. - Use
CountDownLatchto gate the producer thread until the consumer is at a known state.
If you cannot get to 5/5 fails, the bug may genuinely depend on external timing
(network, GC). In that case, escalate to a stress-test pattern: run the inner
test body 100x in a @RepeatedTest and assert that the failure rate is >50%.
Less ideal but acceptable for some shuffle race bugs.
Validation / Self-check
By the end of Step 2 you must have:
- A new test file under
<module>/src/test/java/...namedTest<Component>Tez<NNNN>Repro.java(theReprosuffix is for your workflow; you'll rename it to a real test name in Step 6). - The test fails on a clean
~/tez-src/atmasterwith an assertion error (not a setup error, not a timeout — an assertion error). - Five consecutive runs produce the same failure on the same line.
- The failure happens in under 120 seconds per run.
- A
log4j.propertiessnippet insrc/test/resources/enabling debug logging on the relevant Tez packages. - A captured log excerpt (paste into
capstone-work/repro-logs.txt) showing the state-machine trace at the moment of failure. - A one-paragraph description of the failure mode in your own words, saved
to
capstone-work/repro-summary.md. You will refine this into the root-cause document in Step 4.