Step 2: Reproduction

You do not have a bug until you have a failing test. Stack traces in JIRA comments are circumstantial evidence; a deterministic, automated reproducer is proof. Until you have one, every hypothesis in Step 4 is unverifiable and every "fix" in Step 5 is theater.

Goal of this step: a JUnit test that fails on a clean checkout of apache/tez:master without your patch, in under two minutes, on five out of five runs.


Where Reproducers Live

MiniTezCluster is the Tez-specific harness that boots an in-process YARN cluster plus a DAGAppMaster against the local filesystem. It is the closest thing to a real deployment that you can debug from your IDE.

find ~/tez-src/tez-tests -name "MiniTezCluster.java"
# tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java

Read it first, then read one consumer:

grep -n "MiniTezCluster" \
  ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java
grep -n "MiniTezCluster" \
  ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java

TestTezJobs is the canonical "wire up a real cluster, submit a small DAG, assert on the output" example. TestOrderedWordCount is the lighter-weight end-to-end sanity check.

For pure unit-level reproducers (no YARN, no shuffle), use the patterns in:

~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java

These use DrainDispatcher (a synchronous dispatcher that lets you control event ordering deterministically) — see Step 6 for the full pattern.


Three Reproducer Templates

Pick the template that matches your issue type.

Template A: Race-Condition Reproducer (state-machine level)

When the bug is "two events arrive in an unexpected order and the state machine NPEs / wedges / drops a task," you need DrainDispatcher plus controlled event ordering. No MiniTezCluster.

package org.apache.tez.dag.app.dag.impl;

import org.apache.hadoop.yarn.event.DrainDispatcher;
import org.apache.tez.dag.app.AppContext;
import org.apache.tez.dag.app.dag.event.VertexEventTaskCompleted;
import org.apache.tez.dag.app.dag.event.VertexEventSourceTaskAttemptCompleted;
import org.apache.tez.dag.records.TezTaskID;
import org.junit.Before;
import org.junit.Test;

import static org.junit.Assert.assertEquals;

public class TestVertexImplTezNNNNRepro {

  private DrainDispatcher dispatcher;
  private VertexImpl vertex;
  private AppContext appContext;

  @Before
  public void setUp() {
    dispatcher = new DrainDispatcher();
    dispatcher.register(VertexEventType.class, vertexEventHandler());
    dispatcher.start();
    // Use the same factory as TestVertexImpl. Read its setUp() carefully.
    appContext = MockAppContext.create();
    vertex = createVertex(appContext, dispatcher);
    vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
    dispatcher.await();
  }

  @Test
  public void reproTaskCompletionBeforeRouteEvent() throws Exception {
    // 1. Drive vertex to RUNNING.
    vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
    dispatcher.await();
    assertEquals(VertexState.RUNNING, vertex.getState());

    // 2. Inject a task completion BEFORE the V_ROUTE_EVENT that the bug requires
    //    has been processed. This is the race window from the JIRA.
    TezTaskID t0 = vertex.getTask(0).getTaskId();
    vertex.handle(new VertexEventTaskCompleted(t0, TaskState.SUCCEEDED));

    // Do NOT call dispatcher.await() yet — interleave a second event.
    vertex.handle(new VertexEventSourceTaskAttemptCompleted(...));

    dispatcher.await();

    // 3. Assertion that fails on master, passes with fix.
    assertEquals(VertexState.SUCCEEDED, vertex.getState());
    //                     ^^^^^^^^^^^ on master this is FAILED due to the race
  }
}

Key principles:

  • Drive the state machine by handing events to vertex.handle() directly, not by going through a scheduler.
  • Use dispatcher.await() to deterministically drain the queue between phases.
  • The failing assertion is on a getState() or counter, not on log output.

Template B: Configuration / Validation Reproducer

When the bug is "setting tez.foo=bar is silently ignored / produces wrong behavior," reproduce at the API layer.

@Test
public void testConfigKeyHonored() throws Exception {
  TezConfiguration conf = new TezConfiguration();
  conf.set(TezConfiguration.TEZ_AM_FOO_BAR, "42");

  DAG dag = DAG.create("test-dag");
  Vertex v = Vertex.create("v1", ProcessorDescriptor.create(NoOpProcessor.class.getName()), 4);
  dag.addVertex(v);

  // The component under test reads conf — instantiate it directly.
  FooComponent foo = new FooComponent(conf);
  assertEquals(42, foo.getEffectiveValue());
  //                ^^ on master this is the default (e.g. 100) because conf is ignored
}

No cluster, no DAG submission. Just instantiate the class that reads the config and assert the effective value. The fix usually changes one conf.get() call.

Template C: Shuffle / Correctness Reproducer

When the bug is "output is wrong" (missing rows, duplicated rows, partial sort), you need MiniTezCluster and a small DAG with deterministic input.

public class TestShuffleCorrectnessTezNNNN {

  private static MiniTezCluster mrrTezCluster;
  private static FileSystem fs;

  @BeforeClass
  public static void setup() throws Exception {
    Configuration conf = new Configuration();
    fs = FileSystem.getLocal(conf);
    mrrTezCluster = new MiniTezCluster("TestShuffleRepro", 1, 1, 1);
    mrrTezCluster.init(conf);
    mrrTezCluster.start();
  }

  @AfterClass
  public static void cleanup() throws Exception {
    if (mrrTezCluster != null) mrrTezCluster.stop();
  }

  @Test(timeout = 120_000)
  public void reproPartitionedOutputMissingRows() throws Exception {
    Path inputDir = new Path("/tmp/repro-input-" + System.nanoTime());
    Path outputDir = new Path("/tmp/repro-output-" + System.nanoTime());
    writeKnownInput(fs, inputDir, /*rows=*/ 10_000);

    TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
    DAG dag = buildTwoVertexDAG(inputDir, outputDir);

    TezClient client = TezClient.create("repro", tezConf);
    client.start();
    try {
      DAGClient dagClient = client.submitDAG(dag);
      DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(null);
      assertEquals(DAGStatus.State.SUCCEEDED, status.getState());

      long outputRowCount = countRows(fs, outputDir);
      // On master this is 9_973 (27 rows lost in shuffle). With fix: 10_000.
      assertEquals(10_000L, outputRowCount);
    } finally {
      client.stop();
    }
  }
}

Build with deterministic input (fixed seed if random) so the missing-row count is reproducible across runs.


Logging: See What the State Machine Is Actually Doing

A reproducer without logs is half a reproducer. You will spend Step 4 staring at these logs.

Drop this into your test resources at src/test/resources/log4j.properties (or log4j2.properties for newer modules — check which the module uses):

log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss.SSS} %-5p [%t] %c{1}: %m%n

# Tez AM internals — the state-machine event log lives here
log4j.logger.org.apache.tez.dag.app.DAGAppMaster=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.DAGImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.VertexImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskAttemptImpl=DEBUG

# Async dispatcher event flow
log4j.logger.org.apache.tez.dag.app.AsyncDispatcher=DEBUG

# Runtime task lifecycle
log4j.logger.org.apache.tez.runtime.task=DEBUG
log4j.logger.org.apache.tez.runtime.LogicalIOProcessorRuntimeTask=DEBUG

# Shuffle internals
log4j.logger.org.apache.tez.runtime.library.common.shuffle=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher=DEBUG

# Scheduler
log4j.logger.org.apache.tez.dag.app.rm.TaskSchedulerManager=DEBUG
log4j.logger.org.apache.tez.dag.app.rm.YarnTaskSchedulerService=DEBUG

The two most useful patterns to grep for in the output:

grep -E "VertexImpl|TaskImpl|TaskAttemptImpl" target/surefire-reports/*.txt \
  | grep -E "state|State|Event|EVENT"

That gives you the state-transition trace, which is what you'll diagram in Step 3.

Capturing container logs from MiniTezCluster

MiniTezCluster writes container logs (where your tasks' stderr/stdout end up) under the surefire working directory:

<module>/target/<test-class>-tmpDir/<application-id>/container-logs/

Or, in newer YARN versions:

<module>/target/MiniMRYarnCluster-localDir-nm-X_Y/usercache/<user>/appcache/<app>/container_*/

Find them with:

find ~/tez-src/tez-tests/target -name "syslog" -path "*container*" -mmin -30

Read syslog (TaskAttempt logs) and stderr (uncaught exceptions). The prelaunch.out and directory.info files explain what was actually launched.


Verify Determinism

Five runs. If even one is green, your reproducer is not deterministic yet — it is a coin flip you happen to have caught. Fix the race window before declaring victory.

cd ~/tez-src
for i in 1 2 3 4 5; do
  echo "=== Run $i ==="
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
    | tail -20
done

Expected output: five FAILs with the same assertion failure on the same line.

If you see 4 FAIL / 1 PASS:

  • Add a Thread.sleep is the wrong answer. (Reread Step 6.)
  • Insert an explicit event ordering: drain the dispatcher between every event, inject the conflicting events as a Future you control.
  • Use CountDownLatch to gate the producer thread until the consumer is at a known state.

If you cannot get to 5/5 fails, the bug may genuinely depend on external timing (network, GC). In that case, escalate to a stress-test pattern: run the inner test body 100x in a @RepeatedTest and assert that the failure rate is >50%. Less ideal but acceptable for some shuffle race bugs.


Validation / Self-check

By the end of Step 2 you must have:

  1. A new test file under <module>/src/test/java/... named Test<Component>Tez<NNNN>Repro.java (the Repro suffix is for your workflow; you'll rename it to a real test name in Step 6).
  2. The test fails on a clean ~/tez-src/ at master with an assertion error (not a setup error, not a timeout — an assertion error).
  3. Five consecutive runs produce the same failure on the same line.
  4. The failure happens in under 120 seconds per run.
  5. A log4j.properties snippet in src/test/resources/ enabling debug logging on the relevant Tez packages.
  6. A captured log excerpt (paste into capstone-work/repro-logs.txt) showing the state-machine trace at the moment of failure.
  7. A one-paragraph description of the failure mode in your own words, saved to capstone-work/repro-summary.md. You will refine this into the root-cause document in Step 4.