Step 3: Execution Path Analysis

You have a failing test. Now you map the path the request takes from the moment TezClient.submitDAG() returns through every event, dispatcher hop, and state transition until the failure manifests. This map is the foundation for every hypothesis in Step 4. A wrong map produces a wrong root cause.

Budget: 2–4 evenings. The work is reading code, grep, and drawing.


The Canonical Submit Path

Every DAG that fails went through this skeleton path before it failed. Memorize it; you will use it as the reference axis when you sketch where your particular failure deviates.

TezClient.submitDAG(DAG)
    [tez-api/src/main/java/org/apache/tez/client/TezClient.java]
        |
        v
TezClient.submitDAGSession() or submitDAGApplication()
        |  (session vs. non-session — see TezClient.java for branch)
        v
DAGClientHandler.submitDAG(...)
    [tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java]
        |
        v
DAGAppMaster.submitDAGToAppMaster(...)
    [tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java]
        |
        v
DAGAppMaster.startDAG(...)
        |  - builds DAGImpl
        |  - emits DAGEventType.DAG_INIT
        v
AsyncDispatcher.dispatch(DAGEvent)
    [tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java]
    (uses Hadoop's hadoop-yarn-common AsyncDispatcher under the hood;
     Tez subclasses it — see Tez source for the wrapper)
        |
        v
DAGImpl.handle(DAGEvent)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]
        |  state DAG_NEW --DAG_INIT--> INITED
        |  emits DAGEventType.DAG_START
        v
DAGImpl on DAG_START
        |  state INITED --DAG_START--> RUNNING
        |  for each Vertex: emits VertexEvent V_INIT
        v
VertexImpl.handle(VertexEventType.V_INIT)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java]
        |  state NEW --V_INIT--> INITIALIZING
        |  invokes VertexManagerPlugin.initialize()
        |  on success emits V_INITED
        v
VertexImpl on V_INITED -> on V_START
        |  state INITED --V_START--> RUNNING
        |  schedules tasks via TaskImpl events (T_SCHEDULE)
        v
TaskImpl.handle(T_SCHEDULE)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java]
        |  state NEW --T_SCHEDULE--> SCHEDULED
        |  spawns a TaskAttemptImpl, emits TA_SCHEDULE
        v
TaskAttemptImpl.handle(TA_SCHEDULE)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java]
        |  state NEW --TA_SCHEDULE--> START_WAIT
        |  requests container from TaskSchedulerManager
        v
TaskSchedulerManager / YarnTaskSchedulerService
    [tez-dag/src/main/java/org/apache/tez/dag/app/rm/]
        |  assigns container, emits TA_CONTAINER_LAUNCHED
        v
TaskAttemptImpl receives TA_CONTAINER_LAUNCHED
        |  state START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING
        |  the container is now actually running our task
        v
[ container process boots ]
TezTaskRunner2.run()
    [tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java]
        |
        v
TezChild / TaskRunner instantiates LogicalIOProcessorRuntimeTask
        |
        v
LogicalIOProcessorRuntimeTask.run()
    [tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java]
        |  initializes Inputs, Outputs, Processor
        |  calls Processor.run(inputs, outputs)
        v
[ user code runs — e.g. OrderedWordCount or your DAG's processor ]
        |
        v
heartbeat -> TaskAttemptListener -> TaskAttemptImpl TA_DONE / TA_FAILED

That is the skeleton. Your job in this step is to find the segment where your failure occurs and draw it with line numbers.


Run These Greps

These greps locate the actual file paths and method bodies on your local clone. Run them in ~/tez-src/. Each one gives you a line number to open.

# Entry: submitDAG
grep -n "public.*submitDAG" \
  tez-api/src/main/java/org/apache/tez/client/TezClient.java

# Server-side intake
grep -n "submitDAG\|startDAG" \
  tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

# DAGImpl handlers
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -40

# VertexImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -60

# TaskImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -60

# TaskAttemptImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -80

# Dispatcher
grep -n "class AsyncDispatcher\|dispatch\b" \
  tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java

# Runtime task entry
grep -n "public void run\|class TezTaskRunner2" \
  tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java

grep -n "public void run\|initialize\|class LogicalIOProcessorRuntimeTask" \
  tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java

Open each line in your editor. Read the transition table. Note which event you care about and which state(s) it is legal in.


Locate Your Specific Failure Segment

The skeleton is the highway; your bug is at one specific exit. Use these heuristics:

Symptom in repro logsLikely segment
VertexImpl ... transitioned from RUNNING to FAILEDVertexImpl state machine — transition on V_TASK_RESCHEDULED or V_INTERNAL_ERROR
TaskAttemptImpl ... NPETaskAttemptImpl event handlers; check container-launched and TA_DONE paths
NPE in AsyncDispatcher.dispatchRace between dispatcher start/stop and event submission
ShuffleManager: too many fetch failuresFetcher retry/timeout; ShuffleManager.fetchFailure()
IFile checksum mismatchIFile.Writer/Reader; check spill+merge
OutOfMemory ... GROUP_COMPARATORMergeManager memory math; ifile spill thresholds
Container released before TA_DONETaskSchedulerManager reuse path; check container release races

Once you know your segment, draw it.


Build the Path Diagram

Two formats. Do both — they validate each other.

Text-arrow form (paste into the root-cause doc)

Use this in JIRA comments and PR descriptions. It survives any rendering.

TezClient.submitDAG (TezClient.java:485)
  -> DAGClientHandler.submitDAG (DAGClientHandler.java:152)
  -> DAGAppMaster.startDAG (DAGAppMaster.java:1234)
  -> DAGImpl V_NEW --DAG_INIT--> INITED (DAGImpl.java:340)
  -> DAGImpl INITED --DAG_START--> RUNNING (DAGImpl.java:380)
  -> VertexImpl v1 NEW --V_INIT--> INITIALIZING (VertexImpl.java:1820)
  -> VertexImpl v1 INITIALIZING --V_INITED--> INITED (VertexImpl.java:1856)
  -> VertexImpl v1 INITED --V_START--> RUNNING (VertexImpl.java:1901)
  -> [21 TaskImpl T_SCHEDULE events fired]
  -> TaskImpl t0 NEW --T_SCHEDULE--> SCHEDULED (TaskImpl.java:412)
  -> TaskAttemptImpl t0.0 NEW --TA_SCHEDULE--> START_WAIT (TaskAttemptImpl.java:560)
  -> [container assigned]
  -> TaskAttemptImpl t0.0 START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING (...:610)
  -> [container starts LogicalIOProcessorRuntimeTask]
  -> ShuffleManager.run starts fetcher loop
  -> Fetcher.fetchNext throws IOException (Fetcher.java:289)  <-- FAILURE HERE
  -> ShuffleManager.fetchFailure -> InputReadErrorEvent
  -> TaskAttemptImpl t0.0 RUNNING --TA_FAILED--> FAILED

Cite real line numbers from your checkout. Future-you will thank you.

Mermaid diagram (for the write-up and PR)

sequenceDiagram
    participant C as Client
    participant AM as DAGAppMaster
    participant D as DAGImpl
    participant V as VertexImpl v1
    participant T as TaskImpl t0
    participant TA as TaskAttempt t0.0
    participant SM as ShuffleManager
    participant F as Fetcher

    C->>AM: submitDAG
    AM->>D: DAG_INIT
    D->>D: NEW -> INITED
    AM->>D: DAG_START
    D->>V: V_INIT
    V->>V: NEW -> INITIALIZING -> INITED
    D->>V: V_START
    V->>T: T_SCHEDULE
    T->>TA: TA_SCHEDULE
    TA->>TA: NEW -> START_WAIT
    Note over TA: container assigned + launched
    TA->>TA: START_WAIT -> RUNNING
    TA->>SM: shuffle starts
    SM->>F: fetchNext
    F-->>SM: IOException
    SM->>TA: InputReadErrorEvent (TA_FAILED)
    TA->>TA: RUNNING -> FAILED

Both diagrams say the same thing. Together they pass review with a committer because they prove you actually read the code instead of paraphrasing the JIRA.


Verify Empirically with Temporary LOG.info() Probes

The map is a hypothesis. Confirm it with probes. Add temporary logging at the points you think your event traverses. Pattern:

// In VertexImpl.java, inside the handler you suspect:
private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);

LOG.info("PROBE-TEZ{}: V_INIT entered for vertex={} state={}",
    "NNNN", getName(), getState());

Rules for probes:

  • Prefix every probe with PROBE-TEZ<NNNN> so you can grep them in one pass and delete in one pass.
  • Use LOG.info not LOG.debug so they appear without changing log config.
  • Include the field values you care about (state, event type, IDs).
  • Never commit probes. They are scaffolding for Step 4.

After re-running your test:

mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
  | grep "PROBE-TEZNNNN" | tee /tmp/probe-trace.txt

Compare the probe trace to your diagram. Discrepancies are the most valuable output of this whole step — they are exactly where your mental model differs from the code.

Common discrepancies to watch for:

  • "I thought this handler ran once. It ran three times." (Re-entrancy bug.)
  • "I thought events arrived in order A,B,C. They arrived B,A,C." (Async dispatch reordering.)
  • "I thought the vertex was in RUNNING. It was in INITED." (Wrong assumption about state at the time of the event.)

When a probe surprises you, do not delete the probe. Lean in. That is the shortest path to root cause.


Output

Your Step 3 deliverables live in capstone-work/execution-path/:

  • path-skeleton.md — text-arrow form with line numbers.
  • path.mmd — the mermaid source.
  • probe-trace.txt — grep output from the probe run.
  • notes.md — three to five surprises you found while reading.

Validation / Self-check

Before you advance to Step 4, you must:

  1. Be able to name, from memory, every state transition between TezClient.submitDAG() and your failure point.
  2. Have file:line citations for every transition in your diagram, against your ~/tez-src/ HEAD.
  3. Have run the repro with PROBE-TEZ<NNNN> log statements and confirmed the sequence matches your diagram (or, more usefully, noted where it diverges).
  4. Have removed every probe from your working tree before any commit (git diff should not contain "PROBE-").
  5. Have at least one "surprise" noted in notes.md — if you have zero, you did not look hard enough.
  6. Be able to answer: "Which event, in which state, on which class, fires the handler that produces the failure?" in one sentence.
  7. Have the mermaid diagram render without syntax errors (mdbook serve your capstone-work folder, or paste into mermaid.live).