Step 3: Execution Path Analysis
You have a failing test. Now you map the path the request takes from the moment
TezClient.submitDAG() returns through every event, dispatcher hop, and state
transition until the failure manifests. This map is the foundation for every
hypothesis in Step 4. A wrong map produces a wrong root cause.
Budget: 2–4 evenings. The work is reading code, grep, and drawing.
The Canonical Submit Path
Every DAG that fails went through this skeleton path before it failed. Memorize it; you will use it as the reference axis when you sketch where your particular failure deviates.
TezClient.submitDAG(DAG)
[tez-api/src/main/java/org/apache/tez/client/TezClient.java]
|
v
TezClient.submitDAGSession() or submitDAGApplication()
| (session vs. non-session — see TezClient.java for branch)
v
DAGClientHandler.submitDAG(...)
[tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java]
|
v
DAGAppMaster.submitDAGToAppMaster(...)
[tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java]
|
v
DAGAppMaster.startDAG(...)
| - builds DAGImpl
| - emits DAGEventType.DAG_INIT
v
AsyncDispatcher.dispatch(DAGEvent)
[tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java]
(uses Hadoop's hadoop-yarn-common AsyncDispatcher under the hood;
Tez subclasses it — see Tez source for the wrapper)
|
v
DAGImpl.handle(DAGEvent)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]
| state DAG_NEW --DAG_INIT--> INITED
| emits DAGEventType.DAG_START
v
DAGImpl on DAG_START
| state INITED --DAG_START--> RUNNING
| for each Vertex: emits VertexEvent V_INIT
v
VertexImpl.handle(VertexEventType.V_INIT)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java]
| state NEW --V_INIT--> INITIALIZING
| invokes VertexManagerPlugin.initialize()
| on success emits V_INITED
v
VertexImpl on V_INITED -> on V_START
| state INITED --V_START--> RUNNING
| schedules tasks via TaskImpl events (T_SCHEDULE)
v
TaskImpl.handle(T_SCHEDULE)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java]
| state NEW --T_SCHEDULE--> SCHEDULED
| spawns a TaskAttemptImpl, emits TA_SCHEDULE
v
TaskAttemptImpl.handle(TA_SCHEDULE)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java]
| state NEW --TA_SCHEDULE--> START_WAIT
| requests container from TaskSchedulerManager
v
TaskSchedulerManager / YarnTaskSchedulerService
[tez-dag/src/main/java/org/apache/tez/dag/app/rm/]
| assigns container, emits TA_CONTAINER_LAUNCHED
v
TaskAttemptImpl receives TA_CONTAINER_LAUNCHED
| state START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING
| the container is now actually running our task
v
[ container process boots ]
TezTaskRunner2.run()
[tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java]
|
v
TezChild / TaskRunner instantiates LogicalIOProcessorRuntimeTask
|
v
LogicalIOProcessorRuntimeTask.run()
[tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java]
| initializes Inputs, Outputs, Processor
| calls Processor.run(inputs, outputs)
v
[ user code runs — e.g. OrderedWordCount or your DAG's processor ]
|
v
heartbeat -> TaskAttemptListener -> TaskAttemptImpl TA_DONE / TA_FAILED
That is the skeleton. Your job in this step is to find the segment where your failure occurs and draw it with line numbers.
Run These Greps
These greps locate the actual file paths and method bodies on your local clone.
Run them in ~/tez-src/. Each one gives you a line number to open.
# Entry: submitDAG
grep -n "public.*submitDAG" \
tez-api/src/main/java/org/apache/tez/client/TezClient.java
# Server-side intake
grep -n "submitDAG\|startDAG" \
tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
# DAGImpl handlers
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -40
# VertexImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -60
# TaskImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -60
# TaskAttemptImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -80
# Dispatcher
grep -n "class AsyncDispatcher\|dispatch\b" \
tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java
# Runtime task entry
grep -n "public void run\|class TezTaskRunner2" \
tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java
grep -n "public void run\|initialize\|class LogicalIOProcessorRuntimeTask" \
tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java
Open each line in your editor. Read the transition table. Note which event you care about and which state(s) it is legal in.
Locate Your Specific Failure Segment
The skeleton is the highway; your bug is at one specific exit. Use these heuristics:
| Symptom in repro logs | Likely segment |
|---|---|
VertexImpl ... transitioned from RUNNING to FAILED | VertexImpl state machine — transition on V_TASK_RESCHEDULED or V_INTERNAL_ERROR |
TaskAttemptImpl ... NPE | TaskAttemptImpl event handlers; check container-launched and TA_DONE paths |
NPE in AsyncDispatcher.dispatch | Race between dispatcher start/stop and event submission |
ShuffleManager: too many fetch failures | Fetcher retry/timeout; ShuffleManager.fetchFailure() |
IFile checksum mismatch | IFile.Writer/Reader; check spill+merge |
OutOfMemory ... GROUP_COMPARATOR | MergeManager memory math; ifile spill thresholds |
Container released before TA_DONE | TaskSchedulerManager reuse path; check container release races |
Once you know your segment, draw it.
Build the Path Diagram
Two formats. Do both — they validate each other.
Text-arrow form (paste into the root-cause doc)
Use this in JIRA comments and PR descriptions. It survives any rendering.
TezClient.submitDAG (TezClient.java:485)
-> DAGClientHandler.submitDAG (DAGClientHandler.java:152)
-> DAGAppMaster.startDAG (DAGAppMaster.java:1234)
-> DAGImpl V_NEW --DAG_INIT--> INITED (DAGImpl.java:340)
-> DAGImpl INITED --DAG_START--> RUNNING (DAGImpl.java:380)
-> VertexImpl v1 NEW --V_INIT--> INITIALIZING (VertexImpl.java:1820)
-> VertexImpl v1 INITIALIZING --V_INITED--> INITED (VertexImpl.java:1856)
-> VertexImpl v1 INITED --V_START--> RUNNING (VertexImpl.java:1901)
-> [21 TaskImpl T_SCHEDULE events fired]
-> TaskImpl t0 NEW --T_SCHEDULE--> SCHEDULED (TaskImpl.java:412)
-> TaskAttemptImpl t0.0 NEW --TA_SCHEDULE--> START_WAIT (TaskAttemptImpl.java:560)
-> [container assigned]
-> TaskAttemptImpl t0.0 START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING (...:610)
-> [container starts LogicalIOProcessorRuntimeTask]
-> ShuffleManager.run starts fetcher loop
-> Fetcher.fetchNext throws IOException (Fetcher.java:289) <-- FAILURE HERE
-> ShuffleManager.fetchFailure -> InputReadErrorEvent
-> TaskAttemptImpl t0.0 RUNNING --TA_FAILED--> FAILED
Cite real line numbers from your checkout. Future-you will thank you.
Mermaid diagram (for the write-up and PR)
sequenceDiagram
participant C as Client
participant AM as DAGAppMaster
participant D as DAGImpl
participant V as VertexImpl v1
participant T as TaskImpl t0
participant TA as TaskAttempt t0.0
participant SM as ShuffleManager
participant F as Fetcher
C->>AM: submitDAG
AM->>D: DAG_INIT
D->>D: NEW -> INITED
AM->>D: DAG_START
D->>V: V_INIT
V->>V: NEW -> INITIALIZING -> INITED
D->>V: V_START
V->>T: T_SCHEDULE
T->>TA: TA_SCHEDULE
TA->>TA: NEW -> START_WAIT
Note over TA: container assigned + launched
TA->>TA: START_WAIT -> RUNNING
TA->>SM: shuffle starts
SM->>F: fetchNext
F-->>SM: IOException
SM->>TA: InputReadErrorEvent (TA_FAILED)
TA->>TA: RUNNING -> FAILED
Both diagrams say the same thing. Together they pass review with a committer because they prove you actually read the code instead of paraphrasing the JIRA.
Verify Empirically with Temporary LOG.info() Probes
The map is a hypothesis. Confirm it with probes. Add temporary logging at the points you think your event traverses. Pattern:
// In VertexImpl.java, inside the handler you suspect:
private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);
LOG.info("PROBE-TEZ{}: V_INIT entered for vertex={} state={}",
"NNNN", getName(), getState());
Rules for probes:
- Prefix every probe with
PROBE-TEZ<NNNN>so you can grep them in one pass and delete in one pass. - Use
LOG.infonotLOG.debugso they appear without changing log config. - Include the field values you care about (state, event type, IDs).
- Never commit probes. They are scaffolding for Step 4.
After re-running your test:
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
| grep "PROBE-TEZNNNN" | tee /tmp/probe-trace.txt
Compare the probe trace to your diagram. Discrepancies are the most valuable output of this whole step — they are exactly where your mental model differs from the code.
Common discrepancies to watch for:
- "I thought this handler ran once. It ran three times." (Re-entrancy bug.)
- "I thought events arrived in order A,B,C. They arrived B,A,C." (Async dispatch reordering.)
- "I thought the vertex was in RUNNING. It was in INITED." (Wrong assumption about state at the time of the event.)
When a probe surprises you, do not delete the probe. Lean in. That is the shortest path to root cause.
Output
Your Step 3 deliverables live in capstone-work/execution-path/:
path-skeleton.md— text-arrow form with line numbers.path.mmd— the mermaid source.probe-trace.txt— grep output from the probe run.notes.md— three to five surprises you found while reading.
Validation / Self-check
Before you advance to Step 4, you must:
- Be able to name, from memory, every state transition between
TezClient.submitDAG()and your failure point. - Have file:line citations for every transition in your diagram, against your
~/tez-src/HEAD. - Have run the repro with
PROBE-TEZ<NNNN>log statements and confirmed the sequence matches your diagram (or, more usefully, noted where it diverges). - Have removed every probe from your working tree before any commit (
git diffshould not contain "PROBE-"). - Have at least one "surprise" noted in
notes.md— if you have zero, you did not look hard enough. - Be able to answer: "Which event, in which state, on which class, fires the handler that produces the failure?" in one sentence.
- Have the mermaid diagram render without syntax errors (
mdbook serveyour capstone-work folder, or paste into mermaid.live).