Lab 4.1: Read the VertexImpl State Machine
Background
VertexImpl.java is the most complex class in Apache Tez. It is approximately 6,000 lines
long and contains the complete state machine for vertex execution, including initialization,
scheduling, task completion handling, failure handling, speculative execution, and AM
recovery. Reading it systematically — rather than linearly — is the skill this lab builds.
The output of this lab is a complete state transition table that you have produced from the source code, without reference to any external documentation.
How to Read a Large State Machine Class
Do not read VertexImpl.java from top to bottom. Instead:
- Start with the
StateMachineFactorydeclaration (search forstateMachineFactory =) - Extract all
addTransitioncalls — this gives you the complete transition table - For each transition, find the handler class — the inner class that implements
SingleArcTransitionorMultipleArcTransition - Read each handler's
transition()method — this is the actual state machine logic - Trace inter-state-machine events — where does the handler post events to other state machines?
Step-by-Step Tasks
Step 1: Find the StateMachineFactory
grep -n "stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -5
Note the line number. The factory declaration starts there and continues for hundreds of lines. Read the entire factory definition — do not skip any transitions.
Step 2: Count All States and Transitions
# Count distinct source states referenced in addTransition
grep "addTransition(VertexState\." \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| sed 's/.*addTransition(VertexState\.\([A-Z_]*\).*/\1/' \
| sort -u
# Count total transitions
grep -c "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Record your numbers. You should find more than 30 distinct source states and more than 80 transitions.
Step 3: Build the Transition Table
For each line in the StateMachineFactory, extract:
- Source state
- Event type
- Destination state(s)
- Handler class name
Begin with the transitions from NEW:
# Find all transitions FROM NEW
awk '/addTransition\(VertexState\.NEW/,/\.addTransition/' \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| head -20
Then from INITIALIZING:
grep -A4 "addTransition(VertexState\.INITIALIZING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -40
Build a table with columns: Source State | Event | Destination | Handler.
Step 4: Trace the Happy Path
The happy path for a vertex with no source edges (a root vertex, e.g., Tokenizer):
NEW
V_INIT → INITIALIZING (InitTransition)
V_INIT_DONE → INITED (InitedTransition — if no root input initializers)
V_START → RUNNING (StartTransition)
[VertexManager schedules tasks]
[All tasks complete successfully]
V_TASK_COMPLETED (final task) → COMMITTING (TaskCompletedTransition)
V_COMMIT_COMPLETED → SUCCEEDED (CommitCompletedTransition)
For each transition in the happy path, find the handler class and answer:
InitTransition.transition():
grep -n "class InitTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
- What does
InitTransitiondo when there are noRootInputInitializers? - Does it immediately post
V_INIT_DONE, or is there an intermediate step?
InitedTransition.transition() (or whatever class handles V_INIT_DONE):
- When does
INITIALIZINGgo toINITEDvs going directly toRUNNING? - What is the condition that allows immediate transition to
RUNNING?
StartTransition.transition():
- What method on
VertexManageris called here? - Does this method block or is it asynchronous?
TaskCompletedTransition.transition():
- How does it track whether all tasks have completed?
- What is
numSuccessSourceAttemptCompletions? - At what point does it decide the vertex can move to
COMMITTING?
Step 5: Trace the Failure Path
A task fails. The event chain:
TaskAttemptImpl: RUNNING → FAILED (sends T_ATTEMPT_FAILED to TaskImpl)
TaskImpl: RUNNING → FAILED (if retry limit exceeded; sends V_TASK_COMPLETED{FAILED})
VertexImpl: RUNNING → ?
Find the handler for V_TASK_COMPLETED when the task is FAILED:
# TaskCompletedTransition handles both success and failure
grep -n "TaskCompletedTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10
Answer:
- What field tracks the number of failed tasks?
- What is the condition that causes the vertex to transition to
FAILED? - What event does
VertexImplsend toDAGImplwhen it fails? - Does
DAGImplfail immediately when a vertex fails, or does it try to continue?
# Find how DAGImpl handles vertex failure
grep -n "DAG_VERTEX_COMPLETED\|vertexFailed\|VERTEX_FAILED" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -15
Step 6: Find the RECOVERING States
grep "RECOVERING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| grep "VertexState\." | head -20
Answer:
- How many
RECOVERING_*states exist? - What event exits the
RECOVERINGstate? - What class handles recovery completion?
Step 7: Find All @Ignored Tests in TestVertexImpl
grep -n "@Ignore" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
For each @Ignored test:
- Read the comment explaining why it is ignored
- Determine if the bug has been fixed (search JIRA for the referenced issue number)
- If the fix exists, the test can likely be re-enabled — this is a contributor opportunity
Step 8: Find a Transition with No Test Coverage
Pick three transition handler classes from your transition table. For each, check if
TestVertexImpl has a test that exercises that handler:
# Example: does TestVertexImpl test TaskCompletedTransition?
grep -n "TaskCompletedTransition\|taskCompletedTransition" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -5
# If none found, search for tests that trigger V_TASK_COMPLETED
grep -n "V_TASK_COMPLETED\|VertexEventType.V_TASK_COMPLETED" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -10
Identify one transition that appears to have insufficient test coverage and document it.
This is a potential Test JIRA issue you could file and fix.
Deliverable: Your Transition Table
Produce a table in this format (populate all rows from code):
| Source State | Event Type | Destination | Handler Class |
|---|---|---|---|
| NEW | V_INIT | INITIALIZING | InitTransition |
| INITIALIZING | V_INIT_DONE | INITED / FAILED | InitedTransition |
| INITED | V_START | RUNNING | StartTransition |
| RUNNING | V_TASK_COMPLETED | RUNNING/SUCCEEDED | TaskCompletedTransition |
| ... | ... | ... | ... |
Your table should have at least 30 rows (covering the main execution paths). Recovery states are optional for this level.
Expected Output
- A complete (or near-complete) state transition table for
VertexImpl - Answers to all questions in Steps 4–6 with file:line references
- List of
@Ignored tests with your assessment of whether they could be re-enabled - One transition identified as having insufficient test coverage
Stretch Goals
-
Produce the same transition table for
TaskImplandTaskAttemptImpl. Compare their complexity (number of states and transitions) toVertexImpl. -
Find all places where
VertexImplcallseventHandler.handle()to post an event to another state machine. What are the target state machines and what event types are used?grep -n "eventHandler.handle" \ tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \ | grep -v "VertexEvent" | head -20 -
Find the
V_PARALLELISM_UPDATEDtransition — what does it do, and why is it one of the most bug-prone transitions in the state machine?