Lab 4.1: Read the VertexImpl State Machine

Background

VertexImpl.java is the most complex class in Apache Tez. It is approximately 6,000 lines long and contains the complete state machine for vertex execution, including initialization, scheduling, task completion handling, failure handling, speculative execution, and AM recovery. Reading it systematically — rather than linearly — is the skill this lab builds.

The output of this lab is a complete state transition table that you have produced from the source code, without reference to any external documentation.


How to Read a Large State Machine Class

Do not read VertexImpl.java from top to bottom. Instead:

  1. Start with the StateMachineFactory declaration (search for stateMachineFactory =)
  2. Extract all addTransition calls — this gives you the complete transition table
  3. For each transition, find the handler class — the inner class that implements SingleArcTransition or MultipleArcTransition
  4. Read each handler's transition() method — this is the actual state machine logic
  5. Trace inter-state-machine events — where does the handler post events to other state machines?

Step-by-Step Tasks

Step 1: Find the StateMachineFactory

grep -n "stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -5

Note the line number. The factory declaration starts there and continues for hundreds of lines. Read the entire factory definition — do not skip any transitions.

Step 2: Count All States and Transitions

# Count distinct source states referenced in addTransition
grep "addTransition(VertexState\." \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | sed 's/.*addTransition(VertexState\.\([A-Z_]*\).*/\1/' \
  | sort -u

# Count total transitions
grep -c "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

Record your numbers. You should find more than 30 distinct source states and more than 80 transitions.

Step 3: Build the Transition Table

For each line in the StateMachineFactory, extract:

  • Source state
  • Event type
  • Destination state(s)
  • Handler class name

Begin with the transitions from NEW:

# Find all transitions FROM NEW
awk '/addTransition\(VertexState\.NEW/,/\.addTransition/' \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | head -20

Then from INITIALIZING:

grep -A4 "addTransition(VertexState\.INITIALIZING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -40

Build a table with columns: Source State | Event | Destination | Handler.

Step 4: Trace the Happy Path

The happy path for a vertex with no source edges (a root vertex, e.g., Tokenizer):

NEW
  V_INIT → INITIALIZING (InitTransition)
    V_INIT_DONE → INITED (InitedTransition — if no root input initializers)
  V_START → RUNNING (StartTransition)
    [VertexManager schedules tasks]
    [All tasks complete successfully]
    V_TASK_COMPLETED (final task) → COMMITTING (TaskCompletedTransition)
    V_COMMIT_COMPLETED → SUCCEEDED (CommitCompletedTransition)

For each transition in the happy path, find the handler class and answer:

InitTransition.transition():

grep -n "class InitTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
  • What does InitTransition do when there are no RootInputInitializers?
  • Does it immediately post V_INIT_DONE, or is there an intermediate step?

InitedTransition.transition() (or whatever class handles V_INIT_DONE):

  • When does INITIALIZING go to INITED vs going directly to RUNNING?
  • What is the condition that allows immediate transition to RUNNING?

StartTransition.transition():

  • What method on VertexManager is called here?
  • Does this method block or is it asynchronous?

TaskCompletedTransition.transition():

  • How does it track whether all tasks have completed?
  • What is numSuccessSourceAttemptCompletions?
  • At what point does it decide the vertex can move to COMMITTING?

Step 5: Trace the Failure Path

A task fails. The event chain:

TaskAttemptImpl: RUNNING → FAILED (sends T_ATTEMPT_FAILED to TaskImpl)
  TaskImpl: RUNNING → FAILED (if retry limit exceeded; sends V_TASK_COMPLETED{FAILED})
    VertexImpl: RUNNING → ?

Find the handler for V_TASK_COMPLETED when the task is FAILED:

# TaskCompletedTransition handles both success and failure
grep -n "TaskCompletedTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10

Answer:

  1. What field tracks the number of failed tasks?
  2. What is the condition that causes the vertex to transition to FAILED?
  3. What event does VertexImpl send to DAGImpl when it fails?
  4. Does DAGImpl fail immediately when a vertex fails, or does it try to continue?
# Find how DAGImpl handles vertex failure
grep -n "DAG_VERTEX_COMPLETED\|vertexFailed\|VERTEX_FAILED" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -15

Step 6: Find the RECOVERING States

grep "RECOVERING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | grep "VertexState\." | head -20

Answer:

  1. How many RECOVERING_* states exist?
  2. What event exits the RECOVERING state?
  3. What class handles recovery completion?

Step 7: Find All @Ignored Tests in TestVertexImpl

grep -n "@Ignore" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

For each @Ignored test:

  1. Read the comment explaining why it is ignored
  2. Determine if the bug has been fixed (search JIRA for the referenced issue number)
  3. If the fix exists, the test can likely be re-enabled — this is a contributor opportunity

Step 8: Find a Transition with No Test Coverage

Pick three transition handler classes from your transition table. For each, check if TestVertexImpl has a test that exercises that handler:

# Example: does TestVertexImpl test TaskCompletedTransition?
grep -n "TaskCompletedTransition\|taskCompletedTransition" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -5

# If none found, search for tests that trigger V_TASK_COMPLETED
grep -n "V_TASK_COMPLETED\|VertexEventType.V_TASK_COMPLETED" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -10

Identify one transition that appears to have insufficient test coverage and document it. This is a potential Test JIRA issue you could file and fix.


Deliverable: Your Transition Table

Produce a table in this format (populate all rows from code):

| Source State      | Event Type          | Destination       | Handler Class            |
|---|---|---|---|
| NEW               | V_INIT              | INITIALIZING      | InitTransition           |
| INITIALIZING      | V_INIT_DONE         | INITED / FAILED   | InitedTransition         |
| INITED            | V_START             | RUNNING           | StartTransition          |
| RUNNING           | V_TASK_COMPLETED    | RUNNING/SUCCEEDED | TaskCompletedTransition  |
| ...               | ...                 | ...               | ...                      |

Your table should have at least 30 rows (covering the main execution paths). Recovery states are optional for this level.


Expected Output

  1. A complete (or near-complete) state transition table for VertexImpl
  2. Answers to all questions in Steps 4–6 with file:line references
  3. List of @Ignored tests with your assessment of whether they could be re-enabled
  4. One transition identified as having insufficient test coverage

Stretch Goals

  1. Produce the same transition table for TaskImpl and TaskAttemptImpl. Compare their complexity (number of states and transitions) to VertexImpl.

  2. Find all places where VertexImpl calls eventHandler.handle() to post an event to another state machine. What are the target state machines and what event types are used?

    grep -n "eventHandler.handle" \
      tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
      | grep -v "VertexEvent" | head -20
    
  3. Find the V_PARALLELISM_UPDATED transition — what does it do, and why is it one of the most bug-prone transitions in the state machine?