VertexImpl Lifecycle

VertexImpl is the AM-side representation of a single Vertex in a running DAG. Its lifecycle is a Hadoop state machine with ~15 states and dozens of events. This chapter walks the happy path (NEW → SUCCEEDED), the major failure and kill paths, and the rules that govern transitions.

After this chapter you should be able to draw the state machine on a whiteboard and predict every state transition for any event in any state.


File

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

This is one of the largest files in Tez (typically 4000+ lines). Skim once top-to-bottom, then read the stateMachineFactory block carefully.

grep -n "stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

The factory is a single chained builder defined near the top of the file (roughly 200–600 lines depending on version).


The states

grep -n "VertexState\." tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
# or
grep -n "public enum\|enum VertexState" \
  tez-api/src/main/java/org/apache/tez/dag/api/event/VertexState.java \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

The full state set (names exact as of 0.10.x):

StateMeaning
NEWJust constructed; no events seen
INITIALIZINGInputs being initialized (e.g., split generation)
INITEDReady to run; awaiting V_START
RUNNINGTasks executing
COMMITTINGAll tasks succeeded; outputs being committed
SUCCEEDEDTerminal: all good
TERMINATINGFailure/kill in progress; awaiting task drain
KILLEDTerminal: killed externally
FAILEDTerminal: failed (own fault)
ERRORTerminal: AM internal error
RECOVERING(Recovery only) replaying events into this vertex

State × event matrix (happy path)

StateEventNext stateAction
NEWV_INITINITIALIZINGconstruct inputs, kick off InputInitializers
INITIALIZINGV_ROOT_INPUT_INITIALIZEDINITIALIZINGaccumulate events; if all done → INITED
INITIALIZINGV_ROOT_INPUT_FAILEDTERMINATINGbubble failure
INITIALIZINGV_INIT_COMPLETEDINITEDfinalize parallelism if not set
INITEDV_STARTRUNNINGschedule tasks via VertexManagerPlugin
RUNNINGV_TASK_COMPLETED (success)RUNNINGbump counter; if all done → COMMITTING
RUNNINGV_TASK_COMPLETED (final fail)TERMINATINGinitiate cleanup
RUNNINGV_TASK_RESCHEDULEDRUNNINGrerun a task
COMMITTINGV_COMMIT_COMPLETEDSUCCEEDEDpublish history
COMMITTINGV_COMMIT_FAILEDTERMINATINGrerun or fail

For the complete matrix, count the addTransition(...) calls:

grep -c "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

There are usually >100 transitions registered. Each carries a one-line comment with the bug or JIRA that motivated it; read those comments.


Failure path walk

stateDiagram-v2
    [*] --> NEW
    NEW --> INITIALIZING: V_INIT
    INITIALIZING --> INITED: V_INIT_COMPLETED
    INITIALIZING --> TERMINATING: V_ROOT_INPUT_FAILED
    INITED --> RUNNING: V_START
    RUNNING --> COMMITTING: all tasks SUCCEEDED
    RUNNING --> TERMINATING: any task FAILED beyond max-attempts
    RUNNING --> TERMINATING: V_TERMINATE
    COMMITTING --> SUCCEEDED: V_COMMIT_COMPLETED
    COMMITTING --> TERMINATING: V_COMMIT_FAILED
    TERMINATING --> FAILED
    TERMINATING --> KILLED
    SUCCEEDED --> [*]
    FAILED --> [*]
    KILLED --> [*]

TERMINATING exists because a vertex cannot just jump to FAILED — it must first kill all running tasks and clean up its outputs. The transition from TERMINATING to a terminal state happens when the task count reaches zero.


Vertex initialization in detail

V_INIT is the most complex transition. The handler must:

  1. Construct each root InputDescriptor and call its InputInitializer.
  2. If parallelism is -1, defer task creation until either the VertexManagerPlugin calls reconfigureVertex(...) or the root inputs report concrete counts.
  3. Construct downstream Edge objects (the AM-side Edge, not the tez-api one) and bind their EdgeManagers.
  4. Schedule the VertexManagerPlugin.onVertexStarted callback (it fires on V_START, not V_INIT).

Read the body:

grep -n "InitTransition\|RootInputInitTransition\|RECOVERING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20

The commit path

A vertex with a DataSink (an OutputCommitter) must run a commit phase after all tasks succeed. The commit:

  • Runs on the AM (not in tasks).
  • May fail and trigger a rerun (V_COMMIT_FAILED → TERMINATING).
  • Holds the vertex in COMMITTING for the duration.

Vertex-group commit (when multiple vertices write to a shared VertexGroup) is coordinated by DAGImpl; individual VertexImpls just signal that they are ready to commit.

grep -n "CommittingTransition\|commitOutput\|OutputCommitter" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Reading exercise

# State machine block
sed -n '1,500p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Count transitions
grep -c "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Find every event that can take the vertex to FAILED
grep -n "VertexState.FAILED" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

# Find the InitTransition body
grep -n "private.*class.*Transition\b" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Answer:

  1. List five events that can take the vertex from RUNNING to TERMINATING.
  2. What determines the final state (FAILED vs KILLED) once TERMINATING completes?
  3. Why is INITED distinct from RUNNING — what does V_START actually trigger?
  4. How is parallelism set when a vertex starts with parallelism = -1?
  5. What happens to in-flight tasks when a vertex transitions to TERMINATING?
  6. Why does the state machine have a separate COMMITTING state instead of committing inside RUNNING?

Common bugs and symptoms

SymptomRoot causeWhere to look
InvalidStateTransitonException: Invalid event V_TASK_COMPLETED at SUCCEEDEDA late task completion event arrived after vertex completed (race)Check task retry logic; add a no-op transition
Vertex stuck in INITIALIZING foreverRoot input initializer never emitted eventsCheck InputInitializerEvents in log; cross-check initializer impl
Vertex transitions to FAILED but the failing task was killed externallyBug in TaskAttemptImpl setting the wrong termination causeSee task-attempt-lifecycle.md
All tasks succeed but vertex stays in COMMITTINGOutput committer hangsCheck committer for synchronous slow I/O; consider async
Recovery replays into RUNNING but tasks aren't relaunchedMissing recovery event for in-flight tasksLook for VertexTaskStartEvent gaps in recovery log
V_KILL causes vertex to stay in TERMINATING with one task lingeringContainer heartbeat timeout > kill deadlineTune tez.task.timeout-ms

Validation: prove you understand this

  1. From memory, list all 10–11 VertexState values with a one-line meaning.
  2. Without running code, predict the next state for: (NEW, V_TERMINATE), (INITIALIZING, V_TERMINATE), (RUNNING, V_TASK_RESCHEDULED), (COMMITTING, V_TASK_RESCHEDULED). Verify against the source.
  3. Find the JIRA reference next to one transition you don't understand; read the JIRA; come back and explain why the transition exists.
  4. Write a unit test that drives a VertexImpl from NEW to SUCCEEDED using DrainDispatcher. (Use TestVertexImpl as a template.)
  5. Modify VertexImpl to add a no-op transition for some (state, event) pair currently absent; update TestVertexImpl in the same patch. Compile.