Level 3: Tez Architecture
This level gives you a working mental model of how all Tez components fit together. After completing it you will be able to trace any execution path — from API call to task output — through the code without getting lost. Architecture knowledge is what separates a contributor who fixes isolated bugs from one who can design improvements.
Learning Objectives
By the end of Level 3 you must be able to:
- Draw the Tez component topology from memory (Client → AM → RM → NM → Container)
- Trace a
DAG.submit()call through four class boundaries to the first vertex start - Explain the role of each of the four state machines and how they interact
- Describe what happens on each of the three communication channels between components
- Explain the Input-Processor-Output (IPO) model and how it relates to DAG edges
- Identify which Protocol Buffer message type carries a given piece of information
Component Topology
┌─────────────────────────────────────────────────────────────────────┐
│ Client JVM │
│ ┌─────────────┐ │
│ │ TezClient │──── submitDAG() ────────────────────────────────┐ │
│ └─────────────┘ │ │
└──────────────────────────────────────────────────────────────────┼──┘
│ DAGPlan (protobuf)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ YARN ResourceManager │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ApplicationMaster container (DAGAppMaster) │ │
│ │ │ │
│ │ ┌───────────┐ ┌────────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │ DAGImpl │→ │ VertexImpl │→ │ TaskImpl │→ │ TaskAttemptImpl│ │
│ │ └───────────┘ └────────────┘ └──────────┘ └─────────┘ │ │
│ │ │ │ │ │
│ │ └──── events ──┘ │ │
│ │ │ │
│ │ ContainerLauncher ─── launches ──────────────────────────┐ │ │
│ └─────────────────────────────────────────────────────────┬─┘ │ │
└────────────────────────────────────────────────────────────┼───┼───┘
│ │
container req │ │ container
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ YARN NodeManagers (one per worker node) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ TezChild (task container JVM) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ LogicalIOProcessorRuntimeTask │ │ │
│ │ │ Input(s) ─── Processor ─── Output(s) │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Communication Channels
| Channel | From → To | What travels |
|---|---|---|
| Client → AM | TezClient → DAGClientAMProtocol (IPC) | DAGPlan protobuf, GetDAGStatusRequest |
| AM → RM | RMCommunicator → YARN RM | Container requests, heartbeats, AM completion |
| AM → NM | ContainerLauncher → YARN NM | Container launch context, env, classpath, command |
| AM ↔ Container | TaskCommunicatorManager → TezTaskUmbilicalProtocol (IPC) | Task assignment, task status, event routing |
The Four State Machines
Tez execution is modeled as four nested state machines. Each tracks a specific level of granularity and sends events to the others.
DAGImpl State Machine
| State | Description |
|---|---|
NEW | DAG created, not yet initialized |
INITED | All vertices initialized, ready to start |
RUNNING | At least one vertex is running |
SUCCEEDED | All vertices succeeded |
FAILED | At least one vertex failed (unrecoverable) |
KILLED | AM received a kill request |
ERROR | Internal AM error |
Key transition: NEW → INITED triggers VertexInitializedEvent for each vertex.
VertexImpl State Machine
The most complex state machine in Tez. Has ~30 states and 80+ transitions.
Core states (simplified):
| State | Description |
|---|---|
NEW | Vertex created, not yet initialized |
INITIALIZING | Waiting for inputs and vertex managers to initialize |
INITED | Ready to schedule tasks |
RUNNING | At least one task is running |
COMMITTING | All tasks done, running output committers |
SUCCEEDED | All tasks succeeded, all outputs committed |
FAILED | Unrecoverable failure |
RECOVERING | AM restarted, recovering state from history |
The VertexImpl state machine is defined by the StateMachineFactory at the top of
VertexImpl.java. Reading the factory definition gives you the complete transition table.
TaskImpl State Machine
Each vertex has N tasks (parallelism = N). TaskImpl tracks one task across its attempts.
| State | Description |
|---|---|
NEW → SCHEDULED | Task created and placed in the scheduler queue |
RUNNING | At least one attempt is running |
SUCCEEDED | One attempt succeeded |
FAILED | All attempts exhausted |
KILLED | Task explicitly killed (e.g., pre-emption) |
TaskImpl manages the attempt retry logic: if attempt 1 fails, TaskImpl decides
whether to launch attempt 2 based on the failure mode and retry count configuration.
TaskAttemptImpl State Machine
One actual container execution of a task.
| State | Description |
|---|---|
NEW | Attempt created, awaiting container assignment |
ASSIGNED | Container assigned by the scheduler |
RUNNING | Container launched, task code executing |
SUCCESS_FINISHING_CONTAINER | Task reported success, container cleanup in progress |
SUCCEEDED | Attempt completed successfully |
FAILED | Attempt failed (may or may not trigger task retry) |
KILLED | Attempt pre-empted or killed by AM |
Event System
State machine transitions are driven by events. The event bus (AsyncDispatcher) routes
events from producers to the correct state machine.
Key Event Types
| Event Type | Producer | Consumer |
|---|---|---|
DAGEventType.DAG_INIT | DAGAppMaster | DAGImpl |
VertexEventType.V_INIT | DAGImpl | VertexImpl |
VertexEventType.V_START | DAGImpl | VertexImpl |
TaskEventType.T_SCHEDULE | VertexImpl | TaskImpl |
TaskAttemptEventType.TA_ASSIGNED | TaskScheduler | TaskAttemptImpl |
TaskAttemptEventType.TA_DONE | TezTaskUmbilicalProtocol (container callback) | TaskAttemptImpl |
VertexEventType.V_TASK_COMPLETED | TaskImpl | VertexImpl |
DAGEventType.DAG_VERTEX_COMPLETED | VertexImpl | DAGImpl |
The event flow for a normal task success:
Container reports TA_DONE
→ TaskAttemptImpl: RUNNING → SUCCEEDED
→ sends T_ATTEMPT_SUCCEEDED to TaskImpl
→ TaskImpl: RUNNING → SUCCEEDED
→ sends V_TASK_COMPLETED to VertexImpl
→ VertexImpl checks: all tasks done?
→ if yes: sends DAG_VERTEX_COMPLETED to DAGImpl
→ DAGImpl checks: all vertices done?
→ if yes: DAG transitions to SUCCEEDED
Every state transition in this chain corresponds to a log line you will see in the AM logs.
Protocol Buffers
All cross-process data in Tez is serialized with Protocol Buffers (proto3 in newer versions).
| Proto file | Location | Key messages |
|---|---|---|
DAGApiRecords.proto | tez-api/src/main/proto/ | DAGPlan, VertexPlan, EdgePlan |
DAGIo.proto | tez-api/src/main/proto/ | RootInputLeafOutputProto, EntityDescriptorProto |
HistoryProtos.proto | tez-dag/src/main/proto/ | All timeline/history event types |
Events.proto | tez-runtime-internals/src/main/proto/ | Task-level events (DataMovementEvent, etc.) |
The DAGPlan message is what TezClient sends to the AM. It contains the complete
description of the DAG: vertices, edges, processor descriptors, I/O configurations, and
edge properties. It is generated from the DAG API object.
// In DAGImpl.java, the plan is received and deserialized:
DAGPlan dagPlan = clientAMProtocol.submitDAG(submitDAGRequest).getDagId();
// Plan is then converted to DAGImpl state
Input-Processor-Output (IPO) Model
Each task runs a single AbstractProcessor. The processor has access to named Input and
Output instances, which are determined by the edges in the DAG.
┌──────────────────────────────────────────────────────────────────┐
│ Task container │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ LogicalIOProcessorRuntimeTask │ │
│ │ │ │
│ │ Inputs: Outputs: │ │
│ │ ┌──────────────────┐ ┌──────────────────────┐ │ │
│ │ │ OrderedGrouped │ │ OrderedPartitioned │ │ │
│ │ │ KVInput │──┐ ┌──│ KVOutput │ │ │
│ │ └──────────────────┘ │ │ └──────────────────────┘ │ │
│ │ ▼ │ │ │
│ │ ┌───────────────┐ │ │
│ │ │ MyProcessor │ │ │
│ │ │ extends │ │ │
│ │ │ AbstractProcessor │ │
│ │ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Edge Property Types
The EdgeProperty in the DAG API determines what I/O classes are used between two vertices.
DataMovementType | Meaning | Default I/O pair |
|---|---|---|
SCATTER_GATHER | Partitioned, sorted shuffle | OrderedPartitionedKVOutput → OrderedGroupedKVInput |
BROADCAST | All output sent to all downstream tasks | UnorderedKVOutput → UnorderedKVInput |
ONE_TO_ONE | Task i → Task i, no shuffle | UnorderedKVOutput → UnorderedKVInput |
CUSTOM | User-defined routing | User-provided EdgeManagerPlugin |
SCATTER_GATHER corresponds to the classic MapReduce shuffle. BROADCAST is used for joins
where one side is small enough to replicate to all tasks.
DataMovementEvent
When a task output is ready, it sends a DataMovementEvent through the umbilical to the AM.
The AM routes it to the downstream tasks so their input knows which partition to fetch.
This event routing is the mechanism by which OrderedGroupedKVInput discovers where each
upstream partition is located — it receives DataMovementEvents from the AM containing
the shuffle server address and partition index.
Required Reading
| # | Resource | What to extract |
|---|---|---|
| 1 | tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | The StateMachineFactory declaration — read all addTransition() calls |
| 2 | tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | The createDag() method — how DAGPlan becomes state machine objects |
| 3 | tez-api/src/main/proto/DAGApiRecords.proto | DAGPlan and VertexPlan message definitions |
| 4 | tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | The serviceStart() method — component initialization order |
| 5 | tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java | How inputs, processors, and outputs are initialized in a container |
Key Classes Quick Reference
| Class | Module | Role |
|---|---|---|
DAGAppMaster | tez-dag | AM main class; manages all components; starts the event dispatcher |
DAGImpl | tez-dag | DAG state machine; tracks vertex completion; manages history |
VertexImpl | tez-dag | Vertex state machine; manages task scheduling; calls VertexManager |
TaskImpl | tez-dag | Task state machine; manages attempt lifecycle and retry logic |
TaskAttemptImpl | tez-dag | TaskAttempt state machine; coordinates container assignment |
AsyncDispatcher | tez-dag (via Hadoop) | Event bus; routes events to state machines asynchronously |
TezTaskUmbilicalProtocol | tez-runtime-internals | IPC interface between container and AM |
TezChild | tez-dag | Container main class; receives task assignment; runs the task |
LogicalIOProcessorRuntimeTask | tez-runtime-internals | In-container task runner; sets up IPO |
TezClient | tez-api | Client API; creates TezSession; submits DAGs |
JIRA Categories for Level 3
Having read the architecture, you can now evaluate:
- Architecture improvement JIRAs — proposals to change how components interact
- State machine correctness bugs — transitions that lead to wrong states
- Event routing issues — events that are lost or sent to wrong consumers
- Container reuse improvements — how tasks are assigned to existing containers
You are still not ready to submit fixes for state machine bugs — those require Level 4. But you can now read these issues intelligently and leave informed comments.
Deliverables
- Draw the component topology diagram from memory (no looking)
-
Trace
TezClient.submitDAG()toVertexImplV_STARTevent through class names - Identify the state machines and their event types from code (not from this page)
-
Explain in your own words what
DataMovementEventdoes and why it exists - Lab 3.1 completed: DAG submission trace documented
- Lab 3.2 completed: IPO abstraction walkthrough complete
Common Mistakes
| Mistake | Impact | Correct understanding |
|---|---|---|
| Thinking the AM runs tasks directly | Leads to wrong mental model of container lifecycle | Tasks run in separate JVMs (containers); AM only schedules and monitors |
Confusing VertexImpl with Vertex (API) | Vertex is the builder; VertexImpl is the runtime state machine | They are in different modules (tez-api vs tez-dag) |
Thinking AsyncDispatcher is synchronous | Events are queued; transitions happen on the dispatcher thread | Never assume a transition is immediate after an event is posted |
Reading VertexImpl top-to-bottom | The class is 6000+ lines; reading linearly is unproductive | Start with the StateMachineFactory declaration, then follow individual transitions |