Level 3: Tez Architecture

This level gives you a working mental model of how all Tez components fit together. After completing it you will be able to trace any execution path — from API call to task output — through the code without getting lost. Architecture knowledge is what separates a contributor who fixes isolated bugs from one who can design improvements.


Learning Objectives

By the end of Level 3 you must be able to:

  1. Draw the Tez component topology from memory (Client → AM → RM → NM → Container)
  2. Trace a DAG.submit() call through four class boundaries to the first vertex start
  3. Explain the role of each of the four state machines and how they interact
  4. Describe what happens on each of the three communication channels between components
  5. Explain the Input-Processor-Output (IPO) model and how it relates to DAG edges
  6. Identify which Protocol Buffer message type carries a given piece of information

Component Topology

┌─────────────────────────────────────────────────────────────────────┐
│  Client JVM                                                         │
│  ┌─────────────┐                                                    │
│  │  TezClient  │──── submitDAG() ────────────────────────────────┐  │
│  └─────────────┘                                                 │  │
└──────────────────────────────────────────────────────────────────┼──┘
                                                                   │ DAGPlan (protobuf)
                                                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│  YARN ResourceManager                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ApplicationMaster container (DAGAppMaster)                  │   │
│  │                                                              │   │
│  │  ┌───────────┐  ┌────────────┐  ┌──────────┐  ┌─────────┐  │   │
│  │  │  DAGImpl  │→ │ VertexImpl │→ │ TaskImpl │→ │ TaskAttemptImpl│  │
│  │  └───────────┘  └────────────┘  └──────────┘  └─────────┘  │   │
│  │         │              │                                     │   │
│  │         └──── events ──┘                                    │   │
│  │                                                              │   │
│  │  ContainerLauncher ─── launches ──────────────────────────┐ │   │
│  └─────────────────────────────────────────────────────────┬─┘ │   │
└────────────────────────────────────────────────────────────┼───┼───┘
                                                             │   │
                                              container req  │   │ container
                                                             ▼   ▼
┌─────────────────────────────────────────────────────────────────────┐
│  YARN NodeManagers (one per worker node)                            │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │  TezChild (task container JVM)                                │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │  LogicalIOProcessorRuntimeTask                         │  │  │
│  │  │   Input(s) ─── Processor ─── Output(s)                │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Communication Channels

ChannelFrom → ToWhat travels
Client → AMTezClientDAGClientAMProtocol (IPC)DAGPlan protobuf, GetDAGStatusRequest
AM → RMRMCommunicator → YARN RMContainer requests, heartbeats, AM completion
AM → NMContainerLauncher → YARN NMContainer launch context, env, classpath, command
AM ↔ ContainerTaskCommunicatorManagerTezTaskUmbilicalProtocol (IPC)Task assignment, task status, event routing

The Four State Machines

Tez execution is modeled as four nested state machines. Each tracks a specific level of granularity and sends events to the others.

DAGImpl State Machine

StateDescription
NEWDAG created, not yet initialized
INITEDAll vertices initialized, ready to start
RUNNINGAt least one vertex is running
SUCCEEDEDAll vertices succeeded
FAILEDAt least one vertex failed (unrecoverable)
KILLEDAM received a kill request
ERRORInternal AM error

Key transition: NEW → INITED triggers VertexInitializedEvent for each vertex.

VertexImpl State Machine

The most complex state machine in Tez. Has ~30 states and 80+ transitions.

Core states (simplified):

StateDescription
NEWVertex created, not yet initialized
INITIALIZINGWaiting for inputs and vertex managers to initialize
INITEDReady to schedule tasks
RUNNINGAt least one task is running
COMMITTINGAll tasks done, running output committers
SUCCEEDEDAll tasks succeeded, all outputs committed
FAILEDUnrecoverable failure
RECOVERINGAM restarted, recovering state from history

The VertexImpl state machine is defined by the StateMachineFactory at the top of VertexImpl.java. Reading the factory definition gives you the complete transition table.

TaskImpl State Machine

Each vertex has N tasks (parallelism = N). TaskImpl tracks one task across its attempts.

StateDescription
NEWSCHEDULEDTask created and placed in the scheduler queue
RUNNINGAt least one attempt is running
SUCCEEDEDOne attempt succeeded
FAILEDAll attempts exhausted
KILLEDTask explicitly killed (e.g., pre-emption)

TaskImpl manages the attempt retry logic: if attempt 1 fails, TaskImpl decides whether to launch attempt 2 based on the failure mode and retry count configuration.

TaskAttemptImpl State Machine

One actual container execution of a task.

StateDescription
NEWAttempt created, awaiting container assignment
ASSIGNEDContainer assigned by the scheduler
RUNNINGContainer launched, task code executing
SUCCESS_FINISHING_CONTAINERTask reported success, container cleanup in progress
SUCCEEDEDAttempt completed successfully
FAILEDAttempt failed (may or may not trigger task retry)
KILLEDAttempt pre-empted or killed by AM

Event System

State machine transitions are driven by events. The event bus (AsyncDispatcher) routes events from producers to the correct state machine.

Key Event Types

Event TypeProducerConsumer
DAGEventType.DAG_INITDAGAppMasterDAGImpl
VertexEventType.V_INITDAGImplVertexImpl
VertexEventType.V_STARTDAGImplVertexImpl
TaskEventType.T_SCHEDULEVertexImplTaskImpl
TaskAttemptEventType.TA_ASSIGNEDTaskSchedulerTaskAttemptImpl
TaskAttemptEventType.TA_DONETezTaskUmbilicalProtocol (container callback)TaskAttemptImpl
VertexEventType.V_TASK_COMPLETEDTaskImplVertexImpl
DAGEventType.DAG_VERTEX_COMPLETEDVertexImplDAGImpl

The event flow for a normal task success:

Container reports TA_DONE
  → TaskAttemptImpl: RUNNING → SUCCEEDED
  → sends T_ATTEMPT_SUCCEEDED to TaskImpl
    → TaskImpl: RUNNING → SUCCEEDED
    → sends V_TASK_COMPLETED to VertexImpl
      → VertexImpl checks: all tasks done?
        → if yes: sends DAG_VERTEX_COMPLETED to DAGImpl
          → DAGImpl checks: all vertices done?
            → if yes: DAG transitions to SUCCEEDED

Every state transition in this chain corresponds to a log line you will see in the AM logs.


Protocol Buffers

All cross-process data in Tez is serialized with Protocol Buffers (proto3 in newer versions).

Proto fileLocationKey messages
DAGApiRecords.prototez-api/src/main/proto/DAGPlan, VertexPlan, EdgePlan
DAGIo.prototez-api/src/main/proto/RootInputLeafOutputProto, EntityDescriptorProto
HistoryProtos.prototez-dag/src/main/proto/All timeline/history event types
Events.prototez-runtime-internals/src/main/proto/Task-level events (DataMovementEvent, etc.)

The DAGPlan message is what TezClient sends to the AM. It contains the complete description of the DAG: vertices, edges, processor descriptors, I/O configurations, and edge properties. It is generated from the DAG API object.

// In DAGImpl.java, the plan is received and deserialized:
DAGPlan dagPlan = clientAMProtocol.submitDAG(submitDAGRequest).getDagId();
// Plan is then converted to DAGImpl state

Input-Processor-Output (IPO) Model

Each task runs a single AbstractProcessor. The processor has access to named Input and Output instances, which are determined by the edges in the DAG.

┌──────────────────────────────────────────────────────────────────┐
│  Task container                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  LogicalIOProcessorRuntimeTask                          │    │
│  │                                                         │    │
│  │  Inputs:                    Outputs:                    │    │
│  │  ┌──────────────────┐       ┌──────────────────────┐   │    │
│  │  │ OrderedGrouped   │       │ OrderedPartitioned   │   │    │
│  │  │ KVInput          │──┐ ┌──│ KVOutput             │   │    │
│  │  └──────────────────┘  │ │  └──────────────────────┘   │    │
│  │                        ▼ │                              │    │
│  │               ┌───────────────┐                         │    │
│  │               │  MyProcessor  │                         │    │
│  │               │  extends      │                         │    │
│  │               │  AbstractProcessor                      │    │
│  │               └───────────────┘                         │    │
│  └─────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Edge Property Types

The EdgeProperty in the DAG API determines what I/O classes are used between two vertices.

DataMovementTypeMeaningDefault I/O pair
SCATTER_GATHERPartitioned, sorted shuffleOrderedPartitionedKVOutputOrderedGroupedKVInput
BROADCASTAll output sent to all downstream tasksUnorderedKVOutputUnorderedKVInput
ONE_TO_ONETask i → Task i, no shuffleUnorderedKVOutputUnorderedKVInput
CUSTOMUser-defined routingUser-provided EdgeManagerPlugin

SCATTER_GATHER corresponds to the classic MapReduce shuffle. BROADCAST is used for joins where one side is small enough to replicate to all tasks.

DataMovementEvent

When a task output is ready, it sends a DataMovementEvent through the umbilical to the AM. The AM routes it to the downstream tasks so their input knows which partition to fetch.

This event routing is the mechanism by which OrderedGroupedKVInput discovers where each upstream partition is located — it receives DataMovementEvents from the AM containing the shuffle server address and partition index.


Required Reading

#ResourceWhat to extract
1tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.javaThe StateMachineFactory declaration — read all addTransition() calls
2tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.javaThe createDag() method — how DAGPlan becomes state machine objects
3tez-api/src/main/proto/DAGApiRecords.protoDAGPlan and VertexPlan message definitions
4tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.javaThe serviceStart() method — component initialization order
5tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.javaHow inputs, processors, and outputs are initialized in a container

Key Classes Quick Reference

ClassModuleRole
DAGAppMastertez-dagAM main class; manages all components; starts the event dispatcher
DAGImpltez-dagDAG state machine; tracks vertex completion; manages history
VertexImpltez-dagVertex state machine; manages task scheduling; calls VertexManager
TaskImpltez-dagTask state machine; manages attempt lifecycle and retry logic
TaskAttemptImpltez-dagTaskAttempt state machine; coordinates container assignment
AsyncDispatchertez-dag (via Hadoop)Event bus; routes events to state machines asynchronously
TezTaskUmbilicalProtocoltez-runtime-internalsIPC interface between container and AM
TezChildtez-dagContainer main class; receives task assignment; runs the task
LogicalIOProcessorRuntimeTasktez-runtime-internalsIn-container task runner; sets up IPO
TezClienttez-apiClient API; creates TezSession; submits DAGs

JIRA Categories for Level 3

Having read the architecture, you can now evaluate:

  • Architecture improvement JIRAs — proposals to change how components interact
  • State machine correctness bugs — transitions that lead to wrong states
  • Event routing issues — events that are lost or sent to wrong consumers
  • Container reuse improvements — how tasks are assigned to existing containers

You are still not ready to submit fixes for state machine bugs — those require Level 4. But you can now read these issues intelligently and leave informed comments.


Deliverables

  • Draw the component topology diagram from memory (no looking)
  • Trace TezClient.submitDAG() to VertexImpl V_START event through class names
  • Identify the state machines and their event types from code (not from this page)
  • Explain in your own words what DataMovementEvent does and why it exists
  • Lab 3.1 completed: DAG submission trace documented
  • Lab 3.2 completed: IPO abstraction walkthrough complete

Common Mistakes

MistakeImpactCorrect understanding
Thinking the AM runs tasks directlyLeads to wrong mental model of container lifecycleTasks run in separate JVMs (containers); AM only schedules and monitors
Confusing VertexImpl with Vertex (API)Vertex is the builder; VertexImpl is the runtime state machineThey are in different modules (tez-api vs tez-dag)
Thinking AsyncDispatcher is synchronousEvents are queued; transitions happen on the dispatcher threadNever assume a transition is immediate after an event is posted
Reading VertexImpl top-to-bottomThe class is 6000+ lines; reading linearly is unproductiveStart with the StateMachineFactory declaration, then follow individual transitions