Deep Dives: Reading Order
This directory contains 21 deep-dive chapters. They are the reference material behind the Level curriculum. Each chapter is self-contained, but most chapters depend on a handful of earlier ones. Read in the order below the first time through; thereafter use the index as a lookup.
The chapters are grouped by subsystem. For each chapter we list:
- Title — the file.
- One-line summary — what you should walk away knowing.
- Consumed by — which Levels/Labs depend on it.
Group 1 — The DAG Model and the Client
These four chapters define "what is a Tez job" before any execution machinery exists.
| # | File | Summary | Consumed by |
|---|---|---|---|
| 1 | dag-model.md | DAG/Vertex/Edge as immutable plan; DAGPlan protobuf; validation rules | Level 1 (all labs); Level 2 lab 2.1 |
| 2 | logical-physical.md | How the logical DAG becomes a physical execution plan with concrete parallelism | Level 4 lab 4.2; Level 5 lab 5.1 |
| 3 | tez-client.md | Client-side bring-up: session mode, local resources, AM start, submission RPC | Level 3 lab 3.1; Level 7 lab 7.1 |
| 4 | dag-client.md | Status polling, kill, error reporting; RPC vs ATS backends | Level 3 lab 3.1; Level 8 lab 8.1 |
Start here. Without the DAG model in your head, every later chapter feels like trivia.
Group 2 — AM Lifecycle and Dispatch
| # | File | Summary | Consumed by |
|---|---|---|---|
| 5 | dag-app-master.md | AM as YARN application; dispatchers, heartbeats, recovery | Level 3 lab 3.2; Level 8 lab 8.2 |
| 6 | state-machines.md | Hadoop StateMachineFactory API; dispatcher invariants; tests | Level 4 labs 4.1, 4.3, 4.4 |
| 7 | event-routing.md | The event hierarchy; "events are the only mutation API" rule | Level 4 (all labs) |
These chapters explain how the AM mutates state. They must precede the per-entity lifecycle chapters that follow.
Group 3 — Per-Entity Lifecycle
| # | File | Summary | Consumed by |
|---|---|---|---|
| 8 | vertex-lifecycle.md | VertexImpl state machine: NEW → SUCCEEDED, plus failure/kill paths | Level 4 lab 4.2 |
| 9 | task-lifecycle.md | TaskImpl state machine; speculation; max-failed-attempts | Level 4 lab 4.3 |
| 10 | task-attempt-lifecycle.md | TaskAttemptImpl state machine; container assignment; termination causes | Level 4 lab 4.4; Level 8 lab 8.2 |
Read 8, 9, 10 in this order. Each refers backward to events from chapter 7 and state-machine primitives from chapter 6.
Group 4 — Input/Processor/Output
| # | File | Summary | Consumed by |
|---|---|---|---|
| 11 | ipo-abstractions.md | LogicalInput/LogicalOutput/Processor; lifecycle methods; mergedinputs | Level 5 lab 5.1; Level 7 lab 7.1 |
| 12 | tez-runtime.md | TezTaskRunner2, LogicalIOProcessorRuntimeTask, the umbilical | Level 5 lab 5.1 |
These chapters live inside tez-runtime-internals and tez-runtime-library —
the JVM the task actually runs in.
Group 5 — Shuffle, Sort, and Counters
| # | File | Summary | Consumed by |
|---|---|---|---|
| 13 | shuffle-sort.md | Sorter implementations, IFile, ShuffleManager, Fetcher, MergeManager | Level 5 labs 5.2, 5.3 |
| 14 | counters-diagnostics.md | TezCounters, framework counters, custom counters, ATS publication | Level 8 lab 8.1 |
If you skip 13, do not attempt to debug shuffle issues in production. Always read it cold before opening a fetcher-related JIRA.
Group 6 — Scheduling and Resources
| # | File | Summary | Consumed by |
|---|---|---|---|
| 15 | scheduler.md | TaskSchedulerManager, YarnTaskSchedulerService, AMRM heartbeats | Level 6 lab 6.2 |
| 16 | container-reuse.md | AMContainerImpl lifecycle; reuse policy; idle timeouts | Level 6 labs 6.1, 6.2 |
| 17 | yarn-integration.md | YARN tokens, AMRM client, app master failover, log aggregation | Level 6 lab 6.2 |
Group 7 — Modes and Integrations
| # | File | Summary | Consumed by |
|---|---|---|---|
| 18 | local-mode.md | LocalContainerLauncher, debugging without YARN | Level 2 labs |
| 19 | hive-integration.md | Hive TezTask, edge usage, DynamicPartitionPruning, ATS spans | Level 7 (Hive labs h1–h6) |
Group 8 — Failure, Recovery, and Testing
| # | File | Summary | Consumed by |
|---|---|---|---|
| 20 | failure-handling.md | Task retry, vertex rerun, AM restart, recovery records | Level 8 lab 8.2 |
| 21 | testing-framework.md | MiniTezCluster, MockContainerLauncher, DrainDispatcher, fault injection | Level 2 labs; Level 4 labs |
A note on order vs index
The deep-dives are an index — they exist to be looked up later. The first read should follow the table above. But when you return to fix a bug, jump directly to the chapter most relevant and use the cross-references inside it.
Every chapter ends with a Validation: prove you understand this section. Treat that as the gate before declaring the chapter "read."