Open-Source Engineer & Contributor
A collection of deep, implementation-level curricula for engineers who want to contribute seriously to major open-source projects — not just fix typos, but build the kind of sustained understanding that leads to committer status.
Each curriculum is designed around how the project is actually developed, tested, reviewed, and maintained by its core contributors. Labs reference real source code, real issue trackers, and real contribution workflows.
Curricula
| Project | Focus | Status |
|---|---|---|
| Apache Tez | DAG execution engine on YARN — used by Hive, Pig, and custom batch pipelines | Active |
| Apache Kafka | Distributed log — producers, consumers, brokers, replication, Streams API | Planned |
| Apache Flink | Streaming and batch — state machines, checkpointing, watermarks, operators | Planned |
| Apache Spark | Unified analytics — scheduler, shuffle, RDD lineage, SQL planning | Planned |
| Apache Hadoop | HDFS, YARN, MapReduce — the foundation layer for everything above | Planned |
How to Use This Book
Each curriculum is self-contained. Start at the curriculum's Introduction page and work through its levels sequentially. Levels build on each other — skipping levels skips foundations that later labs depend on.
What you will need for any curriculum:
- 3+ years of Java (or the project's primary language) on production-grade codebases
- Comfort reading large, unfamiliar codebases without a guide
- Git, a build tool (Maven / Gradle / sbt), and an IDE (IntelliJ recommended)
- Patience: the path from contributor to committer is measured in months to years
Select a curriculum from the table above or from the sidebar to begin.
Apache Tez Open-Source Contributor Curriculum
Welcome to the Apache Tez Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious Apache Tez contributors and eventually operate at the level of a core contributor, committer, or PMC-aware engineer.
What This Curriculum Is
This is not a tutorial. It is a structured engineering apprenticeship built around how Apache Tez is actually developed, tested, reviewed, and maintained by its committers and PMC members.
Every level is tied to real Apache Tez source code, real JIRA issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an Apache Tez committer actually does — reading state machine code, tracing DAG execution paths, debugging shuffle failures, reproducing reported issues, and preparing patches for community review.
The curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and push you to develop the muscle memory of someone who works at this level habitually.
Who This Is For
This curriculum is designed for strong backend and distributed systems engineers who:
- Have 3+ years of Java development experience (Maven-based projects)
- Are familiar with Hadoop, YARN, or MapReduce at a conceptual level
- Understand distributed systems fundamentals: scheduling, fault tolerance, partitioning, shuffle
- Want to contribute to Apache open-source at a serious level — not just fix typos
You should be comfortable with:
- Reading large, unfamiliar Java codebases without a guide
gitworkflows, reading diffs, working with patch-based reviews- The Hadoop ecosystem at a high level: YARN, HDFS, MapReduce, Hive
- Distributed execution concepts: task graphs, data movement, speculative execution
What You Will Be Able to Do
After completing this curriculum, you will be able to:
| Capability | Description |
|---|---|
| Build and test | Build Apache Tez from source, run unit and integration tests, run DAGs locally |
| Navigate the codebase | Find any class, understand its role, trace execution across module boundaries |
| Understand DAG execution | Follow a DAG from client submission through AM scheduling to task completion |
| Debug failures | Diagnose failed task attempts, hung DAGs, shuffle errors, and YARN allocation failures |
| Trace state machines | Read and reason about DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl state machines |
| Contribute patches | Reproduce issues, fix bugs, write tests, prepare high-quality patches |
| Engage the community | Interact productively on JIRA and mailing lists |
| Understand Hive integration | Trace a SQL query through Hive planning to a Tez DAG execution |
| Think like a committer | Reason about compatibility, test stability, performance, and release impact |
How to Use This Curriculum
Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.
| Level | Title | Core Focus |
|---|---|---|
| 1 | Hadoop and Tez Foundation | Build, test, first DAG, Hadoop ecosystem |
| 2 | Apache Contributor Onboarding | Workflow, patches, JIRA, mailing lists |
| 3 | Tez Architecture | DAG model, TezClient, DAGAppMaster, key subsystems |
| 4 | DAG Execution Internals | State machines, vertex/task/attempt lifecycle, events |
| 5 | Testing and Debugging | Test infra, mini-cluster, debugging failed tasks |
| 6 | Hive/Tez Integration | SQL-to-DAG, Hive integration, cross-project bugs |
| 7 | Runtime and Shuffle | TezRuntime, I/O abstractions, shuffle and sort |
| 8 | Real Issue Contribution | JIRA reproduction, root cause analysis, real patches |
| 9 | Advanced Committer / PMC | Performance, backward compatibility, release practices |
Beyond the 9 levels, the curriculum includes five additional sections:
| Section | Purpose |
|---|---|
| Contributor Mindset | How to think, behave, and grow as an Apache contributor |
| Issue Roadmap | Staged progression from beginner-friendly to release-blocking issues |
| Internals Deep Dives | 21 focused deep dives, each with a mini-lab |
| Hive-on-Tez Labs | Cross-project debugging, SQL-to-DAG tracing, integration bugs |
| Release, Review, and PMC Practices | Apache governance, voting, licensing, release management |
The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged patch and engineering write-up.
Required Tools
Before starting Level 1, ensure you have the following installed and working:
Java 8 or Java 11 (OpenJDK recommended — match the Tez branch target)
Apache Maven 3.6.3 or newer
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse with M2E
Docker (optional — useful for containerized mini-cluster environments)
You will also need:
- A clone of the Apache Tez repository (GitHub mirror of the Apache GitBox repo)
- A clone of the Apache Hadoop repository (for YARN API context and integration reference)
- An account on Apache JIRA (free to create)
- Subscription to the Apache Tez mailing lists:
dev@tez.apache.org— development discussion (required)issues@tez.apache.org— JIRA notifications (optional but useful)
Note on Java version: Apache Tez's
masterbranch targets Java 8 as the minimum. Some newer branches may require Java 11. Always check thepom.xmlat the root of the branch you are working on.
Apache Tez at a Glance
Apache Tez is a general-purpose DAG execution engine built on top of Apache YARN. It is the primary execution engine for Apache Hive since Hive 0.13, and is used by other Hadoop ecosystem projects including Pig, Cascading, and Spark (historically).
Why Tez Exists
MapReduce forces every computation into a Map → Shuffle → Reduce pattern. Complex analytical queries (like multi-join SQL) require chaining many MapReduce jobs, with intermediate results written to HDFS between each stage. This is slow and wasteful.
Tez allows arbitrary directed acyclic graphs (DAGs) of computation where:
- Vertices represent computation stages
- Edges represent data movement between stages
- Container reuse eliminates JVM startup overhead between tasks
- Data can be pipelined between tasks without HDFS materialization
- The same container can run multiple task types
This makes Tez significantly faster than MapReduce for multi-stage queries.
Key Modules
You will spend the majority of your time in these modules:
| Module | Path | Description |
|---|---|---|
tez-api | tez-api/ | Public API: DAG, Vertex, Edge, TezClient, DAGClient |
tez-dag | tez-dag/ | Core execution engine: AM, state machines, scheduling |
tez-runtime-library | tez-runtime-library/ | Input/Output/Processor implementations, shuffle |
tez-mapreduce | tez-mapreduce/ | MapReduce compatibility layer (MRInput, MROutput) |
tez-runtime-internals | tez-runtime-internals/ | Task execution framework, container management |
tez-tests | tez-tests/ | Integration tests and system-level tests |
tez-tools | tez-tools/ | Utility tools (DAG recovery, history parsing) |
tez-plugins | tez-plugins/ | Optional plugins (LLAP, timeline server integration) |
Key Classes (High-Level Preview)
| Class | Module | Role |
|---|---|---|
TezClient | tez-api | Entry point for DAG submission from a client |
DAGClient | tez-api | Handle for monitoring a submitted DAG |
DAG | tez-api | DAG definition: vertices + edges |
Vertex | tez-api | Vertex definition: processor + parallelism |
DAGAppMaster | tez-dag | ApplicationMaster — orchestrates DAG execution |
DAGImpl | tez-dag | State machine: models DAG lifecycle |
VertexImpl | tez-dag | State machine: models vertex lifecycle |
TaskImpl | tez-dag | State machine: models task lifecycle |
TaskAttemptImpl | tez-dag | State machine: models a single task attempt |
TaskCommunicatorManager | tez-dag | Manages communication between AM and task containers |
TezTaskRunner2 | tez-runtime-internals | Runs a task inside a container |
LogicalIOProcessorRuntimeTask | tez-runtime-internals | Wires up I/O processors inside a task |
Apache Tez Community
Apache Tez is a mature project with an active but selective community. The codebase reflects years of careful design decisions, many of which are documented in JIRA issues, design documents, and mailing list threads rather than in code comments.
What the community values:
- Patches that include tests
- Issues that include a clear reproduction case
- Comments that demonstrate you have read the existing code
- Contributors who engage respectfully and patiently
- Sustained contribution over time, not one-off patches
The path from contributor to committer is measured in years, not weeks. That is intentional. The Apache meritocracy rewards sustained, high-quality contribution — not volume of patches.
This curriculum will help you build the habits and depth of understanding that make that path realistic.
Begin with Level 1: Hadoop and Tez Foundation.
Overview & Prerequisites
Status: Full content coming in Phase 11.
This section covers the complete prerequisites checklist, environment setup guide, and how to navigate the curriculum effectively.
Topics covered:
- Detailed environment setup (Java, Maven, Git, IDE configuration)
- Cloning and verifying the Apache Tez and Hadoop repositories
- Subscribing to Apache Tez mailing lists
- Setting up an Apache JIRA account
- How to navigate each curriculum level
- How to use the labs
Tez Warm-Up: From Data Engineer to Source Contributor
Before you read a single line of VertexImpl.java, you need to have sat in the seat of the
person whose workload Tez is serving. The engineers who built Tez's state machines, container
reuse logic, and shuffle pipelines were solving specific, painful problems that showed up in
production Hive and Pig workloads every day. If you skip that context and go straight to the
source code, you will memorize class names without understanding why the design exists.
This chapter is the missing first mile. You will run Tez from the outside — as a data engineer would — across a series of practical scenarios covering different data shapes, query patterns, and ecosystem integrations. After each scenario, the chapter maps what you observed back to the source code structures that own it. By the end, you will have a mental model that makes every internal class feel like an old acquaintance rather than an alien term.
What Tez Actually Is (Two Sentences)
Apache Tez is a general-purpose DAG execution engine that runs on Apache YARN. It does not execute SQL or process files itself — it provides a runtime that other systems (Hive, Pig, custom applications) compile their work into, and then Tez runs that compiled work as a directed acyclic graph of parallel tasks.
Everything else — SQL parsing, query planning, physical operators, file format codecs — belongs to the caller. Tez sees vertex descriptors, edge properties, and processor classes. That boundary is what you need to hold clearly in mind throughout the curriculum.
Where Tez Sits in the Data Engineering Spectrum
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data Engineering Tool Spectrum │
│ │
│ Batch ◄───────────────────────────────────────────────► Streaming │
│ │
│ MapReduce Tez Spark Flink Kafka Streams │
│ (2004) (2013) (2014) (2014) (2016) │
│ pure batch batch+ micro-batch true stream native stream │
│ pipelined & batch & batch │
│ │
│ ────────────────────────────────────────────────────────────────────── │
│ Ingest Layer: Flume → Kafka → Flink/Kafka Streams │
│ Storage Layer: HDFS / S3 / ORC / Parquet / Iceberg / Delta Lake │
│ Query Layer: Hive (Tez), Presto, Trino, Spark SQL, Flink SQL │
└─────────────────────────────────────────────────────────────────────────────┘
Tez vs. MapReduce
MapReduce forces every computation into map → shuffle → reduce. A five-join SQL query becomes
five chained MapReduce jobs with HDFS materializations between each. Tez expresses that same
query as one DAG, pipelines intermediate data between vertices without HDFS writes, and reuses
JVMs across tasks. Typical improvement: 2–5x on complex queries, 10x+ on workflows that would
have been five MR jobs.
Tez vs. Spark
| Dimension | Tez | Spark |
|---|---|---|
| Primary use case | Hive SQL (on YARN/HDFS ecosystems) | General batch + ML + streaming |
| Execution model | YARN-native, container reuse | Driver + executor (YARN or Kubernetes) |
| In-memory caching | No (disk-backed shuffle) | RDD/DataFrame caching (explicit) |
| Streaming | Not native | Structured Streaming (micro-batch) |
| Deployment | YARN only | YARN, Kubernetes, standalone |
| Hive integration | Deep (Hive's primary engine) | Separate (Hive-on-Spark is less common) |
| Community | Apache Tez (focused on Hive use case) | Apache Spark (broad general use) |
When you are on a Hadoop/YARN cluster where Hive is the primary SQL layer, Tez is the right choice. Spark is a better fit for Python/Scala workloads, ML pipelines, or when you need in-memory caching across multiple queries.
Tez vs. Flink
Flink is a streaming-first engine that also handles batch. Tez is a batch-first engine that handles simple pipelines. The key structural difference: Flink maintains persistent operator state across windows and checkpoints; Tez vertices are stateless per-task (state is external: HDFS, HBase). If you are building event-time windowed aggregations or exactly-once stream processing, you want Flink. If you are running nightly ETL on HDFS data via Hive, Tez is the right tool and Flink would be overengineered for the job.
Tez vs. Flume (Ingest)
Flume is not a computation engine — it is a log/event ingestion agent that moves data from sources (web servers, syslog, Kafka) to sinks (HDFS, Kafka, HBase). The typical pipeline is:
Application Logs → Flume Agent → HDFS (ORC/Parquet files) → Hive table → Tez query
Flume and Tez are not competitors; they are peers in the same pipeline. Tez reads the data that Flume (or Kafka, or Sqoop) landed on HDFS. Knowing this boundary matters when you encounter a data quality bug: is it in the ingest (Flume), the storage format (ORC serialization), or the compute layer (Tez/Hive)?
Data Formats in the Tez Ecosystem
Tez itself is format-agnostic. It does not read or write ORC, Parquet, or Iceberg directly.
Tez sees InputDescriptor and OutputDescriptor objects — the actual codec lives in the class
pointed to by those descriptors. The format lives in the tez-mapreduce compatibility layer
(MRInput, MROutput) or in Hive's vectorized readers.
ORC (Optimized Row Columnar)
ORC is Hive's native format. When you INSERT INTO an ORC table and query it via Hive-on-Tez:
- The input split is an
OrcSplitgenerated byOrcInputFormatin the Hive ORC library. - Tez receives that split as a
DataSourceDescriptorin the DAGPlan. MRInputwrapsOrcInputFormat.createRecordReader(), feeding vectorized row batches to Hive'sMapOperator.- The key Tez entry point is
MRInputLegacy.createReaderInternal()intez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInputLegacy.java.
ORC's predicate pushdown (column pruning, row group skipping) happens before Tez sees the
data — entirely inside OrcInputFormat. If a Hive-on-Tez query reads 10 billion rows instead of
the 100K it should (wrong predicate pushdown), the bug is in ORC/Hive, not in Tez.
Parquet
Parquet is the other dominant columnar format, more common in cross-ecosystem pipelines (Spark + Hive interop). With Hive-on-Tez reading Parquet:
ParquetInputFormatgeneratesParquetInputSplitobjects.- Tez receives those as
DataSourceDescriptorentries. - Vectorization depth varies: ORC vectorization is deeper in Hive; Parquet goes through an additional row-column translation layer.
From a Tez contributor's standpoint, Parquet vs. ORC differences show up mainly in:
- Split size calculations affecting vertex parallelism (how many map tasks Tez schedules)
- Record skew when one Parquet file is much larger than others
Iceberg
Apache Iceberg is a table format (not a file format). It stores data in Parquet or ORC files
but adds a metadata layer for ACID semantics, time travel, and hidden partitioning. Hive + Tez
reads Iceberg via IcebergInputFormat (from the Iceberg Hive runtime JAR).
From Tez's view, Iceberg is yet another InputFormat. The novel behavior is:
- Iceberg's snapshot-based read means splits can come from multiple physical locations.
- Iceberg's
PlanningUtilgenerates splits that can be much more numerous than traditional partition-based splits — this affects Tez vertex parallelism significantly. - Time-travel queries (
SELECT ... FOR SYSTEM_TIME AS OF ...) generate a different split list at query compile time, which Hive encodes into the DAGPlan before Tez sees it.
Key insight for contributors: Tez bugs triggered by Iceberg tables are almost always about
parallelism (too many small tasks, too few tasks for large snapshots) or about the
DataSourceDescriptor encoding. The actual file reading is not Tez's responsibility.
Scenario 1: Classic Batch ETL — Aggregation Over a Large Table
What the data engineer does:
-- Run in Hive CLI connected to a cluster with Hive-on-Tez enabled
SET hive.execution.engine=tez;
CREATE TABLE daily_sales (
event_date STRING,
product_id BIGINT,
region STRING,
revenue DOUBLE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");
-- Query: daily revenue by region, last 90 days
SELECT
event_date,
region,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count
FROM daily_sales
WHERE event_date >= '2026-03-01'
GROUP BY event_date, region
ORDER BY event_date, region;
What Tez does under the hood:
- Hive compiles the query to
MapWork(map-side partial aggregation) +ReduceWork(global aggregation + sort). DagUtils.createVertex()in Hive creates two TezVertexobjects:Map 1andReducer 2.- The edge between them is
SCATTER_GATHER(partitioned shuffle by GROUP BY key hash). ShuffleVertexManagerauto-parallelism kicks in: it monitors how much data map tasks produce, then dynamically reduces the reducer count if data is smaller than expected (config:tez.shuffle-vertex-manager.desired-task-input-size).- Map tasks run
MapProcessor→HashTableContainer(partial agg) →OrderedPartitionedKVOutput(partitioned, sorted). - Reducer tasks run
ReduceProcessor→OrderedGroupedKVInput(merge shuffle inputs) →PTFOperator(for ORDER BY) →FileSinkOperator→ ORC writer.
Dataset characteristics and edge behaviors:
| Dataset characteristic | Tez behavior | Source class to read |
|---|---|---|
| 1 small file (< 1 block) | 1 map task, ShuffleVertexManager sets 1 reducer | ShuffleVertexManager.java |
| 1,000 files, uniform size | Parallelism = file count (MR split logic) | MRInputLegacy.java split sizing |
| 1 file, 10 GB, no ORC splits | 1 map task (cannot split non-splittable format) | OrcInputFormat.isSplittable() |
| WHERE predicate on partitioned column | Hive partition pruning, fewer splits passed to Tez | Hive PartitionPruner, not Tez |
| WHERE filters out all rows | 0 output bytes from map, ShuffleVertexManager → 1 reducer | ShuffleVertexManager.onSourceTaskCompleted() |
Bridge to source code:
cd ~/tez-src
# ShuffleVertexManager — the most important vertex manager for map-reduce style DAGs
find . -name "ShuffleVertexManager.java" -path "*/tez-dag/*"
# tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java
# Auto-parallelism: how ShuffleVertexManager decides to reduce the number of reducers
grep -n "computeParallelism\|desiredTaskInputSize\|onSourceTaskCompleted" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java | head -30
# The edge between Map 1 and Reducer 2 is SCATTER_GATHER — EdgeProperty documentation
grep -n "SCATTER_GATHER" \
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
Scenario 2: Multi-Table Join — The Real Workload Tez Was Built For
What the data engineer does:
SET hive.execution.engine=tez;
SET hive.auto.convert.join=true; -- enable map-side (broadcast) joins
SET hive.mapjoin.smalltable.filesize=25000000; -- 25 MB threshold
SELECT
o.order_id,
c.customer_name,
p.product_name,
o.quantity * p.unit_price AS line_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date = '2026-05-31'
AND c.region = 'US-WEST';
What Tez does:
Hive's query planner analyzes table sizes:
customersandproductsare small (< 25 MB) → broadcast join (MapJoin)ordersis large (> 25 MB) → the probe side, goes through SCATTER_GATHER shuffle
The resulting DAG has:
Map 1— readsorders, builds hash table fromcustomersandproductssmall tables, emits matching rows. Small tables arrive via aBROADCASTedge (ONE_TO_ONEsemantics: every map task gets the full small table).- Optionally a
Reducer 2if there's a DISTINCT or ORDER BY.
VertexGroup for broadcast joins: Hive uses VertexGroup to express that one physical
vertex's output goes to both Map 1 and any other map-side consumer. This is expressed via
DAG.addVertex() with a VertexGroup wrapper.
Dataset edge cases for joins:
| Scenario | What goes wrong | Where to look |
|---|---|---|
customers grows from 20 MB to 30 MB | Map join threshold exceeded, query switches to shuffle join; slower | Hive CommonJoinResolver, not Tez |
orders has extreme key skew (one customer_id has 90% of rows) | One reducer gets 90% of data; task timeout | SkewedJoin hint in Hive; Tez sees it as one overloaded reducer |
| Broadcast table > YARN container heap | OOM in map task | Container memory: tez.task.resource.memory.mb |
| Right side of join returns 0 rows | Map tasks emit 0 output; downstream vertex immediately succeeds | VertexImpl.checkTasksForCompletion() |
Bridge to source code:
# ONE_TO_ONE edge (broadcast) — how every map task gets all small-table data
grep -n "ONE_TO_ONE\|BroadcastEdgeManager" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ -r | grep -v Test | head -20
# VertexGroup — Hive's mechanism for fan-out to multiple consumers
grep -n "class VertexGroup\|addVertexGroup" \
tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -15
# How the DAGAppMaster sees both edges from the same vertex
grep -n "vertexGroup\|groupInput" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20
Scenario 3: Direct Tez API — No Hive
Not all Tez workloads go through Hive. Custom data pipelines, internal batch frameworks, and
migration tools often build Tez DAGs directly. The canonical example is OrderedWordCount in
tez-examples/.
// Simplified from tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
TezClient tezClient = TezClient.create("OrderedWordCount", tezConf);
tezClient.start();
DAG dag = DAG.create("OrderedWordCount");
// Vertex 1: Tokenize words from input files
Vertex tokenizerVertex = Vertex.create(
"Tokenizer",
ProcessorDescriptor.create(TokenProcessor.class.getName()),
numMapTasks,
MRHelpers.getMapResource(conf));
tokenizerVertex.addDataSource(
"Input",
MRInput.createConfigBuilder(conf, TextInputFormat.class, inputPath).build());
// Vertex 2: Sort and deduplicate
Vertex sumVertex = Vertex.create(
"Sorter",
ProcessorDescriptor.create(SumProcessor.class.getName()),
numReduceTasks,
MRHelpers.getReduceResource(conf));
sumVertex.addDataSink(
"Output",
MROutput.createConfigBuilder(conf, TextOutputFormat.class, outputPath).build());
// Edge: SCATTER_GATHER, sorted by word key
dag.addVertex(tokenizerVertex)
.addVertex(sumVertex)
.addEdge(Edge.create(tokenizerVertex, sumVertex, EdgeProperty.create(
DataMovementType.SCATTER_GATHER,
DataSourceType.PERSISTED,
SchedulingType.SEQUENTIAL,
OrderedPartitionedKVOutput.createConfigBuilder(conf, HashPartitioner.class).build(),
OrderedGroupedKVInput.createConfigBuilder(conf).build())));
DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletion();
What this teaches about Tez's structure:
Vertex.create(name, processorDescriptor, parallelism, resource)— the four primitives of a vertex: name, code to run, how many copies, how much resource.EdgeProperty.create(movementType, sourceType, schedulingType, outputDesc, inputDesc)— edge properties completely specify how data moves.MRInput/MROutputbridge the gap between legacy HadoopInputFormat/OutputFormatand Tez's native I/O descriptors.
Bridge to source code:
# Read OrderedWordCount to understand the complete DAG lifecycle from a client
cat tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
# Follow TezClient.submitDAG() into the AM
grep -n "public.*submitDAG" \
tez-api/src/main/java/org/apache/tez/dag/api/client/TezClient.java
# EdgeProperty — the central struct that determines routing
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
Scenario 4: Pipelined Execution — Where Tez Approaches Flink
Tez supports PIPELINED edge scheduling (vs. SEQUENTIAL). With pipelined edges, downstream
tasks can start before all upstream tasks complete — the data flows like a stream within the DAG.
EdgeProperty pipelinedEdge = EdgeProperty.create(
DataMovementType.SCATTER_GATHER,
DataSourceType.PERSISTED_PIPELINED, // <-- pipelined
SchedulingType.CONCURRENT, // <-- downstream starts immediately
outputDescriptor,
inputDescriptor);
This is used by Hive for query pipelining in long-running SELECT ... INSERT chains. The
downstream vertex starts consuming partial output from the upstream before it finishes, reducing
end-to-end latency for multi-stage queries.
When pipelining causes problems:
| Problem | Symptom | Root class |
|---|---|---|
| Upstream task fails mid-stream | Downstream task has consumed partial data → must be killed and retried with upstream | TaskAttemptImpl.FAILED_TRANSITION |
| Downstream cannot consume fast enough | Back-pressure: upstream pauses on write() | OrderedPartitionedKVOutput.sendingThreadShouldRun |
| Memory overflow in pipelined buffer | OutOfMemoryError in fetcher threads | MergeManager in-memory limit |
Bridge to source code:
grep -n "PERSISTENT_PIPELINED\|PIPELINED\|CONCURRENT" \
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
grep -n "SchedulingType.CONCURRENT" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10
Dataset Scenarios for Testing Edge Cases
When you are writing a repro test case or validating a fix, the dataset you choose determines which code paths you exercise. Use these as your starting templates.
Dataset 1: The Empty Partition
// Generate test data where one reduce partition has 0 records
// Triggers ShuffleVertexManager.onSourceTaskCompleted() with 0-byte output
private static final int NUM_PARTITIONS = 10;
private static final int RECORDS_PER_PARTITION = 100;
// Force all records into partitions 0–8, leave partition 9 empty
int partition = key.hashCode() % (NUM_PARTITIONS - 1); // never 9
What this tests: ShuffleVertexManager must handle a vertex where some reducer partitions
receive zero input. Before TEZ-3247, this caused reducers to hang waiting for shuffle data
that would never arrive.
# Test class that covers empty-partition behavior
grep -rn "emptyPartition\|zeroInput\|emptyInput" tez-tests/src/test/ | head -10
Dataset 2: Extreme Key Skew
// One key accounts for 95% of records
for (int i = 0; i < 1_000_000; i++) {
String key = (i < 950_000) ? "hot_key" : "key_" + i;
writer.write(new Text(key), new IntWritable(1));
}
What this tests: The reducer that receives hot_key gets ~950,000 records while other
reducers get ~50 each. This exposes:
- Speculative execution decisions in
LegacySpeculator - Container reuse after the skewed reducer finishes last
- Per-vertex timing in
VertexImpl.checkTasksForCompletion()
Dataset 3: Zero-Row Input
// Empty input — 0 files, 0 records
// The DAG should complete SUCCEEDED with 0 output, not hang
String inputPath = "/tmp/empty_dir_" + UUID.randomUUID();
fs.mkdirs(new Path(inputPath)); // create directory but put no files in it
What this tests: VertexImpl must handle the case where MRInput generates 0 splits.
A vertex with 0 input splits sets its parallelism to 0, transitions immediately to
V_SUCCEEDED without scheduling any tasks. This has historically been a source of
NullPointerException bugs when downstream vertices assume at least one upstream task ran.
grep -n "setParallelism.*0\|numTasks.*0\|zeroTasks\|numSourceTasks.*0" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15
Dataset 4: Very Wide Rows (Many Columns)
// 1,000 columns per row — stresses IFile serialization and spill logic
StringBuilder sb = new StringBuilder();
for (int col = 0; col < 1000; col++) {
sb.append("column_").append(col).append("=").append("value_").append(col).append("\t");
}
writer.write(new Text("key"), new Text(sb.toString())); // ~30 KB per record
What this tests: PipelinedSorter and DefaultSorter spill thresholds. With 30 KB
per record, even a modest sort buffer fills quickly. This exercises the spill path in
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java
and exposes off-by-one bugs in the IFile index writer.
Dataset 5: Many Small Files (HDFS Small-File Problem)
# Generate 50,000 files of 1 KB each — a classic HDFS anti-pattern
for i in $(seq 1 50000); do
echo "record_$i value_$i" > /tmp/smallfiles/file_$i.txt
done
hadoop fs -put /tmp/smallfiles /data/input/smallfiles/
What this tests: Split generation produces 50,000 map tasks. This is a realistic workload that stresses:
TaskSchedulerManagertask queue management- Container reuse logic (50,000 containers → reuse is essential for performance)
DAGAppMasterAMRM heartbeat frequency under high task count
# Container reuse configuration
grep -n "heldContainer\|releaseTimeout\|IDLE_TIMEOUT" \
tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncherImpl.java \
| head -20
Dataset 6: Nested Structs (Complex Types)
-- ORC table with nested complex types
CREATE TABLE events (
event_id BIGINT,
metadata STRUCT<
user_id: BIGINT,
session_id: STRING,
properties: MAP<STRING, STRING>,
tags: ARRAY<STRING>
>,
timestamp BIGINT
) STORED AS ORC;
What this tests: ORC vectorized reader deserialization of STRUCT, MAP, and ARRAY
types. These types are serialized into Hive's OrcStruct/OrcMap/OrcList classes
before being passed through MRInput to the MapOperator. If the column count or type
tree changes between what the ORC file was written with and what the Hive schema says,
you get schema evolution behavior — which can generate bugs that look like Tez data
corruption but are actually ORC schema evolution issues.
Dataset 7: Partitioned Iceberg Table (Snapshot Isolation)
# Using PyIceberg or Spark to create an Iceberg table with multiple snapshots
from pyiceberg.catalog import load_catalog
catalog = load_catalog("hive_catalog", **{"uri": "thrift://hive-metastore:9083"})
table = catalog.load_table("db.events_iceberg")
# Write 3 snapshots representing 3 days of appends
for day in range(3):
df = generate_day_data(day)
table.append(df)
# Now query with time travel — Hive generates a DAGPlan that reads snapshot 1
hive_execute("""
SELECT COUNT(*) FROM db.events_iceberg
FOR SYSTEM_TIME AS OF '2026-05-29 00:00:00'
""")
What this tests: Iceberg's IcebergInputFormat generates a split list that differs per
snapshot. The DataSourceDescriptor passed to Tez encodes the snapshot ID. If Hive
resolves the wrong snapshot, Tez faithfully executes it — the bug is in the DagUtils
snapshot resolution in Hive, not in Tez. But the symptom (wrong row count) looks like
a Tez data bug.
Running Tez End-to-End: The Local Developer Loop
Before writing source code, every Tez contributor should be able to do this loop in under 10 minutes:
# 1. Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist -q # ~8–12 min cold, 3–4 min warm
# 2. Run the canonical integration test that exercises the full stack
mvn test -pl tez-tests \
-Dtest=TestOrderedWordCount \
-DfailIfNoTests=false 2>&1 | tail -30
# 3. Run a single unit test (fast feedback loop — use this constantly)
mvn test -pl tez-dag \
-Dtest=TestVertexImpl#testVertexSucceededSpeculation \
-DfailIfNoTests=false 2>&1 | tail -20
# 4. Run OrderedWordCount in local mode (no YARN cluster required)
hadoop jar tez-examples/target/tez-examples-*.jar orderedwordcount \
-D tez.local.mode=true \
/path/to/input /tmp/tez-output-$(date +%s)
# 5. Verify output
hadoop fs -cat /tmp/tez-output-*/part-* | sort | head -20
The TestOrderedWordCount test is your baseline health check. If it passes, the full
end-to-end stack (TezClient → DAGAppMaster → VertexImpl → shuffle → MRInput/MROutput)
is working. If it fails, something fundamental is broken and you need to fix that before
touching anything else.
The Bridge: User Scenario → Source Code
Every scenario above maps to a specific source subsystem. Use this table whenever you see a runtime behavior and want to find the code responsible:
| Observed behavior | Source location |
|---|---|
| Map task count equals file count | tez-mapreduce/.../MRInputLegacy.createSplitsProto() |
| Reducer count auto-adjusted down | ShuffleVertexManager.computeParallelism() |
| DAG completes even with 0-row input | VertexImpl.scheduleTasks() (0-task vertex path) |
| Broadcast join: small table to all maps | BroadcastEdgeManager + ONE_TO_ONE edge |
| Container reused between tasks | AMContainerImpl.assignContainer() + HeldContainer |
| Task retried after failure | TaskAttemptImpl → TaskImpl.handleTaskAttemptFailed() |
| OOM in shuffle fetch | MergeManager.memoryAvailable / Fetcher.copyFromHost() |
| Hung vertex with tasks still RUNNING | VertexImpl.checkTasksForCompletion() not triggered |
| Wrong output record count | Check OrcInputFormat predicate pushdown first, then Tez |
| Slow single reducer (skew) | LegacySpeculator slow-task detection → speculative attempt |
| Pipelined task killed on upstream failure | TaskAttemptImpl.FAILED_TRANSITION cascades |
What to Verify Before Starting Level 1
Run through this checklist once. It takes 30–45 minutes and proves your environment is solid.
# Environment check
java -version # must be Java 8 or Java 11
mvn -version # must be 3.6.3+
git --version # must be 2.x
# Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist 2>&1 | tail -10
# Confirm build artifacts exist
ls tez-dist/target/tez-*.tar.gz # should exist
ls tez-examples/target/tez-examples-*.jar
# Run the unit test suite in the two most important modules
mvn test -pl tez-dag -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
mvn test -pl tez-api -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
# Run the critical end-to-end test
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -DfailIfNoTests=false 2>&1 | tail -10
# All lines should read "Tests run: N, Failures: 0, Errors: 0"
If any of these fail before you have modified a single line of code, stop and fix your
environment. Do not proceed into Level 1 with a broken baseline. A broken baseline means
every subsequent mvn test will produce false failures that obscure the real work.
Continue to Overview & Prerequisites or jump directly to Level 1: Hadoop and Tez Foundation.
16-Week Plan: From Curious Reader to Tez Committer Candidate
This is a 16-week, ~10-hour-per-week plan that maps the curriculum (Levels 1–9 plus a 2-week capstone) onto a calendar. Each week states:
- Reading — concrete Tez source files. Open them; do not just skim diagrams.
- Hands-on — what you must build/run on your machine.
- JIRA practice queries — searches that surface real, beginner-appropriate issues.
- Labs — the curriculum labs you must complete.
- Exit checkpoint — concrete deliverables. If you cannot produce them, repeat the week.
The plan assumes you have ~/tez-src checked out, tez-tests/ building with
mvn -DskipTests install, and a working Java 8+/Maven 3.6+ environment.
Weeks 1–2: Level 1 — Orientation and First DAG
Week 1 — The DAG model and the client API
Reading
tez-api/src/main/java/org/apache/tez/dag/api/DAG.java(entire file; ~600 lines)tez-api/src/main/java/org/apache/tez/dag/api/Vertex.javatez-api/src/main/java/org/apache/tez/dag/api/Edge.javatez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.javatez-api/src/main/proto/DAGApiRecords.proto— focus onDAGPlan,VertexPlan,EdgePlan,EdgeProperty.
Hands-on
- Build Tez from source:
mvn clean install -DskipTests -Phadoop28. - Run
OrderedWordCountagainst a local file usingMiniTezCluster(seetez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java). - Inspect the generated DAGPlan: print it with
dag.createDag(...).toString().
JIRA practice queries
project = TEZ AND status in (Open, "In Progress") AND labels = newbie
project = TEZ AND component = tez-api AND fixVersion is empty AND priority in (Trivial, Minor)
Labs
- Lab 1.1 — Trace a
WordCountend-to-end. - Lab 1.2 — Modify the DAG: add a second mapper vertex.
Exit checkpoint
- You can name every required argument to
DAG.create(),Vertex.create(),Edge.create(), andEdgeProperty.create(). - You can diagram the WordCount DAG without looking.
- You have one JIRA ticket open in a browser tab that you've read end-to-end (description + every comment).
Week 2 — Edges in depth
Reading
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java— all three enums (DataMovementType,DataSourceType,SchedulingType).tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/EdgeManager*.java— five built-in edge managers.tez-api/src/main/java/org/apache/tez/dag/api/InputDescriptor.java,OutputDescriptor.java,ProcessorDescriptor.java.
Hands-on
- Build the same WordCount with
BROADCASTinstead ofSCATTER_GATHERfor the edge. Observe the failure mode and explain it. - Write a 3-vertex DAG (
A -> B -> C) whereA->BisONE_TO_ONEandB->CisSCATTER_GATHER. Run it; confirm parallelism rules from the source.
JIRA practice queries
project = TEZ AND text ~ "EdgeManager" AND resolution = Unresolved
project = TEZ AND text ~ "broadcast" AND status = Resolved ORDER BY created DESC
Labs
- Lab 1.3 — Edge type matrix experiment.
Exit checkpoint
- Edge type matrix (movement × scheduling × source) drawn from memory.
- You can predict, given edge properties, which
EdgeManagerimpl will be picked. - One short forum/dev-list email you drafted (do not send) summarizing your reading of an EdgeManager file.
Weeks 3–4: Level 2 — Build, run, and read tests
Week 3 — Tez build system and module layout
Reading
pom.xml(root),tez-api/pom.xml,tez-dag/pom.xml.BUILDING.txt.tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java— entry-point for nearly every integration test.
Hands-on
- Run
mvn -pl tez-dag test -Dtest=TestVertexImpl#testBasicVertexCompletion. - Run
mvn -pl tez-tests test -Dtest=TestTezJobs#testWordCount. - Profile a build:
mvn -DskipTests install -X 2>&1 | grep "Building\|BUILD".
JIRA practice queries
project = TEZ AND component = build AND status = Open
project = TEZ AND text ~ "MiniTezCluster" AND resolution = Unresolved
Labs
- Lab 2.1 — Build Tez and run all
tez-apitests. - Lab 2.2 — Add a no-op test to
tez-dagand run it via Maven.
Exit checkpoint
- You can explain why
tez-dagdepends ontez-apibut not vice versa. - You know the difference between
tez-runtime-internalsandtez-runtime-library. - You can run a single test via Maven without consulting any docs.
Week 4 — Tests as documentation
Reading
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java(~5000 lines; pick the top 10 test methods).tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestDAGImpl.java.tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java.
Hands-on
- Pick one test method in
TestVertexImpl; rewrite it from scratch in your notebook, then diff against the original. - Add an assertion that fails; observe the message; fix it.
JIRA practice queries
project = TEZ AND text ~ "flaky" AND status in (Open, "In Progress")
project = TEZ AND text ~ "TestVertexImpl" AND resolution = Unresolved
Labs
- Lab 2.3 — Read
TestVertexImpl#testKilledTasksHandlingand explain every line.
Exit checkpoint
- You can write a test that constructs a
VertexImpldirectly (withoutMiniTezCluster). - You understand the
DrainDispatcherpattern (seestate-machines.md).
Weeks 5–6: Level 3 — Submission and AM lifecycle
Week 5 — TezClient and submission
Reading
tez-api/src/main/java/org/apache/tez/client/TezClient.java.tez-api/src/main/java/org/apache/tez/client/TezClientUtils.java.tez-api/src/main/java/org/apache/tez/client/TezSessionImpl.java.
Hands-on
- Write a small Java program that uses
TezClientdirectly (no MR shim) to submit a DAG toMiniTezCluster. - Use both session and non-session modes; measure the second-DAG latency difference.
JIRA practice queries
project = TEZ AND component = "tez-api" AND text ~ "TezClient" AND status = Open
Labs
- Lab 3.1 — Build a custom client that submits two DAGs in one session.
Exit checkpoint
- You can list every method that talks to the AM over RPC (grep for
dagAMProtocolinTezClient.java). - You can name the three local resources that
TezClientUtilsuploads.
Week 6 — DAGAppMaster bring-up
Reading
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java— focus onserviceInit,serviceStart, dispatcher registration.tez-dag/src/main/java/org/apache/tez/dag/app/TaskCommunicatorManager.java.tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncher*.java.
Hands-on
- Run a DAG against
MiniTezClusterwith AM logs atDEBUG. Identify the line inDAGAppMaster.javathat emits the first"Created DAG"log line.
Labs
- Lab 3.2 — Map an AM log line to source code (Lab in Level 3).
Exit checkpoint
- You can list the AsyncDispatcher event-handler registrations in
DAGAppMasterin order. - You can walk the path from
TezClient.submitDAG()toDAGImplbeing instantiated inside the AM.
Weeks 7–9: Level 4 — Vertex internals and state machines
Week 7 — State machine library
Reading
hadoop-yarn-commonStateMachineFactorysource (you'll need to fetch Hadoop source separately).tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java— read only thestateMachineFactoryblock first (~200 lines near the top).
Hands-on
- Write a toy
StateMachineFactoryfor aLight(OFF,ON,BROKEN) in a scratch project.
Labs
- Lab 4.1 — State-machine introduction.
Exit checkpoint
- You can explain
SingleArcTransitionvsMultipleArcTransitionwithout notes.
Week 8 — VertexManager plugins
Reading
tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPlugin.java,VertexManagerPluginContext.java.tez-dag/src/main/java/org/apache/tez/dag/library/vertexmanager/ShuffleVertexManager.java.
Labs
- Lab 4.2 — VertexManager deep dive (the depth-bar lab).
Exit checkpoint
- A working
CountingVertexManagerwith passing unit test, as specified in Lab 4.2.
Week 9 — Task and TaskAttempt
Reading
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java.tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java.
Labs
- Lab 4.3 — Task lifecycle walk.
- Lab 4.4 — TaskAttempt termination causes.
Exit checkpoint
- You can draw the
TaskAttemptstate machine from memory. - You can list every
TaskAttemptTerminationCauseand what produces it.
Weeks 10–11: Level 5 — Runtime, IPO, and shuffle
Week 10 — Runtime task execution
Reading
tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java.tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java.
Labs
- Lab 5.1 — Trace a task from container start to processor exit.
Exit checkpoint
- You can list every umbilical call a task makes during its lifetime
(grep
umbilicalintez-runtime-internals).
Week 11 — Shuffle and merge
Reading
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java.tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java.tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java.
Labs
- Lab 5.2 — Spilled output inspection on
MiniTezCluster. - Lab 5.3 — Force a fetch failure.
Exit checkpoint
- You can explain
IFileframing in two paragraphs. - You can name the three sorter implementations and when each is used.
Week 12: Level 6 — Scheduling and container reuse
Reading
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java.tez-dag/src/main/java/org/apache/tez/dag/app/rm/TaskSchedulerManager.java.tez-dag/src/main/java/org/apache/tez/dag/app/rm/container/AMContainerImpl.java.
JIRA practice queries
project = TEZ AND text ~ "container reuse" AND status in (Open, "In Progress")
Labs
- Lab 6.1 — Disable container reuse; measure latency cost.
- Lab 6.2 — Read and explain
tez.am.container.reuse.*configs.
Exit checkpoint
- You can list the four conditions under which a container is not reused.
Week 13: Level 7 — MapReduce compatibility and integrations
Reading
tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java.tez-mapreduce/src/main/java/org/apache/tez/mapreduce/output/MROutput.java.tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/map/MapProcessor.java.
Labs
- Lab 7.1 — Submit a vanilla MR job via Tez (
tez.lib.urismode).
Exit checkpoint
- You can write a one-page essay on "what
MRInputdoes that a plainLogicalInputdoes not."
Week 14: Level 8 — Production diagnostics
Reading
tez-api/src/main/java/org/apache/tez/common/counters/TezCounters.java.tez-dag/src/main/java/org/apache/tez/dag/history/HistoryEventHandler.java.tez-plugins/tez-yarn-timeline-history/.
Labs
- Lab 8.1 — Read a real ATS event dump.
- Lab 8.2 — Trace a failure through the AM log + ATS + counters.
Exit checkpoint
- You can answer: "Why did vertex X fail?" given only an AM log and ATS dump.
Weeks 15–16: Capstone
Follow capstone/index.md start-to-finish:
- Issue selection (week 15, day 1–2).
- Reproduction → root cause (week 15, day 3–7).
- Implementation + tests (week 16, day 1–4).
- Patch submission + write-up (week 16, day 5–7).
Exit checkpoint
- A real patch attached to a real JIRA, with passing tests and a clear summary.
- A 1500–3000 word public write-up of the experience.
How to use this plan when you fall behind
- If you finish a week's reading but cannot pass the exit checkpoint, repeat the week. Do not advance.
- If a JIRA query returns no results, change the query. The dev community moves; labels and components shift.
- Skip a Level only if you can pass all exit checkpoints from previous Levels in one sitting.
Milestones: M1 Through M9
Milestones are the "what does mastery look like at this stage" checkpoints. Each milestone has:
- Expected completion — a calendar guideline.
- Skills you must demonstrate — 5–8 concrete abilities.
- Self-check questions — answer them out loud, without notes.
- 20-point rubric — five criteria, four points each.
- Pass threshold — minimum total to advance.
- Move to the next level when — the binary gate.
Pass thresholds are deliberately high. The point is competence, not throughput.
M1 — Orientation (end of Week 2)
You can read the Tez DAG API and explain what every method on DAG, Vertex,
and Edge does.
Skills
- Write a 3-vertex DAG end-to-end without consulting docs.
- Explain the three enums on
EdgePropertyand pick the correct one for a given problem. - Name the protobuf message that represents a DAG on the wire.
- Predict which built-in
EdgeManagerimplementation will be selected for a given edge. - Locate any class in the
tez-apimodule by name within 30 seconds.
Self-check questions
- What is the difference between
DataSourceDescriptorand a runtimeInput? - Why is
DAG.verify()called before submission? - Which class produces the protobuf
DAGPlan?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| API fluency | Can name classes | Can describe responsibilities | Can write code from memory | Can predict behavior |
| Edge model | Confused | Knows enums | Picks correct edge type | Predicts EdgeManager impl |
| Reading speed | >5 min/file | ~3 min/file | ~1 min/file | scanning fluently |
| Mental model | Vague | Sketches DAG | Sketches DAG + edge types | Sketches DAG + edge types + plan flow |
| Communication | Cannot explain | Explains with notes | Explains without notes | Teaches another |
Pass threshold: 14/20, with no criterion below 2.
Move to Level 2 when: you can draft a new DAG class in 10 minutes from a verbal problem statement, on a whiteboard.
M2 — Build and Test Literacy (end of Week 4)
You can navigate the codebase, build it, and run any test by name.
Skills
- Run a single test in any module via
mvn -pl <module> test -Dtest=Class#method. - Add a new test file to
tez-dagand have it picked up by Maven. - Read
TestVertexImpland explain at least 10 individual test methods. - Identify the module of a class given just its FQN (e.g.,
o.a.t.dag.app...→tez-dag). - Build Tez from a clean checkout in under 5 minutes (with cached deps).
- Distinguish unit tests from
MiniTezCluster-backed integration tests.
Self-check questions
- Why does
tez-dagdepend ontez-apiand not the reverse? - What is
DrainDispatcherand why do tests use it? - Where do
MiniTezClustertests live and what classpath do they need?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Build mastery | mvn install works | Can skip tests, profiles | Knows module deps | Diagnoses build failures |
| Test execution | Runs all tests | Runs a class | Runs a method | Runs cross-module |
| Test reading | Skims | Understands assertions | Understands setup | Recreates from scratch |
| Module map | Knows names | Knows top-level deps | Knows transitive deps | Diagnoses cycles |
| Tooling | IDE-only | CLI + IDE | CLI primary | CLI + scripting |
Pass threshold: 14/20.
Move to Level 3 when: you can clone Tez on a fresh laptop, build it, and run
a TestVertexImpl method by name within 15 minutes.
M3 — Submission and AM Bring-up (end of Week 6)
You can trace a DAG from TezClient.submitDAG() to DAGImpl.handle(...) inside
the AM.
Skills
- List the three local resources
TezClientUtilsuploads. - Explain session vs non-session mode and the AM keep-alive mechanism.
- Name every AsyncDispatcher event-handler registered in
DAGAppMaster. - Locate the line of code where
DAGImplis constructed inside the AM. - Read AM logs at
DEBUGand map lines to source positions. - Run
MiniTezClusterin your tests and inspect AM logs.
Self-check questions
- What RPC does
TezClientuse to submit a DAG? Which protocol class? - How does the AM stay alive between DAGs in a session?
- What happens if the AM dies during a DAG run with recovery disabled?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Submission path | Vague | Knows TezClient API | Knows RPC | Knows full byte path |
| AM bring-up | Cannot describe | Names dispatcher | Names handlers | Walks serviceInit |
| Session model | Confused | Knows the flag | Knows keep-alive | Knows timeouts |
| Log reading | Greps blindly | Greps with intent | Maps to code | Predicts log line |
| Recovery | Unknown | Aware | Knows config keys | Knows record format |
Pass threshold: 14/20.
Move to Level 4 when: you can answer "where in the AM does my DAG show up?" with a file:line citation.
M4 — State Machines and VertexManager (end of Week 9)
You can read and modify the vertex/task/attempt state machines.
Skills
- Write a small
StateMachineFactory-based state machine from scratch. - Add a transition to
VertexImpl.stateMachineFactoryand update tests in the same patch. - Implement a custom
VertexManagerPluginwith a unit test. - Diagnose an
InvalidStateTransitonExceptionfrom a stack trace. - Distinguish
SingleArcTransitionfromMultipleArcTransition. - Explain the dispatcher single-threading invariant.
Self-check questions
- Why must state-machine code be single-threaded? What breaks if not?
- What happens if you forget to register a transition for an event in a state?
- How does
ShuffleVertexManagerimplement slow-start?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| State machine | Knows it exists | Can read transitions | Can add transition | Can refactor safely |
| Test discipline | None | Adds happy path | Adds happy + sad | Updates per transition |
| VertexManager | Knows interface | Implements minimal | Implements custom | Implements + tests |
| Concurrency | Confused | Knows the rule | Knows why | Can audit a PR |
| Debugging | Reads stack | Maps to source | Reproduces locally | Writes regression test |
Pass threshold: 16/20 — this is the first hard gate.
Move to Level 5 when: you have submitted (or at minimum drafted) a state machine change that compiles, with a passing test.
M5 — Runtime and Shuffle (end of Week 11)
You can read the runtime data path and explain spill, merge, and fetch.
Skills
- Walk a single task's lifecycle: container start → processor.run() → output close.
- Explain
IFileframing and the difference between V1 and V2. - Distinguish
DefaultSorter,PipelinedSorter, and unordered output. - Diagnose a fetcher failure from logs.
- Read
ShuffleManagerand explain its scheduling of fetchers. - Explain combiners and where they run in the pipeline.
Self-check questions
- What umbilical RPCs does a task make during its run?
- Where is the spill threshold checked?
- What triggers a
FAILED_FETCHevent upstream?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Runtime path | Names classes | Walks happy path | Walks failure paths | Walks edge cases |
| IFile | Knows format | Reads with hexdump | Modifies safely | Diagnoses corruption |
| Sorter | Names them | Knows tradeoffs | Picks for workload | Tunes configs |
| Shuffle | Vague | Knows pull model | Knows scheduling | Knows backoff |
| Combiner | Aware | Knows when run | Implements one | Debugs incorrect output |
Pass threshold: 15/20.
Move to Level 6 when: you can intentionally produce a fetcher failure on
MiniTezCluster and explain every log line.
M6 — Scheduling and Container Reuse (end of Week 12)
You understand how Tez decides where tasks run.
Skills
- Read
YarnTaskSchedulerServiceand explain its scheduling loop. - List the conditions under which a container is/is not reused.
- Explain affinity, locality, and racks.
- Tune
tez.am.container.reuse.*for a given workload. - Diagnose "stuck" scheduling.
Self-check questions
- Why does Tez prefer to reuse containers over requesting new ones?
- What happens if
tez.am.container.idle-release-timeout-min.millisis too low?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Reuse model | Aware | Knows conditions | Knows configs | Tunes for workload |
| Scheduling | Black box | Reads main loop | Reads matching | Reads + modifies |
| Locality | Aware | Knows hints | Knows fallback | Knows rack policy |
| Diagnostics | Guess-and-check | Reads AM logs | Reads + maps to code | Adds counters |
| YARN integration | Aware | Knows AMRM | Knows tokens | Knows failover |
Pass threshold: 14/20.
Move to Level 7 when: you can explain why container reuse is on by default and pick five workloads where you would tune it.
M7 — Integrations (end of Week 13)
You can read and modify the MapReduce shim and explain Hive-on-Tez at a high level.
Skills
- Write a DAG that uses
MRInputreading from HDFS. - Explain
MROutputcommit semantics. - Sketch how Hive's
TezTaskbuilds aDAG. - Identify which features Hive uses (custom edges, manager plugins, dynamic reconfig).
Self-check questions
- What does
MROutput.commit()do, and what guarantees does it offer? - Why does Hive use
ROOT_INPUT_INITIALIZER_FAILEDheavily in its bug fixes?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| MR shim | Knows existence | Reads MRInput | Reads + uses | Modifies safely |
| Commit | Aware | Knows semantics | Knows failure modes | Knows speculative cleanup |
| Hive lens | Aware | Reads TezTask | Reads + maps | Diagnoses cross-project bug |
| Cross-project | Confused | Knows boundaries | Picks the right list | Files bug correctly |
Pass threshold: 12/16 (only 4 criteria here).
Move to Level 8 when: you can read a Hive query plan and predict its DAG.
M8 — Production Diagnostics (end of Week 14)
You can debug a real Tez job failure given logs and an ATS dump.
Skills
- Read a Tez counters dump and find a bottleneck.
- Find a
VertexImplfailure cause from AM logs in <5 minutes. - Read ATS events and reconstruct a DAG timeline.
- Identify a stuck task vs a slow task vs a failed task from counters.
- Build a one-pager triage runbook for your team.
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Counters | Knows existence | Reads | Interprets | Tunes |
| Log triage | Greps | Maps to code | Maps to state | Predicts next event |
| ATS | Aware | Queries | Reads events | Cross-checks vs AM log |
| Runbook | None | Draft | Reviewed | Shipped to team |
| Speed | >30 min | ~15 min | <10 min | <5 min |
Pass threshold: 16/20.
Move to capstone when: you've helped someone (on chat, dev list, or internally) debug a real Tez issue successfully.
M9 — Capstone (end of Week 16)
You've shipped a patch.
Skills
- Selected an appropriate issue.
- Reproduced and root-caused.
- Implemented a fix with tests.
- Submitted a patch in the project's accepted format.
- Responded to at least one round of review feedback.
Rubric (20 points)
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Issue selection | Random | Scoped | Justified | Aligned to roadmap |
| Reproduction | None | Manual | Scripted | Added as a test |
| Root cause | Speculative | Localized | Cited | Explained in JIRA |
| Implementation | Compiles | Tests pass | Idiomatic | Minimal & focused |
| Submission | None | Draft | Submitted | Reviewed |
Pass threshold: 16/20, and the patch must compile and pass mvn verify on
the affected module.
Global Rubric (committer-readiness)
Use this every quarter, regardless of level, to self-assess.
| Dimension | 1 (Beginner) | 2 (Apprentice) | 3 (Practitioner) | 4 (Committer-ready) |
|---|---|---|---|---|
| Code | Reads | Modifies | Designs subsystem | Reviews others' changes |
| Testing | Runs tests | Adds tests | Writes regression suites | Drives test infra |
| Docs | Reads | Edits | Writes user-facing | Owns module-level docs |
| Integration | Single module | Cross-module | Cross-project (Hive) | Drives release decisions |
A committer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused contribution.
Level 1: Hadoop and Tez Foundation
This level establishes the technical baseline every subsequent level depends on. You will understand where Tez fits in the Hadoop ecosystem, successfully build the project from source, run the test suite, and execute your first Tez DAG in local mode.
Learning Objectives
By the end of Level 1 you must be able to:
- Explain where Apache Tez sits in the Hadoop ecosystem and why it exists
- Build Apache Tez from source using Maven, with and without tests
- Execute unit tests scoped to a single module and interpret the results
- Run a simple Tez DAG in local mode without a YARN cluster
- Locate any class mentioned in Levels 2–9 without using a search engine
- Articulate the difference between a MapReduce job and a Tez DAG at the execution model level
- Read
TezConfiguration.javaand find any configuration key by category
The Hadoop Ecosystem Context
Apache Tez lives inside the Hadoop ecosystem. Before touching a line of Tez code, build an accurate mental model of the stack:
┌─────────────────────────────────────────────────────┐
│ Apache Hive / Apache Pig / Cascading │ ← Query / scripting layer
├─────────────────────────────────────────────────────┤
│ Apache Tez │ ← DAG execution engine
├─────────────────────────────────────────────────────┤
│ Apache YARN │ ← Cluster resource management
├─────────────────────────────────────────────────────┤
│ Apache HDFS │ ← Distributed storage
└─────────────────────────────────────────────────────┘
YARN (Yet Another Resource Negotiator) manages cluster resources. It runs an
ApplicationMaster (AM) per application, allocates containers, and monitors health. Tez's
DAGAppMaster IS a YARN ApplicationMaster.
HDFS stores input, output, and sometimes intermediate data. Tez prefers to keep intermediate data on local disk or in memory, but falls back to HDFS for recovery and large-scale shuffles.
Tez submits a DAGAppMaster to YARN, which requests containers for task execution. Tasks
read inputs, execute processors, and write outputs — either directly to downstream tasks via
shuffle or to HDFS for final output.
MapReduce vs. Tez
| Aspect | MapReduce | Apache Tez |
|---|---|---|
| Execution model | Fixed: Map → Shuffle → Reduce | Arbitrary DAG of vertices |
| Multi-stage queries | Chain of separate MR jobs | Single DAG |
| Inter-stage data | Always written to HDFS | Pipelined or local disk |
| JVM startup | New JVM per task | Container reuse across tasks |
| Vertex types | Two (Map, Reduce) | Unlimited |
| Speculative execution | Yes | Yes (configurable per vertex) |
| Session support | No | Yes — TezClient session mode |
For a 10-stage Hive aggregation query, MapReduce requires 10 separate MR jobs with HDFS writes between every stage. Tez runs the same query as a single DAG — no HDFS round-trips between stages, containers reused across task waves, and pipeline-style data movement between compatible vertices.
Required Reading
Complete in this order before starting the labs:
| # | Resource | What to extract |
|---|---|---|
| 1 | README.md in the Tez repo root | Build commands, module overview |
| 2 | Tez architecture document | Original design intent, DAG model rationale |
| 3 | YARN Architecture | Container lifecycle, AM responsibilities |
| 4 | tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java | Class-level Javadoc only — understand session vs. non-session |
| 5 | tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | Skim all keys — understand the category groupings |
| 6 | tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java | End-to-end DAG construction and submission |
Note on reading strategy: In a mature Apache codebase, Javadoc is often the best documentation that exists. Class-level Javadoc on public API classes reflects decisions debated and agreed upon by committers. Read it seriously.
Source Code Areas to Inspect
Read these files before and after the labs. You are not modifying anything yet.
tez-api — Public API
| File | Why |
|---|---|
dag/api/TezClient.java | Entry point for all DAG submissions. Read createTezClient(), start(), submitDAG(). |
dag/api/DAG.java | DAG construction API. Note addVertex(), addEdge(), addTaskLocalFiles(). |
dag/api/Vertex.java | Vertex definition. Understand ProcessorDescriptor, parallelism, and VertexManagerPlugin. |
dag/api/Edge.java | Edge definition. Understand EdgeProperty and DataMovementType. |
dag/api/client/DAGClient.java | DAG monitoring. Understand getDAGStatus() and progress tracking. |
dag/api/TezConfiguration.java | All Tez configuration keys. Every key is documented. |
dag/api/EdgeProperty.java | Data movement type and scheduling type for edges. Fundamental to DAG design. |
tez-dag — Core Execution Engine
| File | Why |
|---|---|
app/DAGAppMaster.java | The YARN ApplicationMaster. First read: just init() and start(). It is 5000+ lines. |
app/dag/impl/DAGImpl.java | DAG state machine. Read the state/transition enum declarations at the top. |
app/dag/impl/VertexImpl.java | Most complex class in the project. First read: state enum + handle() only. |
app/dag/impl/TaskImpl.java | Task state machine. More tractable than VertexImpl. Read fully. |
app/dag/impl/TaskAttemptImpl.java | TaskAttempt state machine. Read fully. |
tez-runtime-library — I/O Implementations
| File | Why |
|---|---|
runtime/library/input/OrderedGroupedKVInput.java | Standard sorted shuffle input. Used by most Hive reduce operations. |
runtime/library/output/OrderedPartitionedKVOutput.java | Standard sorted shuffle output. Paired with the above. |
runtime/library/input/UnorderedKVInput.java | Broadcast input — data is not sorted. |
tez-examples — Reference Implementations
| File | Why |
|---|---|
examples/OrderedWordCount.java | The canonical Tez DAG example. Read this completely. |
examples/IntersectExample.java | Shows a 3-vertex DAG with a broadcast edge. |
Key Classes Quick Reference
| Class | Module | Package | Role |
|---|---|---|---|
TezClient | tez-api | org.apache.tez.dag.api | Creates sessions, submits DAGs |
DAG | tez-api | org.apache.tez.dag.api | Defines the computation graph |
Vertex | tez-api | org.apache.tez.dag.api | One processing stage |
Edge | tez-api | org.apache.tez.dag.api | Data connection between vertices |
EdgeProperty | tez-api | org.apache.tez.dag.api | Data movement + scheduling type |
ProcessorDescriptor | tez-api | org.apache.tez.dag.api | Which Processor class runs in a vertex |
TezConfiguration | tez-api | org.apache.tez.dag.api | All Tez configuration keys |
DAGAppMaster | tez-dag | org.apache.tez.dag.app | YARN ApplicationMaster |
DAGImpl | tez-dag | org.apache.tez.dag.app.dag.impl | DAG state machine |
VertexImpl | tez-dag | org.apache.tez.dag.app.dag.impl | Vertex state machine |
TaskImpl | tez-dag | org.apache.tez.dag.app.dag.impl | Task state machine |
TaskAttemptImpl | tez-dag | org.apache.tez.dag.app.dag.impl | TaskAttempt state machine |
TezTaskRunner2 | tez-runtime-internals | org.apache.tez.runtime | Runs a task inside a container |
OrderedWordCount | tez-examples | org.apache.tez.examples | Canonical DAG example |
JIRA Issue Categories for Level 1 Contributors
At this stage, focus exclusively on:
- Documentation — Javadoc typos, outdated parameter descriptions, missing
@paramor@returnannotations, broken links in comments - Test improvements — Adding missing assertions to existing tests, improving test method naming, removing dead code from test classes
- Checkstyle violations — Unused imports, line length violations, missing final keywords
How to find these:
- Go to Apache Tez JIRA
- Search:
project = TEZ AND labels = "newbie" AND resolution = Unresolved - Also scan:
project = TEZ AND component = "Documentation" AND resolution = Unresolved - Look at recently closed "trivial" issues to understand the standard for accepted patches
Warning: Do not pick up a JIRA issue and immediately upload a patch. Read all existing comments. If there is an active discussion or existing assignee, move on. Leave a comment saying you are investigating before you claim an issue.
Deliverables
You must demonstrate all of the following before advancing to Level 2:
-
Successful
mvn install -DskipTestsoutput — no build failures -
At least one unit test class run successfully (e.g.,
TestDAGImpl) -
Successful local DAG execution showing
DAG completed: SUCCEEDED -
Ability to locate
DAGAppMaster,TezClient, andOrderedGroupedKVInputby memory - Written explanation (2–3 sentences) of why a Tez DAG is faster than chained MapReduce
- Written explanation of the difference between a YARN container and a Tez task
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Building with Java 17 against master | Compile errors or compatibility failures | Use Java 8 or Java 11; check <maven.compiler.source> in root pom.xml |
Running mvn test on the full repository | Hours-long run including integration tests | Use -pl tez-dag -am to scope to one module |
Ignoring TezConfiguration.java | Confusion about configuration keys throughout all levels | Skim the entire file; every key is documented |
| Skipping the YARN architecture doc | Confusion about what Tez owns vs. what YARN owns | YARN understanding is required from Level 3 onward |
Trying to understand all of DAGAppMaster at once | Overwhelm — 5000+ lines | First pass: read only init() and start() |
| Reading Tez code without running it | Abstract understanding that does not transfer to debugging | Always run the code after reading it |
| Picking a JIRA issue without reading existing comments | Duplicate work; community friction | Read all comments; check assignee; leave a note before claiming |
How to Verify Success
# 1. Full build without tests
cd /path/to/tez
mvn install -DskipTests -q && echo "BUILD OK"
# 2. Unit test from tez-dag
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q
# 3. Local DAG run (from Lab 1.3)
# Expected final output line:
# DAG: [OrderedWordCount] finished with status: [SUCCEEDED]
Patch Profile: Level 1 Graduate
| Patch type | Example | Test requirement |
|---|---|---|
| Javadoc fix | Correcting a wrong @param description in TezClient | None — documentation only |
| Dead import removal | Remove unused import statement flagged by checkstyle | Run mvn checkstyle:check -pl <module> |
| Test assertion improvement | Add assertEquals to an existing test that only checks for no-exception | Run the test class |
| README update | Fix a broken Maven command in the build instructions | Manual verification |
You are not ready to submit: bug fixes in state machines, new features, performance patches, or changes to the shuffle path. Those require Levels 3–7.
Lab 1.1: Build Apache Tez from Source
Background
Apache Tez is a multi-module Maven project. Building from source is the mandatory first step for any contributor — you need the ability to make code changes, rebuild specific modules, and run tests against your local changes. This lab walks through the full build, from cloning to verifying artifacts.
Why This Lab Matters for Contributors
- You cannot submit a credible patch without first verifying it builds cleanly
- Knowing which Maven flags control which modules saves hours during development
- Understanding the build structure helps you scope test runs efficiently
- Build failures are sometimes real bugs — knowing a clean build baseline lets you detect regressions
Prerequisites
Verify before starting:
java -version # Must be Java 8 or Java 11
mvn -version # Must be Maven 3.6.3 or newer
git --version # Must be 2.x
Disk space: at least 10 GB free. The full build with tests generates large artifacts.
Memory: at least 8 GB RAM. The tez-dag unit tests can spike to 4 GB during parallel runs.
Step-by-Step Tasks
Step 1: Clone the Repository
git clone https://github.com/apache/tez.git
cd tez
The GitHub repository at https://github.com/apache/tez is a mirror of the canonical
Apache GitBox repository. For contribution purposes (submitting patches via JIRA), the
GitHub mirror is acceptable for development. The patch will be attached to the JIRA issue
rather than sent as a GitHub PR — this is Apache's traditional workflow.
Verify the remote:
git remote -v
# origin https://github.com/apache/tez.git (fetch)
# origin https://github.com/apache/tez.git (push)
Step 2: Inspect the Branch Structure
git branch -r | grep -v HEAD | sort
You will see branches like:
origin/master— development trunkorigin/branch-0.10— stable release branchorigin/branch-0.9— older stable branch
For contributor work, use master unless you are reproducing an issue specific to a
release branch. Bug fixes for release branches are typically backported from master.
Check the current Hadoop dependency in pom.xml:
grep -m1 "hadoop.version" pom.xml
This tells you which Hadoop version Tez is built against. The default Hadoop version target controls which APIs are available.
Step 3: Full Build (Skip Tests)
mvn install -DskipTests -q
Expected duration: 5–15 minutes depending on hardware and Maven cache state.
The first run downloads all dependencies. With a warm Maven cache (~/.m2/repository),
subsequent builds of unchanged modules are near-instant due to incremental compilation.
What -DskipTests does:
Skips compilation and execution of test classes. Use this for iterative development when you
are not changing test code.
What -q does:
Suppresses INFO-level Maven output. Remove -q if you need to debug build failures.
When the build completes, you will see:
[INFO] BUILD SUCCESS
[INFO] Total time: X min Y s
If you see BUILD FAILURE, go to the Troubleshooting section below.
Step 4: Verify Build Artifacts
After a successful build, key JARs exist in each module's target/ directory:
find . -name "tez-dag-*.jar" -not -path "*/test-*" | grep -v sources
# Expected: ./tez-dag/target/tez-dag-<version>.jar
find . -name "tez-api-*.jar" -not -path "*/test-*" | grep -v sources
# Expected: ./tez-api/target/tez-api-<version>.jar
The assembled distribution tarball is built by a separate command:
mvn package -DskipTests -Pdist -q
ls tez-dist/target/*.tar.gz
This produces the full binary distribution used by HDP and other distributions.
Step 5: Build a Single Module
During development you will almost always build a single module to save time:
# Build only tez-dag and its dependencies
mvn install -DskipTests -pl tez-dag -am -q
# Build only tez-api (no dependencies needed — it has none in Tez)
mvn install -DskipTests -pl tez-api -q
-pl specifies the module path. -am (also-make) builds all upstream dependencies first.
This is the command you will run hundreds of times during contributor work.
Step 6: Configure IntelliJ IDEA
IntelliJ handles Maven multi-module projects natively.
File → Open→ select thetez/directory (the one containingpom.xml)- IntelliJ detects the Maven project and imports all modules
- When prompted, select the JDK that matches the build (Java 8 or Java 11)
- Wait for the initial index build to complete (2–5 minutes)
Verify the import worked:
- Open
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java Ctrl+Clickon any class reference — it should navigate correctly- Open
Find Class(Cmd+O/Ctrl+N) and searchTestDAGImpl— it should find the test
Enable checkstyle integration:
- Install the
CheckStyle-IDEAplugin (Settings → Plugins) - Configure it to use
src/config/checkstyle.xmlin the Tez repo root - This gives you real-time checkstyle feedback as you edit
Implementation Requirements
This lab has no code to implement. Deliverables are:
- A successful
mvn install -DskipTestsrun (screenshot or terminal output) - Identification of the Hadoop version Tez is built against
- Location of the
tez-dag-<version>.jarartifact - A working IntelliJ project that resolves all imports
Troubleshooting Common Build Failures
"Source/Target Java version mismatch"
error: Source option X is no longer supported. Use Y or later.
Cause: Your JAVA_HOME or java in PATH is the wrong version.
Fix:
export JAVA_HOME=$(/usr/libexec/java_home -v 11) # macOS
export PATH=$JAVA_HOME/bin:$PATH
java -version # verify
mvn install -DskipTests -q
"Cannot resolve dependency: org.apache.hadoop:..."
Cause: The required Hadoop version is not in Maven Central or your local cache.
Fix: Ensure Maven Central is reachable. If building offline, use an internal repository
mirror. On a clean machine with network access this should not occur.
"Killed" or "Out of Memory"
Cause: Maven forked JVM runs out of heap.
Fix:
export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=512m"
mvn install -DskipTests -q
"ERROR: Failed to execute goal ... tez-tests"
Cause: The tez-tests module requires specific integration test infrastructure.
Fix: Build only the modules you need:
mvn install -DskipTests -pl tez-api,tez-dag,tez-runtime-library,tez-examples -am -q
Expected Output
[INFO] Reactor Summary:
[INFO] Apache Tez ......................................... SUCCESS [ 2.345 s]
[INFO] tez-api ............................................ SUCCESS [ 15.678 s]
[INFO] tez-dag ............................................ SUCCESS [ 45.123 s]
[INFO] tez-runtime-internals .............................. SUCCESS [ 12.456 s]
[INFO] tez-runtime-library ................................ SUCCESS [ 18.789 s]
[INFO] tez-mapreduce ...................................... SUCCESS [ 8.012 s]
[INFO] tez-examples ....................................... SUCCESS [ 5.234 s]
...
[INFO] BUILD SUCCESS
Stretch Goals
-
Build against a specific Hadoop version by overriding the
hadoop.versionproperty:mvn install -DskipTests -Dhadoop.version=3.3.6 -q -
Inspect the generated
effective-pom.xmlfortez-dagto see all inherited dependency versions:mvn help:effective-pom -pl tez-dag | grep -A3 "dependency>" -
Identify which modules depend on
tez-apiby inspecting allpom.xmlfiles:grep -r "tez-api" */pom.xml | grep "artifactId"
Related Real-World Issue Types
- Build breakage issues (e.g., dependency version conflicts) — you can observe but not fix at Level 1
- Java version compatibility issues — important context when reading bug reports
Lab 1.2: Run Unit and Integration Tests
Background
Apache Tez has a well-structured test suite that spans unit tests, module-level integration
tests, and full cluster integration tests using MiniTezCluster. Understanding how to run
specific tests, read failures, and scope test execution is essential for contributor work —
your patch must include a passing test run before upload.
Why This Lab Matters for Contributors
- You must run tests before submitting any patch
- Being able to run a single test class in seconds makes iteration fast
- Understanding test failure output is the first step to debugging
- Many flaky tests are contributor opportunities once you understand how tests work
How Tez Tests Are Organized
Tez tests fall into three categories:
| Category | Location | Runs with | Scope |
|---|---|---|---|
| Unit tests | src/test/java/ in each module | mvn test -pl <module> | Fast, no cluster |
| Module integration tests | tez-tests/src/test/java/ | mvn test -pl tez-tests | Requires MiniTezCluster |
| System tests | Manual / CI scripts | Requires full cluster | Not run locally |
For Level 1–3 work, focus exclusively on unit tests.
Key unit test classes in tez-dag (path: tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/):
| Test Class | What it Tests |
|---|---|
TestDAGImpl | DAGImpl state machine transitions, initialization, completion |
TestVertexImpl | VertexImpl state machine — the most complex test class in the project |
TestTaskImpl | TaskImpl state machine transitions |
TestTaskAttemptImpl | TaskAttemptImpl state transitions, speculation, failure handling |
Supporting test infrastructure in tez-dag/src/test/java/org/apache/tez/dag/app/:
| Class | Role |
|---|---|
MockDAGAppMaster | A reduced AM for unit testing — no YARN connection needed |
MockAppContext | Mock AppContext that provides state to state machine tests |
MockHistoryEventHandler | No-op history handler for tests that don't test history |
Step-by-Step Tasks
Step 1: Run All Unit Tests in tez-dag
cd /path/to/tez
mvn test -pl tez-dag -am -q 2>&1 | tail -30
Expected duration: 3–8 minutes depending on hardware.
Expected completion:
[INFO] Tests run: NNNN, Failures: 0, Errors: 0, Skipped: NN
[INFO] BUILD SUCCESS
Some tests are marked
@Ignoreor skipped due to environment constraints — a non-zeroSkippedcount is normal.
Step 2: Run a Single Test Class
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q
Expected output (last few lines):
[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: X.XXX s
[INFO] BUILD SUCCESS
If a test fails, you will see:
[ERROR] Tests run: 42, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: X.XXX s
[ERROR] testDAGCreation(org.apache.tez.dag.app.dag.impl.TestDAGImpl): expected:<...> but was:<...>
Step 3: Run a Single Test Method
mvn test -pl tez-dag -am -Dtest=TestDAGImpl#testDAGCreation -q
This is the command you will use most often: run exactly one test after a code change to verify your fix.
Step 4: Read the Surefire Report
Maven writes detailed test results to:
tez-dag/target/surefire-reports/
For a failing test, read the .txt file for the test class:
cat tez-dag/target/surefire-reports/org.apache.tez.dag.app.dag.impl.TestDAGImpl.txt
This contains the full stack trace, which is often more informative than the Maven console output.
Step 5: Run Tests in tez-api
mvn test -pl tez-api -q
tez-api tests are faster and simpler. Key test classes:
| Test Class | What it Tests |
|---|---|
TestDAG | DAG API construction, validation, serialization |
TestVertex | Vertex API construction and edge validation |
TestTezClient | TezClient initialization and session management |
TestAMControl | AM communication protocol |
Step 6: Run Tests in tez-runtime-library
mvn test -pl tez-runtime-library -am -q
This includes shuffle and I/O tests. Expected duration: 5–10 minutes.
Key test classes:
| Test Class | What it Tests |
|---|---|
TestOrderedPartitionedKVWriter | Sorted KV output serialization |
TestFetcher | Shuffle fetch logic |
TestShuffleScheduler | Fetch scheduling and retry |
TestTezMerger | Sort-merge implementation |
Step 7: Understand a Test Failure
Intentionally break a test to understand failure output:
- Open
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java - Find the
getTotalVertices()method - Add
return 0;as the first line - Run
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q - Read the failure output in both the console and the surefire report
- Revert the change with
git checkout tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java
This exercise makes test failure output familiar before you encounter a real failure.
Debugging Test Failures
Adding Log Output
Tez uses SLF4J + Log4j. To enable debug-level logging during a test run:
mvn test -pl tez-dag -am -Dtest=TestDAGImpl \
-Dlog4j.configuration=file:src/test/resources/log4j.properties \
-Dlog4j.logger.org.apache.tez=DEBUG
Running Tests with Remote Debug (IntelliJ)
To attach a debugger to a Maven test run:
mvn test -pl tez-dag -am -Dtest=TestDAGImpl \
-Dmaven.surefire.debug="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005"
In IntelliJ: Run → Attach to Process → port 5005. The test JVM pauses until IntelliJ connects.
Testing Checklist
Before submitting any patch:
-
Run
mvn test -pl <changed-module> -am— zero failures -
If adding a new test:
mvn test -pl <module> -am -Dtest=<YourNewTest>passes -
Run
mvn checkstyle:check -pl <changed-module>— zero violations -
If the change touches shuffle or I/O: run
mvn test -pl tez-runtime-library -am
Expected Output
A clean test run for TestDAGImpl:
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.tez.dag.app.dag.impl.TestDAGImpl
[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.345 s
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] BUILD SUCCESS
Stretch Goals
-
Find all test classes in
tez-dagthat test theVertexImplstate machine:find tez-dag/src/test -name "*.java" | xargs grep -l "VertexImpl" -
Count the total number of test methods in
TestVertexImpl:grep -c "@Test" tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java -
Identify which test classes take the longest to run by examining surefire report timestamps:
grep "Time elapsed" tez-dag/target/surefire-reports/*.txt | sort -t= -k2 -rn | head -10 -
Find tests that use
MockDAGAppMasterto understand the test infrastructure pattern:grep -rl "MockDAGAppMaster" tez-dag/src/test/
Related Real-World Issue Types
- Flaky tests (timing-dependent, environment-dependent) — a major contributor opportunity
- Tests that don't assert anything meaningful — test quality improvements
- Missing test coverage for error paths — discoverable by reading state machine code
Lab 1.3: Run a Simple Tez DAG Locally
Background
Apache Tez supports a local mode that runs the entire DAG execution inside a single JVM without YARN or HDFS. This is the primary environment for rapid development and testing. Understanding how to run a DAG in local mode is essential before attempting cluster testing.
The tez-examples module contains reference DAG implementations. OrderedWordCount is the
canonical example: it reads text, counts word occurrences, and sorts by frequency. It
demonstrates the complete Tez DAG API: TezClient, DAG, Vertex, Edge, and I/O processors.
Why This Lab Matters for Contributors
- Local mode is how you verify behavior changes without a cluster
- All integration test work in
tez-testsbuilds on the same local mode infrastructure - Understanding how a real DAG is constructed gives concrete context for reading state machine code
- Every DAG execution produces log output that teaches you about the AM lifecycle
Understanding Tez Local Mode
Tez local mode is enabled by setting tez.local.mode=true in the TezConfiguration. When
this is set:
- No YARN cluster is contacted
- No containers are launched — task execution happens in threads within the same JVM
LocalMode.javareplaces the fullDAGAppMasterwith a lightweight local executor- HDFS is replaced by the local filesystem (configurable)
Key configuration for local mode:
TezConfiguration tezConf = new TezConfiguration();
tezConf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, true);
// Use local filesystem instead of HDFS
tezConf.set("fs.defaultFS", "file:///");
tezConf.setBoolean("tez.local.mode.without.network", true);
Anatomy of OrderedWordCount
Before running the example, read
tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java.
The DAG structure:
[Tokenizer Vertex]
|
| (SCATTER_GATHER edge — partitioned by hash, sorted)
v
[SumReducer Vertex]
|
| (SCATTER_GATHER edge — partitioned by value for sort)
v
[Sorter Vertex] → HDFS output
Tokenizer: Reads input text lines, splits into words, emits (word, 1) pairs.
Processor class: TokenProcessor (inner class in OrderedWordCount)
SumReducer: Receives (word, [1, 1, 1, ...]) groups, sums counts, emits (word, count).
Processor class: SumProcessor (inner class in OrderedWordCount)
Sorter: Receives by (count, word) key (reversed), emits sorted output.
Processor class: NoOpSorter — uses OrderedGroupedKVInput to do the sort during shuffle
The key insight: Tez uses edge properties and I/O processor configuration to control the
sort and partition behavior. The Sorter vertex does not sort — the shuffle/merge into
OrderedGroupedKVInput does the sorting.
Step-by-Step Tasks
Step 1: Prepare Sample Input
mkdir -p /tmp/tez-lab/input
cat > /tmp/tez-lab/input/words.txt << 'EOF'
the quick brown fox jumps over the lazy dog
the dog barked at the fox
quick brown dog
EOF
Step 2: Build tez-examples
cd /path/to/tez
mvn package -DskipTests -pl tez-examples -am -q
Locate the examples JAR:
ls tez-examples/target/tez-examples-*.jar | grep -v sources | grep -v tests
Step 3: Run OrderedWordCount in Local Mode
The example is run as a standard Java main class:
# Set classpath to include Tez JARs
TEZ_HOME=/path/to/tez
CLASSPATH=\
$TEZ_HOME/tez-examples/target/tez-examples-*.jar:\
$TEZ_HOME/tez-api/target/tez-api-*.jar:\
$TEZ_HOME/tez-dag/target/tez-dag-*.jar:\
$TEZ_HOME/tez-runtime-library/target/tez-runtime-library-*.jar:\
$TEZ_HOME/tez-runtime-internals/target/tez-runtime-internals-*.jar:\
$TEZ_HOME/tez-mapreduce/target/tez-mapreduce-*.jar:\
$TEZ_HOME/tez-common/target/tez-common-*.jar
# Add Hadoop JARs (required for FileSystem, Configuration, etc.)
# If Hadoop is installed:
CLASSPATH=$CLASSPATH:$(hadoop classpath)
# If not, add from Maven local cache manually
java -cp "$CLASSPATH" \
org.apache.tez.examples.OrderedWordCount \
/tmp/tez-lab/input \
/tmp/tez-lab/output \
1
Tip: The easiest way to handle classpaths during development is to use Maven's
exec:javagoal or to build a fat JAR using theshadeplugin. Thetez-distassembly includes all JARs and thebin/scripts handle classpath setup.
Step 4: Run with Maven exec plugin (simpler)
If you have Hadoop installed and HADOOP_HOME set, use the Tez distributed shell script:
cd $TEZ_HOME
bin/tez-examples.sh OrderedWordCount \
/tmp/tez-lab/input \
/tmp/tez-lab/output \
1
Or, add local mode flags to the Hadoop conf:
java -Dtez.local.mode=true \
-Dfs.defaultFS=file:/// \
-cp "$CLASSPATH" \
org.apache.tez.examples.OrderedWordCount \
/tmp/tez-lab/input \
/tmp/tez-lab/output \
1
Step 5: Verify Output
cat /tmp/tez-lab/output/part-*
Expected output (sorted by frequency descending):
the 4
dog 3
fox 2
quick 2
brown 2
...
Step 6: Read the Execution Log
Examine the log output from the run. Key lines to understand:
INFO TezClient: Submitting DAG to YARN, queueName=...
INFO DAGAppMaster: Running DAG: [OrderedWordCount]
INFO VertexImpl: Vertex: [Tokenizer] initialized
INFO VertexImpl: Vertex: [Tokenizer] started
INFO DAGImpl: DAG: [OrderedWordCount] finished with status: [SUCCEEDED]
These lines correspond directly to state machine transitions you will study in Level 4. For each log line, identify the state transition it represents.
Implementation Requirements
Modify OrderedWordCount to add a fourth vertex that filters out words with count < 2:
- Add a new
Vertexnamed"Filter"afterSumReducerand beforeSorter - Write a minimal
FilterProcessor extends AbstractProcessor:- In
run(): iterate the input, skip pairs where the count value < 2, forward the rest
- In
- Add an edge
SumReducer → FilterandFilter → Sorter - Run the modified DAG and verify that single-occurrence words are removed from output
This exercise teaches you:
- How to add a vertex to an existing DAG
- How to write a minimal Processor implementation
- How edges connect processors
Do not overthink the implementation — the processor body is ~20 lines.
Debugging Checklist
If the DAG fails with DAG status: FAILED:
- Read the log for
ERRORlines — they contain the failure reason and task attempt ID - Check
DAGAppMasterlog forVertexImpl: Vertex [...] failed - The error message will include the class and method where the exception occurred
- Common causes:
- Classpath missing a required JAR (NoClassDefFoundError)
- Output directory already exists (FileAlreadyExistsException)
- Wrong input path (FileNotFoundException)
Clean output directory before re-running:
rm -rf /tmp/tez-lab/output
Expected Output
A successful run ends with:
INFO DAGImpl: DAG: [OrderedWordCount] finished with status: [SUCCEEDED]
INFO TezClient: Shutting down TezSession...
Stretch Goals
-
Enable
INFO-level logging fororg.apache.tez.dag.app.dag.impland observe vertex state transitions in the console output during the DAG run. -
Modify the DAG to use
UnorderedKVInput/UnorderedKVOutputinstead of the ordered pair for the first edge. Observe the difference in output ordering. -
Change the parallelism of the
Sortervertex to 2 and observe the output directory structure (2 part files instead of 1). -
Add a timer around the
TezClient.submitDAG()→DAGClient.waitForCompletion()block and measure execution time for different input sizes.
Related Real-World Issue Types
- Local mode-specific bugs (different from cluster mode) — contributor opportunity
- DAG API usability issues — often exposed by example code
- Local mode configuration issues — often reported by new users
Lab 1.4: Project — Number Pipeline DAG
What You Are Building
A self-contained, runnable Java project that builds and executes a 3-vertex Tez DAG entirely in local mode — no YARN cluster, no HDFS, no Docker required.
Generator (2 tasks)
│ SCATTER_GATHER shuffle
▼
Multiplier (2 tasks) [value * 2]
│ SCATTER_GATHER shuffle
▼
Sink (1 task) [sum → counter]
Numbers 0–99 flow through the pipeline. The expected final sum is:
sum(0..99) * 2 = 4950 * 2 = 9900.
This pipeline intentionally mirrors the structure of Apache Tez's own
OrderedWordCount example but with an integer domain so the math is verifiable
without a corpus.
Project Location
book/projects/
├── pom.xml ← parent; sets Tez + Hadoop versions
└── level-1-number-pipeline/
├── pom.xml
└── src/main/java/org/apache/tez/learning/l1/
├── GeneratorProcessor.java ← no inputs; emits integers
├── MultiplierProcessor.java ← one input, one output; value * 2
├── SinkProcessor.java ← sums values; publishes counter
├── FilterProcessor.java ← exercise stub (incomplete)
└── NumberPipelineDAG.java ← main class; configures + submits DAG
Prerequisites
- Completed Lab 1.1 (Apache Tez built from source with
mvn install -DskipTests) - Java 8+ on
$PATH - Maven 3.6+ on
$PATH
Step 1: Set the Tez Version
The parent pom.xml needs to reference the exact version that mvn install
installed into your local ~/.m2 repository. Find it:
# Inside your apache/tez clone:
grep -m1 '<version>' pom.xml
Open book/projects/pom.xml and set <tez.version> to match:
<tez.version>0.10.3-SNAPSHOT</tez.version> <!-- adjust to your build -->
Step 2: Compile
cd /path/to/opensource-engineer-and-contributor/apache-tez/book/projects
# Build only the level-1 module (fast; skips the other modules)
mvn -pl level-1-number-pipeline package -q
You should see no errors. The fat JAR is at:
level-1-number-pipeline/target/level-1-number-pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar
If you see Could not resolve dependency org.apache.tez:tez-api:
- Verify that
tez.versionmatches the version in~/.m2/repository/org/apache/tez/tez-api/ - Re-run
mvn install -DskipTestsin your Tez clone
Step 3: Run
java -jar level-1-number-pipeline/target/level-1-number-pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar
Expected output (log lines abbreviated):
TezClient started (local mode).
Submitting DAG...
[SinkProcessor] task=0 partialSum=9900
=== NumberPipeline Result ===
Expected : 9900
Actual : 9900
Result : PASS
Note: You will see a large number of INFO log lines from the Tez framework. This is normal for local mode. The important lines are the ones from
[SinkProcessor]and the final=== Result ===block.
Step 4: Read Every Source File
Before modifying anything, read each Java file carefully.
GeneratorProcessor.java
Key questions:
- Which Tez interface does it implement?
- Why is
output.start()called beforegetWriter()? What happens if you remove it? - How does the processor know which range of numbers to generate? What Tez API provides this?
- The key and value written are both the same integer
n. Why? When would you want them to differ?
MultiplierProcessor.java
Key questions:
OrderedGroupedKVInputvsOrderedPartitionedKVOutput— which side is the input and which is the output? Why are they named differently?- Both
input.start()andoutput.start()are called. What doesinput.start()actually trigger? (Hint: look atOrderedGroupedKVInput.start()in the Tez source.) FACTOR = 2is hardcoded. The Javadoc explains how to pass it viaUserPayload. What is the size in bytes of anintencoded in aByteBuffer?
SinkProcessor.java
Key questions:
- What is the type of
getContext().getCounters()? findCounter(group, name)— what happens if the counter doesn't exist yet when first called?- There is only one Sink task (parallelism=1). If you changed it to 2, would the counter still be correct? Why?
NumberPipelineDAG.java
Key questions:
- What does
tez.local.mode=trueactually change about task execution? OrderedPartitionedKVEdgeConfig.newBuilder(keyClass, valueClass, partitionerClass)— what isHashPartitionerdoing here, and where does the partition count come from?dagClient.waitForCompletion()— does this block on the calling thread, or is it async?EnumSet.of(StatusGetOpts.GET_COUNTERS)— why is this extra call needed? Why aren't counters always included inDAGStatus?
Step 5: Break It and Understand It
Make each change, run the JAR, observe the failure, then revert.
Break 1: Remove output.start()
In GeneratorProcessor.run(), comment out logicalOutput.start().
Expected: NullPointerException or IllegalStateException from the Tez runtime when
getWriter() is called on an uninitialized output.
Why this matters: Tez I/O objects are lazily initialized. The start() method triggers
buffer allocation, sort buffer setup, and (for inputs) the shuffle fetch. Forgetting
start() is a common first patch mistake.
Break 2: Set the wrong parallelism
Change sink parallelism from 1 to 3, run again.
Observe: does the result change? Is it still 9900? Why or why not?
Expected: the total counter is still 9900, because each Sink task emits a partial sum and the AM aggregates counters across all tasks automatically.
Break 3: Swap key and value in the Generator
Change writer.write(new IntWritable(n), new IntWritable(n)) to
writer.write(new IntWritable(0), new IntWritable(n)) (fixed key = 0).
Expected: all values route to the same Multiplier task (the one that owns partition 0).
The other Multiplier task gets no work. The result is still 9900 (correct) but the work
distribution is skewed. You can verify this by adding a counter in MultiplierProcessor
that tracks how many records each task processed.
Why this matters: key-skew (many records with the same key) is one of the most common Tez/MapReduce performance problems. This exercise makes it visible.
Step 6: Add a FilterProcessor (Exercise)
Open FilterProcessor.java. This is the skeleton for your exercise.
Your task: Insert a FilterProcessor between Multiplier and Sink that drops all
values not divisible by 4, then verify the new expected sum.
Step 6a: Implement FilterProcessor
- Add a
private int thresholdfield. - In
initialize(), read the threshold fromUserPayload:byte[] bytes = getContext().getUserPayload().deepCopyAsArray(); this.threshold = ByteBuffer.wrap(bytes).getInt(); - In
run(), replaceif (true)withif (value.get() % threshold == 0).
Step 6b: Update NumberPipelineDAG.buildDAG()
Vertex filter = Vertex.create("filter",
ProcessorDescriptor.create(FilterProcessor.class.getName())
.setUserPayload(UserPayload.create(
ByteBuffer.allocate(4).putInt(4).flip())), // threshold=4
2); // same parallelism as multiplier
// New edge chain: generator → multiplier → filter → sink
.addEdge(Edge.create(generator, multiplier, edgeConf.createDefaultEdgeProperty()))
.addEdge(Edge.create(multiplier, filter, edgeConf.createDefaultEdgeProperty()))
.addEdge(Edge.create(filter, sink, edgeConf.createDefaultEdgeProperty()));
Step 6c: Calculate the new expected sum
After multiplying by 2, the values are: 0, 2, 4, 6, 8, …, 198. After filtering (keep only values divisible by 4): 0, 4, 8, 12, …, 196. Sum of {0, 4, 8, …, 196} = 4 * sum(0, 1, 2, …, 49) = 4 * (49*50/2) = 4 * 1225 = 4900.
Update NumberPipelineDAG.expectedSum() to return 4900 and verify PASS.
Step 7: Connect This to the Tez Source
Every class you used in this project maps to a real Tez module.
| Class | Module | Source path |
|---|---|---|
AbstractLogicalIOProcessor | tez-runtime-api | tez-runtime-api/src/main/java/org/apache/tez/runtime/api/AbstractLogicalIOProcessor.java |
OrderedPartitionedKVOutput | tez-runtime-library | tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java |
OrderedGroupedKVInput | tez-runtime-library | tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java |
OrderedPartitionedKVEdgeConfig | tez-runtime-library | tez-runtime-library/src/main/java/org/apache/tez/runtime/library/conf/OrderedPartitionedKVEdgeConfig.java |
TezClient | tez-api | tez-api/src/main/java/org/apache/tez/client/TezClient.java |
TezConfiguration | tez-common | tez-common/src/main/java/org/apache/tez/dag/api/TezConfiguration.java |
After running the pipeline successfully, open each source file above. For each one:
- Find the method you called
- Read its implementation — what does it actually do?
- Find the unit test class for that file (usually in
src/test/java/under the same package)
Step 8: Find Related JIRA Issues
This pipeline uses OrderedPartitionedKVOutput. Search the Tez JIRA for issues in this
component to find real bugs and improvements you could work on:
project = TEZ AND component = "runtime-library" AND status in (Open, Patch Available)
ORDER BY priority DESC
Also search specifically:
text ~ "OrderedPartitionedKVOutput" AND status in (Open, "Patch Available")
For each open issue you find, ask yourself:
- Do you understand what the bug description is saying?
- Can you locate the relevant code in the source?
- Is there a failing test, or do you need to write one?
Expected Deliverables
- Project compiles without errors
-
Running the JAR prints
PASSwith result 9900 - You can answer all questions in Step 4 (with file:line references to the source)
- You have run all three "Break It" experiments and understand each failure
-
FilterProcessoris implemented and the pipeline printsPASSwith result 4900 - You have opened all 5 source files from the "Connect to Source" table
-
You have found at least 2 open JIRA issues in the
runtime-librarycomponent
Level 2: Apache Contributor Onboarding
This level teaches you how the Apache open-source contribution machine works — not in the abstract, but in the specific context of Apache Tez. You will set up your tooling, understand the community structure, learn the patch workflow, and submit your first meaningful change.
Learning Objectives
By the end of Level 2 you must be able to:
- Subscribe to
dev@tez.apache.organd read a week's worth of threads - Navigate Apache Tez JIRA to find and evaluate open issues
- Describe the full lifecycle of a patch: from JIRA issue to committed code
- Generate a unified diff patch from a Git branch
- Run Apache checkstyle and resolve all violations before submitting a patch
- Write a JIRA comment that adds technical value
- Find any class in the Tez repository in under 30 seconds
Apache Open-Source Contribution Fundamentals
Apache projects operate differently from GitHub-native open-source projects. The primary communication channels are mailing lists, not GitHub issues or Slack. Patches are attached to JIRA issues, not submitted as GitHub pull requests (though GitHub PRs may be used as a convenience in some projects — Tez still prefers JIRA-based workflow).
The Contribution Hierarchy
PMC (Project Management Committee)
└─ Committers (can commit directly)
└─ Contributors (submit patches via JIRA)
└─ Everyone else (can file issues, ask questions)
Becoming a contributor means submitting patches. Becoming a committer means sustained, high-quality contributions over time that earn the trust of existing committers.
The Patch Lifecycle
1. Find or file a JIRA issue
2. Leave a comment: "I'm looking into this"
3. Make changes on a local branch
4. Run: mvn test -pl <module> -am (must pass)
5. Run: mvn checkstyle:check -pl <module> (must pass)
6. Generate a patch: git diff origin/master > TEZ-NNNN.patch
7. Attach the patch to the JIRA issue
8. Set JIRA status to "Patch Available"
9. Wait for review — a committer will comment or set "Reviewed" or "Not a bug"
10. Address feedback → upload v2 patch → repeat
11. Committer commits the patch (you cannot commit yourself until you are a committer)
Required Reading
| # | Resource | What to extract |
|---|---|---|
| 1 | Apache Tez Contributing | The official contribution guide |
| 2 | Apache JIRA for Tez | Browse recent issues to understand what active work looks like |
| 3 | dev@tez.apache.org archives | Read 2 weeks of mailing list threads at https://lists.apache.org/list.html?dev@tez.apache.org |
| 4 | src/config/checkstyle.xml in the Tez repo | What style rules are enforced |
| 5 | Apache How It Works | Meritocracy, governance, why Apache operates the way it does |
| 6 | Any 3 recently closed Tez patches | Read the JIRA comment thread — observe how committers give feedback |
Source Code Areas to Inspect
| File | Why |
|---|---|
pom.xml (root) | Module structure, dependency management, build profiles |
tez-dag/pom.xml | Module-level dependency declarations |
src/config/checkstyle.xml | Style rules enforced on every patch |
src/config/checkstyle-suppressions.xml | Suppressions — which files are exempt and why |
.gitignore | What is excluded from version control |
| Any recently committed file | Read the commit message format |
Apache Tez JIRA Structure
Issue Types You Will Encounter
| Type | Description |
|---|---|
| Bug | A defect in behavior |
| Improvement | An enhancement to existing functionality |
| New Feature | Something that does not exist yet |
| Task | Non-code work (documentation, release, etc.) |
| Sub-task | Part of a larger issue |
| Test | Adding or fixing a test |
Priority Levels
| Priority | Meaning |
|---|---|
| Blocker | Prevents a release |
| Critical | Significant data loss or correctness risk |
| Major | Important but not release-blocking |
| Minor | Small issue or improvement |
| Trivial | Typo, cosmetic, minor cleanup |
For Level 2 contributors: Only work on
MinorandTrivialissues. Do not pick upMajoror higher issues until you have at least 3 accepted patches in the project.
Component Labels
JIRA issues are labeled by component. The most relevant for early contributors:
| Component | What it covers |
|---|---|
Tez-DAG | DAG execution, AM, state machines |
Tez-Runtime | I/O library, shuffle |
Tez-API | Public API — high stability required |
Documentation | Docs, Javadoc, website |
Tests | Test additions and fixes |
Mailing List Etiquette
How to Subscribe
# Send an empty email to:
dev-subscribe@tez.apache.org
# You will receive a confirmation email — reply to it
What to Read First
Do not post until you have read at least two weeks of threads. Understand:
- What issues are currently being discussed
- How committers respond to patches
- The tone and technical depth expected
- What questions get quick responses vs. what gets ignored
How to Ask a Question
Good question format:
Subject: [QUESTION] Understanding VertexImpl initialization flow
Hi dev@,
I'm trying to understand the initialization sequence in VertexImpl.
Specifically, I'm looking at the transition from INITIALIZING to INITED
in VertexImpl.java around line 1234.
The code calls rootInputInitializer() before transitioning, but I'm unclear
on what happens if an initializer throws an unchecked exception.
I've read the JIRA issue TEZ-XXXX and the associated commit, but I still
have this question. Can anyone point me to the relevant code path?
Thanks,
[Your name]
What makes this question good:
- Specific class and approximate line number
- State machine terminology used correctly
- References prior research
- Concrete question, not "how does Tez work?"
What makes a question bad:
- "How do I contribute?" — this is answered in the contributing guide
- "Can you explain how shuffle works?" — too broad; you should read the code first
- Posting before subscribing and reading archives
Apache Checkstyle
Tez enforces checkstyle on every patch. A patch that fails checkstyle will not be committed.
Running Checkstyle
# Check a specific module
mvn checkstyle:check -pl tez-dag
# Check all modules (slow)
mvn checkstyle:check
# Check and see violations inline
mvn checkstyle:checkstyle -pl tez-dag
open tez-dag/target/checkstyle-result.xml
Common Violations
| Violation | Cause | Fix |
|---|---|---|
UnusedImports | Import statement for an unused class | Remove the import |
LineLength | Line exceeds 100 characters | Break the line |
WhitespaceAround | Missing space around operator | Add space |
LeftCurly | { on wrong line | Move to end of previous line |
JavadocMethod | Public method missing Javadoc | Add /** ... */ block |
FinalClass | Utility class not declared final | Add final modifier |
JIRA Issue Categories for Level 2 Contributors
In addition to Level 1 categories, you can now attempt:
- Test improvements — adding tests for uncovered paths you identify from reading the code
- Logging improvements — adding
LOG.debug()statements that would help diagnose issues - Checkstyle fixes — especially in modules you have been reading
Discipline: The quality of your first 5 patches determines how quickly you build credibility in the community. A patch with a checkstyle violation, compilation error, or test failure will be rejected immediately. Every patch must be verified locally before upload.
Deliverables
-
Subscribed to
dev@tez.apache.organd can describe two active discussions - Apache JIRA account created
- One JIRA issue identified, studied, and commented on (even if not yet working on it)
- Lab 2.1 completed: module-by-module walkthrough documented
- Lab 2.2 completed: patch generated, checkstyle passing, JIRA description written
- Understanding of the difference between a Minor and a Trivial issue
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Opening a GitHub PR instead of attaching a patch to JIRA | PR will likely be ignored or closed | Use JIRA; attach a .patch file |
| Submitting a patch that changes formatting in unrelated lines | Noise in the diff; committers reject it | Change only the lines you meant to change |
| Claiming an issue without leaving a JIRA comment | Another contributor may do the same work | Comment "I am investigating this" before starting |
| Submitting a patch without running tests | Immediate rejection | Test everything locally first |
| Writing a JIRA comment that just says "fix attached" | Unhelpful; committers will ask for explanation | Explain what was wrong and what the fix does |
Using git commit -m "fix" | Unprofessional commit message | Format: TEZ-NNNN. Short description of change. |
How to Verify Success
# Your patch generates cleanly
git diff origin/master > /tmp/TEZ-NNNN.001.patch
cat /tmp/TEZ-NNNN.001.patch | head -20 # should show only your intended changes
# Checkstyle passes on the module you changed
mvn checkstyle:check -pl <changed-module>
# Tests pass
mvn test -pl <changed-module> -am -Dtest=<RelevantTestClass>
Patch Profile: Level 2 Graduate
| Patch type | Example | Test requirement |
|---|---|---|
| Javadoc improvement | Add missing @throws annotation to a method | None |
| Log statement improvement | Add context to an existing LOG.warn that is unhelpful | Run the affected test class |
| Checkstyle fix | Fix unused import across multiple files in one module | Run mvn checkstyle:check -pl <module> |
| Test comment improvement | Add test setup comments explaining what MockAppContext does | Run the test class |
You are not ready to submit: behavioral code changes, new features, bug fixes in state machines or shuffle. Continue to Level 3.
Lab 2.1: Navigate the Repository Structure
Background
Before writing a single line of code, a new contributor must be able to navigate the repository with the same fluency as a committer. This lab builds that fluency by walking you through every module, understanding the Maven multi-module structure, and being able to locate any class in under 30 seconds.
Repository Root Layout
apache/tez/
├── pom.xml # Root POM — module declarations, dep management
├── tez-api/ # Public client API
├── tez-common/ # Utilities shared across modules
├── tez-dag/ # DAG AppMaster — the core of Tez
├── tez-examples/ # Example DAG implementations
├── tez-ext-service-tests/ # External service integration tests
├── tez-mapreduce/ # MapReduce compatibility layer
├── tez-plugins/ # Optional plugins (ATSv2, etc.)
├── tez-runtime-internals/ # Internal runtime interfaces
├── tez-runtime-library/ # I/O processors, shuffle
├── tez-tests/ # Integration test suite
├── tez-tools/ # Performance analysis utilities
├── src/
│ └── config/
│ ├── checkstyle.xml # Style enforcement rules
│ └── checkstyle-suppressions.xml
└── CHANGES.txt # Release changelog
Module-by-Module Walkthrough
tez-api — The Public Contract
Everything in tez-api is part of the public API that application developers use. Changes here
must be backward-compatible or explicitly versioned. This is the highest-stability module.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.dag.api | DAG, Vertex, Edge, TezClient, TezConfiguration |
org.apache.tez.dag.api.client | DAGClient, DAGStatus — monitoring and control |
org.apache.tez.dag.api.event | Events emitted by the AM to task processors |
org.apache.tez.dag.api.records | Protocol Buffer message classes (generated) |
org.apache.tez.runtime.api | AbstractProcessor, Input, Output interfaces |
Exercise:
# Count public classes in tez-api (the API surface)
find tez-api/src/main/java -name "*.java" | wc -l
# Find all classes that implement or extend AbstractProcessor
grep -rl "extends AbstractProcessor" tez-runtime-library/src/
tez-dag — The Application Master
This is the largest and most complex module. It implements the DAG AppMaster that runs in a YARN container and orchestrates vertex and task execution.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.dag.app | DAGAppMaster — the main AM class |
org.apache.tez.dag.app.dag | DAG, Vertex, Task, TaskAttempt state machine interfaces |
org.apache.tez.dag.app.dag.impl | DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl |
org.apache.tez.dag.app.rm | YARN resource management integration |
org.apache.tez.dag.app.launcher | Container launch logic |
org.apache.tez.dag.app.web | AM web UI servlets |
org.apache.tez.dag.history | Timeline history event handling |
Exercise:
# Count lines in DAGImpl (the most complex class)
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java
# Count state machine transitions in VertexImpl
grep "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l
tez-runtime-library — I/O and Shuffle
The I/O module implements the actual data reading/writing done inside task containers. Shuffle happens here.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.runtime.library.input | OrderedGroupedKVInput, UnorderedKVInput, etc. |
org.apache.tez.runtime.library.output | OrderedPartitionedKVOutput, UnorderedKVOutput, etc. |
org.apache.tez.runtime.library.common.shuffle | Shuffle fetch infrastructure |
org.apache.tez.runtime.library.common.sort | External sort implementation |
org.apache.tez.runtime.library.common.writers | Spilling KV writers |
Exercise:
# Find all Input implementations
find tez-runtime-library/src/main/java -name "*Input*.java" | grep -v test
# Find the shuffle Fetcher
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")
tez-common — Shared Utilities
Contains utilities used by multiple modules that do not fit in tez-api:
TezUtils— configuration serialization/deserializationTezTaskID,TezVertexID,TezDAGID— ID typesReflectionUtils— Tez-specific reflection helpersVersionUtils— version compatibility checks
tez-mapreduce — MapReduce Compatibility
Allows MapReduce jobs to run on Tez without code changes. Contains MRInput, MROutput,
and the mapper/reducer wrapping infrastructure.
tez-examples — Reference Implementations
Four example DAGs:
| Class | What it demonstrates |
|---|---|
OrderedWordCount | 3-vertex pipeline, ordered shuffle, sort by value |
IntersectExample | 2-way join using broadcast edge |
JoinDataGen | Data generation for the join example |
FilterLinesByWord | Simple filter with configurable parallelism |
tez-tests — Integration Test Suite
Contains tests that run against MiniTezCluster — a full in-process Tez + YARN + HDFS cluster.
These tests are slow (minutes each) but provide end-to-end coverage.
Key test class: TestMiniTezSessionWithLocalMode — runs example DAGs in local mode.
Maven Structure Deep Dive
Root pom.xml
Read the root pom.xml to understand:
- Module declarations (
<modules>section) — the build order - Dependency management (
<dependencyManagement>) — canonical versions for all deps - Plugin management (
<pluginManagement>) — canonical plugin configurations - Build profiles —
hadoop-2vshadoop-3,distprofile for assembly
Exercise:
# What Hadoop version does Tez build against by default?
grep -A2 "hadoop.version" pom.xml | head -5
# What Java version is required?
grep "maven.compiler" pom.xml
# How many external dependencies does the root pom manage?
grep "<artifactId>" pom.xml | wc -l
Module pom.xml Structure
Each module follows the same pattern:
<parent>
<groupId>org.apache.tez</groupId>
<artifactId>tez</artifactId>
<version>0.10.x-SNAPSHOT</version>
</parent>
<artifactId>tez-dag</artifactId>
<name>Tez DAG</name>
<dependencies>
<!-- Module-specific dependencies -->
</dependencies>
Modules declare their inter-dependencies explicitly. This is how Maven knows the build order.
Exercise:
# What modules does tez-dag depend on?
grep -A3 "<dependency>" tez-dag/pom.xml | grep "tez-" | grep "artifactId"
# What does tez-runtime-library depend on?
grep -A3 "<dependency>" tez-runtime-library/pom.xml | grep "tez-" | grep "artifactId"
Finding Classes Quickly
By Name
find . -name "VertexImpl.java"
find . -name "Fetcher.java"
find . -name "TestDAGImpl.java"
By Content
# Find the class that defines TEZ_LOCAL_MODE
grep -rl "TEZ_LOCAL_MODE" --include="*.java" .
# Find all state machine StateMachine declarations
grep -rl "StateMachineFactory" --include="*.java" . | grep -v test
In IntelliJ
- Navigate to class:
⌘ O(macOS) — type class name, supports wildcards - Navigate to file:
⌘ ⇧ O— type file name - Find usages:
⌥ F7— shows all places a class/method is used - Go to implementation:
⌘ ⌥ B— jumps from interface to implementation
Navigation Checklist
After completing this lab, time yourself on each:
| Task | Target time |
|---|---|
Find DAGImpl.java | < 10 seconds |
Find TezConfiguration.TEZ_LOCAL_MODE declaration | < 20 seconds |
Find all tests for VertexImpl | < 30 seconds |
| Identify which module handles shuffle fetch retry | < 60 seconds |
| Find the class that submits a DAG from client to AM | < 60 seconds |
If any take longer, repeat the exercises in this lab.
Expected Output
By end of this lab you should have notes documenting:
- The line count of
VertexImpl.javaandDAGImpl.java - The number of state machine transitions in
VertexImpl - The names of all 4 example DAG classes
- The Hadoop version Tez builds against
- Which module handles shuffle (your own words, not copy-pasted)
Stretch Goals
-
Generate the full module dependency graph:
mvn dependency:tree -pl tez-dag -am | grep "\\-\\-" | head -30 -
Find all Protocol Buffer definition files (
.proto):find . -name "*.proto" | sortFor each, identify which module it belongs to and what messages it defines.
-
Read
tez-api/src/main/proto/DAGApiRecords.protocompletely. Identify which messages correspond to Java classes you have already read.
Lab 2.2: Prepare a Patch Using Apache Practices
Background
A "patch" in Apache open-source culture means a unified diff file attached to a JIRA issue. This lab walks you through the complete workflow: finding a safe change to make, preparing the patch, verifying it, and writing the JIRA description.
This lab uses a real but trivial change as the vehicle — a Javadoc improvement in
tez-api. Trivial changes are intentional: the goal is to master the workflow, not to write
impressive code.
The Apache Git Patch Workflow
Apache Tez development uses a linear history on master (now trunk in some Apache projects,
master in Tez). The standard contributor workflow:
origin/master (read-only for non-committers)
|
↓ checkout
local/master
|
↓ branch
local/TEZ-NNNN
|
↓ make changes
↓ mvn test (pass)
↓ mvn checkstyle:check (pass)
↓ git diff origin/master > TEZ-NNNN.001.patch
|
→ Attach to JIRA
You never push your branch to Apache. You generate a diff and attach it.
Step-by-Step Tasks
Step 1: Set Up Your Working Branch
cd /path/to/tez
# Always start from a clean, up-to-date master
git fetch origin
git checkout master
git merge origin/master
# Create a branch named after the JIRA issue you are working on
# Use TEZ-0000 as a placeholder for this lab
git checkout -b TEZ-0000-javadoc-tezvertex
Verify you are on the new branch:
git branch
# * TEZ-0000-javadoc-tezvertex
# master
Step 2: Find a Target for Your Change
Open tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java.
Look for public methods that:
- Have no Javadoc, or
- Have a
@paramtag with a non-descriptive name like// TODO, or - Have a
@returntag missing from a non-void method
A useful starting point:
# Find methods with empty or missing Javadoc in tez-api
javadoc -private -sourcepath tez-api/src/main/java \
org.apache.tez.dag.api 2>&1 | grep "no comment"
Or manually: open Vertex.java in IntelliJ, look at the addDataSink() method. If it lacks
a @param description for dataSink, that is your target.
Step 3: Make the Change
Add or improve the Javadoc for the method you identified. Follow this format exactly:
/**
* Adds a {@link DataSink} to this vertex. The sink will receive the output
* of this vertex after all tasks complete.
*
* @param outputName
* the name used to identify this sink in the DAG; must be unique
* within this vertex
* @param dataSink
* the {@link DataSink} descriptor defining the sink type and
* configuration
* @return this {@link Vertex} instance (for method chaining)
* @throws IllegalStateException if the vertex has already been added to a {@link DAG}
*/
public Vertex addDataSink(String outputName, DataSinkDescriptor dataSink) {
Rules for Apache Javadoc style:
- First sentence is a brief imperative description (no subject: "Adds a…" not "This method adds a…")
- Multi-line
@paramdescriptions indent the continuation by 10 spaces (2 more than@param) - Use
{@link ClassName}for all class references - Use
{@code value}for code literals and parameter names in prose
Step 4: Verify Compilation
mvn compile -pl tez-api -q
Expected: BUILD SUCCESS with no errors.
Step 5: Run Checkstyle
mvn checkstyle:check -pl tez-api
Expected: BUILD SUCCESS. If there are violations, fix them before continuing.
Common Javadoc-specific violations:
JavadocStyle— Javadoc comment does not end with a periodJavadocMethod—@paramor@returntag is missingJavadocVariable— public field missing Javadoc
Step 6: Run the Relevant Tests
mvn test -pl tez-api -q
Expected: BUILD SUCCESS. Even a pure Javadoc change requires a test run — checkstyle runs
as part of the test phase in some configurations.
Step 7: Generate the Patch
# Verify what you changed
git diff
# The diff should show only the lines you intentionally changed
# No whitespace changes, no unrelated files
# Generate the patch file
git diff origin/master > /tmp/TEZ-0000.001.patch
# Inspect it
cat /tmp/TEZ-0000.001.patch
The patch file should:
- Start with
diff --git a/tez-api/... - Show exactly the lines you added/removed (prefixed with
+/-) - Contain no changes to files you did not intend to modify
If the patch is longer than expected, run git status to find unexpected changes and
use git checkout -- <file> to revert them.
Step 8: Write the JIRA Description
For the JIRA issue you would create for this patch, write:
Summary line format:
TEZ-0000. Improve Javadoc for Vertex.addDataSink()
Description format:
Problem:
The addDataSink() method in Vertex.java has no @param documentation for the
'dataSink' parameter. This makes it harder for new users to understand the
expected input without reading the implementation.
Fix:
Add complete @param, @return, and @throws Javadoc for addDataSink().
Testing:
mvn test -pl tez-api (all existing tests pass)
mvn checkstyle:check -pl tez-api (no violations)
Step 9: Review the Patch as a Committer Would
Before attaching a patch, ask yourself:
- Does the patch contain only the changes described in the JIRA description?
- Does it pass
mvn test -pl <module>locally? - Does it pass
mvn checkstyle:check -pl <module>? - Is the commit message format correct? (
TEZ-NNNN. Short description.) - Is there a clear explanation in the JIRA description of what was wrong and what was fixed?
If any answer is "no", fix it before uploading.
Common Mistakes
| Mistake | How to detect | Fix |
|---|---|---|
| Patch includes unrelated formatting changes | git diff shows hundreds of lines | git checkout -- <unintended-file> |
| Patch modifies generated code | Proto-generated files in the diff | Revert generated files; only change source |
Patch applies only to a non-master branch | git diff origin/master shows no changes | Rebase your branch onto current master |
| Checkstyle violation in unchanged line | mvn checkstyle:check fails in a line you did not write | You must fix it anyway — it is in your patch |
| Test fails on unrelated module | Running all tests surfaces a pre-existing failure | Confirm by running on a clean checkout; note the existing failure in JIRA |
JIRA Status Workflow
After attaching your patch:
- Set the JIRA status to "Patch Available"
- Add a comment: "Patch attached. Tested with
mvn test -pl tez-apiandmvn checkstyle:check -pl tez-api, both pass." - Wait for a committer to review — do not ping on the mailing list immediately
- If no response in 2 weeks, it is acceptable to send one polite reminder to
dev@tez.apache.org:Subject: [REMINDER] TEZ-NNNN patch available for review Hi dev@, Friendly reminder that TEZ-NNNN has a patch attached. Any feedback welcome. https://issues.apache.org/jira/browse/TEZ-NNNN Thanks
Expected Output
At the end of this lab you have:
- A local branch
TEZ-0000-javadoc-tezvertexwith a Javadoc change - A passing test run:
mvn test -pl tez-api - A passing checkstyle run:
mvn checkstyle:check -pl tez-api - A patch file at
/tmp/TEZ-0000.001.patchwith only the intended diff - A written JIRA description (even if not submitted) in the format above
Stretch Goals
-
Find a real
MinororTrivialopen issue in Apache Tez JIRA that has been open for more than 6 months with no patch. Leave a JIRA comment expressing interest. -
Attempt the same patch workflow with a real issue:
- Use
git checkout -b TEZ-<real-number>-<short-description>for the branch name - Use the real JIRA number in the patch filename:
TEZ-NNNN.001.patch
- Use
-
Read three recently committed Tez patches by browsing JIRA issues with status
"Resolved". For each, read the complete comment thread to understand the feedback cycle and how many patch revisions were required. -
Generate a
git logview that shows only your branch's commits:git log origin/master..HEAD --onelineThis is what a committer sees when reviewing your work.
Lab 2.3 — Fix It: NullPointerException in TezTaskAttemptID.fromString
Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 90–120 min
Tez component: tez-common → org.apache.tez.common.TezTaskAttemptID
Background
TezTaskAttemptID is the primary key that links a task attempt to its vertex, DAG, and
application. Its static fromString method parses a serialised ID like:
attempt_1609459200000_0001_1_00_000000_0
In the Tez codebase the parse path for certain malformed inputs has historically
thrown an unguarded NullPointerException rather than a descriptive
IllegalArgumentException. Null returns from String.split() or Integer.parseInt()
call sites that skip validation are the common culprit.
This lab walks the complete Apache contribution workflow for such a bug:
- Reproduce the crash in a test
- Read the source to understand why it crashes
- Apply the minimal fix
- Verify tests pass and checkstyle is clean
- Produce a
.patchfile ready for JIRA upload
Step 1 — Locate the Source File
cd ~/tez-src # your local Tez clone from Lab 1.1
find . -name "TezTaskAttemptID.java" | head -5
Expected path:
./tez-common/src/main/java/org/apache/tez/common/TezTaskAttemptID.java
Open the file and read the fromString method in full.
Questions to answer before continuing
| # | Question |
|---|---|
| 1 | What does fromString call first — TezDAGID.fromString, TezVertexID.fromString, or does it parse raw tokens? |
| 2 | What happens if the input string is null? Is there an explicit null guard? |
| 3 | What exception type does the method declare in its signature (throws clause)? |
| 4 | Find the split("_") call(s). If the split produces fewer parts than expected, what line would throw? |
| 5 | Is there a sibling method toString()? What is the canonical string format it produces? |
Step 2 — Find the Existing Tests
find . -name "TestTezTaskAttemptID.java" | head -5
Expected path:
./tez-common/src/test/java/org/apache/tez/common/TestTezTaskAttemptID.java
Open it.
Questions
| # | Question |
|---|---|
| 1 | How many fromString test cases already exist? |
| 2 | Is there a test for a null input? |
| 3 | Is there a test for a string with too few underscore-separated parts? |
| 4 | What assertion style does the file use — JUnit 4 @Test(expected=...) or try/catch? |
Step 3 — Reproduce the Bug
Add the following test to TestTezTaskAttemptID.java inside the existing test class.
Do not modify the test — the goal is to make it pass, not work around it.
@Test(expected = IllegalArgumentException.class)
public void testFromStringNullInput() {
TezTaskAttemptID.fromString(null);
}
@Test(expected = IllegalArgumentException.class)
public void testFromStringTooFewParts() {
// Fewer underscore-separated tokens than the format requires
TezTaskAttemptID.fromString("attempt_1609459200000_0001_1");
}
Run the tests:
cd tez-common
mvn test -pl . -Dtest=TestTezTaskAttemptID -q 2>&1 | tail -30
Expected result: Both new tests FAIL (the method throws NullPointerException
or ArrayIndexOutOfBoundsException, not IllegalArgumentException).
Record the exact exception and stack-trace line. You will need this for the JIRA description later.
Step 4 — Apply the Fix
Open TezTaskAttemptID.java and apply a minimal patch to the fromString method.
Rules for a minimal patch
- Add a null-check at the very top of
fromString; throwIllegalArgumentExceptionwith a clear message - Add a length-check on the parsed tokens before subscripting the array
- Do not reformat unrelated lines (this produces noisy diffs that fail checkstyle review)
- Do not change method signatures or visibility
Hint — guard pattern used elsewhere in the same class
Search the file for how other fromString variants guard their input:
grep -n "IllegalArgumentException" TezTaskAttemptID.java
Use the same pattern and message style.
Step 5 — Verify the Fix
# All TezTaskAttemptID tests must pass
mvn test -pl tez-common -Dtest=TestTezTaskAttemptID -q
# Full tez-common test suite (regression guard)
mvn test -pl tez-common -q 2>&1 | tail -20
# Checkstyle must be clean
mvn checkstyle:check -pl tez-common -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20
All three commands must produce zero errors.
Step 6 — Understand the Checkstyle Rules
cat tez-common/src/main/checkstyle/tez-checkstyle.xml | grep -A2 "LineLength"
cat tez-common/src/main/checkstyle/tez-checkstyle.xml | grep -A2 "Javadoc"
Questions
| # | Question |
|---|---|
| 1 | What is the maximum line length enforced? |
| 2 | Does the project require Javadoc on all public methods, or only some? |
| 3 | What import ordering rule is in effect — alphabetical, grouped, or none? |
Step 7 — Format the Patch File
Apache Tez uses the unified diff format. From the repo root:
cd ~/tez-src
git diff > /tmp/TEZ-XXXX.001.patch
Inspect the patch:
cat /tmp/TEZ-XXXX.001.patch
Checklist before uploading to JIRA
- Patch header shows the correct file path relative to the repo root
-
Only
TezTaskAttemptID.javaandTestTezTaskAttemptID.javaare modified -
No trailing whitespace on any changed line (
grep -P "\s+$" /tmp/TEZ-XXXX.001.patch) -
Patch applies cleanly to a fresh checkout:
git apply --check /tmp/TEZ-XXXX.001.patch -
mvn test -pl tez-commonstill passes aftergit apply
Step 8 — Write the JIRA Description
Draft a JIRA ticket description following the Apache Tez convention:
Summary: TezTaskAttemptID.fromString throws NPE/AIOOBE on malformed input
instead of IllegalArgumentException
Description:
TezTaskAttemptID.fromString does not validate its input before parsing.
Passing null or a string with fewer than N underscore-separated parts
causes an unhandled NullPointerException (null path) or
ArrayIndexOutOfBoundsException (short-string path) instead of
the expected IllegalArgumentException.
Steps to reproduce:
TezTaskAttemptID.fromString(null);
→ NullPointerException at TezTaskAttemptID.java:NN
TezTaskAttemptID.fromString("attempt_1609459200000_0001_1");
→ ArrayIndexOutOfBoundsException at TezTaskAttemptID.java:NN
Fix: add explicit null guard + array-length guard at the top of fromString.
Priority: Minor
Component: tez-common
Replace NN with the actual line numbers from your stack traces in Step 3.
Step 9 — Connect the Concepts
| Concept | Where to find it in the codebase |
|---|---|
TezTaskAttemptID | tez-common/src/main/java/.../TezTaskAttemptID.java |
TezID base class | Same package — TezID.java |
All fromString sibling methods | TezDAGID, TezVertexID, TezTaskID — same package |
| Checkstyle config | tez-common/src/main/checkstyle/tez-checkstyle.xml |
| Example past fix (similar pattern) | Search JIRA for TEZ- + IllegalArgumentException + fromString |
Reflection
- Why should library code throw
IllegalArgumentExceptionrather than letting aNullPointerExceptionpropagate? - What does the Apache contribution guide say about test coverage for bug fixes?
(Hint:CONTRIBUTING.mdor the Apache Tez wiki — every bug fix must include a reproducing test.) - How does the Tez
fromStringguard pattern compare to the one in Hadoop'sTaskAttemptID.forName? - Could this same class of bug exist in
TezDAGID.fromStringorTezVertexID.fromString?
Check both files and note your findings.
Lab 2.4 — Review It: Spot the Flaws in TEZ-FAKE001.001.patch
Lab type: Review-It — read a synthetic patch, find every flaw, explain the impact, propose fixes
Estimated time: 60–90 min
Tez component: tez-dag → org.apache.tez.dag.app.dag.impl.TaskImpl
Context
You are a Tez committer reviewing a patch uploaded to JIRA. The contributor claims
the patch fixes a race condition where TaskImpl.getCounters() returns null when
called before any task attempt has completed.
Your job is to review the patch before it merges. There are exactly 5 intentional flaws hidden in the diff below. Find them all.
The Synthetic Patch
diff --git a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
index a1b2c3d..e4f5a6b 100644
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
@@ -214,6 +214,8 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
+ import org.apache.tez.common.counters.TezCounters;
+
public synchronized TezCounters getCounters() {
TezCounters counters = null;
if (successfulAttempt != null) {
@@ -221,7 +223,7 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
counters = successfulAttempt.getCounters();
} else {
counters = attemptList.stream()
- .filter(a -> a.getState() == TaskAttemptState.SUCCEEDED)
+ .filter(a -> a.getState() == TaskAttemptState.RUNNING)
.findFirst()
.map(TaskAttemptImpl::getCounters)
.orElse(null);
@@ -231,6 +233,14 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
return counters;
}
+ /**
+ * Returns the counter for this task, or a new empty TezCounters object
+ * if no counters are available yet.
+ *
+ * @return counters, never null
+ */
+ public synchronized TezCounters getCountersOrEmpty() {
+ TezCounters c = getCounters();
+ return c == null ? new TezCounters() : c;
+ }
+
diff --git a/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java b/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
index b7c8d9e..f0a1b2c 100644
--- a/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
+++ b/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
@@ -891,6 +891,18 @@ public class TestTaskImpl {
+ @Test
+ public void testGetCountersBeforeAnyAttempt() {
+ // No attempts started; counters should not be null
+ initTask();
+ TezCounters result = task.getCounters();
+ assertNotNull("getCounters() must not return null", result);
+ }
+
+ @Test
+ public void testGetCountersOrEmptyReturnsSameObjectEachTime() {
+ initTask();
+ TezCounters first = task.getCountersOrEmpty();
+ TezCounters second = task.getCountersOrEmpty();
+ assertSame("Must return same instance", first, second);
+ }
+
Your Task
For each flaw you find, fill in the table:
| # | File | Line / hunk | Flaw description | Why it matters | Suggested fix |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 |
Guided Questions
Work through these questions one by one. Each one points at a different flaw.
Question 1 — Import placement
Look at where the import statement was added:
+ import org.apache.tez.common.counters.TezCounters;
+
public synchronized TezCounters getCounters() {
- Is this a valid location for a Java
importdeclaration? - What would happen at compile time if this diff were applied as-is?
- Where should imports go in a Java file?
- Lookup: does
TaskImpl.javaalready importTezCountersat the top?
(grep "import.*TezCounters" tez-dag/src/main/java/.../TaskImpl.java)
What is the flaw?
Question 2 — The filter predicate
The patch changes the fallback stream filter from:
.filter(a -> a.getState() == TaskAttemptState.SUCCEEDED)
to:
.filter(a -> a.getState() == TaskAttemptState.RUNNING)
- Re-read the JIRA description: the reporter says
getCounters()returns null when called before any attempt has completed. - Does filtering for
RUNNINGattempts fix that? - What does it mean to read counters from a
RUNNINGattempt vs aSUCCEEDEDone? - Are the counters of a still-running attempt considered final/reliable?
What is the flaw? What should the filter be?
Question 3 — The new test testGetCountersBeforeAnyAttempt
Read the test body carefully:
TezCounters result = task.getCounters();
assertNotNull("getCounters() must not return null", result);
- The test asserts that
getCounters()is not null when no attempt has started. - But the patch does not change
getCounters()to return an empty object — it adds a separate methodgetCountersOrEmpty()for that. - When
successfulAttemptis null andattemptListis empty, what doesgetCounters()actually return? - Will this test pass or fail against the patched code?
What is the flaw?
Question 4 — The new test testGetCountersOrEmptyReturnsSameObjectEachTime
assertSame("Must return same instance", first, second);
getCountersOrEmpty()is implemented as:
return c == null ? new TezCounters() : c;- Each call creates a new
TezCounters()whencis null. - Does the
assertSameassertion match the implementation? - Is
assertSametesting a documented contract, or is it over-specifying an implementation detail? - What assertion would actually verify the intended contract ("not null")?
What is the flaw?
Question 5 — The JIRA description says the fix is needed, but…
Re-read the patch one final time. The root cause (as stated in the JIRA) is that
getCounters() can return null. The correct caller-safe fix for most Tez callers
would be to make getCounters() itself never return null (return empty TezCounters as
the contract).
Instead the patch adds getCountersOrEmpty() as a new method — but leaves the old
getCounters() method returning null.
- Every existing caller of
getCounters()still gets null. - The Tez codebase uses
getCounters()in aggregation loops that iterate counters:counters.incrAllCounters(taskCounters)— passing null there throws NPE. - How many callers of
getCounters()exist intez-dag?
grep -rn "\.getCounters()" tez-dag/src/main/ | grep -v "//.*getCounters" | wc -l
- Does the patch actually fix the original bug?
What is the flaw?
Answer Key (Read After You've Filled the Table)
Reveal answers
| # | Flaw | Impact | Fix |
|---|---|---|---|
| 1 | import statement placed inside the class body (after the opening {) | Compile error — Java imports must precede the class declaration | Remove the import; TezCounters is already imported at the top of TaskImpl.java |
| 2 | Filter changed to RUNNING instead of keeping SUCCEEDED; a running attempt's counters are partial and unstable | Returns wrong/partial data; counters values change as the attempt progresses | Revert to SUCCEEDED filter; the real fix is to handle the "no succeeded attempt yet" case separately (return null or empty) |
| 3 | testGetCountersBeforeAnyAttempt asserts assertNotNull on getCounters() which still returns null when no attempt has completed | Test will fail on the patched code — the patch doesn't make getCounters() non-null | Test should call getCountersOrEmpty() or the assertion should accept null and document the contract |
| 4 | assertSame requires the same object reference but getCountersOrEmpty() creates a new TezCounters() each time null is returned | Test fails on every call where no successful attempt exists | Use assertNotNull to verify the non-null contract; don't assert reference identity |
| 5 | The patch adds getCountersOrEmpty() but doesn't fix the root cause — getCounters() still returns null; all existing callers are still broken | Downstream NPEs in counter aggregation loops are not fixed | Change getCounters() itself to return new TezCounters() instead of null, or add a null-guard in every caller; document the chosen contract |
Reflection
-
A patch that adds a new method instead of fixing the old one is sometimes called an "additive workaround." When is that acceptable? When is it wrong?
-
The Apache Tez review process requires that every patch include a test that would have failed before the fix and passes after. Does this patch satisfy that requirement? Why or why not?
-
If you were the committer, what feedback would you leave on JIRA? Write two or three sentences in the style of a real review comment (constructive, specific, pointing at the line).
-
Look up a real Tez JIRA review thread (search
issues.apache.org/jiraforproject = TEZ AND labels = patch-available AND resolution = Fixed). Find one comment where a committer asked for a test change. What did they say?
Level 3: Tez Architecture
This level gives you a working mental model of how all Tez components fit together. After completing it you will be able to trace any execution path — from API call to task output — through the code without getting lost. Architecture knowledge is what separates a contributor who fixes isolated bugs from one who can design improvements.
Learning Objectives
By the end of Level 3 you must be able to:
- Draw the Tez component topology from memory (Client → AM → RM → NM → Container)
- Trace a
DAG.submit()call through four class boundaries to the first vertex start - Explain the role of each of the four state machines and how they interact
- Describe what happens on each of the three communication channels between components
- Explain the Input-Processor-Output (IPO) model and how it relates to DAG edges
- Identify which Protocol Buffer message type carries a given piece of information
Component Topology
┌─────────────────────────────────────────────────────────────────────┐
│ Client JVM │
│ ┌─────────────┐ │
│ │ TezClient │──── submitDAG() ────────────────────────────────┐ │
│ └─────────────┘ │ │
└──────────────────────────────────────────────────────────────────┼──┘
│ DAGPlan (protobuf)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ YARN ResourceManager │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ApplicationMaster container (DAGAppMaster) │ │
│ │ │ │
│ │ ┌───────────┐ ┌────────────┐ ┌──────────┐ ┌─────────┐ │ │
│ │ │ DAGImpl │→ │ VertexImpl │→ │ TaskImpl │→ │ TaskAttemptImpl│ │
│ │ └───────────┘ └────────────┘ └──────────┘ └─────────┘ │ │
│ │ │ │ │ │
│ │ └──── events ──┘ │ │
│ │ │ │
│ │ ContainerLauncher ─── launches ──────────────────────────┐ │ │
│ └─────────────────────────────────────────────────────────┬─┘ │ │
└────────────────────────────────────────────────────────────┼───┼───┘
│ │
container req │ │ container
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ YARN NodeManagers (one per worker node) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ TezChild (task container JVM) │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ LogicalIOProcessorRuntimeTask │ │ │
│ │ │ Input(s) ─── Processor ─── Output(s) │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Communication Channels
| Channel | From → To | What travels |
|---|---|---|
| Client → AM | TezClient → DAGClientAMProtocol (IPC) | DAGPlan protobuf, GetDAGStatusRequest |
| AM → RM | RMCommunicator → YARN RM | Container requests, heartbeats, AM completion |
| AM → NM | ContainerLauncher → YARN NM | Container launch context, env, classpath, command |
| AM ↔ Container | TaskCommunicatorManager → TezTaskUmbilicalProtocol (IPC) | Task assignment, task status, event routing |
The Four State Machines
Tez execution is modeled as four nested state machines. Each tracks a specific level of granularity and sends events to the others.
DAGImpl State Machine
| State | Description |
|---|---|
NEW | DAG created, not yet initialized |
INITED | All vertices initialized, ready to start |
RUNNING | At least one vertex is running |
SUCCEEDED | All vertices succeeded |
FAILED | At least one vertex failed (unrecoverable) |
KILLED | AM received a kill request |
ERROR | Internal AM error |
Key transition: NEW → INITED triggers VertexInitializedEvent for each vertex.
VertexImpl State Machine
The most complex state machine in Tez. Has ~30 states and 80+ transitions.
Core states (simplified):
| State | Description |
|---|---|
NEW | Vertex created, not yet initialized |
INITIALIZING | Waiting for inputs and vertex managers to initialize |
INITED | Ready to schedule tasks |
RUNNING | At least one task is running |
COMMITTING | All tasks done, running output committers |
SUCCEEDED | All tasks succeeded, all outputs committed |
FAILED | Unrecoverable failure |
RECOVERING | AM restarted, recovering state from history |
The VertexImpl state machine is defined by the StateMachineFactory at the top of
VertexImpl.java. Reading the factory definition gives you the complete transition table.
TaskImpl State Machine
Each vertex has N tasks (parallelism = N). TaskImpl tracks one task across its attempts.
| State | Description |
|---|---|
NEW → SCHEDULED | Task created and placed in the scheduler queue |
RUNNING | At least one attempt is running |
SUCCEEDED | One attempt succeeded |
FAILED | All attempts exhausted |
KILLED | Task explicitly killed (e.g., pre-emption) |
TaskImpl manages the attempt retry logic: if attempt 1 fails, TaskImpl decides
whether to launch attempt 2 based on the failure mode and retry count configuration.
TaskAttemptImpl State Machine
One actual container execution of a task.
| State | Description |
|---|---|
NEW | Attempt created, awaiting container assignment |
ASSIGNED | Container assigned by the scheduler |
RUNNING | Container launched, task code executing |
SUCCESS_FINISHING_CONTAINER | Task reported success, container cleanup in progress |
SUCCEEDED | Attempt completed successfully |
FAILED | Attempt failed (may or may not trigger task retry) |
KILLED | Attempt pre-empted or killed by AM |
Event System
State machine transitions are driven by events. The event bus (AsyncDispatcher) routes
events from producers to the correct state machine.
Key Event Types
| Event Type | Producer | Consumer |
|---|---|---|
DAGEventType.DAG_INIT | DAGAppMaster | DAGImpl |
VertexEventType.V_INIT | DAGImpl | VertexImpl |
VertexEventType.V_START | DAGImpl | VertexImpl |
TaskEventType.T_SCHEDULE | VertexImpl | TaskImpl |
TaskAttemptEventType.TA_ASSIGNED | TaskScheduler | TaskAttemptImpl |
TaskAttemptEventType.TA_DONE | TezTaskUmbilicalProtocol (container callback) | TaskAttemptImpl |
VertexEventType.V_TASK_COMPLETED | TaskImpl | VertexImpl |
DAGEventType.DAG_VERTEX_COMPLETED | VertexImpl | DAGImpl |
The event flow for a normal task success:
Container reports TA_DONE
→ TaskAttemptImpl: RUNNING → SUCCEEDED
→ sends T_ATTEMPT_SUCCEEDED to TaskImpl
→ TaskImpl: RUNNING → SUCCEEDED
→ sends V_TASK_COMPLETED to VertexImpl
→ VertexImpl checks: all tasks done?
→ if yes: sends DAG_VERTEX_COMPLETED to DAGImpl
→ DAGImpl checks: all vertices done?
→ if yes: DAG transitions to SUCCEEDED
Every state transition in this chain corresponds to a log line you will see in the AM logs.
Protocol Buffers
All cross-process data in Tez is serialized with Protocol Buffers (proto3 in newer versions).
| Proto file | Location | Key messages |
|---|---|---|
DAGApiRecords.proto | tez-api/src/main/proto/ | DAGPlan, VertexPlan, EdgePlan |
DAGIo.proto | tez-api/src/main/proto/ | RootInputLeafOutputProto, EntityDescriptorProto |
HistoryProtos.proto | tez-dag/src/main/proto/ | All timeline/history event types |
Events.proto | tez-runtime-internals/src/main/proto/ | Task-level events (DataMovementEvent, etc.) |
The DAGPlan message is what TezClient sends to the AM. It contains the complete
description of the DAG: vertices, edges, processor descriptors, I/O configurations, and
edge properties. It is generated from the DAG API object.
// In DAGImpl.java, the plan is received and deserialized:
DAGPlan dagPlan = clientAMProtocol.submitDAG(submitDAGRequest).getDagId();
// Plan is then converted to DAGImpl state
Input-Processor-Output (IPO) Model
Each task runs a single AbstractProcessor. The processor has access to named Input and
Output instances, which are determined by the edges in the DAG.
┌──────────────────────────────────────────────────────────────────┐
│ Task container │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ LogicalIOProcessorRuntimeTask │ │
│ │ │ │
│ │ Inputs: Outputs: │ │
│ │ ┌──────────────────┐ ┌──────────────────────┐ │ │
│ │ │ OrderedGrouped │ │ OrderedPartitioned │ │ │
│ │ │ KVInput │──┐ ┌──│ KVOutput │ │ │
│ │ └──────────────────┘ │ │ └──────────────────────┘ │ │
│ │ ▼ │ │ │
│ │ ┌───────────────┐ │ │
│ │ │ MyProcessor │ │ │
│ │ │ extends │ │ │
│ │ │ AbstractProcessor │ │
│ │ └───────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Edge Property Types
The EdgeProperty in the DAG API determines what I/O classes are used between two vertices.
DataMovementType | Meaning | Default I/O pair |
|---|---|---|
SCATTER_GATHER | Partitioned, sorted shuffle | OrderedPartitionedKVOutput → OrderedGroupedKVInput |
BROADCAST | All output sent to all downstream tasks | UnorderedKVOutput → UnorderedKVInput |
ONE_TO_ONE | Task i → Task i, no shuffle | UnorderedKVOutput → UnorderedKVInput |
CUSTOM | User-defined routing | User-provided EdgeManagerPlugin |
SCATTER_GATHER corresponds to the classic MapReduce shuffle. BROADCAST is used for joins
where one side is small enough to replicate to all tasks.
DataMovementEvent
When a task output is ready, it sends a DataMovementEvent through the umbilical to the AM.
The AM routes it to the downstream tasks so their input knows which partition to fetch.
This event routing is the mechanism by which OrderedGroupedKVInput discovers where each
upstream partition is located — it receives DataMovementEvents from the AM containing
the shuffle server address and partition index.
Required Reading
| # | Resource | What to extract |
|---|---|---|
| 1 | tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | The StateMachineFactory declaration — read all addTransition() calls |
| 2 | tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | The createDag() method — how DAGPlan becomes state machine objects |
| 3 | tez-api/src/main/proto/DAGApiRecords.proto | DAGPlan and VertexPlan message definitions |
| 4 | tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | The serviceStart() method — component initialization order |
| 5 | tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java | How inputs, processors, and outputs are initialized in a container |
Key Classes Quick Reference
| Class | Module | Role |
|---|---|---|
DAGAppMaster | tez-dag | AM main class; manages all components; starts the event dispatcher |
DAGImpl | tez-dag | DAG state machine; tracks vertex completion; manages history |
VertexImpl | tez-dag | Vertex state machine; manages task scheduling; calls VertexManager |
TaskImpl | tez-dag | Task state machine; manages attempt lifecycle and retry logic |
TaskAttemptImpl | tez-dag | TaskAttempt state machine; coordinates container assignment |
AsyncDispatcher | tez-dag (via Hadoop) | Event bus; routes events to state machines asynchronously |
TezTaskUmbilicalProtocol | tez-runtime-internals | IPC interface between container and AM |
TezChild | tez-dag | Container main class; receives task assignment; runs the task |
LogicalIOProcessorRuntimeTask | tez-runtime-internals | In-container task runner; sets up IPO |
TezClient | tez-api | Client API; creates TezSession; submits DAGs |
JIRA Categories for Level 3
Having read the architecture, you can now evaluate:
- Architecture improvement JIRAs — proposals to change how components interact
- State machine correctness bugs — transitions that lead to wrong states
- Event routing issues — events that are lost or sent to wrong consumers
- Container reuse improvements — how tasks are assigned to existing containers
You are still not ready to submit fixes for state machine bugs — those require Level 4. But you can now read these issues intelligently and leave informed comments.
Deliverables
- Draw the component topology diagram from memory (no looking)
-
Trace
TezClient.submitDAG()toVertexImplV_STARTevent through class names - Identify the state machines and their event types from code (not from this page)
-
Explain in your own words what
DataMovementEventdoes and why it exists - Lab 3.1 completed: DAG submission trace documented
- Lab 3.2 completed: IPO abstraction walkthrough complete
Common Mistakes
| Mistake | Impact | Correct understanding |
|---|---|---|
| Thinking the AM runs tasks directly | Leads to wrong mental model of container lifecycle | Tasks run in separate JVMs (containers); AM only schedules and monitors |
Confusing VertexImpl with Vertex (API) | Vertex is the builder; VertexImpl is the runtime state machine | They are in different modules (tez-api vs tez-dag) |
Thinking AsyncDispatcher is synchronous | Events are queued; transitions happen on the dispatcher thread | Never assume a transition is immediate after an event is posted |
Reading VertexImpl top-to-bottom | The class is 6000+ lines; reading linearly is unproductive | Start with the StateMachineFactory declaration, then follow individual transitions |
Lab 3.1: Trace a DAG Submission End-to-End
Background
A DAG goes from a Java object constructed with the API to running tasks in containers through a sequence of method calls, IPC calls, and event posts that spans six class boundaries and three JVMs. This lab asks you to trace that path precisely — class name, method name, and the data that crosses each boundary.
Being able to reconstruct this trace from code (not from documentation) is the skill. That
means reading DAGAppMaster.java, DAGImpl.java, VertexImpl.java, and
TezChild.java and following the chain yourself.
The Six Class Boundaries
[1] TezClient.submitDAG(dag)
│
│ DAGClientAMProtocol (IPC) — carries: SubmitDAGRequest{DAGPlan}
▼
[2] DAGClientHandler.submitDAG(request) [in DAGAppMaster]
│
│ posts: DAGAppMasterEvent(NEW_DAG_SUBMITTED)
▼
[3] DAGAppMaster.handle(event)
│
│ calls createDag(dagPlan) → new DAGImpl(...)
│ posts: DAGEvent(DAG_INIT)
▼
[4] DAGImpl.handle(DAGEvent{DAG_INIT})
│
│ InitTransition: initializes all VertexImpl objects
│ posts: VertexEvent(V_INIT) for each vertex
▼
[5] VertexImpl.handle(VertexEvent{V_INIT})
│
│ InitTransition: sets up tasks, calls VertexManager
│ posts: VertexEvent(V_START) when ready
│ posts: TaskEvent(T_SCHEDULE) for each task
▼
[6] TaskImpl → TaskAttemptImpl → ContainerLauncher → NM
│
│ NM starts container JVM: TezChild.main()
▼
[Container JVM] TezChild receives task assignment via TezTaskUmbilicalProtocol
│
▼
LogicalIOProcessorRuntimeTask.run() — Processor.run() called
Step-by-Step Tasks
Step 1: Find the Entry Point in TezClient
Open tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java.
Find the submitDAG(DAG dag) method. Answer:
- What is the name of the IPC protocol interface used to communicate with the AM?
- What does
TezClientdo if it does not yet have an AM to talk to (session not started)? - What method on the DAG object serializes it to a
DAGPlanprotobuf? - What request object wraps the
DAGPlanbefore it is sent over IPC?
# Find the IPC protocol interface
grep -n "Protocol" tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java | head -10
# Find DAGPlan construction
grep -n "DAGPlan\|createDag\|getPlan" tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java
Step 2: Find the AM-side IPC Handler
The AM exposes the DAGClientAMProtocol interface. The implementation is in DAGAppMaster.
# Find the implementation of submitDAG on the AM side
grep -rn "submitDAG" tez-dag/src/main/java/org/apache/tez/dag/app/ | grep -v test
Open the handler class. Answer:
- What is the exact class name that implements
DAGClientAMProtocol? - What event type does it post to the
AsyncDispatcherafter receiving theDAGPlan? - Does the
submitDAGcall on the AM side block until the DAG completes, or does it return immediately?
Step 3: Trace DAGAppMaster Initialization
Open tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java.
Find the serviceStart() method. Read the component initialization order:
- List the components initialized in
serviceStart()in order - Find where
AsyncDispatcheris created and started - Find where the
DAGEventDispatcher(the component that routesDAGEvents toDAGImpl) is registered
# Find component initialization
grep -n "addService\|serviceStart\|startService" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -20
Step 4: Read the DAGImpl Init Transition
Open tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java.
Find the StateMachineFactory definition. Locate the transition for DAGEventType.DAG_INIT.
The transition handler class is InitTransition. Find it in the same file.
Answer:
- What does
InitTransition.transition()do with each vertex in the DAG? - After initializing vertices, what event does
DAGImplpost? - Under what condition does the init transition immediately move to
RUNNINGvs waiting?
# Find the init transition
grep -n "InitTransition\|DAG_INIT" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20
Step 5: Read the VertexImpl Init Transition
Open tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java.
Find the transition from INITIALIZING on event V_INIT. The handler is InitTransition
(a different class from the one in DAGImpl).
Answer:
- What is the
VertexManagerand when is it invoked during initialization? - How does
VertexImplknow how many tasks to create (the parallelism)? - What event does
VertexImplsend toDAGImplwhen initialization completes?
# Find vertex init transition
grep -n "V_INIT\|InitTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20
Step 6: Trace the Container Launch
After tasks are scheduled, TaskAttemptImpl requests a container from the TaskScheduler.
When a container is assigned, ContainerLauncher builds the launch context.
# Find the container launch command construction
grep -rn "containerLaunchContext\|getContainerLaunchContext\|vargs" \
tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ | grep -v test | head -10
Answer:
- What is the main class of the container JVM? (The class with
main()that YARN launches) - What information is passed to
TezChildvia system properties vs environment variables? - How does
TezChildknow which task to run when it starts?
Step 7: Read TezChild.main()
Open tez-dag/src/main/java/org/apache/tez/dag/app/TezChild.java.
Find the main() method and the run() loop.
Answer:
- What IPC interface does
TezChilduse to communicate with the AM? - What does
TezChilddo when it receives aTaskSpecfrom the AM? - What class is instantiated to actually run the processor?
# Find TezChild
find tez-dag/src/main/java -name "TezChild.java"
wc -l $(find tez-dag/src/main/java -name "TezChild.java")
Complete the Trace Table
Fill in this table by reading the code (not from this page or any other documentation):
| Step | Class | Method | Data / Event |
|---|---|---|---|
| 1 | TezClient | submitDAG() | Sends SubmitDAGRequest{DAGPlan} via IPC |
| 2 | ? | submitDAG() | Posts event ??? |
| 3 | DAGAppMaster | handle() | Creates DAGImpl, posts DAGEvent{DAG_INIT} |
| 4 | DAGImpl | InitTransition.transition() | Posts VertexEvent{V_INIT} for each vertex |
| 5 | VertexImpl | InitTransition.transition() | Posts TaskEvent{T_SCHEDULE} for each task |
| 6 | TaskAttemptImpl | ? | Requests container from RM via TaskScheduler |
| 7 | ContainerLauncher | ? | Launches container JVM with TezChild as main class |
| 8 | TezChild | run() | Receives task spec, starts processor |
| 9 | LogicalIOProcessorRuntimeTask | run() | Calls Processor.run() |
Fill in the ? cells from the actual code. Each cell should contain the real method name.
Expected Output
A completed trace table with all cells filled from code, not from documentation. Each answer should be verifiable by pointing to a specific line in a specific file.
Example format for your notes:
Step 2: DAGClientHandler.submitDAG()
in: tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
line: ~1234
posts: DAGAppMasterEvent(NEW_DAG_SUBMITTED)
Stretch Goals
-
Find the
AsyncDispatcherqueue size configuration. What happens if the queue fills up?grep -rn "AsyncDispatcher\|dispatcher.queue" \ tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -10 -
Find where the AM is told to exit when the DAG completes:
grep -n "stop\|shutdown\|exit" \ tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | grep -i "succeeded\|complete" -
Trace what happens to a
TA_DONEevent fromTezChildback toDAGImpl:TezChildcalls a method on the umbilical- The AM receives it and posts a
TaskAttemptEvent TaskAttemptImpltransitions toSUCCEEDED- The chain continues up to
DAGImplIdentify every class and event in this reverse chain.
Lab 3.2: Understand the IPO Abstraction
Background
Every task in Tez runs a Processor that reads from one or more Input objects and writes
to one or more Output objects. This Input-Processor-Output (IPO) model is the fundamental
abstraction for how data moves through a DAG. Edge properties in the API (EdgeProperty,
DataMovementType) determine which I/O classes are instantiated in the container.
This lab walks through the IPO model from the API layer to the runtime, tracing how an
ORDERED_PARTITIONED_KV_OUTPUT configuration becomes actual bytes in a shuffle buffer.
The IPO Interface Hierarchy
tez-runtime-api (in tez-api module):
AbstractLogicalInput
└── AbstractInput
AbstractLogicalOutput
└── AbstractOutput
AbstractProcessor
tez-runtime-library (implementations):
OrderedPartitionedKVOutput extends AbstractLogicalOutput
OrderedGroupedKVInput extends AbstractLogicalInput
UnorderedKVOutput extends AbstractLogicalOutput
UnorderedKVInput extends AbstractLogicalInput
UnorderedPartitionedKVOutput extends AbstractLogicalOutput
BroadcastKVInput extends AbstractLogicalInput (alias for UnorderedKVInput)
The key interface chain:
AbstractLogicalOutput.initialize() → called by LogicalIOProcessorRuntimeTask
AbstractLogicalOutput.start() → called when the processor is started
AbstractLogicalOutput.getWriter() → returns KeyValueWriter for the processor to use
AbstractLogicalOutput.commit() → called after processor.run() completes
AbstractLogicalOutput.close() → cleanup
Step-by-Step Tasks
Step 1: Read the AbstractLogicalOutput Interface
Open tez-runtime-internals/src/main/java/org/apache/tez/runtime/api/AbstractLogicalOutput.java.
Answer:
- What is the purpose of the
initialize()method? What does it return? - What is the difference between
start()andinitialize()? Why are they separate? - What method does a
Processorcall to get a writer to write records?
Step 2: Trace OrderedPartitionedKVOutput.initialize()
Open tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java.
Find the initialize() method.
Answer:
- What configuration key controls the buffer size for sorting?
- What class is created in
initialize()to handle the actual sort-and-spill? - How is the
Partitionerclass determined at runtime?
# Find sort buffer configuration
grep -n "SORT_MB\|sortmb\|buffer" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java \
| head -10
# Find the writer/sorter creation
grep -n "new.*Writer\|new.*Sorter\|ExternalSorter" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java
Step 3: Trace the Write Path
When a processor calls writer.write(key, value), the data goes:
KeyValueWriter.write(key, value)
→ ExternalSorter.collect(key, value, partition)
→ SpillThread triggers when buffer is full
→ IFile.Writer writes sorted partition to local disk
→ On close(): merges all spills into final output file
Find the ExternalSorter class:
find tez-runtime-library/src/main/java -name "ExternalSorter.java"
Answer:
- What data structure holds records before they are spilled?
- What algorithm is used to sort records in the buffer?
- How is the sort key computed for
(K, V)pairs with a customComparator?
Step 4: Read OrderedGroupedKVInput.initialize()
Open tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java.
Find initialize().
Answer:
- What class handles the shuffle (fetching data from remote nodes)?
- How does the input know which upstream tasks it needs to fetch from?
- What event type does the input consume to discover shuffle locations?
grep -n "Shuffle\|ShuffleManager\|DataMovementEvent" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java \
| head -15
Step 5: Trace the Read Path
When a processor calls keyValueReader.next(), the data flow is:
KeyValueReader.next()
→ MergedKeyValueIterator.next() [merging multiple sorted partitions]
→ TezRawKeyValueIterator [from TezMerger]
→ IFile.Reader reads from local merged file
But before the merge can happen, the shuffle must fetch data:
DataMovementEvent arrives (from AM, routed from upstream task)
→ ShuffleManager records: "partition P is at host H:port/path"
→ Fetcher.fetch() downloads the partition file
→ stores locally
→ When all partitions fetched: MergeManager merges them
→ final sorted output available for KeyValueReader
Find the Fetcher class:
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")
Answer:
- What HTTP endpoint does
Fetchercall to retrieve partition data? - What does
Fetcherdo if the HTTP request fails? - How many simultaneous fetch connections does
Fetcherallow by default?
Step 6: Understand DataMovementEvent Routing
The DataMovementEvent is what connects output to input. When a task completes its output:
- The
ShuffleHandler(shuffle server) registers the output location - The task sends a
DataMovementEventvia theTezTaskUmbilicalProtocolto the AM - The AM routes the event to the downstream tasks that need it
- The downstream input receives it and knows to fetch from that location
Find the DataMovementEvent class:
find . -name "DataMovementEvent.java" | grep -v test
Answer:
- What fields does
DataMovementEventcarry? - Why is the payload (
userPayload) a byte array and not a typed field? - How does the AM know which downstream tasks to route the event to?
# Find the AM-side routing logic
grep -rn "DataMovementEvent\|routeEvent" \
tez-dag/src/main/java/org/apache/tez/dag/app/ --include="*.java" | grep -v test | head -15
Step 7: Edge Properties → I/O Classes
The EdgeProperty object in the API specifies which I/O classes to use. Trace how
EdgeProperty becomes actual I/O class instantiation in the container.
Starting point:
# Find EdgeProperty
find tez-api/src/main/java -name "EdgeProperty.java"
Then trace:
# How does VertexImpl use EdgeProperty to configure I/O for a vertex?
grep -n "EdgeProperty\|getInputDescriptor\|getOutputDescriptor" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15
Answer:
- What field in
EdgePropertyspecifies theInputclass for the destination vertex? - What field specifies the
Outputclass for the source vertex? - How is the class name passed to the container so it can instantiate the correct I/O class?
Build the IPO Map
For OrderedWordCount, fill in this table by reading the code:
| Edge | Source vertex | Dest vertex | Output class | Input class | DataMovementType |
|---|---|---|---|---|---|
| Tokenizer → SumReducer | Tokenizer | SumReducer | ? | ? | ? |
| SumReducer → Sorter | SumReducer | Sorter | ? | ? | ? |
Read tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java to fill in
the ? cells.
grep -n "EdgeProperty\|KVOutput\|KVInput\|DataMovementType" \
tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
Expected Output
By end of this lab you have:
- The IPO map table for
OrderedWordCountcompleted - An answer for each step question (from code, not from documentation)
- Understanding of what
DataMovementEventcarries and why it exists - Knowledge of which configuration key controls sort buffer size
Stretch Goals
-
Find the shuffle HTTP server that serves partition data to
Fetcher:find . -name "ShuffleHandler.java" | grep -v testWhat HTTP framework does it use? What is the URL pattern for fetching a partition?
-
Trace what happens when
Fetcherreceives corrupted data (a checksum mismatch). Does the task fail immediately? Or does it retry from a different source? -
Find the
EdgeManagerPlugininterface and read its contract:find tez-api/src/main/java -name "EdgeManagerPlugin.java"What three methods must a custom edge manager implement, and what do they do? Why would you use a custom edge manager instead of
SCATTER_GATHER? -
Look at
IntersectExample.javaintez-examples. It usesBROADCASTfor one edge. Explain why: what is the semantic meaning of broadcasting in a join operation?
Lab 3.3 — Build It: Multi-Input Union DAG
Lab type: Build It — real Maven project, compilable Java, run + break + fix cycle
Estimated time: 90–120 min
Maven module: book/projects/level-3-multi-input
Main class: org.apache.tez.learning.l3.MultiInputDAG
What You Will Build
EvenNumberSource(1) ──even-edge──┐
├─▶ MultiInputUnionProcessor(1) ──▶ UnionSinkProcessor(1)
OddNumberSource(1) ──odd-edge───┘
Two source vertices emit separate streams of integers (even: 0,2,4,…,98 and odd: 1,3,5,…,99). A middle vertex receives both streams through two named input edges, unions them, and forwards everything to a terminal sink. The sink sums all values and publishes the result via a Tez counter.
Expected output: TotalSum=4950 PASS
This is the smallest possible Tez program with a multi-input vertex — the same structural pattern used by every Tez join, union, and co-group operation.
Step 1 — Set the Tez Version
Open book/projects/pom.xml and confirm <tez.version> matches your local build:
cd ~/tez-src # your Tez clone from Lab 1.1
git log --oneline -1 | head -c 60
mvn help:evaluate -Dexpression=project.version -q -DforceStdout 2>/dev/null
If the version printed differs from what is in the POM, update <tez.version> before
continuing.
Step 2 — Compile and Run the Unit Tests
cd /path/to/apache-tez/book/projects
# Compile and test the new module only
mvn -pl level-3-multi-input test
You should see:
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0
Read every test in TestMultiInputProcessors.java before moving on.
Questions
| # | Question |
|---|---|
| 1 | testEvenAndOddRangesNoOverlapNoGap simulates both sources using a boolean[]. Why is this a more rigorous check than just verifying the counts? |
| 2 | testEdgeNameConstants tests string literals. What real bug would be caught if a developer renamed the constant but not the string in buildDAG()? |
| 3 | testExpectedSum hardcodes 4950L. Could you make this test fail by changing only EvenNumberSource.COUNT? What would change? |
Step 3 — Build the Fat JAR and Run the DAG
mvn -pl level-3-multi-input package -q
java -jar level-3-multi-input/target/level-3-multi-input-1.0-SNAPSHOT-jar-with-dependencies.jar
Expected final line:
[MultiInputDAG] TotalSum=4950 expected=4950 PASS
If you see FAIL, the counter value is wrong — note the actual value before
proceeding to the debugging exercises.
Step 4 — Read Every Source File
Work through each file in src/main/java/org/apache/tez/learning/l3/.
EvenNumberSource.java
| # | Question |
|---|---|
| 1 | What Tez base class does it extend? What does that class provide? |
| 2 | Why does run() call output.start() before getWriter()? What happens if you skip it? (Break It experiment below) |
| 3 | The output is retrieved by getOutputs().values().iterator().next(). What would break if this vertex had two outputs? |
| 4 | Why are key and value declared once outside the loop rather than inside it? What allocation cost would the inner-loop placement cause? |
MultiInputUnionProcessor.java
| # | Question |
|---|---|
| 1 | Inputs are retrieved by string name: inputs.get(EVEN_EDGE). Where are these names assigned? Trace the call to setDestinationEdgeName in MultiInputDAG.java. |
| 2 | Both inputs are started before either reader is obtained. Could you start them one at a time (start even → read even → start odd → read odd)? What would happen? |
| 3 | After draining the even input, the odd input's reader is obtained separately. Is there a scenario where odd records arrive before all even records have been read? How does Tez buffer handle this? |
| 4 | The processor forwards records unchanged (key=value=integer). What change to run() would be needed to emit only distinct values if both sources could produce duplicates? |
MultiInputDAG.java
| # | Question |
|---|---|
| 1 | Both evenEdge and oddEdge use edgeCfg.createDefaultEdgeProperty(). Could you use different edge configs for the two sources? When would that be necessary? |
| 2 | Edge.setDestinationEdgeName(...) names the edge as seen by the destination vertex. Does the source vertex also see this name? Check by reading the Edge API. |
| 3 | The DAG has 4 vertices. Draw the dependency graph. Which vertices can run in parallel? |
| 4 | waitForCompletion(EnumSet.of(StatusGetOpts.GET_COUNTERS)) — what does GET_COUNTERS do? What would status.getDAGCounters() return if this option were omitted? |
Step 5 — Break It: Three Experiments
Perform each experiment, observe the failure, then revert before the next one.
Experiment A — Swap the edge names
In MultiInputDAG.buildDAG(), swap EVEN_EDGE and ODD_EDGE:
.setDestinationEdgeName(MultiInputUnionProcessor.ODD_EDGE) // was EVEN_EDGE
// ...
.setDestinationEdgeName(MultiInputUnionProcessor.EVEN_EDGE) // was ODD_EDGE
Rebuild and run.
- Does the DAG succeed or fail?
- Is the sum still 4950?
- Why does swapping the names not cause a failure here, but would cause a failure in a join operation where the left and right inputs have different schemas?
Experiment B — Remove one start() call
In MultiInputUnionProcessor.run(), remove evenInput.start().
Rebuild and run.
- What exception is thrown? On which line?
- Search the Tez source for the method that throws this exception. What is the guard condition?
Experiment C — Make one source emit duplicates
In EvenNumberSource.run(), change int n = i * 2 to int n = 0 (every write uses key=0).
Rebuild and run.
- What is the counter value now?
- Is the DAG PASS or FAIL?
- What does this reveal about how
OrderedGroupedKVInputhandles duplicate keys when the value type isIntWritable?
Step 6 — Implement a FilterUnionProcessor
Create a new file in the same package:
FilterUnionProcessor.java
Specification:
- Extends
AbstractLogicalIOProcessor - Has the same two named inputs as
MultiInputUnionProcessor - Accepts a threshold via
UserPayload(key"threshold", default50) - Reads from both inputs; only forwards values
>= threshold - Increments counter
UnionPipeline/FilteredCountfor each record dropped
Wire it into the DAG as a replacement for MultiInputUnionProcessor:
Vertex filter = Vertex.create(
"FilterUnion",
ProcessorDescriptor.create(FilterUnionProcessor.class.getName())
.setUserPayload(UserPayload.create(
ByteBuffer.wrap("threshold=50".getBytes()))),
1);
Expected result: With threshold=50, values 0–49 are dropped, values 50–99 are
forwarded. Sum at sink = 50+51+…+99 = 3725. FilteredCount = 50.
Step 7 — Tez Source Connection Table
For each class below, locate the corresponding source file in your Tez clone and record the path:
| Class used in this project | Tez source file (relative to repo root) |
|---|---|
OrderedPartitionedKVOutput | |
OrderedGroupedKVInput | |
OrderedPartitionedKVEdgeConfig | |
HashPartitioner | |
Edge.setDestinationEdgeName |
Step 8 — Connect to Real Tez Data Flows
Open tez-examples/src/main/java/org/apache/tez/examples/JoinDataGen.java or
OrderedWordCount.java in the Tez source tree.
- Find a DAG in the examples that has more than 2 vertices.
- Draw its topology as an ASCII diagram.
- Identify which vertex is the "union-like" vertex (if any) that receives edges from multiple sources.
- Compare its processor class to
MultiInputUnionProcessor: what is similar, what is different?
Step 9 — JIRA Research
Search issues.apache.org/jira for:
project = TEZ AND text ~ "multi-input" AND resolution = Fixed
Find one resolved issue involving multiple inputs to a single vertex.
- What was the bug?
- Which class was modified?
- Was a test added? If so, what does it test?
Level 4: DAG State Machine Internals
This level takes you inside VertexImpl — the most complex class in the Tez codebase. You
will read the full state machine, understand every major state and the conditions that drive
transitions, learn how VertexManager plugs in to control scheduling, and understand how
speculative execution works. After this level you are capable of diagnosing vertex-level
failures from AM log output and writing patches to the state machine.
Learning Objectives
By the end of Level 4 you must be able to:
- Read a
StateMachineFactorydefinition and produce a transition table from code - Explain the full
INITIALIZING → INITED → RUNNING → SUCCEEDEDpath with all preconditions - Describe what
VertexManagerdoes and when it is invoked - Explain the difference between
ImmediateStartVertexManagerandShuffleVertexManager - Describe the speculative execution trigger conditions and what it causes
- Trace a vertex failure from first task failure to
DAGImplreceivingV_COMPLETED - Explain what vertex groups are and why they exist
Reading a StateMachineFactory
Tez uses Hadoop's StateMachineFactory (from hadoop-common). The pattern:
private static final StateMachineFactory<VertexImpl, VertexState, VertexEventType, VertexEvent>
stateMachineFactory =
new StateMachineFactory<>(VertexState.NEW)
// From NEW
.addTransition(VertexState.NEW,
VertexState.INITIALIZING,
VertexEventType.V_INIT,
new InitTransition())
// From INITIALIZING
.addTransition(VertexState.INITIALIZING,
EnumSet.of(VertexState.INITED, VertexState.FAILED),
VertexEventType.V_INIT_DONE,
new InitedTransition())
...
.installTopology();
Reading rules
- First argument — the source state (where we are now)
- Second argument — the destination state(s). If an
EnumSet, the transition handler decides which destination to return. - Third argument — the event type that triggers this transition
- Fourth argument — the
SingleArcTransitionorMultipleArcTransitionhandler
A SingleArcTransition always goes to the same destination state. Its transition() method
returns void.
A MultipleArcTransition can go to different states. Its transition() method returns the
next VertexState.
When you see an EnumSet as the second argument, look for a MultipleArcTransition
implementation — the logic inside that class decides which state to move to.
How to extract the full transition table
# List all addTransition calls in VertexImpl
grep -n "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| wc -l
# Print them all
grep -n "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
The Full Vertex State Machine
NEW → INITIALIZING (event: V_INIT)
Triggered by DAGImpl.InitTransition when the DAG is initializing.
Handler: InitTransition
What happens:
VertexImplsets upinputsWithInitializers— inputs that requireRootInputInitializer- Registers event handlers for root input initializer completion events
- If there are no root input initializers, immediately posts
V_INIT_DONE(transitions toINITEDin the same logical step)
Precondition for INITIALIZING → INITED:
- All
RootInputInitializers have reported completion VertexManager.initialize()has completed
INITED → RUNNING (event: V_START)
Triggered by DAGImpl when all source vertices of this vertex have started or when the
vertex has no source edges (it is a root vertex).
Handler: StartTransition
What happens:
- Calls
vertexManager.onVertexStarted() - The
VertexManagerdecides when to schedule tasks
Important: the V_START event does not directly schedule tasks. The VertexManager
does, via VertexManagerPlugin.scheduleVertexTasks().
RUNNING task completion handling
Each task completion (success or failure) generates a V_TASK_COMPLETED event.
The TaskCompletedTransition handler:
- Increments the succeeded/failed task counter
- Checks if all tasks are done → if yes, triggers
V_COMPLETE_EVENT - Checks speculative execution conditions
- Checks if failure count exceeds tolerable failures threshold
Key configuration: tez.vertex.failure-tasks-percent.to-fail-vertex — percentage of task
failures that cause the entire vertex to fail. Default: 0 (any failure fails the vertex).
Setting to > 0 enables partial failure tolerance.
RUNNING → COMMITTING (all tasks succeeded)
Before a vertex is marked SUCCEEDED, its output committers run.
Handler: VertexCommitCallback
What happens:
OutputCommitter.commitOutput()is called for each output with a committer- Commit is atomic: either all outputs commit or the vertex fails
- The AM must not fail between task completion and output commit (AM recovery handles this)
RUNNING → FAILED
Triggers:
- A task exceeds the failure threshold (
V_TASK_COMPLETEDwith failure) - A container dies without a task completion report
VertexManagerreports an error- A downstream vertex fails and error propagation is configured
RECOVERING states
When the AM restarts (e.g., due to a node failure), VertexImpl enters RECOVERING states.
Recovery reads history events from the timeline service to reconstruct which tasks completed
before the AM died, avoiding re-running already-succeeded tasks.
This is the most complex part of VertexImpl. Recovery bugs are a major category of
contributor-fixable issues.
VertexManager
VertexManager is the plugin interface that controls task scheduling within a vertex.
It sits between the AM framework and the actual task scheduler.
Interface (simplified)
public abstract class VertexManagerPlugin {
// Called when vertex is initialized; plugin configures itself
public abstract void initialize() throws Exception;
// Called when V_START event fires; plugin decides when to schedule tasks
public abstract void onVertexStarted(List<TaskAttemptIdentifier> completions)
throws Exception;
// Called each time a source vertex task completes
// Plugin uses this to update scheduling decisions (for slow-start)
public abstract void onSourceTaskCompleted(TaskAttemptIdentifier completedSrcTaskAttempt)
throws Exception;
// Called when vertex configuration changes (e.g., auto-parallelism)
public abstract void onVertexManagerEventReceived(VertexManagerEvent vmEvent)
throws Exception;
}
ImmediateStartVertexManager
The default for root vertices and vertices with no special scheduling requirements.
Behavior:
- Schedules all tasks immediately when
onVertexStarted()is called - Does not wait for any source task completion
- Used by:
Tokenizervertex inOrderedWordCount
ShuffleVertexManager
Used for vertices that receive SCATTER_GATHER input from a source vertex.
Behavior:
- Implements slow start: waits until a configurable fraction of source tasks have completed before scheduling downstream tasks
- Configuration key:
tez.shuffle-vertex-manager.min-src-fraction(default 0.25) andtez.shuffle-vertex-manager.max-src-fraction(default 0.75) - Implements auto-parallelism: can reduce the downstream vertex's parallelism based on the actual size of shuffle data
- When auto-parallelism reduces parallelism, it calls
context.reconfigureVertex()which posts aV_PARALLELISM_UPDATEDevent toVertexImpl
Why VertexManager Matters for Contributors
Auto-parallelism and slow-start bugs are a major category of Tez issues. The interaction
between ShuffleVertexManager and VertexImpl involves:
- Parallelism changes after task scheduling
- Race conditions between task completion events and parallelism updates
- Recovery of vertices that had parallelism changed before AM death
Speculative Execution
Speculative execution launches a duplicate task attempt when the original attempt is slow.
Trigger conditions
VertexImpl checks speculation conditions in TaskCompletedTransition and on a periodic
timer:
- At least one task has completed (we have a baseline for "normal" task duration)
- The running attempt has been running longer than
speculative_threshold * median_time - The running attempt's progress is lower than expected for its elapsed time
Configuration:
tez.am.speculation.enabled = true (default: false)
tez.am.speculation.interval-ms = 5000 (check interval)
What happens
VertexImplposts aTaskEventType.T_ADD_SPEC_ATTEMPTevent toTaskImplTaskImplcreates a newTaskAttemptImpl- Both attempts run concurrently
- The first to succeed wins; the other is killed
- The winning attempt's output is committed; the losing attempt's output is discarded
Interaction with ShuffleVertexManager
If a speculative attempt completes, ShuffleVertexManager receives an onSourceTaskCompleted
callback for the winning attempt. It must de-duplicate: the task's output should only be
counted once regardless of which attempt succeeded.
Vertex Groups
Vertex groups (VertexGroup in the API) allow multiple vertices to be treated as a single
logical vertex for downstream consumption.
Use case: merging the output of multiple Map vertices before a single Reduce vertex, without
an intermediate shuffle. This is used in the Hive UnionAll operator implementation.
Key classes:
VertexGroupAPI:tez-api/src/main/java/org/apache/tez/dag/api/VertexGroup.javaGroupInputEdge: an edge from aVertexGroupto a regular vertex- The downstream vertex sees a single
MergedLogicalInputthat combines all group members
Key Classes for This Level
| Class | Path | Focus |
|---|---|---|
VertexImpl | tez-dag/.../dag/impl/VertexImpl.java | The entire state machine; 6000+ lines |
ShuffleVertexManager | tez-dag/.../library/cartesian/ShuffleVertexManager.java | Wait: this is actually in tez-dag/.../vertexmanager/ |
ImmediateStartVertexManager | tez-dag/.../vertexmanager/ImmediateStartVertexManager.java | Simple baseline |
VertexManagerPlugin | tez-api/.../VertexManagerPlugin.java | The interface |
VertexManagerPluginContext | tez-api/.../VertexManagerPluginContext.java | What the plugin can call back into |
TaskImpl | tez-dag/.../dag/impl/TaskImpl.java | Manages attempt lifecycle |
# Find the VertexManager implementations
find tez-dag/src/main/java -name "*VertexManager*.java" | grep -v test
JIRA Categories for Level 4 Contributors
You are now ready to investigate and submit patches for:
- Vertex failure handling bugs — incorrect state transitions, wrong error messages
- VertexManager logic bugs — slow-start fraction calculation, auto-parallelism edge cases
- Recovery bugs — vertices that fail to recover correctly after AM restart
- Speculation bugs — duplicate completions, wrong trigger conditions
- Test improvements —
TestVertexImplhas hundreds of tests; adding coverage for edge cases
Approach:
- Find a
TestVertexImpltest that is@Ignored — read the comment explaining why - If the bug is fixed, the
@Ignorecan be removed (a trivial but real contribution) - Or find a state machine transition that has no test coverage (grep for the transition, then grep for the handler class name in test files)
Deliverables
-
Extract the complete
VertexImplstate transition table (all source states, event types, destination states) from the code -
Explain
ShuffleVertexManagerslow-start in your own words, with the relevant config keys -
Trace a vertex failure through
TaskImpl → VertexImpl → DAGImplusing event type names -
Identify one
@Ignored test inTestVertexImpland read why it is ignored - Lab 4.1 completed: full state machine map documented
- Lab 4.2 completed: VertexManager walkthrough complete
Common Mistakes
| Mistake | Impact | Correct understanding |
|---|---|---|
Assuming V_START schedules tasks | Code changes that bypass VertexManager break auto-parallelism | V_START calls VertexManager.onVertexStarted(); the manager schedules |
Ignoring RECOVERING states | Patches that forget about recovery cause AM restart failures | Every new state or transition must handle the RECOVERING_* path |
Confusing TaskImpl failure handling with VertexImpl | Retry logic is in TaskImpl; failure threshold is in VertexImpl | Read both classes before touching failure handling code |
Reading VertexImpl in isolation | Many transitions involve callbacks to DAGImpl | Always trace events both ways: into the state machine AND back out |
Lab 4.1: Read the VertexImpl State Machine
Background
VertexImpl.java is the most complex class in Apache Tez. It is approximately 6,000 lines
long and contains the complete state machine for vertex execution, including initialization,
scheduling, task completion handling, failure handling, speculative execution, and AM
recovery. Reading it systematically — rather than linearly — is the skill this lab builds.
The output of this lab is a complete state transition table that you have produced from the source code, without reference to any external documentation.
How to Read a Large State Machine Class
Do not read VertexImpl.java from top to bottom. Instead:
- Start with the
StateMachineFactorydeclaration (search forstateMachineFactory =) - Extract all
addTransitioncalls — this gives you the complete transition table - For each transition, find the handler class — the inner class that implements
SingleArcTransitionorMultipleArcTransition - Read each handler's
transition()method — this is the actual state machine logic - Trace inter-state-machine events — where does the handler post events to other state machines?
Step-by-Step Tasks
Step 1: Find the StateMachineFactory
grep -n "stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -5
Note the line number. The factory declaration starts there and continues for hundreds of lines. Read the entire factory definition — do not skip any transitions.
Step 2: Count All States and Transitions
# Count distinct source states referenced in addTransition
grep "addTransition(VertexState\." \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| sed 's/.*addTransition(VertexState\.\([A-Z_]*\).*/\1/' \
| sort -u
# Count total transitions
grep -c "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Record your numbers. You should find more than 30 distinct source states and more than 80 transitions.
Step 3: Build the Transition Table
For each line in the StateMachineFactory, extract:
- Source state
- Event type
- Destination state(s)
- Handler class name
Begin with the transitions from NEW:
# Find all transitions FROM NEW
awk '/addTransition\(VertexState\.NEW/,/\.addTransition/' \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| head -20
Then from INITIALIZING:
grep -A4 "addTransition(VertexState\.INITIALIZING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -40
Build a table with columns: Source State | Event | Destination | Handler.
Step 4: Trace the Happy Path
The happy path for a vertex with no source edges (a root vertex, e.g., Tokenizer):
NEW
V_INIT → INITIALIZING (InitTransition)
V_INIT_DONE → INITED (InitedTransition — if no root input initializers)
V_START → RUNNING (StartTransition)
[VertexManager schedules tasks]
[All tasks complete successfully]
V_TASK_COMPLETED (final task) → COMMITTING (TaskCompletedTransition)
V_COMMIT_COMPLETED → SUCCEEDED (CommitCompletedTransition)
For each transition in the happy path, find the handler class and answer:
InitTransition.transition():
grep -n "class InitTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
- What does
InitTransitiondo when there are noRootInputInitializers? - Does it immediately post
V_INIT_DONE, or is there an intermediate step?
InitedTransition.transition() (or whatever class handles V_INIT_DONE):
- When does
INITIALIZINGgo toINITEDvs going directly toRUNNING? - What is the condition that allows immediate transition to
RUNNING?
StartTransition.transition():
- What method on
VertexManageris called here? - Does this method block or is it asynchronous?
TaskCompletedTransition.transition():
- How does it track whether all tasks have completed?
- What is
numSuccessSourceAttemptCompletions? - At what point does it decide the vertex can move to
COMMITTING?
Step 5: Trace the Failure Path
A task fails. The event chain:
TaskAttemptImpl: RUNNING → FAILED (sends T_ATTEMPT_FAILED to TaskImpl)
TaskImpl: RUNNING → FAILED (if retry limit exceeded; sends V_TASK_COMPLETED{FAILED})
VertexImpl: RUNNING → ?
Find the handler for V_TASK_COMPLETED when the task is FAILED:
# TaskCompletedTransition handles both success and failure
grep -n "TaskCompletedTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10
Answer:
- What field tracks the number of failed tasks?
- What is the condition that causes the vertex to transition to
FAILED? - What event does
VertexImplsend toDAGImplwhen it fails? - Does
DAGImplfail immediately when a vertex fails, or does it try to continue?
# Find how DAGImpl handles vertex failure
grep -n "DAG_VERTEX_COMPLETED\|vertexFailed\|VERTEX_FAILED" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -15
Step 6: Find the RECOVERING States
grep "RECOVERING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| grep "VertexState\." | head -20
Answer:
- How many
RECOVERING_*states exist? - What event exits the
RECOVERINGstate? - What class handles recovery completion?
Step 7: Find All @Ignored Tests in TestVertexImpl
grep -n "@Ignore" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
For each @Ignored test:
- Read the comment explaining why it is ignored
- Determine if the bug has been fixed (search JIRA for the referenced issue number)
- If the fix exists, the test can likely be re-enabled — this is a contributor opportunity
Step 8: Find a Transition with No Test Coverage
Pick three transition handler classes from your transition table. For each, check if
TestVertexImpl has a test that exercises that handler:
# Example: does TestVertexImpl test TaskCompletedTransition?
grep -n "TaskCompletedTransition\|taskCompletedTransition" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -5
# If none found, search for tests that trigger V_TASK_COMPLETED
grep -n "V_TASK_COMPLETED\|VertexEventType.V_TASK_COMPLETED" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -10
Identify one transition that appears to have insufficient test coverage and document it.
This is a potential Test JIRA issue you could file and fix.
Deliverable: Your Transition Table
Produce a table in this format (populate all rows from code):
| Source State | Event Type | Destination | Handler Class |
|---|---|---|---|
| NEW | V_INIT | INITIALIZING | InitTransition |
| INITIALIZING | V_INIT_DONE | INITED / FAILED | InitedTransition |
| INITED | V_START | RUNNING | StartTransition |
| RUNNING | V_TASK_COMPLETED | RUNNING/SUCCEEDED | TaskCompletedTransition |
| ... | ... | ... | ... |
Your table should have at least 30 rows (covering the main execution paths). Recovery states are optional for this level.
Expected Output
- A complete (or near-complete) state transition table for
VertexImpl - Answers to all questions in Steps 4–6 with file:line references
- List of
@Ignored tests with your assessment of whether they could be re-enabled - One transition identified as having insufficient test coverage
Stretch Goals
-
Produce the same transition table for
TaskImplandTaskAttemptImpl. Compare their complexity (number of states and transitions) toVertexImpl. -
Find all places where
VertexImplcallseventHandler.handle()to post an event to another state machine. What are the target state machines and what event types are used?grep -n "eventHandler.handle" \ tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \ | grep -v "VertexEvent" | head -20 -
Find the
V_PARALLELISM_UPDATEDtransition — what does it do, and why is it one of the most bug-prone transitions in the state machine?
Lab 4.2: VertexManager Deep Dive
Background
VertexManager is the hook that makes Tez more than just a DAG scheduler. By plugging in
a custom VertexManagerPlugin, applications can implement dynamic parallelism, slow start,
skew handling, and custom task scheduling — without modifying the core AM.
This lab walks through the two built-in VertexManager implementations, explains their
behaviors via code reading, and ends with a minimal custom VertexManagerPlugin that you
write and unit-test.
The VertexManagerPlugin Contract
Full interface:
tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPlugin.java
public abstract class VertexManagerPlugin {
private VertexManagerPluginContext context;
// Called once by the AM to provide the context object
public final void setContext(VertexManagerPluginContext context) { ... }
// The plugin implementation must implement these:
public abstract void initialize() throws Exception;
public abstract void onVertexStarted(List<TaskAttemptIdentifier> completions)
throws Exception;
public abstract void onSourceTaskCompleted(
TaskAttemptIdentifier completedSrcTaskAttempt) throws Exception;
public abstract void onVertexManagerEventReceived(
VertexManagerEvent vmEvent) throws Exception;
// Called when an input is initialized (root inputs only):
public void onRootVertexInitialized(String inputName,
InputDescriptor inputDescriptor, List<Event> events) throws Exception {}
}
VertexManagerPluginContext — what the plugin can call back into
find tez-api/src/main/java -name "VertexManagerPluginContext.java"
cat $(find tez-api/src/main/java -name "VertexManagerPluginContext.java")
Key methods on the context:
| Method | What it does |
|---|---|
scheduleVertexTasks(List<TaskWithLocation>) | Schedules the given tasks for execution |
reconfigureVertex(int parallelism, VertexLocationHint, Map<String,EdgeProperty>) | Changes parallelism and/or edge properties at runtime |
getVertexNumTasks(String vertexName) | Returns the current parallelism of a named vertex |
getCurrentParallelism() | Returns this vertex's current parallelism |
getInputVertexEdgeProperties() | Returns the EdgeProperty for each input edge |
sendEventToProcessor(List<Event>, String, int) | Sends a VertexManagerEvent to a task |
Reading ImmediateStartVertexManager
find tez-dag/src/main/java -name "ImmediateStartVertexManager.java"
cat $(find tez-dag/src/main/java -name "ImmediateStartVertexManager.java")
Answer these questions from the code:
- In
initialize(): doesImmediateStartVertexManagerdo anything? If not, why does it exist? - In
onVertexStarted(): does it schedule tasks immediately or wait for anything? - What
TaskWithLocationdoes it create for each task? Does it provide any location hints? - Does it implement
onSourceTaskCompleted()? If so, what does it do?
Expected finding: ImmediateStartVertexManager is intentionally minimal. Its purpose is to
provide a named, testable implementation that schedules all tasks immediately with no
location hints. It is the baseline from which ShuffleVertexManager diverges.
Reading ShuffleVertexManager
find tez-dag/src/main/java -name "ShuffleVertexManager.java"
wc -l $(find tez-dag/src/main/java -name "ShuffleVertexManager.java")
Slow Start
Find the slow-start logic in onSourceTaskCompleted().
grep -n "minFraction\|maxFraction\|min-src-fraction\|completedSourceTasks\|pendingTasksToSchedule" \
$(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -20
Answer:
- What is the variable that tracks how many source tasks have completed?
- At what fraction does
ShuffleVertexManagerstart scheduling tasks? - What is the formula: at fraction F between
minFractionandmaxFraction, what percentage of downstream tasks are scheduled?
Auto-Parallelism
Find the auto-parallelism logic:
grep -n "reconfigureVertex\|numBipartiteSourceTasks\|desiredTaskInputSize\|targetParallelism" \
$(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -20
Answer:
- What configuration key enables auto-parallelism?
- What information does
ShuffleVertexManageruse to compute the optimal parallelism? - When is
context.reconfigureVertex()called? - What is the minimum parallelism
ShuffleVertexManagerwill ever set (the floor)?
VertexManagerEvent handling
When auto-parallelism is enabled, each upstream task sends a VertexManagerEvent to the
downstream VertexManagerPlugin containing statistics about its output (byte count,
record count, partition sizes).
grep -n "VertexManagerEvent\|onVertexManagerEventReceived\|vmEvent" \
$(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -15
Answer:
- What protobuf message is decoded from the event payload?
- What statistic is accumulated across all events?
- How does
ShuffleVertexManageruse the accumulated statistics to decide on new parallelism?
Write a Minimal Custom VertexManager
Create a CountingVertexManager that:
- Schedules 50% of tasks immediately when
onVertexStarted()is called - Schedules the remaining tasks when all source tasks have completed
- Logs the number of scheduled tasks at each scheduling call
This is the core pattern of slow-start, stripped to its minimum.
Implementation skeleton
package org.apache.tez.dag.library.vertexmanager;
import org.apache.tez.dag.api.VertexManagerPlugin;
import org.apache.tez.dag.api.VertexManagerPluginContext;
import org.apache.tez.dag.api.TaskAttemptIdentifier;
import org.apache.tez.dag.api.event.VertexManagerEvent;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.ArrayList;
import java.util.List;
public class CountingVertexManager extends VertexManagerPlugin {
private static final Logger LOG =
LoggerFactory.getLogger(CountingVertexManager.class);
private int totalSourceTasks = 0;
private int completedSourceTasks = 0;
private boolean secondBatchScheduled = false;
private int totalTasksToSchedule = 0;
@Override
public void initialize() {
totalTasksToSchedule = getContext().getCurrentParallelism();
// Count source tasks across all input vertices
for (String inputVertex : getContext().getInputVertexEdgeProperties().keySet()) {
totalSourceTasks += getContext().getVertexNumTasks(inputVertex);
}
}
@Override
public void onVertexStarted(List<TaskAttemptIdentifier> completions) {
// Schedule first 50%
int firstBatch = totalTasksToSchedule / 2;
List<VertexManagerPluginContext.ScheduleTaskRequest> toSchedule = new ArrayList<>();
for (int i = 0; i < firstBatch; i++) {
toSchedule.add(VertexManagerPluginContext.ScheduleTaskRequest.create(i, null));
}
LOG.info("CountingVertexManager: scheduling first batch of {} tasks", firstBatch);
getContext().scheduleTasks(toSchedule);
}
@Override
public void onSourceTaskCompleted(TaskAttemptIdentifier completedSrcTaskAttempt) {
completedSourceTasks++;
if (!secondBatchScheduled && completedSourceTasks >= totalSourceTasks) {
// Schedule remaining 50%
int firstBatch = totalTasksToSchedule / 2;
List<VertexManagerPluginContext.ScheduleTaskRequest> toSchedule = new ArrayList<>();
for (int i = firstBatch; i < totalTasksToSchedule; i++) {
toSchedule.add(VertexManagerPluginContext.ScheduleTaskRequest.create(i, null));
}
LOG.info("CountingVertexManager: scheduling second batch of {} tasks",
toSchedule.size());
getContext().scheduleTasks(toSchedule);
secondBatchScheduled = true;
}
}
@Override
public void onVertexManagerEventReceived(VertexManagerEvent vmEvent) {
// No-op: we don't need statistics for this simple implementation
}
}
Implementation tasks
-
Identify the correct API method:
getContext().scheduleTasks()vsgetContext().scheduleVertexTasks()— check which one exists in your version of the API. -
Write a unit test using
MockVertexManagerPluginContext(if it exists) or a mock:- Initialize the manager with parallelism = 10 and 4 source tasks
- Call
onVertexStarted()— verify 5 tasks are scheduled - Call
onSourceTaskCompleted()4 times — verify remaining 5 tasks are scheduled on the 4th call - Verify
secondBatchScheduledistrueafter
-
Register the
CountingVertexManagerin a DAG:Vertex reduceVertex = Vertex.create("reducer", ProcessorDescriptor.create(MyReducer.class.getName()), 10); reduceVertex.setVertexManagerPlugin( VertexManagerPluginDescriptor.create(CountingVertexManager.class.getName()));
Finding the VertexManager Test Utilities
# Find mock context for testing
find tez-dag/src/test -name "*Mock*Vertex*" -o -name "*VertexManager*Test*" | grep -v ".class"
# Find TestShuffleVertexManager
find . -name "TestShuffleVertexManager.java" | grep test
Read TestShuffleVertexManager.java to understand how VertexManager tests are structured.
The test creates a mock context, calls lifecycle methods in order, and asserts which tasks
were scheduled.
Expected Output
- Answers to all questions in the
ImmediateStartVertexManagerandShuffleVertexManagersections, with file:line references - A working
CountingVertexManagerimplementation that compiles - A unit test that passes for the two scheduling scenarios
Stretch Goals
-
Read
CartesianProductVertexManager— the most complex VertexManager:find tez-dag/src/main/java -name "CartesianProductVertexManager.java"What computation does it coordinate? When is it used?
-
Find a
ShuffleVertexManagerrelated JIRA (search for "ShuffleVertexManager" in JIRA). Read the issue description and the patch. What invariant was violated? -
Implement a
NoOpVertexManagerthat schedules no tasks (for testing DAG failure paths). Use it in a test DAG and verify the vertex fails withFAILEDstatus after the timeout.
Lab 4.3 — Build It: WavingVertexManager
Lab type: Build It — VertexManagerPlugin with full JUnit + Mockito test suite
Estimated time: 120–150 min
Maven module: book/projects/level-4-waving-manager
Key class: org.apache.tez.learning.l4.WavingVertexManager
What You Will Build
A VertexManagerPlugin that schedules tasks in configurable waves:
- Wave 0: tasks 0 to waveSize-1
- Wave 1: tasks waveSize to 2×waveSize-1
- Wave N: starts only when all tasks in wave N-1 have succeeded
Wave size is read from UserPayload as "waveSize=N".
Default: WavingVertexManager.DEFAULT_WAVE_SIZE = 2.
This is a minimal but complete VertexManagerPlugin — the same architectural
pattern used by ImmediateStartVertexManager, ShuffleVertexManager, and the
VertexManagerPlugin inside every Hive-on-Tez reduce vertex.
Step 1 — Understand the VertexManagerPlugin Contract
Before reading any code, open the Tez source:
find ~/tez-src -name "VertexManagerPlugin.java" | head -3
find ~/tez-src -name "VertexManagerPluginContext.java" | head -3
find ~/tez-src -name "ImmediateStartVertexManager.java" | head -3
Read all three files completely. Then answer:
| # | Question |
|---|---|
| 1 | What are all the lifecycle callback methods in VertexManagerPlugin? List them. |
| 2 | When does the Tez AM call initialize()? Can you call scheduleVertexTasks() from inside initialize()? |
| 3 | What does VertexManagerPluginContext.scheduleVertexTasks(List<ScheduleTaskRequest>) actually do to the DAG execution engine? |
| 4 | ImmediateStartVertexManager.onVertexStarted() calls scheduleAllTasks(). Does it call scheduleVertexTasks once (all tasks in one list) or once per task? Why does that matter for performance? |
| 5 | What is the purpose of VertexManagerPluginContext.reconfigureVertex()? Does WavingVertexManager use it? |
Step 2 — Compile and Run the Tests
cd /path/to/apache-tez/book/projects
mvn -pl level-4-waving-manager test
Expected:
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0
Step 3 — Read the Source Code
Open WavingVertexManager.java and work through every section.
initialize()
| # | Question |
|---|---|
| 1 | The payload is parsed as "waveSize=N". Where in a real DAG would you set this payload? (Hint: VertexManagerPluginDescriptor.setUserPayload() in DAG.create()) |
| 2 | Why does initialize() store totalTasks from the context rather than accepting it as a constructor argument? |
| 3 | If the user sets waveSize=1000 but there are only 5 tasks, what happens? Is there a bug? |
| 4 | Why are scheduled and waveFinished BitSets rather than List<Integer>? What is the time complexity of BitSet.andNot()? |
onVertexStarted()
| # | Question |
|---|---|
| 1 | The completions map passed to onVertexStarted is ignored. Under what condition would a real plugin need to process it? |
| 2 | Why is scheduleNextWave() called here and not from initialize()? |
onTaskAttemptCompleted()
| # | Question |
|---|---|
| 1 | Failed attempts are silently ignored (if (!successful) return). What should a production plugin do instead? |
| 2 | checkAndScheduleNextWave() clones scheduled to avoid mutating it. What subtle bug would occur without the clone? |
| 3 | Trace through the state machine for 4 tasks, waveSize=2. Draw the state of scheduled and waveFinished after each callback. |
scheduleNextWave()
| # | Question |
|---|---|
| 1 | The while loop has two conditions: nextTaskToSchedule < totalTasks AND count < waveSize. Which terminates the loop for the last wave if the number of tasks is not a multiple of waveSize? |
| 2 | The scheduled.get(idx) guard protects against double-scheduling. In what scenario could idx already be set? (Hint: look at the testTaskNotScheduledTwice test.) |
Step 4 — Read the Test Suite
Open TestWavingVertexManager.java. For each test, before reading the assertions:
- Read the test name
- Predict what the test will assert
- Then read the actual assertions and compare to your prediction
Pay particular attention to how Mockito is used:
| Mockito call | What it does |
|---|---|
mock(VertexManagerPluginContext.class) | Creates a fake context that records all calls |
when(ctx.getVertexNumTasks(...)).thenReturn(6) | Stubs a specific return value |
verify(ctx, times(2)).scheduleVertexTasks(anyList()) | Asserts the method was called exactly twice |
ArgumentCaptor.forClass(List.class) | Captures the actual argument for deep inspection |
Questions
| # | Question |
|---|---|
| 1 | testThreeWavesForSixTasks is an integration test of the entire scheduling lifecycle. Which individual unit tests cover the sub-cases that this test depends on? |
| 2 | testPartialWave0DoesNotTriggerWave1 verifies the negative case (wave NOT triggered). How does verify(times(1)) prove this? Could you use verifyNoMoreInteractions() instead? |
| 3 | The test class has a @Before setUp() method. What happens if you remove it and inline mockContext = mock(...) into each test instead? |
Step 5 — Break It: Three Experiments
Experiment A — Remove the if (!successful) return guard
Delete the early-return in onTaskAttemptCompleted. Run:
mvn -pl level-4-waving-manager test -Dtest=TestWavingVertexManager#testFailedAttemptDoesNotAdvanceWave
- Which test fails?
- What is the actual vs. expected
scheduleVertexTaskscall count? - Why does treating failures as successes cause premature wave advancement?
Experiment B — Remove the BitSet.clone() in checkAndScheduleNextWave
Change:
BitSet scheduledCopy = (BitSet) scheduled.clone();
scheduledCopy.andNot(waveFinished);
to:
scheduled.andNot(waveFinished);
Run the full test suite.
- Which tests fail?
- What data corruption does this mutation cause? Trace through
testThreeWavesForSixTasksmanually.
Experiment C — Change count < waveSize to count <= waveSize
In scheduleNextWave(), change the loop condition.
- How many tasks does wave 0 now schedule?
- Which test catches this?
Step 6 — Add a New Feature: onVertexManagerEventReceived
The real ShuffleVertexManager uses onVertexManagerEventReceived to receive
partition statistics from map tasks. Add support for a simple variant:
Create a new callback method:
@Override
public void onVertexManagerEventReceived(
List<VertexManagerEvent> vmEvents) throws Exception {
// If any event's user payload contains "skip=true", mark
// that task as finished so it does not block wave advancement.
for (VertexManagerEvent event : vmEvents) {
// TODO: parse UserPayload for "skip=true"; if present, call
// onTaskAttemptCompleted(taskIndex, true) to release the wave
}
}
Write a test for this method:
@Test
public void testSkipEventReleasesWave() {
// set up 4 tasks, wave size 2
// trigger onVertexStarted (wave 0: tasks 0,1)
// send a VertexManagerEvent for task 0 with payload "skip=true"
// verify task 0 is treated as done for wave-completion purposes
}
Step 7 — Tez Source Connection Table
| Class used in this project | Tez source file |
|---|---|
VertexManagerPlugin | |
VertexManagerPluginContext | |
ScheduleTaskRequest | |
ImmediateStartVertexManager | |
ShuffleVertexManager |
Step 8 — ShuffleVertexManager Deep Dive
Open ShuffleVertexManager.java in the Tez source:
find ~/tez-src -name "ShuffleVertexManager.java"
- Read
onVertexStarted(). Does it schedule tasks immediately likeWavingVertexManager, or does it wait? What does it wait for? - Find the
slowStartFractionfield. How does it determine when to start scheduling? - Find where
reconfigureVertex()is called. What does it change about the vertex? - How does
ShuffleVertexManagerprevent double-scheduling? Compare its guard to thescheduledBitSet inWavingVertexManager. ShuffleVertexManagerhas ~700 lines. Identify the 5 most important methods (the ones that contain the core scheduling logic) and list them.
Step 9 — JIRA Research: VertexManager Bugs
Search:
project = TEZ AND component = "tez-dag" AND text ~ "VertexManager" AND resolution = Fixed
Find one resolved issue where a VertexManagerPlugin had a scheduling bug.
- What was the bug? (Race condition? Double scheduling? Wrong wave boundary?)
- What was the fix?
- Was a test added? What does it mock?
Lab 4.4 — Fix It: Null Dereference in ShuffleVertexManager on Zero-Partition Source
Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 120–150 min
Tez component: tez-dag → org.apache.tez.dag.app.dag.impl.ShuffleVertexManager
Background
ShuffleVertexManager uses partition statistics sent by map tasks to decide
when to start reduce tasks (slow-start) and how many reducers to run
(auto-parallelism). It processes these statistics via
onVertexManagerEventReceived().
A long-standing bug category in this path: when a source vertex has zero
output partitions (all records were filtered, or the vertex ran with zero
tasks), the plugin can receive a ShuffleVertexManager.VertexManagerEvent
whose payload encodes 0 partitions. In several versions of Tez, this caused
a NullPointerException or ArithmeticException (divide by zero) deep in the
statistics-processing path — the code assumed at least one partition existed.
This lab reproduces the bug pattern in a unit test, locates the exact guard that is missing, applies the fix, and submits a patch.
Step 1 — Locate the Source File
find ~/tez-src -name "ShuffleVertexManager.java" | head -5
Expected:
./tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java
Also locate the test file:
find ~/tez-src -name "TestShuffleVertexManager.java" | head -5
Step 2 — Read the Statistics Path
In ShuffleVertexManager.java, find the method that processes
VertexManagerEvent payloads. It will have a call to
ShuffleVertexManagerBase.parseStatsHeader() or similar, and will work with
numPartitions or partitionCount.
Trace the complete call chain from onVertexManagerEventReceived() to the
line that first uses the partition count arithmetically.
Questions
| # | Question |
|---|---|
| 1 | What is the name of the proto-based payload class that encodes partition statistics? |
| 2 | Which method extracts the partition count from the payload? |
| 3 | On what line does the first arithmetic operation involving the partition count occur? |
| 4 | Is there a null-check or zero-check before that line? |
| 5 | What exception would result if partitionCount == 0 at that line? |
Step 3 — Find the Existing Test
find ~/tez-src -name "TestShuffleVertexManager.java"
Open it and search for any test that covers the zero-partition case:
grep -n "zero\|0.*partition\|partition.*0" TestShuffleVertexManager.java -i | head -20
Note: in most Tez versions there is no such test — that is the gap you will fill.
Step 4 — Write the Reproducing Test
Add the following test to TestShuffleVertexManager.java. The exact helper
methods depend on the version you have; adapt the setup pattern from the
nearest existing test (look for testAutoParallelism or testSlowStart).
@Test(expected = Exception.class) // replace Exception with the specific type you observe
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
// TODO: set up a ShuffleVertexManager with auto-parallelism enabled
// TODO: send a VertexManagerEvent with numPartitions = 0
// TODO: call onVertexManagerEventReceived with that event
// The call should NOT throw — once fixed.
// Mark expected = Exception.class so the test initially *passes*
// when the bug exists (the code throws), then change to asserting
// no throw after the fix is applied.
}
Run:
cd ~/tez-src
mvn test -pl tez-dag -Dtest=TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash -q 2>&1 | tail -30
Record: which exception is thrown and on which line.
Step 5 — Apply the Fix
In ShuffleVertexManager.java, add a guard at the point identified in Step 2.
Rules
- The guard must be a minimum: either
if (partitionCount == 0) { return; }to skip the event, orif (partitionCount == 0) { partitionCount = 1; }to normalise (choose the semantically correct one — which is safer for scheduling?) - Do not reformat surrounding code
- Do not change method signatures
Step 6 — Update the Test
Now that the fix is applied, update the test:
@Test
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
// Same setup as before
// This time assert NO exception is thrown
// Optionally assert that scheduling state is unchanged
}
Run the full tez-dag test suite:
mvn test -pl tez-dag -q 2>&1 | tail -20
All tests must pass.
Step 7 — Checkstyle
mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20
Zero violations required.
Step 8 — Format the Patch
cd ~/tez-src
git diff > /tmp/TEZ-ZEROPART.001.patch
cat /tmp/TEZ-ZEROPART.001.patch
Checklist:
-
Only
ShuffleVertexManager.javaandTestShuffleVertexManager.javamodified -
No trailing whitespace:
grep -P "\\s+$" /tmp/TEZ-ZEROPART.001.patch -
Patch applies cleanly:
git apply --check /tmp/TEZ-ZEROPART.001.patch -
All tests pass after
git apply
Step 9 — Write the JIRA Description
Summary: ShuffleVertexManager throws [ExceptionType] when source vertex
has zero output partitions
Description:
When a source vertex completes with zero output partitions (all records
filtered or vertex ran zero tasks), ShuffleVertexManager.onVertexManagerEventReceived
receives a VertexManagerEvent with partitionCount=0. The statistics
processing path performs arithmetic on this value without a zero guard,
causing [ExceptionType] at [ClassName].java:[line].
Steps to reproduce:
See attached TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash.
Fix:
Add a zero-partition guard at [method name], line [N].
Skip or normalise the event when partitionCount == 0.
Priority: Major
Component: tez-dag
Affects Version: 0.10.x
Step 10 — Deeper Understanding
After completing the fix, answer these questions by reading ShuffleVertexManager.java:
| # | Question |
|---|---|
| 1 | What is the slowStartMinFraction and slowStartMaxFraction used for? At what point in the scheduling lifecycle are they checked? |
| 2 | When does ShuffleVertexManager call reconfigureVertex()? What does it change? |
| 3 | What data structure accumulates partition statistics across multiple VertexManagerEvent calls? Why accumulate rather than process each event independently? |
| 4 | The test class uses mock(VertexManagerPluginContext.class). Compare this to TestWavingVertexManager — what additional interactions does ShuffleVertexManager have with the context that WavingVertexManager does not? |
| 5 | Search for all places in ShuffleVertexManager where a divide-by-zero could theoretically occur. List them. |
Level 5: Testing Infrastructure
Apache Tez has one of the most complete test suites in the Hadoop ecosystem:
thousands of unit tests, a MiniTezCluster integration harness, and a
TestOrderedWordCount end-to-end reference. At this level you will move from
reading tests to writing them — adding missing coverage to TestVertexImpl,
submitting a real DAG against MiniTezCluster, and finding and fixing a flaky
test.
Why testing matters for contributors
Every Tez patch must include either (a) a new test that fails without the patch and passes with it, or (b) a clear justification in the JIRA for why a test is not needed. Committers will block patches that regress existing tests or that add unverified logic.
What this level covers
| Topic | Where |
|---|---|
MiniTezCluster setup/teardown lifecycle | Lab 5.1 |
TestOrderedWordCount as the canonical integration test template | Lab 5.1 |
Adding a missing TestVertexImpl transition test | Lab 5.2 |
| Writing a full mini-cluster integration test for your own DAG | Lab 5.3 |
| Identifying, reproducing, and fixing a flaky test | Lab 5.4 |
Prerequisites
- Level 4 complete (you understand
VertexImplstate machine andVertexManagerPlugin) - Tez source checked out and
mvn install -DskipTestssucceeded
Test categories and Maven commands
| Category | What it tests | Command |
|---|---|---|
| Unit | Single class in isolation with mocks | mvn test -pl tez-dag -Dtest=TestVertexImpl |
| Mini-cluster integration | Full AM + YARN + HDFS in-process | mvn test -pl tez-tests -Dtest=TestOrderedWordCount |
| System | Real cluster (CI only) | Not run locally |
Key test classes
| Class | Module | What it covers |
|---|---|---|
TestVertexImpl | tez-dag | VertexImpl state machine, transitions, vertex recovery |
TestDAGImpl | tez-dag | DAGImpl state machine, DAG-level events |
TestTaskImpl | tez-dag | TaskImpl scheduling, speculation, counters |
TestTaskAttemptImpl | tez-dag | TaskAttemptImpl state transitions |
TestOrderedWordCount | tez-tests | End-to-end DAG submission against MiniTezCluster |
TestMiniTezClusterWithTez | tez-tests | Multi-DAG runs, recovery, kill scenarios |
Expected outcome
By the end of this level you will have:
- Run a DAG against
MiniTezClusterinside a JUnit test - Added a missing state-machine transition test to
TestVertexImpl - Identified and fixed a flaky test (or documented why it flakes)
Lab 5.1 — Explore MiniTezCluster and TestOrderedWordCount
Lab type: Read & Run
Estimated time: 90 min
Tez module: tez-tests
Key class: org.apache.tez.test.TestOrderedWordCount
Overview
MiniTezCluster spins up an in-process YARN ResourceManager, NodeManager, HDFS
NameNode, and DataNode, plus the Tez ApplicationMaster — all inside a single
JVM. This lets you submit real DAGs in a JUnit test with no external
infrastructure.
TestOrderedWordCount is the canonical example: it submits a multi-stage
word-count DAG (tokenize → partition → sort → count) and asserts correct output.
Step 1 — Locate the Files
find ~/tez-src -name "MiniTezCluster.java" | head -5
find ~/tez-src -name "TestOrderedWordCount.java" | head -5
find ~/tez-src -name "MiniTezClusterWithTez.java" | head -5
Step 2 — Read MiniTezCluster.java
Open MiniTezCluster.java and answer:
| # | Question |
|---|---|
| 1 | What superclass does MiniTezCluster extend? What Hadoop class sets up the in-process YARN cluster? |
| 2 | Where is TezConfiguration created and how is it modified to use the in-process services? |
| 3 | What is the purpose of the serviceStart() method? What does it start? |
| 4 | After serviceStop(), can you call serviceStart() again on the same instance? Why or why not? |
| 5 | Where does MiniTezCluster write its temporary data (HDFS files, YARN work dirs)? How would a test clean this up? |
Step 3 — Read TestOrderedWordCount.java
Work through the test lifecycle:
3a — @BeforeClass setUpClass()
| # | Question |
|---|---|
| 1 | How many NodeManagers does the test cluster start with? |
| 2 | After miniTezCluster.start(), what call copies the Tez auxiliary service config? |
| 3 | Where are test input files created — on HDFS or local FS? |
| 4 | Is a new TezClient created per test or per class? |
3b — @Test testOrderedWordCount()
| # | Question |
|---|---|
| 1 | Trace the method calls from TezClient.submitDAG() to when the test receives the final DAGStatus. |
| 2 | What does the assertion verify — DAG state, output correctness, or counter values? |
| 3 | If you wanted to assert on a specific counter (e.g. TaskCounter.INPUT_RECORDS_PROCESSED), where in the test would you add that assertion? |
3c — @AfterClass tearDownClass()
| # | Question |
|---|---|
| 1 | What is the order of shutdown calls? Does the TezClient stop before or after the cluster? |
| 2 | Does the test delete the HDFS working directory? Should it? |
Step 4 — Run the Test
cd ~/tez-src
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -20
Expected:
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
If you see Unable to find class: org.apache.tez.test.TestOrderedWordCount,
ensure mvn install -DskipTests completed successfully for all modules.
Step 5 — Measure the Overhead
Time the test:
time mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -3
Record how long it takes. Then answer:
- Is the bottleneck cluster startup, DAG execution, or cluster shutdown?
(Hint: add
-Dorg.apache.tez.test.MiniTezCluster.log.level=DEBUGand look at the timestamps.) - Why is
@BeforeClassused instead of@Before? What is the performance difference?
Step 6 — Find More Integration Tests
find ~/tez-src/tez-tests -name "Test*.java" | xargs grep -l "MiniTezCluster" | head -10
Pick one that is NOT TestOrderedWordCount. Read its @BeforeClass and one
@Test method. Answer:
- What scenario does this test cover that
TestOrderedWordCountdoes not? - Does it use a separate
MiniTezClusterinstance, or the same one reused across multiple test classes? How?
Step 7 — Source Connection Table
| Class used in this lab | Tez source file (relative to repo root) |
|---|---|
MiniTezCluster | |
TezClient | |
TezConfiguration | |
DAGStatus | |
MiniDFSCluster (Hadoop helper) |
Step 8 — JIRA Research
Search:
project = TEZ AND component = "tez-tests" AND resolution = Fixed ORDER BY updated DESC
Find a recent test-improvement JIRA.
- What was added or fixed?
- Does the patch include a new test, an existing test modification, or a flaky-test fix?
Lab 5.2 — Add a Missing TestVertexImpl Transition Test
Lab type: Fix-It (test improvement)
Estimated time: 90 min
Tez module: tez-dag
Key class: org.apache.tez.dag.app.dag.impl.TestVertexImpl
Overview
TestVertexImpl covers the VertexImpl state machine but no test suite is ever
complete. In this lab you will:
- Read the state machine definition
- Identify an untested transition
- Write a JUnit test that exercises that transition
- Verify it fails without the expected assertions and passes with them
This is the canonical entry point for new Tez contributors — many accepted patches are "add test coverage for transition X".
Step 1 — Locate the State Machine Definition
find ~/tez-src -name "VertexImpl.java" | head -3
grep -n "StateMachineFactory\|addTransition" \
~/tez-src/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
| head -50
The state machine is built with StateMachineFactory<VertexImpl, VertexState, VertexEventType, VertexEvent>. Each addTransition() call defines:
- current state
- event type
- next state
- transition action
Step 2 — Read TestVertexImpl.java
wc -l ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
It is large (~5,000 lines). You do not need to read it all. Instead:
grep -n "public void test" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
| head -60
List all test method names.
Step 3 — Find an Untested Transition
Compare the transitions in VertexImpl.java to the tests in TestVertexImpl.java.
Strategy:
- List all
addTransitioncalls withgrep -n "addTransition" VertexImpl.java - For each transition, search
TestVertexImpl.javafor a test that covers the(fromState, eventType)pair - Find one that is missing
Hint: look at transitions from INITED state. Some transitions from INITED
triggered by rare events (e.g. VERTEX_FAILED before a task is scheduled) are
often not explicitly tested.
Step 4 — Write the Test
Add a new test to TestVertexImpl.java. Follow the exact style of the
surrounding tests:
@Test(timeout = 5000)
public void testVertexFailed_FromInitedState() {
// TODO: initialize a vertex to INITED state using the existing test helpers
// then send a VERTEX_FAILED event
// assert the vertex transitions to ERROR or FAILED state
// assert any cleanup callbacks were invoked
}
Pattern to follow:
- Look for an existing test that puts the vertex in the state you need
(e.g.
testVertexWithInitializerreaches RUNNING; look for a simpler path) - Use
dispatcher.getEventHandler().handle(new VertexEventXxx(...))to fire events - Use
vertex.getState()to assert the resulting state
Step 5 — Run the New Test
cd ~/tez-src
mvn test -pl tez-dag \
-Dtest=TestVertexImpl#testVertexFailed_FromInitedState -q 2>&1 | tail -20
Step 6 — Run the Full Test Class
mvn test -pl tez-dag -Dtest=TestVertexImpl -q 2>&1 | tail -10
All existing tests must still pass.
Step 7 — Write the Patch and JIRA Description
cd ~/tez-src
git diff > /tmp/TEZ-VERTEXTEST.001.patch
cat /tmp/TEZ-VERTEXTEST.001.patch
Draft JIRA:
Summary: TestVertexImpl is missing coverage for VERTEX_FAILED from INITED state
Description:
The VertexImpl state machine defines a transition (INITED, VERTEX_FAILED)
but TestVertexImpl has no test that fires this event path. This patch adds
TestVertexImpl#testVertexFailed_FromInitedState to cover the gap.
Priority: Minor
Component: tez-dag
Deeper Understanding
| # | Question |
|---|---|
| 1 | What is the difference between VertexState.FAILED and VertexState.ERROR? When does the AM choose each? |
| 2 | TestVertexImpl uses a mock AppContext. What methods on AppContext does VertexImpl call most frequently? (grep for appContext.) |
| 3 | What is DrainDispatcher and why is it used in tests instead of AsyncDispatcher? |
| 4 | Some tests set a Clock mock. Why would a state machine test need to control time? |
Lab 5.3 — Build It: Integration Test with MiniTezCluster
Lab type: Build It — Maven module with a real mini-cluster integration test
Estimated time: 150 min
Maven module: book/projects/level-5-integration-test
Key class: org.apache.tez.learning.l5.TestNumberPipelineWithMiniCluster
What You Will Build
A JUnit integration test that:
- Starts
MiniTezClusterin@BeforeClass - Submits the Level 1
NumberPipelineDAG(reused fromlevel-1-number-pipeline) - Waits for the DAG to complete
- Reads back the counter
NumberPipeline/TotalSumand asserts it equals 9900 - Stops the cluster in
@AfterClass
This is the same pattern used by TestOrderedWordCount — you are building the
exact kind of test that Tez committers write for new DAG features.
Step 1 — Create the Maven Module
book/projects/level-5-integration-test/
pom.xml
src/test/java/org/apache/tez/learning/l5/
TestNumberPipelineWithMiniCluster.java
The module is a test-only module (no src/main/). It depends on:
org.apache.tez.learning:level-1-number-pipeline:1.0-SNAPSHOT(your DAG)org.apache.tez:tez-tests(forMiniTezCluster)- JUnit 4.13.2
org.apache.hadoop:hadoop-minicluster
<dependency>
<groupId>org.apache.tez</groupId>
<artifactId>tez-tests</artifactId>
<version>${tez.version}</version>
<classifier>tests</classifier>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-minicluster</artifactId>
<version>${hadoop.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.tez.learning</groupId>
<artifactId>level-1-number-pipeline</artifactId>
<version>1.0-SNAPSHOT</version>
<scope>test</scope>
</dependency>
Add level-5-integration-test to the parent pom.xml modules list.
Step 2 — Write TestNumberPipelineWithMiniCluster.java
Skeleton:
package org.apache.tez.learning.l5;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.MiniDFSCluster;
import org.apache.tez.client.TezClient;
import org.apache.tez.common.counters.TezCounters;
import org.apache.tez.dag.api.DAG;
import org.apache.tez.dag.api.TezConfiguration;
import org.apache.tez.dag.app.dag.DAGState;
import org.apache.tez.dag.client.DAGClient;
import org.apache.tez.dag.client.DAGStatus;
import org.apache.tez.learning.l1.NumberPipelineDAG;
import org.apache.tez.test.MiniTezCluster;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;
import static org.junit.Assert.*;
public class TestNumberPipelineWithMiniCluster {
private static MiniTezCluster miniTezCluster;
private static TezClient tezClient;
private static TezConfiguration tezConf;
@BeforeClass
public static void setUpClass() throws Exception {
// Start MiniTezCluster with 1 NodeManager
miniTezCluster = new MiniTezCluster(
TestNumberPipelineWithMiniCluster.class.getName(), 1, 1, 1);
Configuration conf = new Configuration();
miniTezCluster.init(conf);
miniTezCluster.start();
tezConf = new TezConfiguration(miniTezCluster.getConfig());
tezConf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, false);
tezClient = TezClient.create(
"TestNumberPipelineClient", tezConf);
tezClient.start();
}
@AfterClass
public static void tearDownClass() throws Exception {
if (tezClient != null) {
tezClient.stop();
}
if (miniTezCluster != null) {
miniTezCluster.stop();
}
}
@Test(timeout = 120_000)
public void testNumberPipelineTotalSum() throws Exception {
// Build the Level 1 DAG (local mode runs fine in mini-cluster too)
DAG dag = NumberPipelineDAG.buildDAG(tezConf);
DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus dagStatus = dagClient.waitForCompletion();
assertEquals("DAG must succeed",
DAGStatus.State.SUCCEEDED, dagStatus.getState());
TezCounters counters = dagStatus.getDAGCounters();
assertNotNull("Counters must be present", counters);
long totalSum = counters
.getGroup("NumberPipeline")
.findCounter("TotalSum")
.getValue();
assertEquals("TotalSum for 0..99 must equal 4950", 4950L, totalSum);
}
}
Adapting
NumberPipelineDAG: the Level 1 project is designed for local mode (TezConfiguration.TEZ_LOCAL_MODE = true). You will need to either (a) add a staticbuildDAG(TezConfiguration conf)factory method that accepts an external config, or (b) create a subclass that overrides the DAG construction to accept an injected config. Choose (a).
Step 3 — Verify the Build
cd book/projects
mvn -pl level-1-number-pipeline install -DskipTests -q
mvn -pl level-5-integration-test test -q 2>&1 | tail -20
Expected:
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
Step 4 — Deep Questions
| # | Question |
|---|---|
| 1 | Why does the test use dagStatus.getDAGCounters() instead of dagClient.getDAGStatus(EnumSet.of(StatusGetOpts.GET_COUNTERS))? Are they equivalent? |
| 2 | The timeout is 120_000 ms. Why does a simple 100-integer DAG need 2 minutes? |
| 3 | If the DAG fails, dagStatus.getState() returns FAILED and the assertion fires. How would you get the failure reason from dagStatus? |
| 4 | @BeforeClass uses static fields. What happens if two test classes in the same JVM both start MiniTezCluster? How does TestOrderedWordCount handle this? |
| 5 | The counter group is "NumberPipeline" and the counter name is "TotalSum". If you mistype the group name, what does getGroup() return? Does the assertion fail gracefully? |
Step 5 — Experiment: Add a Second Assertion
After verifying TotalSum, add an assertion on the number of tasks run:
long inputRecords = counters
.findCounter(TaskCounter.INPUT_RECORDS_PROCESSED)
.getValue();
// How many input records do you expect?
assertEquals(???, inputRecords);
Think about the DAG topology:
- Source vertex: 1 task, emits 100 integers
- Sink vertex: 1 task, reads 100 records
What value do you expect for INPUT_RECORDS_PROCESSED across both vertices?
Step 6 — Tez Source Connection Table
| Class used in this lab | Tez source file |
|---|---|
MiniTezCluster | |
TezClient | |
DAGClient | |
DAGStatus | |
TezCounters |
Lab 5.4 — Fix It: Un-Ignore a Flaky Test in TestVertexImpl
Lab type: Fix-It — flaky test investigation and repair
Estimated time: 90 min
Tez module: tez-dag
Key class: TestVertexImpl
Overview
Large Java projects accumulate @Ignored tests that were disabled because they
were "flaky" — meaning they passed sometimes and failed other times. A flaky
test is almost always a symptom of a real bug: a race condition, an incorrect
assertion, or missing test isolation.
In this lab you will:
- Find an
@Ignored test inTestVertexImpl - Un-ignore it and run it 10 times to characterize the failure
- Identify the root cause
- Apply the minimum fix
- Verify the test passes reliably
Step 1 — Find the Ignored Tests
grep -n "@Ignore\|@Disabled" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
Also search across all tez-dag tests:
grep -rn "@Ignore\|@Disabled" ~/tez-src/tez-dag/src/test/java/ | \
grep -v "^Binary" | head -30
Step 2 — Pick a Target
Select one ignored test. Prefer tests that have a comment explaining why they were ignored — these are the most educational.
Record:
- The test method name
- The reason given in the
@Ignoreannotation or nearby comment - Which state transition or feature it is testing
Step 3 — Un-Ignore and Run
Remove the @Ignore annotation. Run the test 10 times:
for i in $(seq 1 10); do
mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
grep -E "PASS|FAIL|ERROR|Tests run" | tail -1
done
Record the pass/fail pattern. Is it:
- Always failing (deterministic bug)
- Randomly failing (race condition or timing sensitivity)
- Always passing (was it already fixed in this version?)
Step 4 — Diagnose the Failure
Read the test carefully. Common flaky-test patterns in Tez state machine tests:
| Pattern | Symptom | Fix |
|---|---|---|
AsyncDispatcher not drained before assertion | Assertion fires before event is processed | Use DrainDispatcher instead |
| Mock returns null for a method that returns a list | NullPointerException in production code | Stub with Collections.emptyList() |
Thread.sleep(N) instead of proper synchronization | Fails on slow CI machines | Replace with waitFor() or DrainDispatcher |
| Leaked state from another test | First run passes, second fails | Verify @Before / @After cleans up completely |
Identify which pattern applies.
Step 5 — Apply the Fix
Apply the minimum fix. Options:
Option A — Replace AsyncDispatcher with DrainDispatcher
// Before (flaky):
AsyncDispatcher dispatcher = new AsyncDispatcher();
// After (deterministic):
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertex);
dispatcher.init(conf);
dispatcher.start();
// ... fire events ...
dispatcher.await(); // blocks until queue is empty
Option B — Add missing stub
when(mockContext.getSomeList()).thenReturn(Collections.emptyList());
Option C — Fix assertion order
Move assertions AFTER the dispatcher.await() call.
Step 6 — Verify Reliability
Run the test 20 times:
for i in $(seq 1 20); do
mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
grep -E "Tests run" | tail -1
done
All 20 runs must pass.
Step 7 — Run the Full Suite
mvn test -pl tez-dag -q 2>&1 | tail -10
All existing tests must pass.
Step 8 — Format the Patch and Write the JIRA
cd ~/tez-src
git diff > /tmp/TEZ-FLAKY.001.patch
Summary: TestVertexImpl#[testName] is flaky due to [root cause]
Description:
TestVertexImpl#[testName] was marked @Ignore with the note "[original reason]".
Investigation shows the root cause is [description].
The fix [removes AsyncDispatcher / adds missing stub / fixes assertion order],
making the test deterministic.
Ran the test 20 times with the fix applied — all passed.
Priority: Minor
Component: tez-dag
Deeper Understanding
| # | Question |
|---|---|
| 1 | What is the difference between AsyncDispatcher and DrainDispatcher? Where is DrainDispatcher defined? |
| 2 | Why is a flaky test arguably worse than no test? What does it do to CI reliability? |
| 3 | Tez's StateMachineFactory is modeled after Hadoop's. Does Hadoop's TestStateMachine use DrainDispatcher or AsyncDispatcher in its tests? |
| 4 | Some Tez flaky tests are caused by System.currentTimeMillis() being called in a tight loop and the assertion depending on a specific elapsed time. How would you make such a test deterministic? |
Level 6: Hive/Tez Integration
Hive-on-Tez is the largest consumer of the Tez API. Understanding how Hive translates SQL into a Tez DAG — and what can go wrong — is essential for any contributor who wants to fix real production bugs.
What Hive does with Tez
Every Hive query that runs on Tez goes through this pipeline:
SQL → Hive AST → Operator tree → MapReduceWork/ReduceWork tasks
→ TezWork → Tez DAG (vertices + edges + VertexManagerPlugins)
→ TezClient.submitDAG()
The translation layer lives in hive-exec module, specifically
TezWork, DagUtils, and TezTask.
Why Tez contributors must understand Hive
- Most real Tez bugs are first reported from Hive (a slow query, a failing shuffle, a counter discrepancy)
ShuffleVertexManagerwas built specifically for the Hive reduce pattern- Hive adds many
VertexManagerEventpayloads that Tez must handle correctly - Compatibility issues between Hive versions and Tez versions are common release blockers
What this level covers
| Topic | Lab |
|---|---|
| Trace a Hive SQL query to the generated Tez DAG | Lab 6.1 |
Read DagUtils and understand vertex/edge configuration | Lab 6.1 |
| Debug a failing Hive-on-Tez query (task diagnostics, AM logs) | Lab 6.2 |
| Fix a Hive-Tez compatibility issue via a Tez patch | Lab 6.2 |
Prerequisites
- Level 5 complete (you can submit and debug a Tez DAG)
- Optional but helpful: basic SQL knowledge
- Optional: Hive source checked out alongside Tez
Key classes
| Class | Where | What it does |
|---|---|---|
TezWork | hive-exec | Container for all Tez DAG specifications |
DagUtils | hive-exec | Builds Tez DAG from TezWork |
TezTask | hive-exec | Executes a TezWork via TezClient |
ShuffleVertexManager | tez-dag | Manages reduce-vertex scheduling |
OrderedPartitionedKVOutput | tez-runtime-library | Default Hive reduce output |
Lab 6.1 — Trace a Hive SQL Query to the Generated Tez DAG
Lab type: Read & Research
Estimated time: 120 min
Key classes: DagUtils, TezWork, TezTask (all in Hive)
Overview
When you run SELECT a, COUNT(*) FROM t GROUP BY a on a Hive-on-Tez cluster,
Hive builds a TezWork object (a description of what the DAG should look like)
and hands it to DagUtils.createDag(). That method creates the actual Tez
DAG, vertices, edges, and VertexManagerPluginDescriptors.
In this lab you will trace this path end-to-end.
Step 1 — Check Out Hive Source (Optional)
If you have Hive source:
git clone https://github.com/apache/hive.git ~/hive-src --depth=1
find ~/hive-src -name "DagUtils.java" | head -3
find ~/hive-src -name "TezWork.java" | head -3
find ~/hive-src -name "TezTask.java" | head -3
If you do not have Hive source, you can read these classes on GitHub:
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.javaql/src/java/org/apache/hadoop/hive/ql/plan/TezWork.javaql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
Step 2 — Read TezWork.java
TezWork is a directed graph of BaseWork nodes. Answer:
| # | Question |
|---|---|
| 1 | What are the two main subclasses of BaseWork that represent map and reduce phases? |
| 2 | How does TezWork represent edges between vertices? What class holds edge configuration? |
| 3 | Where does TezWork store the VertexManagerPluginDescriptor? |
| 4 | A GROUP BY query produces how many BaseWork nodes? Draw the graph. |
Step 3 — Read DagUtils.createDag()
This is the core translation method. It iterates over TezWork and calls
createVertex() and createEdge().
| # | Question |
|---|---|
| 1 | What Tez EdgeProperty.DataMovementType does Hive use for a reduce shuffle? Where is this set? |
| 2 | What VertexManagerPlugin does Hive attach to reduce vertices? Is this set unconditionally or based on a configuration flag? |
| 3 | What is auto-parallelism in this context? How does Hive enable it? |
| 4 | What UserPayload does Hive pass to ShuffleVertexManager? Specifically: what are the values of minFraction and maxFraction? |
Step 4 — Read TezTask.execute()
This method submits the DAG and waits for completion.
| # | Question |
|---|---|
| 1 | Does TezTask create a new TezClient per query, or reuse one per session? |
| 2 | How does TezTask wait for DAG completion? Which Tez API does it poll? |
| 3 | When a Hive query fails, what information does TezTask extract from the DAGStatus to show the user? |
| 4 | TezTask updates Hive counters from Tez counters. What is the counter group mapping? |
Step 5 — Tez Counterpart: ShuffleVertexManager
Open ShuffleVertexManager.java in your Tez source. Cross-reference with
what you learned from DagUtils.java:
- The
minFraction/maxFractionpayload you found in Step 3 is parsed by which method inShuffleVertexManager? - When Hive enables auto-parallelism, what happens inside
ShuffleVertexManagerthat does NOT happen when it is disabled? - Where does
ShuffleVertexManagercallcontext.reconfigureVertex()? What doesreconfigureVertexdo to the number of reducer tasks?
Step 6 — End-to-End Mental Model
Draw (on paper or in a text diagram) the full path for:
SELECT dept, COUNT(*) FROM employees GROUP BY dept
Show:
- Hive logical plan nodes
TezWorkgraph (label eachBaseWork)- Tez
DAG(label each vertex, edge type,VertexManagerPlugin) - Which Tez APIs
TezTaskcalls
Step 7 — JIRA Research: Hive/Tez Compatibility
Search:
project = TEZ AND text ~ "hive" AND resolution = Fixed ORDER BY updated DESC
Find one issue where a Tez change broke Hive or where a Hive bug exposed a Tez issue.
- What was the incompatibility?
- Was the fix in Tez or Hive (or both)?
- Did the patch include a test? If so, where?
Lab 6.2 — Debug a Failed Hive-on-Tez Query
Lab type: Fix-It (diagnostics + root-cause analysis)
Estimated time: 120 min
Overview
A Hive-on-Tez query failure can originate from:
- Tez DAG layer — vertex scheduling error, shuffle failure, OOM
- Hive operator layer — deserialization error, UDF crash, wrong SerDe
- Infra layer — YARN container killed, HDFS quota exceeded, network timeout
In this lab you will work through a systematic diagnostic process and trace a simulated failure back to its Tez-layer root cause.
Scenario
A Hive query:
SELECT k, SUM(v) FROM large_table GROUP BY k;
fails with:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1700000000000_0001_1_01,
diagnostics=[Task failed, taskId=task_1700000000000_0001_1_01_000000,
diagnostics=[TaskAttempt 0 failed, info=[Container container_... exited
with exitCode: -104]]
Exit code -104 means container killed by YARN for exceeding memory.
Step 1 — Identify the Layer
| # | Question |
|---|---|
| 1 | Is exit code -104 a Tez error or a YARN error? Where is this code defined? |
| 2 | Which vertex failed — the map or the reduce? How do you know from the diagnostic message? |
| 3 | What Tez API would you call (in Java) to retrieve these diagnostics programmatically? |
| 4 | The error says "TaskAttempt 0 failed". Does this mean no retries happened, or that all retries were exhausted? |
Step 2 — Locate the Logs
In a real cluster:
# Get the AM logs
yarn logs -applicationId application_1700000000000_0001 \
-log_files syslog | grep -A 20 "Reducer 2"
# Get the container logs
yarn logs -applicationId application_1700000000000_0001 \
-containerId container_... | head -200
Questions:
- In the AM logs, what Tez class emits the
Task failedmessage? (Hint: grep forTaskImplorVertexImplin the log.) - The container log has a Java OOM or GC log. Where in
TaskAttemptImpldoes the container exit code get translated to aTaskAttemptEvent?
Step 3 — Identify the Tez Configuration Fix
The reduce vertex ran out of memory. The relevant configuration:
| Config key | Default | Description |
|---|---|---|
tez.am.resource.memory.mb | 1024 | AM container memory |
tez.task.resource.memory.mb | 1024 | Task container memory |
hive.tez.container.size | -1 (inherits from mapred) | Hive override for Tez task memory |
hive.auto.convert.join.noconditionaltask.size | 10MB | In-memory join threshold |
- Which config key should be increased to fix the OOM?
- Is this a Tez config or a Hive config? Which system applies it?
- Find where
tez.task.resource.memory.mbis read in Tez source. In which class and method?
Step 4 — Tez Source Reading: Container Exit Code Handling
Find where Tez handles non-zero container exit codes:
grep -rn "exitCode\|EXIT_CODE\|ContainerExitStatus" \
~/tez-src/tez-dag/src/main/java/ | grep -v "test" | head -30
Answer:
- What class translates the YARN container exit code into a
TaskAttemptEvent? - Is
-104(PREEMPTED) treated differently from-1(ABORTED)? - Does Tez retry a preempted task? What configuration controls the max retries?
Step 5 — Simulate the Fix
In a real system you would increase tez.task.resource.memory.mb and rerun.
Since you do not have a Hive cluster, instead:
Find the test in TestTaskAttemptImpl.java that covers container preemption:
grep -n "preempt\|PREEMPT\|exitCode" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttemptImpl.java \
| head -20
Read the test. Answer:
- How does the test simulate a container exit with a non-zero exit code?
- What state does
TaskAttemptImpltransition to on preemption? - Is there a test for the full retry-until-max-attempts path?
Step 6 — Write a Diagnostic Runbook Entry
Write 5–8 bullet points as a "runbook entry" for this class of failure:
## Hive-on-Tez: Reducer OOM (exit code -104)
**Symptoms:** ...
**Root cause:** ...
**Diagnostic steps:** ...
**Fix:** ...
**Tez classes involved:** ...
**Relevant configuration:** ...
This is the kind of documentation that Tez PMC members write for operators.
JIRA Research
Search for Tez issues related to container OOM or preemption handling:
project = TEZ AND text ~ "preempt OR oom OR out of memory" AND resolution = Fixed
Find one. Read the patch. Was the fix in TaskAttemptImpl, in configuration
defaults, or in a different class?
Level 7: Runtime and Shuffle
The Tez shuffle layer is the most performance-critical and most bug-prone part of the runtime. Understanding it is required for diagnosing slow queries, data-skew issues, and shuffle fetch failures.
How shuffle works in Tez
Map task → OrderedPartitionedKVOutput → TezIndexRecord (index + data files)
↓
ShuffleHandler (HTTP server in NM)
↓
Reduce task → OrderedGroupedKVInput ← shuffle fetcher threads
↓
merge + sort → processor
Key insight: unlike Hadoop MapReduce's ShuffleConsumerPlugin, Tez's shuffle
is split into framework code (tez-runtime-library) and user code
(Processor). The processor never sees unsorted records — sorting happens
in the runtime layer.
What this level covers
| Topic | Lab |
|---|---|
| Trace shuffle fetch failure from AM logs to root cause | Lab 7.1 |
Add or modify an OrderedPartitionedKVOutput processor | Lab 7.2 |
Key classes
| Class | Where | What it does |
|---|---|---|
OrderedPartitionedKVOutput | tez-runtime-library | Map output: partition + sort + spill |
OrderedGroupedKVInput | tez-runtime-library | Reduce input: fetch + merge + sort |
ShuffleFetch / Fetcher | tez-runtime-library | HTTP fetch from ShuffleHandler |
MergeManager | tez-runtime-library | In-memory and on-disk merge |
ShuffleHandler | tez-shuffle | Netty HTTP server serving map output |
TezIndexRecord | tez-runtime-library | Per-partition offset+length in output file |
Lab 7.1 — Debug Shuffle Behavior
Lab type: Read & Research
Estimated time: 120 min
Tez module: tez-runtime-library
Overview
Shuffle failures are the most common source of Tez bug reports. They manifest
as FetchFailure events, IOException during map-output reads, or hung reduce
tasks. In this lab you will trace the complete shuffle path from log line to
source code.
Step 1 — Locate the Core Classes
find ~/tez-src/tez-runtime-library -name "*.java" | xargs grep -l "FetchFailure\|Fetcher\|ShuffleHandler" | head -10
find ~/tez-src/tez-shuffle -name "*.java" | head -10
Step 2 — Read the Shuffle Fetch Path
Open Fetcher.java (in tez-runtime-library) and trace the fetch loop:
| # | Question |
|---|---|
| 1 | What HTTP method does the Fetcher use to request map output? GET or POST? |
| 2 | What is the URL format it sends to ShuffleHandler? What parameters does it include? |
| 3 | If the HTTP response code is 404, what does the Fetcher do? (Fail immediately? Retry? Report back to the InputManager?) |
| 4 | What does the Fetcher do when it detects data corruption (checksum mismatch)? Which class handles checksum verification? |
| 5 | How many concurrent fetcher threads does a reduce task run? What configuration key controls this? |
Step 3 — Read the FetchFailure Event Path
When a fetch fails, an event travels up to the AM:
grep -rn "FetchFailure\|FETCH_FAILURE" ~/tez-src/tez-dag/src/main/java/ | \
grep -v "test" | grep ".java:" | head -20
Trace: where does the FetchFailure event originate, and what state transition
does it trigger in TaskAttemptImpl?
| # | Question |
|---|---|
| 1 | What is the name of the event class that carries the fetch-failure information to the AM? |
| 2 | In TaskAttemptImpl, what state does the task transition to when it receives a fetch failure? |
| 3 | Does a single fetch failure kill the task, or does Tez retry? What configuration controls max fetch retries? |
| 4 | What happens to the source task attempt (the map) when its output cannot be fetched? Is it re-run? |
Step 4 — Read ShuffleHandler
Open ShuffleHandler.java in tez-shuffle:
| # | Question |
|---|---|
| 1 | What Netty class does ShuffleHandler extend? |
| 2 | How does ShuffleHandler authenticate that a requester is authorized to fetch map output? (Hint: look for TOKEN or JobTokenSecretManager.) |
| 3 | Where does ShuffleHandler read the index file? What class represents the index? |
| 4 | If the NM restarts while a reduce is fetching, what happens to in-flight fetch requests? |
Step 5 — Read the Spill Path
Open DefaultSorter.java or PipelinedSorter.java in tez-runtime-library:
- At what memory threshold does a spill occur?
- How many spill files can accumulate before a merge is triggered?
- After a spill, where is the index written?
Step 6 — Common Shuffle Bug Patterns
For each pattern below, identify the relevant Tez class and the configuration that can mitigate it:
| Pattern | Class | Config key |
|---|---|---|
| Slow fetch due to too few fetcher threads | ||
| OOM in reducer due to large in-memory merge buffer | ||
| Fetch failure due to ShuffleHandler authentication timeout | ||
| Data skew: one reducer processes 100× more data than others |
Step 7 — JIRA Research
Search:
project = TEZ AND component = "tez-runtime-library" AND resolution = Fixed ORDER BY updated DESC
Find a recently fixed shuffle or sort bug. Read the patch:
- What was the bug?
- Was it in
Fetcher,DefaultSorter,MergeManager, orShuffleHandler? - Was a test added? What does it mock or simulate?
Lab 7.2 — Modify a Processor: Add Deduplication to UnionSinkProcessor
Lab type: Fix-It / Extend
Estimated time: 90 min
Maven module: book/projects/level-3-multi-input
Overview
UnionSinkProcessor from Level 3 sums all values it receives. In this lab
you will extend it to deduplicate records by key before summing — only
the first record for each key is counted.
This exercise teaches:
- How to modify a
Processorthat usesOrderedGroupedKVInput - How counters interact with deduplication logic
- How to write a unit test for processor logic using mocks
Step 1 — Understand the Current Behavior
The current UnionSinkProcessor (Level 3) receives (Integer key, Integer value)
pairs and sums all values. For the test input (0..99 integers), expected sum is 4950.
Open UnionSinkProcessor.java and answer:
- How does it iterate over input records?
- Where does it write the counter?
- What happens if the same key appears twice (e.g. key=5, value=5 appears
from both the even source and… wait, can it? Check
EvenNumberSourceandOddNumberSource.)
Step 2 — Add a Deduplicating Variant
Create DeduplicatingUnionSinkProcessor.java in the same package. It should:
- Maintain a
Set<Integer>of seen keys - For each
(key, value)pair from the input: if key is new, add to set and add value to the sum; otherwise skip - Publish the same
UnionPipeline/TotalSumcounter - Also publish a new counter
UnionPipeline/DuplicatesSkipped
Step 3 — Write a Unit Test
Create TestDeduplicatingUnionSinkProcessor.java. Use the Mockito pattern
from TestMultiInputProcessors:
@Test
public void testDuplicateKeyIsSkippedOnce() {
// Create a mock input that returns (key=1, value=10) twice
// and (key=2, value=20) once
// Expected TotalSum: 10 + 20 = 30
// Expected DuplicatesSkipped: 1
}
@Test
public void testAllUniqueKeys() {
// No duplicates: result must equal non-deduplicating sum
}
Step 4 — Run the Tests
cd book/projects
mvn -pl level-3-multi-input test -q 2>&1 | tail -10
Step 5 — Questions
| # | Question |
|---|---|
| 1 | If your deduplication Set grows very large (millions of keys), what would happen to the task JVM heap? |
| 2 | The input is already sorted by key (because OrderedGroupedKVInput sorts). Could you use this property to deduplicate without a Set? Rewrite DeduplicatingUnionSinkProcessor to use O(1) memory. |
| 3 | Your new counter UnionPipeline/DuplicatesSkipped — where in the Tez framework does it get propagated to the AM and eventually to DAGStatus.getDAGCounters()? |
Level 8: Real Issue Contribution
This level is the transition from learner to contributor. You will pick a real open JIRA issue, reproduce it, write a patch, and go through the Apache contribution process from start to submission.
The Apache contribution loop
1. Pick an issue (JIRA) → identify something you can fix
2. Understand the context → read related code, existing tests, comments
3. Reproduce the bug → write a failing test or reproduce steps
4. Implement the fix → minimum change that passes all tests
5. Format the patch → `git diff > TEZ-NNNN.001.patch`
6. Upload to JIRA → attach the patch, set status to "Patch Available"
7. Respond to review comments → iterate, upload TEZ-NNNN.002.patch etc.
8. Patch committed → a committer votes +1 and commits
Choosing the right issue
Good first contributions:
| Type | Difficulty | Acceptance rate |
|---|---|---|
| Missing test coverage | Low | High |
| Wrong error message | Low | High |
| Javadoc improvement | Low | High |
| Logging improvement | Low | High |
| NPE in edge case | Medium | High |
| Performance regression (small) | Medium | Medium |
| New feature | High | Low (needs design discussion first) |
Rule: Start with "Minor" or "Trivial" priority JIRAs. Do not attempt "Blocker" or "Critical" until you have 3+ committed patches.
What this level covers
| Topic | Lab |
|---|---|
| Find and reproduce a real open JIRA issue | Lab 8.1 |
| Implement a fix, write the test, format the patch | Lab 8.2 |
| Write better error messages for failed DAGs | Lab 8.3 |
Lab 8.1 — Find and Reproduce a Real JIRA Issue
Lab type: Research & Reproduce
Estimated time: 2–4 hours (actual time varies by issue)
Step 1 — Find a Good Candidate
Go to: https://issues.apache.org/jira/projects/TEZ
Filter:
- Status: Open
- Priority: Minor or Trivial
- Component: tez-dag or tez-runtime-library
- Resolution: Unresolved
Look for issues with:
- A small reproduction case described in comments
- No existing "Patch Available" attachment
- Last comment less than 1 year old
Step 2 — Read Everything
For your chosen issue, read:
- The original description
- Every comment (some comments contain critical reproduction steps)
- Any attached patches (even if they were rejected — understand why)
- Related issues in the "is blocked by" / "depends on" links
Answer for your issue:
| # | Question |
|---|---|
| 1 | What is the exact symptom? (Exception? Wrong result? Performance regression?) |
| 2 | Which Tez class is implicated? Which method? |
| 3 | Under what conditions does the bug occur? |
| 4 | Is there a unit test that would catch this if it existed? |
Step 3 — Reproduce the Bug
For a unit-test-reproducible bug:
cd ~/tez-src
# Write a test that fails
mvn test -pl tez-dag -Dtest=TestVertexImpl#testMyReproduction -q 2>&1 | tail -20
For a configuration-dependent bug, write a minimal local-mode DAG that triggers it.
Record:
- The exact exception and stack trace
- Which class and line number triggers it
- Whether it is deterministic or intermittent
Step 4 — Map the Root Cause
Trace from the symptom to the line of code that is wrong:
- Start with the exception message
- Find the throw site in source code
- Walk backwards through the call stack
- Identify the single line that is wrong (the real fix site is often 10 lines above the throw site)
Step 5 — Verify Your Understanding
Post a comment on the JIRA (be professional and concise):
I was able to reproduce this issue on Tez trunk (commit <hash>) with the
following minimal test case:
[paste test code or reproduction steps]
The root cause appears to be [one sentence description] at
[ClassName.java:line]. I am working on a patch.
This establishes you as working on the issue and prevents duplicate work.
Questions
- How long did it take you to go from "reading the JIRA" to "reproducing the bug"?
- Was the root cause where you expected it based on the stack trace, or did you have to trace further?
- Is there a comment in the code near the bug site that explains the intended behavior? Was the comment wrong?
Lab 8.2 — Implement the Fix, Write the Test, Format the Patch
Lab type: Fix-It (real JIRA)
Estimated time: 2–6 hours
Step 1 — Implement the Minimum Fix
Rules:
- Change only what is necessary to fix the bug
- Do not reformat surrounding code
- Do not add unrelated improvements
- Do not add comments unless they explain the fix
- If the fix requires changes in multiple files, make all changes in one commit
Step 2 — Write the Test
Every Tez patch must include a test that:
- Fails on the original code (without the fix)
- Passes on the patched code
The test must be in the same test class as existing tests for the modified class.
Test quality checklist:
- Test name clearly describes what it is testing
-
@Test(timeout = 5000)annotation (prevents hung tests from blocking CI) -
No
Thread.sleep()(useDrainDispatcher.await()orCountDownLatchinstead) - Assertion messages explain what was expected vs. what was found
- No hardcoded absolute paths or ports
Step 3 — Run the Full Test Module
cd ~/tez-src
mvn test -pl tez-dag -q 2>&1 | tail -10
All tests must pass.
Step 4 — Run Checkstyle
mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|violation" | head -20
Zero violations required.
Step 5 — Format the Patch
cd ~/tez-src
git diff HEAD > /tmp/TEZ-NNNN.001.patch
Verify:
# No trailing whitespace
grep -nP "\\s+$" /tmp/TEZ-NNNN.001.patch
# Patch applies cleanly
git apply --check /tmp/TEZ-NNNN.001.patch
Step 6 — Upload to JIRA
- Open your JIRA issue
- Click "Attach File" and upload
/tmp/TEZ-NNNN.001.patch - Set the "Patch Available" flag (the checkbox in the issue screen, NOT the workflow button)
- Update the description or add a comment:
Attaching patch TEZ-NNNN.001.patch
Changes:
- [ClassName.java]: [one-line description of the fix]
- [TestClassName.java]: [one-line description of the new test]
The test fails on unpatched code and passes with the fix applied.
After You Submit
You will typically receive review feedback within a few days to a few weeks. Common feedback categories:
| Feedback | Meaning | Your response |
|---|---|---|
| "Can you add a test?" | Test is missing | Add test, re-upload |
| "This is too broad" | Change is larger than needed | Narrow scope, re-upload |
| "Style nit: …" | Checkstyle or code style | Fix, re-upload |
| "+1" | Committer approves | Wait for commit, or ask "Is this ready to commit?" |
| "-1" | Hard block | Address all -1 comments before re-uploading |
Lab 8.3 — Improve Error Messages for Failed DAGs
Lab type: Fix-It (error message quality)
Estimated time: 90 min
Overview
Poor error messages are one of the most common complaints from Tez users. "Container exited with a non-zero exit code" tells an operator almost nothing. This lab focuses on finding and improving a diagnostic message in the Tez AM.
Step 1 — Find Weak Error Messages
Search for generic or unhelpful diagnostics:
grep -rn '"Container exited\|"Task failed\|"Vertex failed\|unknown error' \
~/tez-src/tez-dag/src/main/java/ | grep -v test | head -20
Also look for messages that use string concatenation on a potentially-null object:
grep -rn 'diagnostics.*\+.*null\|null.*\+.*diagnostics' \
~/tez-src/tez-dag/src/main/java/ | head -20
Step 2 — Pick a Target
Select one diagnostic message that you can improve. Good candidates:
- A message that says "failed" without explaining why
- A message that could NPE if a field is null
- A message that uses a raw integer code without a human-readable explanation
Step 3 — Understand the Context
For your chosen message:
- What class emits it?
- What state transition triggers it?
- What information is available at that point (in the method parameters or fields) that could be added to the message?
Step 4 — Improve the Message
Example improvement:
// Before (unhelpful):
diagnostics.add("Container " + containerId + " failed");
// After (actionable):
diagnostics.add(String.format(
"Container %s failed with exit code %d (%s). " +
"Check container logs at: %s",
containerId,
exitCode,
ContainerExitStatus.getExitCodeString(exitCode),
logURL));
Step 5 — Write a Test for the New Message
The test should verify that:
- The improved message appears in
TaskAttemptImpl.getDiagnostics()orVertexImpl.getDiagnostics()after the relevant failure event - It contains the expected key fields (exit code, container ID, etc.)
Pattern:
@Test
public void testDiagnosticsContainsExitCode() {
// ... set up failing task attempt with specific exit code ...
List<String> diags = taskAttempt.getDiagnostics();
assertTrue("Diagnostics should contain exit code",
diags.stream().anyMatch(d -> d.contains("exitCode=123")));
}
Step 6 — Format Patch and JIRA
git diff > /tmp/TEZ-ERRORMSG.001.patch
JIRA title pattern: [tez-dag] Improve error message for [specific failure scenario]
Reflection Questions
- What makes a good diagnostic message? List 4 properties.
- Why do projects accumulate bad error messages over time? (Hint: think about who writes the code vs. who runs it.)
- Find a Tez JIRA where the only change was improving a log or diagnostic message. Was the patch accepted? How long did the review take?
Level 9: Advanced Committer / PMC-Level Contributor
At this level you move beyond fixing bugs into shaping the project: writing performance-critical tests, analyzing regressions, participating in design discussions, and understanding how Apache governance works.
The committer path
Contributor → trusted contributor (10+ accepted patches)
→ committer candidate (PMC votes)
→ committer (can merge patches)
→ PMC member (vote on releases and project direction)
Becoming a committer is about demonstrated judgment — not just writing correct code, but consistently:
- Choosing the minimum-impact fix over the clever refactor
- Writing tests that catch real bugs, not just satisfy coverage metrics
- Reviewing others' patches with constructive, specific feedback
- Following up on issues you reported or started
What this level covers
| Topic | Lab |
|---|---|
| Write comprehensive scheduler behavior tests | Lab 9.1 |
| Analyze and quantify a performance regression | Lab 9.2 |
Lab 9.1 — Write Tests for Scheduler Behavior
Lab type: Build It — comprehensive test coverage
Estimated time: 3–4 hours
Tez module: tez-dag
Overview
The Tez task scheduler (TaskSchedulerEventHandler,
CapacityTaskScheduler, FairTaskScheduler) manages how containers are
requested from YARN and how pending tasks are assigned to available containers.
This is one of the least-tested areas of Tez. Well-written scheduler tests are highly valued by committers.
Step 1 — Understand the Scheduler Interface
find ~/tez-src -name "TaskScheduler.java" | head -3
find ~/tez-src -name "TaskSchedulerEventHandler.java" | head -3
find ~/tez-src -name "TestTaskScheduler*.java" | head -10
Open the scheduler interface and answer:
| # | Question |
|---|---|
| 1 | What events does TaskSchedulerEventHandler process? List all event types. |
| 2 | When a container becomes available, what is the algorithm for choosing which task to assign to it? |
| 3 | When Tez requests a container from YARN, what resource profile does it request? (CPU + memory?) |
| 4 | If YARN preempts a container, what does the scheduler do to the task that was running in it? |
Step 2 — Identify Missing Coverage
grep -n "public void test" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/rm/TestTaskSchedulerEventHandler.java \
| head -30
Find 3 scenarios that are NOT covered by existing tests. Good candidates:
- Container allocation after task is cancelled (race condition scenario)
- Scheduling under resource pressure (all containers allocated, new task arrives)
- Task scheduled to a blacklisted node
Step 3 — Write 3 New Tests
For each missing scenario, write a test following the pattern of the existing tests. Each test must:
- Set up the scheduler with a mock
RMCommunicatorandDAGAppMaster - Drive a sequence of events
- Assert on the scheduler's resulting state and on calls made to the mock YARN RM
@Test(timeout = 5000)
public void testTaskScheduledAfterContainerPreempted() {
// TODO: set up scheduler with 1 running container
// TODO: simulate YARN preemption of that container
// TODO: verify the task is re-queued (not dropped)
// TODO: simulate new container allocation
// TODO: verify the task is re-scheduled to the new container
}
Step 4 — Run and Verify
mvn test -pl tez-dag -Dtest=TestTaskSchedulerEventHandler -q 2>&1 | tail -10
Step 5 — Reflection
| # | Question |
|---|---|
| 1 | The test uses mocks for YARN and the DAGAppMaster. What real behavior is NOT exercised by this approach? |
| 2 | A scheduler has inherently concurrent behavior. How do the existing tests handle thread safety? |
| 3 | If you were to write an integration test for the scheduler (using MiniTezCluster), what would be harder to set up than in a unit test? What would be easier to assert? |
Lab 9.2 — Analyze a Performance Regression
Lab type: Research & Benchmark
Estimated time: 3–4 hours
Overview
Performance regressions are among the most impactful bugs in Tez — a 10% slowdown in shuffle can translate to significant cost at scale. But they are also the hardest to reproduce and fix.
In this lab you will:
- Identify a performance-sensitive code path
- Write a micro-benchmark using JMH
- Compare two implementations and quantify the difference
- Write a JIRA with a clear, reproducible performance report
Step 1 — Identify a Hot Path
The most performance-critical paths in Tez:
| Path | Class | Why it matters |
|---|---|---|
| Record serialization | TezSerializer, WritableSerialization | Called once per record |
| Sort buffer writes | DefaultSorter.collect() | Called once per output record |
| Shuffle URL construction | Fetcher.getFetchList() | Called per fetch request |
| Counter increment | TezCounter.increment() | Called very frequently |
| BitSet operations | VertexManagerPlugin.onTaskAttemptCompleted | Called per task completion |
Step 2 — Add Maven Surefire Benchmark Configuration
For a quick JMH benchmark within the project:
<!-- Add to level-4-waving-manager/pom.xml if you want to benchmark BitSet -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.37</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.37</version>
<scope>test</scope>
</dependency>
Step 3 — Write the Benchmark
Example: compare BitSet.andNot(clone) vs re-building the set from scratch:
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkBitSetAndNot(Blackhole bh) {
BitSet scheduled = createBigBitSet(1000);
BitSet finished = createBigBitSet(500);
BitSet copy = (BitSet) scheduled.clone();
copy.andNot(finished);
bh.consume(copy.isEmpty());
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkManualIteration(Blackhole bh) {
Set<Integer> scheduled = createBigSet(1000);
Set<Integer> finished = createBigSet(500);
boolean allDone = finished.containsAll(scheduled);
bh.consume(allDone);
}
Step 4 — Run and Analyze
cd book/projects
mvn -pl level-4-waving-manager test -Dtest=WavingBenchmark -q 2>&1 | tail -30
Record:
- Mean time per operation (nanoseconds)
- Confidence intervals
- Winner
Step 5 — Write the JIRA Performance Report
Summary: [ClassName] uses O(n) Set.containsAll() where O(n) BitSet.andNot() is available
Description:
Micro-benchmark comparison of BitSet.andNot() vs Set.containsAll() for
wave-completion detection in WavingVertexManager (and by extension any
similar VertexManagerPlugin).
Results (1000 tasks, 500 completed, JDK 11, M1 MacBook):
BitSet.andNot(): X ± Y ns/op
Set.containsAll(): X ± Y ns/op
Speedup: Nx
For large DAGs with thousands of tasks, this difference compounds
significantly over the lifetime of the DAG.
Patch: Switch from HashSet to BitSet in [ClassName].
Priority: Minor
Component: tez-dag
Reflection
| # | Question |
|---|---|
| 1 | At what scale (number of tasks per DAG) would the BitSet optimization matter in practice? At 10 tasks? 10,000? |
| 2 | JMH benchmarks measure throughput in isolation. What real-world factors could make the benchmark results misleading? |
| 3 | Performance patches are often held to a higher standard of review than correctness patches. Why? |
Contributor Mindset
This section is the "soft skills with hard edges" half of the curriculum. The technical chapters teach you how Tez works; this section teaches you how the Apache Tez project works — how decisions are made, how patches are accepted, how trust is earned, and how a contributor becomes a committer.
These are not optional skills. A technically excellent patch with poor process around it will sit on JIRA for months. A modest patch with clean process gets reviewed and committed.
Reading Order
The chapters are ordered to mirror the actual arc of a new contributor.
| # | Chapter | What it answers |
|---|---|---|
| 1 | Reading the Codebase | How do I navigate ~200k LOC without drowning? |
| 2 | Design via JIRA | Where does design happen in Apache projects? |
| 3 | Community Interaction | How do I talk to dev@ and JIRA without burning trust? |
| 4 | Patch Quality | What does a committer-ready patch look like? |
| 5 | Responding to Feedback | How do I handle review comments well? |
| 6 | Compatibility | What can I change without breaking users? |
| 7 | Meritocracy | How does someone become a committer or PMC member? |
Chapters 1–2 are pre-work — read them before opening any JIRA. Chapters 3–5 are operational — read them before submitting your first patch. Chapters 6–7 are strategic — read them when you start thinking beyond a single patch.
How This Complements the Technical Labs
The labs in Levels 1–9 build engineering competence inside the Tez codebase. This section builds the project-level competence needed to ship that work into Apache Tez itself.
The relationship is concrete:
| Technical chapter | Mindset chapter that pairs with it |
|---|---|
| Level 2 Lab 2: Prepare a Patch | Patch Quality |
| Level 3 deep dives on AM internals | Reading the Codebase |
| Level 5 Tez/Hive integration | Compatibility |
| Level 7 protocol & wire format | Compatibility |
Capstone project (capstone/) | All seven mindset chapters |
If you are doing the Capstone, you should have read all seven chapters in this section by the time you reach Step 8 (the patch).
What This Section Is Not
It is not generic open-source advice. Every claim, template, and procedure here is grounded in:
- The Apache Software Foundation Way
- The Apache Tez JIRA project (TEZ)
- The
dev@tez.apache.orgmailing-list archive - The
tez-tools/src/main/resources/tez/checkstyle.xmland other in-repo policy files - The
@InterfaceAudience/@InterfaceStabilityannotations intez-api
Where a chapter generalises, it labels the generalisation. Where it states a Tez-specific rule, it cites the in-repo file or the JIRA where the rule was set.
Prerequisites
Before this section is useful you must have:
- A local clone of Tez at
~/tez-src(git clone https://github.com/apache/tez.git) - A JIRA account at
https://issues.apache.org/jira/ - A subscription to
dev@tez.apache.org(send empty mail todev-subscribe@tez.apache.org) - An ASF ID is not required — that comes later, with committership.
Validation for the Section
You have absorbed this section when you can:
- Find any feature in Tez within 10 minutes by tracing from
TezClientorDAGAppMaster. - Write a JIRA description that a committer can act on without follow-up questions.
- Produce a patch that passes
mvn checkstyle:checkandmvn testin changed modules on the first try. - Read a
@InterfaceAudienceannotation and predict what you may and may not change. - Explain to a colleague the difference between contributor, committer, and PMC.
The next chapter — Reading the Codebase — gives you the navigation strategy you will use through everything that follows.
Reading a 200k+ LOC Apache Codebase
Apache Tez is roughly 200,000 lines of Java across 15+ Maven modules. No single human holds it all in their head — not even the most senior committers. The skill is not memory; it is navigation. This chapter gives you the strategies committers actually use.
Module Map First
Before reading any code, learn the module shape. Run this once and pin the output:
cd ~/tez-src
find . -maxdepth 2 -name pom.xml | sort
The modules that matter for ~90% of work:
| Module | What lives there | When you read it |
|---|---|---|
tez-api | Public API: TezClient, DAG, Vertex, Edge, *Descriptor | Always start here |
tez-common | Shared utilities, TezConfiguration, counters | Tracing configs |
tez-runtime-internals | Task runtime, LogicalIOProcessorRuntimeTask | Following a task |
tez-runtime-library | OrderedPartitionedKVOutput, shuffle inputs | I/O contracts |
tez-dag | DAGAppMaster, schedulers, state machines | AM-side bugs |
tez-mapreduce | MR compat: MRInput, MROutput | MR-on-Tez |
tez-tests | MiniTezCluster, TestOrderedWordCount | Integration tests |
tez-tools | Checkstyle config, swimlanes, analyzer | Process tooling |
Tez follows the Hadoop convention: code lives in <module>/src/main/java, tests in
<module>/src/test/java. Protobufs live in <module>/src/main/proto.
Strategy 1: Start From the Public API, Trace Inward
Every Tez user program goes through tez-api. That makes it the only mandatory entry
point. The reading order:
tez-api (what users see)
↓
tez-dag (what the AM does with it)
↓
tez-runtime-internals (what tasks do)
↓
tez-runtime-library (the I/Os tasks use)
Trace example — "where does parallelism come from?":
cd ~/tez-src
grep -rn "setParallelism" tez-api/src/main/java | head
grep -rn "setParallelism\|reconfigureVertex" tez-dag/src/main/java | head
You will find Vertex.setParallelism(int) in tez-api and follow it to
VertexImpl.setParallelism in tez-dag. That arc — API → impl — is the canonical pattern
for reading Tez.
Strategy 2: Protobufs Are the Source of Truth for Anything Serialized
Anything that crosses a process boundary (client → AM, AM → container, AM → history) is defined in protobuf. The protos are the contract; the Java is the implementation.
find ~/tez-src -name "*.proto" | sort
The four protos to internalise:
| Proto | Role |
|---|---|
tez-api/src/main/proto/DAGApiRecords.proto | DAGPlan, VertexPlan, EdgePlan — the DAG on the wire |
tez-api/src/main/proto/Events.proto | The event types that flow on the dispatcher |
tez-common/src/main/proto/TezCommonProtos.proto | Counters, plugin descriptors |
tez-dag/src/main/proto/DAGProtos.proto | AM-internal records |
When you see a class named *Proto (e.g. DAGProtos.DAGPlan) the generated code lives in
target/generated-sources/ after a build. Don't read the generated code; read the .proto.
Practical rule: if you are changing a field that appears in a proto, you are changing wire compatibility. See Compatibility.
Strategy 3: IDE Call Hierarchy + git log -S
Two tools, used together, replace 80% of speculative reading.
Call hierarchy (IntelliJ: Ctrl-Alt-H, Eclipse: Ctrl-Alt-H) answers "who calls
this?". Use it on entry points like TezClient.submitDAG to find every call site in
tests and examples.
git log -S answers "when and why did this code appear?".
cd ~/tez-src
git log -S "reconfigureVertex" --oneline -- tez-dag/
git log -S "reconfigureVertex" --oneline -- tez-api/
Pick the oldest commit referenced and read its JIRA:
git show <sha> | head -30
# Look for "TEZ-NNNN" in the commit message
That JIRA is the design discussion. It is more valuable than the code.
Strategy 4: Tests Are Executable Spec
The Tez test suite is the cheapest way to learn what a class does. For any class
Foo.java, look for TestFoo.java:
find ~/tez-src -name "TestVertexImpl.java"
find ~/tez-src -name "TestDAGImpl.java"
find ~/tez-src -name "TestShuffleVertexManager.java"
The test names alone form a behavior spec:
grep " public void test" $(find ~/tez-src -name TestVertexImpl.java)
For runtime behavior, integration tests in tez-tests/ are the gold:
ls ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/
TestTezJobs.java and TestExceptionPropagation.java walk full DAGs end-to-end on a
MiniTezCluster. Read them before guessing how a feature behaves at runtime.
Strategy 5: Keep a Reading Log
Committers have working memory of the codebase because they wrote a lot of it. You don't. Compensate with notes. Keep one file:
mkdir -p ~/tez-notes
cat > ~/tez-notes/reading-log.md <<'EOF'
# Tez Reading Log
## YYYY-MM-DD — DAG submission path
- TezClient.submitDAG(DAG) in tez-api builds DAGPlan
- → DAGClientAMProtocolBlockingPB.submitDAG (RPC)
- → DAGAppMaster.submitDAGToAppMaster
- → DAGAppMaster.startDAG → AsyncDispatcher.getEventHandler().handle(DAGEventType.DAG_INIT)
## YYYY-MM-DD — Vertex parallelism reconfiguration
- VertexManagerPlugin.context.reconfigureVertex(...)
...
EOF
Re-reading three months later, the log is gold. Without it, you re-trace the same path.
Worked Exercise: TezClient.submitDAG → AsyncDispatcher
Goal: in 90 minutes, trace the path from a user calling tezClient.submitDAG(dag) to the
event landing on the DAGAppMaster async dispatcher.
Step 1 (15 min) — Find the entry
cd ~/tez-src
find tez-api/src/main/java -name "TezClient.java"
grep -n "public DAGClient submitDAG" $(find tez-api/src/main/java -name TezClient.java)
You will find an overload that takes DAG dag. Read its body. Note that it does two
things: builds a DAGPlan from the DAG, then sends it via an RPC stub.
Step 2 (20 min) — Identify the RPC
grep -rn "submitDAG" tez-api/src/main/proto/
Find DAGClientAMProtocol.proto. The SubmitDAGRequestProto carries the DAGPlan. The
generated stub is DAGClientAMProtocolBlockingPB. The server side implements it in
tez-dag.
grep -rn "implements DAGClientAMProtocolBlockingPB\|extends DAGClientAMProtocolBlockingPB" tez-dag/src/main/java
You will land in DAGClientHandler (in tez-dag/.../dag/app/).
Step 3 (20 min) — Server-side handling
grep -n "submitDAG" $(find tez-dag/src/main/java -name "DAGClientHandler.java")
Follow submitDAG → DAGAppMaster.submitDAGToAppMaster → DAGAppMaster.startDAG. Inside
startDAG, you will see a DAG dag = createDAG(dagPlan) and then an event dispatched
through dispatcher.getEventHandler().handle(...).
Step 4 (20 min) — The dispatcher
find tez-dag/src/main/java -name "DAGAppMaster.java"
grep -n "AsyncDispatcher\|dispatcher" $(find tez-dag/src/main/java -name DAGAppMaster.java) | head
Find where dispatcher is instantiated and where event handlers are registered. The
handler for DAGEventType is the DAGImpl's state machine.
Step 5 (15 min) — Record it
Open your reading log and write the four-line summary. Cite the file and line for each hop.
Validation Artifacts
After this chapter you should produce and keep:
- A
~/tez-notes/module-map.mdwith one sentence per module. - A
~/tez-notes/reading-log.mdwith thesubmitDAGtrace from the exercise above. - A
grep-able list of the four protos and what each one defines. - One
git log -Scommand and the JIRA it surfaced, saved to the log.
When you can do the exercise without checking this page, you have the navigation skill. The next chapter — Design via JIRA — tells you where the design decisions behind that code actually lived.
Design via JIRA, Not PRs
Apache projects design in the open. In Tez, "the open" is the
TEZ JIRA project and the
dev@tez.apache.org mailing list — not the GitHub PR.
A PR with a "see what you think" attitude and no JIRA attached will be ignored. A JIRA with a clear problem statement and rough design will get responses within days, often from people who never read the PR. This chapter is about why, and how to use that system.
Why Not Just PRs?
GitHub PRs at Apache are mirrors of patches. They are convenient for diff viewing, but they are not the system of record. The system of record is:
| Artifact | System | Why there |
|---|---|---|
| Bug report, problem statement | JIRA | Searchable, citeable forever |
| Design discussion | JIRA + dev@ | Archived by the ASF, public |
| Patch / code review | JIRA attachment or PR linked from JIRA | Reviewed under ASF ICLA |
| Vote on release / committer | dev@ / private@ | Required by ASF policy |
| The final code | git | The result, not the discussion |
If a discussion happens only on a PR and the PR is later force-closed or the repo moves, the rationale evaporates. JIRA + mailing list don't move.
Concrete consequence: when you read code in tez-dag/ and ask "why?", the answer is
almost certainly in a JIRA referenced from the commit message — see
Reading the Codebase, Strategy 3.
The TEZ JIRA Workflow
A Tez JIRA moves through these statuses:
Open → In Progress → Patch Available → Resolved → Closed
↘ Reopened
Triggers:
| Transition | Triggered by | Means |
|---|---|---|
| Open → In Progress | Assignee starts work | Don't duplicate this |
| In Progress → Patch Available | Patch (or PR) is ready for review | Reviewers, please look |
| Patch Available → Resolved | Committer commits it | Done in trunk |
| Resolved → Closed | Release ships containing the fix | Done for users |
| Resolved → Reopened | Bug returns or revert needed | Re-do |
You only set "Patch Available" yourself. Everything else above the dotted line is yours; everything below requires a committer.
Reading Old JIRAs for Context
The single highest-leverage Tez skill is reading old JIRAs. Conventions:
- Issues are referenced as
TEZ-NNNNin commit messages and source comments. You will see// see TEZ-3045or// TEZ-1597peppered through the code. - Search them at
https://issues.apache.org/jira/browse/TEZ-NNNN. - The "Activity" tab shows the design conversation. The "Attachments" tab shows the patch
iterations (
TEZ-NNNN.001.patch,TEZ-NNNN.002.patch, ...).
Try this now:
cd ~/tez-src
git log --all --oneline | grep -oE "TEZ-[0-9]+" | sort -u | tail -20
Pick one and open it in a browser. Read the description, the comments, and the patch iterations. You will see the design happen — alternative considered, rejected, refined. This is more useful than any architecture document because it shows reasoning, not conclusions.
When to Open a JIRA Yourself
You open a JIRA before writing the patch when any of the following is true:
| Situation | Open JIRA? |
|---|---|
| Typo in Javadoc or log message | Yes (small, but track it) |
| One-line bug fix with obvious cause | Yes |
| Multi-file refactor | Yes, with a brief design |
| New public API | Yes, mandatory, with [DISCUSS] on dev@ first |
| New configuration key | Yes |
| Performance change with measurable impact | Yes, with benchmark plan |
Anything touching DAGPlan proto | Yes, with compatibility note |
You do not need a JIRA to:
- Ask a question on
dev@oruser@ - File a documentation question
- Patch a private fork
The JIRA Description Skeleton
A Tez JIRA description that committers can act on contains, in order:
## Problem
(Two to four sentences. What is wrong. Who hits it.)
## Reproduction
(Steps to reproduce, or a code sample. If a test reproduces it, name the test class.)
## Root Cause
(One paragraph. Cite file and method.)
## Proposed Fix
(One paragraph. What you intend to do. Mention any alternatives considered.)
## Compatibility
(One sentence. Wire compat? API change? Config rename? "None." is a valid answer.)
## Test Plan
(One paragraph. Which tests pass after the change. Any new test added.)
A trivial bug fix may collapse Compatibility and Test Plan to one line each. A new API must expand them.
Design Doc on a JIRA — Skeleton
For anything larger than a single-file fix, attach a design doc (Markdown or PDF) to the JIRA. The skeleton:
# TEZ-NNNN: <short title>
## 1. Problem
What is wrong today. Who is affected. Why "do nothing" is not acceptable.
## 2. Goals
Bulleted, testable. "DAGPlan submission survives a 10 MB plan without OOM."
## 3. Non-Goals
What this design explicitly will not address. Prevents scope creep.
## 4. Alternatives Considered
- Option A: <description>. Pros / Cons. Why rejected.
- Option B: <description>. Pros / Cons. Why rejected.
- Option C (chosen): <description>. Pros / Cons.
## 5. Chosen Approach
Architecture sketch. Mermaid or ASCII. Cite files that will change.
## 6. Compatibility
- Wire compat: <change to any proto? backward compatible?>
- API compat: <InterfaceAudience.Public touched? deprecation plan?>
- Config compat: <new keys? renamed keys? default change?>
## 7. Test Plan
- Unit tests: which classes
- Integration: MiniTezCluster scenarios
- Manual: any out-of-suite verification
## 8. Rollout
- Default off? On? Feature flag name?
- Migration steps for existing users.
Attach as TEZ-NNNN-design.md or TEZ-NNNN-design.pdf. Announce it on dev@ with
subject [DISCUSS] TEZ-NNNN: <short title> and a link.
Expect 1–2 weeks of asynchronous discussion before consensus. Do not start patching until the design is at least loosely agreed — patches without design buy-in get rejected.
"See TEZ-NNNN" — The Codebase Convention
Search the Tez source for back-references:
cd ~/tez-src
grep -rn "TEZ-[0-9]" tez-dag/src/main/java | head -20
Every such reference is a permanent link from the code to a design conversation. When you
add a non-obvious workaround, you do the same — leave a // TEZ-NNNN: <one line why> so
the next reader can find your reasoning.
When the Design Lives on dev@ Only
Some discussions never reach JIRA — release planning, branch policy, build infrastructure.
Those live on dev@tez.apache.org only. Archive:
https://lists.apache.org/list.html?dev@tez.apache.org
Search by subject prefix:
| Prefix | Means |
|---|---|
[DISCUSS] | Open question, no decision sought yet |
[PROPOSAL] | Concrete proposal, feedback wanted |
[VOTE] | Decision being made; 72h window |
[ANNOUNCE] | One-way: release, new committer |
[NOTICE] | One-way: infrastructure change |
Subscribing: send empty mail to dev-subscribe@tez.apache.org.
Validation Artifacts
After this chapter you should be able to produce:
- The URL of three different
TEZ-NNNNJIRAs cited from the Tez source, and a one-line summary of what each one is about. - A draft JIRA description (in a local file
~/tez-notes/draft-jira.md) for a bug or improvement you have noticed, following the skeleton above. - A subscription confirmation to
dev@tez.apache.org. - One archived
[DISCUSS]thread URL relevant to a Tez area you care about.
The next chapter — Community Interaction — covers how to
actually post on dev@ and behave on JIRA without burning trust on day one.
Community Interaction
This chapter covers the operational mechanics of communicating with the Apache Tez
community — dev@tez.apache.org, JIRA, and the project's chat presence. Most of the
"rules" below are not Tez rules; they are Apache-wide conventions that 25 years of mailing
lists have settled into. Violating them is not a hanging offence, but it does mark you as
new and costs you a small amount of credibility you have not yet earned.
The Lists
Tez has the standard ASF list set:
| List | Purpose | Who reads |
|---|---|---|
dev@tez.apache.org | Development discussion, design, votes | Contributors, committers, PMC |
user@tez.apache.org | Usage questions, "how do I" | Users, some committers |
commits@tez.apache.org | Auto-mailed commit notifications | Mostly bots; subscribe to follow trunk |
issues@tez.apache.org | Auto-mailed JIRA notifications | Bots, some committers |
private@tez.apache.org | PMC-only (new-committer votes, security) | PMC only |
Subscribe to a list by sending an empty mail to <list>-subscribe@tez.apache.org. Confirm
the reply. Unsubscribe via <list>-unsubscribe@tez.apache.org.
Default for new contributors: subscribe to dev@ and user@. Add issues@ once you are
actively tracking JIRAs.
Mail Etiquette: Subject Prefixes
Subject lines on dev@ use ASCII-bracketed prefixes so subscribers can filter. Use them.
| Prefix | When |
|---|---|
[DISCUSS] | Open-ended question or design idea, no vote yet |
[PROPOSAL] | Concrete proposal seeking comment |
[VOTE] | Vote in progress; body has voting rules |
[VOTE][RESULT] | Closing a vote; tallies the result |
[ANNOUNCE] | One-way announcement (release, new committer) |
[NOTICE] | Infrastructure / branch / policy change |
[jira] [Created] etc. | Auto-prefixed by the JIRA bot; don't compose these |
For a JIRA-related question, the subject is usually Re: [jira] [Created] (TEZ-NNNN) <title>
— a reply to the bot mail.
Examples of good subjects:
[DISCUSS] Promoting MROutput#getDelegationToken to @Public[PROPOSAL] TEZ-4321: Caching DAG plans across submissions[VOTE] Apache Tez 0.10.4 RC1[ANNOUNCE] New Tez committer: NAME
Mail Etiquette: Formatting
The ASF lists are plaintext-first. The hard rules:
- Plain text only. No HTML, no rich text. Most clients have a "Send as plain text"
toggle; set it as the default for
*@apache.orgrecipients. - Inline reply, not top-post. Quote the relevant lines, reply below each.
- Wrap at ~78 columns. Long unbroken lines render badly in archives.
- Sign off. First name or first + last; not your full corporate signature block.
- No attachments over a few KB. Patches go on JIRA, not the list.
- No images. Diagrams as ASCII or as links to images hosted elsewhere.
A good dev@ reply looks like:
On Tue, May 7, 2024 at 10:14 AM, Foo Bar <foo@example.com> wrote:
> I think we should change the default of tez.am.resource.memory.mb
> from 1024 to 2048 to handle large DAGs better.
Agreed for large DAGs, but 2048 doubles the AM footprint for everyone
running small jobs (most CI users). Could we instead size it based on
DAGPlan size, falling back to 1024? Sketch:
am_mem_mb = max(1024, dagPlanBytes / 1024 * 4)
I can prototype on TEZ-4XXX if there's interest.
--
Jane
What it doesn't have: HTML, a corporate disclaimer, a 2 MB inline screenshot, "+1" with no context, or "any updates?" with no quoted reference.
JIRA Etiquette
JIRA is the system of record for code-touching work. The mores:
Don't reassign
The Assignee field belongs to whoever is doing the work. If a JIRA is assigned to
someone else, do not reassign it to yourself, even if it's been idle for a year. Comment
first:
Hi @ASSIGNEE, I'd like to pick this up if you're not actively working on it. Happy to hand back if you have an in-flight patch. If I don't hear back in a week I'll assign to myself.
After a week of silence, then take it.
Ask before claiming high-traffic JIRAs
For high-visibility issues (release blockers, anything with multiple watchers), comment "I'll take a look at this" before you set yourself as assignee. This prevents two people working on the same fix.
"Patch Available" semantics
Setting status to Patch Available is a signal that means:
- A patch (or PR linked from the JIRA) is attached
- It applies cleanly to the current trunk
- The author believes tests pass locally
- The author is requesting review
It does not mean "I am still iterating." If you upload a draft, leave the status as In Progress and say so in a comment.
Status flow you control vs. don't
| You may set | Means |
|---|---|
| Open → In Progress | Starting work |
| In Progress → Patch Available | Ready for review |
| Patch Available → In Progress | Reopening to revise after feedback |
| Comment with new patch | Iteration |
| Committer-only | Means |
|---|---|
| Patch Available → Resolved | Committed |
| Resolved → Closed | Released |
| Any → Reopened | Bug returned |
Patch naming convention
Patches attached to JIRA use the convention TEZ-NNNN.NNN.patch:
TEZ-4321.001.patch <- first iteration
TEZ-4321.002.patch <- after first review round
TEZ-4321.003.patch <- after second review round
Branch-specific patches add a branch suffix:
TEZ-4321.branch-0.10.001.patch
Old patches stay attached — never delete them. The history is part of the review record.
Where the Tez Community Currently Lives
Tez does not have an official Slack or Discord. The active channels are:
| Channel | Use |
|---|---|
dev@tez.apache.org | Primary, for all dev discussion |
user@tez.apache.org | Usage questions |
| JIRA | Per-issue discussion |
ASF Slack (the-asf.slack.com), #tez if it exists | Informal, ephemeral |
If a #tez Slack channel does not exist, do not assume one. The mailing list is the
official channel and is where decisions are made and archived. Slack/IRC is at most a
hallway conversation that must be summarised back to the list.
Sister projects you may need to follow because Tez integrates with them:
dev@hive.apache.org— Hive on Tez execution issuesdev@hadoop.apache.org— YARN / HDFS compatibilitydev@pig.apache.org— Pig on Tez (mostly inactive but exists)
Self-Introduction Template
A first post to dev@tez.apache.org after subscribing is optional but helpful. Keep it
short:
Subject: [DISCUSS] Introduction and intent to contribute
Hi all,
I'm <first> <last>, a <role> at <company / "independent">. I've been
using Tez via Hive in production for ~<N> months and have been
reading the codebase to understand <component / area>.
I'm interested in contributing in the area of <one or two concrete
areas, e.g. "shuffle reliability" or "AM logging">. I've worked
through Levels 1-4 of the open-source-engineer curriculum and have
TEZ-NNNN (small Javadoc fix) ready as my first patch.
Happy for any pointers on first issues to tackle.
Thanks,
<First>
What this does:
- Signals you've done homework (not asking "how do I start?")
- Names a concrete area so committers can match you to mentors
- References a tiny first patch, so you've already shown you understand the workflow
What to avoid:
- "I'd like to contribute, please assign me a task" (no committer will do this for you)
- A list of grand redesigns
- A corporate signature block
Asking a Question on user@ Well
The format that gets answers:
Subject: Tez 0.10.x: AM OOMing on submission of 200-vertex DAG
Versions:
Tez 0.10.3
Hadoop 3.3.6
Hive 3.1.3
JDK 11
Symptom:
TezClient.submitDAG throws OOM after ~12 seconds. AM log attached
shows GC overhead limit exceeded inside DAGImpl.init.
Reproduction:
- submit DAG with 200 vertices, each with 5 inputs
- tez.am.resource.memory.mb = 1024 (default)
What I tried:
- bumping to 2048 — works
- reducing parallelism — works around but unwanted
Question:
Is there a known scaling limit for DAGPlan size with default AM
memory? Should the AM default scale with DAGPlan size?
Logs / DAG: <link to gist or paste in JIRA>
It gives versions, symptom, reproduction, what was already tried, and a focused question. A question that omits any of these gets a "please provide more info" reply, costing a round-trip day.
Validation Artifacts
After this chapter you should have, on disk and in the public archive:
- A subscription confirmation to
dev@tez.apache.organduser@tez.apache.org. - A self-introduction email posted to
dev@, with archive URL saved. - One inline-reply (not top-post) reply to an existing
dev@thread. - A draft JIRA in JIRA (status Open) describing a real issue you've noticed.
- A
~/tez-notes/etiquette.mdcheatsheet with the subject prefixes table.
The next chapter — Patch Quality — is what your first attached patch needs to look like.
Patch Quality
A "patch" in Apache parlance is a unified diff attached to a JIRA (or, more recently, a GitHub PR linked from a JIRA). This chapter tells you what a committer is looking for when they open it for the first time. Internalising these expectations is the difference between a patch that gets committed in two review rounds and one that dies after a "please rebase" comment in month three.
What Committers Look For — In Reading Order
A committer reviewing your patch does, roughly, this:
1. Read JIRA description. (30 sec)
2. Open the patch, skim the diff stat. (30 sec)
3. Look at tests. (2 min)
4. Look at the implementation. (5 min)
5. Run mvn install / mvn test. (background)
6. Comment. (variable)
Notice tests come before implementation. If the test diff is empty or weak, the implementation is read with suspicion. If the test diff is strong and minimal, the implementation is read with trust.
Rule 1: Minimum Diff
The single rule that most distinguishes a strong patch from a weak one. The diff should contain only the changes that the JIRA describes. Not:
- A whitespace cleanup of the surrounding method
- A rename of an unrelated variable you didn't like
- An import reorder by your IDE
- A bumped dependency version "while you were here"
- A reformatted block
Every line you change costs the reviewer attention. Lines that don't serve the JIRA are a tax on the review.
Check before submitting:
cd ~/tez-src
git diff --stat origin/master
git diff origin/master | head -50
If git diff --stat shows changes in files unrelated to the JIRA, revert them:
git checkout origin/master -- path/to/unrelated/file
Rule 2: No Unrelated Changes
The corollary to Rule 1. Even within a touched file, do not bundle unrelated improvements. If you notice a separate bug while fixing your bug:
# don't fix it here. Open a separate JIRA:
echo "Noticed: VertexImpl.java:842 catches Exception too broadly" >> ~/tez-notes/queue.md
File a follow-up JIRA at the end of the week. Two small patches beat one mixed patch every time.
Rule 3: Apache Commit Message Format
The exact format used in git log for committed Tez changes:
TEZ-NNNN: <short imperative summary, under 72 chars>. (<contributor-name> via <committer-name>)
Verify with:
cd ~/tez-src
git log --oneline -20
You will see lines like:
abc1234 TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)
def5678 TEZ-4322: Add MR compat test for vectorized output. (John Smith via gopalv)
When you submit, your commit message has the contributor side only:
TEZ-4321: Fix NPE in VertexImpl.recover when no inputs.
The committer appends (Jane Doe via <committer>) at commit time. Don't pre-fill it.
The summary line rules:
- Imperative mood: "Fix", "Add", "Remove", "Refactor" — not "Fixed", "Adding".
- Under 72 characters.
- Ends with a period.
- No trailing whitespace.
If the change needs more explanation, leave one blank line and add a body wrapped at 72 columns:
TEZ-4321: Fix NPE in VertexImpl.recover when no inputs.
When a vertex has no Inputs (a root data-source vertex with no
upstream edges), VertexImpl.recover called .iterator() on a null
inputs collection. The fix initialises inputs to an empty list in
the recover path.
Adds TestVertexImpl.testRecoverNoInputs covering the case.
Rule 4: Tests for Behavior Changes
Any behavior change must come with a test. This includes bug fixes — the test should fail before your fix and pass after. Verify:
cd ~/tez-src
# stash your fix
git stash
# run the new test
mvn test -pl tez-dag -Dtest=TestVertexImpl#testRecoverNoInputs
# it should fail
git stash pop
mvn test -pl tez-dag -Dtest=TestVertexImpl#testRecoverNoInputs
# now it should pass
If your "bug fix" passes with the test added but without the fix applied, your test doesn't actually exercise the bug.
Exceptions where a test is not required:
| Change type | Test needed? |
|---|---|
| Javadoc fix | No |
| Log message string change | No |
| Comment / formatting (rare; should be its own patch) | No |
| Build / Maven config change | Usually no, but justify |
| Behavior change | Yes, always |
Rule 5: No Whitespace Churn
Whitespace-only diff lines are noise. IDEs love to insert them — turn off "format on
save" for tez-src, or restrict it to lines you edited.
Detect before submitting:
cd ~/tez-src
git diff -w origin/master --stat
git diff origin/master --stat
If the second shows many more changed files than the first, you have whitespace churn. Either clean it up or, if it's pervasive, configure your editor and re-do the change.
Rule 6: Javadoc for @Public API
If you add or modify a method on a class annotated @InterfaceAudience.Public, it needs
javadoc. The check:
cd ~/tez-src
grep -l "@InterfaceAudience.Public" tez-api/src/main/java -r | head
For each such class, every public method has Javadoc with at least:
- One-sentence summary
@paramfor each parameter@returnfor non-void@throwsfor any non-RuntimeExceptiondeclared exception
If your patch adds a new public method without Javadoc, expect the first review comment to ask for it.
Rule 7: @InterfaceAudience and @InterfaceStability Annotations
Every public-ish class in tez-api is annotated. Example from Vertex.java:
@Public
@Evolving
public class Vertex {
...
}
The grid:
@Stable | @Evolving | @Unstable | |
|---|---|---|---|
@Public | Compat guaranteed across minor versions | May change between minor versions with warning | May change between any release |
@LimitedPrivate({"Hive"}) | Stable for named projects | Evolving for named projects | Unstable, named projects only |
@Private | Internal; do not depend on | Internal | Internal |
When you add a new class to tez-api, you must annotate it. The annotations live in
tez-api/src/main/java/org/apache/hadoop/classification/. When in doubt, default to:
@Public
@Unstable
so users see the class but know not to depend on its shape yet.
Rule 8: Pre-Submit Checklist
Before you upload TEZ-NNNN.001.patch, run each of these and have all pass.
cd ~/tez-src
# 1. Full compile, all modules, no tests.
mvn install -DskipTests
# 2. Checkstyle. Tez uses the config in tez-tools/.
mvn checkstyle:check
# 3. Tests in modules you changed.
# For tez-dag, tez-api, etc.:
mvn test -pl tez-dag
mvn test -pl tez-api
# 4. A representative integration test.
mvn test -pl tez-tests -Dtest=TestOrderedWordCount
# 5. Patch applies cleanly to current master.
git fetch origin
git rebase origin/master
git diff origin/master > /tmp/TEZ-NNNN.001.patch
cd /tmp
git -C ~/tez-src apply --check TEZ-NNNN.001.patch
If any step fails, fix and re-run. Submit only when all pass.
Rule 9: Patch Generation
Generate the patch from a clean rebase against origin/master:
cd ~/tez-src
git fetch origin
git rebase origin/master # resolves conflicts now, not at commit time
git diff origin/master --no-color --unified=5 > TEZ-NNNN.001.patch
The --unified=5 gives reviewers 5 lines of context instead of the default 3. This is a
small kindness that makes review materially easier.
Inspect the patch before attaching:
wc -l TEZ-NNNN.001.patch # how big is it?
head -30 TEZ-NNNN.001.patch # right files?
grep -c "^+" TEZ-NNNN.001.patch # added lines
grep -c "^-" TEZ-NNNN.001.patch # removed lines
A patch of 50–300 lines is comfortable for a single review round. A patch over 1000 lines will sit unreviewed until you split it.
Worked Example — A Minimal Trivial Patch
A real-shape patch for a Javadoc fix on Vertex.java:
diff --git a/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java b/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
index abcdef1..1234567 100644
--- a/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
@@ -180,7 +180,10 @@ public class Vertex {
}
/**
- * Set the parallelism.
+ * Set the parallelism (number of tasks) for this Vertex.
+ *
+ * @param parallelism the number of tasks. Must be > 0 unless
+ * {@link #setVertexManagerPlugin} configures a dynamic plugin.
+ * @return this Vertex, for chaining.
*/
public Vertex setParallelism(int parallelism) {
That's the entire patch — 5 changed lines, +6/-1. No test (Javadoc only). It passes
checkstyle:check, mvn install -DskipTests, and the JIRA is TEZ-NNNN: Improve Javadoc on Vertex#setParallelism.
Anti-Patterns
What committers flag immediately:
| Anti-pattern | Why it's flagged |
|---|---|
| Reformat of an entire file | Hides the real change |
// TODO: refactor comment added | Should be a separate JIRA |
System.out.println left in | Use LOG, never System.out |
e.printStackTrace() | Use LOG.warn(msg, e) |
Catch Exception swallowing everything | Catch specific or rethrow |
New configuration key with no @Public annotation | Won't be honored as stable |
New method with throws Exception | Use specific exceptions |
| Test that always passes (no assertion) | Useless |
| Test depending on wall-clock timing | Flaky |
@Ignore added to silence a failing test | Fix it or revert |
Validation Artifacts
After this chapter you should have:
- A
~/tez-notes/precommit.shscript running the seven pre-submit commands above. - One actual patch file
TEZ-NNNN.001.patchon disk, even if you haven't uploaded it. - A
~/tez-notes/patch-checklist.mdcheatsheet from Rule 8. - Knowledge of the
@InterfaceAudience/@InterfaceStabilitymatrix.
The next chapter — Responding to Feedback — covers what happens after you press "Attach".
Responding to Feedback
Your patch is attached. A committer comments. What happens next is the most underrated skill in open source: turning review comments into a committed patch without burning the reviewer's patience or your own. This chapter is the playbook.
The Asynchronous Reality
Apache review is asynchronous and bursty. The committer who reviews your patch may:
- Be in a different time zone (most likely)
- Be reviewing on weekends or commute time
- Have other patches queued
- Be the only person in the world who deeply knows the file you touched
Practical consequences:
| Reality | What it means for you |
|---|---|
| Reviews come in bursts, not steady drip | Respond within 24–48h of the burst, then wait |
| Patches sit for weeks between rounds | Keep a ~/tez-notes/in-flight.md list |
| Same committer often reviews 2–3 of your patches in one sitting | Have all of them ready |
| A committer may never come back | Polite ping on JIRA at 2 weeks, dev@ at 4 |
Set the expectation early — both for yourself and for reviewers — that a non-trivial patch takes 3–6 weeks from first attach to commit. Optimise for round-trip count, not round-trip duration.
Address Each Comment Explicitly
Reviewers leave per-line comments on a patch (on JIRA in older Tez, on a PR in newer). Each comment needs an explicit response. Not implicit. The committer should not have to diff your old and new patches to figure out which feedback you took.
The pattern:
Reviewer: L243: This catches Exception too broadly. Tighten to IOException.
You (in JIRA comment when attaching .002):
Addressed in .002:
- L243: tightened to IOException; rethrowing wrapped TezException as before.
- L301: added the missing null check you mentioned.
- L427: pushed back; see explanation below.
This three-line response is more valuable than a perfect patch with no commentary. It shows you read every comment and decided about each one.
Don't Argue Without Evidence
When a committer says "this is wrong" and you disagree, the natural reflex is to defend. The Apache-effective reflex is to provide evidence.
Bad:
I don't think changing this would help.
Good:
I tried the suggested approach in a local branch. It causes
TestVertexImpl#testRecoverto fail because REASON. Output:java.lang.AssertionError: expected 3 attempts, got 2 at ...Suggesting we keep the current approach with the additional comment you also asked for.
Three rules for pushback:
- Always try the alternative first. Often the committer is right and you didn't see it.
- Quote the failing test or benchmark. Numbers and stack traces close arguments.
- Offer the smallest possible compromise. "Keep current behavior but add the comment you asked for" is much easier to accept than "no."
When to Push Back
You should push back when:
- The committer's suggestion would break a documented behavior of a
@PublicAPI. - The committer's suggestion contradicts another committer's suggestion (cite the other).
- The committer's suggestion expands scope beyond the JIRA (offer to file a follow-up).
- You have a measurement (perf, memory) that contradicts the suggestion.
You should not push back when:
- It's a style preference and you don't strongly care. Take it; save your capital.
- It's a test-coverage ask. Add the test.
- It's a "split this into two patches" ask. Split it.
- It's "rename this method." Rename it.
The principle: defend the substance of the patch, never the shape.
When to Abandon
Most patches that get abandoned should not have been opened in the first place. But some get abandoned mid-review and that's the right call. Signals:
| Signal | Right action |
|---|---|
| Two committers disagree on the approach, irreconcilable | Wait for them to resolve on dev@; don't ping-pong patches |
| The JIRA is rejected as "won't fix" after design discussion | Close the JIRA, archive the patch locally, move on |
| The required change is much larger than you estimated and you can't commit the time | Comment honestly, unassign yourself, leave the JIRA open |
| The codebase has changed significantly and a complete rewrite is needed | Comment, unassign, leave for someone else |
Abandoning is a respectable outcome. Ghosting a patch is not. If you can't continue, say so on the JIRA in one sentence:
Stepping away from this; my time has been redirected. Unassigning so someone else can pick it up. Latest patch (.003) is a good starting point but needs the test reviewer @NAME asked for.
Post a New Patch with a Clear Delta
When you upload TEZ-NNNN.002.patch, leave a JIRA comment that lists the deltas from
.001:
Posted .002. Delta from .001:
- L243: tightened catch to IOException, per @<reviewer>.
- L301: added null check, per @<reviewer>.
- L427: kept current logic; rationale above.
- Added testRecoverNoInputs in TestVertexImpl.
mvn install -DskipTests, mvn checkstyle:check, mvn test -pl tez-dag all pass.
Why this matters:
- Reviewer can re-review by diffing the delta, not the full patch.
- Future readers of the JIRA see the iteration history at the JIRA level, not just in git.
- It demonstrates the patch had real iteration, not a vibes-based "I changed some stuff."
Diff your own patches locally:
diff -u TEZ-NNNN.001.patch TEZ-NNNN.002.patch | less
Thank the Reviewer
After commit, comment on the JIRA:
Thanks @COMMITTER for the review and commit. Thanks @OTHER-REVIEWERS for the feedback.
This is not perfunctory. Apache is a long game. The committer who reviewed your first patch is likely to review your tenth. They are humans investing volunteer attention.
Acknowledgement also matters at the project level — it shows other onlookers that the project's reviewers are responsive, which makes the next contributor more likely to attempt a patch.
The Shepherd Committer
For non-trivial JIRAs, especially design-heavy ones, one committer often becomes the "shepherd" — the de facto reviewer and merge-committer. The relationship:
| Their role | Your role |
|---|---|
| Reviews each patch iteration | Addresses comments promptly |
| Surfaces concerns from other committers | Treats them as that committer's concerns, not the shepherd's |
| Commits the final patch | Provides commit message text |
| May ask for sub-JIRAs | Files them, links them |
Champions the design on dev@ if questioned | Provides ammunition (numbers, tests) |
Spotting a shepherd: after 2–3 review rounds with the same committer, they're shepherding. Direct future questions on the JIRA to them ("@COMMITTER, would you prefer A or B for the rename?"). Don't ping multiple committers in parallel; that fragments attention.
When to Ping
JIRA pings have a half-life. Use them sparingly.
| Wait time since last activity | Action |
|---|---|
| < 1 week | Don't ping. Reviewers are busy. |
| 1–2 weeks | Comment on JIRA: "Friendly ping — anything blocking on my side?" |
| 2–4 weeks | Re-ping on JIRA, cc'ing any prior reviewer by @-mention. |
| > 4 weeks | Mention on dev@ in a [DISCUSS] thread: "TEZ-NNNN has been quiet for a month, anyone willing to take another look?" |
What kills a patch dead: pinging weekly or daily. After two such pings, reviewers deprioritise the patch out of self-defence. Don't.
Worked Example — A Full Round-Trip
JIRA: TEZ-4321, "Fix NPE in VertexImpl.recover when no inputs."
Day 0: You attach TEZ-4321.001.patch, set status to Patch Available.
Day 4: Committer @gunther comments:
L88: prefer Collections.emptyList() over new ArrayList<>()
L92: add test for the no-inputs case
L94: should we also handle no-outputs symmetrically?
Day 5: You reply on JIRA:
- L88: agreed, will fix.
- L92: agreed, adding TestVertexImpl#testRecoverNoInputs.
- L94: noticed but out of scope for this JIRA. Filed TEZ-4329 for follow-up.
Day 5: You attach TEZ-4321.002.patch and a delta-summary comment.
Day 9: @gunther comments: "+1 LGTM"
Day 10: @gunther commits as
"TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)"
and sets status to Resolved.
Day 10: You comment:
"Thanks @gunther. Working on TEZ-4329 next."
10 days, 2 patch rounds, 1 follow-up JIRA filed, 0 arguments. That is a healthy review.
When Feedback Comes from a Non-Committer
Non-committers can review too. Their +1 is non-binding (only committers' votes
count for commit), but their feedback is often substantively excellent — they may know
the area better than the committer who eventually commits.
Treat non-committer feedback exactly like committer feedback: address each comment,
explain, iterate. Two non-binding +1s also signal to a committer that the patch is
ready to consider, accelerating attention.
Validation Artifacts
After this chapter you should have:
- A
~/tez-notes/in-flight.mdlisting any JIRA you currently have a patch on, with the date of last activity. - A template for the "delta from previous patch" comment, saved as
~/tez-notes/delta-template.md. - Internalised the four-tier ping schedule.
- The reflex to thank the committer after merge.
The next chapter — Compatibility — is the technical knowledge you need so reviewers don't have to teach you compatibility rules during review.
Compatibility
Tez is a library that ships into long-lived production clusters running Hive, Pig, and custom DAG applications. A compatibility break in Tez ripples out to every downstream project that depends on it. This chapter is the operational knowledge of what you may and may not change without breaking users.
The Three Compatibility Surfaces
Tez has three distinct compatibility surfaces, each with different rules:
| Surface | What it covers | Where defined |
|---|---|---|
| API compatibility | Source/binary compat of Java classes | @InterfaceAudience/@InterfaceStability annotations in tez-api |
| Wire compatibility | Serialised messages over the network | protobufs in */src/main/proto/ |
| Configuration compatibility | Config keys and default values | TezConfiguration constants in tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java |
A single patch may touch zero, one, two, or all three. Knowing which surface you're touching tells you which rules apply.
API Compatibility — The Annotation Grid
Every class in tez-api is (or should be) annotated. The two-axis grid:
@Stable | @Evolving | @Unstable | |
|---|---|---|---|
@Public | Compat across minor versions. Major bump to change. | May change across minor versions with deprecation. | May change across any release. |
@LimitedPrivate({"Hive"}) | Stable for named projects only (e.g. Hive). | Evolving for named projects. | Unstable, named projects only. |
@Private | Internal. No external compat. | Internal. | Internal. |
The annotations live at tez-api/src/main/java/org/apache/hadoop/classification/:
ls ~/tez-src/tez-api/src/main/java/org/apache/hadoop/classification/
# InterfaceAudience.java
# InterfaceStability.java
Verify a class:
grep -B2 "^public class Vertex" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
You will see:
@Public
@Evolving
public class Vertex {
That tells you: external users may write code against Vertex, but the class may evolve
between minor versions. You may add methods. You should not remove or change the signature
of an existing method without deprecation.
What You Can and Can't Change
The decision matrix for modifying an existing public method:
| Change | @Public @Stable | @Public @Evolving | @Public @Unstable | @Private |
|---|---|---|---|---|
| Add new method to class | OK | OK | OK | OK |
| Add overload (different signature) | OK | OK | OK | OK |
| Add optional parameter (new overload) | OK | OK | OK | OK |
| Rename method | Major version only | Deprecate first | OK with note in CHANGES.md | OK |
| Change parameter type | Major version only | Deprecate + add new | OK | OK |
| Change return type (widening) | Major version only | OK with note | OK | OK |
| Change return type (narrowing) | Major version only | Major version only | OK | OK |
| Remove method | Major version only | Major after 1 minor deprecation | OK with note | OK |
| Change method behavior (same signature) | Avoid; needs dev@ discussion | Note in CHANGES.md | OK | OK |
The default rule for @Public @Stable: assume you can't change it. To change it, you
need dev@ agreement first.
Deprecation Procedure
When deprecating a @Public @Evolving method:
/**
* @deprecated Since 0.10.5, use {@link #setParallelism(int, VertexLocationHint)} instead.
* This method will be removed in 0.12.0.
*/
@Deprecated
public Vertex setParallelism(int parallelism) {
return setParallelism(parallelism, null);
}
Three required elements:
@Deprecatedannotation on the method.@deprecatedJavadoc tag explaining what to use instead.- A target removal version. Vague "may be removed" deprecations live forever.
Add a note to CHANGES.txt:
DEPRECATIONS:
TEZ-NNNN: Vertex.setParallelism(int) is deprecated; use setParallelism(int, VertexLocationHint).
Will be removed in 0.12.0.
Wire Compatibility — Protobufs
The DAGPlan protobuf is the most compatibility-sensitive file in Tez. It is the
serialised contract between:
- The Tez client (often inside Hive, Pig, or user code) and the AM
- The AM and history (
ATSHistoryLoggingService) - The AM and the recovery file
A DAGPlan written by a 0.10.3 client must be readable by a 0.10.5 AM. A DAGPlan
written today must be readable from recovery files written months ago.
The protobuf compatibility rules (protobuf 2.5 semantics, which Tez still uses for historic reasons):
Change to a .proto | Wire compat impact |
|---|---|
Add a new optional field with default | Forward + backward compatible |
Add a new repeated field | Forward + backward compatible |
Add a new required field | BREAKS old readers |
Remove an optional field | BREAKS if old readers ignore unknowns badly |
| Rename a field (same tag) | OK in wire, breaks source compat |
| Change a field's tag number | BREAKS wire compat |
| Change a field's type | Usually BREAKS |
Convert optional to repeated | BREAKS |
| Add a new enum value | BREAKS if old readers reject unknowns |
The hard rule for DAGApiRecords.proto:
ls ~/tez-src/tez-api/src/main/proto/
# DAGApiRecords.proto
# DAGClientAMProtocol.proto
# Events.proto
- Never reuse a tag number. Once tag
12was used, it's used forever. - Never change a field's type. Even widening (
int32toint64) is a wire break. - Never make an
optionalfieldrequired. - New fields go at the end with the next free tag number, marked
optional.
When adding a new field:
message VertexPlan {
required string name = 1;
optional int32 num_tasks = 2;
...
optional int64 last_modified_time = 11;
+ optional int32 max_attempts = 12;
}
The Java side should treat the new field as "may be absent" forever — old plans don't have it.
Recovery File Compatibility
The AM writes recovery files containing serialised DAGPlan and event records. On restart, the AM reads its own recovery file. A patched AM must be able to read recovery files written by the previous patched AM.
Practical rule: recovery is at least as wire-compat-sensitive as RPC. Treat every
DAGPlan change as a recovery-format change. Tests:
find ~/tez-src -name "TestDAGRecovery*.java"
find ~/tez-src -name "TestRecovery*.java"
If your patch touches a proto, run these tests and add a new case demonstrating old-format recovery still works.
History / ATS Compatibility
The history record format (used by the Tez UI and ATS) is also a wire format:
find ~/tez-src -name "HistoryEvent*.java" | head
find ~/tez-src -name "HistoryEvent.proto"
A change here breaks Tez UI queries on historical DAGs. The compatibility rule is the
same as for DAGPlan. The reviewer for any history-format patch is typically a Hive
committer who depends on the Tez UI.
Configuration Compatibility
Configuration keys are defined in TezConfiguration:
grep "public static final String TEZ_" \
~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head -30
Each key looks like:
@ConfigurationProperty(type = "integer")
public static final String TEZ_AM_RESOURCE_MEMORY_MB = "tez.am.resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;
Adding a new key
OK at any time. Add the String constant, the _DEFAULT constant, an @Public /
@Unstable (or @Evolving) annotation if the surrounding class is annotated, and a
javadoc explaining the key and its valid range.
Renaming a key
This requires a deprecation alias. Tez has a deprecation mechanism via Hadoop's
Configuration.addDeprecation. Pattern:
public static final String TEZ_AM_RESOURCE_MEMORY_MB = "tez.am.resource.memory.mb";
// Old key, deprecated since 0.10.5.
public static final String TEZ_AM_RESOURCE_MEMORY_MB_DEPRECATED = "tez.am.memory.mb";
static {
Configuration.addDeprecation(
TEZ_AM_RESOURCE_MEMORY_MB_DEPRECATED,
TEZ_AM_RESOURCE_MEMORY_MB);
}
Old config files using the deprecated name continue to work. Log a warning on first read.
Removing a key
Only at a major version bump, after at least one minor version of deprecation. Document
in CHANGES.txt and the release notes.
Changing a default
Treat as a behavior change. Requires dev@ discussion if the change affects perf or
resource usage. Document the change explicitly:
DEFAULT CHANGES:
TEZ-NNNN: tez.am.resource.memory.mb default changed from 1024 to 1536 to reduce OOMs
on large DAGs. Users with tight container budgets should explicitly set the
old value.
Compatibility Across Tez and Hive/Pig
Tez has cross-project compatibility commitments to Hive and Pig — they bundle Tez and
expect a Tez version bump not to break them. The mechanism is @LimitedPrivate.
grep -rn "@LimitedPrivate" ~/tez-src/tez-api/src/main/java | head
A class annotated @LimitedPrivate({"Hive"}) has API compatibility guaranteed to Hive
only. The Tez side may not break it without first warning dev@hive.apache.org. The
Hive side commits to not relying on anything other than @LimitedPrivate or @Public
APIs.
When you change a @LimitedPrivate({"Hive"}) class:
- Search Hive for usage:
grep -rn <ClassName> ~/hive-src/ql/src/ - If Hive uses it, post a heads-up on
dev@hive.apache.orgreferencing the JIRA. - Consider providing both old and new methods for one Tez minor version.
Validation Artifacts
After this chapter you should have:
- A
~/tez-notes/compat-cheatsheet.mdwith the API matrix from above. - A list of every
.protofile intez-apiand which compat surface each protects. - The set of files in
tez-api/.../classification/open in your IDE for reference. - Knowledge of which Hive classes import from
tez-api:grep -rn "import org.apache.tez" ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ | head - The ability to predict, for any change, which compat surface(s) it touches and what the deprecation timeline would be.
The next chapter — Meritocracy — is the project-level perspective: how Apache Tez decides who gets to make compatibility decisions.
Meritocracy: Contributor → Committer → PMC
The Apache Way uses a specific, technical sense of the word "meritocracy" that is often misread. This chapter is what it actually means inside Apache Tez, how the path from casual contributor to PMC member works, and what each step really requires.
The Three Roles
| Role | Granted by | What it gives you | What it asks of you |
|---|---|---|---|
| Contributor | Nothing — anyone who contributes is one | JIRA account, ability to submit patches | Nothing formal |
| Committer | Vote on private@tez.apache.org by PMC | Commit access to apache/tez, vote rights on patches (non-binding for releases) | ICLA on file, ongoing engagement |
| PMC member | Vote on private@tez.apache.org by PMC | Binding vote on releases, vote rights on new committers and PMC members, board reporting share | Legal stewardship, release responsibility |
There is no fourth role. "Lead contributor" or "maintainer" are not Apache concepts. "Chair" is a PMC member who reports to the board; rotating, often by lottery within the PMC.
What "Meritocracy" Actually Means at Apache
Apache uses "meritocracy" in a very specific sense: decisions and elevations are based on accumulated, evidenced contribution to the project — not on title, employer, or personal connections.
That is narrower than the colloquial meaning. It explicitly does not mean:
- "Best engineer wins." Many excellent engineers are not committers because they have not engaged with this specific community.
- "Most patches wins." LOC is not a measure of merit.
- "Paid time on the project wins." Full-time paid Tez work, on its own, does not earn committership. The community must observe the contribution.
- "Smartest design wins arguments." Arguments are won by evidence and consensus, not cleverness.
What it does mean:
- Sustained, visible contribution over months
- Quality demonstrated by patches getting committed with few iterations
- Trust demonstrated by reasonable behavior on JIRA and
dev@ - Investment in the project itself, not just in your features
The Path to Committer
The committer vote is private; the criteria are not codified anywhere with bullet points. What committers actually look at, in rough order:
- Patch quality. Have your patches gone in with light review? Have you mastered the workflow in Patch Quality?
- Volume and sustained activity. Not LOC, but consistency. 10 small patches over 6 months is much stronger than 1 huge patch.
- Engagement breadth. Have you reviewed others' patches (with non-binding
+1s)? Helped onuser@questions? Filed clean JIRAs? - Judgement on
dev@. Have you participated in design discussions? Were your contributions thoughtful, not just adding noise? - Area coverage. Have you worked in more than one corner of the codebase, or are you trusted for a deep one? Either can earn the bit.
- Trust. Would the existing committers be comfortable with you committing your own patches?
There is no fixed threshold. Different projects have different bars; Tez is in the middle (not as strict as Hadoop, not as loose as a brand-new TLP).
Typical Trajectory
Month 1-2: First few small patches (Javadoc, log messages, tiny bug fixes).
Some friction in review as you learn conventions.
Month 3-6: More substantive patches. Lower review iteration count.
Reviewing others' patches with non-binding +1.
Month 6-12: Larger patches with design discussion.
Filing follow-up JIRAs after your patches.
Recognised name on dev@.
Month 12+: A PMC member notices and proposes you on private@.
PMC discusses, votes. Vote happens silently.
You receive a private email offering the bit.
You publicly accept on dev@ via an [ANNOUNCE] thread by the PMC.
The 12-month figure is a median, not a rule. Faster is possible with very sustained engagement; slower is common.
Accepting the Bit
If a PMC member emails you with an offer of committership, the steps:
- Accept privately, via reply to the offer email.
- The PMC raises an
[ANNOUNCE] New Tez committer: <name>thread ondev@. - You acknowledge publicly on the thread.
- ASF Infrastructure provisions your ASF ID (
<id>@apache.org). - You get karma to push to
apache/tez.
What changes for you:
- You can commit your own patches. Don't commit your own patches without review for the first few months. The community trust applies to your judgement of others' patches; your own still get reviewed.
- You get a binding
+1vote on commits. - You get a non-binding
+1on releases (PMC+1is binding). - You are now visible as part of the project. Behave accordingly on
dev@, JIRA, and conferences.
The Path to PMC
PMC membership is a separate, later, additive step. Committership is necessary but not sufficient. Criteria, looser even than committership:
- Sustained activity as a committer. Months to years post-committer.
- Project-level judgement, not just code. Have you weighed in on release timing, compat questions, community-management issues?
- Willingness to take on release-management or PMC duties. Cutting a release, responding to security reports, mentoring new committers.
- Trust to handle confidential matters — security disclosures arrive on
private@tez.apache.org, and PMC members must handle them carefully.
PMC votes are also private. You are notified by email; the public announcement is on
dev@.
What PMC Members Do That Committers Don't
| Duty | Why PMC only |
|---|---|
Binding +1 on releases | ASF policy: releases are PMC acts |
| Vote on new committers and PMC members | Self-perpetuating governance |
| Receive and process security reports | Confidentiality |
| Approve / sign release artifacts | Legal liability flows through PMC |
| Quarterly board reports | Stewardship to the foundation |
| Trademark guardianship | "Apache Tez" is a Foundation mark |
| Brand decisions (logos, names, conferences) | ASF authorises through PMCs |
Common Misconceptions
"I work on Tez full-time, so I should be a committer."
Paid time is irrelevant. The community can only assess what it can observe — public patches, public reviews, public discussion. Internal company work, no matter how extensive, does not exist from the project's perspective.
If your day job is Tez work, the way to convert that into committership is to do that work in the open: file JIRAs, attach patches, post designs.
"I wrote N lines of code, so I should be a committer."
LOC is not used. A contributor with 200 lines spread across 15 thoughtful patches is strictly stronger than one with 5000 lines in 2 mega-patches. Smaller, frequent, high- quality contributions demonstrate the judgement committership rewards.
"My company has N committers, so we should have the next slot."
Apache projects are explicitly company-independent. Many PMCs have an informal limit on the proportion of committers from any single employer (no more than ~50%) to preserve project independence. Companies do not have slots.
"I was a committer on project X, so I should get the bit here automatically."
You don't. Committership is per-project. Past contribution elsewhere is positive prior evidence but does not substitute for engagement on Tez.
"I have an ASF ICLA on file, so I'm a contributor."
An ICLA is a legal document covering future contributions. It does not make you a contributor; submitting a contribution makes you a contributor. ICLA is necessary for non-trivial contributions to be committed.
"There is a contributor-rank or leaderboard."
There isn't. Apache projects do not maintain rankings, badges, or stars. The closest
thing is the CHANGES.txt file, which records the contributor name on each committed
patch.
What Earns the Bit, Concretely
If you want a checklist, this is roughly it. None are individually required, but most committers tick most boxes by the time they're proposed:
- 10+ patches committed, spanning multiple areas of the code.
-
At least one patch with non-trivial design discussion on
dev@or JIRA. - At least one bug found by you, reproduced by you, fixed by you, tested by you.
-
Reviewed at least 5 other contributors' patches with constructive non-binding
+1s or-1s. -
Helped answer questions on
user@or in JIRA comments. - Filed follow-up JIRAs when you noticed adjacent issues.
- Behaved well in every public interaction, including when a patch was rejected.
- Maintained existing patches as the codebase moved under them (rebased, addressed review).
- Sustained over 6+ months, not concentrated in one sprint.
- Not gaming any of the above (committers can tell).
What Earns PMC, Concretely
- Committer for 1–3+ years.
-
Demonstrated judgement on
dev@beyond your own patches. - Have either cut a release or helped with one.
- Have proposed or seconded other committers.
- Have engaged with at least one cross-project compat concern.
- Visible willingness to do PMC work (security, brand, board reports) — not just code.
Validation Artifacts
After this chapter you should have:
- A clear-eyed view of where you currently are on the path.
- A
~/tez-notes/karma.mdlisting every concrete thing you've done that the community can observe — patches, reviews, JIRA comments,dev@posts. - A goal for the next 3 months in terms of contribution shape, not LOC.
- The ability to explain the contributor / committer / PMC distinction to a colleague without using the word "lead."
This chapter closes the Contributor Mindset section. The next major section, Release & PMC Reality, takes you inside the committer and PMC view — what those roles actually look like from inside.
Issue Roadmap — Twelve Stages from Trivial to Release-Blocking
This roadmap is a deliberately ordered ladder of Apache Tez contributions. Each rung trains a specific skill, depends on the rung below it, and ends at a concrete review-ready patch. Skipping rungs is the most common reason contributors stall: a shuffle bug fix without state-machine fluency turns into a six-month patch thread, and a release-blocker triage call without compatibility reflexes turns into a reverted commit.
The stages are calibrated to the Tez 0.10.x codebase on disk at ~/tez-src. JIRA
queries assume https://issues.apache.org/jira/projects/TEZ. Patch discussion happens
on dev@tez.apache.org. Where stages reference real modules they use the exact paths
you will see under ~/tez-src:
tez-api/ public interfaces, descriptors, configuration keys
tez-common/ IDs, util, log helpers, ATS/timeline shared code
tez-dag/ AppMaster: DAGImpl, VertexImpl, TaskImpl, schedulers
tez-runtime-internals/ TezTaskRunner, LogicalIOProcessorRuntimeTask
tez-runtime-library/ ShuffleManager, Fetcher, IFile, MergeManager
tez-mapreduce/ MR-shim inputs/outputs/processors
tez-tests/ MiniTezCluster integration tests
tez-examples/ OrderedWordCount, SimpleSessionExample, etc.
tez-plugins/tez-yarn-timeline-history/ ATS history events
tez-plugins/tez-aux-services/ NM-side ShuffleHandler hook
docs/ User-facing site under src/site/markdown
The Twelve Stages
| # | Stage | Target skill | Prereq | Typical patch size | Review depth |
|---|---|---|---|---|---|
| 1 | Docs & tests | Reading the codebase, JIRA workflow, RAT/checkstyle | none | 1–30 lines | 1 reviewer |
| 2 | Build & logging hygiene | pom dep bands, slf4j idioms, LOG.isDebugEnabled() | 1 | 5–80 lines | 1 reviewer |
| 3 | Error message context | Exception chaining, ID propagation, tez-dag CONTEXT rule | 2 | 20–200 lines | 1–2 reviewers |
| 4 | State machine transitions | StateMachineFactory, InvalidStateTransitonException | 3 | 30–250 lines + test | 2 reviewers, dev@ ping |
| 5 | Scheduler bugs | TaskSchedulerManager, YarnTaskSchedulerService, AMRMClient | 4 | 50–500 lines + MiniCluster test | 2 reviewers |
| 6 | Shuffle & runtime | ShuffleManager, Fetcher, MergeManager, IFile | 5 | 80–600 lines + test | 2 reviewers |
| 7 | Hive-on-Tez compatibility | DAGPlan size, edge property contracts, session reuse | 5 or 6 | varies; often a tez-side + HIVE-side ticket | committers in both projects |
| 8 | YARN integration | AMRMToken, log aggregation, NM aux service, kerberos renewal | 5 | 50–400 lines | 2 reviewers, often YARN-side too |
| 9 | Flaky tests | DrainDispatcher, dispatcher-aware waits, port collisions | 4 | 20–150 lines per test | 1–2 reviewers; sometimes "stamped" |
| 10 | Performance regression | git bisect, async-profiler / JFR, JMH micro | 6 or 8 | 30–300 lines + bench evidence | 2 reviewers, dev@ design ping |
| 11 | Backward compatibility | @InterfaceAudience, @InterfaceStability, protobuf evolution | 4 | small code, long dev@ thread | committers + PMC |
| 12 | Release-blocking | RC voting, -1 binding, security CVE pipeline | committer | varies | PMC + release manager |
How to Use This Roadmap
Pick a stage honestly
Find your rung by asking what is the largest patch you have shipped:
- Never landed a Tez patch: start at Stage 1.
- Landed a docs patch but never touched Java in
tez-dag: Stage 2. - Comfortable with
tez-commonJava but never read a state machine: Stage 3. - Read
VertexImpl.stateMachineFactoryonce and were confused: Stage 4. - Read it twice and could draw the state graph: Stage 5+.
- Already a Tez committer: jump straight to Stages 10–12 for sharpening.
Do not jump rungs to chase a "cool" bug. A locality miscount in
YarnTaskSchedulerService looks self-contained and isn't — the patch will land on
state-machine transitions you have never edited.
One stage per PR
Resist the urge to fix two things in one patch. Reviewers reject mixed-concern patches almost reflexively. If you find a logging issue while fixing an error message, file a follow-up JIRA and move on. The roadmap rewards small surface area.
Always start with git log and git blame
Before touching a file, find the last 5 commits that modified it:
cd ~/tez-src
git log --oneline -n 5 -- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
git blame -L 1200,1260 tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
The blame output tells you which committer cares about that area. CC them on the JIRA.
Time investment per stage
Calibrated against a working contributor who has the codebase checked out, can build
locally with mvn clean install -DskipTests -Phadoop28, and has filed at least one
JIRA before:
| Stage | First patch | Becoming fluent (5 patches landed) |
|---|---|---|
| 1 | half a day | 1 week |
| 2 | 1 day | 2 weeks |
| 3 | 1–2 days | 1 month |
| 4 | 3–5 days | 2–3 months |
| 5 | 1–2 weeks | 4–6 months |
| 6 | 2–4 weeks | 6 months |
| 7 | weeks per attribution call | a year of cross-project work |
| 8 | 1–3 weeks | 6 months |
| 9 | 1–3 days per flake | ongoing |
| 10 | weeks (perf is bisect-bound) | committer-level skill |
| 11 | weeks (dev@ design cycle) | committer-level skill |
| 12 | PMC-level responsibility | n/a |
Success criterion per stage
Each stage is "complete" for you when:
- Stage 1: one docs and one test patch are committed to
master. - Stage 2: at least two logging or build patches are committed without nits.
- Stage 3: one error-context patch is committed with no reviewer asking "which DAG?"
- Stage 4: one transition fix is committed and has a regression test in
TestVertexImpl. - Stage 5: one scheduler patch is committed with a
MiniTezClusterrepro test. - Stage 6: one shuffle-runtime patch is committed with a deterministic repro.
- Stage 7: one cross-project ticket is filed with a written attribution argument.
- Stage 8: one YARN-integration patch is committed with explicit Hadoop-version evidence.
- Stage 9: at least three flaky tests have been de-flaked.
- Stage 10: one perf patch is committed with before/after benchmark numbers.
- Stage 11: one compatibility-sensitive patch is committed with explicit annotations and dev@ sign-off.
- Stage 12: you have helped triage at least one RC vote.
When to ask on dev@
Before writing any code for Stages 4 and above, send a short note to
dev@tez.apache.org:
Subject: [DISCUSS] TEZ-XXXX — proposed approach
I see <symptom> at <file>:<line>. My read is <cause>. I plan to <fix>, with
a regression test in <test>. Would appreciate any context I'm missing before
I post a patch.
Three sentences. No essay. The list will tell you in 24 hours whether you are about to step on someone else's in-flight work.
When the roadmap does not apply
This roadmap is for bug fixes and small features. It is not for:
- New runtime engines or scheduler rewrites — those are Tez Improvement Proposals (TEPs); start a dev@ thread, not a patch.
- Hive query-engine changes that happen to surface in Tez — file on
HIVE, notTEZ. - YARN-side fixes that Tez merely consumes — file on
YARN, notTEZ.
Stage 7 teaches the attribution skill that keeps these in the right project.
What to read alongside this roadmap
| Roadmap stage | Companion deep-dive |
|---|---|
| 1–3 | Reading the codebase |
| 4 | State machines, Vertex lifecycle |
| 5 | Scheduler, DAG App Master |
| 6 | Shuffle & sort, Tez runtime |
| 7 | Hive integration |
| 8 | YARN integration |
| 9 | Testing framework |
| 10 | Container reuse, Tez runtime |
| 11 | Compatibility |
| 12 | Release & PMC |
What this roadmap is not
This roadmap is not a tutorial on Apache Tez itself. The deep dives in
../deep-dives/index.md cover the architecture; the
labs in ../level-1/index.md onward cover the
hands-on code reading. The roadmap assumes you can already build Tez from
source, run the unit tests, and stand up a MiniTezCluster end-to-end. If
you cannot, the prerequisite chapter is Level 1, Lab 1.1.
It is also not a generic Apache contribution guide. The Apache "How to Contribute" pages cover the cross-project mechanics (ICLA, JIRA account creation, mailing list etiquette). The roadmap assumes those are done.
Finally, it is not a roadmap for committership. Becoming a Tez committer is a separate path that the PMC manages. The roadmap teaches the skills that, applied consistently over time, make committership a reasonable outcome — but landing patches is necessary, not sufficient.
Reading order
If you read this book front-to-back, you will hit this chapter after the deep dives and before the capstone. That is the intended sequence:
- Read the deep dives to understand the architecture.
- Read this roadmap to understand the contribution ladder.
- Pick a rung and ship a patch.
- Come back to this roadmap when the patch lands, and step up a rung.
- After three or four rungs, attempt the capstone in ../capstone/index.md.
If you are jumping in mid-book, start at the rung that matches your current skill (see "Pick a stage honestly" above) and read the stage's companion deep dive at the same time.
A note on JQL
The JIRA queries in each stage are starting points. The Tez project's
issue labelling has drifted over the years — labels like newbie and
beginner are inconsistently applied. If a filter returns zero results,
broaden it (remove a clause) before assuming the filter is wrong. Each
stage gives at least one fallback grep-based candidate-finding method that
does not depend on labels.
A second JQL tip: pin a "watched issues" filter for the components you care about. Tez has roughly a dozen components in JIRA; you do not need to watch all of them, but watching the two or three closest to your current rung is how you stay current on landed work.
A note on local clone hygiene
Every stage in this roadmap assumes you have a clean checkout at
~/tez-src. "Clean" means:
git statusshows no untracked files outside.gitignore.git branchshows you onmaster(or a topic branch you remember creating).mvn clean install -DskipTests -Phadoop28completes in under two minutes locally.
A messy checkout produces hard-to-reproduce results: a grep that
catches your own WIP, a git bisect that visits commits whose builds
were already broken by an unrelated local change, a mvn test that
passes locally because of a stale ~/.m2 jar.
Refresh on Mondays:
cd ~/tez-src
git checkout master
git pull --ff-only
git clean -fdx
mvn -q clean install -DskipTests -Phadoop28
The git clean -fdx is aggressive — it removes everything not tracked
by git, including IDE artifacts. Keep an .idea/ (or equivalent) backup
elsewhere if you customise it.
How the stages interlock
Each stage builds vocabulary the next stage uses without re-explaining:
- Stage 1 teaches the patch artifact format. Every later stage assumes it.
- Stage 2 teaches the
LOG.isDebugEnabled()pattern. Stage 3 builds on it with the CONTEXT rule. - Stage 3 teaches you to navigate
tez-dag. Stage 4 lives intez-dag/...impl/. - Stage 4 teaches the state-machine DSL. Stage 5 reads the same DSL in the scheduler.
- Stage 5 teaches
MiniTezCluster. Stage 6 leans on it for every shuffle test. - Stage 6 teaches the runtime contracts. Stage 7 attributes bugs against those contracts to Hive.
- Stage 8 teaches the YARN boundary. Stage 11 references it when discussing compat across Hadoop versions.
- Stage 9 teaches deterministic testing. Stage 10 uses it as the baseline for benchmark stability.
- Stage 10 teaches measurement. Stage 11 uses measurement as evidence for compat decisions.
- Stage 11 teaches the audience/stability matrix. Stage 12 uses it when triaging blockers.
Skipping a stage means skipping a vocabulary. Reviewers will notice.
Now turn the page to Stage 1.
Stage 1 — Docs and Tests
What this stage teaches
Stage 1 is the on-ramp. The skills are deliberately non-technical:
- Navigate the Apache JIRA workflow: claim a ticket, assign it to yourself, attach a patch, set "Patch Available", respond to review.
- Run
mvn apache-rat:checkandmvn checkstyle:checkcleanly. - Produce a
git format-patchartifact that applies onmaster. - Wait for a Jenkins precommit run and read its output without panicking.
The contributions themselves are surgical: a docs typo, a missing @since tag, a
@param javadoc that the linter complains about, a LOG.info whose message is
misleading. Nothing in this stage will surprise a reviewer. That is the point: you
are exercising the workflow so the next stages can be about code.
JIRA filter to find candidates
Real JQL you can paste into https://issues.apache.org/jira/issues:
project = TEZ
AND labels in (newbie, beginner, "newbie-friendly", "low-hanging-fruit")
AND resolution = Unresolved
AND (component in (Documentation) OR summary ~ "typo" OR summary ~ "javadoc")
ORDER BY updated DESC
A second filter that often surfaces good Stage 1 work — javadoc that the build already flags:
project = TEZ AND status = Open AND text ~ "javadoc" AND text ~ "missing"
Open three candidates, read each comment thread end to end. Choose one that has no assignee, no patch attached, and was last updated more than three months ago. That is the abandoned-but-still-valid ticket: a perfect Stage 1.
If nothing fits, file your own. Walk the docs/src/site/markdown/ tree and grep for
broken links, stale Hadoop version numbers, and configuration keys removed years ago:
cd ~/tez-src
grep -rn "tez\.am\.task\.max\.failed\.attempts" docs/src/site/markdown/
grep -rn "hadoop-2\.[0-6]" docs/src/site/markdown/
grep -rn "TODO\|FIXME\|XXX" docs/src/site/markdown/
A genuine doc bug found this way is fair game for your first JIRA.
Walked example — TezConfiguration javadoc missing @since
Symptom: a contributor reports on dev@ that TezConfiguration.TEZ_AM_RESOURCE_MEMORY_MB
has no @since tag, so users cannot tell which release introduced the property's
default change.
Step 1 — Locate the symbol
cd ~/tez-src
grep -n "TEZ_AM_RESOURCE_MEMORY_MB" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head
Open the file. The relevant block looks roughly like:
@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
TEZ_AM_PREFIX + "resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;
No javadoc, no @since. That is the bug.
Step 2 — Claim the JIRA
On https://issues.apache.org/jira/projects/TEZ:
- Click Create, set Project = TEZ, Issue Type = Improvement.
- Summary:
Add @since tags and javadoc for TEZ_AM_RESOURCE_MEMORY_MB family. - Component:
tez-api. Affects Version:0.10.3. Fix Version: leave blank — the release manager sets it. - Description: state the symptom, paste the grep above, link the dev@ thread.
- Save, then click Assign to me.
Step 3 — Diff
--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+ /**
+ * Memory (in MB) requested for the AppMaster container. If the AM is launched
+ * by YARN, this is passed through to {@link
+ * org.apache.hadoop.yarn.api.records.Resource#setMemorySize(long)} on the
+ * {@code ApplicationSubmissionContext}.
+ *
+ * @since 0.5.0
+ */
@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
TEZ_AM_PREFIX + "resource.memory.mb";
+ /** Default value of {@link #TEZ_AM_RESOURCE_MEMORY_MB}. @since 0.5.0 */
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;
Two rules for @since:
- Look at the earliest commit that introduced the symbol, not the current version.
git log --diff-filter=A -- tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.javathengit log -S "TEZ_AM_RESOURCE_MEMORY_MB" -- tez-api/.... Cross-reference the commit hash against the release tags (git tag --contains <hash>). - Never guess. If you cannot find the release, ask on dev@. A wrong
@sinceis worse than no@since.
Step 4 — Build and lint
cd ~/tez-src
mvn -pl tez-api -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-api checkstyle:check -q
mvn -pl tez-api apache-rat:check -q
mvn -pl tez-api javadoc:javadoc -q 2>&1 | grep -i "error\|warning" | head
The javadoc target is the slowest gate in Tez. Run it. If it warns about an @link
that no longer resolves, fix that in the same patch — reviewers will ask anyway.
Step 5 — Format and attach the patch
cd ~/tez-src
git add tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
git commit -m "TEZ-XXXX. Add @since tags for TEZ_AM_RESOURCE_MEMORY_MB family"
git format-patch -1 HEAD --stdout > /tmp/TEZ-XXXX.001.patch
The Tez convention is TEZ-XXXX.NNN.patch where NNN starts at 001 and
increments on every reroll. Upload to the JIRA, click "Submit Patch" so the
status flips to Patch Available. Jenkins precommit will pick it up within an
hour and post results.
Step 6 — Respond to review
Almost certain reviewer requests for a docs patch:
- "Add
{@value}macros so the default appears inline." - "Wrap the line at 100 chars."
- "Capitalise the first word of the javadoc sentence."
Reroll as 002, never overwrite the 001 file. Each reroll is an attachment in
JIRA, not a force-push; reviewers compare attachments by name.
Pitfalls
- Don't fix two bugs in one patch. A whitespace cleanup tacked onto a typo fix is the most common reason a Stage 1 patch sits unmerged for months.
- Don't run
mvn installwithout-DskipTests. The full test suite takes well over an hour. For a docs patch you need only the lint targets above. - Don't squash through
git rebase -i masterand callgit diff master— the Apache toolchain expectsgit format-patch -1output. The two are not identical whenever your branch contains merge commits. - Don't paste the diff into the JIRA description. Attach the
.patchfile. - Don't request a reviewer in the JIRA description. Use the Assignee field to assign to yourself and let committers self-select. CC on dev@ if it has been more than two weeks with no review.
- Don't open a GitHub PR instead of a JIRA patch unless the project guide says so. As of 0.10.x, Tez accepts GitHub PRs but the JIRA is still the source of truth and must be referenced in the PR title.
Exit criteria — when you're ready for the next stage
You can move to Stage 2 when:
- You have one merged docs or javadoc patch and one merged test-only patch
(typically a missing
@Testmethod or a broken assertion message intez-tests/). - You have responded to at least one round of reviewer nits without needing the
reviewer to walk you through
git format-patchsyntax. - A green Jenkins precommit run on your patch no longer makes you nervous, and you can read the report and tell which warnings are pre-existing versus introduced by your change.
- You can recite from memory: "JIRA first, branch from master, one logical
change per patch,
TEZ-XXXX.NNN.patchnaming, attach not paste."
A second walked example — fixing a misleading log message
Symptom: a contributor sees a LOG.info in tez-dag that reads:
LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
But it fires every time the vertex is re-initialised, not just on first initialisation. The message implies a one-shot event; operators have complained that they cannot grep the log to find unique vertices.
The diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
- LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
+ LOG.info("Vertex {} (id={}) initialised with {} tasks (init count={})",
+ vertexName, vertexId, numTasks, ++initCount);
Three changes in one diff:
- The message uses slf4j placeholders.
- The vertex ID is added so operators can correlate with downstream ATS events.
- The init counter makes the "re-initialise" case visible.
This patch is technically a borderline Stage 3 candidate (it adds the vertex ID — see stage-3-error-messages.md). For a first patch, the JIRA description should explicitly say "I am only changing the log message; the init-count field is added but no transition behaviour changes." That framing keeps the patch in Stage 1 scope.
Test
A log-message change usually has no functional test. The reviewer signal
is a manual run of a small OrderedWordCount against MiniTezCluster
with the modified jar, and a grep of the resulting log to confirm the new
format. Document the grep in the JIRA comments:
grep "initialised with" tez-am.log | head
When to file a follow-up
If, while working on a Stage 1 patch, you discover a bigger issue —
suppose the missing javadoc is missing because the configuration key was
silently renamed without an @since in either place — file a follow-up
JIRA in the same component. Do not bundle the bigger fix into your
Stage 1 patch.
Standard wording in your JIRA comments:
While working on TEZ-XXXX I noticed that TEZ_AM_RESOURCE_MEMORY_MB was
renamed from TEZ_AM_MEMORY_MB in 0.7.0 without an @deprecated on the
old key. Filed TEZ-YYYY to track the deprecation cleanup.
This habit — narrow Stage 1 patch + follow-up JIRA — is what reviewers mean when they say "keep patches focused." It is the skill the rest of the roadmap depends on.
Where Stage 1 patches go wrong
The two most common failure modes for a Stage 1 patch:
- Scope creep. The contributor "just fixes" three sibling issues while editing the file. Reviewers ask for a split. The contributor reroll incompletely. Two months later the patch is abandoned.
- Silent rebase break. The contributor rebases on master, the
patch no longer applies cleanly, but they never upload an
002reroll. The committer sees a stale patch and moves on.
Neither failure is about code. Both are about workflow discipline. Stage 1 exists to drill that discipline before the stakes get higher.
Stage 2 will move you from documentation into code that runs in production AMs.
Stage 2 — Build and Logging Hygiene
What this stage teaches
Stage 2 teaches the smallest patches that touch running production code: build
metadata (pom.xml), logging idioms, and dependency hygiene. You learn:
- How Tez's dependency version bands work and which bumps are safe within a minor line.
- The
slf4j-api+log4j(orreload4j) logging stack as wired intez-common, and the four idioms reviewers actively enforce. - How to remove deprecated Guava and Hadoop calls without breaking older Hadoop consumers in the supported compatibility band.
- How to triage log-level mismatches: messages logged at
INFOthat should beDEBUG(and the reverse).
The patches are still small (5–80 lines) and the risk surface is small, but they
go into the AM and the runtime tasks. A LOG.info in ShuffleManager that fires
once per fetch will be seen by every operator running Hive-on-Tez.
JIRA filter to find candidates
project = TEZ
AND resolution = Unresolved
AND (summary ~ "logging" OR summary ~ "deprecated" OR summary ~ "guava"
OR summary ~ "bump" OR summary ~ "upgrade dependency"
OR summary ~ "System.out" OR description ~ "isDebugEnabled")
ORDER BY updated DESC
A second sweep for dependency bumps that the build flags:
project = TEZ AND component in (build) AND status = Open ORDER BY priority DESC
You can also generate candidates by running OWASP / dependency-check:
cd ~/tez-src
mvn -pl tez-common dependency:tree -DoutputType=text | grep -E "guava|jackson|netty"
Any line that flags a Guava 12.x in transitive scope is a Stage 2 candidate, because Tez has been on Guava-shaded internals for years.
Walked example A — System.out.println in production code
Symptom: a grep finds three stray System.out.println calls in
tez-runtime-library. They were left over from a debugging session and now show
up in NodeManager stdout logs, polluting operator dashboards.
Step 1 — Find every offender
cd ~/tez-src
grep -rn "System\.out\.println\|System\.err\.println" \
tez-runtime-library/src/main/java tez-runtime-internals/src/main/java tez-dag/src/main/java \
| grep -v "/test/" | grep -v "examples"
Each hit is a separate JIRA candidate (one stage-2 patch per ticket). Pick one, file the JIRA, claim it.
Step 2 — The diff
Suppose the offender is in tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java:
--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
@@
- System.out.println("Spill " + numSpills + " starting, size=" + buffer.position());
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Spill {} starting, size={}", numSpills, buffer.position());
+ }
Three rules in one diff:
- Replace
System.outwith the class's existing slf4jLOG. If the file does not have one, addprivate static final Logger LOG = LoggerFactory.getLogger(...)at the top. - Use slf4j
{}placeholders, not string concatenation. The placeholder form avoids constructing the message string when the log level is filtered out. - Wrap the call in
LOG.isDebugEnabled()only when the argument list does non-trivial work (atoString()on a large object, a list copy, a.size()on a synchronized collection). Pure references (numbers, already-bound strings) do not need the guard.
The third rule is the one reviewers nitpick most. The placeholder form already
defers toString(), so a guard around a plain LOG.debug("foo {}", x) where x
is an int is unnecessary noise. But this:
LOG.debug("Pending {}", scheduledTasks); // scheduledTasks.toString() is expensive
does benefit from a guard, because scheduledTasks will be toString()-ed
before slf4j forms the message.
Step 3 — Verify the build
mvn -pl tez-runtime-library -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-runtime-library checkstyle:check -q
There is no easy unit test for "no System.out left behind." The reviewer signal
is a clean grep across the changed file plus a green checkstyle run.
Walked example B — pom.xml dep bump within the compat band
Symptom: jackson-databind 2.12.x has a known CVE; Tez is pinned to 2.12.6 in the
parent POM. The compatibility band for the 0.10.x line allows bumps within the
2.12.* range.
Step 1 — Find the pin
cd ~/tez-src
grep -n "jackson-databind\|jackson.version\|jackson-core" pom.xml
Result, abbreviated:
pom.xml:178: <jackson.version>2.12.6</jackson.version>
Most jackson artifacts in Tez are governed by ${jackson.version} in the parent
POM. That is the only string you change.
Step 2 — The diff
--- a/pom.xml
+++ b/pom.xml
@@
- <jackson.version>2.12.6</jackson.version>
+ <jackson.version>2.12.7.1</jackson.version>
That is the entire patch. The harder part is justifying it.
Step 3 — The JIRA description
Summary: Bump jackson-databind from 2.12.6 to 2.12.7.1
Description:
2.12.6 is affected by CVE-YYYY-NNNN. 2.12.7.1 is the latest patch on the 2.12
line and is API-compatible per the jackson maintainers' compat notes. We do not
bump to 2.13 / 2.14 here to keep Hive-on-Tez compatibility unchanged.
Verification:
mvn clean install -DskipTests -Phadoop28
mvn -pl tez-dag test -Dtest=TestDAGImpl
mvn -pl tez-runtime-library test -Dtest=TestShuffleManager
Step 4 — Why "within the compat band" matters
If you bumped to 2.14, you would break Hive 3.x users who ship 2.13. A 2.12 → 2.12.7.1 bump is a one-line patch. A 2.12 → 2.14 bump is a six-month compatibility argument and lives in Stage 11. Stay on rung.
Walked example C — log-level mismatch
Symptom: a user reports their NodeManager logs are at 100GB/day. Investigation
shows Fetcher is logging every single shuffle fetch at INFO:
LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);
That message fires per attempt per source per fetch. For a 10k-task vertex it is catastrophic.
Diff
--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
@@
- LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Fetcher {} connecting to {}:{}", id, host, port);
+ }
Rules for INFO → DEBUG demotions:
- The message fires more than once per task attempt → almost always
DEBUG. - The message fires once per DAG lifecycle event (DAG start, vertex committed,
task killed by user) → keep at
INFO. - The message fires per exception → keep at
WARNorERRORper the existing level, never demote silently. - Never demote a log without dev@ confirmation if the message references a contract event (state transition, container release). Operators rely on those for postmortems.
The Fetcher example is uncontroversial; a LOG.info on every state transition in
VertexImpl is not — that would be Stage 4.
Pitfalls
- Don't introduce a logger dependency change in a logging patch. If the file
imports
org.apache.commons.logging.Log, do not migrate it to slf4j in this patch. That migration is a separate JIRA and a much larger surface area. - Don't use
Throwable.printStackTrace()even in tests. Reviewers will flag it. UseLOG.error("msg", t)instead. - Don't bump a dep across a major version line in a Stage 2 patch. That is Stage 11.
- Don't
mvn versions:use-latest-releasesand submit the resulting diff. The bump must be justified per artifact with the CVE or the bug being fixed. - Don't remove deprecated Guava calls by adding new Guava calls. The Tez
trajectory is off Guava in public code. Replace
Preconditions.checkNotNullwithObjects.requireNonNull(JDK 7+) — not with a different Guava class. - Don't add a
LOG.debugguard around a string literal.LOG.debug("hello")needs no guard.
Exit criteria — when you're ready for the next stage
Move on when:
- You have shipped at least two logging-cleanup patches and one pom dep bump.
- You can explain, without looking it up, when to add
LOG.isDebugEnabled()and when not to. - You have read
tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.javaand understand the NDC pattern used to attach DAG/Vertex IDs to log messages — that knowledge is the bridge to Stage 3. - Your last patch was reviewed and merged without a "split this into two JIRAs" comment.
Stage 3 layers on top: you keep the same surgical patch style, but now you make the content of error messages tell the operator which DAG and which vertex.
Appendix — finding logging hygiene candidates yourself
The JIRA filter at the top of this stage may return zero results during quiet periods. When that happens, you can manufacture candidates yourself with two grep patterns that have a high signal-to-noise ratio.
Pattern A — unguarded string concatenation in LOG.debug
cd ~/tez-src
grep -rn 'LOG\.debug(.*+' --include="*.java" tez-dag tez-runtime-internals \
tez-runtime-library | grep -v isDebugEnabled | head -30
This finds calls of the form LOG.debug("got " + counter) that allocate the
concatenated string unconditionally. Pick one, wrap in if (LOG.isDebugEnabled()), attach to a JIRA.
Pattern B — LOG.info calls with high call-site frequency
cd ~/tez-src
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library | wc -l
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle | head
The shuffle path runs per-fetch — any LOG.info there fires hundreds of
thousands of times per DAG. Most are candidates for demotion to DEBUG.
Pattern C — pom files referencing pinned old versions
cd ~/tez-src
grep -rn "<version>" --include="pom.xml" | grep -E "jackson|commons-|guava|netty" \
| grep -v -- "-test" | head -20
Cross-reference against the latest patch release on the package's GitHub releases page. If your match is two patch versions behind and the changelog mentions a security fix, you have a Stage 2 candidate.
The bar for these "self-found" candidates is the same: file a JIRA before
coding, attach a 001 patch, wait for review.
Stage 3 — Error Messages and Exception Context
What this stage teaches
Stage 3 is the first stage where you change behaviour visible to operators in a production postmortem. You learn:
- The CONTEXT rule for
tez-dag: every error raised, logged, or rethrown inside the AppMaster must include the DAG ID, and the vertex/task/attempt ID wherever the call site has them in scope. - How to chain causes correctly:
throw new TezException(msg, cause)instead ofthrow new TezException(msg)thencause.printStackTrace(). - How to find exception sites that swallow the original cause: a
catch (Exception e)followed bythrow new RuntimeException("init failed")is the canonical bug. - How NDC (Nested Diagnostic Context, configured in
tez-common) propagates IDs into log messages automatically — and how to add explicit IDs where NDC is not set up.
These patches are 20–200 lines, often single-method changes that touch error paths. The reviewer test is brutal but fair: "If this exception fires in a production AM log, can the on-call engineer identify the DAG, vertex, and task without cross-referencing any other log file?" If the answer is "no," the patch is not done.
JIRA filter to find candidates
project = TEZ
AND resolution = Unresolved
AND (text ~ "uninformative error" OR text ~ "missing context"
OR text ~ "swallowed exception" OR text ~ "no DAG id"
OR text ~ "improve error message" OR description ~ "InvalidStateTransitonException"
AND text ~ "stack trace")
ORDER BY updated DESC
A second sweep — find your own candidates by grep:
cd ~/tez-src
# error sites that build a message without an ID
grep -rn 'throw new .*Exception(".*failed' tez-dag/src/main/java \
| grep -v "ID\|Id\|getName" | head -30
# catch sites that drop the cause
grep -rn "catch (.*Exception .*)" tez-dag/src/main/java -A 2 \
| grep -B1 "throw new" | grep -v ", e)" | head -30
The second grep is fuzzy; you will get false positives. But every true positive is a Stage 3 patch.
The CONTEXT rule for tez-dag
Every error inside the AppMaster must include enough state to identify which DAG instance on which AM on which application attempt threw it. The minimum fields, listed in priority order:
- The DAG ID (
TezDAGID). - The Vertex ID (
TezVertexID) — required if the error is in a vertex context. - The Task ID (
TezTaskID) — required if in a task context. - The Task Attempt ID (
TezTaskAttemptID) — required if in an attempt context. - The container ID — required for container-management errors.
Each of these IDs is a stable string (toString() returns the canonical form). They
are present on every relevant impl object in tez-dag:
grep -n "getDagId\|getVertexId\|getTaskId\|getTaskAttemptId" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
If you are editing a method on VertexImpl, you have getVertexId() and
getDagId() in scope. If you do not include them in the error, the patch is
incomplete.
Walked example A — uninformative TezException in VertexImpl.maybeSendConfiguredEvent
Symptom: a user reports their DAG fails with:
2026-04-12 10:14:21,003 ERROR [Dispatcher thread] org.apache.tez.dag.app.dag.impl.VertexImpl:
Vertex init failed
org.apache.tez.dag.api.TezException: init failed
at org.apache.tez.dag.app.dag.impl.VertexImpl.maybeSendConfiguredEvent(VertexImpl.java:NNNN)
That error tells the operator nothing. No DAG ID, no vertex name, no cause.
Step 1 — Find the throw site
cd ~/tez-src
grep -n 'throw new TezException("init failed' \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Read 20 lines of context around the hit. The method has vertexId,
getDagId(), and getName() all in scope.
Step 2 — The diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
- } catch (AMUserCodeException e) {
- throw new TezException("init failed");
- }
+ } catch (AMUserCodeException e) {
+ String msg = String.format(
+ "Vertex %s (%s) of DAG %s failed during configured-event dispatch: %s",
+ getName(), vertexId, getDagId(), e.getMessage());
+ LOG.error(msg, e);
+ throw new TezException(msg, e);
+ }
What changed:
- The message now identifies the vertex name (human-readable), the vertex ID (machine-stable), and the DAG ID.
- The original exception is chained via the two-argument
TezExceptionconstructor. The full stack trace survives. - The error is also logged at
ERRORwith the cause. Belt and braces — some callers swallow the exception silently, and the log line is the only record that survives. String.formatis used so the placeholders are visually aligned with the field names. Reviewers prefer it over+-concatenation when the message has more than three substitutions.
Step 3 — Regression test
Add to tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java:
@Test(timeout = 5000)
public void testInitFailureMessageIncludesIds() throws Exception {
VertexImpl v = createVertexThatFailsInConfigured(); // existing helper pattern
try {
v.maybeSendConfiguredEvent();
fail("expected TezException");
} catch (TezException e) {
assertTrue("message should contain vertex id",
e.getMessage().contains(v.getVertexId().toString()));
assertTrue("message should contain dag id",
e.getMessage().contains(v.getDagId().toString()));
assertNotNull("cause should be preserved", e.getCause());
}
}
The test asserts on substring presence, not exact string equality. Reviewers reject exact-string assertions because they break the next time the message is rephrased.
Step 4 — Run targeted tests
cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -40
The full TestVertexImpl suite takes 3–5 minutes on a laptop. Run it. A
state-machine-adjacent change always risks breaking a sibling transition.
Walked example B — swallowed cause in DAGAppMaster.startDAG
Find the bug:
cd ~/tez-src
grep -rn "catch (.*Exception" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java \
-A 3 | grep -B1 "throw new" | head -20
Suppose the offender looks like:
try {
initServices();
} catch (Exception e) {
throw new TezUncheckedException("Failed to start AM");
}
The diff:
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
try {
initServices();
} catch (Exception e) {
- throw new TezUncheckedException("Failed to start AM");
+ throw new TezUncheckedException(
+ "Failed to start AM for application " + appAttemptID + ": "
+ + e.getMessage(), e);
}
Two fixes at once: the cause is preserved (the second constructor argument), and
the message now includes the appAttemptID which the surrounding DAGAppMaster
has in scope. This patch is small but high-leverage: the AM startup path is the
single most common place a swallowed cause hides a real configuration bug.
Walked example C — log-only context via NDC
Some hot paths cannot afford a String.format per call. The Tez convention there
is NDC. Look in tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java:
cat $(find ~/tez-src/tez-common/src/main/java -name "CallableWithNdc.java")
When the dispatcher invokes a vertex transition callback, it pushes the vertex ID
onto the NDC stack. log4j's %X{...} pattern then includes the ID in every log
line for the duration of the call. If you discover a log message in VertexImpl
that lacks the vertex ID, first check whether NDC already provides it via the
log pattern. If yes, the message is fine; if no, add the ID inline. Submitting a
patch that adds an explicit ID where NDC already prints it is a reviewer-rejected
patch.
Pitfalls
- Don't include
e.getStackTrace()in your message. The stack trace is whatLOG.error(msg, e)is for. Concatenating it into the message turns a one-line log into a 60-line one. - Don't use
e.toString()in messages. Usee.getMessage()so the message stays single-line; the stack trace lives in the chained throwable. - Don't catch
Throwableto add context. CatchingThrowableswallowsOutOfMemoryErrorandThreadDeath. CatchException(or the narrowest superclass that fits). - Don't add context that requires a lock. A
getName()call that internally takes the vertex write-lock is a deadlock waiting to happen if the error path itself holds the lock. Always check the lock semantics of the getter you call in an error path. - Don't change the exception type to add context.
throw new TezExceptionis still aTezExceptionafter your patch; changing it toTezUncheckedExceptionis a behavior change and not allowed in Stage 3. - Don't add context that includes user data without redaction. If your error
message includes a configuration value, check whether it could contain
credentials. The Tez convention is to print the key, not the value, when the
key matches
.*\.(password|secret|token|credential).
Exit criteria — when you're ready for the next stage
Move to Stage 4 when:
- You have shipped at least one error-context patch in
tez-dagand one intez-runtime-librarythat includes the DAG and vertex/task IDs. - A reviewer has accepted your test pattern (substring assertion, no exact-string match) without a comment.
- You can find at least three more candidate error sites in five minutes of grepping without referring to this chapter.
- You have read
VertexImpl.maybeSendConfiguredEventand the surrounding 200 lines without feeling lost — that file is the gateway to Stage 4.
Stage 4 will take you inside the state machines themselves.
Stage 4 — State Machine Transitions
What this stage teaches
Stage 4 is the first stage that requires you to understand the Tez AppMaster, not just navigate it. You learn:
- The
StateMachineFactoryDSL used in Hadoop / Tez to declare finite state machines. The two canonical instances areVertexImpl.stateMachineFactoryandTaskImpl.stateMachineFactory. - The
InvalidStateTransitonException(note the historical typo — "Transiton", not "Transition" — preserved for API compatibility) that the state machine throws when an event arrives in a state with no registered transition. - How to add a transition with the right guard, without widening the surface area of the state machine accidentally.
- The hard rule: never widen a transition without a dev@ design discussion.
Adding a transition from
RUNNINGtoKILLEDon a new event class is a semantic change that may cascade to ATS, the client, and the speculator. - The
TestVertexImplandTestTaskImplpatterns for asserting that an event in a state produces an expected next state.
Patches are typically 30–250 lines: a transition table entry, a small guard helper, a fired event, and a deterministic regression test.
Reading order before you touch any code
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java— read the staticstateMachineFactoryblock end to end. It is several hundred lines of.addTransition(...)calls. Diagram it on paper.tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java— same exercise for tasks.tez-common/src/main/java/org/apache/tez/state/StateMachineTez.java— the wrapper Tez puts around the Hadoop state machine.- The deep dives state-machines and vertex-lifecycle. Do not skip these.
Then, and only then, file a JIRA.
JIRA filter to find candidates
The most fruitful filter:
project = TEZ
AND resolution = Unresolved
AND (text ~ "InvalidStateTransitonException" OR text ~ "Invalid event"
OR text ~ "missing transition" OR description ~ "stateMachineFactory")
ORDER BY updated DESC
A second filter for postmortem-style tickets:
project = TEZ AND status = Open AND component in ("tez-dag")
AND priority in (Major, Critical) AND text ~ "VertexState\\|TaskState"
Most real Stage 4 work comes from operator reports of an AM that crashed with
InvalidStateTransitonException: Invalid event X on Y in state Z. That stack
trace is the smoking gun: state Z received event X and had no registered
handler. The fix is one of:
- Add the transition with a guard (most common).
- Suppress the event in that state because it is a benign late delivery (use
addTransition(state, state, event)— a self-loop). - Fix the sender not to emit the event in that state (sometimes the bug is upstream).
Choosing wrong is the most common Stage 4 mistake. Pick option 3 only if you can prove the event should never have been emitted.
Walked example — missing V_INIT transition in VertexState.NEW
Symptom: an operator reports a recurring AM crash:
InvalidStateTransitonException: Invalid event: V_INIT at NEW
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(...)
at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:NNNN)
V_INIT arriving while the vertex is in NEW is suspicious — NEW is supposed
to accept V_INIT. Investigation reveals the transition is registered for the
common path, but a recently-added early-error path emits V_INIT from a
different thread before the main scheduler does, and the second V_INIT arrives
while the vertex is back in NEW after a re-init.
Step 1 — Read the existing transitions
cd ~/tez-src
grep -n "addTransition(VertexState.NEW" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20
You will see something like (simplified):
.addTransition(VertexState.NEW, VertexState.INITED,
VertexEventType.V_INIT, new InitTransition())
.addTransition(VertexState.NEW, VertexState.FAILED,
VertexEventType.V_TERMINATE, new TerminateNewVertexTransition())
V_INIT on NEW is registered. So the crash means the vertex was not in
NEW when the second V_INIT arrived — it was somewhere else, perhaps
INITED. Re-grep:
grep -n "addTransition(VertexState.INITED" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | grep "V_INIT"
No hit. That is the bug: V_INIT arriving in INITED is unhandled.
Step 2 — Decide: add, ignore, or fix upstream
V_INIT in INITED is a duplicate event. It is benign (the vertex is already
initialised; the second message is redundant). The correct fix is to ignore
the duplicate — a self-loop. This is the safe, narrow change.
We are not widening behaviour. We are saying: "in INITED, a redundant
V_INIT is a no-op, not a crash."
Step 3 — The diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
.addTransition(VertexState.INITED, VertexState.RUNNING,
VertexEventType.V_START, new StartTransition())
+
+ // A duplicate V_INIT can arrive when an early error path fires V_INIT
+ // concurrently with the scheduler. The vertex is already initialised;
+ // ignore the duplicate rather than crashing the AM. See TEZ-XXXX.
+ .addTransition(VertexState.INITED, VertexState.INITED,
+ VertexEventType.V_INIT, VERTEX_STATE_CHANGED_CALLBACK_NOOP)
Where VERTEX_STATE_CHANGED_CALLBACK_NOOP is either a constant MultipleArcTransition
that does nothing, or, more idiomatically, a small inner class:
private static class IgnoreEventTransition
implements SingleArcTransition<VertexImpl, VertexEvent> {
@Override
public void transition(VertexImpl vertex, VertexEvent event) {
LOG.debug("Ignoring duplicate {} on vertex {} in state {}",
event.getType(), vertex.getVertexId(), vertex.getState());
}
}
Two rules in this diff:
- The transition has a comment with the JIRA ID explaining why the self-loop exists. State-machine entries without comments are hard to remove safely two years later.
- The transition logs at
DEBUG, notINFO. If the duplicate event is actually a symptom of a larger bug upstream, the debug log is what tells the operator.
Step 4 — Regression test in TestVertexImpl
@Test(timeout = 10000)
public void testDuplicateVInitInInitedIsNoOp() throws Exception {
initAllVertices(VertexState.INITED); // existing helper
VertexImpl v = vertices.get("vertex1");
assertEquals(VertexState.INITED, v.getState());
// Fire a second V_INIT — must not throw, must not change state
v.handle(new VertexEvent(v.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
assertEquals("duplicate V_INIT should leave INITED unchanged",
VertexState.INITED, v.getState());
}
The test pattern:
- Use the existing
initAllVertices(VertexState.INITED)helper. Do not invent your own bootstrap. - Always call
dispatcher.await()afterv.handle(...).TestVertexImplusesDrainDispatcher, which is the only way to make event-driven tests deterministic. - Assert the post state. Never assert on internal counters unless the transition is supposed to change them.
Run it:
cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl#testDuplicateVInitInInitedIsNoOp -q 2>&1 | tail -30
Then run the whole TestVertexImpl suite. A single transition addition has
broken a sibling test more than once in Tez history.
Step 5 — dev@ notification
Before you post the patch:
Subject: [DISCUSS] TEZ-XXXX — add INITED -> INITED self-loop for V_INIT
I have a repro for a recurring AM crash where V_INIT arrives twice. The state
machine currently has no INITED+V_INIT entry. Proposed fix: self-loop with a
debug log. Sender side (early-error path) is left unchanged on the grounds
that defensive handling in the state machine is cheaper than chasing every
sender. Would appreciate a sanity check before I post the patch.
If a committer replies "actually the sender is the bug, fix that instead," you revise your approach. If silence for 48 hours, post the patch.
The "never widen without dev@" rule
What counts as widening:
- Adding a transition from a non-terminal state to a terminal state on a new
event. Example:
RUNNING -> KILLEDonV_USER_REQUEST_FORCE_KILL. - Adding a transition that changes a previously-rejected event into an accepted one with side effects (counters updated, downstream events emitted).
- Removing a transition.
What is not widening:
- Adding a self-loop that ignores a duplicate event (as above).
- Adding a transition that converts an
InvalidStateTransitonExceptioninto a controlledERRORtransition, when the event was clearly a fatal-bug signal.
The dev@ rule exists because state machines are observed externally: the AM emits state-changed events to ATS, the client poll loop watches them, the speculator reads them. Adding a transition is an API change for those observers, even if no Java type signature changes.
Pitfalls
- Don't add transitions to fix symptoms. If you see
InvalidStateTransitonExceptionand the cause is "the sender shouldn't have emitted that event," fix the sender. Adding a transition to silence the exception hides the real bug. - Don't forget the regression test. Every transition patch must have a test
that fires the event in the state and asserts the result. Tests using
DrainDispatcherare the only ones reviewers accept. - Don't use
Mockito.spyonVertexImpl. The state machine has private internal state that spies cannot reach reliably. Use the production class with the test helpers inTestVertexImplandMockDAGAppMaster. - Don't change the
transition()callback signature. Existing transitions useSingleArcTransitionorMultipleArcTransition. Pick the matching one; do not introduce a new interface. - Don't ignore the typo.
InvalidStateTransitonException(no second "i") is the canonical name in Hadoop. If you "fix" the typo in Tez code, you break binary compatibility with downstream callers that catch the exception by name. - Don't bundle a transition fix with an unrelated cleanup. Reviewers will ask you to split.
Exit criteria — when you're ready for the next stage
Move to Stage 5 when:
- You have shipped one transition fix in
VertexImplorTaskImplwith a passing regression test in the correspondingTest*class. - You can draw the
VertexImplstate diagram from memory (8 states, the main transitions, the terminal set). - You have read
TaskAttemptImpl.stateMachineFactoryin full and recognise the similarities and differences toVertexImpl. - A committer has reviewed your transition patch and accepted the addition without asking for a dev@ design thread — meaning your choice of "ignore vs add vs fix sender" was correct.
Stage 5 takes you out of the AM event loop and into the scheduler.
Stage 5 — Scheduler Bugs
What this stage teaches
Stage 5 takes you out of the per-vertex event loop and into the AM-wide scheduling layer. You learn:
- The split between
TaskSchedulerManager(the multi-scheduler dispatch shim) and the concreteYarnTaskSchedulerService(the AMRMClient-backed scheduler used in production), plus the alternativeLocalTaskSchedulerServiceused by local mode and tests. - How container requests, allocations, and releases flow through
AMRMClient, including theheldContainerlifecycle and the canonical leak: a held container that is never returned to YARN after anonErrorcallback fires. - Locality miscounts: the bookkeeping mistake where a node-local allocation is
charged as rack-local in
getAvailableContainers, distorting the affinity signal sent back to the AMRM protocol. - Priority inversion: a high-priority request stuck behind a low-priority pending list because the request was added to the wrong queue.
- Container behaviour across AM failover: when the AM restarts with
tez.am.am-rm.heartbeat.interval-msretries, what should and should not be re-claimed. - How to write a
MiniTezCluster-backed integration test, and when the cheaperAMRMClientstub pattern is sufficient.
Patches are 50–500 lines, often with a non-trivial test that needs
MiniTezCluster or MiniYARNCluster. Reviewers are strict: a scheduler patch
without a deterministic test is rejected on sight.
Reading order
tez-dag/src/main/java/org/apache/tez/dag/app/rm/TaskSchedulerManager.javatez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.javatez-dag/src/main/java/org/apache/tez/dag/app/rm/container/AMContainerImpl.javatez-dag/src/test/java/org/apache/tez/dag/app/rm/TestTaskSchedulerManager.java- The deep dive scheduler.
cd ~/tez-src
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/rm/*.java
If YarnTaskSchedulerService.java is over 2000 lines, that is expected.
JIRA filter to find candidates
project = TEZ
AND component in ("tez-dag")
AND resolution = Unresolved
AND (text ~ "container leak" OR text ~ "scheduler" OR text ~ "locality"
OR text ~ "priority" OR text ~ "AMRMClient" OR text ~ "heldContainer"
OR description ~ "onError")
ORDER BY priority DESC, updated DESC
A second filter for AM-failover-related candidates:
project = TEZ AND resolution = Unresolved AND (text ~ "failover" OR text ~ "AM restart")
AND component in ("tez-dag")
Walked example A — heldContainer never released after onError
Symptom: an operator reports their long-running session AM holds onto containers
indefinitely after a transient RM disconnect. yarn application -status shows
allocated containers far above what the running DAG should need.
Step 1 — Locate the leak path
cd ~/tez-src
grep -n "onError\|heldContainer\|releaseContainer" \
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head -30
You find a class field:
private final Map<ContainerId, HeldContainer> heldContainers = new HashMap<>();
and an onError(Throwable t) callback (inherited from AMRMClientAsync.CallbackHandler):
@Override
public void onError(Throwable t) {
LOG.error("AMRMClient error", t);
appContext.getEventHandler().handle(
new DAGAppMasterEventSchedulingServiceError(t));
}
The bug: heldContainers is populated by onContainersAllocated but never
drained in onError. When the AM recovers and the RM reissues the same
container IDs, the map already has stale entries, and the new allocations are
silently dropped (the bookkeeping path checks heldContainers.containsKey(id)).
The containers are effectively leaked.
Step 2 — Diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
@@
@Override
public void onError(Throwable t) {
LOG.error("AMRMClient error", t);
+ // Before we tear down, release any containers we still hold. If we don't,
+ // a recovering RM will re-issue the same ContainerIds and the dedup
+ // bookkeeping below will silently drop the new allocations. See TEZ-XXXX.
+ synchronized (heldContainers) {
+ for (HeldContainer hc : heldContainers.values()) {
+ try {
+ amRmClient.releaseAssignedContainer(hc.getContainer().getId());
+ } catch (Exception releaseErr) {
+ LOG.warn("Failed to release {} during onError cleanup: {}",
+ hc.getContainer().getId(), releaseErr.getMessage());
+ }
+ }
+ heldContainers.clear();
+ }
appContext.getEventHandler().handle(
new DAGAppMasterEventSchedulingServiceError(t));
}
Rules in this diff:
- The cleanup runs before the event is dispatched. Once the event fires, the AM may shut down handlers, and any release call would race.
- The cleanup is synchronized on the same monitor that other writers to
heldContainersuse. Find that monitor first; if there is none, you have a second bug to file separately. Do not introduce a new lock in this patch. - Each release is wrapped individually. One failure must not prevent the others from being released.
- Logged failures are
WARN, notERROR. The AM is already in an error path; doubling the severity drowns the originating cause.
Step 3 — Test with AMRMClient stub
A full MiniTezCluster test for this is overkill. Stub the client:
@Test(timeout = 10000)
public void testOnErrorReleasesHeldContainers() throws Exception {
AMRMClientAsync<CookieContainerRequest> mockRm =
mock(AMRMClientAsync.class);
YarnTaskSchedulerService scheduler =
new YarnTaskSchedulerService(mockAppCallbackHandler, appContext, mockRm);
scheduler.serviceInit(new Configuration());
scheduler.serviceStart();
// simulate two allocations
Container c1 = newContainer("container_1");
Container c2 = newContainer("container_2");
scheduler.onContainersAllocated(Arrays.asList(c1, c2));
// fire onError
scheduler.onError(new RuntimeException("RM gone"));
// verify both were released
verify(mockRm).releaseAssignedContainer(c1.getId());
verify(mockRm).releaseAssignedContainer(c2.getId());
assertTrue(scheduler.getHeldContainersForTest().isEmpty());
}
The pattern uses Mockito on the AMRM client interface, not on the
YarnTaskSchedulerService itself. getHeldContainersForTest() is a
package-private accessor you add in the same patch with a // VisibleForTesting
comment.
Step 4 — Build, test, sign off
cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestYarnTaskSchedulerService -q 2>&1 | tail -40
mvn -pl tez-tests test -Dtest=TestExternalTezServices -q 2>&1 | tail -10
The integration test (tez-tests) takes 5–10 minutes; skip it on the first
local iteration but run it before the patch submission.
Walked example B — locality miscount
Symptom: a debug log shows node-local: 4, rack-local: 12, off-switch: 0 for a
vertex whose input splits should give 14 node-local containers. The bookkeeping
is off.
Locating the counter
cd ~/tez-src
grep -n "nodeLocal\|rackLocal\|offSwitch" \
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head -20
You find an assignContainer(...) path that compares the allocated host against
the request's preferred host. The bug: the comparison is host.equals(req.host),
but host arrives as node-1.cluster.local while req.host is node-1. The
short-form comparison fails, the allocation is miscounted as rack-local, and
the affinity penalty cascades into the next request.
Diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
@@
- if (host.equals(request.getHosts()[0])) {
+ // Hosts may be reported as FQDNs by the RM but as short names by the
+ // caller-supplied hint. Compare on the leading label to keep both forms
+ // equivalent. See TEZ-XXXX.
+ if (hostMatches(host, request.getHosts()[0])) {
nodeLocalCount.incrementAndGet();
} else if (rackOf(host).equals(rackOf(request.getHosts()[0]))) {
rackLocalCount.incrementAndGet();
} else {
offSwitchCount.incrementAndGet();
}
}
+
+ static boolean hostMatches(String a, String b) {
+ if (a == null || b == null) return false;
+ return a.equals(b)
+ || leadingLabel(a).equals(leadingLabel(b));
+ }
+
+ private static String leadingLabel(String h) {
+ int dot = h.indexOf('.');
+ return dot < 0 ? h : h.substring(0, dot);
+ }
The accompanying test asserts the counter under both FQDN and short-name forms.
Walked example C — priority inversion
Symptom: a high-priority request (priority 0, AM speculation) waits indefinitely behind a long queue of priority-5 requests, even though the scheduler has capacity.
Root cause: the request was added to the queue keyed by priority string, not
priority int. "0" sorts after "10" in string ordering. The fix is to use
an Integer key or a TreeMap with a numeric comparator. The diff and test
follow the same pattern as above; the file is
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
near the requestsByPriority field.
MiniTezCluster pattern
For bugs that only manifest end-to-end:
cd ~/tez-src
find tez-tests/src/test/java -name "TestMRRJobsDAGApi.java"
That file is the canonical worked example. The setup pattern:
private static MiniTezCluster tezCluster;
@BeforeClass
public static void setup() throws Exception {
Configuration conf = new Configuration();
tezCluster = new MiniTezCluster("TEZ-XXXX", 1, 1, 1);
tezCluster.init(conf);
tezCluster.start();
}
@AfterClass
public static void teardown() {
if (tezCluster != null) {
tezCluster.stop();
}
}
Tests should:
- Submit a small DAG (an
OrderedWordCountderivative is fine). - Assert on
DAGStatusandVertexStatusvia the client. - Set tight
tez.am.am-rm.heartbeat.interval-msandtez.task.am.heartbeat.interval-msoverrides so retries fire quickly.
A MiniTezCluster test takes 30s+ per run; do not add more than one per JIRA.
Pitfalls
- Don't mock the AppContext or the EventHandler if you can avoid it. Scheduler bugs often live in the handoff between scheduler and dispatcher. Mocking the dispatcher hides the bug.
- Don't add
Thread.sleepto scheduler tests. UseDrainDispatcher.await()or poll the scheduler'sgetHeldContainers()view with a timeout. - Don't introduce a new lock to fix a race. Most scheduler races are fixed
by moving an existing line inside an existing
synchronizedblock. Adding a new lock is a Stage 11 patch. - Don't change the AMRM heartbeat interval to make a test pass. That hides timing bugs that bite in production. Use the existing test helpers that drive the heartbeat synchronously.
- Don't release containers in
onContainersCompletedto "be safe". Hadoop's AMRMClient documentation forbids that; the container is already released by the RM, and a second release fires a confusing log line. - Don't fix a locality miscount by changing the comparison everywhere. The bug is usually a single inconsistency. Pin it down with a focused unit test before broadening the change.
Exit criteria — when you're ready for the next stage
Move to Stage 6 when:
- You have shipped one scheduler patch with a passing
MiniTezClusteror AMRM stub regression test. - You can read
YarnTaskSchedulerService.assignContainerwithout referring to external docs. - You have written a
MiniTezClustertest from scratch and it runs locally in under a minute. - You can explain the
heldContainerlifecycle to another contributor in five sentences.
Stage 6 moves you into the runtime: ShuffleManager, Fetcher, MergeManager.
Stage 6 — Shuffle and Runtime
What this stage teaches
Stage 6 is the runtime stage. You learn:
- The shuffle pipeline: how
ShuffleManagerschedulesFetcherthreads against the upstream task outputs, howMergeManagerconsolidates fetched segments, and how the result is presented to the downstream processor as aKeyValuesReader. - The on-disk
IFileformat and the off-by-one EOF bugs that haunt every serialiser written against it. FetchedInputand the in-memory vs on-disk decision: howtez.runtime.shuffle.memory.limit.percentinteracts withMergeManager.canShuffleToMemory.- Fetch-failure retry storms: when a single bad NodeManager triggers cascading fetcher restarts that swamp the AM event queue.
- How to inject deterministic faults using the
FaultInjectionFetcherpattern (or, where it does not exist, the equivalent test double).
Patches are 80–600 lines and almost always come with a MiniTezCluster test
because the runtime contracts are too subtle for unit tests alone.
Reading order
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.javatez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.javatez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/MergeManager.javatez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java- The deep dive shuffle-sort.
cd ~/tez-src
wc -l tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/*.java
wc -l tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
JIRA filter to find candidates
project = TEZ
AND component in ("tez-runtime-library", "tez-runtime-internals")
AND resolution = Unresolved
AND (text ~ "shuffle" OR text ~ "fetcher" OR text ~ "MergeManager"
OR text ~ "IFile" OR text ~ "FetchedInput" OR text ~ "spill")
ORDER BY priority DESC, updated DESC
A second filter for fetch-failure storms specifically:
project = TEZ AND text ~ "fetch failure" AND text ~ "retry" AND resolution = Unresolved
Walked example A — fetch-failure retry storm
Symptom: a 5k-task vertex runs on a cluster where one NodeManager goes bad.
Within minutes the AM logs are flooded with INPUT_READ_ERROR events. The DAG
eventually succeeds but takes hours instead of minutes. The AM event queue
backs up to 100k+ pending events.
Step 1 — Trace the path
cd ~/tez-src
grep -n "INPUT_READ_ERROR\|reportReadError\|fetchFailures" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
You find ShuffleManager.reportReadError(...) which fires a TaskAttemptEvent
to the AM for every failed fetch. With 5k downstream tasks each trying to fetch
from the bad source, the AM receives 5k events per cycle. The AM dedupes by
source attempt, but only after the events are on the queue.
Step 2 — Identify the fix
The minimal fix is client-side debounce: a ShuffleManager should not
re-report the same source attempt failure more than once per
tez.runtime.shuffle.fetch-failure.report.cooldown-ms window. The TEZ
convention is to add the config key with a sensible default
(reportCooldownMs = 5_000).
Step 3 — Diff
--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
@@
+ private final ConcurrentMap<InputAttemptIdentifier, Long> lastReportedAt =
+ new ConcurrentHashMap<>();
+ private final long reportCooldownMs;
@@
public void reportReadError(InputAttemptIdentifier srcAttempt, IOException e) {
+ long now = clock.getTime();
+ Long prev = lastReportedAt.get(srcAttempt);
+ if (prev != null && now - prev < reportCooldownMs) {
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Debouncing read-error report for {} (last={}ms ago)",
+ srcAttempt, now - prev);
+ }
+ return;
+ }
+ lastReportedAt.put(srcAttempt, now);
inputContext.sendEvents(Collections.singletonList(
createInputReadErrorEvent(srcAttempt, e)));
}
Add the config key to TezRuntimeConfiguration:
+ public static final String TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS =
+ TEZ_RUNTIME_PREFIX + "shuffle.fetch-failure.report.cooldown-ms";
+ public static final long
+ TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS_DEFAULT = 5_000L;
And register it in the same file's tezRuntimeKeys set so the validator does
not reject it.
Step 4 — Test with FaultInjectionFetcher pattern
There is no production FaultInjectionFetcher; the test pattern is to subclass
ShuffleManager and override createFetcher to return a Fetcher that throws
IOException on every call. The repro test sits in
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/TestShuffleManager.java:
@Test(timeout = 10000)
public void testReadErrorReportDebounce() throws Exception {
Clock clock = new ControlledClock();
TezConfiguration conf = new TezConfiguration();
conf.setLong(TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS, 1000);
ShuffleManager sm = createShuffleManager(conf, clock);
InputAttemptIdentifier src = newInputAttempt(0);
sm.reportReadError(src, new IOException("first"));
sm.reportReadError(src, new IOException("second (debounced)"));
sm.reportReadError(src, new IOException("third (debounced)"));
// Only the first event should reach the inputContext
verify(inputContext, times(1)).sendEvents(anyList());
// Advance the clock past the cooldown
((ControlledClock) clock).setTime(clock.getTime() + 2000);
sm.reportReadError(src, new IOException("after cooldown"));
verify(inputContext, times(2)).sendEvents(anyList());
}
Then a MiniTezCluster integration test with OrderedWordCount and a fault
injection on a single Fetcher — confirms the AM event queue stays bounded.
Walked example B — off-by-one in IFile EOF
Symptom: a reader of IFile-format data occasionally returns one extra
zero-length record at the end of a segment. Downstream processors see a
null/empty key and either throw or silently insert a bogus row.
Step 1 — Locate
cd ~/tez-src
grep -n "EOF_MARKER\|readNextKeyValue\|nextRawKey" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
Read the Reader.nextRawKey loop and the EOF_MARKER constant. The classic
bug shape: the loop tests bytesRead >= length after a successful read
instead of before, allowing one extra iteration when the segment ends exactly
on a record boundary.
Diff
--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
@@
public boolean nextRawKey(DataInputBuffer key) throws IOException {
- int recordLength = readVInt(dataIn);
- if (recordLength == EOF_MARKER) {
- return false;
- }
+ if (bytesRead >= segmentLength) {
+ return false;
+ }
+ int recordLength = readVInt(dataIn);
+ if (recordLength == EOF_MARKER) {
+ return false;
+ }
...
}
The fix is two lines. The harder part is the test.
Step 2 — Test
Add to tez-runtime-library/src/test/java/org/apache/tez/runtime/library/common/sort/impl/TestIFile.java:
@Test
public void testReaderStopsAtExactSegmentBoundary() throws Exception {
// Write exactly two records, capture the byte length, construct a Reader
// bounded to that byte length, and assert the third nextRawKey() returns
// false without throwing.
Path p = writeRecords(2);
long segLen = fs.getFileStatus(p).getLen();
Reader r = new Reader(conf, fs, p, codec, /*ifileReadAhead*/false, 0, segLen);
assertTrue(r.nextRawKey(keyBuf));
assertTrue(r.nextRawKey(keyBuf));
assertFalse("must not return phantom third record",
r.nextRawKey(keyBuf));
r.close();
}
Run:
mvn -pl tez-runtime-library test -Dtest=TestIFile -q 2>&1 | tail -30
A reviewer will also ask for a check that bytesRead does not advance past
segmentLength on a malformed input — add it.
Walked example C — MergeManager unexpected spill
Symptom: a small DAG that fits comfortably in memory still spills to disk.
Investigation: MergeManager.canShuffleToMemory returns false for inputs
smaller than the configured threshold because it compares against the total
memory budget rather than the per-input share.
The bug shape is in MergeManager.canShuffleToMemory(long size) — the
comparison uses usedMemory + size > maxMemory * memoryLimitPercent where it
should be >= plus a fairness check against singleShuffleLimit.
The repro: a tiny OrderedWordCount on MiniTezCluster with
tez.runtime.shuffle.memory.limit.percent=0.95 and a single 100KB input. The
counter MERGED_MAP_OUTPUTS_DISK should be 0 and is not.
The fix and test follow the same pattern as the previous two examples.
Pitfalls
- Don't add
Thread.sleepto a shuffle test. UseDrainDispatcher, theControlledClockpattern, or aCountDownLatchdriven by the production callback. Sleep-based shuffle tests are the #1 source of flakes intez-runtime-library(see Stage 9). - Don't relax
MergeManagerthresholds to "fix" a memory error. The thresholds are a contract with the AM scheduler. IfMergeManagerruns out of memory, the bug is usually upstream — a Fetcher that should have used disk and went to memory. - Don't add a config key without registering it in
tezRuntimeKeys. The runtime validates against an allowlist; an unregistered key is silently ignored. - Don't fix the
IFilereader by widening the boundary check. Boundary bugs inIFileusually have a sibling bug in the writer. Read both before patching either. - Don't add a
Fetcherretry loop that does not respect the AM's already- scheduled retry policy. Two retry loops in series turn a 3x retry into a 9x retry. Confirm via dispatcher trace that the AM is the only retry authority. - Don't change the on-disk
IFileformat without bumpingIFile.VERSION. That is a Stage 11 patch and requires explicit back-compat shims.
Exit criteria — when you're ready for the next stage
Move to Stage 7 when:
- You have shipped one shuffle or runtime patch with a deterministic
MiniTezClusterregression test that passes in under two minutes. - You can recite the relationship between
tez.runtime.shuffle.memory.limit.percent,tez.runtime.shuffle.fetch.buffer.percent, and the JVM heap. - You have read
MergeManager.merge()end to end and can explain the on-disk vs in-memory branches. - A reviewer has accepted your fix without asking "is this the same bug as TEZ-XXXX?" — meaning you have learned to grep for prior art before patching.
Stage 7 takes you out of Tez code and into the Hive-on-Tez attribution skill.
Stage 7 — Hive-on-Tez Compatibility
What this stage teaches
Stage 7 is the cross-project stage. You learn:
- The largest consumer of Tez in production is Hive. Bugs that look like Tez bugs are often Hive bugs that surface through Tez, and vice versa.
- The contracts Hive depends on:
DAGPlansize limits, edge property serialisation, session reuse viaTezSessionPoolManager, and theHiveSplitGeneratorevent protocol. - The attribution decision tree: when to file on
TEZ, when onHIVE, and when on both with a cross-reference. - The release-train interplay: Hive 3.x ships a specific Tez version; Hive 4.x
ships a different one. A "fix" in Tez
masterdoes not automatically reach a Hive user until the next Hive release picks up a Tez release. - How to write an attribution argument in a JIRA description so committers in both projects agree on ownership before any code is written.
The "patch" deliverable for Stage 7 is often a JIRA, not code. A correct attribution call is the contribution; the code may be one line in each project or zero lines in Tez and a workaround in Hive.
JIRA filter to find candidates
project = TEZ AND text ~ "Hive" AND resolution = Unresolved
ORDER BY updated DESC
Then on the Hive side:
project = HIVE
AND (text ~ "Tez" OR text ~ "TezSession" OR text ~ "DAGPlan"
OR text ~ "VertexManagerPlugin")
AND resolution = Unresolved
ORDER BY updated DESC
Cross-reference: a TEZ- ticket linked to a HIVE- ticket is a Stage 7
opportunity. The contribution is reading both, choosing the owner project, and
writing the attribution.
The attribution decision tree
Given a symptom, walk this tree:
Is the symptom observed in a non-Hive Tez workload?
├── Yes → Tez bug. File on TEZ. Stage 4–6 patch.
└── No (Hive-specific)
│
Does the symptom depend on a Hive class on the stack trace?
├── Yes (Hive frame is the top user-code frame)
│ │
│ Is the Tez API contract being misused by Hive?
│ ├── Yes → HIVE bug. File on HIVE. Tez may need a clearer
│ │ contract / better exception message — file
│ │ a follow-up TEZ ticket.
│ └── No → Possibly a Tez API contract gap. File on TEZ
│ with a Hive repro, link the HIVE ticket.
│
└── No (the bug surfaces inside Tez code triggered by Hive's DAG)
│
Does the Hive DAG exercise an edge case Tez tests don't cover?
├── Yes → Tez bug. File on TEZ. Add a Tez-side test that
│ reproduces the shape without Hive.
└── No → File a `cross-project` ticket on TEZ with a
HIVE counter-ticket; sort ownership on dev@.
The tree is not law. It is the start of a dev@ conversation.
Walked example A — DAGPlan size exceeds limit on Hive autogenerated DAG
Symptom: a Hive 3.1 query with a large IN list (10k+ literals) submits a DAG
that fails at TezClient.submitDAG with:
TezException: DAGPlan serialised size 67_108_864 exceeds limit 67_108_864
The Tez default is 64MB on the wire. Hive can in principle stay under it, but
the codegen path for very large IN lists doesn't truncate.
Step 1 — Attribution
Walk the tree:
- Non-Hive workload? No, Hive-specific.
- Hive on stack? Yes,
HiveSplitGenerator. - Is Hive misusing Tez API? No —
DAGPlanis exactly the wire format Tez expects; Hive is sending a legitimate but large payload. - Is this an edge case Tez tests don't cover? Yes — Tez tests submit small DAGPlans.
Conclusion: this is a Tez API contract gap that Hive happens to hit first. The fix is twofold:
- Tez side: raise the configurable limit and improve the error message to tell the operator which key to bump. File on TEZ.
- Hive side: paginate the
INlist literal codegen. File on HIVE.
The Tez patch is small and lands first.
Step 2 — The Tez-side diff
--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+ /**
+ * Maximum size (bytes) of the serialised {@link DAGPlan} that the AM
+ * accepts in a single submission. The default of 64MiB is a Hadoop
+ * IPC limit. Operators submitting very large DAGs (typically generated
+ * by upstream query engines) may need to raise this.
+ * @since 0.10.4
+ */
+ public static final String TEZ_DAG_PLAN_MAX_BYTES =
+ TEZ_PREFIX + "dag.plan.max.bytes";
+ public static final int TEZ_DAG_PLAN_MAX_BYTES_DEFAULT = 64 * 1024 * 1024;
And in tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java:
- if (serialised.length > 64 * 1024 * 1024) {
- throw new TezException("DAGPlan too large");
+ int max = conf.getInt(TEZ_DAG_PLAN_MAX_BYTES, TEZ_DAG_PLAN_MAX_BYTES_DEFAULT);
+ if (serialised.length > max) {
+ throw new TezException(String.format(
+ "DAGPlan serialised size %d exceeds limit %d. "
+ + "Raise %s on the submitter and AM, or reduce DAGPlan size "
+ + "(typically by pruning literal lists or split metadata).",
+ serialised.length, max, TEZ_DAG_PLAN_MAX_BYTES));
}
The patch makes the limit explicit, configurable, and self-describing.
Step 3 — The JIRA description (attribution argument)
Summary: Make DAGPlan size limit configurable and self-describing
Description:
Hive's HiveSplitGenerator can generate DAGPlans > 64MiB for queries with
very large IN lists. Currently Tez throws "DAGPlan too large" with no
actionable advice. The Hive side will paginate (HIVE-NNNNN), but Tez
should:
1. Expose tez.dag.plan.max.bytes so operators can raise the cap.
2. Produce an error message that names the key and the cause.
Attribution rationale:
- This is a Tez API contract gap: legitimate DAGPlans should not be
silently rejected with no recourse.
- Hive is the first downstream that hits this; other DAG generators
(Pig-on-Tez, custom DAGs from BI tools) will hit it next.
- HIVE-NNNNN is filed in parallel for the codegen pagination.
Tests:
- TestDAGAppMaster#testDAGPlanSizeLimitConfigurable
- End-to-end repro left to HIVE-NNNNN (Tez has no test that builds a
pathological 64MiB DAGPlan).
This is the cross-project pattern: the TEZ ticket cites HIVE-NNNNN explicitly, states the attribution rationale, and stops short of fixing Hive's behaviour.
Walked example B — edge property mismatch on Hive upgrade
Symptom: after upgrading Hive 3.1 → 3.2, certain queries fail with:
TezException: EdgeProperty mismatch on edge v1->v2: source class
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput
does not match sink class
org.apache.tez.runtime.library.input.UnorderedKVInput
Tez rejects the DAG because the edge wiring is inconsistent.
Attribution: Hive 3.2 emitted a different sink type for that vertex. Tez is behaving correctly — it is enforcing the edge contract. This is a HIVE bug. File on HIVE. The Tez side requires no patch.
The contribution here is the attribution itself plus a Tez-side documentation
note on the validator: "see EdgeProperty.checkCompatible for the rules
enforced." Add a docs patch (Stage 1) if no such note exists.
Walked example C — TezSessionPoolManager reuse leak
Symptom: HiveServer2 uses TezSessionPoolManager to reuse AMs across queries.
A specific Hive query path leaves the session in a state where the next query
sees stale credentials.
Attribution: TezSessionPoolManager is a Hive class (in the Hive repo),
even though it manages TezClient instances. Find it:
grep -rn "class TezSessionPoolManager" ~/hive-src/ql/src/java
The bug is in Hive. The Tez API used (TezClient.start()) is correct.
File on HIVE. The Tez contribution is zero code; it is the attribution call and the explanation in the JIRA comments that prevents the ticket bouncing.
Reading the Hive code path for attribution
Even though you may not commit to Hive, you must be able to read the Hive classes that touch Tez:
org.apache.hadoop.hive.ql.exec.tez.DagUtils— Hive's DAG builder.org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator— Hive's input split generation, called from Tez VertexManagers.org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager— session reuse.org.apache.hadoop.hive.ql.exec.tez.TezSessionState— per-session state.
Keep a Hive checkout next to your Tez checkout:
git clone https://github.com/apache/hive ~/hive-src
A grep across both:
grep -rn "DAGPlan\|VertexManagerPluginDescriptor" ~/hive-src/ql/src/java | head
is the start of every Stage 7 investigation.
Pitfalls
- Don't fix a Hive bug in Tez. Even if the symptom appears on a Tez stack frame, do not patch Tez to work around an incorrect Hive use of the API. You will trap Tez into supporting buggy clients forever.
- Don't expand a Tez API to "make Hive easier". That is a Stage 11 patch with a dev@ design thread; not a Stage 7 patch.
- Don't assume the Hive committers will read your TEZ ticket. CC the
appropriate Hive committers explicitly, or post a short note on
dev@hive.apache.orglinking the JIRA. - Don't promise a Tez backport to a specific Hive release. Release alignment is a separate conversation; you control your patch's landing in Tez, not when Hive picks it up.
- Don't file the same bug on both projects without distinguishing the work. TEZ-NNNN should fix the Tez side; HIVE-NNNN should fix the Hive side; each ticket should cross-reference the other and say exactly what code lives in which project.
- Don't break older Hive versions to fix newer ones. A Tez change that raises the minimum required Hive version is a Stage 11 / Stage 12 call.
Exit criteria — when you're ready for the next stage
Move to Stage 8 when:
- You have correctly attributed at least one symptom to HIVE (saving Tez from an incorrect patch) and one to TEZ (with a Hive counter-ticket).
- You have a
~/hive-srccheckout next to~/tez-srcand have grepped across both at least three times during real investigation. - You can describe the lifecycle of a
TezSessionStatefrom creation to reuse to teardown in five sentences. - You have read
EdgeProperty.checkCompatibleand know which mismatches the Tez validator does and does not flag.
Stage 8 takes you into the YARN integration layer.
Stage 8 — YARN Integration
What this stage teaches
Stage 8 lives at the Tez/YARN boundary. You learn:
- How the Tez AM acquires and renews its
AMRMToken, and the canonical bug: long-running session AMs (multi-day Hive sessions) whose AMRMToken expires while the AM is mid-RPC. - Log aggregation: how Tez's container exit hooks interact with the NM's
LogAggregationService. The canonical symptom: missing container logs after AM crash because the AM never told the NM to flush. - The NM aux service: the Tez ShuffleHandler (or the MR ShuffleHandler when
configured) lives in
tez-plugins/tez-aux-services. Version mismatches between AM-sidetez-runtime-libraryand NM-side aux service cause shuffle failures with cryptic error messages. - Kerberos delegation token renewal across DAG lifecycles, especially when
multiple DAGs in a session use the same
Credentialsobject. TezClientAMRMToken handling: where the token lives in the submitter process versus the AM.
Patches in this stage are 50–400 lines but often require a Hadoop-version-
specific code path, so the tez-plugins/tez-aux-services profile structure
matters more than in other stages.
Reading order
tez-api/src/main/java/org/apache/tez/client/TezClient.javatez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java— focus on the AMRMToken handling and the credential propagation.tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java- The deep dive yarn-integration.
cd ~/tez-src
grep -rn "AMRMToken\|getCredentials\|TokenIdentifier" \
tez-api/src/main/java tez-dag/src/main/java | head -30
ls tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/
JIRA filter to find candidates
project = TEZ
AND component in ("tez-dag", "tez-plugins")
AND resolution = Unresolved
AND (text ~ "AMRMToken" OR text ~ "kerberos" OR text ~ "delegation token"
OR text ~ "log aggregation" OR text ~ "ShuffleHandler"
OR text ~ "aux service" OR description ~ "TokenExpired")
ORDER BY updated DESC
A second filter focused on long-running session bugs:
project = TEZ AND text ~ "session" AND text ~ "expired"
AND resolution = Unresolved
Walked example A — AMRMToken expiry on long DAGs
Symptom: a Hive session AM runs for 36 hours. On hour 24 it starts logging:
SecretManager$InvalidToken: AMRMToken for application appattempt_X has expired.
The AM crashes mid-DAG. The user loses the long-running session and resubmits all in-progress queries.
Step 1 — Trace token lifetime
cd ~/tez-src
grep -n "AMRMToken\|registerApplicationMaster" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
grep -rn "renewMaxLifetime\|token-max-lifetime" tez-api tez-dag tez-common
YARN's yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs
default is 24h. When the RM rotates the master key, the AM's cached AMRMToken
becomes invalid. The fix is to detect a token-expired exception on the AMRM
heartbeat path and re-acquire the token from the RM (which already exposes
this via the heartbeat response in modern Hadoop versions).
Step 2 — Choose the right Hadoop version
tez-aux-services and tez-dag build against the configured Hadoop profile:
grep -rn "hadoop28\|hadoop29\|hadoop31" pom.xml | head
Token rollover handling differs across Hadoop minor versions. The patch must be a no-op on profiles where the Hadoop client already handles the rollover transparently. Confirm by:
grep -rn "AMRMToken" ~/hadoop-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client | head
If AMRMClientAsyncImpl already loops on token expiry in Hadoop 3.x, your Tez
patch is a Hadoop-2.x-only path guarded by an availability check.
Step 3 — Diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
private void heartbeatLoop() {
while (!shutdownRequested) {
try {
AllocateResponse resp = amRmClient.allocate(progress);
+ // Hadoop 2.x clients did not transparently refresh the AMRMToken
+ // on master-key rollover. Detect token-expired and re-acquire.
+ // See TEZ-XXXX.
+ if (resp.getAMRMToken() != null) {
+ UserGroupInformation.getCurrentUser().addToken(
+ ConverterUtils.convertFromYarn(resp.getAMRMToken(), null));
+ }
processAllocations(resp);
} catch (InvalidToken e) {
+ LOG.warn("AMRMToken invalid for {}, attempting re-register", appAttemptID);
+ try {
+ amRmClient.registerApplicationMaster(host, port, trackingUrl);
+ continue;
+ } catch (Exception reErr) {
+ LOG.error("Re-register failed; AM will exit", reErr);
+ throw new TezUncheckedException(reErr);
+ }
} catch (Exception e) {
...
}
}
}
Step 4 — Test
A unit test stubs the AMRMClient to return an InvalidToken once then a
healthy response, and asserts that registerApplicationMaster was called once
and the loop continued. Pattern:
@Test(timeout = 10000)
public void testAmrmTokenReacquiredOnInvalidToken() throws Exception {
AMRMClient mockRm = mock(AMRMClient.class);
when(mockRm.allocate(anyFloat()))
.thenThrow(new InvalidToken("expired"))
.thenReturn(emptyAllocateResponse());
DAGAppMaster am = createTestAM(mockRm);
am.runOneHeartbeatIteration();
verify(mockRm).registerApplicationMaster(anyString(), anyInt(), anyString());
am.runOneHeartbeatIteration();
// second iteration must succeed
}
A MiniYARNCluster test that triggers an actual key rollover is possible but
slow; the unit test above is sufficient for review.
Walked example B — log aggregation race on AM crash
Symptom: an AM crashes (OutOfMemoryError). The cluster operator runs
yarn logs -applicationId ... and gets nothing. The NodeManager's
LogAggregationService reports the logs as never finalised.
Root cause: the JVM crashed before Tez's DAGAppMaster.shutdown() could flag
the logs as aggregation-ready. NM's default is "wait for the AM to mark
finalisation" rather than aggregating on container exit.
The fix
Tez registers a JVM shutdown hook (Runtime.getRuntime().addShutdownHook) that
calls into the YARN LogAggregationContext to force-finalise. The hook must
run before the JVM's normal exit handlers.
cd ~/tez-src
grep -n "addShutdownHook\|LogAggregationContext" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
If a shutdown hook is registered but does not handle OutOfMemoryError, add a
defensive try/catch (Throwable) and ensure the hook is the first shutdown
hook registered (so it runs last and after other hooks have cleaned up).
The diff is small; the test is hard. The accepted pattern is a logged-evidence
test: spin up a MiniYARNCluster, submit a DAG, kill -9 the AM process, and
assert that the NM log aggregation finalised the logs within a bounded time.
This test belongs in tez-tests and is slow (~30s).
Walked example C — NM aux service version mismatch
Symptom: a cluster operator deploys Tez 0.10.3 but the NMs still run the Tez 0.10.1 aux service. Shuffle fails with:
IOException: Unknown shuffle handler version: 2; expected 1
The fix is in tez-aux-services plus a docs note: the aux service
on every NM must match the AM-side tez-runtime-library minor version. The
Tez patch is twofold:
- The aux service must report its version in the protocol handshake.
- The client side must produce a self-describing error message that names the NM, the version it reported, and the version the AM expected.
--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
@@
- if (serverVersion != EXPECTED_SHUFFLE_VERSION) {
- throw new IOException("Unknown shuffle handler version: " + serverVersion);
+ if (serverVersion != EXPECTED_SHUFFLE_VERSION) {
+ throw new IOException(String.format(
+ "Tez shuffle handler version mismatch on %s:%d: server=%d, expected=%d. "
+ + "Likely cause: NodeManager aux-service jar is older than the AM. "
+ + "Ensure tez-aux-services-%s.jar is deployed to every NM.",
+ host, port, serverVersion, EXPECTED_SHUFFLE_VERSION,
+ TezVersionInfo.getVersion()));
}
The patch is one improved error message and one documentation update in
docs/src/site/markdown/install.md.
Pitfalls
- Don't add a new JVM shutdown hook without considering ordering. Java does not guarantee shutdown hook order; if two hooks rely on each other, you must serialise them explicitly.
- Don't catch
Throwableoutside a shutdown path. CatchingThrowablein the heartbeat loop will swallowOutOfMemoryErrorand leave the AM in an undefined state. - Don't conflate AMRMToken with delegation tokens. AMRMToken authenticates the AM to the RM; delegation tokens authenticate the AM/tasks to HDFS or other services. Renewal paths and lifetimes are different.
- Don't deploy a fix that requires the operator to redeploy
tez-aux-serviceswithout saying so in the release notes. Aux service upgrades require an NM restart; that is operationally expensive. - Don't assume the Hadoop version on disk is the Hadoop version in production.
Test against the minimum Hadoop version supported by your Tez release line
(see
pom.xmlprofile defs). - Don't hard-code token renewal intervals. Use the YARN-side configuration
keys directly (
yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs).
Exit criteria — when you're ready for the next stage
Move to Stage 9 when:
- You have shipped one YARN-integration patch with evidence (in the JIRA description) of which Hadoop minor versions you tested against.
- You can describe the AMRMToken lifecycle in five sentences including the master-key rollover.
- You have read the
LogAggregationContextAPI in the Hadoop source and understand thelogIncludePattern/logExcludePatterninterplay. - You have a
tez-plugins/tez-aux-servicesbuild that runs locally and you understand which NMs need it.
Stage 9 returns to the in-repo skill set with a focus on test stability.
Stage 9 — Flaky Tests
What this stage teaches
Stage 9 is the unglamorous-but-essential stage. You learn:
- The Tez flake taxonomy:
Thread.sleepraces, undrainedAsyncDispatcher,MiniTezClusterport collisions, and@Test(timeout=...)budgets that were too tight for slow CI. - How to distinguish a flake (passes locally, fails on Jenkins 1-in-30 runs) from a real intermittent bug (manifests in production under load). Flakes are tests; intermittent bugs are not.
- The
DrainDispatcher.await()refactor: how to convert a sleep-based synchronisation to an event-drain-based one. - The
@RuleandTestNamepatterns for diagnosing which test in a suite leaks state into the next. - When a flake fix is also a production code fix (the test was right; the code had a race).
Patches are 20–150 lines per test. They rarely change production code. The ones that do warrant a Stage 4–6 ticket in addition to the test fix.
JIRA filter to find candidates
project = TEZ
AND (text ~ "flaky" OR text ~ "intermittent" OR labels = "flaky-test")
AND resolution = Unresolved
ORDER BY updated DESC
A second source: Jenkins precommit history. Pick any open JIRA, find its Jenkins URL in the comments, click through to recent runs, look for tests that failed in one run and passed in the next on the same patch. Those tests are flake candidates regardless of whether a JIRA already exists.
A third source: your own mvn test output. Run any tez-dag test suite three
times in a row:
cd ~/tez-src
for i in 1 2 3; do
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -5
done
Any failure in the three-pass that doesn't repeat is a flake to investigate.
The Tez flake taxonomy
1. Thread.sleep races
The most common shape:
worker.submitJob(j);
Thread.sleep(500); // "wait for it to start"
assertTrue(worker.isJobRunning(j));
On a slow CI box, 500ms may not be enough. On a fast box, the job may have completed before the assertion. Both fail.
The fix is a poll with timeout:
worker.submitJob(j);
TestUtils.waitFor(() -> worker.isJobRunning(j), /*pollMs*/50, /*timeoutMs*/30_000);
assertTrue(worker.isJobRunning(j));
If TestUtils.waitFor does not exist in the module, copy the pattern from
org.apache.tez.test.GenericCounter or write one yourself in three lines.
2. Undrained AsyncDispatcher
The dispatcher is event-driven. A test that fires an event and immediately asserts on state will see the pre-event state half the time.
The fix is DrainDispatcher.await():
cd ~/tez-src
grep -rn "class DrainDispatcher" tez-common/src/main/java tez-dag/src/test
Find the canonical class. The refactor:
- dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
- Thread.sleep(200);
- assertEquals(VertexState.INITED, vertex.getState());
+ dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
+ dispatcher.await();
+ assertEquals(VertexState.INITED, vertex.getState());
The contract: await() returns when the event queue is empty and the last
event has been fully handled (including any subsequent events the handler
itself emitted). If the test still flakes after this refactor, the handler is
emitting events to a different dispatcher (e.g. a child component has its
own). Find it and drain that one too.
3. MiniTezCluster port collisions
The default MiniTezCluster binds a fixed RM port. Two suites running in
parallel on the same machine collide. The fix is per-suite port randomisation:
- tezCluster = new MiniTezCluster("test", 1, 1, 1);
+ tezCluster = new MiniTezCluster(TestName.getMethodName(), 1, 1, 1);
+ Configuration conf = new Configuration();
+ conf.setInt(YarnConfiguration.RM_PORT, 0); // 0 = OS-assigned
+ conf.setInt(YarnConfiguration.RM_SCHEDULER_PORT, 0);
+ conf.setInt(YarnConfiguration.RM_RESOURCE_TRACKER_PORT, 0);
+ tezCluster.init(conf);
The 0 port tells the OS to assign an unused port. Then read the actual port
from the cluster after start:
int amrmPort = tezCluster.getConfig().getInt(YarnConfiguration.RM_PORT, -1);
4. @Test(timeout=...) too tight
A test with @Test(timeout=1000) may pass on a developer's M3 Pro and fail on
a contention-laden Jenkins agent. Raise the timeout to a value that comfortably
covers the slow CI but is still bounded:
- @Test(timeout = 1000)
+ @Test(timeout = 30_000)
public void testInitTransitionRunsOnce() { ... }
The Tez convention: 30s for unit tests, 300s for MiniTezCluster tests.
Never @Test(timeout = 0) — a hung test will block CI for hours.
Walked example — TestShuffleManager flake
Symptom: testReadErrorReportDebounce fails 1-in-12 runs on Jenkins with:
expected:<1> but was:<2>
i.e. the verify on inputContext.sendEvents saw two calls when one was
expected.
Step 1 — Reproduce locally
cd ~/tez-src
for i in $(seq 1 50); do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce \
-q 2>&1 | tail -3
done | grep -c "FAILED"
A local reproduction at 1/50 frequency is good enough to start.
Step 2 — Diagnose
Read the test. The pattern:
sm.reportReadError(src, new IOException("first"));
sm.reportReadError(src, new IOException("second"));
verify(inputContext, times(1)).sendEvents(anyList());
reportReadError may dispatch to an internal executor. The verify runs before
the executor has serviced the call. The Mockito verify sees only the
synchronous call most of the time; the async one fires 1-in-12.
Step 3 — Fix
Replace verify with a timeout-bounded verify:
- verify(inputContext, times(1)).sendEvents(anyList());
+ verify(inputContext, timeout(5_000).times(1)).sendEvents(anyList());
Mockito.timeout(ms) polls until the expected interactions match, then
asserts the count. The test now waits up to 5 seconds before failing.
A bigger refactor (preferred): inject a deterministic executor:
ShuffleManager sm = createShuffleManager(conf, new DirectExecutor());
where DirectExecutor is a java.util.concurrent.Executor whose execute
runs synchronously on the caller thread. Now there is no race, and the
original verify(..., times(1)) is correct.
The reviewer rule: prefer the deterministic executor refactor over
Mockito.timeout. The timeout-based fix masks future races; the deterministic
fix eliminates them.
Step 4 — Confirm the fix
Run the loop again:
for i in $(seq 1 200); do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -3
done | grep -c "FAILED"
200 runs, zero failures, is the bar. Don't ship a flake fix you have not stress-tested.
When a flake is a real bug
Sometimes a test flakes because the production code has a race. If the "obvious" flake fix is to insert a sleep or relax an assertion, stop and ask: could a production caller exercise the same race?
Example: VertexImpl.handle returning before all event-emission side effects
complete. The flaky test fixes itself by dispatcher.await(), but a
production caller doing the same sequence sees a partially-applied state.
That is a Stage 4 bug, not a Stage 9 bug.
The decision rule:
- The test races against an internal event queue → flake fix.
- The test races against a public contract method → file a real bug.
Pitfalls
- Don't
@Ignorea flake to "fix" CI. The next contributor will silently remove the@Ignoreand re-introduce the flake. File a real ticket with a written analysis even if you don't fix it. - Don't bump the
@Test(timeout)without reasoning. A 30s timeout is evidence the test does real work; a 30000s timeout is evidence the test is broken. - Don't replace
assertEqualswithassertTrue(... contains ...)to silence a flake. That weakens the assertion permanently and hides the underlying race. - Don't refactor a test class wholesale in a flake patch. Fix the one test. If the class needs a wholesale refactor, file a separate JIRA.
- Don't use
Thread.yield()to fix a race. It is not a guarantee; it is a hint. Always use a real synchronisation primitive (CountDownLatch,dispatcher.await(),Future.get()). - Don't catch
InterruptedExceptionand ignore it. The Tez convention isThread.currentThread().interrupt(); throw new ...so the interrupt status propagates.
Exit criteria — when you're ready for the next stage
Move to Stage 10 when:
- You have de-flaked at least three tests with confirmed 200-run stability.
- You have caught at least one real production race that was masquerading as a flake.
- You can name the three flake patterns by heart (sleep races, undrained dispatcher, port collisions, tight timeouts).
- A reviewer has accepted your deterministic-executor refactor as the
preferred pattern over
Mockito.timeout.
Stage 10 turns the focus to performance regressions.
Stage 10 — Performance Regressions
What this stage teaches
Stage 10 is where you stop fixing bugs and start measuring. You learn:
- The Tez perf-regression workflow: identify symptom,
git bisectto the culprit commit, profile under load, attribute the cost, ship a fix with before/after numbers. - Microbenchmarking with
tez-examples/OrderedWordCountas the canonical small DAG. When that is too coarse, JMH at the call-site level. - Profilers:
async-profilerfor CPU/lock contention, JFR for allocation/GC pressure. When to use which. - The two perf hotspots most often blamed first:
AsyncDispatcherqueue contention andIFilerecord encoding. - How to file a perf-regression JIRA that committers take seriously: numbers, methodology, reproducibility, and a fix bounded in scope.
Patches are 30–300 lines, always with benchmark evidence in the JIRA. A performance patch without numbers is a no-op.
JIRA filter to find candidates
project = TEZ
AND resolution = Unresolved
AND (text ~ "performance regression" OR text ~ "slow"
OR text ~ "contention" OR text ~ "allocation"
OR labels = "performance")
ORDER BY priority DESC, updated DESC
A second source is the dev@ archive — search for "slowdown" or "regression" in the last six months. Operators often report perf issues without filing a JIRA. The first contribution is filing the JIRA with a repro.
The Tez perf-regression workflow
1. Reproduce the regression with a number
Never start a perf investigation with a vibe. Get a number:
cd ~/tez-src
mvn -pl tez-examples -am clean install -DskipTests -Phadoop28 -q
# Then run OrderedWordCount end-to-end on MiniTezCluster
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
For a more isolated benchmark, write a JMH micro:
find ~/tez-src -name "pom.xml" -exec grep -l jmh {} \;
If JMH is not in the test pom, add it scoped to test only — never to
compile.
2. git bisect to the culprit commit
Suppose the regression is "OrderedWordCount on a 10-node MiniTezCluster went from 12s to 19s between 0.10.2 and 0.10.3":
cd ~/tez-src
git bisect start
git bisect bad 0.10.3
git bisect good 0.10.2
# Each step:
mvn clean install -DskipTests -Phadoop28 -q
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
# Record the wall time. Then:
git bisect good # or 'git bisect bad'
Twenty commits between two minor releases means log2(20) ≈ 5 bisect steps.
Bisect to the single commit, then read its diff. Often the commit is innocent
and the regression is in a sibling commit interacting with it; bisect is the
start of the investigation.
3. Profile under load
Once you suspect a region of code, profile:
# async-profiler: CPU samples
$ASYNC_PROFILER/profiler.sh -d 60 -f /tmp/dag.html -e cpu <AM-pid>
# JFR: GC + allocation
jcmd <AM-pid> JFR.start name=tez duration=60s filename=/tmp/dag.jfr
Profile the AM, not the submitting client. The AM is the long-running process where contention manifests.
For a per-task profile:
// In a one-off test only — never in production code
conf.set(TezConfiguration.TEZ_TASK_LAUNCH_CMD_OPTS,
"-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=/tmp/task-%p.jfr");
4. Attribute the cost
Read the flame graph. A single fat frame above the noise floor is your target. Most Tez regressions land in one of three buckets:
- Lock contention on
AsyncDispatcher.eventQueueorVertexImpl.writeLock. - Allocation pressure from
IFile.WriterorMergeManagerbuilding short-lived buffers in a tight loop. - GC overhead from a long-lived collection that grows unbounded
(e.g. a
HashMapkeyed byTaskAttemptIdthat is never pruned).
5. Ship a fix with numbers
A Stage 10 JIRA description must include:
Methodology:
- Hardware: 16-core M3 Pro, 32GB RAM.
- Command: mvn -pl tez-tests test -Dtest=...
- Runs: 5 cold, 10 warm, report median + p95.
- Hadoop profile: hadoop28.
Before (TEZ master at <hash>): median 19.0s, p95 22.1s.
After (this patch on top): median 12.4s, p95 13.7s.
Profile evidence: flame graph attached. AsyncDispatcher.handle was 38% CPU
before, 4% after.
A reviewer will ask for the profile artifact. Attach it.
Walked example A — AsyncDispatcher queue contention
Symptom: AM throughput collapses on DAGs with > 10k tasks. Profile shows 40%
of CPU is in AsyncDispatcher.handle under LinkedBlockingQueue.put.
Step 1 — Diagnose
cd ~/tez-src
grep -n "LinkedBlockingQueue\|eventQueue" \
tez-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
(The class is technically Hadoop's AsyncDispatcher, but Tez subclasses and
configures it in tez-common.) Single-producer multi-consumer would benefit
from a partitioned queue keyed by event type.
Step 2 — The fix surface
Two acceptable approaches:
- Sharded dispatcher: partition events by destination ID so each shard has its own queue. Tez has the building blocks but not the wiring; the patch is the wiring.
- Batched event submission: collect events on the producer side and submit in groups, reducing lock acquisitions per task.
Both are large patches. The Stage 10 contribution is one of them, with a clear scope: "sharded dispatcher for vertex events only", not "rewrite AsyncDispatcher".
Step 3 — Numbers
For the sharded-dispatcher patch on a 10k-task OrderedWordCount:
Before: 19.0s median, 22.1s p95.
After: 12.4s median, 13.7s p95.
AsyncDispatcher.handle: 38% → 4% CPU.
These numbers go into the JIRA description, with a flame graph attached.
Step 4 — dev@ design ping
Any Stage 10 patch above ~50 lines deserves a dev@ thread:
Subject: [DISCUSS] TEZ-XXXX — shard AsyncDispatcher by destination type
I have a repro for AM throughput collapse on 10k-task DAGs. Profile attached.
Proposed fix: shard the AsyncDispatcher event queue by destination type
(Vertex / Task / TaskAttempt / Container). Numbers: 19s -> 12s median.
Open questions:
1. Default shard count: I propose 4 with a configurable override.
2. Compat: AsyncDispatcher is org.apache.hadoop, so we shim in tez-common.
3. Tests: TestAsyncDispatcher + the existing scheduler integration tests.
Comments welcome before I post the patch.
If a committer flags an unexpected constraint (e.g. "we cannot shard because ATS event ordering depends on global sequence"), redesign before coding.
Walked example B — IFile record encoding hot path
Symptom: profile shows 22% CPU in IFile.Writer.append under
WritableUtils.writeVInt. Allocation profile shows two byte[] per record.
Diagnose:
cd ~/tez-src
grep -n "writeVInt\|writeVLong\|new byte\[" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
The hot path allocates a fresh byte[] per record for VInt encoding. The fix
is a reusable scratch buffer per Writer instance:
+ private final byte[] vIntBuf = new byte[9];
+
public void append(DataInputBuffer key, DataInputBuffer value) throws IOException {
- byte[] scratch = new byte[9];
- int n = encodeVInt(key.getLength(), scratch);
- out.write(scratch, 0, n);
+ int n = encodeVInt(key.getLength(), vIntBuf);
+ out.write(vIntBuf, 0, n);
...
}
The patch is six lines. The justification is the JMH micro:
JMH benchmark: IFileWriter.append for 1M small records.
Before: 14.2 us/op, 32B/op allocation.
After: 8.7 us/op, 0B/op allocation.
This is a textbook Stage 10 patch: small, measurable, attributable.
Pitfalls
- Don't ship a perf patch without numbers. Reviewers will reject it. "Looks faster" is not evidence.
- Don't benchmark on the same machine you developed on without warm-up. Always run cold + warm passes; report median + p95.
- Don't compare across different Hadoop profiles. Pick one profile and hold it constant.
- Don't widen the scope of a perf patch mid-review. "I found another hotspot while I was here" → new JIRA.
- Don't use micro-benchmark numbers in isolation. Always show the
end-to-end impact too. A 2x improvement in
IFile.Writer.appendthat yields 0.1% end-to-end improvement may not be worth merging. - Don't
git bisectagainst a tree with unrelated WIP.git bisectis deterministic only against a clean tree. - Don't profile in production without the operator's consent. Even async-profiler has overhead; the operator should know.
Exit criteria — when you're ready for the next stage
Move to Stage 11 when:
- You have shipped one perf patch with documented before/after numbers and an attached profile.
- You can
git bisect20 commits without referring to documentation. - You have read at least one
async-profilerflame graph for Tez and identified the hotspot without help. - A committer has accepted your patch's methodology section as sufficient evidence.
Stage 11 takes you into the compatibility contract.
Stage 11 — Backward Compatibility
What this stage teaches
Stage 11 is where every change you make is constrained by what was there before. You learn:
- The Apache
@InterfaceAudienceand@InterfaceStabilityannotations and what they obligate you to preserve. - The Tez API surface: which packages are
Public, which areLimitedPrivate("Hive,Pig"), and which arePrivate. The audience determines the cost of breaking a contract. - How to evolve a protobuf message without breaking older clients (optional fields, never reuse field numbers, never change a field type).
- The deprecation cycle: how long a deprecated symbol must remain before removal, and what evidence is required to declare it ready for removal.
- How to negotiate the dev@ conversation when a change is technically compatible but operationally disruptive.
The patches in this stage are often small. The thread is long. A compatibility change without a dev@ design thread is a Stage 11 patch that will be reverted.
The annotation taxonomy
Three audience levels:
| Annotation | Meaning | Examples |
|---|---|---|
@InterfaceAudience.Public | Any external consumer may call this. Removal is a major-version break. | TezClient, DAG, Vertex, Edge, Processor, most of tez-api. |
@InterfaceAudience.LimitedPrivate({"Hive","Pig"}) | Only the named projects may call this. Coordinate with them before changing. | Some internal-ish tez-api helpers used by Hive's DagUtils. |
@InterfaceAudience.Private | Internal to Tez. Free to change. | Everything in tez-dag/src/main/java/org/apache/tez/dag/app/.... |
Three stability levels:
| Annotation | Meaning |
|---|---|
@InterfaceStability.Stable | Compatible across minor versions. Removal requires a major bump. |
@InterfaceStability.Evolving | May change between minor versions, but deprecation cycle expected. |
@InterfaceStability.Unstable | Free to break at any time. |
The combined matrix gives nine cells. Most public Tez API is Public + Stable:
the most expensive to change. Most internal Tez API is Private + Unstable:
free to change.
Find the annotations:
cd ~/tez-src
grep -rn "@InterfaceAudience\|@InterfaceStability" tez-api/src/main/java | head -20
JIRA filter to find candidates
project = TEZ AND resolution = Unresolved
AND (text ~ "deprecate" OR text ~ "compatibility"
OR text ~ "InterfaceAudience" OR text ~ "protobuf"
OR labels = "incompatible")
ORDER BY priority DESC, updated DESC
Walked example A — adding an optional protobuf field
Symptom: Tez wants to add a per-vertex "originating-user-class" string to the DAGPlan so the AM can attribute resource usage. The DAGPlan is wire-serialised to YARN's RM cache, so older AMs must continue to deserialise plans without the new field.
Step 1 — Locate the proto
cd ~/tez-src
find . -name "*.proto" | head
grep -n "message VertexPlan" $(find . -name "*.proto") | head
Read the existing VertexPlan message. Note the highest field number in use
(say, 12). The new field must use a new number, not a recycled one.
Step 2 — The diff
--- a/tez-api/src/main/proto/DAGProtos.proto
+++ b/tez-api/src/main/proto/DAGProtos.proto
@@
message VertexPlan {
optional string name = 1;
...
optional int32 task_resource_memory_mb = 12;
+ // @since 0.10.4 — optional; old AMs ignore unknown fields.
+ optional string originating_user_class = 13;
}
Three rules:
- The field is
optional. Neverrequired— required fields break old readers. Tez uses proto2, whereoptionalis the default for fields you may add later. - The field number 13 has never been used before. Search the entire git
history:
to confirm.git log -p -S "= 13" -- tez-api/src/main/proto/DAGProtos.proto - The comment names the introduction release. Future contributors will use it to decide whether the field is safe to assume in their code path.
Step 3 — Producer and consumer sides
The producer in tez-api/src/main/java/org/apache/tez/dag/api/DAG.java
sets the field when known and leaves it unset when not. The consumer in
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
must tolerate the unset case:
+ if (vertexPlan.hasOriginatingUserClass()) {
+ this.originatingUserClass = vertexPlan.getOriginatingUserClass();
+ } else {
+ this.originatingUserClass = null;
+ }
The reviewer will reject any consumer that calls getOriginatingUserClass()
without first calling hasOriginatingUserClass(). Proto2 optional fields
return a default ("" for strings) when unset, which is not the same as
"absent".
Step 4 — Test the back-compat
The test is a serialisation round-trip with an older binary deserialiser:
@Test
public void testOldAMCanDeserialiseNewPlan() throws Exception {
VertexPlan newPlan = VertexPlan.newBuilder()
.setName("v1")
.setOriginatingUserClass("com.example.Job")
.build();
byte[] wire = newPlan.toByteArray();
// Parse as if we were an older AM that doesn't know the new field
// (use the generated descriptor with the field removed, or use
// DynamicMessage to ignore unknown fields).
VertexPlan parsed = VertexPlan.parseFrom(wire);
assertEquals("v1", parsed.getName());
// The unknown field is preserved in parsed.getUnknownFields() but
// ignored by the AM's logic. That is the contract.
}
A real test against an older Tez jar is also valuable; check it in as a resource.
Walked example B — deprecating a public method
Symptom: TezClient.submitDAG(DAG) returns a DAGClient whose getDAGStatus
contract is unclear. A new method submitDAGWithStatus(DAG) returns a
typed future. The old method should be deprecated.
The diff
--- a/tez-api/src/main/java/org/apache/tez/client/TezClient.java
+++ b/tez-api/src/main/java/org/apache/tez/client/TezClient.java
@@
+ /**
+ * @deprecated as of 0.10.4. Use {@link #submitDAGWithStatus(DAG)} which
+ * returns a typed future. This method will be removed in 0.11.0.
+ * See <a href="https://issues.apache.org/jira/browse/TEZ-XXXX">TEZ-XXXX</a>.
+ */
+ @Deprecated
public DAGClient submitDAG(DAG dag) throws ... { ... }
Rules for deprecation:
- The Javadoc names the replacement, the removal version, and the JIRA with the rationale.
- The
@Deprecatedannotation is on the method, not the class. - The implementation is unchanged. Deprecation is a docs-and-annotation change; behaviour stays the same so existing callers continue to work.
- Never delete a deprecated method in the same patch. Deprecation and removal are separate releases. The minimum cycle in Tez is one minor release as deprecated, then removal in the next major.
The removal patch goes in only when:
- The deprecation has been in a released version for at least one minor cycle.
- Search of downstream code (Hive, Pig, the Tez examples) confirms no remaining callers.
- A dev@ thread has confirmed removal is acceptable.
Walked example C — changing a LimitedPrivate("Hive") API
Symptom: a LimitedPrivate("Hive") helper in tez-api is mis-named. You
want to rename it.
This is not a free change, despite LimitedPrivate. The audience
("Hive") must be coordinated with. The workflow:
- File the TEZ ticket with the rename proposal.
- Search the Hive source for the existing name; if any caller uses it, write the HIVE-side patch first (deprecation-import shim).
- Add the new name in Tez. Keep the old name as a
@Deprecatedwrapper for one release. - Remove the old name in Tez only after Hive has shipped a release that uses the new name.
The contribution often spans two Tez releases and two Hive releases. That is
the cost of LimitedPrivate.
Pitfalls
- Don't reuse a protobuf field number after removing a field. Reserve it
with
reserved 7;in the proto file. Recycling a number breaks cross-version readers in undetectable ways. - Don't change the type of a protobuf field.
string→byteslooks identical on the wire but is incompatible at parse time. Add a new field with a new number; deprecate the old. - Don't widen a
PrivateAPI toPublicwithout a dev@ thread. Once public, you cannot retract. - Don't remove a
@Deprecatedmethod in the same release that introduces the deprecation. That defeats the purpose of deprecation. - Don't change the default value of a configuration key without a dev@ thread. Default changes are invisible to compile-time checks but catastrophic in production. They are a Stage 12-adjacent change.
- Don't introduce a new
Stableannotation lightly. OnceStable, the method is locked for a major-version cycle. - Don't assume Hadoop's compatibility annotations are identical in
meaning. They are similar but have project-specific nuance; read the
Tez project's
BUILDING.txtand thedev@archive before relying on them.
Exit criteria — when you're ready for the next stage
Move to Stage 12 when:
- You have shipped one compatibility-sensitive change (a protobuf evolution, a deprecation, or an API rename) with explicit annotations and dev@ sign-off.
- You can recite the audience × stability matrix and pick the correct cell for an arbitrary tez-api class.
- You have written a deprecation Javadoc that named the replacement, the removal version, and the JIRA without being prompted.
- You have read the
BUILDING.txtanddev@-archived compatibility guidance for Tez and Hadoop.
Stage 12 is the final stage: release-blocking issues and PMC-level work.
Stage 12 — Release-Blocking Issues
What this stage teaches
Stage 12 is the committer/PMC stage. You learn:
- The four categories of release blockers: data loss, correctness regressions, AM crash, security CVE.
- How to triage a candidate blocker during an RC vote: what evidence is required, who must be CC'd, and what the deadline-pressure tradeoffs are.
- The Apache release process from a committer's seat: building an RC,
signing artifacts, calling a
[VOTE]thread, the 72-hour rule, and the meaning of+1 binding,-1 binding,+1, and0votes. - The Tez release notes format and what a release blocker contributes to it.
- Security CVE handling: the private security@ list, embargoed disclosure, and the path from private patch to public release.
This is the only stage where you may be voting on someone else's work as much as writing your own. The patch surface is identical to earlier stages; the context in which you act is different.
JIRA filter to find candidates
project = TEZ
AND priority in (Blocker, Critical)
AND resolution = Unresolved
ORDER BY priority DESC, updated DESC
The set is small at any given time. During an RC vote it grows fast.
A second filter for the RC voting period:
project = TEZ AND priority = Blocker AND created > -7d
The four categories of release blockers
1. Data loss
The strictest category. Any code path where a successfully-acknowledged write can be lost, or a successfully-acknowledged read can return wrong data, is a data-loss blocker. Examples in Tez history:
- A
MergeManagerspill that double-counted records and silently dropped one. - A
Fetcherthat ignored a checksum mismatch and returned corrupted bytes to the downstream processor. - A
DAGRecoverypath that reconstructed an incorrect parent vertex state after AM restart.
Triage: the JIRA description must contain a deterministic repro that the release manager can run in under five minutes. Without a repro, the issue is not a blocker — it is a "to be investigated" ticket.
2. Correctness regressions
A query that returned correct results in version N-1 returns wrong results in version N. The bar is lower than data loss (the data is still there; the output is wrong) but the triage is the same. A correctness regression that affects a single Hive query path is a blocker.
3. AM crash
Any reproducible InvalidStateTransitonException in master is a blocker
during an RC. Operators expect the AM to survive their workload. An AM
crash on a Hive-emitted DAG that worked in the previous release blocks the
RC even if the DAG itself is "unusual" — the AM must be defensive against
its inputs.
4. Security CVE
A demonstrated CVE in a Tez-owned class is a blocker regardless of whether
it has been exploited. The disclosure path is security@tez.apache.org
first, then the public JIRA only after the fix is ready.
Triage during an RC vote
The RC vote pattern on dev@:
Subject: [VOTE] Release Apache Tez 0.10.4 (RC1)
Hi,
I've prepared the first release candidate for Tez 0.10.4. The artifacts
are at:
https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/
The git tag is:
https://github.com/apache/tez/releases/tag/release-0.10.4-rc1
The release notes are:
CHANGES.txt at the top of the tag.
Please verify the signatures, run the smoke tests, and vote:
[+1] release this RC
[0] no opinion
[-1] do not release (please explain)
The vote is open for 72 hours.
Your job, as a contributor evaluating the RC:
- Verify the artifact:
curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz.asc gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz - Build from source:
tar xf apache-tez-0.10.4-src.tar.gz cd apache-tez-0.10.4-src mvn clean install -DskipTests -Phadoop28 - Run a smoke test:
mvn -pl tez-tests test -Dtest=TestExternalTezServices -Phadoop28 - Reply on the vote thread with your evidence.
Vote semantics
| Vote | Meaning |
|---|---|
+1 binding | PMC member endorses release. Three are required for release. |
+1 | Non-PMC endorses. Counts for momentum, not the binding count. |
0 | No opinion. Often used to indicate "I built it, smoke test passed, but I can't speak to my use case." |
-1 binding | PMC member vetoes. One -1 binding stops the release unless overridden by another vote (rare). |
-1 | Non-PMC veto. Not binding, but committers will read it. |
A -1 vote must include the reason. "Build failed" is not enough; "build
failed because X test fails reproducibly on Hadoop 3.x profile, evidence at
URL" is.
Walked example — discovering a blocker during RC vote
Symptom: during the 0.10.4 RC1 vote, you run the smoke test and observe a
test failure in TestShuffleManager#testReadErrorReportDebounce that did
not happen in 0.10.3.
Step 1 — Reproduce
cd apache-tez-0.10.4-src
for i in 1 2 3; do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -5
done
If the failure is 3/3, it is reproducible. If 1/3, it is a flake (Stage 9 issue, not a blocker).
Step 2 — Identify the cause
git log v0.10.3..release-0.10.4-rc1 -- \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped
You see a commit that changed the debounce window default from 5000ms to 500ms. The test was written against 5000ms; the change silently broke it.
Step 3 — Decide blocker vs not
A failing unit test in an RC is not automatically a blocker. The question is: does the underlying behaviour change affect production?
- If the default change is intentional and the test should be updated → not a blocker. Fix the test in 0.10.4 hotfix or 0.10.5.
- If the default change is unintentional or it breaks production users → blocker. RC1 must be cancelled; RC2 reverts the default change.
For this example, suppose the default change was intentional but the release notes don't mention it. The behaviour change is operator-visible (fetch-failure reports now arrive 10x more often, may overwhelm the AM event queue). That makes it a blocker for a different reason than the test failure: an undocumented behaviour change.
Step 4 — Vote and document
Subject: Re: [VOTE] Release Apache Tez 0.10.4 (RC1)
[-1] non-binding
While building the RC and running the smoke tests, I observed:
TestShuffleManager#testReadErrorReportDebounce fails 3/3 runs.
Root cause: commit <hash> changed the default of
tez.runtime.shuffle.fetch-failure.report.cooldown-ms from 5000 to 500.
This is operator-visible behaviour change not noted in CHANGES.txt.
Recommendation: either revert the default in RC2 with the new default
deferred to 0.11.0, or keep the new default and update CHANGES.txt to
flag the operator impact and update the test.
Filed TEZ-XXXX with the analysis.
The release manager will respond. RC2 will either fix the issue (cancel, rebuild, vote again) or argue why the change is acceptable.
Release notes
The Tez release notes live in CHANGES.txt at the repo root, organised by
release. The format:
Release 0.10.4 - 2026-XX-XX
NEW FEATURES:
TEZ-XXXX. Sharded AsyncDispatcher for high-fanout DAGs. (you)
IMPROVEMENTS:
TEZ-YYYY. Make DAGPlan size limit configurable. (you)
BUG FIXES:
TEZ-ZZZZ. Release held containers on AMRM onError. (you)
INCOMPATIBLE CHANGES:
TEZ-AAAA. Default of tez.runtime.shuffle.fetch-failure.report.cooldown-ms
changed from 5000 to 500. Operators of long-running session AMs
should evaluate AM event-queue capacity. (you)
Every patch that lands during the release cycle gets a line. The release manager assembles the file from the JIRA "Fix Version" field; contributors make the lines short and accurate.
Security CVE pipeline
The path from "I think I found a CVE" to a public release:
- Do not file a public JIRA. Email
security@tez.apache.org(the private list, monitored by PMC members). - Wait for acknowledgement (typically within 48 hours).
- Work with the security responder on a fix privately, in a private branch.
- Once the fix is ready, request a CVE ID via the Apache security team (or MITRE via the responder).
- Build a release that includes the fix.
- Publish the release; then the CVE is disclosed publicly with a JIRA.
The embargo window is typically 30–90 days. Contributors who report through the private channel and respect the embargo are credited in the advisory.
Pitfalls
- Don't
+1a release you have not built and smoke-tested. A+1carries weight; do not give it as a courtesy. - Don't
-1without evidence. A-1blocks the release; the bar for evidence is high. - Don't escalate a Stage 9 flake to a blocker. Reproduce three times before voting.
- Don't disclose a security vulnerability publicly before the embargo expires. Apache projects take this very seriously; a leak can lose you committer status.
- Don't file
Priority: Blockercasually. Reserve it for the four categories above. JIRA pollution diminishes the signal. - Don't merge a "must-have" fix during an active RC vote without cancelling the RC first. Mid-vote merges invalidate the artifact and reset the 72-hour clock.
- Don't assume the release manager will catch your concern silently.
Vote on the thread, even if just to
0with a comment.
Exit criteria — there is no next stage
Stage 12 is the final rung of this roadmap. The exit criterion is that you continue — you are now operating as a committer-track contributor. The next steps are not stages but ongoing practices:
- Participate in every RC vote with a built artifact and a smoke-test
result, even just
0. - Watch the
security@anddev@lists daily. - Mentor a new contributor through Stages 1–4 every year.
- Read every
CHANGES.txtdiff for every release line you care about. - Send a quarterly note to dev@ on which areas of the codebase you are willing to review, so contributors know where to ask.
If you have walked all twelve stages, you are the Apache Tez committer the project needed when you started reading this book.
Deep Dives: Reading Order
This directory contains 21 deep-dive chapters. They are the reference material behind the Level curriculum. Each chapter is self-contained, but most chapters depend on a handful of earlier ones. Read in the order below the first time through; thereafter use the index as a lookup.
The chapters are grouped by subsystem. For each chapter we list:
- Title — the file.
- One-line summary — what you should walk away knowing.
- Consumed by — which Levels/Labs depend on it.
Group 1 — The DAG Model and the Client
These four chapters define "what is a Tez job" before any execution machinery exists.
| # | File | Summary | Consumed by |
|---|---|---|---|
| 1 | dag-model.md | DAG/Vertex/Edge as immutable plan; DAGPlan protobuf; validation rules | Level 1 (all labs); Level 2 lab 2.1 |
| 2 | logical-physical.md | How the logical DAG becomes a physical execution plan with concrete parallelism | Level 4 lab 4.2; Level 5 lab 5.1 |
| 3 | tez-client.md | Client-side bring-up: session mode, local resources, AM start, submission RPC | Level 3 lab 3.1; Level 7 lab 7.1 |
| 4 | dag-client.md | Status polling, kill, error reporting; RPC vs ATS backends | Level 3 lab 3.1; Level 8 lab 8.1 |
Start here. Without the DAG model in your head, every later chapter feels like trivia.
Group 2 — AM Lifecycle and Dispatch
| # | File | Summary | Consumed by |
|---|---|---|---|
| 5 | dag-app-master.md | AM as YARN application; dispatchers, heartbeats, recovery | Level 3 lab 3.2; Level 8 lab 8.2 |
| 6 | state-machines.md | Hadoop StateMachineFactory API; dispatcher invariants; tests | Level 4 labs 4.1, 4.3, 4.4 |
| 7 | event-routing.md | The event hierarchy; "events are the only mutation API" rule | Level 4 (all labs) |
These chapters explain how the AM mutates state. They must precede the per-entity lifecycle chapters that follow.
Group 3 — Per-Entity Lifecycle
| # | File | Summary | Consumed by |
|---|---|---|---|
| 8 | vertex-lifecycle.md | VertexImpl state machine: NEW → SUCCEEDED, plus failure/kill paths | Level 4 lab 4.2 |
| 9 | task-lifecycle.md | TaskImpl state machine; speculation; max-failed-attempts | Level 4 lab 4.3 |
| 10 | task-attempt-lifecycle.md | TaskAttemptImpl state machine; container assignment; termination causes | Level 4 lab 4.4; Level 8 lab 8.2 |
Read 8, 9, 10 in this order. Each refers backward to events from chapter 7 and state-machine primitives from chapter 6.
Group 4 — Input/Processor/Output
| # | File | Summary | Consumed by |
|---|---|---|---|
| 11 | ipo-abstractions.md | LogicalInput/LogicalOutput/Processor; lifecycle methods; mergedinputs | Level 5 lab 5.1; Level 7 lab 7.1 |
| 12 | tez-runtime.md | TezTaskRunner2, LogicalIOProcessorRuntimeTask, the umbilical | Level 5 lab 5.1 |
These chapters live inside tez-runtime-internals and tez-runtime-library —
the JVM the task actually runs in.
Group 5 — Shuffle, Sort, and Counters
| # | File | Summary | Consumed by |
|---|---|---|---|
| 13 | shuffle-sort.md | Sorter implementations, IFile, ShuffleManager, Fetcher, MergeManager | Level 5 labs 5.2, 5.3 |
| 14 | counters-diagnostics.md | TezCounters, framework counters, custom counters, ATS publication | Level 8 lab 8.1 |
If you skip 13, do not attempt to debug shuffle issues in production. Always read it cold before opening a fetcher-related JIRA.
Group 6 — Scheduling and Resources
| # | File | Summary | Consumed by |
|---|---|---|---|
| 15 | scheduler.md | TaskSchedulerManager, YarnTaskSchedulerService, AMRM heartbeats | Level 6 lab 6.2 |
| 16 | container-reuse.md | AMContainerImpl lifecycle; reuse policy; idle timeouts | Level 6 labs 6.1, 6.2 |
| 17 | yarn-integration.md | YARN tokens, AMRM client, app master failover, log aggregation | Level 6 lab 6.2 |
Group 7 — Modes and Integrations
| # | File | Summary | Consumed by |
|---|---|---|---|
| 18 | local-mode.md | LocalContainerLauncher, debugging without YARN | Level 2 labs |
| 19 | hive-integration.md | Hive TezTask, edge usage, DynamicPartitionPruning, ATS spans | Level 7 (Hive labs h1–h6) |
Group 8 — Failure, Recovery, and Testing
| # | File | Summary | Consumed by |
|---|---|---|---|
| 20 | failure-handling.md | Task retry, vertex rerun, AM restart, recovery records | Level 8 lab 8.2 |
| 21 | testing-framework.md | MiniTezCluster, MockContainerLauncher, DrainDispatcher, fault injection | Level 2 labs; Level 4 labs |
A note on order vs index
The deep-dives are an index — they exist to be looked up later. The first read should follow the table above. But when you return to fix a bug, jump directly to the chapter most relevant and use the cross-references inside it.
Every chapter ends with a Validation: prove you understand this section. Treat that as the gate before declaring the chapter "read."
DAG Model
A Tez DAG is an immutable plan for a distributed computation. This chapter
describes the model classes (DAG, Vertex, Edge, EdgeProperty,
DataSourceDescriptor, DataSinkDescriptor, *Descriptor), the protobuf
representation that crosses the wire, and the validation rules that turn a
"DAG you wrote" into a "DAG the AM will accept."
After this chapter you should be able to write a small DAG by hand, predict
which EdgeManager implementation will be picked for each edge, and find any
classification rule in the source.
The classes you actually call from a client
All of these live in tez-api:
tez-api/src/main/java/org/apache/tez/dag/api/
DAG.java
Vertex.java
Edge.java
EdgeProperty.java
InputDescriptor.java
OutputDescriptor.java
ProcessorDescriptor.java
VertexManagerPluginDescriptor.java
DataSourceDescriptor.java
DataSinkDescriptor.java
EntityDescriptor.java (base class for all *Descriptors)
GroupInputEdge.java (multi-source unioning edge)
VertexGroup.java (group of vertices for grouped commits)
Use this command to inspect the API surface:
grep -n "^public " tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -40
Every class above is immutable by convention once handed to TezClient.
You may mutate via the builder methods (addVertex, addEdge, addDataSource)
before submission. After submission the only way to change the plan is via
VertexManagerPlugin callbacks (see vertex-lifecycle.md
and the Level 4 lab on VertexManager).
EdgeProperty — three orthogonal axes
EdgeProperty.create(DataMovementType, DataSourceType, SchedulingType, OutputDescriptor, InputDescriptor)
is the single most important constructor in the API.
grep -n "enum " tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
The three enums:
| Enum | Values | What it controls |
|---|---|---|
DataMovementType | ONE_TO_ONE, BROADCAST, SCATTER_GATHER, CUSTOM | How outputs are routed from src to dst tasks |
DataSourceType | PERSISTED, PERSISTED_RELIABLE, EPHEMERAL | Durability of intermediate data |
SchedulingType | SEQUENTIAL, CONCURRENT | Whether dst tasks must wait for src to finish |
Edge type matrix (movement × scheduling)
| Movement | Scheduling | Typical use | EdgeManager impl |
|---|---|---|---|
| SCATTER_GATHER | SEQUENTIAL | Map → Reduce shuffle | ShuffleEdgeManager (the AM-internal default) |
| ONE_TO_ONE | SEQUENTIAL | Sorted reducer → re-sorter (rare) | OneToOneEdgeManager |
| BROADCAST | SEQUENTIAL | Small-side join broadcast | BroadcastEdgeManager |
| CUSTOM | SEQUENTIAL | Hive cartesian product, custom partitioner | User-supplied EdgeManagerPlugin |
| BROADCAST | CONCURRENT | Streaming push between long-running tasks | BroadcastEdgeManager |
| SCATTER_GATHER | CONCURRENT | (Unusual — generally invalid for shuffles) | — |
Locate the actual EdgeManager implementations:
find tez-dag/src/main/java -name "*EdgeManager*"
Key files (exact names vary slightly by branch):
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/
OneToOneEdgeManagerOnDemand.java
ScatterGatherEdgeManager.java
BroadcastEdgeManager.java
Read Edge.java (tez-api) to see how it wires the right manager based on
EdgeProperty:
grep -n "EdgeManager\|edgeManager\|createEdgeManager" \
tez-api/src/main/java/org/apache/tez/dag/api/Edge.java
DataSourceDescriptor vs Input
Beginners frequently confuse these two:
| Concept | Class | Defined in | Lives during |
|---|---|---|---|
| Plan-time root-input definition | DataSourceDescriptor | tez-api | Client + AM (planning) |
| Runtime input attached to a task | Input (interface) | tez-api | Task JVM (execution) |
A DataSourceDescriptor describes "how to materialize splits for this vertex"
(controller class + input descriptor + (optional) initializer). The AM may run
an InputInitializer (e.g., MRInputAMSplitGenerator) to enumerate splits
before the vertex starts. The result of that initialization becomes
InputDataInformationEvents pushed to tasks (see
ipo-abstractions.md and
event-routing.md).
At task time the input class is instantiated from the InputDescriptor and
called with initialize() → start() → getReader() → close(). The task never
sees the DataSourceDescriptor.
The DAGPlan protobuf — the wire format
tez-api/src/main/proto/DAGApiRecords.proto
Inspect:
grep -n "^message " tez-api/src/main/proto/DAGApiRecords.proto
Key messages:
DAGPlan— root: name, vertices, edges, plan-level configs, credentials, ACLs.VertexPlan— name, processor descriptor, parallelism, location hints, associated edges, root inputs.EdgePlan— source/dest vertex names, edge properties, edge manager descriptor.TezEntityDescriptorProto—{class_name, user_payload, history_text}— the serialized form of any*Descriptor.RootInputLeafOutputProto— the protobuf encoding ofDataSourceDescriptorandDataSinkDescriptor.
The conversion from API classes to protobuf happens in:
grep -rn "createDAGPlan\|toProtoFormat" tez-api/src/main/java/org/apache/tez/dag/api/ | head
Specifically DAG.createDag(...) and DagTypeConverters (a kitchen-sink
class of to/from helpers).
Validation — what DAG.verify() checks
grep -n "private void.*verify\|public void verify" \
tez-api/src/main/java/org/apache/tez/dag/api/DAG.java
DAG.verify(restricted=true) enforces, at minimum:
- Name uniqueness — vertex names and DAG name are unique.
- No cycles — DFS over the edge graph; throws
IllegalStateException("DAG contains a cycle") if any back-edge is found. - Parallelism rules:
ONE_TO_ONEedges requiresource.parallelism == dest.parallelismif both are statically set.- Vertices with
BROADCASToutputs must have a finite parallelism (since each downstream task receives every output).
- Descriptor non-null for required slots (Processor, Output for vertices that produce, Input for vertices that consume).
- No "dangling" data sources — every root input is on a real vertex.
VertexManagerPluginspecified explicitly for vertices that need dynamic reconfig (else a default is chosen — see vertex-lifecycle.md for the default rules).
Read the body of verify(...) line-by-line; the comments cite the JIRA that
added each check.
How a DAG becomes a plan, end-to-end
flowchart LR
A[User code: new DAG] --> B[addVertex/addEdge/addDataSource]
B --> C[TezClient.submitDAG]
C --> D[DAG.verify]
D -->|ok| E[DAG.createDag -> DAGPlan proto]
E --> F[RPC DAGClientAMProtocol.submitDAG]
F --> G[DAGAppMaster: DAGImpl init]
G --> H[VertexImpl per VertexPlan]
H --> I[Edge per EdgePlan; EdgeManager selected]
Each arrow has a citation:
verify:DAG.verify(...).createDag:DAG.createDag(BinaryConfig, Credentials, Map<String,LocalResource>, JobTokenSecretManager, boolean tezLrsAsArchive).- AM-side:
DAGImpl.init()andVertexImpl.constructInputDescriptors(),Edge.<init>(intez-dag, not thetez-apiEdge).
Reading exercise
# Top-level surface
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/DAG.java
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/Edge.java
# All the places where DAGPlan is constructed
grep -rn "DAGPlan.newBuilder" tez-api/src/main/java | head
# Cycle detection
grep -n "cycle\|cycleFound\|visit" \
tez-api/src/main/java/org/apache/tez/dag/api/DAG.java
Answer:
- What exception class does
DAG.verify()throw on a cycle, and what does its message contain that helps a user diagnose the offending vertex? - Which method on
Vertexis used to attach aDataSourceDescriptor? Which to attach aDataSinkDescriptor? - What is the role of
DagTypeConvertersand why is it preferred over each class owning its owntoProto/fromProtomethods? - When you call
Edge.create(srcV, dstV, EdgeProperty.create(...)), where is the resultingEdgeregistered? On the source vertex? Destination? The DAG itself? - Suppose you call
dag.addVertex(v)twice with the samevinstance. What happens, and where inDAG.javais the protection? - What is the difference between
DataSourceType.PERSISTEDandDataSourceType.PERSISTED_RELIABLE? Find the consumer (searchtez-dagfor uses ofDataSourceType).
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
IllegalStateException: DAG contains a cycle at submission | Accidentally added a back-edge | DAG.verify |
Vertex starts with parallelism -1 and never runs | setParallelism(-1) and no VertexManagerPlugin to reconfigure | VertexImpl.initialize; check for "parallelism not set" |
Job hangs with all vertices in INITED | A DataSourceDescriptor has an initializer that never emits events | Search AM log for InputInitializerEvent; cross-reference initializer impl |
ClassNotFoundException at task start for your Processor | The class is in client classpath but not uploaded as a local resource | TezClient.addAppMasterLocalFiles not called; see tez-client.md |
EdgeManager mismatch between sides — task hangs reading | Custom EdgeManagerPlugin returns inconsistent partition counts | Always run TestEdgeManagerSelf on your plugin |
DAGPlan proto exceeds 64 MB | Encoding huge userPayload directly into the plan | Use a side file via LocalResource; payload is byte[] not free-form storage |
Validation: prove you understand this
- Write, on a whiteboard, a 4-vertex DAG with two
SCATTER_GATHERedges and oneBROADCASTedge. Annotate each edge with its threeEdgePropertyenums. Justify each choice. - Given an edge with
(SCATTER_GATHER, PERSISTED, SEQUENTIAL), name theEdgeManagerclass that will be selected at runtime and the source file where the selection logic lives. - From memory, list the five required arguments to
EdgeProperty.create(...). - Open
DAG.verify()and identify the first five checks. For each, propose a one-line DAG that would fail it. - In a new method
getAllRootInputs(DAG), walk the DAG and return allDataSourceDescriptorobjects across all vertices. Compile it; check againstDAG.java's own helpers.
TezClient
TezClient is the client-side API: the class your driver code instantiates
to start an AM, submit DAGs, and (optionally) keep the AM alive across DAGs.
This chapter walks bring-up, the session vs non-session distinction, local
resource staging, RPC submission, and ATS hookup.
After this chapter you should be able to point at every line of code that runs
between TezClient.create(...) and the moment a DAG appears inside the AM
ready to be start()ed.
Files to open
tez-api/src/main/java/org/apache/tez/client/
TezClient.java
TezClientUtils.java
TezSessionImpl.java
FrameworkClient.java
TezYarnClient.java (YARN-backed FrameworkClient)
LocalClient.java (in-process FrameworkClient for local mode)
Plus the YARN-AM protocol definition:
tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java
tez-api/src/main/proto/DAGClientAMProtocol.proto
Two modes: session and non-session
The mode is chosen at TezClient.create(...):
TezClient client = TezClient.create(
"MyApp",
tezConf,
isSession /* true = session mode */);
| Property | Non-session | Session |
|---|---|---|
| AM lifetime | Per DAG | Across many DAGs |
start() semantics | No-op (AM launched at submitDAG) | Launches AM and waits for it to register |
| Allowed DAGs in flight | 1 | 1 (sequential within a session by default) |
| Keep-alive | n/a | tez.session.am.dag.submit.timeout.secs |
| Use case | One-shot jobs (CLI tools, scheduled batch) | Latency-sensitive (Hive, Pig, interactive) |
The AM keep-alive timer is critical. In session mode, after a DAG completes the AM waits for the configured timeout for a new DAG. If none arrives, it shuts down to free YARN resources. Find the timer:
grep -n "AMSessionDAGSubmitTimeout\|dag.submit.timeout" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
Bring-up control flow
sequenceDiagram
participant U as User code
participant TC as TezClient
participant TCU as TezClientUtils
participant YC as TezYarnClient
participant RM as YARN RM
participant AM as DAGAppMaster
U->>TC: TezClient.create(name, conf, isSession)
U->>TC: addAppMasterLocalFiles(map)
U->>TC: start()
TC->>TCU: createApplicationSubmissionContext(...)
TCU->>TCU: stage local resources to HDFS
TCU->>TCU: build classpath & env
TC->>YC: submitApplication(appSubmissionContext)
YC->>RM: submitApplication
RM-->>YC: appId
Note over RM,AM: RM launches AM container
AM->>AM: serviceInit, serviceStart
AM-->>TC: AM registers via heartbeat; TC sees RUNNING
U->>TC: submitDAG(dag)
TC->>AM: DAGClientAMProtocol.submitDAG(rpcCall)
AM-->>TC: dagId
TC-->>U: DAGClient
Where each call lives:
TezClient.start()→TezClientUtils.createFinalConfProtoForApp()→TezClientUtils.createApplicationSubmissionContext()→frameworkClient.submitApplication(...).TezClient.submitDAG(dag)→getSessionAMProxy()→dagAMProtocol.submitDAG(submitRequest)(the YARN AM proxy).
grep -n "submitApplication\|submitDAG\|dagAMProtocol" \
tez-api/src/main/java/org/apache/tez/client/TezClient.java
Local resources that TezClientUtils uploads
A YARN container starts with a clean working directory plus whatever local resources the AM submission context declares. For Tez, that includes:
- Tez framework tarball — pointed to by
tez.lib.uris(or a local jar list). Containstez-api.jar,tez-dag.jar,tez-runtime-*.jar, etc. - User application jars — anything you added via
TezClient.addAppMasterLocalFiles(Map<String, LocalResource>)plusaddTaskLocalFiles. - The DAGPlan — not a local resource. It is sent via the
submitDAGRPC payload.
Inspect:
grep -n "tez.lib.uris\|TezConfiguration.TEZ_LIB_URIS\|addAppMasterLocalFiles" \
tez-api/src/main/java/org/apache/tez/client/TezClient.java \
tez-api/src/main/java/org/apache/tez/client/TezClientUtils.java
The AMRM token is delivered by YARN when the container starts; Tez does not manage it directly.
The submission RPC
The protocol is defined in:
tez-api/src/main/proto/DAGClientAMProtocol.proto
grep -n "rpc " tez-api/src/main/proto/DAGClientAMProtocol.proto
Key RPCs:
| RPC | What it does |
|---|---|
submitDAG | Submit a new DAG to a running AM |
getDAGStatus | Poll status (also used by DAGClient) |
getVertexStatus | Poll a specific vertex |
tryKillDAG | Initiate kill |
shutdownSession | Stop the AM in session mode |
The RPC server lives in the AM (DAGClientHandler and its Protobuf
implementation):
grep -rn "DAGClientAMProtocol\|submitDAG" \
tez-dag/src/main/java/org/apache/tez/dag/api/client/ 2>/dev/null | head
ATS / Timeline Service integration
When tez.history.logging.service.class is set to
ATSHistoryLoggingService (the default in many distros), TezClient does
not publish events itself — the AM does, via the HistoryEventHandler.
However, TezClient does:
- Set
tez.history.logging.service.classinto the AM env. - Provide ATS credentials in the application submission context.
Read:
grep -rn "ATSHistoryLoggingService\|YARN_TIMELINE_SERVICE" \
tez-api/src/main/java/org/apache/tez/client/
For the AM-side, see counters-diagnostics.md.
TezSessionImpl vs TezClient
There is a subclass relationship: TezSessionImpl was the older name; modern
Tez uses TezClient with isSession=true, but TezSessionImpl still
appears in some codepaths. The two are largely interchangeable. Inspect both:
grep -n "class TezClient\|class TezSessionImpl" \
tez-api/src/main/java/org/apache/tez/client/*.java
Reading exercise
sed -n '1,120p' tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -n "submitDAG\b" tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -n "stopSession\|stop\|close" \
tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -rn "submitApplication" tez-api/src/main/java/org/apache/tez/client/
Answer:
- What is the difference between
TezClient.stop()in session vs non-session mode? - When
TezClient.submitDAG()is called for a DAG that conflicts with one currently running in the session, what happens? - Find the timeout used while waiting for the AM to reach
RUNNINGafterstart(). Which config key controls it? - What pre-condition does
submitDAGenforce on the DAG's vertex names with respect to previously-submitted DAGs in the same session? - Trace
addAppMasterLocalFiles(...)end-to-end. Where do those files end up on HDFS? - Why is
tez.lib.urissometimes a directory and sometimes a tarball? What doesTezClientUtils.setupTezJarsLocalResourcesdo for each case?
Common bugs and symptoms
| Symptom | Root cause | Fix |
|---|---|---|
AM never reaches RUNNING; client hangs in start() | tez.lib.uris points to a path the NodeManager can't read | Verify HDFS perms; check NM logs |
submitDAG throws SessionNotRunning | AM died (idle timeout, crash) | Catch, recreate TezClient, resubmit |
submitDAG blocks forever | Previous DAG still in flight in the session | Don't reuse session for parallel DAGs; or wait |
IOException: Failed to submit application | RM rejected (queue full, ACL) | Inspect RM logs; verify queue config |
| AM starts but cannot talk back to client | Client behind NAT; AM cannot reach client's RPC server | Use polling-only DAGClient; avoid callbacks |
Tasks fail with ClassNotFoundException for user code | addTaskLocalFiles not called for that jar | Add jars via both addAppMasterLocalFiles and addTaskLocalFiles if used in tasks |
Validation: prove you understand this
- Write a 30-line Java driver that creates a
TezClientin session mode, submits two DAGs back-to-back, prints bothDAGClient.getDAGStatus()results, and shuts down cleanly. - From
TezClient.java, list every method that ultimately reachesdagAMProtocol. - Explain why
addAppMasterLocalFilesis aMap<String, LocalResource>and not aList<Path>. - From the proto file
DAGClientAMProtocol.proto, write the exact request message used bysubmitDAG. - Reproduce the "AM idle timeout" path on
MiniTezCluster: submit one DAG, wait past the configured timeout, attempt a second submit, observe the exception class and message.
DAGClient
DAGClient is the read-only client-side handle to a submitted DAG. It is
returned by TezClient.submitDAG(...) and lives until the DAG completes (or
the user kills it). This chapter covers status polling, the
StatusGetOpts flag, the RPC vs ATS backends, error reporting, and the
contract DAGClient exposes to callers like Hive, Pig, and CLI drivers.
After this chapter you should know which backend a given DAGClient instance
is using, what fields will be populated, and which calls block vs poll.
Files to open
tez-api/src/main/java/org/apache/tez/dag/api/client/
DAGClient.java (abstract base)
DAGStatus.java (the snapshot type)
VertexStatus.java
Progress.java
StatusGetOpts.java (enum: GET_COUNTERS, GET_MEMORY_USAGE)
rpc/
DAGClientRPCImpl.java (talks to the AM via DAGClientAMProtocol)
DAGClientImplLocal.java (in-process; for LocalClient)
registry/ (service discovery if applicable)
ATS-backed variant:
tez-plugins/tez-yarn-timeline-history-with-fs/ or
tez-plugins/tez-yarn-timeline-history/
src/main/java/org/apache/tez/dag/api/client/DAGClientTimelineImpl.java
(Module names vary across versions; locate with find . -name "DAGClientTimelineImpl.java".)
Core API
public abstract class DAGClient implements Closeable {
public abstract String getExecutionContext();
public abstract DAGStatus getDAGStatus(Set<StatusGetOpts> opts) throws ...;
public abstract DAGStatus getDAGStatus(Set<StatusGetOpts> opts, long timeoutMillis) throws ...;
public abstract VertexStatus getVertexStatus(String vertexName, Set<StatusGetOpts> opts) throws ...;
public abstract DAGStatus waitForCompletion() throws ...;
public abstract DAGStatus waitForCompletionWithStatusUpdates(Set<StatusGetOpts> opts) throws ...;
public abstract void tryKillDAG() throws ...;
// ...
}
grep -n "public abstract\|public " \
tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java
DAGStatus — what callers actually consume
grep -n "public " tez-api/src/main/java/org/apache/tez/dag/api/client/DAGStatus.java
Fields you'll see in production triage:
| Field | Populated by | Notes |
|---|---|---|
state (DAGStatus.State) | Always | SUBMITTED/INITING/RUNNING/SUCCEEDED/FAILED/KILLED/ERROR |
progress | RPC backend; ATS backend may lag | Progress per vertex + aggregate |
diagnostics | On terminal states | Newline-joined messages |
counters | Only if StatusGetOpts.GET_COUNTERS passed | Expensive over RPC |
memoryUsage | Only if StatusGetOpts.GET_MEMORY_USAGE passed | Aggregated across containers |
Note: state is not the same as VertexStatus.State. Vertex states are
richer (INITED, RUNNING, COMMITTING, SUCCEEDED, etc.) — see
vertex-lifecycle.md. DAG state is a roll-up.
RPC backend: DAGClientRPCImpl
grep -n "DAGClientAMProtocol\|proxy" \
tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java
Behavior:
- Each
getDAGStatus(opts)is a synchronous RPC to the AM. - Default timeout per call is governed by
tez.dag.am.client.am-connect-timeout-secs. - If
GET_COUNTERSis set, the AM serializes the entireTezCounterstree (potentially MBs); avoid in tight loops. waitForCompletion()is implemented as a polling loop with backoff. Find the loop:
grep -n "waitForCompletion\|sleep\|poll" \
tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java
ATS backend: DAGClientTimelineImpl
When the AM has exited but ATS retains history, status is fetched from the ATS REST API (or RM web UI) instead. This is critical for post-mortem and "why did my job fail" UIs.
Behavior differences from RPC:
- Eventually consistent (ATS publication is async; see counters-diagnostics.md).
stateis the final state recorded; intermediate states between two ATS events are invisible.- Counters are available if
ATSHistoryLoggingServicewas active and the event made it past the publisher queue.
Search for the fallback path that picks ATS when RPC fails:
grep -rn "DAGClientTimelineImpl\|getDAGAndAMURL\|RPCFailed\|amProxyFailed" \
tez-api/src/main/java/org/apache/tez/dag/api/client/ \
tez-api/src/main/java/org/apache/tez/client/
tryKillDAG() — the only mutation
Despite the name, DAGClient has exactly one mutating method: tryKillDAG.
It triggers the AM to start the kill path, but does not block until the
DAG is dead.
grep -n "tryKillDAG\|killDAG" \
tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java \
tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java
To wait for the kill to take effect:
client.tryKillDAG();
DAGStatus status = client.waitForCompletion();
// status.state will be KILLED (or whatever it raced to)
Status populate flow
sequenceDiagram
participant U as User code
participant DC as DAGClientRPCImpl
participant AM as DAGAppMaster
participant DH as DAGClientHandler
participant DI as DAGImpl
U->>DC: getDAGStatus(opts)
DC->>AM: RPC: getDAGStatus(dagId, opts)
AM->>DH: dispatch
DH->>DI: dagImpl.getDAGStatus(opts)
DI-->>DH: DAGStatusProto
DH-->>AM: response
AM-->>DC: response bytes
DC-->>U: DAGStatus
The conversion DAGImpl → DAGStatusProto happens in DAGImpl.getDAGStatus()
(in tez-dag). For GET_COUNTERS, the AM walks the counter aggregation
tree — expensive.
Reading exercise
# Surface
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java
# State enum
grep -n "public enum State\b" \
tez-api/src/main/java/org/apache/tez/dag/api/client/DAGStatus.java
# RPC polling loop
grep -n "waitForCompletion\|backoff\|sleep" \
tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java
Answer:
- What is the difference between
waitForCompletion()andwaitForCompletionWithStatusUpdates(opts)? - What happens if
GET_COUNTERSis requested but the DAG is stillINITING? - List the exact
DAGStatus.Stateenum values and the terminal subset. - From the polling loop, what is the maximum sleep between polls?
- When
tryKillDAG()is called after the DAG already finished, what does the RPC return? Is it an error? - In
DAGClientTimelineImpl, how is the "I don't see a SUCCEEDED event yet" case distinguished from "the DAG is still running"?
Common bugs and symptoms
| Symptom | Root cause | Fix |
|---|---|---|
waitForCompletion() returns RUNNING forever | AM crashed, RPC keeps timing out | Add timeout; check AM log; fall back to ATS |
| Counters are stale by ~30s | AM aggregation interval | tez.am.aggregate.counters.interval-secs |
tryKillDAG() returns immediately but DAG keeps running for minutes | Kill is async; tasks must drain | Always follow with waitForCompletion |
Hive sees DAGStatus.State=ERROR with no diagnostics | AM crashed before publishing | Check NM container log for the AM |
| ATS-backed status missing for a recently completed DAG | ATS publisher queue backed up | Wait; or query ATS REST directly |
| Inconsistent state between RPC and ATS for same DAG | Race during AM shutdown; ATS publishes after final RPC | Trust RPC while AM lives, ATS after |
Validation: prove you understand this
- Write a 20-line program that polls
getDAGStatus(GET_COUNTERS)once a second and prints theFILE_BYTES_WRITTENcounter from each snapshot. - List the four
StatusGetOptsenum values (check the source — there may be fewer/more than you remember) and what each adds to the payload. - From
DAGClient.java, draw the inheritance/factory diagram for how aDAGClientinstance is actually constructed (look atTezClient.submitDAGto see which subclass is returned in YARN vs local mode). - Force the RPC backend to fail and confirm whether (or not) Tez falls back to the ATS backend automatically. Cite the line that performs the fallback.
- Explain why
DAGStatusis a snapshot rather than an observable.
DAGAppMaster
DAGAppMaster is Tez's YARN ApplicationMaster: a single JVM, launched by the
YARN ResourceManager, that owns one or more DAGs over its lifetime. This
chapter describes its bring-up, its dispatcher topology, its YARN-facing
heartbeats, and the recovery service that lets it restart after a crash.
After this chapter you should be able to map any AM log line in the first 60
seconds of operation to a method in DAGAppMaster.java.
Files to open
tez-dag/src/main/java/org/apache/tez/dag/app/
DAGAppMaster.java (the AM main class)
TaskCommunicatorManager.java (task umbilical multiplexer)
ContainerHeartbeatHandler.java (container liveness)
rm/
TaskSchedulerManager.java (one per scheduler instance)
YarnTaskSchedulerService.java (the default scheduler impl)
container/
AMContainerImpl.java (container state machine)
launcher/
ContainerLauncherManager.java
DagContainerLauncher.java (varies by version)
LocalContainerLauncher.java (in-process)
recovery/
RecoveryService.java (event log; restart path)
dag/impl/
DAGImpl.java
VertexImpl.java
TaskImpl.java
TaskAttemptImpl.java
Bring-up: serviceInit and serviceStart
DAGAppMaster extends AbstractService. YARN starts it with a main; control
flows:
main()
-> DAGAppMaster.create / new DAGAppMaster(...)
-> init(conf)
-> serviceInit(conf)
- parse appAttemptId
- load credentials
- construct AsyncDispatcher
- construct + register child services: TaskSchedulerManager,
ContainerLauncherManager, TaskCommunicatorManager,
RecoveryService (if enabled), HistoryEventHandler, ATSHook
- register event handlers on the dispatcher
-> start()
-> serviceStart()
- start child services (they each start their own threads)
- if not session mode: handle the inline DAG plan
- if session mode: enter idle loop, wait for submitDAG RPC
Inspect the boundaries:
grep -n "serviceInit\|serviceStart\|serviceStop" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
The AsyncDispatcher and registered handlers
DAGAppMaster builds one AsyncDispatcher (from hadoop-yarn-common) and
registers a handler per event type. The contract is:
- Each handler runs on a single dispatch thread.
- Handlers must be fast (no blocking I/O); they should mutate state and emit follow-on events.
Find the registrations:
grep -n "dispatcher.register\|register(.*\.class" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
Typical registrations (names approximate by version):
| Event type | Handler class | Owned subsystem |
|---|---|---|
DAGEventType | DAGEventDispatcher (forwards to DAGImpl.handle) | DAG lifecycle |
VertexEventType | VertexEventDispatcher (forwards to VertexImpl.handle) | Vertex lifecycle |
TaskEventType | TaskEventDispatcher (forwards to TaskImpl.handle) | Task lifecycle |
TaskAttemptEventType | TaskAttemptEventDispatcher (forwards to TaskAttemptImpl.handle) | Attempt lifecycle |
AMSchedulerEventType | AMSchedulerEventDispatcher (forwards to TaskSchedulerManager) | Scheduling |
AMContainerEventType | container event dispatcher | Container state |
AMNodeEventType | node event dispatcher | Node tracking |
ContainerLauncherEventType | launcher dispatcher | Launch/stop containers |
TaskCommunicatorEventType | comms dispatcher | Per-launcher umbilical |
HistoryEventType | history event dispatcher | ATS/log publication |
SpeculatorEventType | speculator dispatcher | Speculation (if enabled) |
DAGAppMasterEventType | AM itself | Lifecycle (e.g., shutdown) |
RecoveryEventType | recovery dispatcher | Recovery log |
The handlers themselves are inner classes or top-level dispatchers found in:
grep -rn "extends EventHandler\|implements EventHandler" \
tez-dag/src/main/java/org/apache/tez/dag/app/ | head -20
Event flow diagram
flowchart TB
subgraph "Sources of events"
TC[Task heartbeat]
SCH[Scheduler callback]
TL[Container launcher]
UC[User: submitDAG/killDAG]
RC[Recovery on restart]
end
TC --> D
SCH --> D
TL --> D
UC --> D
RC --> D
D[AsyncDispatcher] --> DH[DAGEventDispatcher]
D --> VH[VertexEventDispatcher]
D --> TH[TaskEventDispatcher]
D --> AH[TaskAttemptEventDispatcher]
D --> SH[AMSchedulerEventDispatcher]
D --> HH[HistoryEventDispatcher]
DH --> DI[DAGImpl]
VH --> VI[VertexImpl]
TH --> TI[TaskImpl]
AH --> TAI[TaskAttemptImpl]
SH --> TSM[TaskSchedulerManager]
HH --> HEH[HistoryEventHandler]
Everything flows through D. There is no other way to mutate the state
of a DAG, vertex, task, or attempt. See event-routing.md.
YARN-facing components
AMRM heartbeat (the resource conversation)
TaskSchedulerManager (and underneath, YarnTaskSchedulerService) maintains
an AMRMClient (from YARN). This heartbeats with the RM at a configurable
interval (tez.am.am-rm.heartbeat.interval-ms.max) carrying:
ContainerRequests for new tasks.ContainerReleases for freed containers.- Progress percent (visible in
yarn application -status).
Responses contain:
AllocatedContainers(RM granted).CompletedContainersStatuses(RM tells us a container died).
grep -n "heartbeat\|AMRMClient\|allocate" \
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head
Container heartbeat (the liveness check)
ContainerHeartbeatHandler tracks the wall time of the last
heartbeat() call from each running container's umbilical. If a container
goes silent past tez.task.timeout-ms, the AM declares the container
unresponsive and kills the attempt.
grep -n "ContainerHeartbeatHandler\|tez.task.timeout" \
tez-dag/src/main/java/org/apache/tez/dag/app/ContainerHeartbeatHandler.java
Task umbilical (the per-task RPC server)
TaskCommunicatorManager runs an in-AM RPC server (the umbilical) that tasks
call into for:
getTask()— pick up assigned task.statusUpdate(...)/heartbeat(...)— progress and liveness.done(...)/fatalError(...)— completion.outputReady(...)/inputEvents(...)— runtime data plane.
The umbilical protocol is TezTaskUmbilicalProtocol:
find . -name "TezTaskUmbilicalProtocol.java"
Recovery: surviving an AM restart
If tez.am.am-rm.heartbeat.interval-ms.max allows it and recovery is
enabled (tez.dag.recovery.enabled=true), RecoveryService writes a log of
state-changing events to HDFS. On a restart (YARN gives the AM a new
appAttemptId but the same appId), the new AM:
- Reads the recovery log under
${tez.staging-dir}/$appId/recovery/$attemptId/. - Replays events into
DAGImpl,VertexImpl, etc., to rebuild in-memory state up to the last durable point. - Resumes execution: completed tasks remain completed, in-flight tasks are relaunched.
grep -rn "RecoveryService\|RecoveryEvent\|replayEvents" \
tez-dag/src/main/java/org/apache/tez/dag/app/recovery/ | head
Note: recovery is per-DAG, not per-task. A vertex that was RUNNING
becomes RUNNING again; tasks that completed stay completed; tasks that
were in flight get fresh attempts.
Reading exercise
# Bring-up
sed -n '1,200p' tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -200
grep -n "serviceInit\|serviceStart" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
# Handlers
grep -n "register\b" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -30
# Session vs non-session control
grep -n "isSession\|sessionMode" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -20
# Recovery hookup
grep -n "RecoveryService\|recoveryEnabled" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head
Answer:
- In what order are the child services started in
serviceStart? Why does order matter? - List the first three events that flow through the dispatcher when an AM in non-session mode starts.
- What thread does
DAGImpl.handle(DAGEvent)execute on? Is it the same thread asVertexImpl.handle(VertexEvent)? - Where is the
appAttemptId > 1check that decides whether to start fresh or recover? - What is the difference between
DAGAppMaster.shutdown()andDAGAppMaster.serviceStop()? - Find the line that emits the first
"DAGAppMaster started"log statement (or its modern equivalent).
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
AM dies immediately with NPE in serviceInit | Missing or wrong tez.lib.uris; jars not found | NM container log; verify HDFS perms |
AM hangs forever after serviceStart in session mode | No DAGs submitted; tez.session.am.dag.submit.timeout.secs exhausted | Increase timeout; or check why client isn't submitting |
| Tasks all fail with "container lost" after a long GC | AM GC pause exceeded heartbeat budget; RM killed AM | Tune AM heap; reduce dispatcher pressure |
Recovery replays but stalls in INITING | Recovery log truncated mid-vertex-init | Look for SummaryEventWriter errors in prior attempt |
| Event dispatcher queue grows without bound | A handler is doing blocking I/O on the dispatch thread | Take a thread dump; verify which event is stuck |
AM exits with ERROR and no DAG transition | An uncaught exception bubbled out of an event handler | grep "Error in dispatcher thread" in AM log |
Validation: prove you understand this
- From memory, list ten event-type→handler registrations in
DAGAppMaster. - Draw the event flow from
TezTaskUmbilicalProtocol.heartbeattoTaskAttemptImpl.handle(TA_DONE). - Reproduce a single-DAG, non-session AM bring-up on
MiniTezClusterand identify the log line emitted by each child-service start. - Read the
RecoveryServicewriter and identify which event types are persisted vs in-memory-only. - Explain why the dispatcher must be single-threaded and what would break if you parallelized it.
VertexImpl Lifecycle
VertexImpl is the AM-side representation of a single Vertex in a running
DAG. Its lifecycle is a Hadoop state machine with ~15 states and dozens of
events. This chapter walks the happy path (NEW → SUCCEEDED), the major
failure and kill paths, and the rules that govern transitions.
After this chapter you should be able to draw the state machine on a whiteboard and predict every state transition for any event in any state.
File
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
This is one of the largest files in Tez (typically 4000+ lines). Skim once
top-to-bottom, then read the stateMachineFactory block carefully.
grep -n "stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
The factory is a single chained builder defined near the top of the file (roughly 200–600 lines depending on version).
The states
grep -n "VertexState\." tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
# or
grep -n "public enum\|enum VertexState" \
tez-api/src/main/java/org/apache/tez/dag/api/event/VertexState.java \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
The full state set (names exact as of 0.10.x):
| State | Meaning |
|---|---|
NEW | Just constructed; no events seen |
INITIALIZING | Inputs being initialized (e.g., split generation) |
INITED | Ready to run; awaiting V_START |
RUNNING | Tasks executing |
COMMITTING | All tasks succeeded; outputs being committed |
SUCCEEDED | Terminal: all good |
TERMINATING | Failure/kill in progress; awaiting task drain |
KILLED | Terminal: killed externally |
FAILED | Terminal: failed (own fault) |
ERROR | Terminal: AM internal error |
RECOVERING | (Recovery only) replaying events into this vertex |
State × event matrix (happy path)
| State | Event | Next state | Action |
|---|---|---|---|
| NEW | V_INIT | INITIALIZING | construct inputs, kick off InputInitializers |
| INITIALIZING | V_ROOT_INPUT_INITIALIZED | INITIALIZING | accumulate events; if all done → INITED |
| INITIALIZING | V_ROOT_INPUT_FAILED | TERMINATING | bubble failure |
| INITIALIZING | V_INIT_COMPLETED | INITED | finalize parallelism if not set |
| INITED | V_START | RUNNING | schedule tasks via VertexManagerPlugin |
| RUNNING | V_TASK_COMPLETED (success) | RUNNING | bump counter; if all done → COMMITTING |
| RUNNING | V_TASK_COMPLETED (final fail) | TERMINATING | initiate cleanup |
| RUNNING | V_TASK_RESCHEDULED | RUNNING | rerun a task |
| COMMITTING | V_COMMIT_COMPLETED | SUCCEEDED | publish history |
| COMMITTING | V_COMMIT_FAILED | TERMINATING | rerun or fail |
For the complete matrix, count the addTransition(...) calls:
grep -c "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
There are usually >100 transitions registered. Each carries a one-line comment with the bug or JIRA that motivated it; read those comments.
Failure path walk
stateDiagram-v2
[*] --> NEW
NEW --> INITIALIZING: V_INIT
INITIALIZING --> INITED: V_INIT_COMPLETED
INITIALIZING --> TERMINATING: V_ROOT_INPUT_FAILED
INITED --> RUNNING: V_START
RUNNING --> COMMITTING: all tasks SUCCEEDED
RUNNING --> TERMINATING: any task FAILED beyond max-attempts
RUNNING --> TERMINATING: V_TERMINATE
COMMITTING --> SUCCEEDED: V_COMMIT_COMPLETED
COMMITTING --> TERMINATING: V_COMMIT_FAILED
TERMINATING --> FAILED
TERMINATING --> KILLED
SUCCEEDED --> [*]
FAILED --> [*]
KILLED --> [*]
TERMINATING exists because a vertex cannot just jump to FAILED — it must
first kill all running tasks and clean up its outputs. The transition from
TERMINATING to a terminal state happens when the task count reaches zero.
Vertex initialization in detail
V_INIT is the most complex transition. The handler must:
- Construct each root
InputDescriptorand call itsInputInitializer. - If parallelism is
-1, defer task creation until either theVertexManagerPlugincallsreconfigureVertex(...)or the root inputs report concrete counts. - Construct downstream
Edgeobjects (the AM-sideEdge, not thetez-apione) and bind theirEdgeManagers. - Schedule the
VertexManagerPlugin.onVertexStartedcallback (it fires onV_START, notV_INIT).
Read the body:
grep -n "InitTransition\|RootInputInitTransition\|RECOVERING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20
The commit path
A vertex with a DataSink (an OutputCommitter) must run a commit phase
after all tasks succeed. The commit:
- Runs on the AM (not in tasks).
- May fail and trigger a rerun (
V_COMMIT_FAILED → TERMINATING). - Holds the vertex in
COMMITTINGfor the duration.
Vertex-group commit (when multiple vertices write to a shared VertexGroup)
is coordinated by DAGImpl; individual VertexImpls just signal that they
are ready to commit.
grep -n "CommittingTransition\|commitOutput\|OutputCommitter" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
Reading exercise
# State machine block
sed -n '1,500p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# Count transitions
grep -c "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# Find every event that can take the vertex to FAILED
grep -n "VertexState.FAILED" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
# Find the InitTransition body
grep -n "private.*class.*Transition\b" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
Answer:
- List five events that can take the vertex from
RUNNINGtoTERMINATING. - What determines the final state (
FAILEDvsKILLED) onceTERMINATINGcompletes? - Why is
INITEDdistinct fromRUNNING— what doesV_STARTactually trigger? - How is parallelism set when a vertex starts with
parallelism = -1? - What happens to in-flight tasks when a vertex transitions to
TERMINATING? - Why does the state machine have a separate
COMMITTINGstate instead of committing insideRUNNING?
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
InvalidStateTransitonException: Invalid event V_TASK_COMPLETED at SUCCEEDED | A late task completion event arrived after vertex completed (race) | Check task retry logic; add a no-op transition |
Vertex stuck in INITIALIZING forever | Root input initializer never emitted events | Check InputInitializerEvents in log; cross-check initializer impl |
Vertex transitions to FAILED but the failing task was killed externally | Bug in TaskAttemptImpl setting the wrong termination cause | See task-attempt-lifecycle.md |
All tasks succeed but vertex stays in COMMITTING | Output committer hangs | Check committer for synchronous slow I/O; consider async |
Recovery replays into RUNNING but tasks aren't relaunched | Missing recovery event for in-flight tasks | Look for VertexTaskStartEvent gaps in recovery log |
V_KILL causes vertex to stay in TERMINATING with one task lingering | Container heartbeat timeout > kill deadline | Tune tez.task.timeout-ms |
Validation: prove you understand this
- From memory, list all 10–11
VertexStatevalues with a one-line meaning. - Without running code, predict the next state for: (NEW, V_TERMINATE), (INITIALIZING, V_TERMINATE), (RUNNING, V_TASK_RESCHEDULED), (COMMITTING, V_TASK_RESCHEDULED). Verify against the source.
- Find the JIRA reference next to one transition you don't understand; read the JIRA; come back and explain why the transition exists.
- Write a unit test that drives a
VertexImplfromNEWtoSUCCEEDEDusingDrainDispatcher. (UseTestVertexImplas a template.) - Modify
VertexImplto add a no-op transition for some(state, event)pair currently absent; updateTestVertexImplin the same patch. Compile.
TaskImpl Lifecycle
TaskImpl is the AM-side representation of one logical task within a vertex.
It is a relatively small state machine, but it owns a critical piece of
policy: which attempt of this task is the "winner." This chapter walks
the states, the attempt management rules, speculation, and the max-failed
threshold that promotes a task to "this whole vertex must fail."
After this chapter you should be able to explain why a task with three failed
attempts may still be RUNNING while another with one failed attempt is
already FAILED.
File
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
Tests:
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
The states
grep -n "TaskState\." tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
grep -n "public enum TaskState\|enum TaskState" \
tez-api/src/main/java/org/apache/tez/dag/api/event/TaskState.java
| State | Meaning |
|---|---|
NEW | Constructed; no attempts yet |
SCHEDULED | First attempt requested from scheduler |
RUNNING | At least one attempt is RUNNING |
SUCCEEDED | Terminal: one attempt succeeded; task complete |
KILLED | Terminal: explicitly killed (vertex termination, user) |
FAILED | Terminal: max attempts exceeded |
TaskImpl does not have INITIALIZING or TERMINATING — those concerns
belong to the vertex.
State × event matrix
| State | Event | Next state | Action |
|---|---|---|---|
| NEW | T_SCHEDULE | SCHEDULED | create first TaskAttemptImpl, send TA_SCHEDULE |
| SCHEDULED | T_ATTEMPT_LAUNCHED | RUNNING | mark first attempt as running |
| RUNNING | T_ATTEMPT_SUCCEEDED | SUCCEEDED | pick this attempt as the winner; kill others (if speculating) |
| RUNNING | T_ATTEMPT_FAILED | RUNNING (retry) or FAILED (exceeded) | spawn new attempt or terminate |
| RUNNING | T_ATTEMPT_KILLED | RUNNING | no-op unless this was last attempt |
| RUNNING | T_ADD_SPEC_ATTEMPT | RUNNING | spawn a duplicate attempt |
| RUNNING | T_TERMINATE | KILLED | kill all attempts |
| any | T_RECOVER_* | recovered state | replay events |
Count transitions:
grep -c "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
Retry: how max-failed-attempts works
The config:
grep -n "TASK_MAX_FAILED_ATTEMPTS\|tez.am.task.max.failed.attempts" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Default is 4 in most branches; a task is FAILED only after N attempts
have failed (not been killed).
Failed vs killed distinction:
| Outcome | Counts toward max.failed.attempts? |
|---|---|
TaskAttempt failed (own crash, processor exception) | yes |
TaskAttempt killed by speculation (lost the race) | no |
TaskAttempt killed because vertex terminated | no |
TaskAttempt killed because container preempted | no |
The classification is owned by TaskAttemptTerminationCause (see
task-attempt-lifecycle.md). TaskImpl.handle
consults the cause when deciding whether to retry or fail.
grep -n "TerminationCause\|isFailureCause" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
Speculation
Speculation runs a second copy of a task before the first finishes, hoping the second wins. Implementation:
tez-dag/src/main/java/org/apache/tez/dag/app/dag/speculate/
LegacySpeculator.java
SimpleSpeculator.java (varies by version)
legacy/RuntimeTaskStatsEstimator.java
The speculator emits T_ADD_SPEC_ATTEMPT events into the dispatcher; the
task spawns an additional attempt. The first attempt to succeed wins; the
others are killed with cause TERMINATED_BY_OWNER (or similar). Killed
attempts do not count toward max.failed.attempts.
Enabled by:
grep -n "tez.am.speculation.enabled\|speculation" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head
"Best attempt" selection
When multiple attempts of the same task exist, the first to send
TA_DONE (successful completion) wins. The handler:
- Marks that attempt as the canonical one (cached in
TaskImpl). - Iterates remaining attempts, sending each a kill event.
- Transitions task to
SUCCEEDED.
grep -n "successfulAttempt\|setWinnerAttempt\|markSuccessful" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
Downstream vertices reading from this task's output use the winner's
outputLocationHint for shuffle (see shuffle-sort.md).
Reading exercise
# Surface
sed -n '1,120p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
# State machine block
grep -n "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -40
# Retry logic
grep -n "addAttempt\|nextAttemptNumber\|createAttempt" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
# Speculation hook
grep -rn "T_ADD_SPEC_ATTEMPT" tez-dag/src/main/java/org/apache/tez/dag/app/ | head
Answer:
- What is the precise condition for transitioning from
RUNNINGtoFAILEDon aT_ATTEMPT_FAILEDevent? Cite the line. - Where is a new
TaskAttemptImplconstructed? Is it a public method or private toTaskImpl? - How does
TaskImplknow whether a failed attempt should count toward the failure budget? - In what state can
T_ADD_SPEC_ATTEMPTarrive? What does the handler do? - Why does
TaskImplnot own its own scheduling? Who does? - When a task succeeds with two parallel attempts, which one becomes the downstream input? How is the loser cleaned up?
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
| Task retries forever and never fails | max.failed.attempts set absurdly high; or all failures classified as "kill" | Check config; verify TerminationCause for each failure |
| Speculation kills the original just after it succeeds (lost work) | Race on markSuccessful and speculative-attempt kill | Ensure speculator backs off when task is in completing |
Task SUCCEEDED but a sibling attempt still appears as RUNNING for a long time | Container slow to acknowledge kill | Look at ContainerHeartbeatHandler and TA_KILL_REQUEST |
Task succeeded reported but downstream cannot fetch outputs | Race between TA_DONE and output ready event | Check ordering of outputReady umbilical calls |
Recovery brings task back as RUNNING even though it had finished | Missing TaskFinishedEvent in recovery log | Investigate RecoveryService flush boundaries |
Validation: prove you understand this
- Draw the
TaskImplstate machine from memory, including all six states. - From
TestTaskImpl, identify a test that drives a task toFAILED. Walk the events it sends. - List the four
TaskAttemptTerminationCausecategories that do not count towardmax.failed.attempts. Cite the enum and the consumer. - Trace, line by line, what
TaskImpldoes whenT_ATTEMPT_SUCCEEDEDarrives for the second of two concurrent attempts. - Modify
TaskImplto log the winner's attempt number explicitly at theINFOlevel. Run aMiniTezClusterjob and observe.
TaskAttemptImpl Lifecycle
TaskAttemptImpl is the AM-side representation of a single execution
attempt of a task. It owns the container assignment, the umbilical, the
output commit decision, and — critically — the TaskAttemptTerminationCause
that drives upstream retry decisions.
After this chapter you should be able to look at any TaskAttemptImpl state
in an AM log and explain (a) what container holds it, (b) which umbilical
calls have or have not landed, and (c) what its termination cause will be if
it dies right now.
File
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
Tests:
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
Termination cause enum:
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptTerminationCause.java
The states (typical 0.10.x naming)
| State | Meaning |
|---|---|
NEW | Constructed; not yet given to scheduler |
START_WAIT | Request sent to scheduler; awaiting container |
SUBMITTED | Container allocated; awaiting launch ack (some versions) |
RUNNING | Container launched; processor executing |
SUCCEEDED | Terminal: TA_DONE received |
KILL_IN_PROGRESS | Kill requested; awaiting confirmation |
KILLED | Terminal: killed before/during execution |
FAIL_IN_PROGRESS | Failure recognized; cleaning up |
FAILED | Terminal: failed (counts against max.failed.attempts) |
Exact list varies by branch. Verify:
grep -n "TaskAttemptStateInternal\." \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head
Tez separates the external state (TaskAttemptState in tez-api, the
3-state coarse enum visible to ATS) from the internal state machine state
(richer). Mapping:
| Internal | External |
|---|---|
| NEW, START_WAIT, SUBMITTED, RUNNING | STARTING / RUNNING |
| SUCCEEDED | SUCCEEDED |
| KILL_IN_PROGRESS, KILLED | KILLED |
| FAIL_IN_PROGRESS, FAILED | FAILED |
State × event matrix (key transitions)
| State | Event | Next state | Notes |
|---|---|---|---|
| NEW | TA_SCHEDULE | START_WAIT | request container |
| START_WAIT | TA_STARTED | SUBMITTED/RUNNING | container launched |
| START_WAIT | TA_CONTAINER_TERMINATING | KILL_IN_PROGRESS | preemption before launch |
| RUNNING | TA_DONE | SUCCEEDED | done(...) umbilical call |
| RUNNING | TA_FAILED | FAIL_IN_PROGRESS | processor threw |
| RUNNING | TA_TIMED_OUT | FAIL_IN_PROGRESS | heartbeat exceeded tez.task.timeout-ms |
| RUNNING | TA_KILL_REQUEST | KILL_IN_PROGRESS | external kill |
| RUNNING | TA_CONTAINER_TERMINATED | FAIL_IN_PROGRESS / KILL_IN_PROGRESS | NM said container died |
| KILL_IN_PROGRESS | TA_CONTAINER_TERMINATED | KILLED | cleanup done |
| FAIL_IN_PROGRESS | TA_CONTAINER_TERMINATED | FAILED | cleanup done |
grep -c "addTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
Container assignment
When a TaskAttempt becomes schedulable, the AM:
- Builds a
ContainerRequest(resource, priority, locality). - Hands it to
TaskSchedulerManager.allocateTask(...). - The scheduler (
YarnTaskSchedulerService) eventually matches a granted container. - The match drives an
AMSchedulerEventTAEnded/...TALaunchRequestflow that updates theTaskAttemptImplstate. ContainerLauncherManageractually starts the JVM via NMClient.
grep -n "allocateTask\|deallocateTask\|AMSchedulerEvent" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head
The container is not assigned at construction; that's why the START_WAIT
state exists. Some configurations short-circuit this via container reuse
(the scheduler offers a free, idle container).
See container-reuse.md and scheduler.md.
Output commit rules (per attempt)
For attempts of vertices with an OutputCommitter:
| Condition | Commit who? |
|---|---|
Output commits are at the task level (tez.am.commit-all-outputs-on-dag-success=false) | Each TaskAttemptImpl runs commit() from inside the task JVM (via processor) |
| Output commits are at the vertex level (default for MROutput) | Only the AM commits, after all tasks succeed (see vertex-lifecycle.md) |
Losing speculative attempts must not commit. The setOutputCommitted(true)
flag on TaskAttemptImpl records who actually committed. The AM ensures
exactly one attempt of each task has outputCommitted=true.
grep -n "outputCommitted\|commitOutput\|noCommit" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head
TaskAttemptTerminationCause — the policy enum
sed -n '1,200p' \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptTerminationCause.java
Categories (the exact enum is long):
| Cause | Counts as failure? | Typical trigger |
|---|---|---|
TERMINATED_BY_CLIENT | No | User killed DAG |
TERMINATED_AT_SHUTDOWN | No | AM shutting down |
TERMINATED_INEFFECTIVE_SPECULATION | No | Lost the speculation race |
INTERNAL_PREEMPTION | No | AM preempted it (e.g., for higher-priority work) |
EXTERNAL_PREEMPTION | No | YARN preempted the container |
CONTAINER_EXITED | Yes (default) | Container died mid-run |
NODE_FAILED | Yes | NM died |
TASK_HEARTBEAT_ERROR | Yes | Heartbeat timeout |
OUTPUT_LOST | Yes | Downstream reported output gone (rerun) |
APPLICATION_ERROR | Yes | Processor threw |
TaskImpl uses cause.causesFailure() (or equivalent) to decide whether to
bump the failure counter.
Reading exercise
sed -n '1,160p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
grep -n "TaskAttemptStateInternal\." \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -20
grep -n "TerminationCause" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -20
# Heartbeat timeout path
grep -n "TA_TIMED_OUT\|heartbeatTimeout" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head
Answer:
- What event arrives when an attempt's container heartbeat times out? What issues it?
- What is the difference between
TA_FAILEDandTA_CONTAINER_TERMINATED? When does each fire? - Which
TaskAttemptTerminationCausevalues are not counted towardtez.am.task.max.failed.attempts? - In what state does an attempt sit during container provisioning?
- What does
outputCommittedtrack, and how is it used by the AM to choose the canonical attempt? - Why are there separate
FAIL_IN_PROGRESSandFAILEDstates (likewise for kill)?
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
Attempt stuck in START_WAIT for minutes | Scheduler can't satisfy locality/resource | TaskSchedulerManager log; relax locality |
Attempt marked FAILED when container was preempted | TerminationCause set incorrectly | Check the TA_CONTAINER_TERMINATED handler |
| Two attempts both commit outputs (data corruption) | setOutputCommitted race; speculative commit | Run TestSpeculation; ensure committer is idempotent |
TaskAttempt heartbeat timeout fires even though task was running | AM GC pause; clock skew | Tune AM heap; check NM/AM clock drift |
Recovery comes back with all attempts FAILED | Recovery log lacks TaskAttemptStartedEvent for last attempt | Force flush before submitting next event |
KILL_IN_PROGRESS lingers | TA_CONTAINER_TERMINATED never arrives | NM is dead; AM eventually times out container |
Validation: prove you understand this
- Without running code: given an attempt in
RUNNINGand eventTA_CONTAINER_TERMINATEDwith causeINTERNAL_PREEMPTION, what is the next state and does the failure counter increment? - From the enum, list every
TaskAttemptTerminationCauseand tag each "counts" / "does not count". - Reproduce a heartbeat timeout on
MiniTezClusterby suspending a task JVM. Identify the exact log line that transitions the attempt. - Walk the path from
TaskCommunicatorManager.heartbeatreturning aLATEST_RESPONSE_TIMEOUTtoTaskAttemptImpl.handle(TA_TIMED_OUT). - Verify that a speculative-loser attempt does not corrupt counters by reading the kill-handler code.
State Machines
Tez's AM uses Hadoop's StateMachineFactory extensively: every long-lived
entity (DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl, container
state objects) is a state machine. This chapter explains the API, the
dispatcher contract that keeps state machines correct, the AsyncDispatcher
vs DrainDispatcher distinction, the common InvalidStateTransitonException
bug class, and the discipline required to add a transition safely.
After this chapter you should be able to write a small state machine from scratch and review a transition-modifying patch.
The API
The factory lives in:
hadoop-yarn-common
org/apache/hadoop/yarn/state/StateMachineFactory.java
org/apache/hadoop/yarn/state/SingleArcTransition.java
org/apache/hadoop/yarn/state/MultipleArcTransition.java
org/apache/hadoop/yarn/state/InvalidStateTransitonException.java
(Yes, the exception is spelled Transiton in the Hadoop source — historical
typo, preserved for compatibility. Greps that look for Transition will
miss it.)
Skeleton:
private static final StateMachineFactory<MyEntity, MyState, MyEvtType, MyEvt>
stateMachineFactory =
new StateMachineFactory<MyEntity, MyState, MyEvtType, MyEvt>(MyState.NEW)
// Single-arc: state, event, nextState, transition
.addTransition(MyState.NEW, MyState.RUNNING,
MyEvtType.START,
new StartTransition())
// Multiple-arc: state, set of possible next states, event, transition
.addTransition(MyState.RUNNING,
EnumSet.of(MyState.SUCCEEDED, MyState.FAILED),
MyEvtType.DONE,
new DoneTransition())
// Self-loop: state, state, event, transition
.addTransition(MyState.RUNNING, MyState.RUNNING,
MyEvtType.HEARTBEAT,
new HeartbeatTransition())
// No-op self-loop with no transition object
.addTransition(MyState.SUCCEEDED, MyState.SUCCEEDED,
EnumSet.of(MyEvtType.HEARTBEAT))
.installTopology();
installTopology() returns a builder you store; per-instance:
private final StateMachine<MyState, MyEvtType, MyEvt> stateMachine =
stateMachineFactory.make(this);
public void handle(MyEvt event) {
writeLock.lock();
try {
MyState oldState = stateMachine.getCurrentState();
try {
stateMachine.doTransition(event.getType(), event);
} catch (InvalidStateTransitonException e) {
LOG.error("Invalid event " + event.getType() + " at " + oldState);
// typically: re-throw or transition to ERROR
}
} finally {
writeLock.unlock();
}
}
Single-arc vs multiple-arc
| Concept | When to use | Implementation |
|---|---|---|
SingleArcTransition<OPERAND, EVENT> | The next state is always the same | void transition(OPERAND op, EVENT event) |
MultipleArcTransition<OPERAND, EVENT, STATE> | Next state depends on event content | STATE transition(OPERAND op, EVENT event) (returns the chosen state) |
You almost always start with SingleArcTransition. Promote to
MultipleArcTransition only when the next state legitimately depends on
runtime data (e.g., "if task count == 0 then SUCCEEDED else RUNNING").
Dispatcher contract
State machines are not thread-safe by themselves. Tez upholds correctness via the single-dispatcher-thread invariant:
- All events for a
DAGAppMaster's state machines flow through oneAsyncDispatcher. - The dispatcher has one thread that pulls events and calls
handle(event). - Therefore handlers run serially; no two
handle()calls overlap.
This invariant is the reason VertexImpl.handle can manipulate fields
without synchronization. Break the invariant and you get races no test will
catch consistently.
grep -n "AsyncDispatcher\|GenericEventHandler" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
AsyncDispatcher vs DrainDispatcher
| Class | Where | Behavior |
|---|---|---|
AsyncDispatcher | Production | Background thread; events processed asynchronously |
DrainDispatcher | Tests | Same API; tests call await() to block until queue empty |
Tests use DrainDispatcher so they can assert state after a known set of
events has been processed:
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertexEventHandler);
dispatcher.init(conf);
dispatcher.start();
dispatcher.getEventHandler().handle(new VertexEvent(...));
dispatcher.await(); // blocks until queue empty
assertEquals(VertexState.RUNNING, vertex.getState());
find . -name "DrainDispatcher.java"
grep -rn "new DrainDispatcher" tez-dag/src/test/java | head
InvalidStateTransitonException
Thrown when doTransition(type, event) finds no registered handler for the
(currentState, eventType) pair. The exception message has the form:
Invalid event: V_TASK_RESCHEDULED at SUCCEEDED
Common causes:
- A late event arrived after the entity reached a terminal state (race between cancellation and a completion event).
- A new code path emits an event but the receiving state machine forgot to register a handler.
- The event sender misunderstood the protocol.
Fixing one of these almost always requires:
- Adding a
(state, eventType, sameState)no-op transition (case 1). - Adding a real transition (case 2).
- Removing the bogus emit (case 3).
Never silently catch and swallow the exception in production code — it indicates a real protocol violation, and an unhandled exception in the dispatch thread is a worse outcome than a graceful error.
How to add a transition safely
Process every Tez committer follows when modifying a state machine:
- Find the existing transitions for the state — read all
addTransition(STATE, ...)lines. - Identify the gap — confirm the event is not already handled.
- Add the transition in the correct alphabetical/grouping order the file uses.
- Add a unit test to the corresponding
Test*Implclass that triggers the new event in the relevant state. - Update related no-op transitions for terminal states (a new event
needs no-op handlers in
SUCCEEDED,FAILED,KILLED). - Run all tests in the module before opening a PR.
The discipline "always update the test in the same patch" is enforced by
reviewers. PRs that change VertexImpl without changes to TestVertexImpl
are typically blocked.
Reading exercise
# Find the factory blocks
grep -n "stateMachineFactory" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/*.java
# Count transitions per entity
for f in tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java; do
echo "$f $(grep -c addTransition $f)"
done
# Look at one transition impl
grep -n "class StartTransition\|class InitTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
Answer:
- Why is the exception named
InvalidStateTransitonException(with a typo)? What would happen if you renamed it? - Which Tez class uses
MultipleArcTransitionmost heavily, and why? - What does
installTopology()return, and why is the factory typically astatic finalfield? - In
TestVertexImpl, find theDrainDispatcher.await()calls. Why are they essential and what failure mode occurs if you forget? - If two threads call
vertexImpl.handle(event)concurrently — bypassing the dispatcher — what specific bug class arises? - Read one
MultipleArcTransitionand explain how its return value determines the next state.
Common bugs and symptoms
| Symptom | Root cause | Fix |
|---|---|---|
InvalidStateTransitonException: Invalid event X at TERMINAL_STATE | Late event after terminal state | Add a no-op transition |
| Test passes locally, fails on CI intermittently | DrainDispatcher.await() missing or called too early | Always call await() between event sends and asserts |
| State machine mutates wrong fields | Transition class accidentally captures outer state | Make transition classes static; pass everything via the event |
| Dispatcher thread deadlocks | Handler is doing blocking I/O on dispatch thread | Move I/O to a worker; emit a follow-up event when done |
addTransition for a no-op throws compile error | Wrong arity overload | Use the variant with EnumSet<EventType> |
| Adding a transition silently breaks recovery | Recovery replay hits the new event in an old state | Cover the recovery test path; recovery uses the same SM |
Validation: prove you understand this
- Implement a
Lightstate machine with statesOFF,ON,BROKENand eventsTOGGLE,BREAK. Compile and unit-test. - Find every
SingleArcTransitioninVertexImplthat is registered asstatic final— explain why static. - Take an
InvalidStateTransitonExceptionfrom a real AM log; map it to the exact(state, event)pair and propose either a fix or a JIRA. - Run
TestVertexImpl#testKilledTasksHandling. Identify everyDrainDispatcher.await()call and what it guards. - Add a
(SUCCEEDED, T_HEARTBEAT, SUCCEEDED)no-op toTaskImpland the corresponding test inTestTaskImpl. Ensure all tests pass.
Event Routing
Events are the only sanctioned API for mutating any AM-side entity. This chapter catalogs the event hierarchy, explains the "events are the only mutation API" rule, walks how a single task-completion percolates up to the DAG, and shows where each event is registered and dispatched.
After this chapter you should be able to trace any state transition in the AM back through the chain of events that caused it.
The hierarchy
hadoop-yarn-common
org/apache/hadoop/yarn/event/AbstractEvent<EVT_TYPE>
org/apache/hadoop/yarn/event/EventHandler<E>
org/apache/hadoop/yarn/event/AsyncDispatcher
tez-dag
org/apache/tez/dag/app/dag/event/
DAGEvent (subclasses: DAGEventStart, DAGEventDAGAttemptStarted, ...)
VertexEvent (subclasses: VertexEventTaskCompleted, VertexEventVertexCompleted, ...)
TaskEvent (subclasses: TaskEventTAUpdate, TaskEventTermination, ...)
TaskAttemptEvent (subclasses: TaskAttemptEventStartedRemotely, ...)
AMSchedulerEvent
AMContainerEvent
AMNodeEvent
SpeculatorEvent
...
Hint to grep all event classes:
find tez-dag/src/main/java/org/apache/tez/dag/app -path "*event*" -name "*.java" \
| xargs grep -l "extends AbstractEvent\|extends DAGEvent\|extends VertexEvent" \
| head -30
The AbstractEvent<E> base has two fields: an event type (enum) and a
timestamp. Concrete event classes add payloads (e.g.,
VertexEventTaskCompleted carries the TezTaskID and the
TaskAttemptIdentifier).
The "events are the only mutation API" rule
This rule is the bedrock of correctness:
Any change to the externally observable state of a
DAGImpl,VertexImpl,TaskImpl, orTaskAttemptImplmust occur inside a state-machine transition handler, triggered by an event that flowed through theAsyncDispatcher.
Concretely:
- Never call a setter directly on
VertexImplfrom another thread. - Never have one entity reach into another and mutate. Send an event.
- The only "side door" is read-only getters (intentionally not synchronized; callers tolerate slight staleness).
Why this rule:
- Concurrency safety — the dispatcher serializes everything. Direct mutation re-introduces races.
- Auditability — events appear in the AM log; field writes do not.
- Recoverability —
RecoveryServicewrites events; replay rebuilds state. Mutations outside events are invisible to recovery. - Testability —
DrainDispatchercontrols the world; bypass it and tests become non-deterministic.
A patch that calls a mutator method outside a transition handler is, by convention, immediately rejected.
Bubble-up: a task completion to the DAG
sequenceDiagram
participant TA as TaskAttemptImpl
participant T as TaskImpl
participant V as VertexImpl
participant D as DAGImpl
participant DI as Dispatcher
Note over TA: heartbeat -> done(...) on umbilical
TA->>TA: handle(TA_DONE)
TA-->>DI: emit T_ATTEMPT_SUCCEEDED
DI->>T: handle(T_ATTEMPT_SUCCEEDED)
T->>T: mark winner; check siblings
T-->>DI: emit V_TASK_COMPLETED (success)
DI->>V: handle(V_TASK_COMPLETED)
V->>V: bump succeededTaskCount
alt All tasks done
V-->>DI: emit V_COMMIT_REQUEST (if applicable)
V-->>DI: emit DAG_VERTEX_COMPLETED
DI->>D: handle(DAG_VERTEX_COMPLETED)
D->>D: bump succeededVertexCount
end
Every arrow is a state-machine transition. Every emit is an
eventHandler.handle(...) call inside the transition body.
Find the emit sites:
grep -n "eventHandler.handle\b" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
grep -n "eventHandler.handle\b" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
Where events are registered
Registrations live in DAGAppMaster.serviceInit (see
dag-app-master.md):
grep -n "dispatcher.register\|register(.*\.class" \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
Each registration maps an event type to a handler. Most handlers are inner
classes that delegate to entity.handle(event):
private class TaskEventDispatcher implements EventHandler<TaskEvent> {
@Override
public void handle(TaskEvent event) {
DAG dag = context.getCurrentDAG();
Task task = dag.getVertex(event.getTaskID().getVertexID())
.getTask(event.getTaskID());
((EventHandler<TaskEvent>) task).handle(event);
}
}
Why the indirection: events carry IDs, not object references. The dispatcher handler does the resolve, then forwards.
Per-entity event types
| Entity | Event type enum | Where emitted |
|---|---|---|
DAGImpl | DAGEventType | Vertex completions, kill, recovery |
VertexImpl | VertexEventType | Task completions, manager callbacks, root input events |
TaskImpl | TaskEventType | Attempt completions, speculation, kill |
TaskAttemptImpl | TaskAttemptEventType | Container events, umbilical events |
TaskSchedulerManager | AMSchedulerEventType | New requests, completions, container availability |
AMContainerImpl | AMContainerEventType | Launch, assignment, completion |
HistoryEventHandler | HistoryEventType | Any history-loggable change |
Each enum lives next to the event class:
ls tez-dag/src/main/java/org/apache/tez/dag/app/dag/event/*EventType.java
Reading exercise
# Catalog
find tez-dag/src/main/java/org/apache/tez/dag/app -name "*Event.java" \
| head -40
# Find a transition that emits other events
grep -B2 -A15 "class CommitCompletedTransition" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# Find AMSchedulerEvent emit sites
grep -rn "AMSchedulerEvent" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ | head
# Compare: emit vs direct mutation
grep -n "eventHandler.handle" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l
Answer:
- Why does the dispatcher carry IDs (e.g.,
TezTaskID) inside events rather than object references? - Find an event that crosses subsystems: e.g.,
TaskAttemptImplemitting anAMSchedulerEvent. What is the receiver and what action does it take? - List the four classes of events that
VertexImpl.handlereacts to and the three classes it emits. - How does the AM ensure ordering when multiple events for the same entity are emitted in quick succession?
- What happens if a transition handler throws an uncaught exception? Which thread catches it?
- Find one event that has no consumers (dead code). If you find one, propose its removal in a JIRA.
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
| Inconsistent state visible to two getters | Direct mutation outside dispatcher | Audit for setters called from non-handler code |
| Event "lost" — entity never sees it | Forgot to register handler in DAGAppMaster.serviceInit | Add registration; add unit test |
| Replay during recovery diverges from original run | An event was emitted but not recorded (recovery log gap) | RecoveryService writer filter |
| Deadlock when one entity event handler tries to read another entity | Reader path uses a lock held elsewhere | Prefer event-emit over cross-entity reads |
Test hangs in DrainDispatcher.await() | Transition emitted an event of a type with no handler in test | Register the missing handler (no-op is fine) |
| One subsystem floods the dispatcher | Storm of small events (e.g., per-heartbeat) | Batch in the emitter; or upgrade to a separate dispatcher |
Validation: prove you understand this
- Pick one transition in
TaskAttemptImpland trace every event it emits; for each, name the receiving entity. - Open
DAGAppMasterand list every event type registered, in order. - Walk a
V_KILLfromDAGImpl.killDAGdown to aTaskAttemptImplactually shutting down its container. - Write a unit test that triggers a transition with an event whose payload is malformed; verify the dispatcher logs the error without crashing.
- Explain why moving from
AsyncDispatcherto a multi-threaded dispatcher would break Tez and what would have to change to support it.
IPO Abstractions
Input, Processor, Output (collectively "IPO") are the three core
runtime contracts. A Tez task is built from one processor, zero or more
inputs, and zero or more outputs. This chapter walks the abstractions, the
distinction between the LogicalInput/LogicalOutput layer (rich, modern)
and the plain Input/Output layer (used for raw byte pipelines), the
lifecycle methods, merged inputs, root vs intermediate inputs, and the
minimum skeleton needed to write a new input or output.
After this chapter you should be able to read any concrete IPO class in
tez-runtime-library and explain what each lifecycle method is for.
The interfaces
tez-api/src/main/java/org/apache/tez/runtime/api/
Input.java
Output.java
Processor.java
LogicalInput.java (extends Input)
LogicalOutput.java (extends Output)
LogicalIOProcessor.java (extends Processor)
AbstractLogicalInput.java (base class for custom inputs)
AbstractLogicalOutput.java
AbstractLogicalIOProcessor.java
Reader.java (the byte/record stream interface)
Writer.java
MergedLogicalInput.java (combines multiple inputs)
InputContext.java
OutputContext.java
ProcessorContext.java
Event.java (DataMovementEvent, etc.)
grep -n "^public " tez-api/src/main/java/org/apache/tez/runtime/api/LogicalInput.java
Plain Input/Output vs LogicalInput/LogicalOutput
| Layer | Class | Why it exists |
|---|---|---|
| Low-level | Input | Bare contract: provides a Reader |
| Low-level | Output | Bare contract: provides a Writer |
| High-level | LogicalInput | Adds events, lifecycle, knowledge of upstream completion |
| High-level | LogicalOutput | Adds events (to AM and downstream) |
Almost all production inputs/outputs are LogicalInput/LogicalOutput. The
plain layer exists for primitive byte-stream cases (rarely used directly).
Lifecycle methods (LogicalInput)
public abstract class AbstractLogicalInput implements LogicalInput {
// Called by the runtime when the task starts. Setup; no I/O yet.
public abstract List<Event> initialize() throws Exception;
// Called after `initialize` for *all* inputs has completed.
// Begin actively pulling data.
public abstract void start() throws Exception;
// The processor calls this to get a Reader for this input.
public abstract Reader getReader() throws Exception;
// Handle data movement / control events from the AM (e.g., upstream task done).
public abstract void handleEvents(List<Event> inputEvents) throws Exception;
// Final cleanup; close streams; return any final events.
public abstract List<Event> close() throws Exception;
}
Order in a task's life:
constructor -> setContext -> initialize -> start -> getReader -> close
initialize returns events to the AM (for example,
InputInitializerEvents that ask the AM to do more split work). Most
inputs return an empty list.
Lifecycle methods (LogicalOutput)
Mirror of input:
public abstract class AbstractLogicalOutput implements LogicalOutput {
public abstract List<Event> initialize() throws Exception;
public abstract void start() throws Exception;
public abstract Writer getWriter() throws Exception;
public abstract void handleEvents(List<Event> outputEvents) throws Exception;
public abstract List<Event> close() throws Exception;
}
The close of an output is the most consequential call: it flushes pending
data, returns CompositeDataMovementEvent (or
VertexManagerEvent) telling the AM (and thus downstream vertices) what
this output produced.
Root inputs vs intermediate inputs
| Kind | Source of data | Initializer runs where? |
|---|---|---|
| Root input | External (HDFS, HBase, Kafka) | AM-side: InputInitializer enumerates splits, emits InputDataInformationEvents |
| Intermediate input | Upstream Tez vertex output | No initializer; data arrives via DataMovementEvent from the AM |
MRInput is the canonical root input. Its AM-side initializer
(MRInputAMSplitGenerator) calls InputFormat.getSplits(...) and pushes
the resulting splits to tasks.
Intermediate inputs (e.g., OrderedGroupedKVInput) receive their data
descriptors from the AM via DataMovementEvents — one event per upstream
task completion, carrying the upstream task's location and partition.
MergedLogicalInput
When a vertex has multiple physical inputs that should look like one to the
processor (e.g., a vertex group union), Tez wraps them in a
MergedLogicalInput:
grep -n "MergedLogicalInput\|getInputs\|getReader" \
tez-api/src/main/java/org/apache/tez/runtime/api/MergedLogicalInput.java
The processor calls getReader() once; the merged input combines all
underlying readers. Common subclasses live in tez-runtime-library:
OrderedGroupedMergedInput— merge K/V streams preserving sort order.ConcatenatedMergedKeyValueInput— concatenate.
Events flowing between AM and task
| Event class | Direction | Carries |
|---|---|---|
DataMovementEvent | AM → task input | Source task index, source URL/path, partition |
InputReadErrorEvent | task input → AM | "This source URL is broken, please re-route" |
CompositeDataMovementEvent | task output → AM (then forwarded) | Bulk version of DataMovementEvent |
InputDataInformationEvent | AM → task input | Concrete split (root inputs only) |
InputInitializerEvent | task → AM (initializer) | Custom signal to the initializer |
VertexManagerEvent | task output → AM (vertex manager) | Stats for auto-parallelism (ShuffleVertexManager) |
ls tez-api/src/main/java/org/apache/tez/runtime/api/events/
Minimal LogicalInput skeleton
package com.example;
import org.apache.tez.runtime.api.*;
import org.apache.tez.runtime.api.events.*;
import java.io.IOException;
import java.util.Collections;
import java.util.List;
public class HelloLogicalInput extends AbstractLogicalInput {
private final List<Event> deferred = new java.util.ArrayList<>();
public HelloLogicalInput(InputContext ctx, int physicalInputCount) {
super(ctx, physicalInputCount);
}
@Override
public List<Event> initialize() throws IOException {
// Allocate resources here. Do not do I/O.
return Collections.emptyList();
}
@Override
public void start() throws IOException {
// Begin background fetch threads if any.
}
@Override
public Reader getReader() throws IOException {
// Return a Reader. Simplest: a no-op reader that reports EOF.
return new SimpleStringReader("hello");
}
@Override
public void handleEvents(List<Event> events) throws IOException {
// Receive DataMovementEvents from the AM. Build internal routing.
}
@Override
public List<Event> close() throws IOException {
return Collections.emptyList();
}
}
Real implementations to read for reference:
find tez-runtime-library/src/main/java -name "OrderedGrouped*Input*.java"
find tez-runtime-library/src/main/java -name "Unordered*Input.java"
Reading exercise
sed -n '1,140p' tez-api/src/main/java/org/apache/tez/runtime/api/LogicalInput.java
sed -n '1,140p' tez-api/src/main/java/org/apache/tez/runtime/api/LogicalOutput.java
grep -rn "extends AbstractLogicalInput" tez-runtime-library/src/main/java | head
grep -rn "extends AbstractLogicalOutput" tez-runtime-library/src/main/java | head
# Event flow
ls tez-api/src/main/java/org/apache/tez/runtime/api/events/
Answer:
- What is the ordering guarantee between
initialize()calls across the multiple inputs/outputs of a task? - When does
start()get called relative togetReader()? - What's the difference in return semantics between
getReader()of aLogicalInputvs anMergedLogicalInput? - Find one concrete
LogicalOutput; identify what event types itsclose()returns and what downstream effect each has. - Why does
initialize()returnList<Event>instead ofvoid? - What is the difference between
InputInitializerEventandInputDataInformationEvent? Who emits each?
Common bugs and symptoms
| Symptom | Root cause | Where to look |
|---|---|---|
Task hangs in getReader() | Input's start() never returned; deadlock with handler | Always make start() non-blocking |
NullPointerException in handleEvents | Events arrived before initialize(); you're using a field not yet set | Allocate state in initialize() |
| Downstream sees half the data | close() returned Collections.emptyList() when it should have emitted DME | Always emit completion events |
Custom input never receives DataMovementEvents | EdgeManager on the upstream side not aware of your partitioning | Check edge property OutputDescriptor matches your InputDescriptor |
| Root input never starts | Initializer's handleInputInitializerEvent not implemented | Provide a default; never silently drop |
| Task succeeds but produces no output | Writer was never flushed (forgot close()) | Verify with IFile size = 0 in logs |
Validation: prove you understand this
- Write a minimal
LogicalInputthat produces 100 fixed strings via itsReader. Wire it into a one-vertex DAG and run onMiniTezCluster. - From
OrderedGroupedKVInput, identify exactly whenhandleEventsis called and what it does with each event. - List the seven event classes in
org.apache.tez.runtime.api.events. - Diagram the events flowing from one upstream task's
LogicalOutput.close()to a downstream task'sLogicalInput.handleEvents(). - Explain why
initialize()is split fromstart()rather than collapsed into a single method.
Logical vs Physical Plan
Tez exposes two planes to the user and to the runtime:
- Logical plan — what the application author writes: vertices, edges, edge
properties. Lives in
tez-api. Immutable once submitted (mostly). - Physical plan — what the AM actually schedules: task instances per vertex,
per-edge routing decisions, container assignments. Lives in
tez-dag. Mutable at runtime viaVertexManagerreconfiguration andEdgeManagerPluginrouting.
This chapter walks the boundary between them.
The logical plane
A logical DAG is a DAG object containing Vertex objects connected by Edge
objects, each carrying an EdgeProperty.
ls tez-api/src/main/java/org/apache/tez/dag/api/ | head -30
Key classes:
| Class | File | Purpose |
|---|---|---|
DAG | tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | The DAG builder. |
Vertex | tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java | Logical vertex with processor + target parallelism. |
Edge | tez-api/src/main/java/org/apache/tez/dag/api/Edge.java | Logical edge between two vertices. |
EdgeProperty | tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java | Routing + scheduling + storage characteristics. |
EdgeProperty — four orthogonal axes
grep -n "enum DataMovementType\|enum DataSourceType\|enum SchedulingType" \
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
public enum DataMovementType { ONE_TO_ONE, BROADCAST, SCATTER_GATHER, CUSTOM }
public enum DataSourceType { PERSISTED, PERSISTED_RELIABLE, EPHEMERAL }
public enum SchedulingType { SEQUENTIAL, CONCURRENT }
| Axis | Values | What it controls |
|---|---|---|
DataMovementType | ONE_TO_ONE, BROADCAST, SCATTER_GATHER, CUSTOM | How source outputs map to destination inputs. |
DataSourceType | PERSISTED, PERSISTED_RELIABLE, EPHEMERAL | Whether outputs survive a task failure; affects re-execution policy. |
SchedulingType | SEQUENTIAL, CONCURRENT | Whether destination can start before source completes (required for pipelined shuffle and broadcast). |
OutputDescriptor / InputDescriptor | class names + payloads | The IO classes wired on each side of the edge. |
A logical edge says nothing about which destination task index reads from which source task index. That decision is the EdgeManagerPlugin.
The physical plane
When the AM initializes a DAG it builds:
VertexImplper logicalVertex(tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java)TaskImpl[]per vertex, sized by parallelism (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java)TaskAttemptImplper attempt of each task (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java)Edgeruntime objects with an activeEdgeManagerPlugin(tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java)
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/{VertexImpl,TaskImpl,TaskAttemptImpl,Edge}.java
Mapping logical to physical
flowchart LR
subgraph logical[Logical]
LV1[Vertex A parallelism=3]
LV2[Vertex B parallelism=2]
LV1 -- "EdgeProperty SCATTER_GATHER" --> LV2
end
subgraph physical[Physical]
A0[A.0] --> B0[B.0]
A0 --> B1[B.1]
A1[A.1] --> B0
A1 --> B1
A2[A.2] --> B0
A2 --> B1
end
logical --> physical
Every source attempt produces partitions for every destination task. The
EdgeManager decides which output partition goes to which input.
EdgeManagerPlugin — the routing brain
find tez-api/src/main/java -name "EdgeManagerPlugin.java"
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeManagerPlugin.java
The contract (paraphrased):
public abstract class EdgeManagerPlugin {
public abstract void routeDataMovementEventToDestination(
DataMovementEvent event,
int srcTaskIndex,
int outputIndex,
Map<Integer, List<Integer>> destTaskAndInputIndices);
public abstract int getNumDestinationConsumerTasks(int srcTaskIndex);
public abstract int getDestinationConsumerTaskNumber(int srcTaskIndex,
int srcOutputIndex);
public abstract int getNumDestinationTaskPhysicalInputs(int destTaskIndex);
public abstract int getNumSourceTaskPhysicalOutputs(int srcTaskIndex);
}
Built-in implementations
find tez-dag/src/main/java -name "*EdgeManager*.java"
| Plugin | DataMovementType | Routing rule |
|---|---|---|
ScatterGatherEdgeManager | SCATTER_GATHER | Source task i produces N partitions; destination d reads partition d from every source. |
BroadcastEdgeManager | BROADCAST | Every source output is consumed by every destination task. |
OneToOneEdgeManager | ONE_TO_ONE | Requires srcParallelism == destParallelism. Source i → destination i. |
| User-supplied | CUSTOM | Anything. Cartesian product, range partitioning, etc. |
Inspecting routing on a live AM
grep -n "EdgeManager\|setCustomEdgeManager" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java | head -20
For each destination task Edge.sendTezEventToDestinationTasks() consults the
plugin to expand source outputs into per-destination input events. The
destination task receives a DataMovementEvent per logical input partition.
SCATTER_GATHER walkthrough
Source: A with parallelism 3, each task emits N partitions.
Destination: B with parallelism 2.
cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ScatterGatherEdgeManager.java
For source task A.1 emitting partitions [0, 1]:
| Call | Returns |
|---|---|
getNumSourceTaskPhysicalOutputs(1) | 2 (= destination parallelism) |
getNumDestinationTaskPhysicalInputs(0) | 3 (= source parallelism) |
getNumDestinationConsumerTasks(1) | 2 |
routeDataMovementEventToDestination(event, 1, 0, out) | out = { 0 -> [1] } |
routeDataMovementEventToDestination(event, 1, 1, out) | out = { 1 -> [1] } |
Invariant: numSrcOutputs == destParallelism, numDestInputs == srcParallelism.
ONE_TO_ONE walkthrough
cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/OneToOneEdgeManager.java
Requires numSrcTasks == numDestTasks. Each source produces exactly one
partition consumed by exactly one destination of the same index.
Common bug: changing destination parallelism via reconfigureVertex while a
ONE_TO_ONE edge feeds it. Tez throws at edge initialization.
BROADCAST walkthrough
cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/BroadcastEdgeManager.java
Source emits a single logical output. Every destination task receives one
input event per source task. Cost scales as srcParallelism * destParallelism
— large broadcast vertices are an antipattern.
CUSTOM walkthrough — CartesianProductEdgeManager
find tez-dag/src/main/java -name "CartesianProductEdgeManager*.java"
wc -l $(find tez-dag/src/main/java -name "CartesianProductEdgeManager*.java")
CartesianProductVertexManager chunks source outputs and creates a 2D grid of
destination tasks; the edge manager projects (srcChunkX, srcChunkY) → destIndex.
The CUSTOM movement type is the contract by which Hive ships its own routing
for unconventional joins.
Runtime mutation: parallelism reconfiguration
A logical Vertex declares a target parallelism; the physical parallelism
can change before the vertex starts running, via the VertexManager:
grep -n "reconfigureVertex" tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPluginContext.java
reconfigureVertex(int parallelism, VertexLocationHint, Map<String,EdgeProperty>)
does three things in one atomic step inside VertexImpl:
- Resizes the
TaskImpl[]array (must happen before any task is scheduled). - Re-installs
EdgeManagerPlugininstances on incoming edges. - Updates location hints used by the scheduler.
Read the state machine guard:
grep -n "reconfigureVertex\|VertexState.INITED\|VertexState.INITIALIZING" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -30
Reconfiguration is illegal once any task has been scheduled.
Worked example: ShuffleVertexManager auto-parallelism
- Vertex
Rdeclared with parallelism = 100 (pessimistic upper bound). - Upstream tasks emit
VertexManagerEventpayloads with byte counts per partition. ShuffleVertexManager.onVertexManagerEventReceivedaccumulates totals.- After the slow-start threshold, it computes
target = ceil(totalBytes / desiredTaskInputSize)clamped to[minParallelism, originalParallelism]. - Calls
reconfigureVertex(target, null, updatedEdgeProps). VertexImplresizes from 100 → e.g. 17 task instances.- The edge manager on the incoming
SCATTER_GATHERedge is rebuilt to route 100-partition outputs into 17 destinations (merging at the destination).
Reading exercise
grep -n "createEdgeManager\|edgeManager =" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java— when is theEdgeManagerPlugininstantiated?cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java | head -100— list all factory methods onEdgeProperty. Which require anEdgeManagerPluginDescriptor?grep -n "setParallelism\|setVertexParallelism" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head— which state transitions accept a parallelism change?grep -rn "OneToOneEdgeManager\|ScatterGatherEdgeManager\|BroadcastEdgeManager" tez-dag/src/test— list the unit tests covering each built-in routing.cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/CartesianProductEdgeManager.java | head -120— what state must this plugin keep across destination task initializations?grep -n "EdgeProperty\." ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java | head— which edge property combinations does Hive build?
Answer these:
- For
SCATTER_GATHERwhat is the size of the output partition array of one source task? - For
ONE_TO_ONE, what happens if the upstream vertex auto-parallelizes from 100 → 17 after the destination has been initialized? - For
BROADCAST, what is the data volume amplification? - Which
EdgeManagermethods are called on every destination task init, and which once per edge?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
Vertex failed: Cannot change parallelism after tasks scheduled | VertexManager.reconfigureVertex invoked after scheduleTasks. Fix ordering. |
OneToOneEdgeManager: srcParallelism != destParallelism | Auto-parallelism broke the ONE_TO_ONE invariant. Forbid auto-parallelism on ONE_TO_ONE edges. |
Destination task receives 0 DataMovementEvents | Custom EdgeManagerPlugin returned 0 from getNumDestinationTaskPhysicalInputs. |
| Hive query produces wrong row counts after a custom join | CUSTOM EdgeManagerPlugin mis-routed partitions; fence-post bug in routeDataMovementEventToDestination. |
BROADCAST edge OOMs the destination | Source parallelism × payload size exceeds destination heap; switch to PERSISTED source type and stream from disk. |
Validation: prove you understand this
- Given
Vertex A (parallelism=4)SCATTER_GATHER→Vertex B (parallelism=3), compute the number ofDataMovementEventsB.1receives. Show the work. - Explain in one sentence each: when does
EdgeManagerPluginget re-instantiated, and when does it survive across reconfiguration? - Write a one-paragraph rejection of "let's just use
BROADCASTfor our 500-task lookup vertex" referencing concrete cost. - Identify the exact line in
VertexImpl.javawherereconfigureVertexis rejected if tasks have been scheduled. Cite path + line number fromgrep -n. - Sketch a
CUSTOMEdgeManagerPluginfor range-partitioned merge: source taskiemits keys in range[i*R, (i+1)*R); the destination is K tasks where K may differ from source parallelism. DefinegetNumDestinationTaskPhysicalInputsand the routing rule in code.
Shuffle and Sort
The shuffle layer is where Tez moves data between vertices. It splits into two
halves, both living in tez-runtime-library:
- Sort path — producer side: partition, sort, spill, merge.
OrderedPartitionedKVOutput→PipelinedSorter/DefaultSorter→IFilesegments on local disk. - Shuffle path — consumer side: fetch, merge, iterate.
ShuffleManager→Fetcher→FetchedInput→MergeManager→ValuesIterator.
Between them sits the YARN ShuffleHandler aux service inside the NodeManager
that serves spilled segments over HTTP.
ls tez-runtime-library/src/main/java/org/apache/tez/runtime/library/
The producer side
OrderedPartitionedKVOutput
find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java"
wc -l $(find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java")
The output that powers MapReduce-style shuffles. Lifecycle:
initialize()— creates aSorter(Pipelined or Default), allocatestez.runtime.io.sort.mbof byte buffer, registers as aMemoryUpdateCallbackwith theMemoryDistributor.getWriter()— returns aKeyValueWriterthat delegates to the sorter.close()— callssorter.flush()to merge spills into final segments and emitsCompositeDataMovementEventper partition with offsets into the merged file.
Two sorter implementations
find tez-runtime-library/src/main/java -name "PipelinedSorter.java" \
-o -name "DefaultSorter.java"
| Sorter | Strategy | When to pick |
|---|---|---|
DefaultSorter | Single in-memory accumulator; quicksort by (partition, key); spill when buffer crosses tez.runtime.sort.spill.percent; final merge of all spills. | MapReduce parity, conservative memory. |
PipelinedSorter | Multi-buffer accumulator; concurrent spill thread; per-partition sort and merge; final spill writes the merged output in one pass. | Large outputs, faster; default in Hive. |
Configuration knobs:
| Key | Default | Effect |
|---|---|---|
tez.runtime.io.sort.mb | 100 | Sort buffer in MB. Reused for both sorters. |
tez.runtime.sort.spill.percent | 0.8 | Threshold to start spilling (DefaultSorter). |
tez.runtime.sorter.class | PIPELINED | PIPELINED or LEGACY (DefaultSorter). |
tez.runtime.compress | false | Per-segment compression. |
tez.runtime.compress.codec | DefaultCodec | Snappy, Lz4, Gzip. |
tez.runtime.combiner.class | unset | Combiner ran during spill merge. |
IFile on-disk format
IFile is the segment format both sorters write.
find tez-runtime-library/src/main/java -name "IFile.java"
grep -n "class Writer\|class Reader\|EOF_MARKER\|writeKVPair" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java | head -30
Per-record layout:
+--------------+--------------+----------------+----------------+
| keyLen (VInt)| valLen (VInt)| key bytes (KL) | value bytes (VL)|
+--------------+--------------+----------------+----------------+
End of segment:
keyLen = -1, valLen = -1 // EOF_MARKER
If compression is enabled, the bytes between the partition header and EOF_MARKER are compressed; the record framing is inside the compressed stream.
A sorter writes one IFile segment per partition per spill. After the final
merge, an IFile.OutputStream produces one file per output with an *.index
sibling that records (rawLen, partLen, compressedLen) per partition.
find tez-runtime-library/src/main/java -name "TezSpillRecord.java"
grep -n "rawLength\|partLength\|compressedLength" \
$(find tez-runtime-library/src/main/java -name "TezSpillRecord.java")
The index file is what ShuffleHandler reads when a fetcher asks for partition
p of source attempt (vertex, task, attempt).
Combiner integration
Both sorters honor tez.runtime.combiner.class. The combiner is invoked
during the merge step (not during accumulation), running over sorted runs:
grep -n "combiner\|combineAndSpill\|runCombiner" \
$(find tez-runtime-library/src/main/java -name "DefaultSorter.java")
A correct combiner is associative and commutative on the value space; Tez gives no guarantee on how many merge phases run it.
Spill walkthrough
sequenceDiagram
participant P as Processor
participant W as KeyValueWriter
participant S as Sorter (Pipelined)
participant D as Local disk
P->>W: write(K, V) [N times]
W->>S: collect into KV buffer
S->>S: buffer crosses sort.spill.percent
S->>D: spill_0.out (partitioned, sorted)
S->>D: spill_0.out.index
Note over S: continue accepting writes into next buffer
P->>W: close()
W->>S: flush()
S->>D: merge spill_0..spill_N -> file.out + file.out.index
S-->>P: CompositeDataMovementEvent per partition
The consumer side
OrderedGroupedKVInput and ShuffleManager
find tez-runtime-library/src/main/java -name "OrderedGroupedKVInput.java"
find tez-runtime-library/src/main/java -name "ShuffleManager.java"
find tez-runtime-library/src/main/java -name "Shuffle.java"
OrderedGroupedKVInput.initialize() constructs Shuffle which holds:
ShuffleManager— pool ofFetcherthreads and inbound event queue.MergeManager— receivesFetchedInputs, decides in-memory vs disk placement, kicks off background merges.ValuesIterator— the reader the processor sees.
Fetcher
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")
A Fetcher is a thread that connects via HTTP to the NodeManager
ShuffleHandler running on the source task's node:
GET /mapOutput?job=<jobId>&dag=<dagId>&reduce=<partition>&map=<attempt1,attempt2,...>
Multi-map response: ShuffleHandler streams all requested attempts back-to-back,
each prefixed with a header (MapOutputInfo). The Fetcher reads the header,
decides if the payload fits in memory (MergeManager.reserve), and either
writes to an in-memory buffer or directly to disk.
Key configs:
| Key | Default | Effect |
|---|---|---|
tez.runtime.shuffle.parallel.copies | 20 | Fetcher thread count per task. |
tez.runtime.shuffle.connect.timeout | 30000 | HTTP connect timeout. |
tez.runtime.shuffle.read.timeout | 180000 | HTTP socket read timeout. |
tez.runtime.shuffle.fetch.max.task.output.at.once | 20 | Max attempts per HTTP request. |
tez.runtime.shuffle.memory.limit.percent | 0.25 | Max fraction of heap held in-memory before forcing disk. |
tez.runtime.shuffle.merge.percent | 0.9 | When in-mem buffer crosses this, kick a merge. |
FetchedInput
grep -n "abstract class FetchedInput\|MemoryFetchedInput\|DiskFetchedInput" \
$(find tez-runtime-library/src/main/java -name "FetchedInput.java")
A FetchedInput is one source partition payload. Two subclasses:
MemoryFetchedInput— bytes held in a ByteBuffer.DiskFetchedInput— bytes on local disk undertez.runtime.shuffle.tmp.directory.
The MergeManager decides which based on size and current in-memory budget.
MergeManager
find tez-runtime-library/src/main/java -name "MergeManager.java"
Three merge tracks:
- In-memory merge — N in-memory inputs are merged into one in-memory buffer or spilled to disk.
- On-disk merge — N on-disk inputs are merged into a single larger on-disk segment.
- Final merge — at processor pull time, remaining in-memory and on-disk
inputs are merged into a unified
KeyValuesReader.
grep -n "InMemoryMerger\|OnDiskMerger\|finalMerge\|mergeFactor" \
$(find tez-runtime-library/src/main/java -name "MergeManager.java") | head -20
io.sort.factor (default 100) — max segments merged in one pass; more
segments trigger multiple passes.
ValuesIterator
find tez-runtime-library/src/main/java -name "ValuesIterator.java"
grep -n "next\|groupingKey\|valuesIter" \
$(find tez-runtime-library/src/main/java -name "ValuesIterator.java") | head
Wraps the merged sorted stream, presenting (key, Iterable<value>) pairs to
the processor — the classic reducer API.
Shuffle walkthrough
sequenceDiagram
participant T as Task processor
participant SM as ShuffleManager
participant F as Fetcher
participant NM as Source NM (ShuffleHandler)
participant MM as MergeManager
SM->>F: assign source attempt + partition
F->>NM: GET /mapOutput?...
NM-->>F: stream attempt headers + IFile bytes
F->>MM: reserve(size)
alt fits in memory
MM-->>F: MemoryFetchedInput
else too big
MM-->>F: DiskFetchedInput
end
F->>MM: commit FetchedInput
MM->>MM: kick InMemoryMerger / OnDiskMerger when thresholds crossed
T->>SM: getReader() (blocks until all inputs done)
SM->>MM: finalMerge()
MM-->>T: KeyValuesReader (ValuesIterator)
ShuffleHandler is YARN's, not Tez's
ls /opt/hadoop/share/hadoop/yarn/lib/ | grep shuffle # cluster path varies
org.apache.hadoop.mapred.ShuffleHandler lives in Hadoop. NodeManagers load
it as an aux service via yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
Tez piggybacks on this — Tez ships no NodeManager-side fetch service.
Misconfigured aux services are a common cause of ConnectException in
Fetcher.
Reading exercise
grep -n "EOF_MARKER\|writeRecord" $(find tez-runtime-library/src/main/java -name "IFile.java")— verify the EOF sentinel value.wc -l $(find tez-runtime-library/src/main/java -name "PipelinedSorter.java" -o -name "DefaultSorter.java")— which is larger? Hypothesize why.grep -rn "tez.runtime.io.sort.mb\|tez.runtime.sort.spill.percent" tez-runtime-library/src/main/java— find every read site for these keys.grep -n "GET /mapOutput\|reduce=\|map=" $(find ~ -name ShuffleHandler.java 2>/dev/null | head -1)— read the exact request format.cat $(find tez-runtime-library/src/main/java -name "ShuffleManager.java") | head -200— how is back-pressure onFetcherthreads applied?grep -n "combiner" tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/*.java— at what phases does the combiner run?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
Fetcher: java.net.ConnectException | mapreduce_shuffle aux service not configured or NM not running. |
Shuffle error: java.io.IOException: Failed on local exception: org.apache.hadoop.security.AccessControlException | ShuffleSecret missing or stale; check JobTokenSecretManager. |
| OOM during sort | tez.runtime.io.sort.mb too high relative to container JVM heap. |
| OOM during shuffle | tez.runtime.shuffle.memory.limit.percent too high; in-memory inputs starve heap. |
Premature EOF from inputStream | Source task wrote partial IFile (killed mid-spill); destination retries from another attempt. |
| Wrong reducer output count | Combiner not idempotent across merge passes. |
OnDiskMerger thrashing | io.sort.factor too low; many tiny segments forcing many merge passes. |
| Long shuffle plateau | One source NM saturated; HDFS-local fetch concentration. |
Validation: prove you understand this
- Sketch the byte layout of an
IFilesegment containing 3 records and a single partition. Show key/val lengths and the EOF marker. - A reducer task reads from 200 mappers. With
tez.runtime.shuffle.parallel.copies=20andtez.runtime.shuffle.fetch.max.task.output.at.once=20, what is the minimum number of HTTP requests the fetcher pool must make? Justify. - Explain why
PipelinedSorterreduces wall time but not CPU time. - Given a 10 GB shuffle into a 4 GB heap reducer with
tez.runtime.shuffle.memory.limit.percent=0.25, predict which inputs go to disk versus memory and why. - Identify the exact file and method where the URL pattern
?reduce=&map=is constructed on the Tez fetcher side. Use grep.
Tez Runtime Internals
The Tez runtime is the code that runs inside the container, not inside the AM. Its job: boot a JVM, accept tasks from the AM over umbilical RPC, run them to completion, and report status.
Three modules collaborate:
tez-runtime-internals— process boot, task driver, umbilical client, memory broker.tez-runtime-library— concreteInput/Processor/Outputimplementations (KV, shuffle, etc).tez-api— the SPI the user implements (AbstractLogicalInput,AbstractLogicalOutput,AbstractLogicalIOProcessor).
ls tez-runtime-internals/src/main/java/org/apache/tez/runtime/
The container process: TezChild
TezChild.main() is the JVM entry point for every Tez task container.
find tez-runtime-internals/src/main/java -name "TezChild.java"
grep -n "public static void main\|new TezChild\|run()" \
$(find tez-runtime-internals/src/main/java -name "TezChild.java")
Boot sequence (paraphrased from TezChild.java):
- Read JVM args: AM host, AM port, container ID, application attempt ID, PID, JVM identifier.
- Read the security tokens from
$HADOOP_TOKEN_FILE_LOCATION. - Construct
TezTaskUmbilicalProtocolRPC proxy pointing at the AMTaskAttemptListenerImpl. - Enter
TezChild.run()— an infinite loop: a.umbilical.getTask(containerContext)— blocks until the AM hands us aContainerTask. b. IfContainerTask.shouldDie(), exit cleanly. c. Otherwise build aTezTaskRunner2for the assigned attempt and callrunner.run(). d. Loop — this is container reuse: same JVM, next task.
flowchart TD
S[JVM start] --> P[Parse args + tokens]
P --> R[RPC connect to AM]
R --> L{umbilical.getTask}
L -- shouldDie --> X[exit]
L -- new task --> T[TezTaskRunner2.run]
T --> L
Why container reuse needs this loop
Allocating a YARN container costs hundreds of ms; starting a JVM costs
seconds. Tez amortizes both by running multiple tasks in the same TezChild
process. See container-reuse.md for the AM side.
TezTaskRunner2 — the task driver
find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java"
wc -l $(find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java")
Per-attempt driver. Owns:
- a
LogicalIOProcessorRuntimeTask(the actual task body), - the input/output initialization thread pool,
- abort hooks (kill, fatal error, timeout).
Lifecycle of a single attempt:
sequenceDiagram
participant TC as TezChild
participant TR as TezTaskRunner2
participant T as LogicalIOProcessorRuntimeTask
participant IO as Inputs / Outputs
participant P as Processor
TC->>TR: new + run()
TR->>T: initialize()
T->>IO: input.initialize() (parallel)
T->>IO: output.initialize() (parallel)
T->>P: processor.initialize()
TR->>T: run() => processor.run(inputs, outputs)
P->>IO: read inputs, write outputs
T->>IO: output.close() (parallel)
T->>IO: input.close() (parallel)
TR->>TC: result (success/failure)
Failure routes:
- Input init throws →
TaskFailedExceptionto AM, attempt fails. - Processor throws → same.
- AM sends
killvia umbilical heartbeat reply →TezTaskRunner2.killTask()interrupts the processor thread. - Fatal error on any IO →
TaskReporter.notifyFatalError()short-circuits the run.
LogicalIOProcessorRuntimeTask — orchestrator
find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java"
wc -l $(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java")
This is the class that actually instantiates the user's IPO triple.
initialize() does, in order:
- Build the
TezConfigurationfor this task from the AM-providedTaskSpec. - Build the
MemoryDistributor(next section) over all IOs. - For each
InputSpec: instantiate the input class, set itsInputContext, callinitialize()on a worker thread. - Same for each
OutputSpec. - Instantiate the processor; call
processor.initialize(processorContext). - Wait for all input/output
initialize()calls to complete (parallel).
run():
- Block until every input reports it has data (or signals empty).
- Call
processor.run(inputs, outputs)on the main thread. - On return, call
output.close()for every output (parallel), theninput.close()for every input (parallel). - Collect counters; hand the final
TaskStatusback toTezTaskRunner2.
Key field:
grep -n "initializerCompletionService\|runInputRunnable\|runOutputRunnable" \
$(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java") | head
Parallel init is what makes Tez fast for processors with many inputs (eg multi-input joins).
MemoryDistributor
find tez-runtime-internals/src/main/java -name "MemoryDistributor.java"
cat $(find tez-runtime-internals/src/main/java -name "MemoryDistributor.java") | head -160
A single broker that hands out portions of the task's JVM heap to IOs that ask for memory.
Flow:
- At task init, each
Input/Outputcallscontext.requestInitialMemory(size, callback)with what it would like to reserve (e.g.OrderedPartitionedKVOutputrequeststez.runtime.io.sort.mbMB). MemoryDistributor.makeInitialAllocations()runs anInitialMemoryAllocatorplugin (default:WeightedScalingMemoryDistributor) to scale requests down to fit the container's available heap.- Allocations are dispatched to callbacks; each IO learns its actual budget
via
MemoryUpdateCallback.memoryAssigned(long). - IO classes resize their buffers accordingly.
Configuration knobs:
| Key | Effect |
|---|---|
tez.runtime.task.initial.memory.allocator.class | Plugin to use. Default WeightedScalingMemoryDistributor. |
tez.task.scale.memory.enabled | Master toggle. |
tez.task.scale.memory.ratios | Per-IO-class weight overrides. |
tez.task.scale.memory.reserve-fraction | Reserved for processor/JVM. |
grep -n "requestInitialMemory\|memoryAssigned" \
$(find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java")
Without the distributor an output would request its configured size verbatim, potentially OOMing the container when summed across IOs.
TaskReporter and the umbilical
find tez-runtime-internals/src/main/java -name "TaskReporter*.java"
find tez-api/src/main/java -name "TezTaskUmbilicalProtocol.java"
TaskReporter runs a heartbeat thread per task attempt. Each cycle:
- Collect outbound events (counter updates, completion events from completed IOs).
- Call
umbilical.heartbeat(request)whererequestcontains attempt ID, counters, status messages, and the outboundTezEventlist. - Decode the reply: AM may push back inbound
TezEvents (e.g.DataMovementEvents from upstream tasks), ashouldDieflag, or ashouldResetflag. - Dispatch inbound events into the appropriate Input via
LogicalIOProcessorRuntimeTask.handleEvents().
Interval: tez.task.am.heartbeat.interval-ms (default 100) plus a
counter-update interval tez.task.am.heartbeat.counter.interval-ms (default
4000).
Why heartbeats carry events
Tez has no separate "event bus" between AM and containers. Everything piggybacks on the umbilical heartbeat. This means:
- Event latency is bounded below by
heartbeat.interval-ms. - A wedged umbilical (network partition) blocks all task communication;
tez.task.timeout-ms(default 5 minutes) eventually fires and the AM considers the attempt lost.
End-to-end task lifecycle inside the JVM
grep -n "phase\|TaskRunnerPhase" $(find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java") | head
| Phase | Owner | What happens |
|---|---|---|
| 1 Receive | TezChild.run | umbilical.getTask returns a ContainerTask. |
| 2 Build | TezTaskRunner2 | Construct LogicalIOProcessorRuntimeTask, hook up TaskReporter. |
| 3 Init | LogicalIOProcessorRuntimeTask.initialize | MemoryDistributor + parallel IO init + processor init. |
| 4 Run | LogicalIOProcessorRuntimeTask.run | processor.run(inputs, outputs). |
| 5 Close | same | Outputs close (flush spills, emit DataMovementEvents), inputs close. |
| 6 Report | TaskReporter final tick | Send counters + completion event. AM transitions attempt to SUCCEEDED. |
| 7 Loop | TezChild.run | Discard task object, request next. |
Reading exercise
grep -n "shouldDie\|exit(" $(find tez-runtime-internals/src/main/java -name "TezChild.java")— list every termination path.grep -n "initialize\(\)\|run\(\)\|close\(\)" $(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java") | head -40— verify the lifecycle order.cat $(find tez-runtime-internals/src/main/java -name "MemoryDistributor.java") | head -100— how does it handle the case where summed requests exceed available?grep -n "heartbeat\|TezTaskUmbilical" $(find tez-runtime-internals/src/main/java -name "TaskReporter.java") | head— find the heartbeat loop body.cat tez-api/src/main/java/org/apache/tez/runtime/api/AbstractLogicalIOProcessor.java— read the user-facing processor contract.wc -l $(find tez-runtime-internals/src/main/java -name "*.java" | head -20)— find the biggest classes in the runtime module.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
| Container OOM during init | MemoryDistributor disabled or summed IO requests exceed heap. Enable tez.task.scale.memory.enabled. |
TaskAttempt timed out after 5 min of no heartbeat | TaskReporter thread died (uncaught exception) or RPC hung. |
| Processor sees zero events | Inbound events not delivered — check TaskReporter.heartbeat reply path; common when tez.task.am.heartbeat.interval-ms raised too high. |
| Container reuse off, JVMs constantly spinning up | TezChild.run loop returns shouldDie too eagerly; check AM-side AMContainerImpl reuse decision. |
IllegalStateException: Cannot reserve more memory | IO requesting after makeInitialAllocations already ran. |
| Outputs never close (process hangs) | Processor never returned from run(); usually an infinite loop on a KeyValuesReader. |
Validation: prove you understand this
- Trace, with file:method references, the path from
TezChild.maintoprocessor.runfor a single attempt. - Explain in two sentences why
LogicalIOProcessorRuntimeTask.initializeparallelizes input/output init. Cite the field name. - Given a container with 1 GB heap, one
OrderedPartitionedKVOutputrequesting 512 MB and twoOrderedGroupedKVInputs requesting 256 MB each, compute the actual allocations under the defaultWeightedScalingMemoryDistributor. - Identify the single umbilical method that delivers inbound
TezEvents to the task. Cite the file and the field on the response object. - Sketch the smallest possible
AbstractLogicalIOProcessorthat prints the class names of all configured inputs and exits. Includeinitialize,handleEvents,run,close.
Scheduler
The scheduler is the AM-side component that turns task launch requests into
running containers. It lives in tez-dag under
org.apache.tez.dag.app.rm.
Two-layer design:
TaskSchedulerManager— single dispatcher and router. ReceivesAMSchedulerEvents from the rest of the AM, forwards to the right scheduler instance.TaskSchedulerinstances — one per scheduler ID. In practice almost alwaysYarnTaskSchedulerService(production) orLocalTaskSchedulerService(tez.local.mode=true). External pluggable schedulers (LLAP) also slot in here.
ls tez-dag/src/main/java/org/apache/tez/dag/app/rm/
TaskSchedulerManager
find tez-dag/src/main/java -name "TaskSchedulerManager.java"
wc -l $(find tez-dag/src/main/java -name "TaskSchedulerManager.java")
Implements EventHandler<AMSchedulerEvent> and is wired into the AM
AsyncDispatcher. Every scheduling decision in the AM starts by enqueuing an
AMSchedulerEvent.
Event types
find tez-dag/src/main/java -name "AMSchedulerEvent*.java"
grep -rn "extends AMSchedulerEvent" tez-dag/src/main/java
| Event | Source | Purpose |
|---|---|---|
AMSchedulerEventTALaunchRequest | TaskAttemptImpl after a TA is ready to schedule | Ask scheduler to launch this attempt. |
AMSchedulerEventTAStateUpdated | TaskAttemptImpl on completion | Notify scheduler the container is now releasable. |
AMSchedulerEventContainerCompleted | YARN RM callback | RM told us a container died. |
AMSchedulerEventDeallocateContainer | various | Force-release a held container. |
AMSchedulerEventNodeBlacklistUpdate | NodeTracker | Add/remove node from blacklist. |
AMSchedulerEventDAGStart, AMSchedulerEventVertexStateUpdated | DAGImpl, VertexImpl | DAG lifecycle hints (drives priority adjustments). |
TaskSchedulerManager.handle(event) switches on event type and forwards via
getTaskScheduler(event.getSchedulerId()).handleEvent(...).
YarnTaskSchedulerService
find tez-dag/src/main/java -name "YarnTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")
This is where Tez talks to YARN. Owns:
AMRMClientAsync— async RM heartbeat client.Map<Priority, BlockingQueue<CookieContainerRequest>>— outstanding requests, bucketed by priority.Map<ContainerId, HeldContainer>— currently-assigned containers (see container-reuse.md).- A
DelayedContainerManagerthread that releases idle reused containers.
Request flow
sequenceDiagram
participant TA as TaskAttemptImpl
participant TSM as TaskSchedulerManager
participant Y as YarnTaskSchedulerService
participant RM as YARN RM
TA->>TSM: AMSchedulerEventTALaunchRequest
TSM->>Y: allocateTask(...)
Y->>Y: build CookieContainerRequest (priority, resource, locality)
Y->>RM: addContainerRequest (via AMRMClientAsync)
Note over RM: scheduler matches request to a node
RM-->>Y: onContainersAllocated([Container])
Y->>Y: assignContainer() — match to a pending request
Y->>TSM: containerAllocated(taskAttempt, container)
TSM->>TA: TAEventContainerAssigned
Matching: priority + locality
grep -n "assignContainer\|matchContainerToRequest\|getMatchingRequests" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20
When a container arrives, assignContainer walks pending requests at the
container's priority. For each:
- NODE_LOCAL — container's node matches a hint host of the request.
- RACK_LOCAL — same rack but different host.
- ANY — locality wildcard.
AMRMClientAsync already biases matches by locality on the YARN side; this
pass is the AM-side tiebreaker when multiple requests are eligible.
| Hint level | YARN request | Tez match |
|---|---|---|
NODE_LOCAL | host + rack + * | accepts container on the exact host |
RACK_LOCAL | rack + * | accepts container on the same rack |
ANY | * only | accepts any container at this priority |
TaskLocationHint is set on TaskAttemptImpl either from the input split
(MRInput), the VertexLocationHint (provided by VertexManager), or left
null.
Priorities
grep -n "Priority\|priorityForVertex" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java" \
-o -name "DAGImpl.java") | head
Tez assigns each vertex a priority class derived from its topological order in the DAG; downstream vertices have higher numeric priority (lower priority value), so that source tasks complete first and free their containers for downstream consumers. Priority is the primary key for container reuse matching as well.
RM callbacks
grep -n "AMRMClientAsync.CallbackHandler\|onContainersAllocated\|onContainersCompleted\|onShutdownRequest" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")
YarnTaskSchedulerService implements AMRMClientAsync.CallbackHandler:
onContainersAllocated(List<Container>)— enqueue for assignment.onContainersCompleted(List<ContainerStatus>)— translate exit status intoAMSchedulerEventContainerCompleted.onShutdownRequest()— RM asked AM to die (eg lost AM attempt).onNodesUpdated(List<NodeReport>)— update node health for blacklisting.getProgress()— AM tells RM its overall DAG progress.
LocalTaskSchedulerService
find tez-dag/src/main/java -name "LocalTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java")
Same contract as YarnTaskSchedulerService but bypasses YARN:
- A bounded
ExecutorServiceofLocalContainerworker threads stands in for the YARN cluster. allocateTaskinstantly synthesizes a fakeContainerand dispatchescontainerAllocated.- The container launcher (
LocalContainerLauncher) runsTezChildin the same JVM on the executor.
Used by tez.local.mode=true and MiniTezCluster tests of certain flavors.
See local-mode.md.
Pluggable schedulers
grep -n "tez.am.task.scheduler.classes\|TASK_SCHEDULER_SERVICE_CLASS" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Configuration:
tez.am.task.scheduler.classes = <comma-separated FQNs>
TaskSchedulerManager instantiates one per ID. Hive's LLAP plugs in a
custom scheduler that talks to LLAP daemons instead of YARN.
Walkthrough: launching a single task attempt
VertexImpldecides to schedule taskT.k(viaVertexManageror scaling).TaskImplcreatesTaskAttemptImplfor attempt 0 → stateNEW.TaskAttemptImpltransitions toSTART_WAIT, dispatchesAMSchedulerEventTALaunchRequestwithTaskLocationHintand capability.TaskSchedulerManager.handleroutes to the configured scheduler.YarnTaskSchedulerService.allocateTaskconstructsCookieContainerRequest(priority, capability, hosts, racks, relaxLocality=true)and callsAMRMClientAsync.addContainerRequest.- RM schedules → callback
onContainersAllocated([c]). assignContainer(c)finds the matching pending request, callsinformAppAboutAssignment→TaskSchedulerManager.containerAllocated.TaskSchedulerManagerdispatchesAMContainerEventAssignTAtoAMContainerImpl, thenTAEventContainerAssignedtoTaskAttemptImpl.AMContainerImplasksContainerLauncherImplto launch the container (or reuse a held one).TezChildstarts (or accepts new task via reuse loop). The umbilical fires up; the attempt transitions toRUNNING.
sequenceDiagram
participant V as VertexImpl
participant TA as TaskAttemptImpl
participant TSM as TaskSchedulerManager
participant Y as YarnTaskSchedulerService
participant AC as AMContainerImpl
participant RM as YARN RM
V->>TA: schedule
TA->>TSM: AMSchedulerEventTALaunchRequest
TSM->>Y: allocateTask
Y->>RM: addContainerRequest
RM-->>Y: onContainersAllocated
Y->>TSM: containerAllocated
TSM->>AC: AMContainerEventAssignTA
TSM->>TA: TAEventContainerAssigned
AC->>RM: start container (via NMClient)
Reading exercise
cat $(find tez-dag/src/main/java -name "TaskSchedulerManager.java") | head -200— list the event types handled.grep -n "addContainerRequest\|removeContainerRequest\|releaseAssignedContainer" \ $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— find all RM client interactions.grep -n "NODE_LOCAL\|RACK_LOCAL\|OFF_SWITCH\|ANY" \ $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— how is locality classified?grep -n "CookieContainerRequest" $(find tez-dag/src/main/java -name "*.java" | grep rm)— what is the "cookie"? (Hint: opaque payload to thread reuse data throughAMRMClient.)wc -l $(find tez-dag/src/main/java/org/apache/tez/dag/app/rm -name "*.java")— which file dominates? LikelyYarnTaskSchedulerService≫ everything.grep -n "Priority.newInstance\|priority(" \ $(find tez-dag/src/main/java -name "VertexImpl.java" -o -name "DAGImpl.java")— where is per-vertex priority computed?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
| AM stuck "0 containers running" | RM has no capacity at requested priority; queue at capacity. Check yarn application -status. |
| All tasks scheduled OFF_SWITCH | TaskLocationHint not propagated through VertexManager. |
Tasks fail with Container released by AM | YarnTaskSchedulerService released a container that an attempt still owned — usually a state machine race; see failure-handling.md. |
| Reuse not happening | Priorities mismatch between completed and pending tasks; check tez.am.container.reuse.locality.delay-allocation-millis. |
| AM heartbeat thread blocked | A scheduler callback (onContainersAllocated) ran a slow blocking op on the RM client thread. Keep callbacks light. |
IllegalStateException: Priority N not registered | allocateTask called for a vertex whose priority class was never bootstrapped. |
Validation: prove you understand this
- Walk an
AMSchedulerEventTALaunchRequestfrom dispatch inTaskAttemptImplto a YARNAMRMClient.addContainerRequestcall. Cite file paths. - Explain the difference between priority (YARN concept) and DAG priority (Tez concept) and where Tez sets each.
- Given a 100-task
Vertex Afollowed by a 10-taskVertex B, what priority class does each get and why? - Describe how
YarnTaskSchedulerServicedecides between two pending requests at the same priority when a container arrives. - Identify the single method on
YarnTaskSchedulerServicethat the RM callback thread invokes when containers become available. Cite file:line.
Container Reuse
Container reuse is the single biggest reason Tez runs short-task DAGs faster than MapReduce. This chapter explains why, where the policy lives, and how to debug it when it stops working.
Why reuse matters
Container allocation has three costs:
- RM round-trip —
addContainerRequest, RM scheduling cycle (typicallyyarn.scheduler.capacity.node-locality-delayadds extra ms),onContainersAllocated. - NM container launch —
ContainerLaunchContextsetup, localization of resources, NodeManager forking the JVM. - JVM warmup — classloading, JIT, GC tuning.
For a 5-second task on a fresh container the wall time looks like:
| Phase | ms |
|---|---|
| AM request → RM allocate | 200–2000 |
| NM launch + localization | 500–3000 |
| JVM start | 500–2000 |
| Task work | 5000 |
| Overhead share | 25–60% |
For 100 such tasks, paying that overhead 100 times turns a DAG that should finish in ~10s into one that takes 60–90s. Reuse drops to near-zero overhead for tasks 2..N on the same container.
The reuse loop in TezChild
See tez-runtime.md. The single key fact: after each
completed task, TezChild.run() calls umbilical.getTask() again instead
of exiting. As long as the AM hands it work, the same JVM keeps running.
grep -n "umbilical.getTask\|shouldDie\|run()" \
$(find tez-runtime-internals/src/main/java -name "TezChild.java") | head
So the entire reuse policy is implemented on the AM side — the container asks "what next?" and the AM decides whether to give it another task or release it.
AMContainerImpl — per-container state machine
find tez-dag/src/main/java -name "AMContainerImpl.java"
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")
grep -n "AMContainerState\|enum AMContainerState" \
$(find tez-dag/src/main/java -name "AMContainerState.java" \
-o -name "AMContainerImpl.java") | head
Each YARN container the AM holds has a corresponding AMContainerImpl
state machine. States include:
| State | Meaning |
|---|---|
ALLOCATED | RM has assigned the container; not yet launched. |
LAUNCHING | NMClient is forking the JVM. |
IDLE | Launched, no task assigned (reuse candidate). |
RUNNING | A task attempt is currently executing. |
STOP_REQUESTED / COMPLETED | Releasing or released. |
The transition RUNNING → IDLE is the moment Tez decides between reuse and
release.
HeldContainer
grep -n "HeldContainer\|heldContainers\|delayedContainers" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20
HeldContainer is the scheduler-side view of an idle reused container:
| Field | Purpose |
|---|---|
container | The underlying YARN Container (resource, node, priority). |
priority | The priority class it was originally allocated at. |
lastTaskActivity | Timestamp of the last task completion. |
nextScheduleTime | When DelayedContainerManager will reconsider it. |
localityMatchLevel | Track the locality at which it can still be matched. |
When a task completes, AMContainerImpl reports back to
YarnTaskSchedulerService which wraps the container in a HeldContainer
and queues it for matching.
Matching: who gets the held container?
Algorithm (paraphrased from YarnTaskSchedulerService):
- Walk pending requests at the same priority as the held container's original allocation.
- Prefer requests with locality matching the container's node, then rack, then any.
- Verify resource compatibility: container's
Resourcemust satisfy the request'scapability. - If a match exists, dispatch reuse to the matched
TaskAttemptImpl. - If no match, leave the container as
HeldContainerand schedule theDelayedContainerManagerto re-evaluate after the locality delay.
grep -n "tryAssignReUsedContainer\|matchHeldContainerToRequest\|getMatchingRequests" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head
Why priority-strict matching?
Tez does not reuse a container allocated for priority class P1 for a task
of priority P2 because RM accounting attributed the container to the P1
queue/request. Crossing priority classes would corrupt fairness and create
double-counting in the RM's view of demand.
Idle timeout
grep -n "tez.am.container.idle.release-timeout" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Two timeouts bracket the wait:
| Key | Default | Meaning |
|---|---|---|
tez.am.container.idle.release-timeout-min.millis | 5000 | Don't release before this much idle time. |
tez.am.container.idle.release-timeout-max.millis | 10000 | Definitely release after this much. |
DelayedContainerManager runs a periodic sweep. For each HeldContainer:
- If
now - lastActivity < min, wait. - If
min ≤ now - lastActivity < max, try a relaxed-locality match. - If
now - lastActivity ≥ max, release back to YARN (AMRMClient.releaseAssignedContainer).
Why a range? Avoids thundering-herd releases when a wave of tasks finishes simultaneously, and gives the AM a window to re-match before paying the allocate-from-scratch cost.
Locality re-matching
grep -n "localityMatchLevel\|adjustLocalityMatch\|fallbackMatch" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head
A held container starts at NODE_LOCAL. Each sweep without a match
relaxes the level:
NODE_LOCAL → RACK_LOCAL → ANY → release.
tez.am.container.reuse.locality.delay-allocation-millis (default 250) is
the per-step delay. Higher values raise locality at the cost of throughput;
lower values give up locality faster.
DAG transitions and reuse
grep -n "tez.am.container.reuse.across-dags.enabled\|tez.am.container.reuse.enabled" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Reuse policy at DAG boundaries:
| Key | Default | Effect |
|---|---|---|
tez.am.container.reuse.enabled | true | Master toggle. |
tez.am.container.reuse.rack-fallback.enabled | true | Allow RACK_LOCAL fallback. |
tez.am.container.reuse.non-local-fallback.enabled | false | Allow ANY-locality fallback. |
tez.am.container.reuse.new-containers.enabled | true | Reuse a brand-new container for a different task than originally requested. |
tez.am.session.mode.tez-session.enabled (Hive) | controls inter-DAG reuse via session | Hive holds the AM across queries. |
When Session mode is on (Hive's TezSessionPoolManager does this), the AM
holds containers across DAGs, so the first DAG warms the JVMs that the
second DAG reuses.
Failure modes
Stale credentials
grep -n "credentials\|Token\|getCredentials" \
$(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
-o -name "AMContainerImpl.java") | head
If a DAG uses delegation tokens (HDFS, HiveMetastore) that expire mid-session,
reused containers still hold the old tokens. Symptoms: tasks fail with
SecretManager$InvalidToken on file open. Fix: token renewal via
TokenRenewer, or release reused containers between DAGs that use tokens
with short TTLs.
Leaked containers on AM failover
grep -n "recoverContainer\|onAMRestart" \
$(find tez-dag/src/main/java -name "*.java" | head -50) | head
When the AM dies and YARN restarts attempt 2, the old containers are still
running. YARN passes them to the new AM via getContainersFromPreviousAttempts.
If the new AM mis-handles the priority mapping, those containers can become
orphaned — neither released nor reused — until the YARN-level
yarn.am.liveness-monitor.expiry-interval-ms kicks in.
Resource fragmentation
Tez does not reshape containers. A 4 GB container allocated for a heavyweight mapper sits idle through the reduce phase if reducers want 2 GB containers — the 4 GB block is not subdivided.
Container blacklisting
grep -n "blacklist\|NodeTracker" \
$(find tez-dag/src/main/java -name "*.java" | grep rm) | head
A node accumulating task failures gets blacklisted; held containers on that node are released even within the idle window.
Tuning playbook
| Goal | Tune |
|---|---|
| Reduce p50 task latency | Increase tez.am.container.idle.release-timeout-max.millis — keep JVMs warm longer. |
| Reduce YARN queue pressure | Lower tez.am.container.idle.release-timeout-min.millis — return idle containers faster. |
| Improve locality on long DAGs | Increase tez.am.container.reuse.locality.delay-allocation-millis. |
| Hive interactive queries | Enable session pools (hive.server2.tez.initialize.default.sessions) and large reuse windows. |
| Debugging "why was this container released?" | Set log4j level for org.apache.tez.dag.app.rm to DEBUG. |
Reading exercise
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")then read the state machine declaration block. Count states and transitions.grep -n "DelayedContainerManager" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— find the sweep loop.grep -rn "idle.release-timeout" tez-dag/src/main/java— list all read sites for the idle timeout.grep -n "previousAttemptContainers\|registerApplicationMaster" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— how does the AM enumerate inherited containers on failover?cat tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | grep -A 1 "REUSE\|REUSE_ENABLED"— list every reuse-related config key.grep -n "containerCompleted" $(find tez-dag/src/main/java -name "AMContainerImpl.java")— where does the AM learn that the JVM exited unexpectedly?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
0% reuse despite tez.am.container.reuse.enabled=true | Priority mismatches; verify with AM log Container released because no matching request. |
| Hive query slow after token refresh | Reused container holding stale HiveMetastore delegation token. Release after refresh or shorten reuse window. |
AM log spam: Released container X because expired | Tasks completing faster than next-wave dispatch — lower idle.release-timeout-min. |
| YARN queue at 100% but tasks pending | Held containers at wrong priority blocking new allocations; check nm-rm-heartbeat-interval-ms. |
| Containers orphaned after AM crash | New AM did not register previous containers; check |
getContainersFromPreviousAttempts handling. |
Validation: prove you understand this
- Describe the four-step locality relaxation a
HeldContainerundergoes. - Why is priority-strict matching enforced even when relaxing locality? Cite the RM accounting consequence.
- Given
idle.release-timeout-min=5000,idle.release-timeout-max=10000, and 200 ms between successive task completions on the same vertex, what fraction of containers get reused? - Identify the exact configuration key that controls whether RM-fresh containers can be assigned to a task different from the one that triggered the request. Cite file:line.
- Sketch the sequence of AM events when an AMContainer transitions
RUNNING → IDLE → RUNNINGwith reuse, including which state machine emits each event.
Local Mode
Tez ships two "no YARN" execution paths:
- Local mode —
tez.local.mode=true. The whole AM + all containers run in the calling JVM. No RM, no NM, no networking. - MiniTezCluster — a real YARN MiniCluster (RM + NMs as threads) with a real Tez AM submitted as a YARN app. Networking goes over loopback.
Both let you test without a cluster, but they are not interchangeable. This chapter explains the wiring and the tradeoffs.
Why a no-YARN path exists
Production Tez requires YARN to allocate containers. For:
- IDE-driven unit tests of vertex managers, edge managers, processors;
- short reproducers in JIRAs;
- in-process pipelines (e.g. running a DAG inline from a Hive Driver in a test);
paying YARN startup cost (30+ seconds) is intolerable. Local mode is the escape hatch.
grep -rn "tez.local.mode\|LOCAL_MODE" \
tez-api/src/main/java tez-dag/src/main/java | head
How tez.local.mode=true rewires the AM
grep -n "TEZ_LOCAL_MODE\|isLocalMode\|getBoolean.*LOCAL" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
When tez.local.mode=true:
TezClient.start()does not submit to YARN. Instead it constructs aDAGAppMasterinstance directly in the client JVM and starts it as a service.TaskSchedulerManageris configured withLocalTaskSchedulerServiceinstead ofYarnTaskSchedulerService.ContainerLauncherManagerusesLocalContainerLauncherinstead ofContainerLauncherImpl.TaskCommunicatorManagerusesTezLocalTaskCommunicatorImplwhich bypasses RPC entirely.
The net effect: the AM, scheduler, container launcher, and TezChilds all
live in the same JVM, talking via in-process queues.
LocalTaskSchedulerService
find tez-dag/src/main/java -name "LocalTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java")
Mirrors YarnTaskSchedulerService but the "resource pool" is a thread pool.
| Concept | Yarn version | Local version |
|---|---|---|
| Resource pool | YARN cluster | ExecutorService of bounded thread count |
allocateTask | AMRMClient.addContainerRequest | enqueue to local queue, immediately synthesize fake Container |
releaseAssignedContainer | RM release | return thread to pool |
| Locality | NODE_LOCAL/RACK_LOCAL | always ANY (single "node") |
| Priority | YARN priority class | honored as a queue-ordering hint |
Configuration:
grep -n "TEZ_AM_INLINE_TASK_EXECUTION_MAX_TASKS\|tez.am.inline" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
tez.am.inline.task.execution.max-tasks (default 1) controls thread-pool
size in local mode. Bumping this exposes concurrency bugs that production
container parallelism would also expose.
LocalContainerLauncher
find tez-dag/src/main/java -name "LocalContainerLauncher.java"
When the AM "launches" a local container, the launcher allocates a
LocalContainer worker that runs TezChild logic in the same process:
- No new JVM.
- No serialization of the
ContainerLaunchContext— the AM hands theTaskSpecdirectly to the local task runner. - The umbilical "RPC" is a Java method call on an in-process object.
This means: local mode does not exercise the RPC layer, classpath construction, NM localization, or token plumbing. Bugs in those paths are invisible to local-mode tests.
What local mode does not exercise
| Layer | Skipped in local mode |
|---|---|
| YARN RM scheduling | ✗ |
| NodeManager container launch | ✗ |
| Resource localization (HDFS download) | ✗ |
| AMRMToken / ClientToAMToken | ✗ |
| HDFS shuffle path (uses local FS only) | ✗ |
ShuffleHandler aux service | ✗ |
| RPC serialization | ✗ |
| JVM cold start / classloader isolation | ✗ |
What it does exercise: the DAG state machine, VertexManagers, EdgeManagers, sort/merge code, processors, and the umbilical event flow.
MiniTezCluster
find tez-tests/src/test/java -name "MiniTezCluster.java"
wc -l $(find tez-tests/src/test/java -name "MiniTezCluster.java")
A real cluster compressed onto one host. Inherits
MiniYARNCluster from Hadoop:
- One RM thread.
- N NM threads (configurable).
- A Tez AM submitted as a normal YARN application.
TezChildruns in separate JVMs spawned by NMContainerExecutor.- HDFS is
MiniDFSCluster(a few NameNode + DataNode threads in the same JVM) or aRawLocalFileSystem.
grep -n "MiniYARNCluster\|MiniDFSCluster\|appJar\|deploy" \
$(find tez-tests/src/test/java -name "MiniTezCluster.java") | head
Setup pattern
grep -rn "MiniTezCluster\b" tez-tests/src/test/java | head -10
MiniTezCluster cluster = new MiniTezCluster("test", numNMs, numDNs, racks);
cluster.init(conf);
cluster.start();
TezConfiguration tezConf = new TezConfiguration(cluster.getConfig());
TezClient client = TezClient.create("test", tezConf);
client.start();
client.waitTillReady();
client.submitDAG(myDag);
When MiniTezCluster is the right tool
- You are exercising RPC, security, or localization code.
- You hit
ShuffleHandlerpaths or HDFS-backed recovery (see failure-handling.md). - You're reproducing a bug that involves real container lifecycle (kill -9
vs orderly shutdown) — MiniCluster can
forkProcessand SIGKILL. - You need realistic counters and ATS event flow.
When MiniTezCluster is the wrong tool
- Pure VertexManager logic — use local mode or mock dispatcher.
- Pure IFile / sort behavior — use a unit test on the runtime-library classes directly.
- Anything where 30–60 s startup + heavy memory cost (~1 GB minimum) is intolerable.
Side-by-side comparison
| Aspect | Local mode | MiniTezCluster |
|---|---|---|
| Startup | < 1 s | 30–60 s |
| Memory | ~256 MB | 1 GB+ |
| YARN exercised | no | yes (in-process) |
| RPC exercised | no | yes (loopback) |
| Tokens exercised | no | yes (simple, unkerberized by default) |
| Separate JVMs for tasks | no | yes |
| HDFS | RawLocal | MiniDFS or RawLocal |
| Shuffle path | no ShuffleHandler | full ShuffleHandler |
| Use case | unit / integration of AM logic | end-to-end integration tests |
| Example test class | TestLocalMode | TestOrderedWordCount |
find tez-tests/src/test/java -name "TestLocalMode.java" \
-o -name "TestOrderedWordCount.java"
Worked example: switching between modes in one test
@Parameters
public static Iterable<Object[]> modes() {
return Arrays.asList(new Object[][] {{"local"}, {"mini"}});
}
@Before
public void setUp() throws Exception {
conf = new TezConfiguration();
if ("local".equals(mode)) {
conf.set("fs.defaultFS", "file:///");
conf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, true);
} else {
miniCluster = new MiniTezCluster("test", 1, 1, 1);
miniCluster.init(conf);
miniCluster.start();
conf = new TezConfiguration(miniCluster.getConfig());
}
client = TezClient.create("t", conf);
client.start();
client.waitTillReady();
}
This is the pattern in several Tez tests where a feature must work in both universes.
Reading exercise
wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java" \ -o -name "YarnTaskSchedulerService.java")— confirm the local version is much smaller.grep -n "tez.local.mode" $(find tez-dag/src/main/java -name "DAGAppMaster.java")— find every branch that depends on local mode.cat $(find tez-dag/src/main/java -name "LocalContainerLauncher.java") | head -160— how does it runTezChildwithout a fork?find tez-tests/src/test/java -name "MiniTezCluster.java" -exec grep -n "ShuffleHandler\|aux-services" {} \;— verify MiniTezCluster wires the YARN aux service.grep -rn "TEZ_LOCAL_MODE" tez-api tez-dag tez-runtime-internals | head— list every config read site.find tez-tests/src/test/java -name "TestLocal*" -o -name "TestMRR*"— read one local-mode and one MiniCluster test, side by side.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
| Test passes in local mode, fails on cluster | Local mode skipped RPC/localization/tokens. Add a MiniCluster variant. |
MiniCluster test times out at waitTillReady | RM never registered the AM. Check tez-site.xml is on the AM classpath in the MiniCluster config. |
Local-mode race conditions only visible with inline.task.execution.max-tasks > 1 | Single-threaded local mode hides ordering bugs in VertexManager and dispatchers. |
ClassNotFoundException for custom processor in MiniCluster | Container localization needs the JAR; either put it on the launch classpath or register via LocalResources. |
| MiniCluster blows the heap | Default 1 NM + MiniDFS already 1 GB; bump JVM heap or reduce NM count to 1. |
| Hive integration test wedges only in MiniCluster | Hive needs full Hadoop config; check hadoop.security.authentication=simple in test conf. |
Validation: prove you understand this
- List four layers that local mode does not exercise. For each, name a bug class it can hide.
- In local mode, where does the "RPC" between
TezChildand the AM actually happen? Cite the file path. - Why is
tez.am.inline.task.execution.max-tasks=1the default in local mode? What test reliability tradeoff does it enforce? - Given a reproducer for a bug in
ShuffleHandleraux-service interaction, explain why aTestLocalMode-style test cannot reproduce it, and what the minimum MiniCluster setup is. - Show the minimum
TezConfigurationsetup for local mode in code. Three lines max.
Hive on Tez
Hive is the largest single consumer of Tez. Roughly 70% of bug reports filed against Tez originate in a Hive query; many "Tez bugs" turn out to be Hive bugs, and vice versa. This chapter walks the compile boundary, explains how Hive operators map to Tez I/P/O, and gives a triage tree for attribution.
The compile boundary
Hive's query compiler produces a TezWork, a graph of BaseWork nodes
(MapWork, ReduceWork, MergeJoinWork, etc). TezTask.execute walks
TezWork and constructs a Tez DAG.
ls ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/
Key files:
| File | Role |
|---|---|
TezTask.java | Hive's Task impl; builds the DAG and submits via TezSessionState. |
DagUtils.java | DAG construction helpers (createVertex, createEdge, etc). |
TezSessionPoolManager.java | Warm session pool — keeps AMs alive between queries. |
TezSessionState.java | One Hive session ↔ one Tez AM. |
TezProcessor.java | The LogicalIOProcessor that runs Hive operator pipelines inside a Tez task. |
wc -l ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/{TezTask,DagUtils,TezSessionPoolManager,TezProcessor}.java
TezTask.execute — high-level flow
grep -n "execute\|build\|submitDAG" \
~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java | head -30
Steps:
- Acquire a
TezSessionStatefromTezSessionPoolManager(or open a new one). build(jobConf, work, scratchDir, ...)— callDagUtilsto turn eachBaseWorkinto a TezVertexand eachTezEdgePropertyinto a TezEdge.submit(dag, sessionState)→tezClient.submitDAG(dag).- Poll
dagClient.getDAGStatus(...)until terminal. - Surface counters + diagnostics back to Hive.
DagUtils.createVertex
grep -n "createVertex\|createEdge\|createEdgeProperty\|setVertexManagerPlugin" \
~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java | head -30
For a MapWork:
| Hive concept | Tez vertex configuration |
|---|---|
Operator tree starting with TableScanOperator | processor = MapTezProcessor (subclass of TezProcessor) |
| Number of input splits | parallelism = splits.length (often overridden by grouping) |
| Per-split input | DataSourceDescriptor with MRInputLegacy and the InputFormat |
| Combiner | Edge-level (downstream ReduceWork configures it as a combiner.class) |
For a ReduceWork:
| Hive concept | Tez vertex configuration |
|---|---|
Operator tree starting at ReduceSinkOperator's consumer | processor = ReduceTezProcessor |
| Target parallelism | numReducers (from Hive's Operator tree, optionally |
auto-parallelized via ShuffleVertexManager) | |
| Sort key codec | OrderedGroupedKVInput.KEY_CLASS, KEY_COMPARATOR_CLASS |
setVertexManagerPlugin | ShuffleVertexManager with auto-parallelism if hive.tez.auto.reducer.parallelism=true |
For a MergeJoinWork:
processor = MergeJoinProcessor- Multiple sorted inputs (one per join side) using
OrderedGroupedKVInput - A custom or built-in vertex manager that coordinates inputs
Operator → IPO mapping
Hive operators run inside a Tez task — they are not Tez constructs. The mapping happens at the input/output boundary of the vertex.
| Position | Hive operator | Tez wiring |
|---|---|---|
| Vertex entry (map side) | TableScanOperator | MRInputLegacy (tez-mapreduce) emits (key, value) from InputFormat |
| Vertex middle | Filter / Select / GroupBy partial / etc | Pure in-memory operator chain inside TezProcessor.process |
| Vertex exit (shuffle producer) | ReduceSinkOperator | OrderedPartitionedKVOutput with Hive's HiveKey serializer and partitioner |
| Vertex entry (reduce side) | First operator after the boundary | OrderedGroupedKVInput provides a KeyValuesReader; ReduceRecordProcessor adapts it into Hive's tuple-at-a-time interface |
| Vertex middle (reduce) | GroupBy aggregation, Join, etc | Operator chain |
| Vertex exit (final) | FileSinkOperator | MROutputLegacy writes to HDFS |
| Broadcast join build | HashTableSinkOperator | UnorderedKVOutput (or in newer Hive a BROADCAST-typed edge) feeding the probe vertex |
| Broadcast join probe | MapJoinOperator | UnorderedKVInput on a BROADCAST edge |
grep -rn "OrderedPartitionedKVOutput\|OrderedGroupedKVInput\|UnorderedKVOutput\|UnorderedKVInput" \
~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez | head -20
TezProcessor adapter
grep -n "class TezProcessor\|class MapTezProcessor\|class ReduceTezProcessor\|process(" \
~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezProcessor.java
TezProcessor.run(inputs, outputs):
- Pull the singular input (
MRInputLegacyor firstOrderedGroupedKVInput). - Construct a
RecordSourcethat adapts the Tez reader into Hive'sOperator.process(Object row, int tag)calling convention. - Run the operator tree until the input is drained.
- Call
forward(EOF)to drain operator buffers. - Close outputs in reverse order.
The processor is intentionally thin — all the interesting logic is in the Hive operator chain.
TezSessionPoolManager
find ~/hive-src/ql/src/java -name "TezSessionPoolManager.java"
wc -l ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPoolManager.java
A Tez session = a long-lived Tez AM holding zero or more idle containers ready to accept the next DAG.
| Config | Default | Effect |
|---|---|---|
hive.server2.tez.default.queues | default | Pre-warm sessions per YARN queue. |
hive.server2.tez.sessions.per.default.queue | 1 | Number of pre-warm sessions per queue. |
hive.server2.tez.initialize.default.sessions | false | Start them at HS2 boot. |
hive.tez.exec.print.summary | false | Surface Tez counters in query output. |
Pool flow:
- HS2 starts. If
initialize.default.sessions=true, launches N AMs per queue. - Query comes in. HS2 calls
TezSessionPoolManager.getSession(queue)— gets an idle session or opens a new one. - Session executes the DAG; AM holds containers across DAGs (see container-reuse.md).
- On session return, AM remains idle awaiting next DAG.
- On idle timeout (
hive.server2.session.check.interval), pool may close sessions.
LLAP
LLAP (Live Long And Process) is a different execution model that replaces the per-query AM with a long-lived per-node daemon. The Tez AM still coordinates, but instead of asking YARN for containers it asks LLAP daemons for "fragments".
find ~/hive-src/llap-* -type d -maxdepth 2 2>/dev/null | head
Key differences (do not extrapolate Tez-on-YARN debugging to LLAP):
- Containers are replaced by
LlapTaskExecutorServiceworker slots. - The shuffle path uses a Netty-based fetcher (
LlapShuffleHandler). - The Tez scheduler plugin is
LlapTaskSchedulerService(inhive-llap-server). - Container reuse is not relevant — LLAP slots are always "hot".
This chapter does not cover LLAP further; treat it as a separate world.
Bug attribution: where does it really live?
Triage tree. Symptom: query fails or returns wrong result.
flowchart TD
S[Failure observed] --> Q1{Failure message mentions Hive operator?}
Q1 -- yes --> H1[Hive bug: open against HIVE]
Q1 -- no --> Q2{Failure in TezChild / IFile / Fetcher?}
Q2 -- yes --> T1[Tez bug: open against TEZ]
Q2 -- no --> Q3{Failure in container launch / RM allocation?}
Q3 -- yes --> Y1[YARN bug: open against YARN]
Q3 -- no --> Q4{Wrong result not crash?}
Q4 -- yes --> Q5{Reproduce with same DAG via TestOrderedWordCount-style?}
Q5 -- no --> H1
Q5 -- yes --> T1
Practical heuristics:
| Stack trace contains | Probably |
|---|---|
org.apache.hadoop.hive.ql.exec.Operator | Hive |
org.apache.tez.runtime.library | Tez |
org.apache.tez.dag.app.rm | Tez (scheduling) |
org.apache.hadoop.yarn | YARN |
ShuffleHandler | YARN-side (mapreduce auxservice) |
LlapDaemon | LLAP (Hive) |
MapJoinOperator + OOM | Hive (join planning), even though the OOM happens in a Tez container |
Wrong-result bugs almost always live in Hive (operator semantics) unless
you can isolate the same DAG with synthetic data in TestOrderedWordCount
style.
Reading exercise
cat ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java | head -200— read the top ofTezTask.execute.grep -n "createEdgeProperty\|EdgeProperty\.create" \ ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java— list all edge property factories Hive uses.grep -rn "ShuffleVertexManager\|RootInputVertexManager" \ ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez— when does Hive set each manager?find ~/hive-src/ql/src/java -name "TezProcessor.java" -exec wc -l {} \;— confirm the processor is < 1000 lines (it's an adapter, not the brain).grep -rn "TezSessionPoolManager.getSession" ~/hive-src/service/src— when does HS2 acquire sessions?cat ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java | head -100— see how a session wraps aTezClient.
Common bugs and symptoms
| Symptom | Likely owner |
|---|---|
MetaException mid-query | Hive (HMS client) |
| Container OOM during reduce join | Hive operator (map-join build size); Tez can not size around an oversized hash table |
| Wrong row counts after a query rewrite | Hive optimizer or MapJoinOperator semantics |
Fetcher: ConnectException to nm:13562 | YARN (aux-service mis-config) |
AM dies with org.apache.tez.dag.app.DAGAppMaster: Vertex failed and the diagnostic mentions only TezProcessor, no Hive class | Tez bug — open a reproducer DAG without Hive |
| Slow first query after HS2 restart | No warm sessions; enable initialize.default.sessions |
Stale ACL after GRANT reissue | Hive (HMS) — Tez containers cache delegation tokens; see container-reuse.md |
Validation: prove you understand this
- List the Hive operators on the source and destination sides of a
SCATTER_GATHERshuffle edge and map each side to the Tez Input or Output class. - Identify the Hive method that finally calls
tezClient.submitDAG. Cite path + grep command. - Given a query that succeeds in standalone HS2 but fails in HS2 with session pooling on, name two likely failure modes and where to look.
- Explain why a
MapJoinOperatorOOM is a Hive bug even though the OOM stack trace is rooted inTezChild. - Show, in three lines, the conditional inside
DagUtilsthat decides whether to installShuffleVertexManageron a reduce vertex. (Find via grep; quote the file:line.)
YARN Integration
The Tez AM is, from YARN's perspective, an ordinary YARN application: an ApplicationMaster running in a container, talking to the ResourceManager to request more containers, talking to NodeManagers to launch them, and writing events to a Timeline Server.
This chapter walks every YARN-facing interface Tez touches.
DAGAppMaster as a YARN AM
find tez-dag/src/main/java -name "DAGAppMaster.java"
wc -l $(find tez-dag/src/main/java -name "DAGAppMaster.java")
grep -n "main(\|serviceStart\|serviceInit" \
$(find tez-dag/src/main/java -name "DAGAppMaster.java") | head
Boot sequence when YARN launches the AM container:
- NodeManager runs the AM command line (constructed by
TezClientUtils), which is essentiallyjava -cp <classpath> org.apache.tez.dag.app.DAGAppMaster. DAGAppMaster.mainparses environment forApplicationAttemptId, container ID, AMResource, etc.- Constructs the
DAGAppMasterservice tree (state machines, dispatchers, schedulers, ATS publisher). serviceStart()registers with the RM viaAMRMClientAsync.registerApplicationMaster.- Starts an RPC server for client connections (
DAGClient,TezTaskUmbilicalProtocol). - Waits for DAG submissions over the client RPC (or, for non-session mode, picks up the pre-submitted DAG from local disk).
Key clients owned by the AM:
| Client | Purpose | Library |
|---|---|---|
AMRMClientAsync | RM heartbeat: request/release containers | hadoop-yarn-client |
NMClientAsync | NM RPC: launch/stop containers | hadoop-yarn-client |
TimelineClient | ATS event publisher | hadoop-yarn-client |
DFSClient | HDFS access for recovery & temp files | hadoop-hdfs-client |
AMRMClientAsync
grep -n "AMRMClientAsync\|addContainerRequest\|releaseAssignedContainer\|allocate" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20
The async wrapper around AMRMClient. Tez uses it instead of the sync
client so allocate-callbacks fire on a dedicated thread.
Lifecycle:
- Register:
registerApplicationMaster(host, rpcPort, trackingUrl). This is the AM telling the RM "I'm alive, here is where to find me." - Allocate loop: a background thread heartbeats every
yarn.am.liveness-monitor.expiry-interval-ms / 3(roughly). Each heartbeat the AMRM client sends:- Pending container requests (added via
addContainerRequest). - Containers to release.
- Application progress (0..1). It receives:
- Newly allocated containers.
- Completed container statuses.
- Updated node reports (for blacklisting).
- Decommissioned-node reports.
- Pending container requests (added via
- Unregister:
unregisterApplicationMaster(state, msg, trackingUrl)on AM shutdown.
grep -n "CallbackHandler\|onContainersAllocated\|onContainersCompleted\|onShutdownRequest\|onNodesUpdated" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")
These callbacks run on the AMRM client's internal thread; Tez keeps them short by forwarding to its own dispatcher.
NMClientAsync and ContainerLauncherImpl
find tez-dag/src/main/java -name "ContainerLauncherImpl.java"
wc -l $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java")
grep -n "NMClientAsync\|startContainerAsync\|stopContainerAsync" \
$(find tez-dag/src/main/java -name "ContainerLauncherImpl.java")
After the RM allocates a container, Tez must tell the relevant NM to actually
launch the JVM. ContainerLauncherImpl uses NMClientAsync to send
startContainerAsync(container, containerLaunchContext).
ContainerLaunchContext
grep -n "buildContainerLaunchContext\|ContainerLaunchContext\|setLocalResources\|setEnvironment\|setCommands" \
$(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
-o -name "AMContainerHelpers.java")
The CLC is what NM uses to fork the JVM. It carries:
| Field | What Tez puts there |
|---|---|
commands | java <jvm opts> -Dlog4j.configuration=... org.apache.tez.runtime.task.TezChild <args> |
environment | CLASSPATH, JVM_PID, container ID, AM host/port |
localResources | Tez tarball, user JARs, any HDFS-distributed resources |
tokens | Delegation tokens (HDFS, HMS, etc) for the container to use |
serviceData | Per-aux-service payload (e.g. mapreduce_shuffle job secret) |
grep -n "ServiceData\|JobTokenSecretManager\|shuffleSecret" \
$(find tez-dag/src/main/java -name "*.java") | head
The serviceData map entry under key mapreduce_shuffle carries the
serialized JobToken that NM's ShuffleHandler will use to authorize fetch
requests — this is why mapreduce_shuffle must be configured as an NM
aux-service even for Tez DAGs.
Tokens
grep -rn "AMRMToken\|ClientToAMToken\|TimelineDelegationToken" \
tez-dag/src/main/java | head
| Token | Issued by | Used for | Where it lives |
|---|---|---|---|
AMRMToken | RM, auto-injected into AM's Credentials | AM↔RM RPC | AM JVM credentials |
ClientToAMToken | RM, returned to client at submit | Client (DAGClient) ↔ AM RPC | Client credentials |
TimelineDelegationToken | Timeline Server | AM → Timeline publisher | AM credentials, refreshed periodically |
| HDFS delegation token | NN | Tasks reading/writing HDFS | Container credentials |
| Hive Metastore token | HMS | Tasks calling HMS | Container credentials, via Hive code path |
The AM is responsible for collecting all necessary delegation tokens at submit
time (client-side TezClientUtils does this) and passing them to NMs in the
CLC. Tokens that expire mid-DAG must be renewed by a TokenRenewer.
Log aggregation
grep -rn "log-aggregation\|LogAggregationService" \
$(find ~/hadoop-src -name "*.java" 2>/dev/null | head -3) 2>/dev/null | head
YARN log aggregation is configured in yarn-site.xml:
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
When enabled, every container's stdout, stderr, and syslog are
uploaded to HDFS under
/app-logs/<user>/logs/<applicationId>/<nodeAddress> when the container
exits. Retrieve with:
yarn logs -applicationId application_1234_0001 -containerId container_..._01
yarn logs -applicationId application_1234_0001 -appOwner alice
Without aggregation, logs sit in
${yarn.nodemanager.log-dirs}/application_.../container_.../ on each NM
until cleaned by yarn.nodemanager.log.retain-seconds.
Timeline Server (ATS)
Tez publishes a rich event stream to ATS for post-mortem debugging.
find tez-plugins -type d -name "tez-yarn-timeline*"
ls tez-plugins/
Three flavors exist in the wild:
| Version | Tez plugin module | Notes |
|---|---|---|
| ATSv1 | tez-yarn-timeline-history | Original; LevelDB-backed Timeline Server. Deprecated. |
| ATSv1.5 | tez-yarn-timeline-history-with-acls and tez-yarn-timeline-history-with-fs | Adds entity-file staging to HDFS; reduces ATS write load. |
| ATSv2 | tez-yarn-timeline-history-with-fs + ATSv2 reader configuration | HBase-backed, scalable; requires Hadoop 3.x. |
grep -rn "TimelineClient\|TIMELINE_HISTORY\|HistoryEventHandler" \
tez-plugins/tez-yarn-timeline-history*/src/main/java | head
What gets published:
AppLaunchedEventDAGSubmittedEvent,DAGInitializedEvent,DAGStartedEvent,DAGFinishedEventVertexInitializedEvent,VertexStartedEvent,VertexFinishedEventTaskStartedEvent,TaskFinishedEventTaskAttemptStartedEvent,TaskAttemptFinishedEventContainerLaunchedEvent,ContainerStoppedEvent
The Tez UI (Ambari, standalone) reads these events to render the DAG view, vertex graphs, task swimlanes, and counter trees.
ls tez-ui/src/main 2>/dev/null
Configuration cheat sheet
grep -n "YARN\|ATS\|TIMELINE\|LOG_AGG" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head -20
| Key | Default | Effect |
|---|---|---|
tez.am.am-rm.heartbeat.interval-ms.max | 1000 | Cap on AMRM heartbeat interval. |
tez.am.client.am.port-range | auto | RPC port for AM client RPC. |
tez.am.container.lookup.timeout-ms | 30000 | How long to wait for an NM ack before failing the launch. |
tez.history.logging.service.class | (varies) | Which ATS plugin to use. |
tez.am.tez-ui.history-url.template | template | Where the UI is hosted; surfaced in DAGStatus. |
yarn CLI behaviors for Tez apps
| Command | Behavior on a Tez app |
|---|---|
yarn application -list | Lists Tez AMs alongside MR/Spark; type tag is TEZ. |
yarn application -status <appId> | Shows AM state, RM tracking URL, ATS tracking URL (if configured). |
yarn application -kill <appId> | RM SIGKILLs the AM container; Tez state is lost (no recovery beyond what RecoveryService already wrote). |
yarn logs -applicationId <appId> | Streams aggregated logs of all containers — AM and TezChilds. |
yarn node -list | Useful for confirming aux-service mapreduce_shuffle is up on each NM. |
Reading exercise
grep -n "registerApplicationMaster\|unregisterApplicationMaster" \ $(find tez-dag/src/main/java -name "*.java")— find every AM-lifecycle call.grep -rn "setupContainerEnvironment\|buildContainerEnvironment" \ tez-dag/src/main/java tez-api/src/main/java | head— what environment variables does the AM pass to each container?cat $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java") | head -200— read the launch path.grep -rn "mapreduce_shuffle" tez-dag/src/main/java tez-api/src/main/java— verify the aux-service name is hard-coded.find tez-plugins -name "*.java" | xargs grep -l "TimelineEntity" | head -3— which classes assemble ATS entities?cat $(find tez-dag/src/main/java -name "DAGAppMaster.java") | head -300— locateserviceInitand list every service added to the composite.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
AM exits with InvalidApplicationMasterRequestException | AM tried to register twice or after un-register; usually a re-init bug. |
Auxiliary service mapreduce_shuffle not configured | yarn-site.xml aux-services missing. |
ConnectionRefused from Fetcher | NodeManager aux-service crashed or wrong shuffle port. |
| AM dies "RM expired" | AMRM heartbeat thread blocked or paused for GC > expiry interval. |
| ATS empty for completed app | tez.history.logging.service.class mis-set, or ATS not running. |
yarn logs returns "Logs not aggregated" | Container did not finish cleanly, or aggregation not enabled. |
| ClientToAMToken auth fail | Client and AM disagree on cluster security; check both have the same hadoop.security.authentication. |
Validation: prove you understand this
- Trace the exact call path from
DAGAppMaster.serviceStarttoAMRMClientAsync.registerApplicationMaster. - List the contents of the
ContainerLaunchContext.serviceDatamap that Tez populates, and explain who reads each entry. - Explain why an AM long pause for full GC can manifest as an
RM expiredshutdown, and which config controls the threshold. - For an app with
yarn.log-aggregation-enable=falseand a TezChild that crashed, give the exact filesystem path on the NM where itsstderrlives. Use the configuredyarn.nodemanager.log-dirsas a variable. - Name the three ATS plugin modules, and pick the right one for a Hadoop 3.x cluster targeting HBase-backed ATSv2.
Failure Handling
A Tez DAG fails for many reasons: a corrupted input split, a flaky NM, an OOM in the user processor, a Kerberos token expiry, an RM connectivity blip, an AM crash. Tez has a layered escalation model: small failures are absorbed, big ones propagate, and the AM persists enough state to recover from its own death.
This chapter walks the escalation, the failure taxonomy, and the recovery machinery.
Escalation: attempt → task → vertex → DAG
flowchart TD
TA[TaskAttempt fails] -->|retry budget| TA2[New TaskAttempt]
TA -->|exhausted| T[Task fails]
T -->|failure policy| V[Vertex fails]
V -->|fail-on-vertex-failure| D[DAG fails]
Default behavior:
| Layer | Configuration | Default | Effect when exceeded |
|---|---|---|---|
| TaskAttempt | tez.am.task.max.failed.attempts | 4 | Mark Task as failed |
| Task | tez.am.vertex.max.task.failed.attempts (no direct knob; per-vertex policy) | varies | Vertex fails on first failed task by default |
| Vertex | per-DAG failure policy | fail-fast | DAG fails |
grep -n "MAX_FAILED_ATTEMPTS\|MAX_TASK_ATTEMPTS\|TEZ_AM_TASK_MAX" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
TaskAttemptTerminationCause
find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java"
cat $(find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java")
The enum names every reason a TaskAttempt can end up non-SUCCEEDED. A
selection:
| Cause | Source | Retryable? |
|---|---|---|
TERMINATED_BY_CLIENT | user-initiated DAG kill | no |
INTERNAL_PREEMPTION | scheduler preempted to make room | yes |
EXTERNAL_PREEMPTION | YARN preempted the container | yes |
CONTAINER_LAUNCH_FAILED | NM rejected the launch | retried on a new container |
CONTAINER_EXITED | TezChild exited without a status update | yes |
CONTAINER_STOPPED | AM stopped the container intentionally | depends |
NODE_FAILED | NM died | yes, on a different node |
NODE_BLACKLISTED | node accumulated too many failures | retried elsewhere |
OUTPUT_LOST | downstream reported missing output | yes, re-run source TA |
INPUT_READ_ERROR | TA failed reading shuffle from a source | yes |
APPLICATION_ERROR | uncaught exception in user code | usually no, but retried up to attempt budget |
FRAMEWORK_ERROR | uncaught exception in Tez code | sometimes no |
OTHER_TASK_ATTEMPT_KILLED_DUPLICATE | speculative duplicate lost | no (not a failure) |
This enum is the most important debugging signal — every failed attempt in ATS / AM log surfaces a cause from this list.
TaskAttempt failure retries
grep -n "max.failed.attempts\|maxFailedAttempts\|attemptFailed" \
$(find tez-dag/src/main/java -name "TaskImpl.java")
On a TA failure:
TaskAttemptImpltransitions toFAILED(orKILLEDif the cause is in the "killed" subset).TaskImplincrements its failed-attempt counter.- If counter <
tez.am.task.max.failed.attempts,TaskImplschedules a newTaskAttemptImpl(incremented attempt index). - Otherwise
TaskImpltransitions toFAILEDand reports up toVertexImpl.
Some causes are not counted against the budget (e.g.
OUTPUT_LOST, NODE_FAILED) — these are infrastructure failures, not
user-code failures.
grep -n "isFatalFailure\|isExternalError\|countAsFailure" \
$(find tez-dag/src/main/java -name "*.java" | head -50)
Node blacklisting
grep -rn "NodeTracker\|blacklist\|BLACKLISTED" \
tez-dag/src/main/java/org/apache/tez/dag/app | head
Per-node failure accounting:
| Trigger | Effect |
|---|---|
| N task attempts fail on the same node within a window | Add node to blacklist for this app |
NodeReport from RM says UNHEALTHY | Add node to blacklist immediately |
tez.am.maxtaskfailures.per.node | Per-node failure threshold (default 3) |
tez.am.node-blacklisting.enabled | Master toggle |
tez.am.node-blacklisting.ignore-threshold-node-percent | Don't blacklist if it would remove more than N% of the cluster |
A blacklisted node:
- No new container requests go to it.
- Held containers on it are released.
- Existing attempts already running on it are allowed to finish (not preemptively killed).
Output loss
A common late-stage failure: a downstream task reads from a shuffle source and finds the source's output is gone (the NM died, the disk was wiped, etc).
grep -rn "OUTPUT_LOST\|reportSourceTaskAttemptFailed\|inputFailedEvent" \
tez-runtime-library/src/main/java tez-dag/src/main/java | head
Flow:
- Destination TA's
Fetcherfails permanently on sourceS.a.0. - Destination TA sends
InputReadErrorEventvia umbilical heartbeat. - AM's
VertexImplreceives the event, marksS.a.0asOUTPUT_LOST. TaskImplforS.aschedules a new attemptS.a.1.- New attempt re-runs, produces fresh outputs, downstream resumes.
This is the cascading-rerun engine — and a source of pathological behavior when a single bad disk poisons many downstream tasks.
AM failover
find tez-dag/src/main/java -name "RecoveryService.java"
find tez-dag/src/main/java -name "RecoveryEventHandler.java"
wc -l $(find tez-dag/src/main/java -name "RecoveryService.java")
YARN keeps a small budget of AM restarts (yarn.resourcemanager.am.max-attempts,
default 2). When the AM crashes:
- RM allocates a fresh AM container, attempt index incremented.
- New AM boots, sees attempt > 1, enters recovery mode.
RecoveryServicereads the recovery log from HDFS (written by attempt 1).- Replays events to reconstruct DAG, Vertex, Task, TaskAttempt state.
- Inherits any pre-existing containers via
AMRMClient.getContainersFromPreviousAttempts. - Resumes scheduling from the last consistent state.
RecoveryService
grep -n "writeEvent\|flush\|recover\|RecoveryEventType" \
$(find tez-dag/src/main/java -name "RecoveryService.java" \
-o -name "RecoveryEventHandler.java")
Append-only event log on HDFS, one file per app attempt:
hdfs:///tmp/staging/<user>/.staging/application_<id>/appattempt_<id>_NNNNNN/
recovery/
summary.dag_1.recovery
dag_1.recovery
Event kinds:
| Event | When written |
|---|---|
DAGSubmittedEvent | DAG arrives at AM |
DAGInitializedEvent | DAG state machine reaches INITED |
DAGStartedEvent | DAG reaches RUNNING |
DAGFinishedEvent | DAG terminal state |
VertexInitializedEvent, VertexStartedEvent, VertexFinishedEvent | mirror state transitions |
TaskStartedEvent, TaskFinishedEvent | per task |
TaskAttemptStartedEvent, TaskAttemptFinishedEvent | per attempt |
VertexConfigurationDoneEvent | parallelism finalized |
find tez-dag/src/main/java -name "*Event*.java" -path "*recovery*"
Configuration
| Key | Default | Effect |
|---|---|---|
tez.dag.recovery.enabled | true | Master toggle. |
tez.dag.recovery.flush.interval.secs | 30 | Periodic fsync of the recovery log. |
tez.dag.recovery.io.buffer.size | 8192 | Buffer for the writer. |
yarn.resourcemanager.am.max-attempts (YARN) | 2 | Caps recovery attempts. |
What recovery can and cannot recover
| Can | Cannot |
|---|---|
| DAG / Vertex / Task / TA state at last flush | In-flight events lost since last flush |
| Counter snapshots written to recovery log | Real-time counter updates between flushes |
| Container assignments | NM-side container state — those are rediscovered via getContainersFromPreviousAttempts |
User payload of DAGPlan | User in-memory state inside a custom VertexManagerPlugin |
A VertexManagerPlugin that holds in-memory state across events must
override getState() / setState() to participate in recovery — otherwise
it starts fresh on AM attempt 2.
DAG-level termination causes
find tez-dag/src/main/java -name "DAGTerminationCause.java"
cat $(find tez-dag/src/main/java -name "DAGTerminationCause.java")
| Cause | Trigger |
|---|---|
DAG_KILL | client called dagClient.tryKillDAG() |
VERTEX_FAILURE | a vertex transitioned to FAILED |
INIT_FAILURE | DAG init failed (bad plan, bad input) |
INTERNAL_ERROR | unhandled exception inside AM |
AM_USERCODE_FAILURE | user-supplied plugin threw |
OUT_OF_TEZ_TASK_RESOURCES | scheduler could not satisfy resource requests |
RECOVERY_FAILURE | replay couldn't reconstruct prior state |
Reading exercise
grep -n "transition\|FAILED\|KILLED" \ $(find tez-dag/src/main/java -name "TaskAttemptImpl.java") | head -40— count terminal transitions.grep -rn "OUTPUT_LOST" tez-dag/src/main/java tez-runtime-library/src/main/java— what triggers this cause?cat $(find tez-dag/src/main/java -name "RecoveryService.java") | head -200— read the writer loop.grep -n "RecoveryEvent" $(find tez-dag/src/main/java -name "*.java" | head -50)— list all recovery event classes.wc -l $(find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java" \ -o -name "DAGTerminationCause.java" \ -o -name "VertexTerminationCause.java")grep -rn "node-blacklisting\|blacklistNode" tez-dag/src/main/java | head— where is blacklist enforcement implemented?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
OUTPUT_LOST cascade kills the DAG | One bad NM is poisoning downstream; blacklist or pin off it. |
| Recovery infinite-loops on attempt 2 | Corrupt recovery log; check fsync gating and tez.dag.recovery.flush.interval.secs. |
INTERNAL_PREEMPTION repeatedly | Tez scheduler is preempting its own attempts; usually a higher-priority vertex starving lower; tune priorities. |
| All attempts of one task fail in < 1s | User code throws deterministically; cause is APPLICATION_ERROR. |
| DAG hangs forever after one task fails | Vertex failure policy is permissive (rare); look at the vertex transition. |
NODE_BLACKLISTED removes 100% of cluster | ignore-threshold-node-percent not set; the DAG is now unschedulable. |
| AM crashes, attempt 2 boots, but tasks restart from scratch | Recovery disabled or HDFS staging dir not accessible to attempt 2. |
Validation: prove you understand this
- List five
TaskAttemptTerminationCausevalues that do not count against the attempt budget. Cite where the predicate lives. - Explain in two sentences how an
OUTPUT_LOSTon sourceS.a.0triggers a re-run ofS.a, not justS.a.0's downstream consumers. - Identify the HDFS path pattern under which recovery events are written. Give the exact path components.
- Describe what happens to in-flight
DataMovementEvents when the AM crashes mid-DAG and AM attempt 2 takes over. - Given
tez.am.maxtaskfailures.per.node=3and an 8-node cluster, what is the smallest sequence of task failures that triggers blacklisting? Show the math.
Counters and Diagnostics
When a Tez DAG misbehaves, you have two primary signals: counters (numeric aggregates from every task) and diagnostics strings (free-text causes at every level of the hierarchy). This chapter is the operator's reference for both.
TezCounters
find tez-api/src/main/java -name "TezCounters.java"
wc -l $(find tez-api/src/main/java -name "TezCounters.java")
grep -n "class TezCounters\|addGroup\|getGroup\|findCounter" \
$(find tez-api/src/main/java -name "TezCounters.java")
TezCounters is a typed map of (groupName) → CounterGroup → (counterName) → Counter.
It is hash-cons style: identical strings share storage. Counters are long
values with thread-safe increment.
find tez-api/src/main/java -name "TaskCounter.java"
cat $(find tez-api/src/main/java -name "TaskCounter.java")
Standard groups
| Group | Source class | What lives there |
|---|---|---|
org.apache.tez.common.counters.TaskCounter | TaskCounter enum | Per-task framework metrics |
org.apache.tez.common.counters.DAGCounter | DAGCounter enum | Per-DAG aggregate metrics |
org.apache.tez.common.counters.FileSystemCounter | FileSystemCounter | Per-FS bytes-read/written |
org.apache.hadoop.mapreduce.JobCounter | (legacy MR) | Compatibility shim |
| User-defined | <your class name> | App code |
Key TaskCounter values
grep -n "INPUT_RECORDS_PROCESSED\|OUTPUT_RECORDS\|SPILLED_RECORDS\|SHUFFLE_BYTES\|GC_TIME_MILLIS\|REDUCE_INPUT_GROUPS" \
$(find tez-api/src/main/java -name "TaskCounter.java")
| Counter | Meaning |
|---|---|
INPUT_RECORDS_PROCESSED | Records read from logical inputs |
OUTPUT_RECORDS | Records written to logical outputs |
OUTPUT_BYTES | Bytes written (post-compression for shuffle) |
OUTPUT_BYTES_PHYSICAL | Bytes actually written to disk |
SPILLED_RECORDS | Records spilled by sorter |
NUM_SPILLS | Number of spill files created |
MERGED_MAP_OUTPUTS | Spills merged on the source side |
SHUFFLE_BYTES | Bytes fetched by shuffle |
SHUFFLE_BYTES_TO_MEM, SHUFFLE_BYTES_TO_DISK | Fetcher allocation split |
REDUCE_INPUT_GROUPS | Distinct keys seen by a KeyValuesReader |
REDUCE_INPUT_RECORDS | Total values across all groups |
GC_TIME_MILLIS | Sum of GC time during the task |
CPU_MILLISECONDS | Process CPU time |
COMMITTED_HEAP_BYTES | Heap size at task end |
PHYSICAL_MEMORY_BYTES, VIRTUAL_MEMORY_BYTES | Process memory snapshot |
DAGCounter
find tez-api/src/main/java -name "DAGCounter.java"
cat $(find tez-api/src/main/java -name "DAGCounter.java")
| Counter | Meaning |
|---|---|
NUM_SUCCEEDED_TASKS | Aggregated across all vertices |
NUM_KILLED_TASKS | Speculative duplicates + user kills |
NUM_FAILED_TASKS | TA failures (counts every failed attempt) |
TOTAL_LAUNCHED_TASKS | Lifetime sum |
OTHER_LOCAL_TASKS, RACK_LOCAL_TASKS, DATA_LOCAL_TASKS | Locality histogram |
AM_CPU_MILLISECONDS, AM_GC_TIME_MILLIS | AM process counters |
WALL_CLOCK_MILLIS | DAG submission → completion |
Aggregation: task → TA → vertex → DAG
flowchart TD
TA[TaskAttempt counters] -->|flushed via heartbeat| T[Task counters]
T -->|on TASK_SUCCEEDED| V[Vertex counters]
V -->|on VERTEX_SUCCEEDED| D[DAG counters]
Mechanism:
- Each
LogicalIOProcessorRuntimeTaskaccumulates counters in process. TaskReporterheartbeat carries a snapshot to the AM viaTezTaskUmbilicalProtocol.statusUpdate.- AM's
TaskAttemptImplstores the latest snapshot. - On
TASK_SUCCEEDED, the winning attempt's counters become theTaskcounters; other attempts are discarded. - On
VERTEX_SUCCEEDED,VertexImplsums all task counters into the vertex counters. - On
DAG_SUCCEEDED,DAGImplsums all vertex counters into DAG counters and includesAM_*andDAG_*self-counters.
grep -n "incrCounters\|aggregateCounters\|getCounters\|setCounters" \
$(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
-o -name "TaskImpl.java" \
-o -name "VertexImpl.java" \
-o -name "DAGImpl.java") | head -30
Counter limits (and how they kill DAGs)
grep -n "COUNTERS_MAX\|TEZ_COUNTERS_MAX\|countersMax" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
| Key | Default | Cap on |
|---|---|---|
tez.counters.max | 1200 | Total counter count per TezCounters instance |
tez.counters.max.groups | 500 | Group count |
tez.counters.group-name.max | 256 | Length of a group name |
tez.counters.counter-name.max | 64 | Length of a counter name |
Exceeding any limit throws LimitExceededException. This typically happens
when:
- An app creates a counter per unique key (e.g. per file path).
- A user vertex manager creates per-task counters.
- A DAG has very many vertices, each contributing many counters, and the DAG-level aggregate blows the cap.
The exception propagates up the heartbeat path and kills the DAG with
INTERNAL_ERROR. Look for LimitExceededException in the AM log to confirm.
Diagnostics strings
Every level (TA → Task → Vertex → DAG) has a List<String> of diagnostics.
| Level | Class | Populated by |
|---|---|---|
TaskAttempt | TaskAttemptImpl | User exception stacks, framework errors, container exit reasons |
Task | TaskImpl | Aggregate of failed attempt diagnostics + scheduling diagnostics |
Vertex | VertexImpl | Aggregate of failed task diagnostics + vertex manager events |
DAG | DAGImpl | Aggregate of failed vertex diagnostics + DAG-level events |
grep -n "addDiagnostic\|diagnostics\|getDiagnostics" \
$(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
-o -name "TaskImpl.java" \
-o -name "VertexImpl.java" \
-o -name "DAGImpl.java") | head -40
When a DAG completes, DAGStatus.getDiagnostics() is the union of every
diagnostic at every level. This is what tez-tool and the Tez UI display.
Where to find diagnostics
| Surface | Path | Notes |
|---|---|---|
| Client return value | DAGStatus.getDiagnostics() | Concatenated strings |
| AM log | syslog | Search for DIAG:, ERROR, the cause keyword |
| ATS | DAGFinishedEvent.diagnostics, VertexFinishedEvent.diagnostics, etc | One field per entity |
| Tez UI | DAG / Vertex / Task page | Renders the same ATS fields |
dag.dot (if dumped) | local file written by TezClient when enabled | Static plan only, no diagnostics |
| Counter dump from CLI | tez-tool dump-counters <appId> | Counter snapshots |
grep -rn "DIAG\|addDiagnosticInfo" tez-dag/src/main/java | head -20
Counters in the AM log
A typical successful-task log line:
TaskAttempt: [attempt_1_0_00_000000_0]
TASK_ATTEMPT_FINISHED ...
counters: Counters: 26
org.apache.tez.common.counters.TaskCounter
INPUT_RECORDS_PROCESSED=12345
OUTPUT_RECORDS=12345
OUTPUT_BYTES=4567890
...
grep -rn "Counters: " tez-dag/src/main/java | head
For diagnostic grepping, search the AM log for:
| Pattern | What it finds |
|---|---|
DIAG: | Diagnostics appends |
Counters: | Counter dumps |
LimitExceededException | Counter limit hits |
TaskAttemptTerminationCause | Failure causes |
TERMINATED_BY_CLIENT | User-initiated kills |
OUTPUT_LOST | Cascading reruns |
Custom counters
User code accesses counters via the IPO context:
public class MyProcessor extends AbstractLogicalIOProcessor {
@Override
public void run(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs) {
TezCounters counters = getContext().getCounters();
counters.findCounter("MyApp", "ROWS_FILTERED").increment(1);
}
}
grep -rn "getContext().getCounters\|getCounters()" \
tez-tests/src/main/java tez-examples/src/main/java | head
Operational guidance:
- Cap group/counter cardinality at compile time. Never use unbounded user input as a counter name.
- One group per app; many counters per group.
- Counter names are visible in ATS forever — treat them as a stable API.
Reading exercise
cat $(find tez-api/src/main/java -name "TaskCounter.java")— read the enum.grep -n "incrCounter\|addCounters" \ $(find tez-runtime-library/src/main/java -name "*.java") | head -20— find every place runtime increments counters.grep -rn "LimitExceededException" tez-api/src/main/java tez-dag/src/main/java— trace the kill path.find tez-tools -type f -name "*.java" | head— look attez-toolsfor counter-dump tooling.grep -rn "addDiagnosticInfo\|addDiagnostic" tez-dag/src/main/java | wc -l— count the call sites; build a mental model of "where diagnostics flow in."- Open the Tez UI for a recent app, navigate DAG → Vertex → Task, and compare each level's counter view against what the AM log shows.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
DAG fails with LimitExceededException | Too many counters — search AM log for the limit that triggered. |
| Counters at DAG level don't sum to vertex counters | One vertex failed; its counters are excluded from the sum. |
| Counter group missing from ATS | Counter was never incremented (zero is not stored). |
| Diagnostics string truncated | ATS field length limit; check yarn.timeline-service.client.max-attempts and entity size. |
INPUT_RECORDS_PROCESSED is zero but task succeeded | Input had zero rows, or a custom IPO does not increment the standard counter. |
SHUFFLE_BYTES_TO_DISK >> SHUFFLE_BYTES_TO_MEM | Fetcher exhausted memory budget; tune tez.runtime.shuffle.memory.limit.percent. |
| Wall clock huge vs CPU millis | Task spent most time waiting (shuffle, GC, blocked); not CPU bound. |
Validation: prove you understand this
- Name the four standard counter groups and the class that defines each.
- Explain why two attempts of the same task can have different counter values, and what happens to the loser's counters.
- Calculate the smallest DAG that can hit
tez.counters.max=1200, assuming each TaskCounter contributes 26 counters per vertex on success. - Trace the path of a single counter increment in user code through the
classes that aggregate it up to the
DAGStatusreturned to the client. - Given an AM log line
DIAG: TaskAttempt attempt_1_0_05_000003_2 failed, cause=APPLICATION_ERROR, list the four levels where this diagnostic ultimately appears and the exact classes that store each copy.
Testing Framework
Tez ships three tiers of tests, each with a different cost/coverage tradeoff. Knowing which tier to use for a given change — and which patterns are considered idiomatic — is the difference between a patch that lands and one that sits in review.
Three tiers
| Tier | Module | Boots... | Run cost | Use for |
|---|---|---|---|---|
| Unit | each module's src/test/java | nothing real; pure mocks + dispatcher | seconds | State-machine transitions, parsers, helper classes |
| Mini-cluster | tez-tests/src/test/java | MiniTezCluster (MiniYARNCluster + Tez session) | seconds-to-minutes | End-to-end DAGs in a JVM |
| Full cluster | external | real YARN cluster | minutes | Release validation, perf tests |
find . -name "MiniTezCluster.java"
wc -l $(find . -name "MiniTezCluster.java")
Unit testing state machines
The dominant pattern for Tez unit tests is arrange-state, send-event,
drain-dispatcher, assert. Reference: TestVertexImpl, TestTaskImpl,
TestTaskAttemptImpl, TestDAGImpl.
find tez-dag/src/test/java -name "TestVertexImpl.java"
wc -l $(find tez-dag/src/test/java -name "TestVertexImpl.java")
grep -n "DrainDispatcher\|MockVertex\|MockDAG\|setupVertices\|dispatcher.await" \
$(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -20
Building blocks
| Class | Purpose |
|---|---|
DrainDispatcher | Synchronous-ish event dispatcher; await() blocks until queue drains |
MockVertex, MockDAG, MockTask, etc | Lightweight stand-ins that satisfy Vertex etc interfaces |
MockClock | Controllable clock for time-dependent transitions |
MockHistoryEventHandler | Captures recovery / ATS events for assertion |
Mockito (mock, when, verify) | Mocks for collaborators (TaskSchedulerManager, etc) |
Recipe
@Test
public void testVertexInitsAfterAllInputsReady() throws Exception {
// 1. Arrange
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.init(new Configuration());
dispatcher.start();
TaskSchedulerManager sched = mock(TaskSchedulerManager.class);
DAG dag = mock(DAG.class);
when(dag.getID()).thenReturn(TezDAGID.getInstance(appAttemptId, 1));
VertexImpl v = new VertexImpl(vertexId, plan, name, conf,
dispatcher.getEventHandler(),
mock(TaskCommunicatorManagerInterface.class),
mockClock, taskHeartbeatHandler, mockAppContext,
VertexLocationHint.create(...), dispatcher,
mockVertexManager, ...);
// 2. Act
dispatcher.getEventHandler().handle(
new VertexEvent(vertexId, VertexEventType.V_INIT));
dispatcher.await();
// 3. Assert
assertEquals(VertexState.INITED, v.getState());
verify(sched, never()).taskAllocated(any(), any(), any());
}
Key idioms:
- Never call
Thread.sleep. Alwaysdispatcher.await(). - Never assume event ordering unless you've sent events sequentially through the same dispatcher.
- Mock the AppContext aggressively. It's the god-object; mocking it lets each test isolate exactly the collaborators it cares about.
MiniTezCluster tests
find tez-tests/src/test/java -name "TestOrderedWordCount.java" \
-o -name "TestMRRJobsDAGApi.java" \
-o -name "TestExtServicesWithLocalMode.java" | head
wc -l $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")
MiniTezCluster boots:
- A
MiniYARNCluster(in-process RM + N NMs). - A
MiniDFSCluster(in-process NN + DNs) — optional. - A
TezClientconfigured against the mini cluster.
grep -n "MiniTezCluster\|new MiniYARNCluster\|setup\|tearDown" \
$(find tez-tests/src/test/java -name "TestOrderedWordCount.java")
Lifecycle
flowchart TD
setUp[BeforeClass: setup] --> mini[Start MiniTezCluster]
mini --> tez[Create TezClient]
test1[Test: build DAG] --> submit[submitDAG]
submit --> wait[waitForCompletion]
wait --> assert[Assert DAGStatus + counters]
tear[AfterClass: tearDown] --> stop[Stop TezClient + cluster]
Common shape
@BeforeClass
public static void setup() throws Exception {
conf = new Configuration();
conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 100);
miniTezCluster = new MiniTezCluster("name", 1, 1, 1);
miniTezCluster.init(conf);
miniTezCluster.start();
TezConfiguration tezConf = new TezConfiguration(miniTezCluster.getConfig());
tezClient = TezClient.create("test", tezConf);
tezClient.start();
}
@AfterClass
public static void tearDown() throws Exception {
tezClient.stop();
miniTezCluster.stop();
}
@Test(timeout = 60_000)
public void testWordCount() throws Exception {
DAG dag = buildWordCountDAG();
DAGClient client = tezClient.submitDAG(dag);
DAGStatus status = client.waitForCompletionWithStatusUpdates(EnumSet.of(StatusGetOpts.GET_COUNTERS));
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
assertEquals(EXPECTED_ROW_COUNT,
status.getDAGCounters().findCounter(TaskCounter.OUTPUT_RECORDS).getValue());
}
@Test(timeout = ...) is mandatory
A mini-cluster test that hangs blocks the whole CI build. Every
MiniTezCluster test has a JUnit timeout in the 60-300 second range.
Local mode for tests
Faster than MiniTezCluster: no YARN, no DFS, everything in-process.
grep -rn "TEZ_LOCAL_MODE\|setLocalMode\|tez.local.mode" \
tez-tests/src/test/java tez-runtime-library/src/test/java | head
Used for:
- Unit-style integration tests where YARN isn't relevant.
- Examples / smoke tests in
tez-examples. - Quick repro of runtime issues — see the local mode deep dive.
Patterns: do and don't
Do
grep -rn "DrainDispatcher\|await()" tez-dag/src/test/java | wc -l
- Send all setup events synchronously, then call
dispatcher.await(). - Use
MockClockand advance it explicitly. - Capture emitted events with a custom handler and assert on the collection.
- Use
@Before/@Afterto reset shared dispatcher and mocks. - Mock external collaborators (
TaskScheduler,ContainerLauncher,NMClient); never instantiate the real ones in unit tests. - Bound parallelism in mini-cluster tests (
numNodeManagers=1is usually fine).
Don't
| Anti-pattern | Why it bites |
|---|---|
Thread.sleep(N) to wait for state | Flake city; transition time depends on machine load. |
while (vertex.getState() != X) busy loop | Same flake, plus burns CPU. |
Assume e1 happens before e2 when both posted async | Dispatcher orders by arrival, not posting. |
| Static state across tests | Tests run in some JVM order; static leaks corrupt later tests. |
| Real network calls in unit tests | Slow, flaky, often forbidden in CI sandboxes. |
System.exit from tested code paths | Kills the JVM running the test runner. |
CI / build integration
cat pom.xml | head -100
grep -n "surefire\|failsafe" pom.xml
| Maven plugin | Runs | Default scope |
|---|---|---|
maven-surefire-plugin | unit tests under src/test/java | Test*.java, *Test.java, *Tests.java, *TestCase.java |
maven-failsafe-plugin | integration tests | IT*.java, *IT.java, *ITCase.java |
Tez puts MiniTezCluster tests under surefire as well (no separation),
which is one reason mvn test is slow. Run a single test:
mvn test -pl tez-dag -Dtest=TestVertexImpl
mvn test -pl tez-tests -Dtest=TestOrderedWordCount#testWordCount
Test-only utilities
find . -path "*/src/main/java/*" -name "Test*.java" | head
find . -path "*/src/test/java/*" -name "Mock*.java" | head
Helpful classes (some live under src/main so they're reusable downstream):
| Class | Module | Purpose |
|---|---|---|
MiniTezCluster | tez-tests | Bootstrap an in-process cluster |
TezClientForTest | tez-api | Subclass exposing internals |
MockDAG, MockVertex, MockTask | tez-dag test sources | Plain-old objects implementing state-machine interfaces |
TestProcessor, TestInput, TestOutput | tez-tests | No-op IPOs for plan plumbing tests |
DrainDispatcher | hadoop-yarn-common (depended upon) | Dispatcher with await() |
Reading exercise
cat $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -150— read the setup + first test.grep -n "@Test" $(find tez-dag/src/test/java -name "TestTaskImpl.java" \ -o -name "TestVertexImpl.java" \ -o -name "TestDAGImpl.java") | wc -l— get a sense of the test surface.cat $(find tez-tests/src/test/java -name "TestOrderedWordCount.java") | head -200— see a real MiniTezCluster test.grep -rn "dispatcher.await\|DrainDispatcher" tez-dag/src/test/java | wc -l— confirm the pattern is universal.grep -rn "Thread.sleep" tez-dag/src/test/java | head— find any stragglers using the anti-pattern; understand why each one is there (usually waiting on real OS state, e.g. a port).mvn -pl tez-dag test -Dtest=TestVertexImpl -DfailIfNoTests=false— run one and read the output structure.
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
| Test passes locally, flakes in CI | Thread.sleep waiting for transition; replace with dispatcher.await(). |
MiniTezCluster test hangs forever | Missing @Test(timeout = …); AM never finishes due to test bug. |
BindException in mini-cluster | Previous test didn't stop(); ports leaked. |
State machine throws InvalidStateTransitionException in test | Test sent event in wrong state; check arrange step. |
Mock returns null from getDAG() | Forgot to stub when(appContext.getCurrentDAG()). |
OutOfMemoryError: Java heap space in surefire | Each test forking JVM holds too much; tune argLine=-Xmx1g in pom. |
| Test depends on counter being non-zero, but it's zero | Counter incremented in code path the mock skipped; verify the code under test actually ran. |
Validation: prove you understand this
- Outline the four-step recipe for a state-machine unit test, with the exact call to drain the dispatcher.
- Name three classes from
tez-dag/src/test/javathat implement theMock*pattern and what each replaces. - Explain why
Thread.sleepis an anti-pattern in Tez tests and what the correct alternative is for time-dependent transitions. - Given a hang in
TestVertexImpl#testTaskKill, list the first three diagnostics you'd inspect (no debugger). - Describe the difference between a
MiniTezClustertest and a local-mode test, and give one scenario where each is the correct choice.
Hive-on-Tez Labs
Hive on Tez is the production context that has carried Tez through the last decade. Every large Hive deployment that's not on Spark is on Tez. Understanding the Tez/Hive boundary is therefore not a niche skill — it is the production debugging skill for both projects.
These labs work from a SQL query down through Hive compilation, into a Tez DAG, into
running tasks, and back out through failure attribution and remediation. They are
deliberately hands-on; every step has commands to run against ~/tez-src and
~/hive-src.
Prerequisites
| Tool | Required version | Why |
|---|---|---|
| Apache Tez | 0.10.x | Matches the rest of this book |
| Apache Hive | 3.x or 4.x | Production-relevant; Hive 2 is end of life |
| Hadoop | 3.3.x | Tez and Hive both target this |
| JDK | 11 (Hive 4) or 8 (Hive 3) | Per project requirements |
| Local clones | ~/tez-src, ~/hive-src | All commands assume these paths |
If you only have one of Hive 3 vs Hive 4, the labs work either way — they call out the delta where it matters. Class paths used throughout these labs (the integration boundary):
org.apache.hadoop.hive.ql.exec.tez.TezTask — Hive's "execute on Tez" task
org.apache.hadoop.hive.ql.exec.tez.DagUtils — Builds Tez DAG from Hive plan
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager — Pools Tez sessions
org.apache.hadoop.hive.ql.exec.tez.TezSessionState — One Hive session = one Tez AM
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource — Map-side record source
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource — Reduce-side record source
Verify these exist in your tree:
find ~/hive-src -path "*ql/exec/tez/TezTask.java"
find ~/hive-src -path "*ql/exec/tez/DagUtils.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionPoolManager.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionState.java"
find ~/hive-src -path "*ql/exec/tez/MapRecordSource.java"
find ~/hive-src -path "*ql/exec/tez/ReduceRecordSource.java"
If any are missing, your Hive tree may be too old. Hive 3.1.x and 4.0.x both have all six.
The Tez/Hive Boundary, At a Glance
The boundary is one Hive class — TezTask — and a handful of supporting utilities. Above
the boundary, Hive owns: SQL parsing, semantic analysis, logical plan, physical plan
(MapWork/ReduceWork). Below the boundary, Tez owns: DAG execution, task scheduling,
shuffle, recovery.
flowchart TD
subgraph Hive
A[SQL Query] --> B[Parser]
B --> C[Semantic Analyzer]
C --> D[Logical Plan]
D --> E[Physical Plan<br/>MapWork / ReduceWork]
E --> F[TezTask.execute]
F --> G[DagUtils.createVertex<br/>DagUtils.createEdge]
G --> H[DAG object]
end
subgraph Tez
H --> I[TezSession.submitDAG]
I --> J[DAGAppMaster<br/>tez-dag]
J --> K[Vertex tasks<br/>tez-runtime-internals]
K --> L[Shuffle I/O<br/>tez-runtime-library]
end
That TezTask → DagUtils → DAG → submitDAG sequence is the entire integration
surface. The seven labs below walk it from the top (Lab H1) to the runtime (Lab H6).
Lab Index
| Lab | Goal | Output artifact |
|---|---|---|
| H1: SQL → DAG | Trace a SELECT...GROUP BY...ORDER BY from SQL to a labelled Tez DAG | DAG diagram |
| H2: Inspect DAG | Capture and inspect the DAG Hive submits | EXPLAIN output + .dot file |
| H3: Debug a query | Walk from a "Vertex failed" message to the actual exception | Failure narrative |
| H4: Bug attribution | Use stack-trace top frame to attribute to Hive, Tez runtime, Tez AM, or YARN | Decision tree applied |
| H5: Reproducing bugs | Build a minimum reproducer for a Hive-on-Tez bug | Repro tarball |
| H6: Diagnostics | Write a small diagnostic patch (log, counter, config) and attach to JIRA | Patch + JIRA |
Reading Order
H1 and H2 are foundational — do them in order. H3 and H4 are debugging skills that build on each other. H5 and H6 are the contributor-facing skills you need to file a useful Hive-on-Tez JIRA from a production observation.
If you are coming to this section from the Capstone, H4 and H5 are the most directly relevant.
Where the Real Work Happens
The Tez/Hive boundary is one of the most-asked-about areas on both project mailing lists. The labs are written so that, when you encounter a production issue, you can:
- Read the stack trace and attribute it (H4).
- Locate the SQL that produced the DAG (H1).
- Capture the DAG and find the relevant vertex (H2).
- Identify the failing task and its log (H3).
- Reproduce it minimally on
MiniTezCluster(H5). - Attach a diagnostic patch to a JIRA to get more data from the reporter (H6).
That six-step routine, executed crisply, is what gets Hive-on-Tez JIRAs resolved.
Validation for the Section
You have absorbed the Hive-on-Tez section when, given a freshly-failing query in a production Hive-on-Tez deployment, you can:
- Within 10 minutes, identify which project owns the failure (Hive / Tez / YARN).
- Within 30 minutes, locate the relevant code on both sides of the boundary.
- Within 1 hour, capture the DAG and the failing task's log.
- Within a day, produce a minimum reproducer on
MiniTezCluster. - Within a week, file a JIRA on the right project with all the data needed.
That is the standard a Hive-on-Tez committer holds themselves to. The labs build the muscle.
Lab H1: SQL → DAG
Background
A user writes:
SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;
Hive compiles this into a Tez DAG with three vertices and two edges. This lab walks the
compilation path: parser → semantic analyzer → logical plan → physical plan
(MapWork/ReduceWork) → TezTask → DagUtils.createVertex/createEdge → submitted
DAG.
By the end you will have a labelled DAG diagram for this query and you will be able to trace any similar query from SQL to runtime topology.
Setup
cd ~/hive-src
git log --oneline -1 # know the version you're on
find . -name "TezTask.java" # boundary class
find . -name "DagUtils.java" # DAG construction
A representative test table (use Hive CLI or beeline):
CREATE TABLE t (a INT, b STRING)
STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');
The query under study:
SELECT a, COUNT(*) AS c
FROM t
GROUP BY a
ORDER BY a;
Step 1: Parser (lexing, AST)
Hive uses ANTLR. The grammar lives in:
find ~/hive-src -name "HiveParser.g" -o -name "HiveLexer.g"
The parser produces an AST. From the CLI:
EXPLAIN AST SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
You will see a Lisp-style tree:
(TOK_QUERY
(TOK_FROM (TOK_TABREF (TOK_TABNAME t)))
(TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL a))
(TOK_SELEXPR (TOK_FUNCTIONSTAR COUNT) c))
(TOK_GROUPBY (TOK_TABLE_OR_COL a))
(TOK_ORDERBY (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL a)))))
The AST is the input to the next phase.
Step 2: Semantic Analyzer
The AST goes through SemanticAnalyzer:
find ~/hive-src -name "SemanticAnalyzer.java" | head
It resolves table references, expands *, type-checks aggregates, and produces a
Query Block (QB) tree → Operator tree (logical plan).
EXPLAIN LOGICAL SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
You see operators like TS (TableScan), SEL (Select), GBY (GroupBy), RS
(ReduceSink), FS (FileSink). Two GBY and two RS are typical for a
GROUP BY ... ORDER BY (one pair each).
Step 3: Physical Plan — MapWork, ReduceWork
The logical operator tree is converted to a physical plan whose top-level units are
MapWork, ReduceWork, and MergeJoinWork. For our query, Hive produces three
Work units:
| Work | Purpose | Operators inside |
|---|---|---|
MapWork (Map 1) | Read t, partial aggregate by a | TS → SEL → GBY → RS |
ReduceWork (Reducer 2) | Final aggregate by a, prepare for sort | GBY → RS |
ReduceWork (Reducer 3) | Total-order sort by a, write output | SEL → FS |
Inspect the structures:
grep -rn "class MapWork" ~/hive-src/ql/src/java/
grep -rn "class ReduceWork" ~/hive-src/ql/src/java/
Get this from Hive directly:
EXPLAIN SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Look for the Stage: Stage-1 / Tez block and the per-vertex sections (Map 1,
Reducer 2, Reducer 3).
Step 4: TezTask — The Boundary
The Hive-side execution entry point for a Tez query:
grep -n "public int execute" $(find ~/hive-src -name TezTask.java)
TezTask.execute(TaskQueue queue, DriverContext driverContext) does roughly:
- Acquire a
TezSessionState(existing pooled session or new one) viaTezSessionPoolManager. - Build a
DAGfromMapWork/ReduceWorkviaDagUtils. - Submit the DAG via the session's
TezSession.submitDAG. - Block on the
DAGClientfor completion. - Surface counters and diagnostics.
The DAG-building call:
grep -n "DagUtils\|dagUtils\.create" $(find ~/hive-src -name TezTask.java)
You will see calls to DagUtils.createDag or DagUtils.buildDag (name varies by Hive
version).
Step 5: DagUtils.createVertex / createEdge
The mapping from Hive Work units to Tez Vertex happens here:
find ~/hive-src -name "DagUtils.java"
grep -n "createVertex\|public Vertex " $(find ~/hive-src -name DagUtils.java)
grep -n "createEdge\|public Edge " $(find ~/hive-src -name DagUtils.java)
For our query, DagUtils produces:
Hive Work | Tez Vertex | Processor descriptor |
|---|---|---|
MapWork "Map 1" | Vertex "Map 1" | MapTezProcessor |
ReduceWork "Reducer 2" | Vertex "Reducer 2" | ReduceTezProcessor |
ReduceWork "Reducer 3" | Vertex "Reducer 3" | ReduceTezProcessor |
And two edges:
| From | To | EdgeProperty kind |
|---|---|---|
| Map 1 | Reducer 2 | SCATTER_GATHER (shuffle) |
| Reducer 2 | Reducer 3 | SCATTER_GATHER (with a 1-task sink for total order) |
The "1-task sink for total order" is how Hive forces a single reducer for ORDER BY
(no LIMIT): Reducer 3 has parallelism 1.
Step 6: The Submitted DAG
After DagUtils.createDag returns, TezTask submits via the session:
grep -n "submitDAG" $(find ~/hive-src -name TezTask.java)
grep -n "submitDAG" $(find ~/hive-src -name TezSessionState.java)
The call lands on TezSession.submitDAG(DAG dag) in tez-api:
grep -n "public DAGClient submitDAG" \
$(find ~/tez-src/tez-api/src/main/java -name TezClient.java)
From there, Reading the Codebase Step 2's worked exercise picks up.
Step 7: Validation — Labelled DAG Diagram
Build this diagram for our query and save it.
flowchart TD
M1["Map 1<br/>processor: MapTezProcessor<br/>operators: TS → SEL → GBY → RS<br/>parallelism: numSplits(t)"]
R2["Reducer 2<br/>processor: ReduceTezProcessor<br/>operators: GBY → RS<br/>parallelism: hive.exec.reducers.* tuning"]
R3["Reducer 3<br/>processor: ReduceTezProcessor<br/>operators: SEL → FS<br/>parallelism: 1 (ORDER BY)"]
M1 -->|"SCATTER_GATHER<br/>partition on a"| R2
R2 -->|"SCATTER_GATHER<br/>partition on sort key"| R3
Capture this as your validation artifact (~/tez-notes/hive-h1-dag.md).
Step 8: Print the DAG via Hive
Hive has a setting to print a runtime summary of the executed DAG:
SET hive.exec.print.summary=true;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
The summary, printed after the query, lists each vertex, its task count, and counters.
Confirm the topology matches the diagram. (If you see four vertices, you may be on a
build that splits ORDER BY differently; record the actual topology.)
For more detail, tez.am.dag.dot.file.location writes a .dot file — used in
Lab H2.
Step 9: Counter Pop Quiz
After the query runs (with hive.exec.print.summary=true), find:
| Counter | Where it lives | What it measures |
|---|---|---|
INPUT_RECORDS_PROCESSED | Map 1 | Rows read from t |
OUTPUT_RECORDS | Map 1 | Records emitted to shuffle (post partial-aggregate) |
REDUCE_INPUT_GROUPS | Reducer 2 | Distinct a values seen |
OUTPUT_RECORDS | Reducer 2 | Records to Reducer 3 |
OUTPUT_RECORDS | Reducer 3 | Final result row count |
For our 6-row input with 3 distinct values of a:
| Counter | Expected |
|---|---|
Map 1 INPUT_RECORDS_PROCESSED | 6 |
Map 1 OUTPUT_RECORDS | 3 (after partial GBY) |
Reducer 2 REDUCE_INPUT_GROUPS | 3 |
Reducer 2 OUTPUT_RECORDS | 3 |
Reducer 3 OUTPUT_RECORDS | 3 |
Verify against your actual run.
Validation Artifacts
- The labelled mermaid DAG diagram saved at
~/tez-notes/hive-h1-dag.md. - The
EXPLAIN AST,EXPLAIN LOGICAL, andEXPLAINoutputs saved. - The
hive.exec.print.summaryoutput for the actual run. - The counter table above, with your actual numbers filled in.
- The
grepresults forcreateVertexandcreateEdgeinDagUtils.javasaved as~/tez-notes/hive-h1-dagutils.txt.
You can now trace any Hive query through compilation to a Tez DAG. The next lab — Lab H2: Inspect the DAG — adds the production-grade techniques for capturing and inspecting that DAG at runtime.
Lab H2: Inspecting the Hive-Emitted DAG
Background
Lab H1 traced compilation to derive the DAG by reading code. In production, you can't always re-derive — you need to capture the DAG Hive submitted to Tez. This lab covers the four production-grade ways to do that:
EXPLAIN FORMATTEDandEXPLAIN VECTORIZATION DETAILfrom Hive.TezTasklogging atDEBUGlevel.- The Tez UI (backed by YARN ATS or Tez SimpleHistoryLoggingService).
- The
tez.am.dag.dot.file.locationgraphviz dump.
Plus the cross-cutting skill: mapping each Hive operator in the captured DAG to its Tez Input/Processor/Output (I/P/O).
Setup
# Hive CLI or beeline. Use the same table from H1:
CREATE TABLE IF NOT EXISTS t (a INT, b STRING) STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');
Verify Tez is the execution engine:
SET hive.execution.engine; -- should be 'tez'
If not:
SET hive.execution.engine=tez;
Method 1: EXPLAIN FORMATTED
EXPLAIN FORMATTED emits a JSON-ish structure with operator details. Useful for
programmatic parsing.
EXPLAIN FORMATTED
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Snippet of the output (structure varies by Hive version):
{
"STAGE DEPENDENCIES": {
"Stage-1": {"ROOT STAGE": "TRUE"},
"Stage-0": {"DEPENDENT STAGES": "Stage-1"}
},
"STAGE PLANS": {
"Stage-1": {
"Tez": {
"DagId:": "...",
"Edges:": {
"Reducer 2": [{"parent": "Map 1", "type": "SIMPLE_EDGE"}],
"Reducer 3": [{"parent": "Reducer 2", "type": "SIMPLE_EDGE"}]
},
"Vertices:": {
"Map 1": {
"Map Operator Tree:": [...],
"Execution mode:": "vectorized"
},
"Reducer 2": { ... },
"Reducer 3": { ... }
}
}
}
}
}
Save it:
hive -e "EXPLAIN FORMATTED SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;" \
> ~/tez-notes/hive-h2-explain-formatted.json
What it tells you that EXPLAIN doesn't:
- Edge types between vertices (
SIMPLE_EDGE,BROADCAST_EDGE,CUSTOM_SIMPLE_EDGE,CUSTOM_EDGE). - Execution mode per vertex (
vectorized,llap, neither). - The full operator tree per vertex, including row-schema annotations.
Method 2: EXPLAIN VECTORIZATION DETAIL
When a query runs slower than expected on Tez, vectorization is the first thing to
check. EXPLAIN VECTORIZATION DETAIL shows per-operator whether vectorization succeeded
and, if not, why.
EXPLAIN VECTORIZATION DETAIL
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Look for per-vertex Execution mode: vectorized and per-operator Vectorized: true.
If you see notVectorizedReason: <reason>, that's the diagnostic.
Common notVectorizedReason values:
| Reason | Cause |
|---|---|
UDF X is not vectorized | Hive lacks a vectorized impl of a UDF you used |
Reduce vectorization disabled | hive.vectorized.execution.reduce.enabled=false |
MAP_JOIN with key types ... | Vectorized map-join doesn't support the key type combo |
Column type X not supported | Vectorization doesn't handle the column type (DECIMAL precision, etc.) |
This explains a class of Hive-on-Tez perf surprises that are unrelated to Tez itself.
Method 3: TezTask Logging
Increase the log level on TezTask to capture the DAG it submitted:
SET hive.root.logger=DEBUG,console;
-- or, more targeted:
SET hive.log.explain.output=true;
hive.log.explain.output=true writes the EXPLAIN to the Hive log on each query —
useful in production where you can't get a CLI run but can grep the log.
grep -A100 "DAG description" /var/log/hive/hive-server2.log | head -200
For the most detail, set DEBUG specifically on the Tez integration:
# in hive-site.xml or via SET:
log4j.logger.org.apache.hadoop.hive.ql.exec.tez=DEBUG
log4j.logger.org.apache.tez.dag.api=DEBUG
In DEBUG you see:
- The serialised
DAGPlansize at submit time. - Each Vertex's name, parallelism, processor descriptor class.
- Each Edge's source, destination, data-source / data-movement / scheduling type.
Method 4: Tez UI
The Tez UI runs against YARN Timeline Service (ATS) or against the file-system
SimpleHistoryLoggingService. When configured, every Tez DAG submitted by Hive (or
anything else) is captured.
Capture is enabled via tez.history.logging.service.class:
grep "tez.history.logging.service.class" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Once a DAG runs, browse to:
http://<atstimeline-host>:8188/applicationhistory/
or for the standalone Tez UI:
http://<tez-ui-host>:9999/tez-ui/
Click into a DAG to see:
- Per-vertex stats (tasks, attempts, succeeded, failed, killed).
- Edges with type and statistics (
BYTES_TRANSFERRED). - A graphical DAG view.
- Per-task and per-attempt logs.
For an offline cluster, the file-system logger writes JSON files under
tez.simple.history.logging.dir. They can be loaded into the Tez UI later.
Method 5: tez.am.dag.dot.file.location
For visual inspection, Tez can write each DAG as a Graphviz .dot file:
SET tez.am.dag.dot.file.location=/tmp/tez-dags;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
After the query:
ls /tmp/tez-dags/
# <app-id>_<dag-name>.dot
dot -Tpng /tmp/tez-dags/<file>.dot -o ~/tez-notes/hive-h2-dag.png
The .dot has the same nodes/edges as the Tez UI, in a portable format.
Caveat: the location is written from the AM, so on a real cluster it lands on the AM node, not the client. Configure the path to a shared filesystem or copy after the fact.
Mapping Hive Operators to Tez I/P/O
Now the cross-cutting skill: each Hive operator inside a Vertex maps to one of Tez's three runtime roles — Input, Processor, or Output. For our query:
Map 1 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| TableScan | Input | MRInput (from tez-mapreduce) or HiveInputFormat adapter |
| Select | (inside Processor) | — |
| GroupBy (partial) | (inside Processor) | — |
| ReduceSink | Output | OrderedPartitionedKVOutput (from tez-runtime-library) |
The Processor itself: MapTezProcessor. Find it:
find ~/hive-src -name "MapTezProcessor.java"
Reducer 2 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| (shuffle in) | Input | OrderedGroupedKVInput |
| GroupBy (final) | (inside Processor) | — |
| ReduceSink | Output | OrderedPartitionedKVOutput |
Processor: ReduceTezProcessor. Find it:
find ~/hive-src -name "ReduceTezProcessor.java"
Reducer 3 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| (shuffle in) | Input | OrderedGroupedKVInput |
| Select | (inside Processor) | — |
| FileSink | Output | MROutput (from tez-mapreduce) |
Validation — A Side-by-Side Table
Build this for your captured DAG and save it:
| Vertex | Tasks | Inputs (class, source) | Processor | Outputs (class, dest) |
|---|---|---|---|---|
| Map 1 | (from EXPLAIN) | MRInput ← t ORC files | MapTezProcessor | OrderedPartitionedKVOutput → Reducer 2 |
| Reducer 2 | (from EXPLAIN) | OrderedGroupedKVInput ← Map 1 | ReduceTezProcessor | OrderedPartitionedKVOutput → Reducer 3 |
| Reducer 3 | 1 | OrderedGroupedKVInput ← Reducer 2 | ReduceTezProcessor | MROutput → query result location |
Save as ~/tez-notes/hive-h2-iop-mapping.md.
Worked Differences Across Methods
When all four capture methods agree, you have ground truth. When they disagree:
| Disagreement | Likely cause |
|---|---|
EXPLAIN FORMATTED shows N vertices, runtime UI shows N+1 | Dynamic vertex insertion (CBO, runtime statistics) |
tez.am.dag.dot.file.location shows fewer edges than UI | Edges added by VertexManager at runtime (see Lab 4.2) |
UI shows BROADCAST_EDGE, EXPLAIN says SIMPLE_EDGE | Hive's EXPLAIN is sometimes loose on edge type; trust the UI |
Parallelism in UI differs from EXPLAIN's -mapred.reduce.tasks | tez.shuffle.vertex.manager reconfigured parallelism at runtime |
Each disagreement is informative — it shows you which subsystem made the dynamic decision.
Production Diagnostic Routine
When asked "why is this query slow on Tez?":
EXPLAIN FORMATTEDto see the planned DAG.EXPLAIN VECTORIZATION DETAILto spot non-vectorized operators.- Run with
hive.exec.print.summary=trueto get the runtime summary. - Open the Tez UI for the DAG, look at per-vertex and per-edge stats.
- Compare planned parallelism to actual (VertexManager may have changed it).
- Identify the bottleneck vertex by
WALL_CLOCK_MILLISorOUTPUT_RECORDSskew.
Most slowness is one of: vectorization failure, parallelism mismatch, data skew on a shuffle key, or AM overhead for a many-vertex DAG.
Validation Artifacts
- The
EXPLAIN FORMATTEDJSON saved to~/tez-notes/hive-h2-explain-formatted.json. - The
EXPLAIN VECTORIZATION DETAILsaved to~/tez-notes/hive-h2-vec.txt. - A
.pngrendered from the.dotsaved to~/tez-notes/hive-h2-dag.png. - The Tez UI URL for the actual DAG run, bookmarked.
- The Hive-operator-to-Tez-I/P/O table above, filled in for your captured DAG.
Once you can capture and read the DAG four ways, you are ready for failure analysis — Lab H3: Debug a Failed Query.
Lab H3: Debugging a Failed Query
Background
Production Hive-on-Tez failures usually surface as one line in the Hive console:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Vertex failed, vertexName=Map 1, vertexId=vertex_1718000000000_4321_1_00,
diagnostics=[Task failed, taskId=task_1718000000000_4321_1_00_000003,
diagnostics=[TaskAttempt 0 failed, info=[
Container container_e123_1718000000000_4321_01_000007 failed.
Exit code: 1
Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :
... ]]]
That message is the tip. The actual exception is buried 3–4 hops away. This lab is the operational walk from that tip to the root-cause stack trace, with a fabricated-but- realistic example.
The Failure Hop Sequence
flowchart TD
H[Hive console error<br/>'Vertex failed, vertexName=Map 1']
H --> A[AM log<br/>tez-dag log on the AM container]
A --> T[TaskAttempt diagnostics<br/>which task, which container]
T --> C[Container stderr / stdout log<br/>on the worker node]
C --> E[Actual exception<br/>the root cause]
E --> X[Attribute to Hive / Tez runtime / Tez AM / YARN]
Five hops. Most engineers can do hop 1 (read the console). Few can do hops 2–4 without guidance. This lab is the guidance.
Step 1: Parse the Console Message
Take the message above and extract the identifiers:
| Identifier | Value (in our example) | Use for |
|---|---|---|
| Application ID | application_1718000000000_4321 | YARN log retrieval |
| DAG ID | dag_1718000000000_4321_1 | Tez UI URL |
| Vertex ID | vertex_1718000000000_4321_1_00 | The failing vertex; here 00 ≈ Map 1 |
| Task ID | task_1718000000000_4321_1_00_000003 | Which task within the vertex |
| Attempt | 0 | First attempt failed |
| Container ID | container_e123_1718000000000_4321_01_000007 | Where the work was running |
| Exit code | 1 | Process died abnormally |
The format is consistent across all Hive-on-Tez failures. Memorise the structure.
Step 2: Get the AM Log
The Tez AM is itself a YARN container. Its log is fetched with yarn logs:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000001
The AM container is typically _01_000001 (always the first container of the app). The
log streams to stdout. Pipe to a file:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000001 \
> ~/tez-notes/hive-h3-amlog.txt
The AM log contains the DAGAppMaster lifecycle, vertex state transitions, and
diagnostics aggregated from failing tasks.
Search for our failing task:
grep -n "task_1718000000000_4321_1_00_000003" ~/tez-notes/hive-h3-amlog.txt | head
You will see lines like:
2024-06-10 14:22:11,432 [INFO ] TaskImpl - task_..._000003 transitioned from SCHEDULED to RUNNING
2024-06-10 14:22:13,108 [INFO ] TaskAttemptImpl - attempt_..._000003_0 transitioned from RUNNING to FAILED
2024-06-10 14:22:13,108 [WARN ] TaskImpl - Diagnostics for ..._000003_0:
Container ..._000007 failed.
Exit code: 1
... [Last 4096 bytes of stderr] ...
The "Last 4096 bytes of stderr" is the AM's view of why the container died. It's truncated. For the full container log, hop 3.
Step 3: Get the Container Log
The container ID from the AM log (container_..._000007) is the worker. Its log:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000007 \
> ~/tez-notes/hive-h3-container-007.txt
The container log contains the full stdout and stderr from the Tez task runtime
(LogicalIOProcessorRuntimeTask), including all logged exceptions and any user-code
output.
The container log structure:
LogType:stdout
...
LogType:syslog
2024-06-10 14:22:12,856 [INFO ] LogicalIOProcessorRuntimeTask - Initializing task ...
2024-06-10 14:22:12,891 [INFO ] MRInput - Initializing MRInput for ...
2024-06-10 14:22:13,007 [WARN ] MRInput - ...
2024-06-10 14:22:13,084 [ERROR] LogicalIOProcessorRuntimeTask - Failed to execute task
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing row {"a":3,"b":"q"}
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
at org.apache.hadoop.hive.ql.exec.UDFBridge.<init>(UDFBridge.java:54)
...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
...
LogType:stderr
This is the actual exception. The Caused by: chain walks from Hive's wrapping
exception down to the JVM-level cause.
Step 4: Walk the Exception
Reading the trace top-down for our example:
| Frame | Tells you |
|---|---|
java.lang.RuntimeException | Container exit, generic |
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"a":3,"b":"q"} | Hive boundary; you know the input row |
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow:91 | Hive Tez map-side row processor |
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run:418 | Hive Tez map record processor |
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run:223 | Hive's Tez Processor adapter |
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run:374 | Tez runtime task |
org.apache.tez.runtime.task.TaskRunner2Callable... | Tez runtime task launcher |
Now the Caused by: chain:
| Cause | Tells you |
|---|---|
HiveException: Unable to load UDF X | The proximate Hive problem |
ClassNotFoundException: com.example.udf.X | The root: classloader can't find UDF |
So the root cause is a UDF class missing from the classpath of the Tez task. That's a Hive (or user) issue, not a Tez issue. See Lab H4 for how to make that attribution rigorously.
Step 5: Attribute the Failure
Apply the decision rule from H4 (preview):
The package of the top frame whose code you can change indicates the project.
Top frames in order:
java.lang.RuntimeException— JVM, not actionable.org.apache.hadoop.hive.ql.metadata.HiveException— Hive, but generic wrap; keep walking.org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91— Hive code, specific. Stop here for the top frame: this is Hive'sMapRecordSource.
Then the Caused by: chain:
HiveException: Unable to load UDF X— Hive.ClassNotFoundException: com.example.udf.X— root cause.
Attribution: Hive (the proximate code is MapRecordSource) and user
(the missing class is the user's UDF jar). Tez is not at fault — it correctly ran the
task, the Hive code, and surfaced the exception. Tez's job is to provide a stack trace,
which it did.
The fix is to ensure the UDF jar is on the AuxJar list:
ADD JAR /path/to/udf.jar;
or in hive-site.xml:
<property>
<name>hive.aux.jars.path</name>
<value>file:///opt/hive/auxlib/udf.jar</value>
</property>
Tooling Shortcuts
Get all container logs at once
yarn logs -applicationId application_1718000000000_4321 \
> ~/tez-notes/hive-h3-all.txt
For a large DAG with many containers, this is large (often 100s of MB). Use the per-container form when you know which one to look at.
Search across container logs
grep -B2 -A20 "java.lang.\|Caused by" ~/tez-notes/hive-h3-all.txt | head -100
Find the failing task fast
grep "FAILED\|state changed.*FAILED\|attempt.*FAILED" ~/tez-notes/hive-h3-amlog.txt
Tez UI shortcut
If your cluster has the Tez UI, the per-task log links are one click. The UI URL pattern:
http://<tez-ui-host>:9999/tez-ui/#/tez-dag/dag_1718000000000_4321_1
From that page, navigate to Map 1 → task 000003 → attempt 0 → "logs". The UI fetches
the container log automatically.
A Second Worked Example — Tez Runtime Failure
Console:
Vertex failed, vertexName=Reducer 2, ...
Container ... failed. Exit code: 1
Container log top of stack:
java.io.IOException: Failed on local exception: java.io.IOException: Failed to fetch shuffle data
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
...
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
...
Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.
Attribution: Tez runtime library. Specifically the shuffle fetcher. The root cause —
ConnectException: Connection refused — points to the upstream task's container being
gone (killed, evicted, or networked away). Investigation continues into the upstream
container's log.
This is the canonical Tez shuffle failure shape. The reproduction is in H5.
A Third Worked Example — AM Failure
Console:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Application application_1718000000000_4321 failed with state FAILED.
Diagnostics: Application application_1718000000000_4321 failed 2 times due to AM Container ... exited with exitCode: -103 ...
The AM itself died. Container log of the AM:
[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
...
Top frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412. Attribution: Tez AM.
Root cause: AM heap too small for the DAG (tez.am.resource.memory.mb). Fix is
configuration; if reproducible at the default, file a JIRA against Tez requesting either
a smarter default or a sizing recommendation.
Validation Artifacts
For our first example, save:
- The console error verbatim (
~/tez-notes/hive-h3-console.txt). - The parsed-identifiers table (Application ID, DAG ID, Vertex ID, Task ID, Container ID).
- The AM log fragment showing the task transition to FAILED.
- The container log fragment showing the full exception with
Caused by:chain. - The attribution paragraph: which project owns the bug, and why.
- The fix you propose.
Once you can produce that artifact for an arbitrary Hive-on-Tez failure, you can debug one. The next lab — Lab H4: Bug Attribution — makes the attribution rigorous with a decision tree and four more worked examples.
Lab H4: Bug Attribution
Background
A failing Hive-on-Tez query may be a Hive bug, a Tez runtime bug, a Tez AM bug, a YARN bug, a Hadoop common bug, a JVM bug, a user bug, or an infrastructure bug. Filing it on the wrong project wastes the reporter's time and the maintainer's. This lab gives you a mechanical decision tree to attribute correctly from a stack trace, plus four worked examples.
The Decision Tree
Given a stack trace (after Lab H3 has surfaced it):
flowchart TD
S[Start: have stack trace]
S --> T1[Find top frame whose package you can change]
T1 --> P{Package prefix?}
P -->|org.apache.hadoop.hive.*| H[Hive bug]
P -->|org.apache.tez.runtime.library.*| TR[Tez runtime library<br/>tez-runtime-library]
P -->|org.apache.tez.runtime.*<br/>not .library| TRI[Tez runtime internals<br/>tez-runtime-internals]
P -->|org.apache.tez.dag.app.*| TA[Tez AM<br/>tez-dag]
P -->|org.apache.tez.dag.api.*| TC[Tez client / API<br/>tez-api]
P -->|org.apache.tez.client.*| TC
P -->|org.apache.hadoop.yarn.*| Y[YARN bug]
P -->|org.apache.hadoop.hdfs.*| HD[HDFS bug]
P -->|org.apache.hadoop.mapred.*| MR[Hadoop MR compat<br/>tez-mapreduce]
P -->|user package| U[User code bug]
P -->|java.*, sun.*| J[Walk down to next frame]
J --> T1
H --> CD[Then check Caused by chain]
TR --> CD
TRI --> CD
TA --> CD
Y --> CD
HD --> CD
CD --> R[Root cause may shift attribution]
R --> END[File on the project that owns the actionable code]
The rule in one sentence: find the top frame in actionable code, name its package prefix, and read off the project.
Package → Project → Module Table
| Package prefix | Project | Module / area | Where to file |
|---|---|---|---|
org.apache.hadoop.hive.ql.exec.tez.* | Hive | Tez integration | https://issues.apache.org/jira/projects/HIVE |
org.apache.hadoop.hive.ql.exec.* (not .tez) | Hive | Operators | HIVE JIRA |
org.apache.hadoop.hive.ql.metadata.* | Hive | Metadata / UDF | HIVE JIRA |
org.apache.hadoop.hive.serde2.* | Hive | Serialization | HIVE JIRA |
org.apache.hadoop.hive.* (any other) | Hive | Core | HIVE JIRA |
org.apache.tez.runtime.library.* | Tez | tez-runtime-library | TEZ JIRA |
org.apache.tez.runtime.task.* | Tez | tez-runtime-internals | TEZ JIRA |
org.apache.tez.runtime.* (not .library, not .task) | Tez | tez-runtime-internals | TEZ JIRA |
org.apache.tez.dag.app.dag.impl.* | Tez | tez-dag (state machines) | TEZ JIRA |
org.apache.tez.dag.app.rm.* | Tez | tez-dag (RM client / container scheduling) | TEZ JIRA |
org.apache.tez.dag.app.launcher.* | Tez | tez-dag (container launcher) | TEZ JIRA |
org.apache.tez.dag.app.* (other) | Tez | tez-dag (AM core) | TEZ JIRA |
org.apache.tez.dag.api.* | Tez | tez-api (DAG / Vertex / Edge) | TEZ JIRA |
org.apache.tez.client.* | Tez | tez-api (TezClient) | TEZ JIRA |
org.apache.tez.mapreduce.* | Tez | tez-mapreduce (MRInput/MROutput) | TEZ JIRA |
org.apache.hadoop.yarn.client.* | YARN | Client | HADOOP JIRA, component YARN |
org.apache.hadoop.yarn.server.resourcemanager.* | YARN | RM | HADOOP YARN |
org.apache.hadoop.yarn.server.nodemanager.* | YARN | NM | HADOOP YARN |
org.apache.hadoop.hdfs.* | HDFS | Client / DN / NN | HADOOP HDFS |
org.apache.hadoop.mapred.* | MR compat | tez-mapreduce for MR-on-Tez | TEZ JIRA |
org.apache.hadoop.io.* / .fs.* / .conf.* | Hadoop common | hadoop-common | HADOOP COMMON |
com.<user>.* / org.<user>.* (not apache) | User code | n/a | Fix locally |
java.*, sun.*, jdk.* | JVM | walk down | (not the cause; keep looking) |
Verify the modules against your tree:
find ~/tez-src -maxdepth 2 -name pom.xml | sort
find ~/hive-src -maxdepth 3 -name pom.xml | head
Example 1: UDF Not Found (Hive bug → User bug)
Trace (from Lab H3):
java.lang.RuntimeException: ...
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(...)
...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
Apply the tree:
- Top actionable frame:
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91. - Package:
org.apache.hadoop.hive.ql.exec.tez.*. - Project: Hive (the Tez integration code).
- Check
Caused by: root isClassNotFoundException: com.example.udf.X— a user class. - Adjust: this is user error (their UDF jar isn't on the classpath), surfaced by Hive's UDF registry, surfaced by Hive's Tez integration. No bug to file.
Fix: ADD JAR or hive.aux.jars.path.
If the same trace came with Caused by: ClassNotFoundException: org.apache.hadoop.hive.ql.exec.UDFBridge,
then the root is a Hive class missing from the Hive distribution — file on HIVE.
Example 2: Shuffle Fetch Failure (Tez runtime bug)
Trace:
java.io.IOException: Failed to fetch shuffle data
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
Apply the tree:
- Top actionable frame:
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391. - Package:
org.apache.tez.runtime.library.*. - Project: Tez, module
tez-runtime-library. - Check
Caused by:ConnectException— network. - Adjust: the root is a network/infra failure. The shuffle code surfaced it
correctly; not a bug in itself. But:
- If this happens once with sporadic node failures: infrastructure issue, no bug.
- If this happens frequently and the fetcher isn't retrying enough times before
giving up: Tez bug — file on TEZ asking to bump or expose
tez.runtime.shuffle.connect.timeout/retry counts. - If the upstream container died because of an AM scheduling bug: Tez AM bug, file on TEZ with the AM log evidence.
Verify the retry config:
grep "shuffle.connect\|shuffle.fetch.retry\|shuffle.read.timeout" \
~/tez-src/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/api/TezRuntimeConfiguration.java
Example 3: AM OOM During DAG Submit (Tez AM bug)
AM container log:
[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3210)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(...)
...
at com.google.protobuf.ByteString.copyFrom(ByteString.java:194)
at org.apache.tez.dag.api.records.DAGProtos$DAGPlan.toBuilder(DAGProtos.java:...)
at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
at org.apache.tez.dag.app.DAGAppMaster.createDAG(DAGAppMaster.java:...)
Apply the tree:
- Top actionable frame: skip JVM/protobuf frames. First Tez frame:
org.apache.tez.dag.app.dag.impl.VertexImpl:412. - Package:
org.apache.tez.dag.app.*. - Project: Tez, module
tez-dag(AM). - Check
Caused by: none — just the OOM.
Attribution: Tez AM. The proximate cause is constructing VertexImpl from a large
DAGPlan. Three possible JIRA shapes:
- "Tez AM OOMs on submission of N-vertex DAG at default
tez.am.resource.memory.mb" — file requesting smarter sizing or doc. - "VertexImpl construction allocates O(N²) memory in inputs" — file with a profile and a fix suggestion.
- "DAGPlan toBuilder() materialises a full copy" — file as a perf bug.
The correct shape depends on profile evidence. Without profiling, file the sizing/doc variant first; the deeper variants follow.
Example 4: NodeManager Lost (YARN bug)
AM log:
2024-06-10 ... [WARN ] AMContainerImpl - Container container_..._000007 transitioned from RUNNING to STOPPED. exitStatus -100
2024-06-10 ... [WARN ] DAGAppMaster - Container ..._000007 completed unexpectedly; will be rescheduled
2024-06-10 ... [WARN ] RMContainerRequestor - Lost node nm-12.example.com
2024-06-10 ... [INFO ] DAGAppMaster - Marking task attempt as failed due to lost node: attempt_..._000003_0
Apply the tree:
-
Top frame in trace:
org.apache.tez.dag.app.rm.AMContainerImpl— but this is the AM's correct reaction to a node loss, not a bug. -
The substantive cause is "NodeManager nm-12 lost" — diagnose by checking NodeManager log on that host:
yarn node -list -all | grep nm-12 tail -200 /var/log/hadoop-yarn/yarn-nodemanager.log # on nm-12 -
Common nm-side root causes:
- NM heap OOM (NM stops responding to RM heartbeats) → YARN bug or NM tuning.
- Network partition → infra.
- Disk full on NM local-dirs → ops issue.
Attribution:
- If NM died from OOM, file on HADOOP YARN.
- If Tez AM didn't reschedule the lost task correctly, file on TEZ. But the AM log here shows correct reaction, so that's not in play.
- If Tez's
TaskSchedulerretried the task on the same lost node repeatedly, file on TEZ (a scheduler awareness issue).
Cross-Project Patterns
Some failure modes have a well-known cross-project shape. Memorise the shapes:
| Shape | Likely project | Quick diagnostic |
|---|---|---|
ClassCastException inside MapRecordSource / ReduceRecordSource | Hive (schema mismatch in vectorization) | Check EXPLAIN VECTORIZATION DETAIL |
IOException: Stream is closed in shuffle reader | Tez runtime library | Check upstream container alive |
TaskCommitDeniedException | Tez AM speculative-exec coordination | Check tez.am.speculation.enabled |
NoSuchMethodError on a Tez or Hive class | Version skew | Check classpath; check mvn dependency:tree |
IllegalArgumentException: Wrong FS | Hadoop FS | Check fs.defaultFS, core-site.xml |
Container killed by OOM killer (exit code 137) | YARN or workload | Check container memory request vs JVM heap |
org.apache.hadoop.security.AccessControlException | HDFS or Hive Ranger | Permissions issue, not a code bug |
What to Do With the Attribution
Having attributed correctly:
| Attribution | Action |
|---|---|
| Hive | File on https://issues.apache.org/jira/projects/HIVE with Tez in summary if relevant |
Tez tez-runtime-library | File on https://issues.apache.org/jira/projects/TEZ, component Runtime Library |
Tez tez-runtime-internals | File on TEZ, component Runtime Internals |
Tez tez-dag (AM) | File on TEZ, component AM |
Tez tez-api | File on TEZ, component Client / API |
Tez tez-mapreduce | File on TEZ, component MR Compat |
| YARN | File on https://issues.apache.org/jira/projects/HADOOP, component YARN |
| HDFS | File on HADOOP, component HDFS |
| User | Fix locally, no JIRA |
| Infrastructure | Operations issue, no JIRA |
| Multiple (Hive needs change AND Tez needs change) | File on both, cross-reference |
In all cases, the JIRA description follows the skeleton in Design via JIRA.
Validation Artifacts
After this lab:
- The decision tree printed and pinned at your desk (or in
~/tez-notes/). - The Package → Project → Module table memorised or saved as
~/tez-notes/hive-h4-attribution.md. - Four attributions, one for each worked example, written out in your own words.
- The reflex: never file a JIRA on a project whose code does not appear in the top of the actionable stack.
The next lab — Lab H5: Reproducing Bugs — covers how to turn an attributed bug into a minimum reproducer suitable to attach to a JIRA.
Lab H5: Reproducing Bugs
Background
A JIRA without a reproducer drifts. A JIRA with a clean reproducer gets attention.
"Clean" means: minimal schema, minimal data, minimal query, runnable in under a minute
on a local MiniTezCluster or MiniHS2. This lab is the procedure.
The Hive integration test framework (hive-itests) is the source of every pattern you
need. Reading its existing tests is the cheapest education.
The Three Reduction Axes
To minimise a reproducer, reduce along three independent axes:
| Axis | Reduce | Stop reducing when |
|---|---|---|
| Schema | Drop unused columns; simplify types | Removing a column makes the bug disappear |
| Data | Reduce row count; generate synthetic data | Reducing rows makes the bug disappear |
| Query | Drop joins, predicates, projections | Dropping a clause makes the bug disappear |
The goal is the smallest schema × smallest data × smallest query that still reproduces.
Setup — Local MiniHS2 + MiniTezCluster
MiniHS2 is a single-JVM HiveServer2 that runs against a MiniTezCluster (a single-JVM
YARN). Together they let you reproduce a Hive-on-Tez bug in seconds without an external
cluster.
Existing reference in your tree:
find ~/hive-src/itests -name "MiniHS2.java" | head
find ~/hive-src/itests -name "TestMiniLlapVectorArrowWithLlapIODisabled.java" | head
find ~/tez-src/tez-tests -name "MiniTezCluster.java"
A reproducer test class skeleton (Hive 3/4 style):
public class TestMyBugRepro {
private MiniHS2 miniHS2;
@Before
public void setUp() throws Exception {
HiveConf conf = new HiveConf();
conf.set("hive.execution.engine", "tez");
conf.set("tez.lib.uris",
"file://" + System.getProperty("tez.lib.dir"));
miniHS2 = new MiniHS2.Builder()
.withConf(conf)
.withMiniMR() // brings up MiniTezCluster
.build();
miniHS2.start(new HashMap<>());
}
@After
public void tearDown() throws Exception {
miniHS2.stop();
}
@Test
public void reproBug() throws Exception {
try (Connection c = DriverManager.getConnection(miniHS2.getJdbcURL());
Statement s = c.createStatement()) {
s.execute("CREATE TABLE t (...) STORED AS ORC");
s.execute("INSERT INTO t VALUES (...)");
ResultSet rs = s.executeQuery("SELECT ...");
// assert behaviour or expect exception
}
}
}
Run with mvn test -pl itests -Dtest=TestMyBugRepro.
Reducing the Schema
Starting from a real production table with 200 columns, reduce iteratively:
- Identify referenced columns. Read the failing query; note which columns the
SELECT,WHERE,GROUP BY,JOIN,ORDER BYactually reference. - Drop everything else. Make a new test schema with only the referenced columns.
- Re-run. Does the bug still reproduce? If yes, you've reduced. If no, you've found a column that's load-bearing; add it back and look for why.
- Simplify remaining types. Replace
DECIMAL(38,10)withDECIMAL(10,2)if the bug doesn't depend on precision. ReplaceSTRUCT<...>withSTRINGif you can. Replace partition columns with non-partitioned tables unless the partition is load-bearing. - Stop when reduction breaks the repro.
For our running example query:
SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;
Only column a is referenced. Schema reduces to:
CREATE TABLE t (a INT) STORED AS ORC;
If the bug needs the second column for some reason (e.g. ORC stripe layout), keep it.
Reducing the Data — JoinDataGen Pattern
Hive's itests includes data generators for systematic minimisation. The most common
pattern is JoinDataGen for generating join inputs at controlled cardinalities:
find ~/hive-src -name "JoinDataGen*.java" -o -name "*DataGen*.java" | head
The pattern (adapt for your bug):
public final class TestDataGen {
public static void writeIntRows(String tableName, int rowCount, int distinctKeys,
Statement s) throws SQLException {
Random r = new Random(42);
StringBuilder values = new StringBuilder();
for (int i = 0; i < rowCount; i++) {
if (i > 0) values.append(",");
values.append("(").append(r.nextInt(distinctKeys)).append(")");
}
s.execute("INSERT INTO " + tableName + " VALUES " + values);
}
}
Reduce data:
- Start with original data size. 1 billion rows? Reduce to 1 million.
- Halve until bug disappears. Binary-search the row count: 1M → 500K → 250K → ...
- At the smallest row count that still repros, vary distinct-key count. Bug may need 5 distinct keys (skew) or 500K (cardinality). Find which.
- Vary value distribution. If the bug needs a skewed distribution (one key gets 90% of rows), generate that explicitly.
- Document the minimum. "Bug reproduces at >= 1024 rows with >= 8 distinct keys."
For our running example, with no actual bug, the minimum is whatever you need to exercise the GROUP BY + ORDER BY path — single-digit rows are enough.
Reducing the Query
Remove clauses one at a time and re-test:
- Remove
ORDER BY→ does the bug still happen? (Probably not, if the bug is in the total-order reducer.) - Remove the aggregate → does the bug still happen?
- Remove
WHEREpredicates one at a time. - Remove
JOINs; if the join is the cause, simplify to a 2-table join, then to a tiny-on-tiny join. - Replace
MAP JOINwithSHUFFLE JOINby disabling map joins (hive.auto.convert.join=false) and re-test.
A reproducer query of 3 lines beats a reproducer query of 30 lines, even for the same bug.
Capturing the Artifacts
A complete bug-report artifact set:
| Artifact | Why |
|---|---|
CREATE TABLE DDL for every table involved | Reproducer setup |
Data generation code or inline INSERT values | Reproducer setup |
| The minimal query | The test |
SET hive.* lines that were necessary | Configuration |
| The expected behavior (correct result) | Oracle |
| The actual behavior (incorrect result or exception) | Symptom |
EXPLAIN FORMATTED output | Plan |
| AM log fragment showing failure | Diagnostic |
| Container log fragment showing exception | Diagnostic |
| Tez and Hive version | Version |
Bundle into a single artifact:
cd ~/tez-notes
mkdir hive-h5-repro
cp ddl.sql hive-h5-repro/
cp gen.sql hive-h5-repro/
cp query.sql hive-h5-repro/
cp explain.txt hive-h5-repro/
cp amlog-fragment.txt hive-h5-repro/
cp container-log-fragment.txt hive-h5-repro/
cat > hive-h5-repro/README.md <<EOF
# Repro for HIVE-XXXXX / TEZ-XXXX
Tez version: 0.10.X
Hive version: 4.0.X
Hadoop version: 3.3.X
JDK: 11
Setup: hive -f ddl.sql && hive -f gen.sql
Repro: hive -f query.sql
Expected: rows = N, max value = M.
Actual: exception in container log (see container-log-fragment.txt).
EOF
tar czf hive-h5-repro.tar.gz hive-h5-repro/
Attach hive-h5-repro.tar.gz to the JIRA. A reproducer in this shape gets opened by
maintainers; one without these elements doesn't.
When MiniTezCluster Doesn't Reproduce
A bug that reproduces on a production cluster but not on MiniTezCluster is the worst
shape. Common causes:
| Cause | Diagnostic |
|---|---|
| Multi-node shuffle behavior; mini cluster is single-node | Force multiple containers per node; can't fully simulate |
| Container OOM at production memory; mini cluster doesn't have memory pressure | Configure mini cluster with tight memory limits |
| Concurrent DAG submissions; mini cluster has none | Run multiple parallel tests |
| ORC stripe layout; needs production-size files | Generate larger ORC files |
| Production data distribution; mini cluster has uniform | Use realistic random seed and distribution |
| Speculative execution; not enabled in mini by default | Enable with tez.am.speculation.enabled=true |
If none of these reduce, the bug may be in cluster-only code paths (RM scheduling edge cases). Document that the reproducer requires N nodes and attach what evidence you have.
A Worked Reproducer — Hypothetical Bug
Suppose a bug: COUNT(*) returns 0 when input table has exactly 1024 rows and
vectorization is enabled. (Imaginary; for the pattern.)
Schema
CREATE TABLE t (a INT) STORED AS ORC;
Data
INSERT INTO t SELECT col1 FROM dual WHERE 1=0; -- placeholder
-- repeat to produce exactly 1024 rows:
INSERT INTO t SELECT pos AS a FROM (
SELECT explode(sequence(1, 1024)) AS pos
) s;
(Hive's explode(sequence(...)) may or may not be available depending on version; use
the equivalent for your version.)
Query
SET hive.vectorized.execution.enabled=true;
SELECT COUNT(*) FROM t;
Expected vs Actual
Expected: 1024
Actual: 0
EXPLAIN
EXPLAIN VECTORIZATION DETAIL SELECT COUNT(*) FROM t;
Save the output. Look for Execution mode: vectorized and any odd Vectorized: false
on a key operator.
Trial Reductions
- 1023 rows: bug? No.
- 1024 rows: bug.
- 2048 rows: bug? Test.
- Vectorization off: bug? Reset.
Document the conditions:
Bug reproduces at:
- row count exactly 1024
- hive.vectorized.execution.enabled=true
Bug does NOT reproduce at:
- row count != 1024
- hive.vectorized.execution.enabled=false
That's a sharp, actionable bug. Attribution (by Lab H4): likely Hive's vectorized aggregation code path. File on HIVE.
Production-to-Test Translation
When a real production bug is reported to you with no reproducer:
- Get the query. From the user, from
hive.log(hive.server2.logging.operation.enabled), or fromHiveServer2audit logs. - Get the schema. Run
SHOW CREATE TABLEon each involved table; copy. - Get a sample of data. A few hundred to a few thousand rows. Anonymise PII if needed.
- Get the version triplet. Tez / Hive / Hadoop.
- Reproduce. Stand up
MiniHS2, load the schema, load the sample data, run the query. - If it reproduces, reduce. Apply the three axes.
- If it doesn't reproduce, expand. More data, more nodes, more concurrency.
A one-day cycle for a complex production bug is fast. A one-week cycle is realistic for something subtle.
Validation Artifacts
After this lab:
- A complete reproducer artifact (a
hive-h5-repro.tar.gz-style bundle) for a real or imagined Hive-on-Tez bug. - A
TestMyBugRepro.javaskeleton you can adapt. - The three-axes reduction discipline applied at least once.
- The reflex to capture the version triplet (Tez/Hive/Hadoop) on every reproducer.
The next lab — Lab H6: Diagnostics — covers what to do when you can't reproduce locally and need to ask the production reporter to capture more data.
Lab H6: Writing a Diagnostic Patch
Background
You have a Hive-on-Tez bug report from production. You can't reproduce locally (Lab H5 didn't work). You need more data. The way to get it is a diagnostic patch — a small change that adds logging, counters, or a debug toggle without changing behavior, attached to the JIRA, that the reporter can apply and re-run.
A well-shaped diagnostic patch:
- Adds boundary-INFO logging at the suspected fault site.
- Adds a
TezCounterso the data is captured in the standard counter mechanism. - Adds a debug-only
TezConfigurationswitch so the cost is opt-in.
This lab walks the three patterns.
Pattern 1: Boundary INFO Logging
A "boundary" is the point at which control flows from one subsystem to another:
| Boundary | Example |
|---|---|
| Hive → Tez submit | TezTask.execute → TezSession.submitDAG |
| Tez AM → Container | DAGAppMaster.scheduleTaskAttempt → ContainerLauncherImpl.launch |
| Container → Task | LogicalIOProcessorRuntimeTask.run → Processor.run |
| Task → Input shuffle | OrderedGroupedKVInput.start |
| Task → Output shuffle | OrderedPartitionedKVOutput.start |
INFO at a boundary is cheap, lasts the lifetime of a task, and gives the next debugger a structured trail.
Example patch (illustrative diff)
Suppose the bug is "DAG submission occasionally takes >10s on large DAGs." A diagnostic
patch in TezTask:
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index abcdef1..2345678 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -201,7 +201,12 @@ public class TezTask extends Task<TezWork> {
private DAGClient submit(DAG dag, TezSessionState session) throws Exception {
+ long submitStartNs = System.nanoTime();
+ int dagPlanBytes = dag.createDag(conf, null, null, null, false).getSerializedSize();
+ LOG.info("HIVE-XXXX diag: about to submitDAG, dagName={}, vertices={}, planBytes={}",
+ dag.getName(), dag.getVertices().size(), dagPlanBytes);
DAGClient client = session.getSession().submitDAG(dag);
+ LOG.info("HIVE-XXXX diag: submitDAG returned in {} ms",
+ TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - submitStartNs));
return client;
}
Rules for the patch:
- Tag every log line with the JIRA ID. The reporter greps for
HIVE-XXXX diag:to find your data. - INFO, not DEBUG. The reporter must not have to change log levels.
- Structured key=value or
{}placeholders. Easy to parse. - Cheap. Measure or log only what's needed; no full-DAG dumps unless explicitly asked.
Pattern 2: A New TezCounter
Counters are the production-safe way to surface a number. They aggregate across tasks
and are visible in the Tez UI, in hive.exec.print.summary output, and in the AM log.
Define a new counter
Tez counters are enums. The Tez-side counters:
find ~/tez-src -name "DAGCounter.java" -o -name "TaskCounter.java"
public enum TaskCounter {
// ... existing ...
REDUCE_INPUT_GROUPS,
REDUCE_OUTPUT_RECORDS,
// new for diagnostic:
/** TEZ-XXXX diag: number of shuffle fetch retries on this task. */
SHUFFLE_FETCH_RETRIES,
}
The Hive-side counters live in:
find ~/hive-src -name "OperatorVariation.java" -o -name "HiveCounter*.java" | head
For Hive-side, use a Reporter.incrCounter or operator counter mechanism, depending on
Hive version.
Increment it where it matters
In the suspected hot spot:
- copyFromHost(host);
+ try {
+ copyFromHost(host);
+ } catch (IOException e) {
+ context.getCounters().findCounter(TaskCounter.SHUFFLE_FETCH_RETRIES).increment(1);
+ throw e;
+ }
After the reporter runs with the patch:
SET hive.exec.print.summary=true;
-- repro the bug
The summary will show SHUFFLE_FETCH_RETRIES = N per task, surfacing data that was
previously invisible.
Counters vs logs
| Aspect | Counter | Log |
|---|---|---|
| Aggregation across tasks | Automatic | Manual |
| Production safety | High | High |
| Persistence | Long (ATS / Tez UI) | Short (containerlog rotation) |
| Detail per event | None (just a count) | Full message |
| Cost | Near zero | Low to moderate |
Use both for big diagnostics: a counter to know "this happened N times" and a log to know "and the first time, here's what it looked like."
Pattern 3: A Debug TezConfiguration Switch
For more invasive diagnostics — extra log lines that would be too noisy by default, or extra checks that have a measurable cost — gate them behind a config switch.
Define the switch
In TezConfiguration (Tez side) or HiveConf (Hive side):
// TezConfiguration.java
@Private @Unstable
public static final String TEZ_AM_DIAGNOSTICS_VERBOSE = "tez.am.diagnostics.verbose";
public static final boolean TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT = false;
Use @Private and @Unstable for a diagnostic key — see
Compatibility. It signals "this is not a
supported API, may be removed once the bug is fixed."
Gate the diagnostic
private final boolean verboseDiagnostics;
public VertexImpl(...) {
this.verboseDiagnostics = conf.getBoolean(
TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE,
TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT);
}
public void scheduleTasks(...) {
// ... existing logic ...
if (verboseDiagnostics) {
LOG.info("TEZ-XXXX diag: scheduling {} tasks for vertex {}; first task locations: {}",
tasksToSchedule.size(), getName(),
tasksToSchedule.subList(0, Math.min(5, tasksToSchedule.size())));
}
}
Reporter applies the patch and turns it on:
SET tez.am.diagnostics.verbose=true;
-- repro the bug
SET tez.am.diagnostics.verbose=false;
When the bug is diagnosed, the switch is removed in the proper fix. It is not a supported config — the JIRA tracks both the diagnostic patch (to be reverted) and the real fix.
Assembling the Diagnostic Patch
A complete patch for attachment:
- One new INFO log line at the boundary you suspect.
- One new counter if there's a count to track.
- One debug switch if the diagnostic has cost.
- JIRA description with:
- What the patch adds.
- How to apply it.
- How to enable any switch.
- What output the reporter should capture and attach.
- Test that the patch compiles and runs the existing tests — diagnostic patches must not change behavior.
Skeleton JIRA comment
Diagnostic patch attached: TEZ-XXXX.diag.001.patch
Adds:
- INFO log "TEZ-XXXX diag:" in VertexImpl.scheduleTasks
- TaskCounter SHUFFLE_FETCH_RETRIES
- Config switch tez.am.diagnostics.verbose (default false)
To reproduce with the patch:
1. Apply: git apply TEZ-XXXX.diag.001.patch
2. Build: mvn install -DskipTests
3. Run query that reproduces the issue, with:
SET tez.am.diagnostics.verbose=true;
SET hive.exec.print.summary=true;
4. Attach:
- The AM log (yarn logs -applicationId ...)
- The full Tez summary output
- One container log from a failing task
Will use the data to file a proper fix. The diagnostic patch is not for commit;
the fix patch will be separate.
Thanks,
<First>
When the Reporter Can't Apply a Patch
Some reporters can't patch their cluster (locked-down enterprise environment). In that case:
- Ask for what data they can capture: AM log, container logs, Tez UI screenshots, counter values.
- Tell them which existing INFO-level logs to grep for.
- Tell them which existing counters to read off.
- If a config switch already exists that increases logging, point at it
(e.g.
tez.am.history.logging.enabled=true,tez.runtime.shuffle.connect.timeout=X).
You don't always get a diagnostic patch onto the cluster. The skill is to plan the diagnostic so you get as much as possible from what's already shipped.
After the Diagnostic Runs
The reporter attaches:
- AM log with
TEZ-XXXX diag:lines. - Tez summary with counters.
- Container log fragments.
You analyse:
- What did the diagnostic counter show?
- What did the diagnostic log line tell you?
- Where is the actual root cause?
Once you know the root cause:
- File a separate JIRA for the real fix (or repurpose the diagnostic JIRA).
- Attach a proper fix patch (no diag, no INFO noise, no
@Privateconfig keys). - Note in the JIRA comment that the diagnostic patch is being abandoned in favor of the fix patch.
Worked Example — Slow DAG Submission
Production bug: "Hive query takes 30 seconds in TezTask.submit before the DAG starts
running on large DAGs."
Diagnostic patch
(As above) — adds two INFO lines in TezTask.submit capturing time and DAGPlan size.
Reporter runs
HIVE-XXXX diag: about to submitDAG, dagName=Hive_..., vertices=347, planBytes=8421567
HIVE-XXXX diag: submitDAG returned in 28341 ms
Analysis
- 347 vertices: large DAG, but not absurd.
- DAGPlan 8.4 MB: very large.
- 28 seconds for
submitDAG: most likely RPC + protobuf parse on the AM side.
Further diagnosis
Add a Tez-side INFO in DAGAppMaster.submitDAGToAppMaster:
TEZ-XXXX diag: received DAGPlan of {} bytes, deserializing
TEZ-XXXX diag: createDAG completed in {} ms
TEZ-XXXX diag: VertexImpl construction completed in {} ms
Re-run on the cluster. Pinpoint where the 28 seconds go.
Likely result: VertexImpl construction is O(N²) in vertex count for some reason.
File the fix patch with the profile evidence.
What a Diagnostic Patch Is Not
- Not a place to add unrelated improvements.
- Not a permanent feature; it gets reverted after the bug is fixed.
- Not a substitute for a proper reproducer (combine both when you can).
- Not a place to use
@PublicAPIs — diagnostic config keys are@Private,@Unstable. - Not committable to trunk as-is.
Validation Artifacts
After this lab:
- A
~/tez-notes/diag-patch-template.mdcontaining the JIRA-comment template above. - One worked diagnostic patch (real or imagined) following the three patterns.
- The reflex to tag every diagnostic log line with the JIRA ID.
- The discipline to file diagnostic and fix as separate JIRAs (or stages of one JIRA).
This lab closes the Hive-on-Tez Labs section. You now have the full toolkit: trace a SQL query into a DAG, capture the DAG four ways, walk a failure to its root, attribute it to the right project, reproduce it minimally, and instrument it for remote diagnosis. That toolkit is the practising-Tez-committer skill at the Tez/Hive boundary.
Release & PMC Reality
This section takes you inside the committer and PMC view of Apache Tez. It is written for two audiences:
- Contributors who want to understand what a committer is reading when they review your patch, why a release vote takes 72 hours, and what a PMC member actually does between commits.
- New committers and PMC members on Tez (or any other ASF project) who need the operational playbook nobody hands them.
The chapters are deliberately not aspirational. They are the mechanics — what email to
send, what file to sign, what the [VOTE] thread template looks like, where the LICENSE
and NOTICE rules are bright lines.
Reading Order
| # | Chapter | Audience |
|---|---|---|
| 1 | Mailing Lists | Everyone |
| 2 | JIRA & Code Review | Contributors and committers |
| 3 | Committer Mindset | New committers, contributors who want to think like one |
| 4 | Release Voting | PMC and release managers |
| 5 | PMC Responsibilities | PMC members |
| 6 | Licensing | Everyone touching dependencies; PMC for releases |
| 7 | Code Style & Trust | All contributors |
Chapters 1–3 and 6–7 are useful to contributors. Chapters 4–5 are PMC-facing but worth reading earlier to understand why committers behave the way they do at release time.
How This Section Differs From the Mindset Section
The Contributor Mindset section answered the question "how do I behave so my work gets accepted?" This section answers "what is the work being done by the people who accept it?" — the asymmetric view from the other side.
You don't need to be a committer to read this material. You need to internalise it before you become one, so the offer doesn't catch you off guard.
What This Section Is Not
This section is not:
- A substitute for the ASF release distribution policy.
- A substitute for ASF legal guidance on licensing.
- A substitute for the Tez committer's onboarding email from the PMC.
It is a faithful, project-specific summary of what those documents and that onboarding actually contain, written so that a contributor can build accurate expectations and a new committer can move fast without surprises.
Prerequisites
Before this section is fully useful:
- You have read the Contributor Mindset section.
- You have a JIRA account at
https://issues.apache.org/jira/. - You are subscribed to
dev@tez.apache.org. - You have a local clone of Tez at
~/tez-src.
If you are a new Tez committer:
- You have received your ASF ID (
<id>@apache.org). - You have set up GPG (we'll cover this in Release Voting).
- You are subscribed to
private@tez.apache.org.
Validation for the Section
You have absorbed this section when you can:
- Compose a
[VOTE]thread email for an RC without consulting a template. - Read a
LICENSEchange in a patch and predict if it would block a release. - Explain why Tez is RTC (Review Then Commit) and not CTR (Commit Then Review).
- Predict, before opening a JIRA, which committer will likely shepherd it.
- Identify the category-A / category-B / category-X status of a dependency you want to add.
- Run
mvn apache-rat:checkand read its output.
The next chapter — Mailing Lists — covers the operational mechanics of the ASF list system that this entire section relies on.
Mailing Lists
Mailing lists are the spine of Apache governance. Every decision that affects the project — design, release, new committer, security disclosure — happens on a mailing list, in an archived thread, with a documented vote when required. This chapter is the operational manual for the Tez lists.
The Tez Lists
| List | Purpose | Subscribe | Notes |
|---|---|---|---|
dev@tez.apache.org | Development discussion, design, votes | dev-subscribe@tez.apache.org | Primary list. Read first, post sparingly. |
user@tez.apache.org | Usage questions | user-subscribe@tez.apache.org | Lower-traffic. Answer here if you can. |
commits@tez.apache.org | Git commit notifications | commits-subscribe@tez.apache.org | Bot-driven. Subscribe to follow trunk live. |
issues@tez.apache.org | JIRA event notifications | issues-subscribe@tez.apache.org | Bot-driven. Verbose; use a filter rule. |
private@tez.apache.org | PMC-only | (Auto on PMC) | New-committer votes, security reports. |
Archive: https://lists.apache.org/list.html?dev@tez.apache.org and equivalent for each
list. Anything posted is public forever (except private@, which is archived but not
public).
Subscribing
# From the address you want subscribed:
echo "" | mail -s "" dev-subscribe@tez.apache.org
# You will get a confirmation request. Reply to it.
For multiple lists, repeat. To unsubscribe, replace subscribe with unsubscribe.
Filtering
issues@tez.apache.org posts dozens of mails per day. Set a Gmail / Outlook / Thunderbird
rule to file it into a folder. Same for commits@tez.apache.org if you subscribe.
For dev@, file by subject prefix:
| Prefix | Folder |
|---|---|
[VOTE] | dev-vote (read same-day) |
[ANNOUNCE] | dev-announce (read same-day) |
[NOTICE] | dev-notice (read same-day) |
[DISCUSS] | dev-discuss (read within the week) |
[PROPOSAL] | dev-proposal (read within the week) |
| (anything else) | dev-misc |
ASF Mailing-List Mores
Lists predate the web at Apache. The conventions are old and load-bearing.
Plain text only
HTML mail is dropped by some clients, breaks quoting, and bloats archives. Apache lists are plain-text. Configure your mail client:
- Gmail web:
Settings → General → Default text style → Plain text - Mutt / mu4e / aerc: already plain
- Outlook:
File → Options → Mail → Plain Text
Inline reply, not top-post
The Apache convention is to reply under the relevant quoted text, quoting only the part you're answering. Trim aggressively.
On Tue, May 7, 2024, Foo Bar wrote:
> Should we bump the default for tez.am.resource.memory.mb?
Yes, but conditionally. See the sizing sketch on TEZ-4321.
> And what about the AM heap?
Same patch; -Xmx is computed from -resource.memory.mb in the launch
command. We don't need a separate knob.
--
Jane
What top-post would do — your full reply at the top, the original quoted in full below — makes archive threads unreadable. People will gently note this once; do not require a second note.
No attachments
Patches go on JIRA. Logs and stack traces go in a gist or a pastebin and are linked. Long output goes as an attachment to the JIRA, not the email.
A 2 MB attachment forces hundreds of subscribers to download it. A link forces only the interested.
Sign off
A short sign-off — first name, or first + last — is conventional. No corporate signature block, no legal disclaimer, no "Sent from my iPhone."
If you must have a signature, use the standard -- \n separator (dash-dash-space) so
mail clients can suppress it.
Subject hygiene
Subject prefixes are filterable. Use them.
| Prefix | When |
|---|---|
[DISCUSS] | Open question, no decision sought yet |
[PROPOSAL] | Concrete proposal, comment wanted |
[VOTE] | Vote in progress; body has voting rules |
[VOTE][RESULT] | Closing a vote; tallies the result |
[ANNOUNCE] | One-way announcement (release, new committer) |
[NOTICE] | Infrastructure / policy change |
Don't prefix replies. The Re: is enough; subscribers' filters key off the embedded
[VOTE] already.
Reply-To etiquette
ASF lists are configured to set Reply-To: list. So your reply goes to the list by
default. Don't break it by manually rewriting the To:.
If you want to reply privately to the sender (rare — use only for personal/off-topic), explicitly remove the list and address them.
[VOTE] Mechanics
ASF votes are the formal decision mechanism. They use a fixed +1 / 0 / -1 syntax.
Voting tokens
| Token | Meaning |
|---|---|
+1 | I approve. |
+0 | I'm slightly positive but won't block. |
0 | I have no opinion. |
-0 | I'm slightly negative but won't block. |
-1 | I disapprove. |
The -1 (a "veto") is a heavy tool. It must be accompanied by a technical justification.
A -1 without justification is invalid. Once a valid -1 is cast on a code change, the
issue must be resolved (typically by revision) before the change proceeds.
Binding vs non-binding votes
| Vote topic | Who is binding |
|---|---|
| Code change | Committers and PMC |
| Release artifact | PMC only |
| New committer | PMC only |
| New PMC member | PMC only |
| Project mechanics (board reports, etc.) | PMC only |
Non-binding votes are welcomed and counted, but only the binding count determines the outcome.
Required minimums
For releases, the ASF rule:
- 72-hour minimum vote duration.
- At least 3 binding
+1votes. - More
+1than-1votes.
If those conditions aren't met by close, the vote is extended or fails. See Release Voting for the full mechanics.
For code changes in Tez (RTC project — see JIRA & Code Review):
- Typically 1 binding
+1(a committer) is sufficient to commit, after review. - A
-1from any committer or PMC member blocks the commit pending resolution.
For new committers / PMC:
- Run on
private@. - Typically a few-day vote window.
- Pass: more
+1than-1; common practice is at least 3+1.
Lazy consensus
Many decisions don't require a vote. The mechanism is lazy consensus:
"I'm planning to do X. Speaking up if you disagree; otherwise I'll proceed in 72 hours."
Used for things like cutting a branch, scheduling a release-vote window, or applying a trivial fix. The poster picks a reasonable window (24–72 hours). Silence = consent.
Lazy consensus is not for irreversible decisions (release, license change, PMC membership). Those require an explicit vote.
Composing a [VOTE] Email
Template — release vote (the full version is in Release Voting):
Subject: [VOTE] Apache Tez 0.10.4 RC1
Hi all,
I'd like to call a vote on releasing Apache Tez 0.10.4 RC1.
Source release: https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1/
Git tag: release-0.10.4-rc1
Commit hash: <full sha>
Staging Nexus: https://repository.apache.org/content/repositories/orgapachetez-NNNN/
KEYS file: https://downloads.apache.org/tez/KEYS
Signed with key: <your key id and fingerprint>
The vote will be open for 72 hours.
[ ] +1 Release this package
[ ] 0 No opinion
[ ] -1 Do not release because ...
My +1.
Thanks,
<First>
Template — new committer (run on private@):
Subject: [VOTE] New Tez committer: <First Last>
Hi PMC,
I'd like to propose <First Last> as a new committer on Apache Tez.
<First Last> has been contributing since <month year> and has had
<N> patches committed, spanning <areas>. Highlights:
- TEZ-NNNN: <one line>
- TEZ-NNNN: <one line>
- Active reviewer on TEZ-XXX, TEZ-YYY.
They've shown <judgement / quality / breadth>.
Vote open for 72 hours.
[ ] +1
[ ] 0
[ ] -1
My +1.
Thanks,
<First>
Template — closing a vote:
Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1
Hi all,
The vote on Apache Tez 0.10.4 RC1 has passed with the following tally:
Binding +1: <list of names>
Non-binding +1: <list of names>
0: <names if any>
-1: <names if any, with reasons>
I'll proceed with the release steps.
Thanks to everyone who voted.
<First>
Lazy Consensus Examples
Good lazy-consensus posts:
- "I'm cutting branch
branch-0.10.5from current master tomorrow at 12:00 UTC unless there's objection." - "Planning to apply TEZ-4321 (one-line log fix, trivial) by end of week unless someone
flags it. Patch is
.001on JIRA." - "Will cancel the 0.10.5 RC1 vote and roll RC2 tomorrow due to the LICENSE finding."
Bad lazy-consensus posts:
- "Going to release 0.10.5 next week." (Requires a
[VOTE].) - "Going to add NAME as committer." (Requires a
[VOTE]onprivate@.) - "Going to remove the deprecated key X." (User-visible behavior; requires
[DISCUSS]→ consensus.)
When You're New on the List
The first month of reading a list:
- Read every
[VOTE]thread. - Read every
[DISCUSS]thread. - Skim
[jira] [Created]mails. - Post nothing initially.
After the first month:
- Reply to a
user@question you can answer. - Post a self-introduction (see Community Interaction).
- Comment on a
[DISCUSS]thread once you have substance.
Validation Artifacts
After this chapter:
- Subscriptions confirmed to
dev@,user@, and (if your mail client tolerates the volume)issues@. - Mail-client filters configured for the subject prefixes table.
- A
~/tez-notes/vote-templates.mdcontaining the four templates above. - The reflex to inline-reply, not top-post.
- One archived
[VOTE]thread URL bookmarked for reference.
The next chapter — JIRA & Code Review — is the operational view of what code review looks like from the committer side of the table.
JIRA & Code Review — Inside a Tez Review
This chapter is the committer view of code review. Read it as a contributor, and your patches will become reviewable. Read it as a new committer, and you'll have a workflow.
Tez is RTC
Apache projects choose between two commit philosophies:
| Model | Meaning | Used by |
|---|---|---|
| RTC (Review Then Commit) | Patch must be reviewed and +1'd before commit | Tez, Hive, Hadoop (for most code) |
| CTR (Commit Then Review) | Committer may commit and discuss after | Some smaller projects, certain Hadoop subsystems |
Tez is RTC. The implication: every commit went through at least one review round. Patches sitting at "Patch Available" with no review block on attention, not on velocity — the committer pool is finite.
The RTC exception: trivial fixes (typos, log message edits, javadoc improvements) may be
committed by a committer without an explicit +1, but the commit message references the
JIRA and the patch is still attached for the record.
How a Committer Reads a Patch
When a committer opens your patch (in JIRA, GitHub PR, or git apply locally), the
sequence is roughly:
1. Read the JIRA description. 30s
2. git apply --check on a fresh clone. 30s
3. Look at git diff --stat. 30s
4. Read the test changes. 2-5 min
5. Read the implementation changes. 5-15 min
6. Run mvn checkstyle:check. 30s
7. Run mvn test in changed modules. 2-15 min
8. Optionally: run an integration test. 5-30 min
9. Comment. Variable
The first three steps determine whether the patch gets the full read or a bounce. If the JIRA is unclear or the diff doesn't apply or includes unrelated changes, the patch goes back without step 5.
The Skim Phase
A committer skimming git diff --stat is looking for:
- File count and module spread. A patch touching one module is easy; one touching five is suspicious.
- Tests in the diff. No tests in a behavior-changing patch is a red flag.
- Generated files in the diff.
target/,*.iml,.idea/— never committed. - Whitespace-only churn.
git diff -wshould not be vastly smaller thangit diff.
If any of these are off, expect a comment before the implementation is read.
The Test Phase
Committers read tests before implementation because the test reveals intent. A good test
named testRecoverNoInputs tells the reviewer:
- The bug is in recovery.
- The trigger is "no inputs."
- The fix should not break recovery in any other case.
If the test is missing, weak (no assertions, or assertions that would pass without the
fix), or named generically (testMethod1), the reviewer assumes the implementation is
also weak.
The Implementation Phase
By the time the reviewer reads the code, they have a mental model from the JIRA, the test name, and the diff stat. The implementation read is checking:
- Does the code match the intent of the JIRA and test?
- Is the change minimal — does it touch what it must, and only what it must?
- Are exceptions handled appropriately for the file's conventions?
- Is logging at the right level (
DEBUGfor hot paths,INFOfor state transitions,WARNfor recoverable,ERRORfor unrecoverable)? - Are there obvious thread-safety issues (state visible across threads, shared mutable collections)?
- Are there back-compat concerns? (See Compatibility)
Comment Phrasing
Committer comments follow soft conventions that contributors should recognise — they encode meaning beyond the literal text.
| Comment | Means |
|---|---|
| "Nit: ..." | Stylistic preference; you may take it or push back without controversy. |
| "Suggestion: ..." | Reviewer thinks there's a better way but isn't blocking. |
| "Concern: ..." | Reviewer wants this addressed before commit. |
| "I don't think this is right." | Block; must be resolved. |
| "Have you considered X?" | Genuine question; respond with your reasoning. |
| "Let's discuss on dev@." | Issue is bigger than the patch; design discussion needed. |
| "+1 LGTM" | Approval (informal). |
| "+1 pending checkstyle" | Conditional approval. |
| "-1, see ..." | Veto; must be resolved before commit. |
Reciprocal etiquette on responses, see Responding to Feedback: acknowledge every comment explicitly, fix what's fixable, push back with evidence on what's not.
Patch Available → Reviewed Lifecycle
The JIRA state transitions for a typical patch:
Open
| (contributor starts)
v
In Progress
| (contributor attaches .001)
v
Patch Available ← reviewer reads here
| (review comments)
v
In Progress ← contributor revises
| (attaches .002)
v
Patch Available
| (LGTM)
v
Resolved (committer commits to trunk)
| (release ships)
v
Closed
The patch attachments accumulate: .001, .002, .003. They are never deleted. Future
readers can reconstruct the review by walking through them.
GitHub-PR-based reviews follow the same lifecycle, but the iteration happens in the PR's
commit history rather than separate .NNN.patch files. The JIRA still moves through the
states above.
Backport Patches
A patch may need to land on multiple branches (e.g. master and branch-0.10). The
contributor attaches both:
TEZ-4321.001.patch (for master)
TEZ-4321.branch-0.10.001.patch (for the maintenance branch)
The committer reviews and commits each. The JIRA comment notes the commits:
Committed to master: <sha>
Committed to branch-0.10: <sha>
The Committer's Pre-Commit Checklist
A committer about to commit runs:
cd ~/tez-src
git fetch origin
git checkout master
git merge --ff-only origin/master
git apply --check /tmp/TEZ-4321.003.patch
git apply /tmp/TEZ-4321.003.patch
mvn install -DskipTests
mvn checkstyle:check
mvn test -pl tez-dag,tez-api # changed modules
mvn test -pl tez-tests -Dtest=TestOrderedWordCount
git add -A
git commit -s -m "TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)"
git push origin master
Notes on the commit step:
-sadds aSigned-off-by:trailer. Tez doesn't currently require DCO, but it's Apache-idiomatic.- The
(Jane Doe via gunther)suffix is added by the committer, not the contributor. - The push goes to
apache/tez(committer karma required).
After push:
1. Update JIRA: status → Resolved, set Fix Version (e.g. "0.10.5").
2. Comment on JIRA with the commit SHA.
3. Thank the contributor.
Holding the "No" Muscle
A subtle and underappreciated committer skill is declining patches that shouldn't go in. A patch can be technically correct and still not belong in trunk — too narrow a use case, too much added complexity, the wrong layer.
Wording for a respectful decline:
Thanks for the patch. After reading, I'm not comfortable taking this in trunk because REASON. I appreciate the work, and I'd encourage ALTERNATIVE-PATH. Closing the JIRA as Won't Fix; if there's broader consensus on
dev@for a different approach, happy to reopen.
The "no" muscle is not natural. Committers learn it because the alternative — accepting every patch — accumulates technical debt that the committer pool will pay forever. See Committer Mindset.
When to Refactor Unsolicited Code in a Patch
A contributor's patch sometimes lands in a corner of the code the committer would like to clean up. The temptation is to do the cleanup at commit time. Don't.
The rules:
- Never modify the contributor's diff at commit. The patch attached to JIRA must match what was reviewed.
- File a follow-up JIRA for the cleanup. Reference the contributor in CC.
- If the patch creates a refactoring opportunity, take it later. Not in this commit.
The exception: trivial cleanups the contributor agreed to in review may be applied at commit. The JIRA comment notes them. Example:
Committed with a small change: extracted the new logic into a private
helper method as discussed in review. Attaching the committed patch
as .004 for the record.
What Goes On the GitHub PR vs. JIRA
Tez accepts patches as JIRA attachments and as GitHub PRs (linked from the JIRA). The mapping:
| Lives on | What |
|---|---|
| JIRA | Description, design discussion, root cause, attachments (.NNN.patch), final commit reference |
| GitHub PR (if used) | Line-by-line comments, CI run results, iterative push history |
A PR without a linked JIRA is incomplete; the JIRA is the system of record. A JIRA without a PR is fine — many Tez patches are still JIRA-attachment-only.
If you open a PR, link it on the JIRA in the first comment and set the JIRA to "Patch Available."
Worked Example — A Full Review Cycle
JIRA: TEZ-4321. "Fix NPE in VertexImpl.recover when no inputs."
Day 0 You: file JIRA with description, repro, root cause.
Set yourself as assignee, status In Progress.
Attach .001 patch; status → Patch Available.
Day 3 Committer @gopalv: applies patch locally, runs tests, reviews.
Comments on JIRA:
- L88: prefer Collections.emptyList().
- L92: add a test for the no-inputs case.
- L94: should we handle no-outputs symmetrically? Concern: see
VertexImpl.recover at L142, looks like the same shape.
Day 4 You: reply on JIRA.
- L88: agreed.
- L92: agreed; adding testRecoverNoInputs.
- L94: I see the parallel but think it's a separate JIRA.
Filing TEZ-4329 to track.
Attach .002.
Day 7 @gopalv: re-reviews. "+1 LGTM."
Day 8 @gopalv: commits.
"TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gopalv)"
Sets JIRA Resolved, Fix Version 0.10.5.
Day 8 You: comment "Thanks @gopalv. Working on TEZ-4329 next."
That is a healthy review — 2 patch rounds, 1 follow-up JIRA filed, no friction.
Validation Artifacts
After this chapter:
- A
~/tez-notes/reviewer-vocab.mdcheatsheet from the comment-phrasing table. - The four checklist steps committers run pre-commit, saved for when you are one.
- The discipline to never modify a contributor's diff at commit (with an exception only for explicit reviewer-author agreement).
- The reflex to comment "Thanks @COMMITTER" after a merge of your patch.
The next chapter — Committer Mindset — takes the perspective further: the judgement model committers use across many patches and many years.
Committer Mindset
Becoming a committer is a one-day event. Thinking like one is a multi-year practice. This chapter sketches the practice: the asymmetries, the recurring trade-offs, and the mental model that distinguishes "writes good patches" from "stewards the codebase."
The Long-Lived Code Tax
A contributor writes a patch and leaves. A committer commits a patch and inherits it forever. Every line a committer approves is theirs to debug at 11pm three years later when it breaks in production.
Practical consequence: the committer's "yes" is a much heavier word than the contributor's "this would be nice." Committers reflexively ask:
| Question | Why |
|---|---|
| Who will maintain this in 2 years? | Code without a maintainer becomes everyone's problem |
| Is the complexity proportional to the value? | Complex code is paid for in every future bug |
Does this make tez-dag harder to onboard into? | Onboarding cost is real |
| What's the failure mode at 10x scale? | Tez runs in production clusters at scale |
| Does this lock us into a design we'll regret? | API and proto changes are forever |
These are not abstractions. Every committer has at least one patch they regret approving. That memory is the source of the "no" muscle.
Reasoning About Compatibility
The compatibility surface is exhaustively documented in Compatibility. The mindset around it:
- Default to backwards-compat. A change that breaks no one is always preferable to one that breaks anyone, even if uglier.
- A deprecation is a promise. If you deprecate a method "to be removed in 0.12," it had better be removable in 0.12 — which means no production user can still be on it by then, which means the deprecation window has to be long enough to drain.
- Wire compat is not negotiable. A
DAGPlanchange that breaks recovery from an old AM means a cluster can't roll-restart safely. That's a P0 production issue. - Configuration compatibility is silent until it isn't. Renaming a key without a
deprecation alias breaks every cluster that has the old key in
tez-site.xml. Reviewers will catch this if they're paying attention; committers must always pay attention.
The mental model: imagine you are the SRE on call at a Fortune 500 that runs Tez via Hive at 1 AM. What does this patch do to your night?
Reasoning About Performance
Tez runs in the hot path of Hive on terabyte-scale workloads. A 5 ms regression in a per-task code path is real money. The mindset:
- Measure, don't guess. A patch claiming performance benefit needs numbers, not intuition. A patch claiming no performance impact in a hot path still needs a check.
- Hot vs. cold paths. Optimisations matter in
tez-runtime-libraryand the per-task paths oftez-runtime-internals. They matter much less intez-dagAM startup code that runs once per DAG. - GC is performance. A patch that allocates an extra object per task adds GC pressure at scale. Reuse buffers; use primitives; bound queues.
- Logging is performance.
LOG.debug("..." + obj)allocates the string even when DEBUG is off. UseLOG.debug("... {}", obj)instead.
The committer reading a patch in a hot path keeps these questions ready:
- Does this allocate per-record? Per-batch? Per-DAG?
- Is the allocation reusable / poolable?
- Is the log statement guarded or formatted?
- Has the contributor said how this performs at scale?
Reasoning About Complexity
Complexity has a half-life of bugs. The reviewing committer's complexity check:
| Complexity addition | What it costs |
|---|---|
| A new abstract base class | A new mental model for readers |
| A new configuration key | Documentation, default-tuning, deprecation later |
| A new state in a state machine | Combinatorial new transitions to test |
| A new event type | New event dispatcher cases, new history entries |
| A new public method | Compatibility commitment |
| A new dependency | Licensing review, attack surface, build complexity |
A patch that adds, say, a new configuration key for a corner-case behavior is not trivially "yes" even if the code is correct. The cost of the key — documentation, tuning, eventual deprecation — must justify the value.
The reflexive committer question: "Could this be a default, with no key?" If the answer is yes, skip the key.
Reasoning About Risk
Different code paths carry different risk profiles:
| Path | Risk |
|---|---|
tez-tools/ | Low. Process tooling; broken doesn't affect runtime. |
tez-mapreduce/ | Medium. Affects MR-on-Tez users; relatively well-tested. |
tez-runtime-library/ | High. In the per-task hot path. |
tez-runtime-internals/ | High. Task runtime; affects every DAG. |
tez-dag/ AM scheduling | High. AM bugs lose work. |
tez-dag/ DAG planning | Very high. Errors are bad DAGs. |
tez-api/ | Very high. Public API; breaking it breaks downstream projects. |
tez-api/src/main/proto/ | Critical. Wire format; cluster-rolling-restart implications. |
Committers calibrate review depth to risk. A 50-line patch in tez-tools/ may get a
quick read and +1. A 50-line patch in tez-api/src/main/proto/ gets word-by-word
scrutiny, a [DISCUSS] thread, and possibly a -1 if the protobuf change is anything
other than additive.
The "No" Muscle — When and How
The hardest committer skill is saying no. Not no-by-silence (the default and worst form), but explicit, kind, decisive no. Patterns for when to use it:
| Pattern | Pattern of "no" |
|---|---|
| Patch fixes a real but rare bug at the cost of significant new complexity | "Let's not fix this in code; document the workaround and close as Won't Fix." |
| Patch adds a feature with one user (the contributor) | "Could you maintain this as an out-of-tree plugin? VertexManagerPlugin exists for this." |
| Patch is technically correct but encodes a design that conflicts with planned direction | "We're going a different way on dev@ thread XYZ; let's wait." |
| Patch is correct but vastly over-scoped | "Could you split into 3 JIRAs? Happy to commit them one at a time." |
| Patch is correct but in a part of the codebase being rewritten | "Let's wait for TEZ-NNNN to land first; this conflicts." |
The crucial thing about saying no: do it early, explicitly, and once. Don't ghost the patch. The contributor's time is worth your one paragraph of explanation.
When to Refactor Unsolicited
A patch lands in a part of the codebase the committer has been wanting to refactor. The temptation is to do the refactor in or alongside the commit. Don't, except in narrow cases.
The rules:
- Refactor neither in the contributor's patch nor in the same commit. Their patch must match what was reviewed.
- File a follow-up JIRA for the refactor. Reference the contributor in CC; they often have context.
- Do the refactor in a separate review cycle. Either you do it (review by someone else) or someone else does it (review by you).
- Exception: If the contributor's patch sits in code that is literally being moved or removed by an imminent committed patch, coordinate. Either delay the contributor's patch or rebase the imminent one.
Mentoring Pattern
A committer's leverage is not just commits — it's mentoring. The well-trodden Apache mentoring pattern:
- Notice a thoughtful new contributor. Their first patch was clean; they responded
well to feedback; they asked good questions on
dev@. - Suggest a JIRA in your area. Comment on a JIRA: "This would be a good fit for NAME based on their recent work on TEZ-XXXX."
- Shepherd it. Review their patch yourself, fast. Set expectations on iteration count.
- Make them visible. Refer to their work on
dev@. Cite them in commits as you would any contributor. - Eventually propose them. When they hit the rough bar from
Meritocracy, propose them on
private@.
A committer who has mentored two or three contributors into committership has done more for the project than one who has committed thousands of patches.
Time Allocation
Newly-minted committers underestimate how time-consuming the role is. A rough budget for sustained committership:
| Activity | Weekly time |
|---|---|
| Reviewing patches | 2–4 hours |
| Filing or shepherding your own patches | 2–4 hours |
dev@ discussion participation | 1–2 hours |
| JIRA triage (closing dups, asking for repros) | 0.5–1 hour |
| Mentoring | 0.5–1 hour |
| Release work (during release windows) | 4–8 hours |
A committer who spends 0.5 hours/week on the project will be reactive at best and become inactive within a year. A committer who spends 4+ hours/week stewards the codebase.
Avoiding Burnout
The committer pool at any Apache project is finite. Burnout is a real failure mode:
| Burnout signal | Self-rescue |
|---|---|
| Reviewing patches feels like a chore | Take a 2-week formal break; tell dev@ |
| You're saying yes to patches you don't believe in | Practice saying no |
| You're the only reviewer for an area | Mentor someone into co-reviewing |
| You're sleeping less because of a release window | Ask the PMC to split the RM duties |
| You haven't filed a JIRA you cared about in months | Stop reviewing for a week; write |
Committership is voluntary. Stepping back is honourable. Emeritus committer status exists at Apache for those who want a graceful exit; you can come back later.
Validation Artifacts
After this chapter:
- A
~/tez-notes/committer-questions.mdof the five recurring questions a committer asks of every patch. - The discipline to score each Tez file path you touch by risk tier.
- The vocabulary to say no, in writing, with no rancour.
- The plan to do mentoring at some point in your committer life.
The next chapter — Release Voting — is the operational manual for the most visible PMC-level work: cutting a release.
Release Voting
Cutting an Apache Tez release is a procedural, legal, and cryptographic operation. It is the most formal thing the PMC does. This chapter is the operational manual: the steps, the artifacts, the vote thread, and the failure modes.
The authoritative reference is the ASF Release Distribution Policy. This chapter is the Tez-specific overlay on top of it.
What "Release" Means at Apache
An Apache release has a precise legal meaning. Only source artifacts are official Apache releases. Binary artifacts (jars in Maven Central, Docker images) are convenience artifacts that the PMC may publish but that are not the legal release.
Practical consequence: every vote is a vote on the source release. Binaries derive from it.
Release Artifacts
A Tez release consists of:
| Artifact | Where | Format |
|---|---|---|
| Source tarball | dist.apache.org | apache-tez-X.Y.Z-src.tar.gz |
| ASCII-armored signature | dist.apache.org | apache-tez-X.Y.Z-src.tar.gz.asc |
| SHA-512 checksum | dist.apache.org | apache-tez-X.Y.Z-src.tar.gz.sha512 |
| (Optional) binary tarball | dist.apache.org | apache-tez-X.Y.Z-bin.tar.gz plus .asc and .sha512 |
| Staged Maven jars | repository.apache.org (Nexus) | Standard Maven layout |
| Git tag | apache/tez | release-X.Y.Z-rcN then release-X.Y.Z |
Notes:
- MD5 and SHA-1 are forbidden for release checksums (ASF policy since 2019). Use SHA-512 (preferred) or SHA-256.
- The signature must be ASCII-armored (
.asc), not binary. - The signing key must be in the project KEYS file at
https://downloads.apache.org/tez/KEYSand in your public key on a public keyserver.
Prerequisites — One-Time PMC Setup
Before you can RM (release-manage), once:
# 1. Generate a GPG key (4096-bit RSA).
gpg --full-generate-key
# 2. Submit the public key to keyservers.
gpg --send-keys <KEY_ID>
# 3. Add your key to the Tez KEYS file.
svn co https://dist.apache.org/repos/dist/release/tez tez-dist-release
cd tez-dist-release
(gpg --list-sigs <KEY_ID> && gpg --armor --export <KEY_ID>) >> KEYS
svn commit KEYS -m "Add <Your Name>'s release-signing key"
# 4. Verify it lands at:
# https://downloads.apache.org/tez/KEYS
The Nexus staging access:
# Add ~/.m2/settings.xml entry:
cat >> ~/.m2/settings.xml <<EOF
<settings>
<servers>
<server>
<id>apache.releases.https</id>
<username>YOUR_APACHE_ID</username>
<password>YOUR_APACHE_LDAP_PASSWORD</password>
</server>
</servers>
</settings>
EOF
The Release Cut
Roughly the sequence the release manager runs:
cd ~/tez-src
git fetch origin
# 1. Branch (for X.Y.0 releases) or check out maintenance branch.
git checkout -b branch-0.10.4 origin/master # for a new minor
# or
git checkout branch-0.10 # for a patch release
# 2. Update version.
mvn versions:set -DnewVersion=0.10.4
git commit -am "Setting version to 0.10.4 for release"
git tag release-0.10.4-rc1
git push origin branch-0.10.4
git push origin release-0.10.4-rc1
# 3. Build everything; tests must pass.
mvn clean install
mvn apache-rat:check
# 4. Build source tarball.
mvn clean package -Pdist,docs,src -DskipTests
ls tez-dist/target/ # apache-tez-0.10.4-src.tar.gz
# 5. Sign and checksum.
gpg --armor --output apache-tez-0.10.4-src.tar.gz.asc \
--detach-sign apache-tez-0.10.4-src.tar.gz
sha512sum apache-tez-0.10.4-src.tar.gz > apache-tez-0.10.4-src.tar.gz.sha512
# 6. Stage to dist.apache.org/dev.
svn co https://dist.apache.org/repos/dist/dev/tez tez-dev
mkdir tez-dev/tez-0.10.4-RC1
cp apache-tez-0.10.4-src.tar.gz* tez-dev/tez-0.10.4-RC1/
cd tez-dev
svn add tez-0.10.4-RC1
svn commit -m "Apache Tez 0.10.4 RC1"
# 7. Stage Maven artifacts.
mvn clean deploy -Papache-release -DskipTests
# Then on https://repository.apache.org, log in, find your
# staging repo (orgapachetez-NNNN), "Close" it.
The exact Maven profiles differ across Tez versions; check
~/tez-src/RELEASING.txt and the release notes for the prior release for the recipe in
use.
The [VOTE] Email
After staging, you send the vote. The template:
Subject: [VOTE] Apache Tez 0.10.4 RC1
Hi all,
I'd like to call a vote on releasing Apache Tez 0.10.4 RC1.
Notable changes since 0.10.3:
- TEZ-NNNN: <one line>
- TEZ-MMMM: <one line>
- <N> additional fixes; see CHANGES.txt for the full list.
Source release:
https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1/
The release was signed with key:
<KEY_ID> <fingerprint>
KEYS file:
https://downloads.apache.org/tez/KEYS
Git tag: release-0.10.4-rc1
Git commit: <full 40-char sha>
Staging repository for Maven:
https://repository.apache.org/content/repositories/orgapachetez-NNNN/
The vote will be open for 72 hours.
Please verify and vote:
[ ] +1 Release this package
[ ] 0 No opinion
[ ] -1 Do not release this package because ...
Verification steps (https://www.apache.org/info/verification.html):
- Download src.tar.gz, .asc, .sha512.
- Verify SHA512: sha512sum -c apache-tez-0.10.4-src.tar.gz.sha512
- Verify signature:
gpg --import KEYS
gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz
- Untar; check LICENSE, NOTICE, DISCLAIMER.
- Build: mvn clean install -DskipTests
My +1.
Thanks,
<First Last>
Send to dev@tez.apache.org. Subject [VOTE] Apache Tez 0.10.4 RC1.
What Voters Verify
A binding +1 is not just trust. It carries a check. PMC voters typically:
| Check | Command / location |
|---|---|
| Source artifact downloads | wget from dist.apache.org/repos/dist/dev/tez/... |
| Signature is valid and from a Tez committer | gpg --verify against KEYS file |
| SHA-512 matches | sha512sum -c |
LICENSE is correct and current | Read it |
NOTICE reflects bundled third-party | Read it; cross-check against LICENSE |
DISCLAIMER present if incubating (not for Tez since 2014) | Check |
| No binary files in source tree | find apache-tez-X.Y.Z-src -type f -name '*.jar' -o -name '*.class' |
| Apache RAT clean | mvn apache-rat:check |
| Builds clean | mvn clean install -DskipTests |
| Tests pass (optional but valued) | mvn test |
A voter who finds anything wrong with the source tarball can -1. Common -1 reasons:
| Reason | Severity |
|---|---|
| Missing or broken signature | Vetoes (must respin) |
| MD5 / SHA-1 only | Vetoes |
| Binary files in source tree | Vetoes |
| Missing or wrong LICENSE | Vetoes |
| Missing or wrong NOTICE | Vetoes |
| GPL or category-X dep | Vetoes |
| RAT failure | Vetoes |
| Apache headers missing | Vetoes |
| Failed unit tests of significance | Usually vetoes |
| Build failure | Vetoes |
| Documentation issue | Often non-blocking, opinion |
Vote Pass Criteria
The release passes if, after the 72-hour minimum:
- At least 3 binding
+1votes from PMC members. - More
+1than-1total (binding and non-binding). - No unaddressed binding
-1.
If criteria fail:
- Extend the vote by 24–48 hours and ask explicitly for more attention.
- Or cancel and roll RC2 with the fixes.
Closing the Vote
The release manager closes:
Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1
Hi all,
The vote on Apache Tez 0.10.4 RC1 has passed.
Binding +1: <names of PMC voters>
Non-binding +1: <names>
0: <names>
-1: <names with reasons, if any>
Proceeding with the release steps.
Thanks to everyone who voted.
<First>
If the vote fails:
Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1
The vote did not pass. Issues raised:
- <issue from voter>
- <issue from voter>
Rolling RC2 with these fixes. Expect a new [VOTE] thread within
<N> days.
<First>
Promoting the Release
After the vote passes:
# 1. Move source from dev to release.
svn mv \
https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1 \
https://dist.apache.org/repos/dist/release/tez/0.10.4 \
-m "Releasing Apache Tez 0.10.4"
# 2. Promote Nexus staging repo to release (one-click in Nexus UI).
# 3. Tag the final release.
cd ~/tez-src
git tag release-0.10.4 release-0.10.4-rc1
git push origin release-0.10.4
# 4. Wait 24h for mirrors.
# 5. Update the Tez website with download links.
# 6. Send ANNOUNCE.
The announce email goes to announce@apache.org (BCC), dev@tez.apache.org,
user@tez.apache.org, and your usual ASF lists for downstream projects (e.g.
dev@hive.apache.org):
Subject: [ANNOUNCE] Apache Tez 0.10.4 released
The Apache Tez community is pleased to announce the release of
Apache Tez 0.10.4.
Apache Tez is an application framework that allows for a complex
directed acyclic graph of tasks for processing data. It is built
atop Apache Hadoop YARN.
Highlights:
- <user-facing change>
- <user-facing change>
Download: https://tez.apache.org/releases/0.10.4/
Release notes: https://tez.apache.org/releases/0.10.4/release-notes.html
Thanks to everyone who contributed.
The Apache Tez team
RC Iteration Patterns
A first RC almost never passes. Typical RC count for a minor release:
| Release type | Typical RCs |
|---|---|
| Patch (0.10.X) | 1–2 |
| Minor (0.10.0, 0.11.0) | 2–4 |
| Major (1.0.0 if it happened) | 4+ |
Each RC means: cancel vote, fix issues, re-tag (release-X.Y.Z-rcN+1), respin tarball,
re-sign, re-stage Nexus (new staging repo), re-send [VOTE]. Plan for 1–3 weeks per
release cycle.
Common Failure Modes
| Failure | Recovery |
|---|---|
| Signature key not in KEYS file | Stop, update KEYS, restart vote |
| RAT failure on a new file | Add Apache header, respin |
| Forgot to update CHANGES.txt | Update, respin |
Stray .class or .jar in src tree | Clean, respin |
| Missing LICENSE entry for new bundled dep | Add LICENSE entry + NOTICE if needed, respin |
| Vote got fewer than 3 binding +1 in 72h | Extend with explicit ping to PMC |
| -1 on the source artifact for a legitimate issue | Respin |
| Maven staging mistake | Drop staging repo in Nexus, re-stage |
Validation Artifacts
After this chapter you should have:
- A GPG key generated and added to the project KEYS file (if you are PMC).
- A
~/tez-notes/release-checklist.mdwith the seven RM steps. - The
[VOTE]and[VOTE][RESULT]templates saved. - The discipline to never vote
+1on an RC you haven't checked at least signature + LICENSE + a build. - The phone number for ASF Infra Slack handy in case Nexus or
dist.apache.orgmisbehaves.
The next chapter — PMC Responsibilities — covers the rest of what PMC membership entails, beyond releases.
PMC Responsibilities
PMC (Project Management Committee) membership at Apache is not a senior-engineer title. It is a stewardship role with explicit legal, brand, community, and release responsibilities. This chapter is the operational manual for what PMC members actually do between releases.
The Tez PMC list is at private@tez.apache.org. Public PMC members are listed at
https://tez.apache.org/team-list.html (or the equivalent on the current site).
The Four Buckets of PMC Work
| Bucket | Examples | Frequency |
|---|---|---|
| Legal | License headers, NOTICE file, third-party LICENSE entries, ICLA matching | Per-patch and per-release |
| Brand | Trademark protection, conference talk approvals, logo use | Quarterly to annual |
| Community | Moderating list, voting new committers, mentoring, code of conduct enforcement | Continuous |
| Releases | Voting on RCs, cutting RCs, post-release announce | Per-release |
Plus one cross-bucket: board reporting, quarterly.
Legal Responsibilities
License Headers
Every source file in the Tez tree must have an Apache 2.0 license header. Tez uses Apache RAT to enforce this.
cd ~/tez-src
mvn apache-rat:check
The expected header for a .java file:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
If RAT fails on a release candidate, the release cannot ship. PMC members reviewing a release verify RAT cleanliness as part of vote-time checks (see Release Voting).
For non-.java files (.proto, .xml, .sh, .md), the same content with the
appropriate comment delimiters.
NOTICE File
The NOTICE file at the repo root carries:
- The required Apache attribution line.
- Required attribution for any bundled third-party code that explicitly demands it.
cat ~/tez-src/NOTICE
Most BSD-, MIT-, and Apache-licensed dependencies do not require NOTICE entries. Some do (notably ones with NOTICE files of their own, which by Apache convention propagate into bundlers). The rule of thumb: if a dependency ships a NOTICE file, copy the required text into Tez's NOTICE.
Common error: adding random "thanks to" lines. NOTICE is not a thank-you file; it is a legal artifact. Keep it minimal and correct.
LICENSE File
LICENSE at the repo root is the Apache License 2.0 plus appendices for any bundled
third-party code under different licenses.
For Tez, mostly the appendices are absent because the source release bundles no third-party source. The binary release (the convenience tarball) may bundle jars whose licenses must be appendixed.
If you are a PMC member adding a new dependency that gets bundled in the binary release:
- Identify the dependency's license (read it, don't guess).
- Verify category (A, B, or X) — see Licensing.
- If A: update LICENSE appendix; sometimes NOTICE.
- If B: requires PMC discussion + LICENSE / NOTICE updates.
- If X: stop. Cannot be bundled. May only be a runtime-optional dep, never a hard one.
ICLA Matching
Every non-trivial contribution must come from someone with an Apache ICLA on file. The
ICLA list is maintained by Apache Infra; PMC members can verify by emailing
secretary@apache.org with a contributor name.
In practice, for casual contributors:
- Trivial patches (Javadoc, typo) do not require ICLA.
- Anything substantive does.
- The contributor sends the ICLA themselves; PMC verifies it landed.
If a substantial patch is committed without an ICLA on file, that is a legal exposure for the foundation. PMC members must catch this before commit.
Brand Responsibilities
"Apache Tez" is a trademark of the Apache Software Foundation. The PMC is the steward.
| Brand decision | PMC action |
|---|---|
| New logo | PMC vote, register with VP Brand Management |
| Conference talk titled "Apache Tez" | OK; speaker should follow trademark guidelines |
| Conference talk titled "Tez" without Apache | Polite ask: please use full mark |
| Third-party product named "TezCloud" | Likely refer to VP Brand; could be misleading |
| Third-party product built on Tez, named differently | OK; clarify attribution if uncertain |
| Use of the Tez feather logo in a slide deck | OK with attribution |
For specifics see the ASF Trademark and Brand Policy.
When in doubt, the PMC defers to trademarks@apache.org.
Community Responsibilities
Moderation
Most ASF mailing lists are moderated for non-subscribers (subscribers post freely). The moderation work is light: approving first posts, rejecting spam.
Tez has a small mod team (typically a couple of PMC members). Add dev-moderate@ or
similar to your mail filter to spot moderation requests.
If subscriber behavior on a list becomes problematic — flame wars, code-of-conduct violations — the PMC handles it. Typical escalation:
- Off-list private email from a PMC member to the offending subscriber.
- If unaddressed, a public on-list warning.
- If unaddressed, removal from the list (rare; requires PMC vote).
For severe cases (harassment, security threats), escalate immediately to
board@apache.org.
Voting New Committers
The committer-bit process, from the PMC's side:
1. PMC member observes a strong contributor (see meritocracy chapter).
2. PMC member emails private@tez.apache.org with [VOTE] thread.
3. PMC members vote +1 / 0 / -1 (usually +1, sometimes 0 with rationale).
4. Vote runs ~72 hours; passes with >3 binding +1 and no binding -1.
5. PMC member privately emails the contributor with the offer.
6. On acceptance, ASF Infra is notified to provision the ASF account.
7. PMC announces publicly on dev@.
A -1 from a PMC member on a committer vote requires a concrete reason. "Doesn't feel
right" is not enough; "two recent JIRAs showed inadequate care for compatibility" is.
PMC members may vote 0 if they don't know the contributor well — common, no shame in
it.
Voting New PMC Members
Same mechanism as committer, except:
- All committers are pre-considered, so the candidate is always a sitting committer.
- The bar is higher (judgement, willingness to do PMC work, see Meritocracy).
After acceptance, the candidate is invited to the PMC. The Apache Board confirms.
Code of Conduct
Apache projects follow the ASF Code of Conduct. The PMC is the enforcement body within the project. Most enforcement is gentle and private. Serious cases are escalated to the board.
Release Responsibilities
Covered in detail in Release Voting. The PMC-specific elements:
- Binding
+1votes on release artifacts are PMC-only. - At least 3 binding
+1required for a release to pass. - PMC member is the release manager (or supervises if a non-PMC committer is designated by lazy consensus to RM under PMC oversight).
- Post-release, PMC member ensures the
announce@apache.orgmail goes out and the website is updated.
Security Reports
Security disclosures arrive at private@tez.apache.org or security@apache.org. The
process:
1. Acknowledge receipt within 48 hours.
2. PMC investigates in private; reproduce.
3. Develop a fix in a private branch (not in apache/tez until disclosure).
4. Determine severity (CVSS) and assign a CVE.
5. Coordinate disclosure timing with downstream projects (Hive, etc).
6. Cut a release containing the fix.
7. Send disclosure to oss-security and security@apache.org with CVE and details.
The discipline: never discuss security issues on public lists or public JIRA until the fix has been released and disclosure is published.
If you are new to PMC, read the ASF Security Team process before you need it.
Board Reporting
The Apache Board oversees every project via quarterly reports. The Tez PMC submits a
report each quarter (or per the schedule the board sets — currently quarterly with
projects rotated through). The chair (or a delegate) submits it via
https://reporter.apache.org/.
A standard report contains:
- Community activity (new committers, new PMC members, list activity)
- Releases since last report
- Brand or legal issues
- Health concerns the board should know about
The board looks for warning signs:
| Warning | Board concern |
|---|---|
| No releases in many quarters | Is the project dormant? |
| All committers from one company | Is the project independent? |
| Mailing-list activity falling | Is the community shrinking? |
| Code-of-conduct issues unresolved | Is the PMC functional? |
The chair is responsible for filing on time. If the report is late, the board notices.
Time Commitment
A PMC member with no other ASF roles spends roughly:
| Activity | Monthly time |
|---|---|
| Reviewing private@ traffic | 1–2 hours |
| Voting on releases (when there is one) | 1–3 hours per release |
| Voting on new committers | 30 minutes per vote |
| Board reporting (every 3 months) | 1–2 hours |
| Security incidents (when they happen) | Variable; possibly days |
| Committer work on top of PMC duties | (as before) |
A PMC member who is also chair adds the report-filing burden and acts as the project's ambassador to the board.
Stepping Back
PMC membership is permanent until you step back. Emeritus PMC status exists for those who have stepped away from active project work but want to remain available for consultation.
To go emeritus:
Subject: [NOTICE] Going emeritus PMC
Hi all,
Effective <date>, I'm moving to emeritus PMC status on Tez. My
involvement in the project has tapered and I want the active PMC
to reflect who's currently doing the work.
Please feel free to reach out if you ever want a sanity check on
something I worked on historically.
Thanks for the years of collaboration.
<First>
PMC removes you from active count. You retain your ASF account; you may return to active later by vote.
Validation Artifacts
After this chapter:
- A
~/tez-notes/pmc-duties.mdlisting the four buckets and a one-line example of each. - A subscription to
private@tez.apache.org(when you are PMC). - Knowledge of how to verify an ICLA, how to find the trademark policy, how to file a board report.
- A reflex to escalate security reports to
private@immediately and never discuss them publicly until disclosure.
The next chapter — Licensing — drills into the legal bucket: ALv2, LICENSE/NOTICE rules, and category A/B/X.
Licensing
Apache licensing is precise. The rules are not "be reasonable about open source"; they are a specific framework administered by Apache Legal. Getting them wrong blocks a release. This chapter is the working knowledge needed by committers and PMC, plus the bits every contributor should know before adding a dependency.
The Apache License 2.0
Apache Tez is licensed under the Apache License, Version 2.0 ("ALv2"). This is a permissive license that allows:
- Use, reproduction, modification, distribution
- Commercial use
- Patent grant (explicitly, unlike MIT/BSD)
- Sublicensing under different terms (with attribution)
In exchange:
- You include the LICENSE and NOTICE in distributions
- You note significant modifications
- You preserve attribution and patent grants
Practically, ALv2 is one of the most permissive copyleft-free licenses. It's compatible with almost everything except GPL 2.0 (and is one-way compatible with GPL 3.0).
The Three Files in the Tez Repo Root
| File | Purpose |
|---|---|
LICENSE | The Apache License 2.0 text, plus appendices for any bundled third-party code under different licenses |
NOTICE | Required attributions for bundled code (Apache + any NOTICE-bearing deps) |
KEYS (in dist, not repo) | PGP keys used to sign releases |
ls ~/tez-src/LICENSE ~/tez-src/NOTICE
cat ~/tez-src/NOTICE
For Tez source releases, LICENSE and NOTICE are typically short — the source tarball
bundles no third-party code. For convenience binary releases, both grow with the bundled
jars.
Category A / B / X — The Dependency Classes
Apache Legal classifies third-party licenses into categories. The full list is at Apache Legal Resolved. Summary:
| Category | Meaning | Examples | Can it be a Tez dependency? |
|---|---|---|---|
| A | Compatible with ALv2 | ALv2, MIT, BSD 2/3-clause, ISC, MPL 2.0 | Yes; document in LICENSE/NOTICE if bundled |
| B | Compatible with conditions | EPL 1.0/2.0, CDDL 1.0/1.1, MPL 1.1, IBM Public License 1.0 | Yes, but only as bundled binary, not source. Add LICENSE/NOTICE entry. |
| X | Incompatible | GPL (any version), AGPL, LGPL 2.0/2.1 (kind of), SSPL, BUSL, CC-BY-NC | No. May not be bundled in any release. Runtime optional dep only, with care. |
The hard cases:
- LGPL is category X for binary distribution but acceptable as an optional runtime
dependency. Be careful; this is one of the most-asked questions on
legal-discuss@apache.org. - CC-BY-SA and other ShareAlike licenses depend on the work: data and documentation are sometimes B, sometimes X.
- Bespoke licenses (custom permissive licenses) must be reviewed before use.
If you are uncertain, post on legal-discuss@apache.org with a link to the license text.
Don't guess.
"GPL Contamination"
Apache projects cannot ship GPL code. The rule has corollaries that catch people:
| Action | OK? |
|---|---|
| Tez code calls a GPL library via reflection at runtime | No — if the library must be present, it's a dep |
| Tez code can optionally integrate with a GPL tool the user installs themselves | Yes — runtime-optional, user-supplied |
| Tez ships a GPL jar in the binary tarball | No |
| Tez build script downloads a GPL jar during build | No (this is contamination) |
| Tez source contains a comment "see SOME GPL CODE for reference" | Risky — get review |
| Tez source copies a snippet from GPL code | No — pollutes the codebase |
The conservative rule: GPL code may exist near Tez (a user's runtime environment) but not in Tez (source or binary distribution).
Adding a New Dependency — Procedure
When a patch proposes a new third-party dependency:
- Identify the license. Open the project's
LICENSEfile. Don't read the GitHub "License" sidebar; it can be wrong. - Classify. Category A, B, or X (above). If A, proceed. If B, plan for LICENSE / NOTICE updates and PMC discussion. If X, stop.
- Check transitive deps. A category-A library may pull in a category-X transitive.
Use
mvn dependency:treeand verify every transitive's license. - Justify. On the JIRA, explain why this dep is needed and why no in-tree alternative suffices.
- Update LICENSE. If the dep is bundled in the binary release (it usually is), add an appendix entry naming the dep, its license, and where to find the full license text.
- Update NOTICE. If the dep ships a NOTICE file, copy the required text into Tez's NOTICE. Read the dep's NOTICE; not all of it is required.
- Test the build. Run
mvn apache-rat:checkand a full build. The dep should not produce RAT-flagged files (most don't).
PMC review the dependency before commit. If you are PMC, ask:
- Is the license correctly classified?
- Is the dep maintained?
- What is the size cost (Tez binary tarball grows by N MB)?
- Are there security advisories against the version proposed?
Apache RAT in Tez Pre-commit
Apache RAT (Release Audit Tool) checks that every source file has an Apache license header. It is part of every Tez release vote and should be part of every contributor's pre-submit.
Run:
cd ~/tez-src
mvn apache-rat:check
Output on success:
[INFO] BUILD SUCCESS
Output on failure:
[ERROR] Files with unapproved licenses:
tez-dag/src/main/java/.../NewClass.java
The fix is to add the license header. The standard Java header is at the top of any existing Tez Java file; copy it.
RAT can be configured to allow certain files to be exempt (e.g. generated .proto-derived
files, META-INF/). The exemption config lives in the parent pom.xml:
grep -A20 "apache-rat-plugin" ~/tez-src/pom.xml
Adding a new file type that legitimately can't carry a header (e.g. a JSON test fixture) requires updating the exemption list and noting it in the JIRA.
License Header Template
For .java:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
For .proto:
//
// Licensed to the Apache Software Foundation (ASF) ... (same content with // comments)
//
For .xml:
<!--
Licensed to the Apache Software Foundation (ASF) ... (same content)
-->
For .sh / .py:
#
# Licensed to the Apache Software Foundation (ASF) ... (same content with # comments)
#
For .md: by convention, no header is needed for markdown docs in the source tree, but
project policy may require one. Check mvn apache-rat:check output.
The Tez NOTICE File
A typical Tez NOTICE:
Apache Tez
Copyright 2014-YYYY The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (https://www.apache.org/).
Plus, if bundled deps require:
This product bundles SomeLibrary, which is available under
the Foo Bar License. See <path or URL>.
NOTICE is not:
- A list of contributors (that's
CHANGES.txtand git). - A thank-you list.
- A list of services or users.
Keep it minimal and legally precise.
Source vs Binary Release — Different Rules
Apache makes a sharp distinction:
| Aspect | Source release | Binary release |
|---|---|---|
| Status | Official Apache release | Convenience artifact |
| What's bundled | Source code only | Compiled jars, possibly third-party jars |
| Must have ALv2 LICENSE | Yes | Yes |
| Must have NOTICE | Yes | Yes; longer than source NOTICE |
| Must pass RAT | Yes | Source check passes for binary, plus binary-bundled jars are exempt |
| Category B bundling | Generally allowed in source, restrictive | Allowed with LICENSE/NOTICE entry |
| Category X bundling | Never | Never |
Practical implication: a source release rarely bundles anything except Tez's own source.
A binary release bundles tez-dist/target/apache-tez-X.Y.Z-bin.tar.gz which contains
all the runtime jars Tez depends on (Hadoop, Jackson, etc.).
Common Licensing Mistakes
| Mistake | Caught by | Fix |
|---|---|---|
| New file without Apache header | mvn apache-rat:check | Add header |
| Random third-party snippet pasted into Tez | Code review | Replace with original code or pull in via dep |
| New category-B dep with no LICENSE update | PMC at release vote | Update LICENSE |
| New category-X dep | PMC at release vote | Remove dep |
| NOTICE accidentally cleared | Code review | Restore from prior release |
Copyright (c) Company Name in a file | Code review | Replace with Apache header; Company-owned code requires CLA review |
What ICLAs and CCLAs Cover
Two contributor license agreements:
| CLA | Who signs | What it covers |
|---|---|---|
| ICLA (Individual) | An individual contributor | Their personal contributions |
| CCLA (Corporate) | A company's authorised signatory | Contributions by listed employees |
An ICLA is required for any non-trivial contribution. A CCLA is required if the contribution is made in the contributor's capacity as a company employee.
PMC members can verify ICLA status via secretary@apache.org. For a casual single-patch
contributor, the trivial-patch exception often applies and no ICLA is needed; for a
contributor on path to committer, the ICLA needs to be on file by the second or third
patch.
Validation Artifacts
After this chapter:
- A
~/tez-notes/license-categories.mdcheatsheet of A/B/X with examples. - The reflex to run
mvn apache-rat:checkin your pre-submit script. - The discipline to check a new dep's category before opening a JIRA proposing it.
- The ability to read Tez's
NOTICEfile and confirm what each line is there for.
The next chapter — Code Style & Trust — closes the section with the operational mechanics of style enforcement and the trust ladder a contributor climbs.
Code Style & Trust
The Tez project enforces a specific code style via checkstyle. The style itself is less interesting than the trust mechanism it embodies: an automated, opinionated style is how a project of dozens of committers and hundreds of contributors keeps its codebase coherent without requiring every reviewer to argue about braces.
This chapter is the practical guide to the style, the tools that enforce it, and the trust ladder a contributor climbs from first patch to commit bit.
Where the Style Lives
Tez's checkstyle configuration:
cat ~/tez-src/tez-tools/src/main/resources/tez/checkstyle.xml
This file is the source of truth. If a reviewer says "your patch fails checkstyle," they mean this file is unhappy.
The file is invoked by the parent pom.xml:
grep -A10 "maven-checkstyle-plugin" ~/tez-src/pom.xml
Verify locally:
cd ~/tez-src
mvn checkstyle:check
Output on success is silent (exit 0). Output on failure lists each violation with file and line number.
The Rules That Matter
The full ruleset is the file above. The rules that catch contributors most often:
| Rule | What it enforces |
|---|---|
| Line length | Usually 120 chars max |
| Indentation | 2 spaces (not 4, not tabs) |
| Imports | No wildcard imports; specific order |
| Brace style | Egyptian ({ on same line) |
| Unused imports | Disallowed |
| Member ordering | Static fields, instance fields, constructors, methods |
| Trailing whitespace | Disallowed |
| Final newline | Required |
@Override annotations | Required when overriding |
Javadoc on public methods of @Public classes | Required |
The full list is in the file. Notable absences:
- Tez does not enforce a strict naming convention beyond standard Java (camelCase, PascalCase for classes).
- Tez does not enforce method length limits (so committers must catch overly long methods in review).
- Tez does not enforce strict cyclomatic complexity (same).
So checkstyle is a floor, not a ceiling. Passing it doesn't mean the patch is well-styled in the human sense — it means the obvious mechanical violations are absent.
IDE Setup
Configure your IDE to match. IntelliJ:
1. File → Settings → Editor → Code Style → Java.
2. Set Tab size: 2; Indent: 2; Continuation indent: 4.
3. Use spaces, not tabs.
4. Wrapping: hard wrap at 120.
5. Import → Class count to use import with '*': 999.
6. Final newline: required.
Or import the Hadoop / Tez IntelliJ style file if one is in the repo:
find ~/tez-src -name "*.xml" | xargs grep -l "CodeStyle" 2>/dev/null | head
Eclipse: Window → Preferences → Java → Code Style → Formatter, import an XML if one is
provided in tez-tools/.
VS Code with the Java extension: edit .vscode/settings.json per workspace:
{
"java.format.settings.url": "tez-tools/src/main/resources/tez/eclipse-formatter.xml",
"editor.tabSize": 2,
"editor.insertSpaces": true,
"files.insertFinalNewline": true,
"files.trimTrailingWhitespace": true
}
The goal: at save time, your IDE produces checkstyle-passing code.
Catching Violations Pre-Submit
The pre-submit script (from Patch Quality):
#!/usr/bin/env bash
set -e
cd ~/tez-src
mvn install -DskipTests
mvn checkstyle:check
git diff --check # detects whitespace errors
mvn test -pl tez-dag,tez-api
git diff --check is a free win — it catches trailing whitespace and conflict markers
before they reach the reviewer.
The Trust Ladder
Style is the visible surface of a deeper thing: trust. The contributor-to-committer path is a multi-step climb up a trust ladder.
Step 0: Anonymous reader.
Reads the codebase.
Trust: none required.
Step 1: First-time contributor (Javadoc fix).
Patch passes mechanical checks.
Trust to receive: a few minutes of review attention.
Step 2: Multi-patch contributor.
Several patches in over weeks/months.
Trust to receive: a sympathetic reviewer who will guide.
Trust to give: explain your reasoning on JIRA without being asked.
Step 3: Repeat contributor in one area.
Becomes recognised as an expert in that area.
Trust to receive: their +1 (non-binding) carries weight on patches in that area.
Trust to give: stay engaged on follow-up issues.
Step 4: Reviewer.
Provides non-binding +1 on others' patches with insight.
Trust to receive: PMC members notice.
Trust to give: your reviews must be substantive, not drive-by +1s.
Step 5: Committer (the bit).
Granted by PMC vote on private@.
Trust to receive: commit access to apache/tez.
Trust to give: review patches in your areas, mentor newcomers, attend to dev@.
Step 6: PMC member.
Granted later, after sustained committership.
Trust to receive: binding release vote, security-disclosure access.
Trust to give: stewardship duties (legal, brand, community, releases).
Each step takes months of consistent engagement. The ladder is asymmetric: the contribution required to climb each step grows roughly linearly, but the trust granted grows roughly exponentially.
Patterns Committers Want
Beyond mechanical style, certain patterns mark a patch as "from someone who gets it":
Use the existing logging idiom
private static final Logger LOG = LoggerFactory.getLogger(MyClass.class);
// Then in method:
LOG.info("Initialized vertex {} with {} tasks", vertexName, numTasks);
Not System.out.println. Not LOG.info("Initialized vertex " + vertexName + ...) (the
string is built even when INFO is off in some logging stacks; with SLF4J it's avoided by
parameterized form).
Use existing helper classes
If tez-common has a TezUtils helper for serialising a config to a byte buffer, use
it. Don't write a new helper inline. Search:
grep -rn "class.*Utils" ~/tez-src/tez-common/src/main/java
Match the surrounding file's style for ambiguous things
If the file uses final on every parameter, your additions should too. If the file
uses single-letter loop variables (for (int i = 0; ...), don't suddenly switch to
for (int taskIndex = 0; ...). Match the file.
Avoid speculative generality
Don't introduce an interface "in case we need a second implementation later." Don't add a configuration key "in case someone wants to tune this." Both increase the surface area the committer pool must maintain forever.
Cite the JIRA in non-obvious code
// TEZ-4321: handle the case where inputs is null after recover.
if (inputs == null) {
inputs = Collections.emptyList();
}
The comment is a permanent breadcrumb back to the design discussion.
Keep try/catch narrow
// Good
try {
state = readState();
} catch (IOException e) {
LOG.warn("Failed to read state for {}", id, e);
return defaultState();
}
// Bad — catches too much
try {
state = readState();
process(state); // <-- different exception domain
publish(state); // <-- different exception domain
} catch (Exception e) { // <-- swallows everything
LOG.error("Something failed", e);
}
Don't add @SuppressWarnings without justification
// Bad
@SuppressWarnings("unchecked")
public List<T> getStuff() { ... }
// Good
@SuppressWarnings("unchecked") // safe; we control all writers
public List<T> getStuff() { ... }
A bare @SuppressWarnings is a code smell that says "I didn't want to deal with the
real warning."
Use specific exception types in throws
// Bad
public DAG build() throws Exception { ... }
// Good
public DAG build() throws TezException, IOException { ... }
throws Exception defeats the type system. Reviewers will ask for specifics.
How Trust Is Withdrawn
Trust is built one patch at a time; it can also erode. Things that erode committer trust in a contributor:
| Behavior | Erosion |
|---|---|
| Ghosting a patch mid-review | Significant; reviewer's time wasted |
| Re-attaching the same patch without addressing comments | Significant; wastes another review cycle |
| Arguing without evidence | Moderate; teaches reviewer to expect friction |
| Pinging weekly | Moderate; reviewer learns to deprioritise |
| Submitting a patch that breaks tests | Mild if rare; serious if pattern |
| Committing your own patch without review (as committer) | Serious; loss of community trust |
| Reverting another committer's work without discussion | Very serious; potential PMC issue |
| Public criticism of a committer for their review | Very serious |
The recoverable: explain, apologise, address the underlying issue. Trust returns.
The non-recoverable: code-of-conduct violations. PMC handles these privately.
From First Patch to Commit Bit — The Arc
A realistic 12-month arc for a contributor on the path:
Month 1 First Javadoc fix. Review takes 2 weeks (reviewer wasn't sure).
You learn the patch generation workflow.
Month 2 Three small bug fixes. Review faster (reviewer knows you).
You learn checkstyle, run it pre-submit.
Month 3 Mid-sized refactor. Two review rounds, no friction.
You start filing follow-up JIRAs from things you notice.
Month 4-5 You review someone else's patch with a substantive +1.
A PMC member notices on dev@.
Month 6 First design discussion on a JIRA. You write a one-page design.
Review goes well; consensus reached.
Month 7-8 You're patch-author on the implementation. Three review rounds.
Final commit feels routine.
Month 9 You shepherd a new contributor through their first patch.
PMC notices.
Month 10 You're proposed on private@. Vote passes.
You're a committer.
Month 11 You commit your first patch (someone else's, reviewed by you).
You explicitly don't commit your own work unreviewed.
Month 12 You're routine. You review 2-3 patches a month, file 2-3.
The flywheel.
This is one path, not the only path. Some contributors hit the bit at month 6 (extremely sustained activity); some at month 24+ (slower but steady). The trust ladder doesn't have a clock; it has a contribution count + sustained behavior pattern.
Validation Artifacts
After this chapter:
- Your IDE is configured to produce checkstyle-passing code at save time.
- Your pre-submit script runs
mvn checkstyle:checkandgit diff --check. - A
~/tez-notes/style-patterns.mdlisting the "patterns committers want" above. - A clear-eyed estimate of where you are on the trust ladder, and what step is next.
This chapter closes the Release & PMC Reality section. The next major section, Hive-on-Tez Labs, is operational engineering at the Tez/Hive boundary — the most common production context for Tez today.
Capstone Project
The Capstone is the bridge from "I have read the Tez codebase" to "I have shipped a non-trivial fix that an Apache Tez committer merged into master." Everything in Levels 1–7 was preparation. This is the work.
You will pick one real, open Apache Tez JIRA, reproduce it against a current build, trace the failure through the codebase, identify the root cause, write a minimum-diff patch with deterministic tests, get it through precommit (Yetus / GitHub Actions), respond to review comments, and land the change. Then you write it up so the next person can learn from your investigation.
This chapter is the table of contents. The ten step-chapters that follow are the work itself.
Prerequisites
Do not start the Capstone until you can answer "yes" to every one of these:
- Level 1–7 complete. You can read
DAGImpl,VertexImpl,TaskImpl,TaskAttemptImpl,AsyncDispatcher, the shuffle path (ShuffleManager,Fetcher,MergeManager), and at least oneVertexManagerPlugin(ShuffleVertexManagerorRootInputVertexManager) without a guide open. - You have built Tez from source.
mvn clean install -DskipTestssucceeds on your machine, andmvn test -pl tez-dagfinishes (some flakes are normal — see Stage 9 of the issue roadmap). - You have run
MiniTezClusterlocally.mvn test -pl tez-tests -Dtest=TestOrderedWordCountgoes green. - You have a working JIRA + Apache ID (or a GitHub account ready to PR).
- You have read the Tez contribution guide:
https://tez.apache.org/contribution_guide.htmlandhttps://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute.
If any of these is "no," stop. Go back. The Capstone is unforgiving of partial preparation — you will spend three weeks confused instead of three weeks shipping.
The 10-Step Flow
flowchart TD
A[Step 1: Issue Selection] --> B[Step 2: Reproduction]
B --> C[Step 3: Execution Path Analysis]
C --> D[Step 4: Root Cause Identification]
D --> E[Step 5: Implementation]
E --> F[Step 6: Testing]
F --> G[Step 7: Validation]
G --> H[Step 8: Patch / PR]
H --> I[Step 9: JIRA + Docs]
I --> J[Step 10: Engineering Write-Up]
G -.fail.-> D
F -.fail.-> E
H -.review.-> E
The dotted arrows are the loops you will actually run. Nobody gets root cause right on the first hypothesis. Nobody passes precommit on the first push. Plan for two or three iterations through Steps 4–8 before you land.
Deliverables
By the time you mark the Capstone done, every one of these artifacts exists:
| # | Artifact | Lives in |
|---|---|---|
| 1 | Failing reproducer test (a JUnit test that fails on master without your patch and passes with it) | tez-tests/ or a module-local src/test/java/... |
| 2 | Root-cause document (200–500 words, with file:line citations) | capstone-work/root-cause.md in your fork |
| 3 | Minimum-diff patch | A branch on your fork of apache/tez |
| 4 | Unit tests using DrainDispatcher / mock dispatcher (if state-machine related) | The relevant src/test/java |
| 5 | Integration test using MiniTezCluster (if end-to-end behavior changed) | tez-tests/src/test/java/org/apache/tez/test/ |
| 6 | Validation report (output of mvn test -pl <module>, checkstyle, spotbugs, RAT) | capstone-work/validation.md |
| 7 | GitHub PR against apache/tez:master (or .patch file attached to JIRA) | https://github.com/apache/tez/pulls |
| 8 | JIRA updated: status = "Patch Available," PR linked, release-notes filled if user-visible | https://issues.apache.org/jira/browse/TEZ-NNNN |
| 9 | Engineering write-up (500–1000 words: problem, investigation, design, alternatives, lessons) | Personal blog, Apache wiki page, or dev@ summary |
Every one. No exceptions. The write-up is not optional — it is how the community (and your future self) learns from your investigation.
100-Point Rubric Summary
The full rubric lives in evaluation-rubric.md. Headline:
| Area | Weight |
|---|---|
| Problem articulation (symptom vs. root cause separation, conditions) | 20 |
| Execution-path mastery (file:line citations, diagram, accuracy) | 20 |
| Implementation quality (minimum diff, conventions, no scope creep) | 20 |
| Testing (unit + integration, deterministic, coverage) | 15 |
| Review responsiveness (addresses comments, iteration cadence) | 10 |
| Documentation (JIRA, code comments, write-up) | 10 |
| Community interaction (mailing-list etiquette, handoff hygiene) | 5 |
Tier thresholds:
- 80+ — credible Tez contributor. You can sustain a steady patch flow.
- 90+ — committer-ready. You are doing work a committer would do without hand-holding.
- 95+ — PMC-track. You are leading work others want to follow.
You will self-grade in Step 10. Be honest. Inflated self-grades are visible from orbit when a committer reads your write-up.
Timeline
The Capstone is a 4–6 week effort if you have one focused evening per weekday plus weekend mornings. Less than that and you risk losing context between sessions (which is far more expensive than people expect for state-machine code).
| Week | Steps | Hours |
|---|---|---|
| 1 | 1–2: Pick an issue, build a deterministic reproducer | 10–15 |
| 2 | 3–4: Trace execution, identify root cause | 12–18 |
| 3 | 5–6: Implement fix, write unit + integration tests | 12–18 |
| 4 | 7–8: Validate, prepare patch / PR, push | 8–12 |
| 5 | 8–9: Review iteration (two or three rounds is normal) | 6–10 |
| 6 | 10: Write-up, JIRA cleanup, retrospective | 4–6 |
If you blow past six weeks, that is a signal — not a failure. Either the issue is
larger than it looked (in which case, pause and renegotiate scope in the JIRA), or
you are stuck on a specific step (in which case, ask on dev@tez.apache.org).
Success Indicators
You will know it is working when:
- A committer comments "+1" or "LGTM, will commit shortly" on your PR.
- Your fix appears in
git log apache/masterwith(cherry picked from commit ...)landing on the next release branch. - The JIRA you claimed flips to "Resolved / Fixed in X.Y.Z" with your name on it.
- Your write-up gets traffic — search-engine hits, a comment from another
contributor, a question on
user@. - The next time you pick a JIRA, you reach root cause in days, not weeks.
You will know it is failing when:
- You are still editing files in Step 5 with no failing test in hand from Step 2.
- Your PR description says "I think this might fix it."
- You have not run
mvn test -pl tez-dagend-to-end in over a week. - You are arguing in PR comments instead of changing code or asking questions.
If you spot a failure signal, do not push through. Stop, reread the relevant step chapter, and reset.
How to Use This Chapter
Read all ten step-chapters once, end-to-end, before you start Step 1. You need the shape of the whole journey in your head — Step 4 (root cause) makes choices that Step 6 (testing) depends on; Step 8 (patch) assumes you have artifacts from Steps 2 and 7. Skim now, deep-read each as you arrive at it.
Then go to Step 1: Issue Selection. Pick the issue. The clock starts when you comment "Working on this" on the JIRA.
Validation / Self-check
Before starting Step 1, confirm:
- You can produce, from memory, the file path of
DAGAppMaster,DAGImpl,VertexImpl,TaskImpl, andAsyncDispatcher. mvn clean install -DskipTestscompletes against your local~/tez-src/clone.mvn test -pl tez-tests -Dtest=TestOrderedWordCountpasses.- You have a
capstone-work/directory in your fork ready for theroot-cause.md,validation.md, andwriteup.mddeliverables. - You have skimmed every step-chapter once.
- You have set aside 4–6 calendar weeks with realistic time budget.
- You have subscribed to
dev@tez.apache.org(sendsubscribetodev-subscribe@tez.apache.org) andissues@tez.apache.org.
Step 1: Issue Selection
Picking the wrong issue is the most expensive mistake in the Capstone. Two weeks of
investigation on a JIRA that turns out to be a duplicate, a WONTFIX, or a
multi-month rearchitecture is two weeks you do not get back. The goal of this step
is not to find a perfect issue. It is to find a tractable issue that exercises
the parts of Tez you actually know.
Budget: 1–3 days. If you are past day 4 and still triaging, your standards are too high.
Where the Real Issues Live
Apache Tez tracks issues in JIRA at:
https://issues.apache.org/jira/projects/TEZ
There is no good-first-issue label on Tez (unlike Hadoop). The closest
proxies are newbie, very small subtasks of larger umbrellas, and stale
unassigned bugs with reproducers attached. You will write your own JQL.
Starter JQL Queries
Run these in JIRA's "Advanced" search box. Open each in a separate tab; do not chase one result before you have seen the whole landscape.
1. Unassigned open bugs, sorted by recency:
project = TEZ AND status in (Open, "In Progress")
AND assignee is EMPTY
AND type = Bug
ORDER BY created DESC
2. Bugs with reproducers attached (the gold standard):
project = TEZ AND status = Open
AND type = Bug
AND attachments is not EMPTY
ORDER BY updated DESC
3. Newbie-labeled (small surface area):
project = TEZ AND status = Open
AND (labels = newbie OR labels = beginner OR labels = "low-hanging-fruit")
ORDER BY priority DESC, created DESC
4. Flaky tests (Stage 9 territory, often great Capstone fodder):
project = TEZ AND status = Open
AND (summary ~ "flaky" OR summary ~ "intermittent" OR description ~ "flaky")
ORDER BY votes DESC
5. Open bugs touching modules you know:
project = TEZ AND status = Open AND type = Bug
AND (component in ("tez-dag", "tez-runtime-internals", "tez-runtime-library")
OR summary ~ "VertexImpl"
OR summary ~ "ShuffleManager"
OR summary ~ "AsyncDispatcher")
ORDER BY created DESC
Cast a wide net. Pull 20+ candidates into a scratchpad. You will trim aggressively.
Triage: Pick 5 Finalists from 20
For each candidate, spend 10–15 minutes — no more — answering this single question: "Could I write a failing test for this today?" If "no" or "I have no idea," drop it. If "probably yes, here's how," keep it.
Concrete triage protocol:
- Read the JIRA description and every comment. Watch for "I cannot reproduce" or "this is a duplicate of TEZ-XXXX" buried at the bottom.
- Check
git log --grep "TEZ-NNNN"in your~/tez-src/clone — has it already been partially fixed? - Search the dev@ mailing list archive for the issue number:
https://lists.apache.org/list.html?dev@tez.apache.org. - Open the linked files in your editor. Are they in
tez-dag,tez-runtime-*,tez-api(familiar territory), ortez-ui,tez-plugins,tez-yarn-timeline-*(less familiar — skip unless you specifically studied them)? - Note the Affects-Versions field. If it only affects 0.8.x and master has been rewritten in the area, the fix may not be portable.
Keep the 5 finalists in a markdown table:
| TEZ-NNNN | Title | Component | Reproducer? | Last activity | My read |
|---|---|---|---|---|---|
| TEZ-4321 | Fetcher hangs on connection reset | tez-runtime-library | none | 2024-11 | Plausible; I know ShuffleManager |
| TEZ-4456 | VertexImpl NPE on V_ROUTE_EVENT after kill | tez-dag | stack trace only | 2025-02 | Race-y; familiar state machine |
| ... | | | | | |
Scoring Rubric
Score each finalist 0–2 in each column. The winner is the highest aggregate.
| Criterion | 0 | 1 | 2 |
|---|---|---|---|
| Clarity | Description is one sentence and ambiguous | Description names symptom but not conditions | Clear symptom + reproduction conditions in description |
| Scope | Open-ended ("refactor X") | Bounded but spans modules | Bounded to one or two classes |
| Isolation | Requires Hive/Pig running | Needs MiniTezCluster | Can be reproduced in pure unit test |
| Testability | No clear failing assertion possible | Failing assertion possible after MiniTezCluster run | Failing assertion possible in DrainDispatcher test |
| Alignment | Touches code I have never read | Touches one familiar class | Touches 2–3 classes I have studied in Levels 4–6 |
| Community engagement | Last activity > 2 years, no watchers | Some activity in last year | Recently discussed; a committer responded |
Total possible: 12. Anything below 7 is risky. Pick the 9+ candidate.
Three Worked Examples
These are illustrative archetypes, not literal current JIRAs.
Candidate A: "ShuffleManager retries forever on IOException: Connection reset"
- Clarity: 2 (description names the exception and the loop).
- Scope: 2 (one class:
ShuffleManagerorFetcher). - Isolation: 1 (need a fake
Fetcherto inject the exception). - Testability: 2 (mock-based unit test with retry counter assertion).
- Alignment: 2 (you read this in Level 5).
- Community engagement: 1 (one committer comment, no resolution).
- Total: 10. Pick this.
Candidate B: "Refactor DAGImpl state machine to use enum-based transitions"
- Clarity: 1 (vague — "refactor").
- Scope: 0 (touches
DAGImpl, every event handler, every test). - Isolation: 0 (no failing behavior to test).
- Testability: 0 (regression-only testing).
- Alignment: 1 (you know
DAGImplbut this is huge). - Community engagement: 0 (no committer +1).
- Total: 2. Skip. This is a months-long design proposal, not a bug.
Candidate C: "Container reuse logs say assigned then released for same container"
- Clarity: 2 (you can pull the log lines from the description).
- Scope: 1 (touches
TaskSchedulerManagerand possiblyYarnTaskSchedulerService). - Isolation: 0 (need
MiniYARNCluster— slow, flaky, environment-sensitive). - Testability: 1 (assertions are on log content + scheduler state).
- Alignment: 1 (you read
TaskSchedulerManageronce). - Community engagement: 2 (recent discussion).
- Total: 7. Borderline. Pick only if you have no candidate above 8 and you budget extra time for the YARN harness.
Claiming the Issue
Once you decide, claim it publicly. This is non-negotiable — it prevents wasted work by others, and it commits you.
JIRA comment template
Hi — I'd like to work on this as part of an extended Tez learning project.
My plan:
1. Build a deterministic reproducer (target: <date+1 week>).
2. Root-cause analysis (target: <date+2 weeks>).
3. Patch + tests posted for review (target: <date+4 weeks>).
I'll post weekly updates here. If anyone with context has pointers on
<specific question, e.g. "whether this race was discussed in TEZ-NNNN">,
I'd be grateful. Otherwise I'll start on the reproducer this week.
— <Your Name>
Then assign the JIRA to yourself (you need a JIRA account; the Tez PMC grants contributor role on request — comment "please grant contributor role" on any issue and a PMC member will action it within a few days).
If you get no response in 5 business days
Post to dev@tez.apache.org:
Subject: [TEZ-NNNN] Working on this — any context before I dive in?
Hi all,
I left a comment on TEZ-NNNN <link> last week saying I plan to work on it. No
objections so far, so I'm starting on a reproducer this week. If anyone has
historical context — especially whether this overlaps with TEZ-XXXX — please
shout. Otherwise I'll update the JIRA as I make progress.
Thanks,
<Your Name>
If still no response after another week, proceed. Silence on a small bug is permission. (Silence on a redesign proposal is not — different beast.)
Red Flags: Issues to Skip
- Last comment is from a committer saying "we should think about this more." You are not the right person to land a design call.
- Open for >5 years with multiple abandoned patches. Something is structurally hard. Not Capstone material — pick later.
- Touches
tez-ui(Ember 1.x). The UI is on a separate lifecycle; build and test setup is divergent from the JVM modules you studied. - "Upgrade dependency X to version Y." Looks easy, ends up rebuilding the shuffle stack to handle a Guava API change. Skip unless you specifically want this experience.
CriticalorBlockerpriority with no patch. A committer would already be on it. If they are not, the issue may be misclassified or stale-critical.- Reproducer requires a specific Hive version + a 1TB TPC-DS run. No.
Validation / Self-check
Before you advance to Step 2, produce:
- A markdown table of your 5 finalists with full scoring rubric, saved as
capstone-work/issue-shortlist.md. - The TEZ-NNNN number of your chosen issue, posted as a JIRA comment claiming it.
- A 1-paragraph statement of why you picked it (which two criteria scored highest and which scored lowest).
- A self-assigned target date for Step 2 (deterministic reproducer in hand).
- Subscription confirmed to
dev@tez.apache.organd the JIRA itself (click the "Start watching" eye icon). - Your fork of
apache/tezexists on GitHub with a branch namedtez-NNNN-<short-slug>checked out locally. - A note in
capstone-work/issue-shortlist.mdof any near-miss candidates you may revisit after the Capstone — these are your next contributions.
Step 2: Reproduction
You do not have a bug until you have a failing test. Stack traces in JIRA comments are circumstantial evidence; a deterministic, automated reproducer is proof. Until you have one, every hypothesis in Step 4 is unverifiable and every "fix" in Step 5 is theater.
Goal of this step: a JUnit test that fails on a clean checkout of apache/tez:master
without your patch, in under two minutes, on five out of five runs.
Where Reproducers Live
MiniTezCluster is the Tez-specific harness that boots an in-process YARN
cluster plus a DAGAppMaster against the local filesystem. It is the closest
thing to a real deployment that you can debug from your IDE.
find ~/tez-src/tez-tests -name "MiniTezCluster.java"
# tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java
Read it first, then read one consumer:
grep -n "MiniTezCluster" \
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java
grep -n "MiniTezCluster" \
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java
TestTezJobs is the canonical "wire up a real cluster, submit a small DAG, assert
on the output" example. TestOrderedWordCount is the lighter-weight end-to-end
sanity check.
For pure unit-level reproducers (no YARN, no shuffle), use the patterns in:
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
These use DrainDispatcher (a synchronous dispatcher that lets you control event
ordering deterministically) — see Step 6 for the full pattern.
Three Reproducer Templates
Pick the template that matches your issue type.
Template A: Race-Condition Reproducer (state-machine level)
When the bug is "two events arrive in an unexpected order and the state machine
NPEs / wedges / drops a task," you need DrainDispatcher plus controlled event
ordering. No MiniTezCluster.
package org.apache.tez.dag.app.dag.impl;
import org.apache.hadoop.yarn.event.DrainDispatcher;
import org.apache.tez.dag.app.AppContext;
import org.apache.tez.dag.app.dag.event.VertexEventTaskCompleted;
import org.apache.tez.dag.app.dag.event.VertexEventSourceTaskAttemptCompleted;
import org.apache.tez.dag.records.TezTaskID;
import org.junit.Before;
import org.junit.Test;
import static org.junit.Assert.assertEquals;
public class TestVertexImplTezNNNNRepro {
private DrainDispatcher dispatcher;
private VertexImpl vertex;
private AppContext appContext;
@Before
public void setUp() {
dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertexEventHandler());
dispatcher.start();
// Use the same factory as TestVertexImpl. Read its setUp() carefully.
appContext = MockAppContext.create();
vertex = createVertex(appContext, dispatcher);
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
}
@Test
public void reproTaskCompletionBeforeRouteEvent() throws Exception {
// 1. Drive vertex to RUNNING.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
dispatcher.await();
assertEquals(VertexState.RUNNING, vertex.getState());
// 2. Inject a task completion BEFORE the V_ROUTE_EVENT that the bug requires
// has been processed. This is the race window from the JIRA.
TezTaskID t0 = vertex.getTask(0).getTaskId();
vertex.handle(new VertexEventTaskCompleted(t0, TaskState.SUCCEEDED));
// Do NOT call dispatcher.await() yet — interleave a second event.
vertex.handle(new VertexEventSourceTaskAttemptCompleted(...));
dispatcher.await();
// 3. Assertion that fails on master, passes with fix.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
// ^^^^^^^^^^^ on master this is FAILED due to the race
}
}
Key principles:
- Drive the state machine by handing events to
vertex.handle()directly, not by going through a scheduler. - Use
dispatcher.await()to deterministically drain the queue between phases. - The failing assertion is on a
getState()or counter, not on log output.
Template B: Configuration / Validation Reproducer
When the bug is "setting tez.foo=bar is silently ignored / produces wrong
behavior," reproduce at the API layer.
@Test
public void testConfigKeyHonored() throws Exception {
TezConfiguration conf = new TezConfiguration();
conf.set(TezConfiguration.TEZ_AM_FOO_BAR, "42");
DAG dag = DAG.create("test-dag");
Vertex v = Vertex.create("v1", ProcessorDescriptor.create(NoOpProcessor.class.getName()), 4);
dag.addVertex(v);
// The component under test reads conf — instantiate it directly.
FooComponent foo = new FooComponent(conf);
assertEquals(42, foo.getEffectiveValue());
// ^^ on master this is the default (e.g. 100) because conf is ignored
}
No cluster, no DAG submission. Just instantiate the class that reads the config
and assert the effective value. The fix usually changes one conf.get() call.
Template C: Shuffle / Correctness Reproducer
When the bug is "output is wrong" (missing rows, duplicated rows, partial sort),
you need MiniTezCluster and a small DAG with deterministic input.
public class TestShuffleCorrectnessTezNNNN {
private static MiniTezCluster mrrTezCluster;
private static FileSystem fs;
@BeforeClass
public static void setup() throws Exception {
Configuration conf = new Configuration();
fs = FileSystem.getLocal(conf);
mrrTezCluster = new MiniTezCluster("TestShuffleRepro", 1, 1, 1);
mrrTezCluster.init(conf);
mrrTezCluster.start();
}
@AfterClass
public static void cleanup() throws Exception {
if (mrrTezCluster != null) mrrTezCluster.stop();
}
@Test(timeout = 120_000)
public void reproPartitionedOutputMissingRows() throws Exception {
Path inputDir = new Path("/tmp/repro-input-" + System.nanoTime());
Path outputDir = new Path("/tmp/repro-output-" + System.nanoTime());
writeKnownInput(fs, inputDir, /*rows=*/ 10_000);
TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
DAG dag = buildTwoVertexDAG(inputDir, outputDir);
TezClient client = TezClient.create("repro", tezConf);
client.start();
try {
DAGClient dagClient = client.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(null);
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
long outputRowCount = countRows(fs, outputDir);
// On master this is 9_973 (27 rows lost in shuffle). With fix: 10_000.
assertEquals(10_000L, outputRowCount);
} finally {
client.stop();
}
}
}
Build with deterministic input (fixed seed if random) so the missing-row count is reproducible across runs.
Logging: See What the State Machine Is Actually Doing
A reproducer without logs is half a reproducer. You will spend Step 4 staring at these logs.
Drop this into your test resources at src/test/resources/log4j.properties
(or log4j2.properties for newer modules — check which the module uses):
log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss.SSS} %-5p [%t] %c{1}: %m%n
# Tez AM internals — the state-machine event log lives here
log4j.logger.org.apache.tez.dag.app.DAGAppMaster=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.DAGImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.VertexImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskAttemptImpl=DEBUG
# Async dispatcher event flow
log4j.logger.org.apache.tez.dag.app.AsyncDispatcher=DEBUG
# Runtime task lifecycle
log4j.logger.org.apache.tez.runtime.task=DEBUG
log4j.logger.org.apache.tez.runtime.LogicalIOProcessorRuntimeTask=DEBUG
# Shuffle internals
log4j.logger.org.apache.tez.runtime.library.common.shuffle=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher=DEBUG
# Scheduler
log4j.logger.org.apache.tez.dag.app.rm.TaskSchedulerManager=DEBUG
log4j.logger.org.apache.tez.dag.app.rm.YarnTaskSchedulerService=DEBUG
The two most useful patterns to grep for in the output:
grep -E "VertexImpl|TaskImpl|TaskAttemptImpl" target/surefire-reports/*.txt \
| grep -E "state|State|Event|EVENT"
That gives you the state-transition trace, which is what you'll diagram in Step 3.
Capturing container logs from MiniTezCluster
MiniTezCluster writes container logs (where your tasks' stderr/stdout end up)
under the surefire working directory:
<module>/target/<test-class>-tmpDir/<application-id>/container-logs/
Or, in newer YARN versions:
<module>/target/MiniMRYarnCluster-localDir-nm-X_Y/usercache/<user>/appcache/<app>/container_*/
Find them with:
find ~/tez-src/tez-tests/target -name "syslog" -path "*container*" -mmin -30
Read syslog (TaskAttempt logs) and stderr (uncaught exceptions). The
prelaunch.out and directory.info files explain what was actually launched.
Verify Determinism
Five runs. If even one is green, your reproducer is not deterministic yet — it is a coin flip you happen to have caught. Fix the race window before declaring victory.
cd ~/tez-src
for i in 1 2 3 4 5; do
echo "=== Run $i ==="
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
| tail -20
done
Expected output: five FAILs with the same assertion failure on the same line.
If you see 4 FAIL / 1 PASS:
- Add a
Thread.sleepis the wrong answer. (Reread Step 6.) - Insert an explicit event ordering: drain the dispatcher between every event,
inject the conflicting events as a
Futureyou control. - Use
CountDownLatchto gate the producer thread until the consumer is at a known state.
If you cannot get to 5/5 fails, the bug may genuinely depend on external timing
(network, GC). In that case, escalate to a stress-test pattern: run the inner
test body 100x in a @RepeatedTest and assert that the failure rate is >50%.
Less ideal but acceptable for some shuffle race bugs.
Validation / Self-check
By the end of Step 2 you must have:
- A new test file under
<module>/src/test/java/...namedTest<Component>Tez<NNNN>Repro.java(theReprosuffix is for your workflow; you'll rename it to a real test name in Step 6). - The test fails on a clean
~/tez-src/atmasterwith an assertion error (not a setup error, not a timeout — an assertion error). - Five consecutive runs produce the same failure on the same line.
- The failure happens in under 120 seconds per run.
- A
log4j.propertiessnippet insrc/test/resources/enabling debug logging on the relevant Tez packages. - A captured log excerpt (paste into
capstone-work/repro-logs.txt) showing the state-machine trace at the moment of failure. - A one-paragraph description of the failure mode in your own words, saved
to
capstone-work/repro-summary.md. You will refine this into the root-cause document in Step 4.
Step 3: Execution Path Analysis
You have a failing test. Now you map the path the request takes from the moment
TezClient.submitDAG() returns through every event, dispatcher hop, and state
transition until the failure manifests. This map is the foundation for every
hypothesis in Step 4. A wrong map produces a wrong root cause.
Budget: 2–4 evenings. The work is reading code, grep, and drawing.
The Canonical Submit Path
Every DAG that fails went through this skeleton path before it failed. Memorize it; you will use it as the reference axis when you sketch where your particular failure deviates.
TezClient.submitDAG(DAG)
[tez-api/src/main/java/org/apache/tez/client/TezClient.java]
|
v
TezClient.submitDAGSession() or submitDAGApplication()
| (session vs. non-session — see TezClient.java for branch)
v
DAGClientHandler.submitDAG(...)
[tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java]
|
v
DAGAppMaster.submitDAGToAppMaster(...)
[tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java]
|
v
DAGAppMaster.startDAG(...)
| - builds DAGImpl
| - emits DAGEventType.DAG_INIT
v
AsyncDispatcher.dispatch(DAGEvent)
[tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java]
(uses Hadoop's hadoop-yarn-common AsyncDispatcher under the hood;
Tez subclasses it — see Tez source for the wrapper)
|
v
DAGImpl.handle(DAGEvent)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]
| state DAG_NEW --DAG_INIT--> INITED
| emits DAGEventType.DAG_START
v
DAGImpl on DAG_START
| state INITED --DAG_START--> RUNNING
| for each Vertex: emits VertexEvent V_INIT
v
VertexImpl.handle(VertexEventType.V_INIT)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java]
| state NEW --V_INIT--> INITIALIZING
| invokes VertexManagerPlugin.initialize()
| on success emits V_INITED
v
VertexImpl on V_INITED -> on V_START
| state INITED --V_START--> RUNNING
| schedules tasks via TaskImpl events (T_SCHEDULE)
v
TaskImpl.handle(T_SCHEDULE)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java]
| state NEW --T_SCHEDULE--> SCHEDULED
| spawns a TaskAttemptImpl, emits TA_SCHEDULE
v
TaskAttemptImpl.handle(TA_SCHEDULE)
[tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java]
| state NEW --TA_SCHEDULE--> START_WAIT
| requests container from TaskSchedulerManager
v
TaskSchedulerManager / YarnTaskSchedulerService
[tez-dag/src/main/java/org/apache/tez/dag/app/rm/]
| assigns container, emits TA_CONTAINER_LAUNCHED
v
TaskAttemptImpl receives TA_CONTAINER_LAUNCHED
| state START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING
| the container is now actually running our task
v
[ container process boots ]
TezTaskRunner2.run()
[tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java]
|
v
TezChild / TaskRunner instantiates LogicalIOProcessorRuntimeTask
|
v
LogicalIOProcessorRuntimeTask.run()
[tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java]
| initializes Inputs, Outputs, Processor
| calls Processor.run(inputs, outputs)
v
[ user code runs — e.g. OrderedWordCount or your DAG's processor ]
|
v
heartbeat -> TaskAttemptListener -> TaskAttemptImpl TA_DONE / TA_FAILED
That is the skeleton. Your job in this step is to find the segment where your failure occurs and draw it with line numbers.
Run These Greps
These greps locate the actual file paths and method bodies on your local clone.
Run them in ~/tez-src/. Each one gives you a line number to open.
# Entry: submitDAG
grep -n "public.*submitDAG" \
tez-api/src/main/java/org/apache/tez/client/TezClient.java
# Server-side intake
grep -n "submitDAG\|startDAG" \
tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java \
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
# DAGImpl handlers
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -40
# VertexImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -60
# TaskImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -60
# TaskAttemptImpl state machine
grep -nE "addTransition|stateMachineFactory" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -80
# Dispatcher
grep -n "class AsyncDispatcher\|dispatch\b" \
tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java
# Runtime task entry
grep -n "public void run\|class TezTaskRunner2" \
tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java
grep -n "public void run\|initialize\|class LogicalIOProcessorRuntimeTask" \
tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java
Open each line in your editor. Read the transition table. Note which event you care about and which state(s) it is legal in.
Locate Your Specific Failure Segment
The skeleton is the highway; your bug is at one specific exit. Use these heuristics:
| Symptom in repro logs | Likely segment |
|---|---|
VertexImpl ... transitioned from RUNNING to FAILED | VertexImpl state machine — transition on V_TASK_RESCHEDULED or V_INTERNAL_ERROR |
TaskAttemptImpl ... NPE | TaskAttemptImpl event handlers; check container-launched and TA_DONE paths |
NPE in AsyncDispatcher.dispatch | Race between dispatcher start/stop and event submission |
ShuffleManager: too many fetch failures | Fetcher retry/timeout; ShuffleManager.fetchFailure() |
IFile checksum mismatch | IFile.Writer/Reader; check spill+merge |
OutOfMemory ... GROUP_COMPARATOR | MergeManager memory math; ifile spill thresholds |
Container released before TA_DONE | TaskSchedulerManager reuse path; check container release races |
Once you know your segment, draw it.
Build the Path Diagram
Two formats. Do both — they validate each other.
Text-arrow form (paste into the root-cause doc)
Use this in JIRA comments and PR descriptions. It survives any rendering.
TezClient.submitDAG (TezClient.java:485)
-> DAGClientHandler.submitDAG (DAGClientHandler.java:152)
-> DAGAppMaster.startDAG (DAGAppMaster.java:1234)
-> DAGImpl V_NEW --DAG_INIT--> INITED (DAGImpl.java:340)
-> DAGImpl INITED --DAG_START--> RUNNING (DAGImpl.java:380)
-> VertexImpl v1 NEW --V_INIT--> INITIALIZING (VertexImpl.java:1820)
-> VertexImpl v1 INITIALIZING --V_INITED--> INITED (VertexImpl.java:1856)
-> VertexImpl v1 INITED --V_START--> RUNNING (VertexImpl.java:1901)
-> [21 TaskImpl T_SCHEDULE events fired]
-> TaskImpl t0 NEW --T_SCHEDULE--> SCHEDULED (TaskImpl.java:412)
-> TaskAttemptImpl t0.0 NEW --TA_SCHEDULE--> START_WAIT (TaskAttemptImpl.java:560)
-> [container assigned]
-> TaskAttemptImpl t0.0 START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING (...:610)
-> [container starts LogicalIOProcessorRuntimeTask]
-> ShuffleManager.run starts fetcher loop
-> Fetcher.fetchNext throws IOException (Fetcher.java:289) <-- FAILURE HERE
-> ShuffleManager.fetchFailure -> InputReadErrorEvent
-> TaskAttemptImpl t0.0 RUNNING --TA_FAILED--> FAILED
Cite real line numbers from your checkout. Future-you will thank you.
Mermaid diagram (for the write-up and PR)
sequenceDiagram
participant C as Client
participant AM as DAGAppMaster
participant D as DAGImpl
participant V as VertexImpl v1
participant T as TaskImpl t0
participant TA as TaskAttempt t0.0
participant SM as ShuffleManager
participant F as Fetcher
C->>AM: submitDAG
AM->>D: DAG_INIT
D->>D: NEW -> INITED
AM->>D: DAG_START
D->>V: V_INIT
V->>V: NEW -> INITIALIZING -> INITED
D->>V: V_START
V->>T: T_SCHEDULE
T->>TA: TA_SCHEDULE
TA->>TA: NEW -> START_WAIT
Note over TA: container assigned + launched
TA->>TA: START_WAIT -> RUNNING
TA->>SM: shuffle starts
SM->>F: fetchNext
F-->>SM: IOException
SM->>TA: InputReadErrorEvent (TA_FAILED)
TA->>TA: RUNNING -> FAILED
Both diagrams say the same thing. Together they pass review with a committer because they prove you actually read the code instead of paraphrasing the JIRA.
Verify Empirically with Temporary LOG.info() Probes
The map is a hypothesis. Confirm it with probes. Add temporary logging at the points you think your event traverses. Pattern:
// In VertexImpl.java, inside the handler you suspect:
private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);
LOG.info("PROBE-TEZ{}: V_INIT entered for vertex={} state={}",
"NNNN", getName(), getState());
Rules for probes:
- Prefix every probe with
PROBE-TEZ<NNNN>so you can grep them in one pass and delete in one pass. - Use
LOG.infonotLOG.debugso they appear without changing log config. - Include the field values you care about (state, event type, IDs).
- Never commit probes. They are scaffolding for Step 4.
After re-running your test:
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
| grep "PROBE-TEZNNNN" | tee /tmp/probe-trace.txt
Compare the probe trace to your diagram. Discrepancies are the most valuable output of this whole step — they are exactly where your mental model differs from the code.
Common discrepancies to watch for:
- "I thought this handler ran once. It ran three times." (Re-entrancy bug.)
- "I thought events arrived in order A,B,C. They arrived B,A,C." (Async dispatch reordering.)
- "I thought the vertex was in RUNNING. It was in INITED." (Wrong assumption about state at the time of the event.)
When a probe surprises you, do not delete the probe. Lean in. That is the shortest path to root cause.
Output
Your Step 3 deliverables live in capstone-work/execution-path/:
path-skeleton.md— text-arrow form with line numbers.path.mmd— the mermaid source.probe-trace.txt— grep output from the probe run.notes.md— three to five surprises you found while reading.
Validation / Self-check
Before you advance to Step 4, you must:
- Be able to name, from memory, every state transition between
TezClient.submitDAG()and your failure point. - Have file:line citations for every transition in your diagram, against your
~/tez-src/HEAD. - Have run the repro with
PROBE-TEZ<NNNN>log statements and confirmed the sequence matches your diagram (or, more usefully, noted where it diverges). - Have removed every probe from your working tree before any commit (
git diffshould not contain "PROBE-"). - Have at least one "surprise" noted in
notes.md— if you have zero, you did not look hard enough. - Be able to answer: "Which event, in which state, on which class, fires the handler that produces the failure?" in one sentence.
- Have the mermaid diagram render without syntax errors (
mdbook serveyour capstone-work folder, or paste into mermaid.live).
Step 4: Root Cause Identification
A symptom is "the test fails." A root cause is "this specific line, in this specific state, when this specific event arrives, performs this specific incorrect operation, because of this specific design assumption that no longer holds." If your statement does not have that shape, you have not found root cause yet.
This step is mostly thinking. The tools are five-whys, git blame, and
git bisect. The output is a 200–500 word root-cause document and a tested
hypothesis.
Five Whys, Applied to a State-Machine Race
The five-whys technique sounds trite. It is not. The discipline of asking "why" five times in a row forces you past the first plausible explanation (almost always wrong) and into the actual design defect (almost always two or three levels deeper than you initially thought).
Worked example: vertex stays in RUNNING after all tasks succeed
Symptom from Step 2: assertEquals(SUCCEEDED, vertex.getState()) fails with
expected SUCCEEDED but was RUNNING. Repro is deterministic at 5/5.
Why 1: Why is the vertex still in RUNNING?
Because the transition to SUCCEEDED requires all tasks to have completed AND
the vertex's completion handler to have been invoked. Looking at the probe trace
from Step 3, the completion handler was invoked. So the transition was attempted.
Why 2: Why did the transition not happen even though the handler ran?
Because the handler returned a new state that depends on a counter
(completedTaskCount). The probe shows completedTaskCount = 19 when the
handler ran, but the vertex has 20 tasks. So the guard says "not done yet."
Why 3: Why is the count 19 when all 20 task-completed events were fired?
Because the count is incremented inside the handler, AFTER a check that re-routes
certain V_TASK_COMPLETED events back through another handler. The re-route
fires for the 20th task (look at VertexImpl.java around line 2750 — the
if (recoveryData != null) branch). The re-routed event is queued but the
test's dispatcher.await() returns before the queue is fully drained.
Why 4: Why does dispatcher.await() return before the re-routed event
is processed?
Because AsyncDispatcher.await() waits for the current queue to drain, but
the re-route enqueues into a secondary queue (the recovery dispatcher) which
is not joined by the primary await.
Why 5: Why are there two dispatchers, and why does the test only await one?
Because recovery events were added in TEZ-2877 as a separate dispatch path to avoid blocking the main event loop during recovery replay. The test setup predates that change. The test never knew there was a second queue to wait on.
Root cause statement: The 20th V_TASK_COMPLETED event is enqueued into the
recovery dispatcher rather than handled directly when recoveryData != null,
and the test (and any caller relying on the primary dispatcher having drained)
observes a stale completedTaskCount. The fix is either to (a) join the
recovery dispatcher in await(), (b) handle the recovery-data branch
synchronously when not actually replaying recovery, or (c) document that
callers must use a different barrier.
That is a root cause. The fix direction is now obvious-ish. You can argue between (a), (b), (c) — but you know what each one changes.
Git Archaeology
Once you have a candidate cause, ask: when did this break? And why did the person who wrote it think it was correct?
git log --follow -p -S<token>
Find every commit that introduced or removed a specific string or method name:
cd ~/tez-src
# Every commit that touched the recovery dispatcher branch
git log --follow -p -S "recoveryData != null" \
-- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# Every commit that mentions the counter
git log --follow -p -S "completedTaskCount" \
-- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# The original change that added recovery dispatching
git log --all --grep="TEZ-2877" --oneline
-S ("pickaxe") matches commits where the count of that string changed —
either added or removed. It is the single most powerful git command in
this entire chapter. Learn it.
git blame -L <start>,<end>
Once you know the file and lines, find the commit and committer:
git blame -L 2740,2770 \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Output looks like:
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2745) if (recoveryData != null) {
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2746) handleRecovery(event);
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2747) return;
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2748) }
Then read the commit:
git show a1b2c3d4
git log -1 --format="%B" a1b2c3d4
Look for the JIRA reference in the commit message (TEZ-NNNN: ...). Open
that JIRA. Read every comment. Often you will discover:
- The change was made to fix a different bug (recovery correctness) and introduced your bug as collateral.
- There was a comment on the original JIRA flagging the exact concern you are hitting. ("This might race with the test dispatcher pattern" — and it did.)
- The fix you are considering was discussed and rejected for a reason you must now address.
git bisect for Regressions
If the bug is a regression — works in 0.9.x, broken in 0.10.x — bisect tells you the exact commit that introduced it. This is the highest-confidence signal in all of root-cause work.
cd ~/tez-src
git bisect start
git bisect bad master
git bisect good rel/release-0.9.2
# git checks out a midpoint commit. Build and run the repro:
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q
# If the test FAILS at this commit: bug exists here
git bisect bad
# If the test PASSES at this commit: bug introduced later
git bisect good
# Repeat. git narrows to one commit in log2(N) steps.
Once bisect converges:
a1b2c3d4 is the first bad commit
commit a1b2c3d4
Author: Alice <alice@example.org>
Date: Thu Apr 12 09:34:18 2018
TEZ-2877: Add recovery dispatcher path
Now you know:
- The JIRA that introduced the regression.
- The author (potential reviewer for your fix — Cc them).
- The exact diff to study.
Automating bisect with git bisect run <script> is also fair game once you
have a return-code-clean reproducer command.
Writing the Root-Cause Statement
This document goes into your JIRA, into your PR description, and into your write-up. 200–500 words, no more, no less. Use this template:
## Root cause: TEZ-NNNN
### Symptom
<one sentence — what the user sees>
### Trigger conditions
- <condition 1, e.g. recovery data is non-null when V_TASK_COMPLETED fires>
- <condition 2, e.g. only on the last task in a vertex>
- <condition 3 if any>
### Affected code
- `tez-dag/src/main/java/.../VertexImpl.java#L2745-L2748` (the recovery branch)
- `tez-dag/src/main/java/.../AsyncDispatcher.java#L210` (`await()` does not
join the secondary queue)
### Mechanism
<three to five sentences explaining the actual defect. Use words like "because",
"as a result", "however". This is the part most people get wrong — they describe
the symptom again instead of the mechanism. The mechanism answers: of the
many ways this code could have been written, why does the current way produce
this wrong answer?>
### Introducing change
- TEZ-2877 (commit a1b2c3d4) added the recovery-dispatch branch without
updating `AsyncDispatcher.await()` to join the recovery queue.
- The original JIRA flagged this as a concern (link to comment) but the
resolution was deferred ("we don't await in production paths, only in
tests").
### Fix direction
Three options considered:
1. **Join the recovery dispatcher in `await()`.** Smallest change. Risk: may
slow recovery in production if a slow recovery handler blocks the await.
2. **Handle the recovery branch synchronously when not replaying.** Larger
change, narrower blast radius. Recommended.
3. **Document that tests must use a new barrier.** Cheapest. Pushes burden
onto every test author. Rejected.
Recommended: option 2. See Step 5 for the diff.
Save as capstone-work/root-cause.md.
Validating the Hypothesis
A root cause is not validated until you have demonstrated it. Two ways:
1. Revert the introducing commit and re-run the repro
git checkout master
git revert --no-commit a1b2c3d4 # introducing commit from bisect
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q
If the test now PASSES (because the change you reverted is what introduced the bug), your root cause is at least partially correct. If it still FAILS, the introducing commit is not the root cause — there is a deeper issue.
Reset before you go any further:
git reset --hard origin/master
2. Make a minimal one-line "patch" that confirms the mechanism
You are not writing the real fix yet. You are confirming the mechanism. For the example above:
--- a/tez-dag/.../VertexImpl.java
+++ b/tez-dag/.../VertexImpl.java
@@ -2745,3 +2745,3 @@
- if (recoveryData != null) {
+ if (recoveryData != null && isReplayingRecovery()) {
handleRecovery(event);
return;
}
(Assume isReplayingRecovery() does not exist yet — pretend it returns false
in tests, true only during actual recovery replay.) Apply this, re-run the
repro. If it passes, the mechanism is confirmed even if the actual API does
not exist yet.
If the test still fails: your mechanism is wrong. Go back to the five-whys.
If the test now passes but breaks 14 other tests: your fix direction is too broad. Go back to "fix direction" in the root-cause statement and pick a narrower option.
Validation / Self-check
Before advancing to Step 5:
capstone-work/root-cause.mdexists, follows the template, is 200–500 words.- You can name the introducing commit (full SHA) and JIRA.
- You ran
git bisectto convergence (or proved bisect doesn't apply because the bug existed since the file was first added — note this in the doc). - You ran a "revert introducing commit" experiment and saw the test go green (or have a documented reason the revert doesn't apply).
- You wrote a one-line throwaway "mechanism confirmation" patch and saw the test pass on it.
- You have read every comment on the introducing JIRA.
- You can articulate three fix directions and explain why you rejected two of them in one sentence each.
Step 5: Implementation
Your fix is the smallest diff that makes the failing test pass without breaking any other test. Period. Anything else — a refactor you noticed, a TODO you want to address, a better name for a field — belongs in a separate JIRA, not this PR.
Committer reviewers' single biggest objection to first-time contributors is scope creep. The second is API hygiene. This chapter is about both.
Minimum-Diff Principle
The fix should change as few lines as possible while addressing the root cause identified in Step 4. Everything that survives compilation but is not strictly required to fix the bug is review surface area. Review surface area is the enemy of "merged this week."
Too much
- public void handleVertexCompleted(VertexEvent event) {
- if (recoveryData != null) {
- handleRecovery(event);
- return;
- }
- completedTaskCount++;
- if (completedTaskCount == numTasks) {
- transitionToSucceeded();
- }
- }
+ // Refactored to use stream API for clarity
+ public void handleVertexCompleted(final VertexEvent event) {
+ Optional.ofNullable(recoveryData)
+ .filter(rd -> isReplayingRecovery())
+ .ifPresentOrElse(
+ rd -> handleRecovery(event),
+ () -> {
+ this.completedTaskCount = this.completedTaskCount + 1;
+ this.maybeTransitionToSucceeded();
+ });
+ }
+
+ private void maybeTransitionToSucceeded() {
+ if (completedTaskCount == numTasks) {
+ transitionToSucceeded();
+ }
+ }
This will be rejected. You changed five things (stream API, final keyword,
method extraction, control-flow shape, formatting). A committer cannot tell
which change is the actual fix without re-deriving the root cause from scratch.
Just right
public void handleVertexCompleted(VertexEvent event) {
- if (recoveryData != null) {
+ if (recoveryData != null && isReplayingRecovery()) {
handleRecovery(event);
return;
}
completedTaskCount++;
if (completedTaskCount == numTasks) {
transitionToSucceeded();
}
}
One line. The change matches the root-cause statement verbatim. A reviewer reads it, opens the root-cause doc, agrees in 30 seconds.
The extracted helper, the final keyword, the stream rewrite — all may be
good ideas. File them as separate JIRAs after this lands.
The Boy Scout rule does NOT apply
In a green-field project, "leave the campground cleaner than you found it" is fine. In Apache project review, drive-by cleanups block your fix because they expand the review and trigger objections you do not need to deal with to land the actual bug fix. Resist the urge.
Where Does the Fix Go? A Decision Tree
Is the bug a check that should have rejected an input but didn't?
-> Guard condition (likely in a setter or builder).
Example: TezConfiguration.validate(), DAG.verify().
Is the bug a wrong state machine transition?
-> State-machine transition table edit.
Look for stateMachineFactory.addTransition() in the affected *Impl class.
The fix is usually adding/removing a transition or changing its target state.
Is the bug a config key being read at the wrong place or with the wrong default?
-> Config validation in the constructor of the class that reads it.
Or a fix to where conf.get() / conf.getInt() is called.
Is the bug a logic error in business code (wrong arithmetic, wrong comparator,
missing close())?
-> Logic bug. Fix is local to the offending method.
Add a test that asserts the corrected behavior.
Is the bug a race?
-> First, prove it is actually a race with DrainDispatcher. Most "races"
turn out to be logic bugs that *look* race-y because event ordering
is non-obvious.
-> If genuinely a race: usually a missing dispatcher.await, a missing
volatile, or a transition guard that isn't atomic with a counter
increment. Synchronize the smallest critical section.
Is the bug a memory issue (OOM, off-heap leak)?
-> Almost never in scope for a first Capstone. Pause and consult a committer.
Configuration Keys: The Right Way
You will be tempted to "add a knob" — a new tez.foo.bar flag that defaults
to the old (buggy) behavior, lets users opt in to the fix. Resist. Knobs
are an admission that you don't trust your fix. If your fix is correct, it
should be the new default; if it isn't, fix the fix, not the user's
configuration burden.
When a knob IS justified:
- The fix changes a performance-sensitive default that may regress some users.
- The fix changes user-visible output format (release-note required).
- The fix is gated on a long-deprecation window and the old behavior must remain available for one or two releases.
When you DO add a key, conform to Tez convention. Read:
grep -n "TEZ_AM\|TEZ_TASK\|TEZ_RUNTIME" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java \
| head -40
You will see the pattern:
/**
* Maximum number of times an AM can attempt to launch a task before failing
* the task.
* <p>
* Default: {@link #TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT}.
*
* @since 0.9.0
*/
@ConfigurationScope(Scope.AM)
@ConfigurationProperty(type = "integer")
public static final String TEZ_AM_TASK_MAX_FAILED_ATTEMPTS =
TEZ_AM_PREFIX + "task.max.failed.attempts";
public static final int TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT = 4;
Mandatory elements for any new key:
- Javadoc that explains what the knob does and when to change it.
@since X.Y.Zmatching the next release version.@ConfigurationScope(AM,VERTEX,TASK,CLIENT).@ConfigurationProperty(type = "integer" / "long" / "boolean" / "string").- A
_DEFAULTconstant alongside. - Use the right prefix constant (
TEZ_AM_PREFIX,TEZ_RUNTIME_PREFIX, etc.). - Add to
tez-api/src/main/resources/META-INF/services/...if the doc-gen needs to pick it up (check existing keys to see if their config-doc generator catches up automatically or needs manual entries).
A new key that violates any of these will fail review.
Tez Coding Style
Read the existing class you are editing. Match its style exactly. The project-wide rules below are necessary but not sufficient — the file-local conventions matter just as much.
Logging
Always slf4j, never log4j directly, never System.out:
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);
LOG.info("Vertex {} transitioned from {} to {} on event {}",
getName(), oldState, newState, event.getType());
Use {} parameterization, never string concatenation in log args. Use the
exception form LOG.error("Failed to schedule task {}", taskId, ex) rather
than concatenating ex.toString().
Preconditions
Tez uses Guava Preconditions heavily. Use it for invariants and argument
checks:
import com.google.common.base.Preconditions;
Preconditions.checkNotNull(event, "event must not be null");
Preconditions.checkArgument(parallelism > 0,
"parallelism must be positive, got %s for vertex %s", parallelism, vertexName);
Preconditions.checkState(getState() == VertexState.RUNNING,
"Vertex %s must be RUNNING to receive %s, was %s",
getName(), event.getType(), getState());
The variadic %s form is preferable to string concatenation because it is
free when the check passes.
Exception messages
Always include the context: which vertex, which task ID, which state, which event. Diagnosing a Tez bug from a stack trace alone is hard enough; an exception message that just says "invalid state" is hostile.
Bad:
throw new IllegalStateException("invalid state");
Good:
throw new IllegalStateException(String.format(
"Vertex %s received event %s in state %s, which is not legal. "
+ "Expected one of [RUNNING, INITED].",
getName(), event.getType(), getState()));
Forbidden
System.out.println/System.err.println(useLOG).e.printStackTrace()(useLOG.error("...", e)).Thread.sleepin production code unless you have a// TEZ-NNNN: justificationcomment AND a committer agreed in review.- New
synchronizedmethods on hot paths — discuss in the JIRA before adding. - Adding new dependencies to
pom.xmlwithout discussion. This is a major re-review trigger.
Imports
- No wildcard imports (
import foo.bar.*;). The project's checkstyle catches these and you will fail precommit. - Group order: java, javax, org, com, third-party, project. Most IDEs handle this automatically.
Tests
Discussed fully in Step 6, but: every fix must come with at least one test that fails on master and passes with your fix. No test, no merge.
Building Incrementally
Do not try to write the whole fix and run the whole test suite. That feedback loop is too slow. Instead:
# Tight loop: compile + run only the changed module's affected test.
mvn install -DskipTests -pl tez-api,tez-common -am -q && \
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q
# When that goes green, broaden:
mvn test -pl tez-dag -Dtest='TestVertex*' -q
# Finally, full module:
mvn test -pl tez-dag -q
If your fix touches tez-api, you have to rebuild every downstream module.
The -am flag is your friend — "also make" upstream deps.
When You Get Stuck
Hard rule: if you have not made forward progress in three sessions, post on the JIRA. Format:
Status update: I have the repro from Step 2 passing/failing as expected. My
working hypothesis is <one-sentence>. I have tried:
1. <approach A> — does not work because <observed result>.
2. <approach B> — does not work because <observed result>.
I am unsure whether to (a) <option a> or (b) <option b>. The constraint I am
trying to satisfy is <invariant>. If anyone has context on whether <approach C>
was considered for a related JIRA, please share.
Reproducer is at <link to gist or branch>.
This is not failure. This is community engagement done right. Committers respect contributors who ask sharp questions with context attached. They ignore contributors who ask "any update?" or "can you help?"
Validation / Self-check
Before advancing to Step 6:
- Your fix is committed to your branch as a single commit with the title
TEZ-NNNN: <short summary>and a body that references the root-cause document. git diff origin/master --statshows the smallest plausible diff (single digit files changed, double-digit lines at most for a typical bug fix).- The diff contains zero unrelated changes (no formatting-only changes, no import reordering not caused by your edit, no Javadoc cleanups in methods you didn't touch).
mvn install -DskipTests -pl <changed-module> -am -qsucceeds.- The Step 2 reproducer test now passes (you'll generalize the test in Step 6 — the repro itself is still the gating signal).
- If you added a
TezConfigurationkey, it has all required annotations, Javadoc,_DEFAULTconstant, and@sincetag. - You have re-read your diff line by line and convinced yourself every line change is required by the root cause. Strike anything that isn't.
Step 6: Testing
Your reproducer from Step 2 is the minimum — it proves the bug existed. The tests in this step prove that the fix is correct, that it stays correct, and that the next person who edits this code path will notice if they break it again. A good test suite is the most durable artifact you ship.
Two kinds of tests are required. Unit tests using a controlled dispatcher
(fast, deterministic, surgical) and at least one integration test on
MiniTezCluster (slow, realistic, end-to-end). Both. Always both.
Unit Tests with DrainDispatcher
The single most important Tez test pattern: synchronous, deterministic state- machine testing. Read the canonical example top to bottom before you write your own:
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
Each is 1000+ lines. They are not light reading. They are also the only authoritative source on what is and isn't testable at the unit layer.
What DrainDispatcher Does
DrainDispatcher is Hadoop's synchronous testing dispatcher (from
hadoop-yarn-common). When you dispatch() an event into it, the event sits
in a queue. When you call await(), the queue drains synchronously on the
calling thread — every handler runs before await() returns. This gives you
two superpowers:
- Deterministic event ordering. You can dispatch A, dispatch B, await — and you know A's handler completed before B's started.
- No real threading. Bugs reproduce on every machine, not just under contention.
State-Transition Test Pattern
The template every state-machine unit test follows:
@Test
public void testV_TASK_COMPLETED_inRunningWithRecovery() throws Exception {
// 1. Arrange: drive the SUT to the state under test.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
dispatcher.await();
assertEquals(VertexState.RUNNING, vertex.getState());
// 2. Set up the precondition that triggers the bug.
vertex.setRecoveryData(mockRecoveryData());
// 3. Act: fire the event under test.
TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
dispatcher.await();
// 4. Assert: the new state and any side-effect counters.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
assertEquals(vertex.getNumTasks(), vertex.getCompletedTaskCount());
assertFalse("vertex must not call handleRecovery when not actually replaying",
vertex.getRecoveryHandlerCalled());
}
The sections — arrange, set precondition, act, assert — should always be visible. Reviewers skim for that shape. Hidden setup inside helpers makes the test harder to debug when it fails on a future change.
Build a Negative Test Too
You proved the bug is fixed. Now prove the non-buggy path still works:
@Test
public void testV_TASK_COMPLETED_inRunningWithoutRecovery() throws Exception {
// Same arrange/state machinery, but recoveryData stays null.
vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
dispatcher.await();
// ...
TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
dispatcher.await();
// Without recovery data, the existing transition behavior is unchanged.
assertEquals(VertexState.SUCCEEDED, vertex.getState());
}
The negative test catches the regression where someone "fixes" your fix by removing the recovery branch entirely.
Test Both Branches of Every Guard You Added
If your fix is:
if (recoveryData != null && isReplayingRecovery()) { ... }
You owe four tests, one per combination:
recoveryData == null | isReplayingRecovery() returns | Expected branch |
|---|---|---|
| true | n/a (short-circuited) | non-recovery path |
| false | true | recovery path |
| false | false | non-recovery path (this is the bug fix) |
| true | true | non-recovery path (impossible? assert it cannot happen) |
The last row is the kind of test that catches a future refactor where someone deletes the short-circuit.
MockAppContext, MockHistoryEventHandler, and friends
Building a VertexImpl in a unit test requires a small zoo of collaborators
(an AppContext, an event handler, an EdgeManager, etc.). Don't try to
build them all from scratch — copy the helpers from TestVertexImpl.
grep -nE "private.*setUp\(|class Mock|createVertex\(" \
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
| head -30
You'll see helper methods like createVertex(...), createDAG(...), and
inner MockHistoryEventHandler. Use them as a template; do not duplicate them
in your own test if you can extend the existing test class with a new
@Test method.
Integration Tests with MiniTezCluster
Unit tests prove the fix works in isolation. Integration tests prove it works when wired up to a real YARN cluster (in-process, but real). For correctness bugs and shuffle bugs, this is non-negotiable.
Canonical example:
~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java
Read its setUp / tearDown carefully. The pattern:
private static MiniTezCluster mrrTezCluster;
private static Path TEST_ROOT_DIR;
@BeforeClass
public static void setup() throws IOException {
Configuration conf = new Configuration();
TEST_ROOT_DIR = new Path("target", TestYourFix.class.getName() + "-tmpDir");
mrrTezCluster = new MiniTezCluster(TestYourFix.class.getSimpleName(),
/*numNodeManagers=*/ 1, /*numLocalDirs=*/ 1, /*numLogDirs=*/ 1);
mrrTezCluster.init(conf);
mrrTezCluster.start();
}
@AfterClass
public static void tearDown() {
if (mrrTezCluster != null) {
mrrTezCluster.stop();
mrrTezCluster = null;
}
}
@Test(timeout = 180_000)
public void testTezNNNNFixEndToEnd() throws Exception {
TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
DAG dag = buildDAGThatExercisesFix();
TezClient tezClient = TezClient.create("test-tez-NNNN", tezConf);
tezClient.start();
try {
DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(
EnumSet.of(StatusGetOpts.GET_COUNTERS));
assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
// The actual assertion — what proves the fix works end-to-end:
long counterVal = status.getDAGCounters()
.findCounter(YourCounterGroup.class.getName(), "ExpectedCounter")
.getValue();
assertEquals(20L, counterVal);
} finally {
tezClient.stop();
}
}
awaitVertexState and the Deterministic Polling Pattern
MiniTezCluster tests look async (real cluster, real time) but you can still
write deterministic assertions. Use the await* helpers in the tez-tests
test utility classes:
grep -rn "awaitVertexState\|awaitDAGCompletion\|awaitTaskAttempt" \
~/tez-src/tez-tests/src/test/java/
Pattern:
TestTezUtils.awaitVertexState(dagClient, "v1", VertexStatus.State.SUCCEEDED, 60_000);
This polls with backoff up to the timeout. It never returns early on a spurious signal and never sleeps a fixed wallclock duration.
Determinism Rules
Hard rules. Violating any of them gets your PR sent back.
| Rule | Bad | Good |
|---|---|---|
No Thread.sleep | Thread.sleep(500) | dispatcher.await() or awaitVertexState(...) |
| No wallclock waits | while (!done && System.currentTimeMillis() < deadline) {...} | latch.await(60, SECONDS) driven by event callback |
No Random without seed | new Random() | new Random(42L) |
| No timezone-dependent assertion | assertEquals("2024-...", LocalDate.now()) | inject Clock |
| No order-dependent assertion on a Set | assertEquals(List.of("a","b"), new HashSet<>(...)) | sort first or use containsInAnyOrder |
| Tests must clean up tmpdirs | leaving target/...-tmpDir between runs | @After removes it or uses unique nanoTime() path |
| No global mutable state | static int counter = 0; shared across tests | per-test instance state |
Tez has shipped many flaky-test fixes. Read a few of them:
cd ~/tez-src
git log --oneline --grep="flaky\|intermittent" | head -20
git show <flaky-fix-sha>
Notice the pattern — most flaky fixes are replacing a Thread.sleep with
an event-driven await, or replacing a counter assertion with a state
assertion.
Coverage Target
You do not need 100% line coverage on the file you touched. You do need ~80% coverage on the lines you changed, plus tests that exercise every new branch (true and false sides).
Spot-check coverage:
mvn test -pl tez-dag -Dtest='TestVertexImpl*' \
org.jacoco:jacoco-maven-plugin:prepare-agent \
org.jacoco:jacoco-maven-plugin:report
# Open tez-dag/target/site/jacoco/index.html
If your changed lines show red, add a test before pushing.
A Complete Test That Fails on Master, Passes With Fix
The deliverable for this step is a test (typically two or three @Test
methods on the same class) that:
- Fails on a clean checkout of
origin/master— assertion error, not a compilation error, not a setup error. - Passes when run against your fix branch.
- Runs in under 10 seconds for unit tests, under 3 minutes for integration tests.
- Has zero flakes in 10 consecutive runs.
Verify the third and fourth:
for i in {1..10}; do
echo "=== Run $i ==="
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNN -q || break
done
If even one run fails, you have a flaky test. Fix it before pushing. A flaky test you ship is technical debt every other contributor will pay.
Test Naming
Tez convention:
- Unit test file:
Test<ClassUnderTest>.javalives in<module>/src/test/java/<package>/. IfTestVertexImpl.javaalready exists, add a new@Testmethod there rather than a new file. - Test method:
test<Method>_<Condition>_<ExpectedResult>ortest<Scenario>_<ExpectedBehavior>. - Bad:
testFoo,testBug,testCase1. - Good:
testV_TASK_COMPLETED_inRunningWithRecoveryData_doesNotShortCircuit.
The verbose name is the test's documentation. Future-you reading the failure output of CI will be glad for the verbosity.
Validation / Self-check
Before advancing to Step 7:
- At least two
@Testmethods exist that fail onorigin/masterand pass on your branch. - At least one of them uses
DrainDispatcherfor deterministic event ordering (or has a documented reason it doesn't — pure unit, no events). - At least one integration test on
MiniTezClusteris present if your fix affects end-to-end behavior (correctness, shuffle, scheduling). - Ten consecutive runs of your tests are all green.
- Every new conditional branch in your production code has at least one test that exercises each side.
- No
Thread.sleep, no wallclock waits, no unseededRandom, no order-dependent assertions on unordered collections. mvn test -pl <module>runs your tests in under the budget (10s unit, 3min integration).
Step 7: Validation
Your patch compiles. Your new tests pass. That is not enough. Validation is proving that the rest of the build — full module test suites, the static analyzers Tez runs, the legal scanner, the end-to-end examples — is also still green. Reviewers will not run this for you. They will check that you ran it and reject the PR if you didn't.
Budget: 1–2 evenings. Most of it is waiting on mvn test.
The Validation Checklist
In order. Do not skip steps because the previous step passed.
- Full test suite of every module you touched.
- Full clean build of the whole repo.
- Checkstyle.
- SpotBugs.
- Apache RAT (license header check).
TestOrderedWordCountend-to-end.- Re-run your original Step 2 reproducer to confirm green.
- Regression sweep of any module that depends on what you changed.
- Performance validation (if perf-relevant).
Capture the output of each into capstone-work/validation/. You'll cite it
in the PR description.
1. Full Module Tests
The module you changed:
cd ~/tez-src
mvn test -pl tez-dag -q 2>&1 | tee capstone-work/validation/01-tez-dag-test.log
This will take 5–20 minutes depending on the module. tez-dag is the slowest
non-integration module. While it runs, work on the diff cleanup.
When it finishes, scroll to the summary lines. Look for:
[INFO] Tests run: 1342, Failures: 0, Errors: 0, Skipped: 17
If you see Failures > 0, open every failure. Then triage:
- My fix caused it. Go back to Step 5. Reread the test. Either your fix is wrong, or the test is wrong (rare — assume the test is right until proven otherwise).
- It is a known flaky test. Grep the JIRA:
git log --grep="<TestName>". If there is an open ticket, link it in your PR description ("known flake, see TEZ-XYZ"). If there is not, file one before claiming the green. - It is also broken on master. Verify by running
git stash && mvn test ... && git stash pop. If it fails on master too, link the JIRA or file one. Do not let your PR be the one to surface a pre-existing failure silently.
Run for every module you touched. If you touched tez-api, you touched
everything downstream — plan accordingly.
2. Full Clean Build
The compilation gate. Catches missing imports, accidental Java-version features, downstream API breaks:
mvn clean install -DskipTests -q 2>&1 \
| tee capstone-work/validation/02-clean-install.log
Expect a clean BUILD SUCCESS. Common failures:
- Missing import. Your IDE auto-imported something not on the classpath of a downstream module.
- API break. You changed a public method signature in
tez-apiand a downstream caller broke. Either revert the signature change or update the caller. - Java version. You used
varor text blocks. Tez compiles to a JDK baseline (checkpom.xmlfor<maven.compiler.target>). Use compatible syntax.
3. Checkstyle
Tez uses checkstyle aggressively. Run:
mvn checkstyle:check -q 2>&1 \
| tee capstone-work/validation/03-checkstyle.log
Or, per module:
mvn checkstyle:check -pl tez-dag
Common violations and fixes:
| Violation | Fix |
|---|---|
| Line longer than 120 chars | Break the line. Indent continuation 4 spaces. |
| Wildcard import | Replace with explicit imports. |
| Missing javadoc on public method | Add /** ... */ block. |
| Trailing whitespace | Configure your editor to strip it on save. |
| Tab character | Convert to 2 spaces (Tez uses 2-space indent in most modules). |
| Method ordering | Public before private; static before instance. |
The checkstyle config lives at tez-build-tools/src/main/resources/tez/checkstyle/checkstyle.xml
— read it to understand the rules.
4. SpotBugs
Static analysis for null-deref, unchecked cast, dead-store, etc.:
mvn spotbugs:check -q 2>&1 \
| tee capstone-work/validation/04-spotbugs.log
If it fails, view the report:
mvn spotbugs:gui -pl tez-dag
Common warnings worth fixing:
NP_NULL_ON_SOME_PATH— your new code dereferences a value that can be null on some branch.EI_EXPOSE_REP— your getter returns a mutable internal collection directly. Wrap inCollections.unmodifiableList(...)or copy.RV_RETURN_VALUE_IGNORED_BAD_PRACTICE— the result offile.delete()was ignored.
Warnings already present on master are not your problem to fix, but the
analyzer will fail the build if your change introduces new ones.
git diff origin/master tez-dag/target/spotbugsXml.xml (after running on
both branches) tells you which are new.
5. Apache RAT (License Headers)
Every new .java, .xml, .properties file must carry the ASL header.
RAT enforces this:
mvn apache-rat:check -q 2>&1 \
| tee capstone-work/validation/05-rat.log
If it complains about your new test file, prepend the standard header:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
(Copy from any existing Tez file — it is the canonical form.)
For shell, properties, and XML files, use the appropriate comment syntax. Look at neighboring files in the same directory.
6. TestOrderedWordCount End-to-End
The closest thing to a smoke test of "does Tez actually still work for a real user workload":
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 \
| tee capstone-work/validation/06-orderedwordcount.log
Takes 2–5 minutes. If this fails when your unit tests pass, your fix likely broke an interaction your unit test didn't exercise. Common culprits:
- You changed an event ordering and a downstream component assumed the old ordering.
- You added a config key default that breaks the example's expectations.
- Your
MiniTezClustertest is leaking state into a sibling test.
7. Re-Run Your Original Step 2 Reproducer
Sanity check. The thing you set out to fix is still fixed:
mvn test -pl <module> -Dtest=<YourReproTest> 2>&1 \
| tee capstone-work/validation/07-repro.log
Five runs:
for i in 1 2 3 4 5; do
mvn test -pl <module> -Dtest=<YourReproTest> -q
done
Five greens. Or you have not actually shipped a fix.
8. Regression Sweep
Run the test suite of every module that depends on what you changed. If you
touched tez-api, that is everything. If you touched tez-runtime-library,
that is at least tez-tests, tez-mapreduce, and tez-examples.
# Identify dependents
grep -l "tez-runtime-library" $(find ~/tez-src -name pom.xml)
# Run each
mvn test -pl tez-mapreduce -q | tail -5
mvn test -pl tez-examples -q | tail -5
mvn test -pl tez-tests -q | tail -10
If tez-tests takes too long (it can — there are real MiniTezCluster
runs in there), at least run the tests whose name contains your changed
class:
mvn test -pl tez-tests -Dtest='*Vertex*' -q
9. Performance Validation (If Relevant)
Skip this section unless your fix touches scheduling, shuffle, or any code path documented as "hot." For those, use async-profiler or JFR to capture a flamegraph before and after.
async-profiler pattern
# Start the JVM under test (e.g. a MiniTezCluster integration test)
mvn test -pl tez-tests -Dtest=TestPerfWorkload -DforkMode=never &
TEST_PID=$!
# Attach profiler
~/async-profiler/profiler.sh -d 60 -f /tmp/flame-before.svg $TEST_PID
# Apply your fix, repeat
~/async-profiler/profiler.sh -d 60 -f /tmp/flame-after.svg $TEST_PID
Compare the two SVGs. The stack frames you care about (e.g.
ShuffleManager.run, MergeManager.merge) should not be wider after your
fix than before. If they are, you have introduced a regression and you owe
the JIRA an explanation.
Simpler: timing assertions in a JUnit test
@Test
public void testShuffleNotSlowerAfterFix() throws Exception {
long start = System.nanoTime();
runShuffleWorkload();
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
// Loose bound — assert no >30% regression vs. a previously-measured baseline.
assertTrue("shuffle took " + elapsedMs + "ms, expected < 15000",
elapsedMs < 15_000);
}
Brittle. Only add if perf is truly the concern.
The Validation Report
Compile everything into one document for the PR:
# Validation report for TEZ-NNNN
## Environment
- JDK: `java -version` -> openjdk version "11.0.21"
- Maven: `mvn -version` -> Apache Maven 3.9.6
- OS: macOS 14.2 / Linux 5.15.0-91-generic
- Tez HEAD: `git rev-parse origin/master` -> a1b2c3d4
## Results
| Check | Status | Notes |
|---|---|---|
| `mvn test -pl tez-dag` | PASS | 1342 tests, 0 failures, 17 skipped |
| `mvn clean install -DskipTests` | PASS | |
| `mvn checkstyle:check` | PASS | |
| `mvn spotbugs:check` | PASS | |
| `mvn apache-rat:check` | PASS | |
| `mvn test -pl tez-tests -Dtest=TestOrderedWordCount` | PASS | |
| Original reproducer | PASS (5/5 runs) | |
| `mvn test -pl tez-mapreduce` | PASS | |
| `mvn test -pl tez-examples` | PASS | |
## Known flakes encountered
- TestSomething#testWhatever — pre-existing flake, see TEZ-XXXX, not caused by this change.
## Performance
- Not applicable / no perf-relevant code paths touched.
Save as capstone-work/validation/REPORT.md. Paste it (or a summary plus
link) into your PR description.
Validation / Self-check
Before advancing to Step 8:
capstone-work/validation/contains one log file per check (logs 01–07 at minimum).capstone-work/validation/REPORT.mdexists with the table above filled in honestly.- Every check passes, or every failure is documented as a pre-existing issue with a JIRA link.
- You re-ran your Step 2 reproducer five times with your fix applied and got 5/5 green.
- You ran the test suite of at least one module that depends on the one you changed (regression sweep).
- No new SpotBugs warnings introduced (diff against master baseline).
- The validation report is short enough to paste into a PR description without making the reviewer scroll for a screen.
Step 8: Patch Preparation
You have working code, working tests, and a green validation run. Now you
package the change so it can land. Modern Tez does this via GitHub PR; older
Tez (and still some committers' preference) is .patch files attached to
the JIRA. You should know how to do both.
This step is the easiest to skip past and the easiest to lose a week on if you do it sloppily. Treat the PR title, description, and commit message as seriously as the code.
Modern Tez: GitHub Pull Request
Apache Tez has been on GitHub Issues + PRs (mirrored to JIRA) for several years. The flow:
# 1. Make sure your branch is up to date with master
cd ~/tez-src
git remote -v
# origin git@github.com:<you>/tez.git
# apache https://github.com/apache/tez.git
git fetch apache
git checkout tez-NNNN-<slug>
git rebase apache/master
# Resolve any conflicts. Rebuild and re-run your tests after rebase.
# 2. Squash to a single clean commit (or 2-3 if logically separable)
git rebase -i apache/master
# In the editor: pick the first commit, squash the rest. Edit the combined
# commit message to one final version.
# 3. Push to your fork
git push --force-with-lease origin tez-NNNN-<slug>
# 4. Open a PR via https://github.com/apache/tez
# Title: TEZ-NNNN: <short summary, present tense>
# Base: apache/tez:master
--force-with-lease instead of --force: protects against overwriting a
collaborator's commit if someone pushed to your branch between your fetch
and your push.
Commit Message Template
TEZ-NNNN: Fix VertexImpl recovery branch short-circuiting non-recovery path
Reverts the unconditional short-circuit added in TEZ-2877 so that
V_TASK_COMPLETED events on the final task are processed by the standard
transition when no recovery is in progress. The original short-circuit
assumed any non-null recoveryData implies an active replay; this assumption
broke when recoveryData is populated speculatively at vertex initialization
even though no replay will occur.
The fix gates the recovery branch on the new isReplayingRecovery() predicate.
The previous behavior is preserved for actual recovery scenarios.
Tests:
- New unit test TestVertexImpl#testV_TASK_COMPLETED_inRunningWithRecovery
- New integration test in tez-tests verifying end-to-end DAG success
with recoveryData populated.
- Existing TestOrderedWordCount and full tez-dag suite pass.
Note the shape:
- Title line:
TEZ-NNNN: <verb-phrase, present tense, < 72 chars>. - Blank line.
- Body paragraphs: what the change does, why, what assumption broke.
- Tests: explicit list of tests added or affected.
No "should fix" or "I think." Past tense for what you did, present tense for what the code does after the change.
PR Title
TEZ-NNNN: <Short imperative summary>
TEZ-NNNNprefix is mandatory. The bot uses it to link to JIRA.- Imperative mood: "Fix race", "Add config", "Avoid NPE". Not "Fixed race", not "Fixing race".
- < 72 characters total including the prefix.
Bad: Fix bug, Updates to VertexImpl, My fix for TEZ-NNNN.
Good: TEZ-4567: Honor isReplayingRecovery in VertexImpl completion path.
PR Description Template
## JIRA
https://issues.apache.org/jira/browse/TEZ-NNNN
## Problem
<2-4 sentences. Symptom + trigger conditions. Cite the root-cause doc.>
When the last `V_TASK_COMPLETED` event arrives for a vertex with non-null
`recoveryData` outside an actual recovery replay, the event is unconditionally
re-routed through `handleRecovery()` rather than processed by the standard
transition. As a result, `completedTaskCount` is not incremented and the vertex
fails to transition to SUCCEEDED. This affects DAGs whose AM populates
recovery data speculatively at vertex initialization.
## Root cause
See `capstone-work/root-cause.md` (or paste inline if short).
Introduced in TEZ-2877 (commit a1b2c3d4).
## Fix
Gate the recovery short-circuit on a new `isReplayingRecovery()` predicate
that returns true only during active replay. Minimum-diff (one production
line + one new private method).
## Testing
- **New:** `TestVertexImpl#testV_TASK_COMPLETED_inRunningWithRecovery` —
unit test using `DrainDispatcher` that reproduces the failure on master
and passes with this fix. Plus a negative-control test.
- **New:** `TestTezNNNNFixIntegration` — `MiniTezCluster` end-to-end test
that runs a 20-task vertex with speculatively-populated recoveryData and
asserts DAG SUCCEEDED.
- **Existing:** Full `tez-dag` suite (1342 tests) passes. `TestOrderedWordCount`
passes. Validation report in commit message footer.
## Backward compatibility
None affected. The fix changes behavior only for the broken case (no replay
in progress). Recovery scenarios are unchanged.
## Configuration
No new keys.
Adjust sections to your fix. The structure stays the same.
GitHub Actions / Yetus Precommit
When you open the PR, GitHub Actions runs the precommit checks. The full
config lives in .github/workflows/ — read it:
ls ~/tez-src/.github/workflows/
cat ~/tez-src/.github/workflows/build.yml
Common checks (subject to change as the workflow evolves):
| Check | What it runs | Failure means |
|---|---|---|
| Compile | mvn install -DskipTests | Build broken on some module |
| Tests | mvn test for each module | Some test failed (yours or flake) |
| Checkstyle | mvn checkstyle:check | Style violation in changed file |
| Javadoc | mvn javadoc:javadoc | Broken Javadoc reference or missing tag |
| RAT | mvn apache-rat:check | New file missing ASL header |
| License | License snippet check | Same as RAT, or LICENSE/NOTICE drift |
| SpotBugs | mvn spotbugs:check | New static-analysis warning |
Failures appear on the PR as red ✖ marks. Click into the failing job to read the log. Common first-PR failures:
- Javadoc broken: You referenced
{@link Foo#bar}andbardoesn't exist. Either fix the link or remove it. - Checkstyle: A line exceeded 120 chars or an unused import slipped in.
- License: New file missing the header. Add it.
- Test: A flake. Re-run the workflow ("Re-run all failed jobs" in the Actions tab). If it goes green on retry, leave a comment: "Re-ran job — flake, see TEZ-XYZ." If it fails again, your fix probably broke it.
Push fixes as new commits on the same branch. The PR auto-updates. After review approval, you'll squash on merge.
Old-Style: .patch Files on JIRA
Some committers still review .patch attachments. Know the convention.
Generate
git format-patch apache/master -o /tmp/
# Produces /tmp/0001-TEZ-NNNN-Fix-...patch
Or, for one combined diff:
git diff apache/master..HEAD > /tmp/TEZ-NNNN.01.patch
Naming convention
TEZ-<NNNN>.<iteration>.patch. So your first attachment is TEZ-4567.01.patch,
second iteration after review feedback is TEZ-4567.02.patch. Some committers
use TEZ-4567.001.patch (three-digit). Match whatever pattern the most
recent committer used on that issue.
For a branch-specific patch (e.g. against the branch-0.10):
TEZ-4567.branch-0.10.01.patch.
Attach
In JIRA: "Attach files" → upload. Then "More" → "Patch Available" to flip the state. Cancel patch (revert to "In Progress") if you find a problem before review starts.
The JIRA workflow is covered fully in Step 9. The patch-file mechanics live here.
Rebasing on Master Without Losing Review Comments
GitHub's PR view loses inline comment threads when you force-push a rebase that changes the SHAs reviewers commented on. To minimize the damage:
- Don't rebase mid-review unless you have to. Merge-commits from
apache/masterinto your branch are usually acceptable during active review; squash at the end. - When you do rebase, leave a comment: "Force-pushed to rebase on
master(was<old SHA>, now<new SHA>). All review threads should still be visible against the latest commit." - Squash only at the very end, after approval, just before merge.
- If you really break the comment threads, post a summary comment listing "what was at line X became line Y in the new push." Reviewers appreciate it.
To rebase:
git fetch apache
git checkout tez-NNNN-<slug>
git rebase apache/master
# If conflicts: edit, git add, git rebase --continue
mvn install -DskipTests -pl <module> -am -q
mvn test -pl <module> -Dtest=<YourTests> -q
git push --force-with-lease origin tez-NNNN-<slug>
Co-Author and Sign-Off
If a committer or another contributor materially helped (suggested the fix direction, found the root cause), credit them:
TEZ-NNNN: <summary>
<body>
Co-authored-by: Alice <alice@example.org>
Tez does not require a Signed-off-by line (it is not a DCO project — it requires an Apache CLA), but committers appreciate when you note influences in the commit message.
What Reviewers Look For First
In rough order:
- PR title and JIRA link — wrong format, instant correction request.
- Description quality — vague description, "please clarify" comment.
- Diff size — > ~200 lines for a "bug fix" gets scrutiny on scope creep.
- Tests present — no tests, immediate request.
- Tests fail on master, pass with fix — confirms the test is actually testing the fix, not just a happy path.
- Production diff is minimum to fix the bug — every extra change has to justify itself.
- Style and convention compliance — checkstyle and tests must be green.
- API hygiene — no public methods added/removed without discussion.
- Backward compatibility — does the fix change observable behavior for non-buggy cases? If yes, was it discussed?
Optimize for the first seven before you push. The last two are usually discussed in JIRA comments before the PR opens.
Validation / Self-check
Before advancing to Step 9:
- PR exists on
apache/tezwith the formatTEZ-NNNN: <summary>. - PR description follows the template; cites the root-cause document.
- Commit message follows the template (title, body, tests footer).
- GitHub Actions precommit is green (every check), or every red has a documented and accepted explanation.
- Branch is rebased on a recent
apache/master(within last 24-48h ideally). - PR was opened with the URL pasted into the JIRA as a comment.
- You can articulate in one paragraph why every line of your diff is necessary, if a reviewer asks.
Step 9: JIRA and Documentation
The JIRA is the project's permanent memory. The PR is ephemeral — it lives on
GitHub, gets merged, fades into git log. The JIRA is what users grep when
they hit a similar bug two years later, what release managers read when
compiling release notes, what new contributors find when researching prior
art. Treating it as a checkbox is the laziest possible thing you can do.
This step is short on procedure and heavy on hygiene. Twenty minutes done well saves three different people an hour each later.
The Status Workflow
Tez uses Apache's standard JIRA workflow. The states you will pass through:
Open -> In Progress -> Patch Available -> Resolved -> Closed
^ ^
| |
(you) (committer)
| State | Set by | Meaning |
|---|---|---|
| Open | Reporter | Bug exists, nobody is working on it. |
| In Progress | Assignee | Someone is actively investigating. |
| Patch Available | Assignee | A patch / PR is ready for committer review. |
| Resolved | Committer | Patch merged. Resolution: Fixed plus Fix Version. |
| Closed | Anyone | Verified in a release. Often skipped — many Tez tickets stay Resolved indefinitely. |
Transitioning correctly
- You move it to In Progress when you claim it in Step 1.
- You move it to Patch Available when your PR is open and precommit is green. This is the signal "ready for human review, not just CI."
- You do NOT move it to Resolved. Only a committer does that when they merge. Setting it yourself will be reverted, and you will look new.
- If a committer asks you to revise, the state usually stays at Patch Available. Move back to In Progress only if you'll be rewriting significantly (multi-day rework).
The Patch Available ritual
When you flip to Patch Available, leave a comment:
PR is now open at <link>, precommit is green, ready for review.
Summary: <one paragraph from the PR description>.
Tests: <list>.
Specific reviewer requests: <if any, e.g. "would appreciate a look from
@someone since they wrote the original code">.
This wakes up the JIRA's watchers (committers who follow issues@) and
gives them enough context to decide whether to pick it up.
Required Fields
| Field | Who sets | What to set |
|---|---|---|
| Assignee | You | Yourself (Step 1). |
| Component | You | tez-dag, tez-runtime-library, etc. — whatever module you primarily changed. |
| Affects Version | Reporter or you | The earliest version where the bug reproduces. |
| Fix Version | Committer | Leave blank. Only PMC/committers set this. You can comment "suggesting fix version X.Y.Z" if you have a strong opinion. |
| Priority | Reporter or PMC | Don't bump your own. Comment if you think the priority is wrong. |
| Labels | You | Add flaky-test / recovery / shuffle if it helps grep later. Don't invent vanity labels. |
| Release Notes | You, if user-visible | Mandatory if behavior, API, or configuration changes are visible. See below. |
| Linked Issues | You | Link the PR (web link) and any related JIRAs. See below. |
Release Notes
If your fix changes anything a user can observe — output format, config key default, error message, performance characteristic — fill out the "Release Notes" field. Format:
Fixed an issue where vertices with speculatively-populated recovery data
would not transition to SUCCEEDED after all tasks completed. Affects DAGs
submitted via TezClient when checkpoint-based recovery is enabled. No
configuration or API change is required.
Two to four sentences. Past tense ("Fixed"). User-facing language ("DAGs", "TezClient"), not implementation jargon ("V_TASK_COMPLETED handler").
If your fix is purely internal (refactor of a private method, test-only change), leave Release Notes blank. The release manager will skip it.
Linking the PR
Issue Links → "is related to" → Web Link → paste the GitHub PR URL.
Tez also has a bot that auto-links a PR to the JIRA when the PR title starts
with TEZ-NNNN:. The bot fires within minutes. If after an hour the JIRA
does not have a "GitHub Pull Request" link visible, add it manually:
JIRA → More → Link → Web Link → URL: https://github.com/apache/tez/pull/<NNN> →
Link Text: GitHub PR.
Cross-Linking Related JIRAs
If your fix interacts with other tickets, link them explicitly:
| Relation | When to use |
|---|---|
| is duplicated by | Another JIRA is a duplicate of yours (close that one). |
| duplicates | Yours is the duplicate (close yours, work on the older one). |
| is related to | Touches similar code but distinct issue. |
| is blocked by | You cannot land until another JIRA lands first. |
| is caused by | Bisect identified TEZ-XYZ as the regression source. |
| supersedes | Your fix replaces an older abandoned attempt. |
Be conservative. Spurious links pollute the issue graph. Cross-link only where the connection is concrete.
Code Comments in the Fix
The JIRA explains what and why at the project level. Inline code comments explain why at the file level for the next person editing this line.
Good inline comment patterns:
// TEZ-NNNN: only short-circuit when recovery replay is actually in progress;
// recoveryData may be populated speculatively at vertex init even when no
// replay will occur. See the JIRA for the affected scenario.
if (recoveryData != null && isReplayingRecovery()) {
handleRecovery(event);
return;
}
Rules:
- Cite the JIRA number. Future-you grepping the file for
TEZ-will find the context immediately. - Explain the non-obvious invariant, not what the code obviously does.
Never write
// increment counternext tocount++. - One or two lines max. If you need a paragraph, write the design note in
the class Javadoc or in a markdown doc under
docs/. - Don't paste the entire root-cause document. The JIRA holds that.
Notifying Watchers
After Patch Available, the JIRA's watchers see an email. If you want a
specific committer's attention (e.g. the author of the introducing commit
from your git bisect), @mention them in a JIRA comment:
[~alice] (author of TEZ-2877) — would appreciate a sanity check on the
recovery short-circuit gating in this PR, since you wrote the original
branch. No urgency.
The [~jira-username] syntax is JIRA's mention. Find the username from
their JIRA profile URL (https://issues.apache.org/jira/people/<username>).
Do this once. Do not @-mention in every subsequent comment — committers filter their inboxes.
Backporting Fix to Branches
For most Capstone work, you fix on master and stop. But if your bug
affects a maintained release branch and a committer asks you to backport:
- Comment on the JIRA: "Will backport to branch-0.10 once master patch lands."
- After merge to master, create a new branch from
apache/branch-0.10:git fetch apache git checkout -b tez-NNNN-branch-0.10 apache/branch-0.10 git cherry-pick <master-fix-commit-sha> # Resolve conflicts (often minor; sometimes major if branch diverged). - Run validation on the branch (same Step 7 checks).
- Open a separate PR titled
TEZ-NNNN (branch-0.10): <summary>or attach aTEZ-NNNN.branch-0.10.01.patchto the same JIRA.
Each branch's PR/patch is a separate review.
After Merge
When a committer merges your PR:
- The PR is closed automatically, and they'll comment "Committed to master, thanks @you" with the merged-commit SHA.
- They (or the bot) set the JIRA to Resolved with
Resolution: FixedandFix Version: X.Y.Z. - You comment with a thanks and any follow-up plans:
Thanks for the review and merge, [~alice]. I'll watch for the next RC to verify it lands cleanly. Filed TEZ-MMMM for the follow-up refactor we discussed. - If you spotted a related improvement during review, file the follow-up JIRA immediately — do not let it slip.
Documentation Beyond the JIRA
Most bug fixes need no further doc. Exceptions:
| Change | Where to document |
|---|---|
| New config key | tez-api/src/main/resources/META-INF/services/... if not auto-generated; reference from the Tez site config docs page. |
| New public API | Javadoc on the new method/class + the relevant docs/<feature>.md if one exists. |
| Behavior change visible to operators | A note in CHANGELOG.md (committer usually handles), and a JIRA Release Notes entry (you write this). |
| New tunable or debug flag for operators | Mention in the Tez configuration reference page (commit to the tez-site/ directory or open a JIRA for the site update). |
When in doubt, ask in the JIRA: "Should I update the docs page for X as part of this, or as a follow-up JIRA?" Committers will tell you.
Validation / Self-check
Before advancing to Step 10:
- JIRA status is Patch Available with a comment summarizing the change and linking the PR.
- Assignee is you.
- Component is set to the right module.
- Affects Version is set to a real Tez version where the bug reproduces.
- Release Notes field is filled in (or explicitly blank with a one-line "internal only" justification in the PR description).
- PR is linked under Issue Links → Web Link.
- Any related JIRAs are cross-linked with the correct relation (is caused by / is related to / etc).
- Inline code comments cite
TEZ-NNNNwhere the change is non-obvious. - If a committer was specifically helpful (author of regressing commit, reviewer on related work), you @-mentioned them once, not repeatedly.
Step 10: Engineering Write-Up
The patch is merged. The JIRA is Resolved. Most contributors stop here. The ones who become committers, write the post. The write-up is the artifact that travels with you when you change jobs, apply for a committer vote, or get cited by another contributor doing similar work.
Eight hundred to a thousand words. Most of it written in the four hours right after merge, while the dead ends are still fresh.
Why It Matters
Three audiences:
- Future you. Six months from now you'll touch this code again and want to remember what you tried.
- The next contributor working a similar bug. They'll find your post via Google ("Tez vertex stuck RUNNING") and shortcut a week of work.
- The committers / PMC evaluating you for a vote. They want to see that you can communicate engineering reasoning, not just produce diffs.
A good write-up is not a press release. It is a postmortem: honest about what you tried, including the failed approaches.
The Template
Sections in order, suggested word counts.
Title (one line)
Fixing TEZ-NNNN: <one-line technical summary>
Examples:
- "Fixing TEZ-4567: A speculative-recovery short-circuit race in VertexImpl"
- "Fixing TEZ-3982: Why our shuffle was 30% slow on small inputs"
- "Fixing TEZ-2451: An off-by-one in MergeManager spill accounting"
Technical, specific. Not "My first Apache Tez contribution" — write that post separately on your blog. The engineering post stands on its own.
Problem (100–150 words)
What broke, for whom, under what conditions. Plain English, but precise.
Tez vertices configured with checkpoint-based recovery would intermittently
fail to transition to SUCCEEDED, leaving the DAG in RUNNING state until the
AM hit its global timeout. The bug only manifested when the application
master pre-populated recovery data at vertex initialization (rather than
lazily during an actual replay), which is the path used by long-running
Tez sessions reusing AMs across DAG submissions.
The symptom was a stalled DAG with all tasks reporting SUCCEEDED in the
counters but no DAGFinishedEvent in the AM log. Affected Tez 0.9.x and
0.10.0 onward.
State the symptom (what the user sees), the trigger condition (when it manifests), and the affected version range. No code yet.
Investigation Log (200–300 words)
The most valuable section. Walk through what you tried, including the hypotheses that were wrong.
Initial hypothesis was a task-scheduler bug — we suspected
TaskSchedulerManager was dropping a TASK_COMPLETED event under load.
DrainDispatcher-based reproducers in isolation showed no event loss, so
we ruled this out within a day.
Second hypothesis: a state-machine transition guard rejecting the final
event. Adding TRACE logging to VertexImpl confirmed V_TASK_COMPLETED was
arriving and being dispatched, but completedTaskCount remained one short
of total. This shifted attention from "the event is missing" to "the
event is processed but not by the expected handler."
Reading VertexImpl.handle(...) line by line revealed the recovery
short-circuit at line ~2400: `if (recoveryData != null) { handleRecovery(...); }`.
A git blame placed this in TEZ-2877 (commit a1b2c3d4), where the
assumption "non-null recoveryData implies active replay" was reasonable
at the time but became invalid when TEZ-3105 introduced speculative
recovery-data population at vertex init.
The actual race: V_TASK_COMPLETED for the final task arrived at the
moment when recoveryData was populated but isRecovering() would have
returned false — there was no isRecovering() check.
Three to five hypotheses, in the order you tried them. Each with one sentence on what suggested it and one sentence on what disproved it. The dead ends are not embarrassments — they are the work, and they teach readers what not to spend a week on.
Root Cause (50–100 words)
One paragraph, the truth as you now understand it.
The vertex state machine's V_TASK_COMPLETED handler in the RUNNING state
short-circuited any event to handleRecovery() when recoveryData was non-null,
regardless of whether a recovery replay was actually in progress. Speculative
population of recoveryData at vertex initialization (TEZ-3105) made the
guard fire in normal execution, routing terminal events to the recovery
path which silently ignored them when not replaying. The completedTaskCount
counter never reached totalTaskCount, blocking the SUCCEEDED transition.
Cite the introducing JIRA. Cite the bisect commit if you have it.
Final Design (150–200 words)
What you actually changed and why this design over alternatives.
The fix introduces an isReplayingRecovery() predicate that returns true
only when a recovery replay is in flight (tracked by an existing
RecoveryState flag in DAGAppMaster). The short-circuit is gated on this
predicate:
if (recoveryData != null && isReplayingRecovery()) { ... }
This is a one-line production change plus a four-line predicate method.
It preserves all behavior for actual recovery scenarios and corrects the
behavior only for the speculatively-populated case.
Show the diff size and the principle ("minimum surface area"). Note any public API impact (here: none).
Alternatives Considered (100–150 words)
Two to three alternatives you rejected, with the reason.
**Alternative 1: stop populating recoveryData speculatively at vertex init.**
Rejected: TEZ-3105 documented performance reasons for the eager population
(avoids a stall when actual recovery kicks in). Reverting it would
regress that path.
**Alternative 2: have handleRecovery() forward the event back to the
standard transition when not replaying.** Rejected: it works, but couples
the recovery path to internal knowledge of which events the standard
transition needs. The gate-at-source approach is local and reviewable.
**Alternative 3: remove the short-circuit entirely and let handleRecovery()
no-op when not replaying.** Rejected: changes the semantics of every other
event flowing through the recovery path, with broader behavioral risk for
a narrowly-scoped bug.
This is the section that separates contributor-quality write-ups from committer-quality ones. Anyone can ship a fix. Articulating why this fix and not the obvious alternatives demonstrates engineering judgment.
Performance / Behavior Impact (50–100 words)
If perf-relevant, numbers from Step 7. Otherwise, one sentence:
No measurable performance impact. The new predicate is a single field
read on a hot path (VertexImpl.handle) but the original short-circuit
already paid this cost on every event. Validated via TestOrderedWordCount
runtime: no statistically significant change across 10 runs.
Lessons Learned (100–150 words)
The transferable insights, written for a peer. Things you would tell yourself before starting.
- Recovery code in Tez has always been the sharpest edge: it is the
least-tested path because it only runs during AM failover, and most
developer environments don't trigger it. When a bug touches recovery
data flow, assume the test coverage is thin and add reproducers
aggressively.
- `git pickaxe` and `git bisect` together were decisive — bisect found
the introducing commit (TEZ-2877), and pickaxe on the changed expression
showed it had never had a guard. Without bisect this would have been
a week of code archaeology.
- DrainDispatcher in TestVertexImpl is underused. The repro test for this
bug took two hours to write once I learned the pattern, and it is now
permanent regression protection.
Three to five bullets. Concrete enough that a peer at another project could apply them.
Links
- JIRA: https://issues.apache.org/jira/browse/TEZ-NNNN
- PR: https://github.com/apache/tez/pull/<NNN>
- Merged commit: <SHA>
- Introducing commit (TEZ-2877): <SHA>
Where to Publish
Three venues, in roughly decreasing order of effort and impact.
1. Personal blog or company engineering blog
Full ~1000-word write-up. SEO-friendly title with the JIRA number and a keyword phrase users would search for ("Tez vertex stuck RUNNING fix"). Link prominently to JIRA and PR. This is the version that follows you across jobs.
2. Apache wiki / Tez documentation
Shorter version (300–500 words) focused on the lesson, not the personal narrative. Filed under a relevant page (recovery troubleshooting, debugging state machines). Requires wiki access — committers will grant it once you have a few merged contributions.
3. dev@ summary email
Two to three paragraph summary on dev@tez.apache.org with subject
[TEZ-NNNN] Notes on the fix. Lets watchers and PMC see the engineering
reasoning without having to read the whole PR. Optional but earns
goodwill.
Subject: [TEZ-NNNN] Notes on the fix
Hi all,
Merged TEZ-NNNN this morning. Quick notes on the investigation since
recovery bugs are uncommon and the root cause was a non-obvious
interaction with TEZ-3105:
<2 paragraphs of summary>
Full write-up: <link to blog post>
Thanks again to [~alice] for the review.
Anti-Patterns
What separates write-ups that help from ones that don't:
- "I learned a lot working on this!" — Yes, we know. Cut it. The artifact is the engineering, not the feel-good.
- Personal narrative dominating the engineering. Save the "my journey into open source" angle for a separate post. Engineering posts get cited and reread. Narrative posts get one-time clicks.
- Sanitized version where you "knew the answer all along." Nobody believes this and it actively misleads new contributors who feel inadequate when their investigation is messy. Be honest about the dead ends.
- No code snippets. A write-up without showing the actual diff or the symptomatic log line is unfalsifiable.
- No links. JIRA, PR, commit — all three minimum. A write-up without the JIRA link is unreviewable.
- Word-padding to look thorough. A tight 600-word write-up that respects the reader beats a 2000-word slog every time.
Validation / Self-check
Before declaring the Capstone complete:
- The write-up is published at a URL you can share (blog, GitHub Gist,
capstone-work/writeup.mdin a public repo). - It is 500–1000 words; not 200 (too thin) and not 3000 (padding).
- Investigation Log section contains at least two hypotheses you ruled out, not only the winning one.
- Alternatives Considered section names at least two designs you rejected with reasons.
- Lessons Learned section has three to five bullets, each concrete enough to be reusable by another contributor.
- JIRA, PR, and merged-commit SHA are all linked.
- The write-up reads as something a peer engineer would respect, not a triumphalist blog post.
Evaluation Rubric
A 100-point self-grading rubric for the Capstone. Score yourself honestly after you finish Step 10. The scoring is calibrated against what Tez committers actually look for — not what feels good to read.
The point of this rubric is not the score. It is the diagnostic: a low score on one dimension tells you exactly where to invest the next contribution.
Scoring Dimensions
Seven dimensions, weighted by how much they matter for review outcomes.
| # | Dimension | Points |
|---|---|---|
| 1 | Problem articulation | 20 |
| 2 | Execution-path mastery | 20 |
| 3 | Implementation quality | 20 |
| 4 | Testing | 15 |
| 5 | Review responsiveness | 10 |
| 6 | Documentation | 10 |
| 7 | Community interaction | 5 |
| Total | 100 |
1. Problem Articulation (20 pts)
Can you state, in one paragraph, what was broken, for whom, under what conditions?
| Score | What it looks like |
|---|---|
| 18-20 | Crisp one-paragraph statement covering symptom, trigger conditions, affected version range, and operational impact. Distinguishes "this is what the user sees" from "this is the underlying mechanism." Could be read aloud at a standup and a peer would correctly grasp the bug. |
| 14-17 | Clear symptom but trigger conditions vague ("happens sometimes under load"). OR trigger clear but conflates symptom with root cause. |
| 10-13 | Reader needs to ask follow-up questions to understand what was broken. Uses jargon without grounding it in user-visible behavior. |
| 5-9 | Mostly restates the JIRA title. No conditions. No version impact. |
| 0-4 | "It was broken and I fixed it." |
Look for: the absence of the word "intermittent" without a documented trigger; conflation of symptom (vertex stuck) with cause (event short-circuit).
2. Execution-Path Mastery (20 pts)
Did you actually trace the code, or did you guess?
| Score | What it looks like |
|---|---|
| 18-20 | Step-3 document maps the full path from user submission to bug location with file:line citations at every layer. Includes a diagram (mermaid or text-arrow). Cites the AsyncDispatcher event hop and the specific state-machine transition where the bug fires. Reviewer reading it could open each file at each line and follow the logic without asking questions. |
| 14-17 | Most layers cited but one or two skipped ("then the event reaches VertexImpl"). Diagram present but missing a critical hop. |
| 10-13 | Cites the location of the bug correctly but does not trace how execution reached it. No diagram. |
| 5-9 | Vague references ("the dispatcher handles it") without file:line. |
| 0-4 | No execution-path document, or it is just a paragraph of prose. |
Look for: presence of tez-api/src/main/...-style paths with line numbers
that match the resolved commit SHA.
3. Implementation Quality (20 pts)
Diff hygiene, scope discipline, convention compliance.
| Score | What it looks like |
|---|---|
| 18-20 | Minimum-diff fix. Production change measured in tens of lines, not hundreds. Every changed line is justifiable in one sentence. No drive-by refactors, no opportunistic renames. Public API surface unchanged unless required. Naming, slf4j logging style, Preconditions, exception messages all match Tez conventions. Checkstyle, SpotBugs, RAT all green without manual overrides. |
| 14-17 | Mostly minimum-diff but one or two stray changes that don't belong. Conventions mostly followed; minor style nits a reviewer would flag. |
| 10-13 | Fix works but is broader than necessary. Scope creep ("while I was here I cleaned up..."). Conventions inconsistently applied. |
| 5-9 | Significant scope creep. Public API changed unnecessarily. Style violations would block precommit without revision. |
| 0-4 | Diff is so large reviewers would request it be broken up before reviewing. OR breaks public API silently. |
Look for: scope-creep tells: git diff origin/master --stat with files
unrelated to the bug touched.
4. Testing (15 pts)
Coverage, determinism, regression value.
| Score | What it looks like |
|---|---|
| 14-15 | New unit test reproduces the bug deterministically on master (DrainDispatcher or equivalent), passes with fix. Negative-control test (similar input where the bug should NOT trigger) included. Branch coverage on the changed lines is high. Integration test with MiniTezCluster confirms the fix in an end-to-end DAG. No Thread.sleep, no wall-clock dependencies, no order-dependent assertions. Test ran 10x in a loop without flake. |
| 11-13 | Unit test present and deterministic but no negative control. OR has an integration test but the unit test is weak. |
| 7-10 | Unit test present but uses Thread.sleep or is otherwise non-deterministic. Coverage of fix path incomplete. |
| 3-6 | Test exists but only checks the happy path; would have passed before the fix. |
| 0-2 | No new tests, or tests that fail on master AND on the fix. |
Look for: presence of dispatcher.await() rather than Thread.sleep; a
test name that describes the scenario (testV_TASK_COMPLETED_inRunningWithRecovery)
rather than the method (testHandle).
5. Review Responsiveness (10 pts)
How well you ran the review cycle.
| Score | What it looks like |
|---|---|
| 9-10 | Every reviewer comment addressed in code or with a substantive reply. Iteration cadence < 48h on most comments. Disagreements (when they happened) made the technical case without defensiveness. Updated PR description after material changes so the top-of-PR text stays accurate. |
| 7-8 | Addresses comments correctly but slow (multi-day gaps). OR addresses most comments but lets a few stylistic ones slide without acknowledgement. |
| 5-6 | Defensive on at least one comment ("but I think my way is fine"). OR force-pushed without summarizing the diff for reviewers. |
| 2-4 | Required multiple reminders from reviewers. Comments not addressed cleanly. |
| 0-1 | PR went silent for > 2 weeks without explanation, or contributor argued every comment. |
Look for: PR review threads marked "resolved" by the contributor with a substantive commit pushed, not just a reply.
6. Documentation (10 pts)
JIRA fields, code comments, write-up presence.
| Score | What it looks like |
|---|---|
| 9-10 | JIRA has Component, Affects Version, Release Notes (if user-visible), PR link, and relevant cross-links. In-code comments cite TEZ-NNNN where the change is non-obvious. Write-up exists at a public URL. JIRA status correctly walked through In Progress -> Patch Available. |
| 7-8 | JIRA mostly filled but Release Notes missing on a user-visible change. Code comments present but don't cite the JIRA. |
| 5-6 | JIRA workflow followed but fields incomplete. No write-up beyond the PR description. |
| 2-4 | JIRA fields blank or wrong. Comments absent at the surprising lines. |
| 0-1 | No JIRA hygiene at all. |
Look for: the JIRA's "Release Notes" field being populated or an explicit note explaining why it's intentionally blank.
7. Community Interaction (5 pts)
Mailing list etiquette, claiming/handoff hygiene.
| Score | What it looks like |
|---|---|
| 5 | Claimed the JIRA before starting. Posted to dev@ only when meaningful (design question, summary after merge). Used [TEZ-NNNN] subject prefix. Was reachable during review. Thanked reviewers explicitly. If they hit a wall, posted clearly with "stuck on X, considering A/B/C, leaning A because Y." |
| 3-4 | Mostly good etiquette; one minor slip (claimed late, or one off-topic mailing-list post). |
| 1-2 | Did not claim the JIRA before working. OR sent mailing-list traffic that was really just chat ("does anyone know..."). |
| 0 | Worked silently for weeks, then dropped a PR with no JIRA assignment and no context. |
Look for: a JIRA comment by the contributor before the first PR push, along the lines of "Working on this, will have a patch in a few days."
Tier Thresholds
Where you land tells you what to do next.
| Score | Tier | Interpretation |
|---|---|---|
| 95-100 | PMC-ready | This is the quality of work that earns a committer vote, given a track record of several such contributions over months. You are operating at the level of someone the PMC would trust to maintain a module. |
| 90-94 | Committer-ready | You are writing patches at committer quality. With 3-5 such contributions across different modules over 6-12 months and demonstrated review participation on others' patches, a vote is plausible. |
| 80-89 | Strong contributor | A reliable contributor whose patches need minimal review iteration. Keep building the track record; this is the level where committers actively look forward to reviewing your work. |
| 65-79 | Contributor | Solid bug-fix-grade work. Patches land with normal review iteration. Most contributions to most projects live here, and it is honorable work. |
| 50-64 | Learning | Patches eventually land but with significant reviewer guidance. Use the next contribution to focus on the dimension where you scored lowest. |
| < 50 | Foundational gap | The contribution may have merged, but the process skipped enough corners that another reviewer or future maintainer is paying a tax. Restart with a smaller bug and apply the rubric end-to-end. |
The tier is not a personality assessment. It is calibrated to the artifact you produced for this one Capstone. The same person can score 65 on one contribution and 95 on the next.
How to Self-Grade
Block 30 minutes. Open this rubric. Open your own artifacts side by side (JIRA, PR, code, root-cause doc, write-up, validation report). Score each dimension by reading the band descriptions and picking the one that most honestly matches what you produced.
Two rules:
- No interpolation upward. If you're between 14 and 17 on a dimension and unsure, score 14. The optimist's tax.
- One independent reviewer. Ask a peer (ideally another contributor) to score independently on the same rubric. If your scores differ by more than 10 points on any dimension, talk about it. The difference is where the calibration lives.
Record both scores in capstone-work/self-grade.md along with one sentence
per dimension on what would have moved the score up one band. This becomes
the input for the next contribution's plan.
What to Do With a Low Score
| Lowest dimension | Next contribution focus |
|---|---|
| Problem articulation | Pick a smaller, sharper bug. Write the one-paragraph statement before opening the JIRA edit, and post it for review. |
| Execution-path mastery | Pick a bug in a layer you've never traced (e.g. you've done DAG-level, now do shuffle-level). Force yourself to write the path doc before reading the existing tests. |
| Implementation quality | Pick a bug where the minimum fix is < 10 lines. Practice the discipline of leaving the surrounding code untouched. |
| Testing | Pick a flaky-test JIRA (Stage 9 of the roadmap). The whole bug is about testing discipline. |
| Review responsiveness | Pick a bug in a high-traffic area where you'll get more reviewers. Set a 24-hour SLA for yourself on every comment. |
| Documentation | Pick a bug that requires a Release Notes entry. Write the entry before the fix is done. |
| Community interaction | Reply substantively to three other contributors' patches before opening your next one. |
Validation / Self-check
Before declaring the Capstone done:
capstone-work/self-grade.mdexists with a score per dimension and a total.- The total is honest, not aspirational — you can defend each dimension's score with citations to your own artifacts.
- At least one independent reviewer has also scored, and disagreements
10 points on any dimension have been discussed.
- The lowest dimension is identified and the next contribution's focus is written down.
- The score is recorded somewhere you'll see again in 3 months (calendar reminder, journal, follow-on JIRA list).
- You understand that the tier label ("Contributor", "Committer-ready") describes this one piece of work, not you.
- You have a candidate next bug picked, with the focus dimension in mind.