Tez Warm-Up: From Data Engineer to Source Contributor

Before you read a single line of VertexImpl.java, you need to have sat in the seat of the person whose workload Tez is serving. The engineers who built Tez's state machines, container reuse logic, and shuffle pipelines were solving specific, painful problems that showed up in production Hive and Pig workloads every day. If you skip that context and go straight to the source code, you will memorize class names without understanding why the design exists.

This chapter is the missing first mile. You will run Tez from the outside — as a data engineer would — across a series of practical scenarios covering different data shapes, query patterns, and ecosystem integrations. After each scenario, the chapter maps what you observed back to the source code structures that own it. By the end, you will have a mental model that makes every internal class feel like an old acquaintance rather than an alien term.

What Tez Actually Is (Two Sentences)

Apache Tez is a general-purpose DAG execution engine that runs on Apache YARN. It does not execute SQL or process files itself — it provides a runtime that other systems (Hive, Pig, custom applications) compile their work into, and then Tez runs that compiled work as a directed acyclic graph of parallel tasks.

Everything else — SQL parsing, query planning, physical operators, file format codecs — belongs to the caller. Tez sees vertex descriptors, edge properties, and processor classes. That boundary is what you need to hold clearly in mind throughout the curriculum.

Where Tez Sits in the Data Engineering Spectrum

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Data Engineering Tool Spectrum                          │
│                                                                             │
│  Batch ◄───────────────────────────────────────────────► Streaming         │
│                                                                             │
│  MapReduce    Tez      Spark         Flink         Kafka Streams            │
│  (2004)       (2013)   (2014)        (2014)        (2016)                   │
│  pure batch   batch+   micro-batch   true stream   native stream            │
│               pipelined & batch      & batch                                │
│                                                                             │
│  ──────────────────────────────────────────────────────────────────────     │
│  Ingest Layer:  Flume  →  Kafka  →  Flink/Kafka Streams                    │
│  Storage Layer: HDFS / S3 / ORC / Parquet / Iceberg / Delta Lake           │
│  Query Layer:   Hive (Tez), Presto, Trino, Spark SQL, Flink SQL            │
└─────────────────────────────────────────────────────────────────────────────┘

Tez vs. MapReduce

MapReduce forces every computation into map → shuffle → reduce. A five-join SQL query becomes five chained MapReduce jobs with HDFS materializations between each. Tez expresses that same query as one DAG, pipelines intermediate data between vertices without HDFS writes, and reuses JVMs across tasks. Typical improvement: 2–5x on complex queries, 10x+ on workflows that would have been five MR jobs.

Tez vs. Spark

Dimension	Tez	Spark
Primary use case	Hive SQL (on YARN/HDFS ecosystems)	General batch + ML + streaming
Execution model	YARN-native, container reuse	Driver + executor (YARN or Kubernetes)
In-memory caching	No (disk-backed shuffle)	RDD/DataFrame caching (explicit)
Streaming	Not native	Structured Streaming (micro-batch)
Deployment	YARN only	YARN, Kubernetes, standalone
Hive integration	Deep (Hive's primary engine)	Separate (Hive-on-Spark is less common)
Community	Apache Tez (focused on Hive use case)	Apache Spark (broad general use)

When you are on a Hadoop/YARN cluster where Hive is the primary SQL layer, Tez is the right choice. Spark is a better fit for Python/Scala workloads, ML pipelines, or when you need in-memory caching across multiple queries.

Tez vs. Flink

Flink is a streaming-first engine that also handles batch. Tez is a batch-first engine that handles simple pipelines. The key structural difference: Flink maintains persistent operator state across windows and checkpoints; Tez vertices are stateless per-task (state is external: HDFS, HBase). If you are building event-time windowed aggregations or exactly-once stream processing, you want Flink. If you are running nightly ETL on HDFS data via Hive, Tez is the right tool and Flink would be overengineered for the job.

Tez vs. Flume (Ingest)

Flume is not a computation engine — it is a log/event ingestion agent that moves data from sources (web servers, syslog, Kafka) to sinks (HDFS, Kafka, HBase). The typical pipeline is:

Application Logs → Flume Agent → HDFS (ORC/Parquet files) → Hive table → Tez query

Flume and Tez are not competitors; they are peers in the same pipeline. Tez reads the data that Flume (or Kafka, or Sqoop) landed on HDFS. Knowing this boundary matters when you encounter a data quality bug: is it in the ingest (Flume), the storage format (ORC serialization), or the compute layer (Tez/Hive)?

Data Formats in the Tez Ecosystem

Tez itself is format-agnostic. It does not read or write ORC, Parquet, or Iceberg directly. Tez sees InputDescriptor and OutputDescriptor objects — the actual codec lives in the class pointed to by those descriptors. The format lives in the tez-mapreduce compatibility layer (MRInput, MROutput) or in Hive's vectorized readers.

ORC (Optimized Row Columnar)

ORC is Hive's native format. When you INSERT INTO an ORC table and query it via Hive-on-Tez:

The input split is an OrcSplit generated by OrcInputFormat in the Hive ORC library.
Tez receives that split as a DataSourceDescriptor in the DAGPlan.
MRInput wraps OrcInputFormat.createRecordReader(), feeding vectorized row batches to Hive's MapOperator.
The key Tez entry point is MRInputLegacy.createReaderInternal() in tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInputLegacy.java.

ORC's predicate pushdown (column pruning, row group skipping) happens before Tez sees the data — entirely inside OrcInputFormat. If a Hive-on-Tez query reads 10 billion rows instead of the 100K it should (wrong predicate pushdown), the bug is in ORC/Hive, not in Tez.

Parquet

Parquet is the other dominant columnar format, more common in cross-ecosystem pipelines (Spark + Hive interop). With Hive-on-Tez reading Parquet:

ParquetInputFormat generates ParquetInputSplit objects.
Tez receives those as DataSourceDescriptor entries.
Vectorization depth varies: ORC vectorization is deeper in Hive; Parquet goes through an additional row-column translation layer.

From a Tez contributor's standpoint, Parquet vs. ORC differences show up mainly in:

Split size calculations affecting vertex parallelism (how many map tasks Tez schedules)
Record skew when one Parquet file is much larger than others

Iceberg

Apache Iceberg is a table format (not a file format). It stores data in Parquet or ORC files but adds a metadata layer for ACID semantics, time travel, and hidden partitioning. Hive + Tez reads Iceberg via IcebergInputFormat (from the Iceberg Hive runtime JAR).

From Tez's view, Iceberg is yet another InputFormat. The novel behavior is:

Iceberg's snapshot-based read means splits can come from multiple physical locations.
Iceberg's PlanningUtil generates splits that can be much more numerous than traditional partition-based splits — this affects Tez vertex parallelism significantly.
Time-travel queries (SELECT ... FOR SYSTEM_TIME AS OF ...) generate a different split list at query compile time, which Hive encodes into the DAGPlan before Tez sees it.

Key insight for contributors: Tez bugs triggered by Iceberg tables are almost always about parallelism (too many small tasks, too few tasks for large snapshots) or about the DataSourceDescriptor encoding. The actual file reading is not Tez's responsibility.

Scenario 1: Classic Batch ETL — Aggregation Over a Large Table

What the data engineer does:

-- Run in Hive CLI connected to a cluster with Hive-on-Tez enabled
SET hive.execution.engine=tez;

CREATE TABLE daily_sales (
  event_date STRING,
  product_id BIGINT,
  region     STRING,
  revenue    DOUBLE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");

-- Query: daily revenue by region, last 90 days
SELECT
  event_date,
  region,
  SUM(revenue)     AS total_revenue,
  COUNT(*)         AS transaction_count
FROM daily_sales
WHERE event_date >= '2026-03-01'
GROUP BY event_date, region
ORDER BY event_date, region;

What Tez does under the hood:

Hive compiles the query to MapWork (map-side partial aggregation) + ReduceWork (global aggregation + sort).
DagUtils.createVertex() in Hive creates two Tez Vertex objects: Map 1 and Reducer 2.
The edge between them is SCATTER_GATHER (partitioned shuffle by GROUP BY key hash).
ShuffleVertexManager auto-parallelism kicks in: it monitors how much data map tasks produce, then dynamically reduces the reducer count if data is smaller than expected (config: tez.shuffle-vertex-manager.desired-task-input-size).
Map tasks run MapProcessor → HashTableContainer (partial agg) → OrderedPartitionedKVOutput (partitioned, sorted).
Reducer tasks run ReduceProcessor → OrderedGroupedKVInput (merge shuffle inputs) → PTFOperator (for ORDER BY) → FileSinkOperator → ORC writer.

Dataset characteristics and edge behaviors:

Dataset characteristic	Tez behavior	Source class to read
1 small file (< 1 block)	1 map task, ShuffleVertexManager sets 1 reducer	`ShuffleVertexManager.java`
1,000 files, uniform size	Parallelism = file count (MR split logic)	`MRInputLegacy.java` split sizing
1 file, 10 GB, no ORC splits	1 map task (cannot split non-splittable format)	`OrcInputFormat.isSplittable()`
WHERE predicate on partitioned column	Hive partition pruning, fewer splits passed to Tez	Hive `PartitionPruner`, not Tez
WHERE filters out all rows	0 output bytes from map, ShuffleVertexManager → 1 reducer	`ShuffleVertexManager.onSourceTaskCompleted()`

Bridge to source code:

cd ~/tez-src

# ShuffleVertexManager — the most important vertex manager for map-reduce style DAGs
find . -name "ShuffleVertexManager.java" -path "*/tez-dag/*"
# tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java

# Auto-parallelism: how ShuffleVertexManager decides to reduce the number of reducers
grep -n "computeParallelism\|desiredTaskInputSize\|onSourceTaskCompleted" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java | head -30

# The edge between Map 1 and Reducer 2 is SCATTER_GATHER — EdgeProperty documentation
grep -n "SCATTER_GATHER" \
  tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

Scenario 2: Multi-Table Join — The Real Workload Tez Was Built For

What the data engineer does:

SET hive.execution.engine=tez;
SET hive.auto.convert.join=true;      -- enable map-side (broadcast) joins
SET hive.mapjoin.smalltable.filesize=25000000;  -- 25 MB threshold

SELECT
  o.order_id,
  c.customer_name,
  p.product_name,
  o.quantity * p.unit_price AS line_total
FROM orders          o
JOIN customers       c ON o.customer_id = c.customer_id
JOIN products        p ON o.product_id  = p.product_id
WHERE o.order_date = '2026-05-31'
AND   c.region = 'US-WEST';

What Tez does:

Hive's query planner analyzes table sizes:

customers and products are small (< 25 MB) → broadcast join (MapJoin)
orders is large (> 25 MB) → the probe side, goes through SCATTER_GATHER shuffle

The resulting DAG has:

Map 1 — reads orders, builds hash table from customers and products small tables, emits matching rows. Small tables arrive via a BROADCAST edge (ONE_TO_ONE semantics: every map task gets the full small table).
Optionally a Reducer 2 if there's a DISTINCT or ORDER BY.

VertexGroup for broadcast joins: Hive uses VertexGroup to express that one physical vertex's output goes to both Map 1 and any other map-side consumer. This is expressed via DAG.addVertex() with a VertexGroup wrapper.

Dataset edge cases for joins:

Scenario	What goes wrong	Where to look
`customers` grows from 20 MB to 30 MB	Map join threshold exceeded, query switches to shuffle join; slower	Hive `CommonJoinResolver`, not Tez
`orders` has extreme key skew (one customer_id has 90% of rows)	One reducer gets 90% of data; task timeout	`SkewedJoin` hint in Hive; Tez sees it as one overloaded reducer
Broadcast table > YARN container heap	OOM in map task	Container memory: `tez.task.resource.memory.mb`
Right side of join returns 0 rows	Map tasks emit 0 output; downstream vertex immediately succeeds	`VertexImpl.checkTasksForCompletion()`

Bridge to source code:

# ONE_TO_ONE edge (broadcast) — how every map task gets all small-table data
grep -n "ONE_TO_ONE\|BroadcastEdgeManager" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ -r | grep -v Test | head -20

# VertexGroup — Hive's mechanism for fan-out to multiple consumers
grep -n "class VertexGroup\|addVertexGroup" \
  tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -15

# How the DAGAppMaster sees both edges from the same vertex
grep -n "vertexGroup\|groupInput" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20

Scenario 3: Direct Tez API — No Hive

Not all Tez workloads go through Hive. Custom data pipelines, internal batch frameworks, and migration tools often build Tez DAGs directly. The canonical example is OrderedWordCount in tez-examples/.

// Simplified from tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
TezClient tezClient = TezClient.create("OrderedWordCount", tezConf);
tezClient.start();

DAG dag = DAG.create("OrderedWordCount");

// Vertex 1: Tokenize words from input files
Vertex tokenizerVertex = Vertex.create(
    "Tokenizer",
    ProcessorDescriptor.create(TokenProcessor.class.getName()),
    numMapTasks,
    MRHelpers.getMapResource(conf));
tokenizerVertex.addDataSource(
    "Input",
    MRInput.createConfigBuilder(conf, TextInputFormat.class, inputPath).build());

// Vertex 2: Sort and deduplicate
Vertex sumVertex = Vertex.create(
    "Sorter",
    ProcessorDescriptor.create(SumProcessor.class.getName()),
    numReduceTasks,
    MRHelpers.getReduceResource(conf));
sumVertex.addDataSink(
    "Output",
    MROutput.createConfigBuilder(conf, TextOutputFormat.class, outputPath).build());

// Edge: SCATTER_GATHER, sorted by word key
dag.addVertex(tokenizerVertex)
   .addVertex(sumVertex)
   .addEdge(Edge.create(tokenizerVertex, sumVertex, EdgeProperty.create(
       DataMovementType.SCATTER_GATHER,
       DataSourceType.PERSISTED,
       SchedulingType.SEQUENTIAL,
       OrderedPartitionedKVOutput.createConfigBuilder(conf, HashPartitioner.class).build(),
       OrderedGroupedKVInput.createConfigBuilder(conf).build())));

DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletion();

What this teaches about Tez's structure:

Vertex.create(name, processorDescriptor, parallelism, resource) — the four primitives of a vertex: name, code to run, how many copies, how much resource.
EdgeProperty.create(movementType, sourceType, schedulingType, outputDesc, inputDesc) — edge properties completely specify how data moves.
MRInput/MROutput bridge the gap between legacy Hadoop InputFormat/OutputFormat and Tez's native I/O descriptors.

Bridge to source code:

# Read OrderedWordCount to understand the complete DAG lifecycle from a client
cat tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java

# Follow TezClient.submitDAG() into the AM
grep -n "public.*submitDAG" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/TezClient.java

# EdgeProperty — the central struct that determines routing
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

Scenario 4: Pipelined Execution — Where Tez Approaches Flink

Tez supports PIPELINED edge scheduling (vs. SEQUENTIAL). With pipelined edges, downstream tasks can start before all upstream tasks complete — the data flows like a stream within the DAG.

EdgeProperty pipelinedEdge = EdgeProperty.create(
    DataMovementType.SCATTER_GATHER,
    DataSourceType.PERSISTED_PIPELINED,    // <-- pipelined
    SchedulingType.CONCURRENT,             // <-- downstream starts immediately
    outputDescriptor,
    inputDescriptor);

This is used by Hive for query pipelining in long-running SELECT ... INSERT chains. The downstream vertex starts consuming partial output from the upstream before it finishes, reducing end-to-end latency for multi-stage queries.

When pipelining causes problems:

Problem	Symptom	Root class
Upstream task fails mid-stream	Downstream task has consumed partial data → must be killed and retried with upstream	`TaskAttemptImpl.FAILED_TRANSITION`
Downstream cannot consume fast enough	Back-pressure: upstream pauses on `write()`	`OrderedPartitionedKVOutput.sendingThreadShouldRun`
Memory overflow in pipelined buffer	`OutOfMemoryError` in fetcher threads	`MergeManager` in-memory limit

Bridge to source code:

grep -n "PERSISTENT_PIPELINED\|PIPELINED\|CONCURRENT" \
  tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

grep -n "SchedulingType.CONCURRENT" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10

Dataset Scenarios for Testing Edge Cases

When you are writing a repro test case or validating a fix, the dataset you choose determines which code paths you exercise. Use these as your starting templates.

Dataset 1: The Empty Partition

// Generate test data where one reduce partition has 0 records
// Triggers ShuffleVertexManager.onSourceTaskCompleted() with 0-byte output
private static final int NUM_PARTITIONS = 10;
private static final int RECORDS_PER_PARTITION = 100;

// Force all records into partitions 0–8, leave partition 9 empty
int partition = key.hashCode() % (NUM_PARTITIONS - 1);  // never 9

What this tests: ShuffleVertexManager must handle a vertex where some reducer partitions receive zero input. Before TEZ-3247, this caused reducers to hang waiting for shuffle data that would never arrive.

# Test class that covers empty-partition behavior
grep -rn "emptyPartition\|zeroInput\|emptyInput" tez-tests/src/test/ | head -10

Dataset 2: Extreme Key Skew

// One key accounts for 95% of records
for (int i = 0; i < 1_000_000; i++) {
    String key = (i < 950_000) ? "hot_key" : "key_" + i;
    writer.write(new Text(key), new IntWritable(1));
}

What this tests: The reducer that receives hot_key gets ~950,000 records while other reducers get ~50 each. This exposes:

Speculative execution decisions in LegacySpeculator
Container reuse after the skewed reducer finishes last
Per-vertex timing in VertexImpl.checkTasksForCompletion()

Dataset 3: Zero-Row Input

// Empty input — 0 files, 0 records
// The DAG should complete SUCCEEDED with 0 output, not hang
String inputPath = "/tmp/empty_dir_" + UUID.randomUUID();
fs.mkdirs(new Path(inputPath));  // create directory but put no files in it

What this tests: VertexImpl must handle the case where MRInput generates 0 splits. A vertex with 0 input splits sets its parallelism to 0, transitions immediately to V_SUCCEEDED without scheduling any tasks. This has historically been a source of NullPointerException bugs when downstream vertices assume at least one upstream task ran.

grep -n "setParallelism.*0\|numTasks.*0\|zeroTasks\|numSourceTasks.*0" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15

Dataset 4: Very Wide Rows (Many Columns)

// 1,000 columns per row — stresses IFile serialization and spill logic
StringBuilder sb = new StringBuilder();
for (int col = 0; col < 1000; col++) {
    sb.append("column_").append(col).append("=").append("value_").append(col).append("\t");
}
writer.write(new Text("key"), new Text(sb.toString()));  // ~30 KB per record

What this tests: PipelinedSorter and DefaultSorter spill thresholds. With 30 KB per record, even a modest sort buffer fills quickly. This exercises the spill path in tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java and exposes off-by-one bugs in the IFile index writer.

Dataset 5: Many Small Files (HDFS Small-File Problem)

# Generate 50,000 files of 1 KB each — a classic HDFS anti-pattern
for i in $(seq 1 50000); do
  echo "record_$i value_$i" > /tmp/smallfiles/file_$i.txt
done
hadoop fs -put /tmp/smallfiles /data/input/smallfiles/

What this tests: Split generation produces 50,000 map tasks. This is a realistic workload that stresses:

TaskSchedulerManager task queue management
Container reuse logic (50,000 containers → reuse is essential for performance)
DAGAppMaster AMRM heartbeat frequency under high task count

# Container reuse configuration
grep -n "heldContainer\|releaseTimeout\|IDLE_TIMEOUT" \
  tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncherImpl.java \
  | head -20

Dataset 6: Nested Structs (Complex Types)

-- ORC table with nested complex types
CREATE TABLE events (
  event_id  BIGINT,
  metadata  STRUCT<
    user_id:       BIGINT,
    session_id:    STRING,
    properties:    MAP<STRING, STRING>,
    tags:          ARRAY<STRING>
  >,
  timestamp BIGINT
) STORED AS ORC;

What this tests: ORC vectorized reader deserialization of STRUCT, MAP, and ARRAY types. These types are serialized into Hive's OrcStruct/OrcMap/OrcList classes before being passed through MRInput to the MapOperator. If the column count or type tree changes between what the ORC file was written with and what the Hive schema says, you get schema evolution behavior — which can generate bugs that look like Tez data corruption but are actually ORC schema evolution issues.

Dataset 7: Partitioned Iceberg Table (Snapshot Isolation)

# Using PyIceberg or Spark to create an Iceberg table with multiple snapshots
from pyiceberg.catalog import load_catalog
catalog = load_catalog("hive_catalog", **{"uri": "thrift://hive-metastore:9083"})
table = catalog.load_table("db.events_iceberg")

# Write 3 snapshots representing 3 days of appends
for day in range(3):
    df = generate_day_data(day)
    table.append(df)

# Now query with time travel — Hive generates a DAGPlan that reads snapshot 1
hive_execute("""
    SELECT COUNT(*) FROM db.events_iceberg
    FOR SYSTEM_TIME AS OF '2026-05-29 00:00:00'
""")

What this tests: Iceberg's IcebergInputFormat generates a split list that differs per snapshot. The DataSourceDescriptor passed to Tez encodes the snapshot ID. If Hive resolves the wrong snapshot, Tez faithfully executes it — the bug is in the DagUtils snapshot resolution in Hive, not in Tez. But the symptom (wrong row count) looks like a Tez data bug.

Running Tez End-to-End: The Local Developer Loop

Before writing source code, every Tez contributor should be able to do this loop in under 10 minutes:

# 1. Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist -q   # ~8–12 min cold, 3–4 min warm

# 2. Run the canonical integration test that exercises the full stack
mvn test -pl tez-tests \
    -Dtest=TestOrderedWordCount \
    -DfailIfNoTests=false 2>&1 | tail -30

# 3. Run a single unit test (fast feedback loop — use this constantly)
mvn test -pl tez-dag \
    -Dtest=TestVertexImpl#testVertexSucceededSpeculation \
    -DfailIfNoTests=false 2>&1 | tail -20

# 4. Run OrderedWordCount in local mode (no YARN cluster required)
hadoop jar tez-examples/target/tez-examples-*.jar orderedwordcount \
    -D tez.local.mode=true \
    /path/to/input /tmp/tez-output-$(date +%s)

# 5. Verify output
hadoop fs -cat /tmp/tez-output-*/part-* | sort | head -20

The TestOrderedWordCount test is your baseline health check. If it passes, the full end-to-end stack (TezClient → DAGAppMaster → VertexImpl → shuffle → MRInput/MROutput) is working. If it fails, something fundamental is broken and you need to fix that before touching anything else.

The Bridge: User Scenario → Source Code

Every scenario above maps to a specific source subsystem. Use this table whenever you see a runtime behavior and want to find the code responsible:

Observed behavior	Source location
Map task count equals file count	`tez-mapreduce/.../MRInputLegacy.createSplitsProto()`
Reducer count auto-adjusted down	`ShuffleVertexManager.computeParallelism()`
DAG completes even with 0-row input	`VertexImpl.scheduleTasks()` (0-task vertex path)
Broadcast join: small table to all maps	`BroadcastEdgeManager` + `ONE_TO_ONE` edge
Container reused between tasks	`AMContainerImpl.assignContainer()` + `HeldContainer`
Task retried after failure	`TaskAttemptImpl` → `TaskImpl.handleTaskAttemptFailed()`
OOM in shuffle fetch	`MergeManager.memoryAvailable` / `Fetcher.copyFromHost()`
Hung vertex with tasks still RUNNING	`VertexImpl.checkTasksForCompletion()` not triggered
Wrong output record count	Check `OrcInputFormat` predicate pushdown first, then Tez
Slow single reducer (skew)	`LegacySpeculator` slow-task detection → speculative attempt
Pipelined task killed on upstream failure	`TaskAttemptImpl.FAILED_TRANSITION` cascades

What to Verify Before Starting Level 1

Run through this checklist once. It takes 30–45 minutes and proves your environment is solid.

# Environment check
java -version    # must be Java 8 or Java 11
mvn -version     # must be 3.6.3+
git --version    # must be 2.x

# Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist 2>&1 | tail -10

# Confirm build artifacts exist
ls tez-dist/target/tez-*.tar.gz  # should exist
ls tez-examples/target/tez-examples-*.jar

# Run the unit test suite in the two most important modules
mvn test -pl tez-dag -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
mvn test -pl tez-api -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5

# Run the critical end-to-end test
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -DfailIfNoTests=false 2>&1 | tail -10

# All lines should read "Tests run: N, Failures: 0, Errors: 0"

If any of these fail before you have modified a single line of code, stop and fix your environment. Do not proceed into Level 1 with a broken baseline. A broken baseline means every subsequent mvn test will produce false failures that obscure the real work.

Continue to Overview & Prerequisites or jump directly to Level 1: Hadoop and Tez Foundation.

Open-Source Engineer & Contributor