Tez Warm-Up: From Data Engineer to Source Contributor
Before you read a single line of VertexImpl.java, you need to have sat in the seat of the
person whose workload Tez is serving. The engineers who built Tez's state machines, container
reuse logic, and shuffle pipelines were solving specific, painful problems that showed up in
production Hive and Pig workloads every day. If you skip that context and go straight to the
source code, you will memorize class names without understanding why the design exists.
This chapter is the missing first mile. You will run Tez from the outside — as a data engineer would — across a series of practical scenarios covering different data shapes, query patterns, and ecosystem integrations. After each scenario, the chapter maps what you observed back to the source code structures that own it. By the end, you will have a mental model that makes every internal class feel like an old acquaintance rather than an alien term.
What Tez Actually Is (Two Sentences)
Apache Tez is a general-purpose DAG execution engine that runs on Apache YARN. It does not execute SQL or process files itself — it provides a runtime that other systems (Hive, Pig, custom applications) compile their work into, and then Tez runs that compiled work as a directed acyclic graph of parallel tasks.
Everything else — SQL parsing, query planning, physical operators, file format codecs — belongs to the caller. Tez sees vertex descriptors, edge properties, and processor classes. That boundary is what you need to hold clearly in mind throughout the curriculum.
Where Tez Sits in the Data Engineering Spectrum
┌─────────────────────────────────────────────────────────────────────────────┐
│ Data Engineering Tool Spectrum │
│ │
│ Batch ◄───────────────────────────────────────────────► Streaming │
│ │
│ MapReduce Tez Spark Flink Kafka Streams │
│ (2004) (2013) (2014) (2014) (2016) │
│ pure batch batch+ micro-batch true stream native stream │
│ pipelined & batch & batch │
│ │
│ ────────────────────────────────────────────────────────────────────── │
│ Ingest Layer: Flume → Kafka → Flink/Kafka Streams │
│ Storage Layer: HDFS / S3 / ORC / Parquet / Iceberg / Delta Lake │
│ Query Layer: Hive (Tez), Presto, Trino, Spark SQL, Flink SQL │
└─────────────────────────────────────────────────────────────────────────────┘
Tez vs. MapReduce
MapReduce forces every computation into map → shuffle → reduce. A five-join SQL query becomes
five chained MapReduce jobs with HDFS materializations between each. Tez expresses that same
query as one DAG, pipelines intermediate data between vertices without HDFS writes, and reuses
JVMs across tasks. Typical improvement: 2–5x on complex queries, 10x+ on workflows that would
have been five MR jobs.
Tez vs. Spark
| Dimension | Tez | Spark |
|---|---|---|
| Primary use case | Hive SQL (on YARN/HDFS ecosystems) | General batch + ML + streaming |
| Execution model | YARN-native, container reuse | Driver + executor (YARN or Kubernetes) |
| In-memory caching | No (disk-backed shuffle) | RDD/DataFrame caching (explicit) |
| Streaming | Not native | Structured Streaming (micro-batch) |
| Deployment | YARN only | YARN, Kubernetes, standalone |
| Hive integration | Deep (Hive's primary engine) | Separate (Hive-on-Spark is less common) |
| Community | Apache Tez (focused on Hive use case) | Apache Spark (broad general use) |
When you are on a Hadoop/YARN cluster where Hive is the primary SQL layer, Tez is the right choice. Spark is a better fit for Python/Scala workloads, ML pipelines, or when you need in-memory caching across multiple queries.
Tez vs. Flink
Flink is a streaming-first engine that also handles batch. Tez is a batch-first engine that handles simple pipelines. The key structural difference: Flink maintains persistent operator state across windows and checkpoints; Tez vertices are stateless per-task (state is external: HDFS, HBase). If you are building event-time windowed aggregations or exactly-once stream processing, you want Flink. If you are running nightly ETL on HDFS data via Hive, Tez is the right tool and Flink would be overengineered for the job.
Tez vs. Flume (Ingest)
Flume is not a computation engine — it is a log/event ingestion agent that moves data from sources (web servers, syslog, Kafka) to sinks (HDFS, Kafka, HBase). The typical pipeline is:
Application Logs → Flume Agent → HDFS (ORC/Parquet files) → Hive table → Tez query
Flume and Tez are not competitors; they are peers in the same pipeline. Tez reads the data that Flume (or Kafka, or Sqoop) landed on HDFS. Knowing this boundary matters when you encounter a data quality bug: is it in the ingest (Flume), the storage format (ORC serialization), or the compute layer (Tez/Hive)?
Data Formats in the Tez Ecosystem
Tez itself is format-agnostic. It does not read or write ORC, Parquet, or Iceberg directly.
Tez sees InputDescriptor and OutputDescriptor objects — the actual codec lives in the class
pointed to by those descriptors. The format lives in the tez-mapreduce compatibility layer
(MRInput, MROutput) or in Hive's vectorized readers.
ORC (Optimized Row Columnar)
ORC is Hive's native format. When you INSERT INTO an ORC table and query it via Hive-on-Tez:
- The input split is an
OrcSplitgenerated byOrcInputFormatin the Hive ORC library. - Tez receives that split as a
DataSourceDescriptorin the DAGPlan. MRInputwrapsOrcInputFormat.createRecordReader(), feeding vectorized row batches to Hive'sMapOperator.- The key Tez entry point is
MRInputLegacy.createReaderInternal()intez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInputLegacy.java.
ORC's predicate pushdown (column pruning, row group skipping) happens before Tez sees the
data — entirely inside OrcInputFormat. If a Hive-on-Tez query reads 10 billion rows instead of
the 100K it should (wrong predicate pushdown), the bug is in ORC/Hive, not in Tez.
Parquet
Parquet is the other dominant columnar format, more common in cross-ecosystem pipelines (Spark + Hive interop). With Hive-on-Tez reading Parquet:
ParquetInputFormatgeneratesParquetInputSplitobjects.- Tez receives those as
DataSourceDescriptorentries. - Vectorization depth varies: ORC vectorization is deeper in Hive; Parquet goes through an additional row-column translation layer.
From a Tez contributor's standpoint, Parquet vs. ORC differences show up mainly in:
- Split size calculations affecting vertex parallelism (how many map tasks Tez schedules)
- Record skew when one Parquet file is much larger than others
Iceberg
Apache Iceberg is a table format (not a file format). It stores data in Parquet or ORC files
but adds a metadata layer for ACID semantics, time travel, and hidden partitioning. Hive + Tez
reads Iceberg via IcebergInputFormat (from the Iceberg Hive runtime JAR).
From Tez's view, Iceberg is yet another InputFormat. The novel behavior is:
- Iceberg's snapshot-based read means splits can come from multiple physical locations.
- Iceberg's
PlanningUtilgenerates splits that can be much more numerous than traditional partition-based splits — this affects Tez vertex parallelism significantly. - Time-travel queries (
SELECT ... FOR SYSTEM_TIME AS OF ...) generate a different split list at query compile time, which Hive encodes into the DAGPlan before Tez sees it.
Key insight for contributors: Tez bugs triggered by Iceberg tables are almost always about
parallelism (too many small tasks, too few tasks for large snapshots) or about the
DataSourceDescriptor encoding. The actual file reading is not Tez's responsibility.
Scenario 1: Classic Batch ETL — Aggregation Over a Large Table
What the data engineer does:
-- Run in Hive CLI connected to a cluster with Hive-on-Tez enabled
SET hive.execution.engine=tez;
CREATE TABLE daily_sales (
event_date STRING,
product_id BIGINT,
region STRING,
revenue DOUBLE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");
-- Query: daily revenue by region, last 90 days
SELECT
event_date,
region,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count
FROM daily_sales
WHERE event_date >= '2026-03-01'
GROUP BY event_date, region
ORDER BY event_date, region;
What Tez does under the hood:
- Hive compiles the query to
MapWork(map-side partial aggregation) +ReduceWork(global aggregation + sort). DagUtils.createVertex()in Hive creates two TezVertexobjects:Map 1andReducer 2.- The edge between them is
SCATTER_GATHER(partitioned shuffle by GROUP BY key hash). ShuffleVertexManagerauto-parallelism kicks in: it monitors how much data map tasks produce, then dynamically reduces the reducer count if data is smaller than expected (config:tez.shuffle-vertex-manager.desired-task-input-size).- Map tasks run
MapProcessor→HashTableContainer(partial agg) →OrderedPartitionedKVOutput(partitioned, sorted). - Reducer tasks run
ReduceProcessor→OrderedGroupedKVInput(merge shuffle inputs) →PTFOperator(for ORDER BY) →FileSinkOperator→ ORC writer.
Dataset characteristics and edge behaviors:
| Dataset characteristic | Tez behavior | Source class to read |
|---|---|---|
| 1 small file (< 1 block) | 1 map task, ShuffleVertexManager sets 1 reducer | ShuffleVertexManager.java |
| 1,000 files, uniform size | Parallelism = file count (MR split logic) | MRInputLegacy.java split sizing |
| 1 file, 10 GB, no ORC splits | 1 map task (cannot split non-splittable format) | OrcInputFormat.isSplittable() |
| WHERE predicate on partitioned column | Hive partition pruning, fewer splits passed to Tez | Hive PartitionPruner, not Tez |
| WHERE filters out all rows | 0 output bytes from map, ShuffleVertexManager → 1 reducer | ShuffleVertexManager.onSourceTaskCompleted() |
Bridge to source code:
cd ~/tez-src
# ShuffleVertexManager — the most important vertex manager for map-reduce style DAGs
find . -name "ShuffleVertexManager.java" -path "*/tez-dag/*"
# tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java
# Auto-parallelism: how ShuffleVertexManager decides to reduce the number of reducers
grep -n "computeParallelism\|desiredTaskInputSize\|onSourceTaskCompleted" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java | head -30
# The edge between Map 1 and Reducer 2 is SCATTER_GATHER — EdgeProperty documentation
grep -n "SCATTER_GATHER" \
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
Scenario 2: Multi-Table Join — The Real Workload Tez Was Built For
What the data engineer does:
SET hive.execution.engine=tez;
SET hive.auto.convert.join=true; -- enable map-side (broadcast) joins
SET hive.mapjoin.smalltable.filesize=25000000; -- 25 MB threshold
SELECT
o.order_id,
c.customer_name,
p.product_name,
o.quantity * p.unit_price AS line_total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date = '2026-05-31'
AND c.region = 'US-WEST';
What Tez does:
Hive's query planner analyzes table sizes:
customersandproductsare small (< 25 MB) → broadcast join (MapJoin)ordersis large (> 25 MB) → the probe side, goes through SCATTER_GATHER shuffle
The resulting DAG has:
Map 1— readsorders, builds hash table fromcustomersandproductssmall tables, emits matching rows. Small tables arrive via aBROADCASTedge (ONE_TO_ONEsemantics: every map task gets the full small table).- Optionally a
Reducer 2if there's a DISTINCT or ORDER BY.
VertexGroup for broadcast joins: Hive uses VertexGroup to express that one physical
vertex's output goes to both Map 1 and any other map-side consumer. This is expressed via
DAG.addVertex() with a VertexGroup wrapper.
Dataset edge cases for joins:
| Scenario | What goes wrong | Where to look |
|---|---|---|
customers grows from 20 MB to 30 MB | Map join threshold exceeded, query switches to shuffle join; slower | Hive CommonJoinResolver, not Tez |
orders has extreme key skew (one customer_id has 90% of rows) | One reducer gets 90% of data; task timeout | SkewedJoin hint in Hive; Tez sees it as one overloaded reducer |
| Broadcast table > YARN container heap | OOM in map task | Container memory: tez.task.resource.memory.mb |
| Right side of join returns 0 rows | Map tasks emit 0 output; downstream vertex immediately succeeds | VertexImpl.checkTasksForCompletion() |
Bridge to source code:
# ONE_TO_ONE edge (broadcast) — how every map task gets all small-table data
grep -n "ONE_TO_ONE\|BroadcastEdgeManager" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ -r | grep -v Test | head -20
# VertexGroup — Hive's mechanism for fan-out to multiple consumers
grep -n "class VertexGroup\|addVertexGroup" \
tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -15
# How the DAGAppMaster sees both edges from the same vertex
grep -n "vertexGroup\|groupInput" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20
Scenario 3: Direct Tez API — No Hive
Not all Tez workloads go through Hive. Custom data pipelines, internal batch frameworks, and
migration tools often build Tez DAGs directly. The canonical example is OrderedWordCount in
tez-examples/.
// Simplified from tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
TezClient tezClient = TezClient.create("OrderedWordCount", tezConf);
tezClient.start();
DAG dag = DAG.create("OrderedWordCount");
// Vertex 1: Tokenize words from input files
Vertex tokenizerVertex = Vertex.create(
"Tokenizer",
ProcessorDescriptor.create(TokenProcessor.class.getName()),
numMapTasks,
MRHelpers.getMapResource(conf));
tokenizerVertex.addDataSource(
"Input",
MRInput.createConfigBuilder(conf, TextInputFormat.class, inputPath).build());
// Vertex 2: Sort and deduplicate
Vertex sumVertex = Vertex.create(
"Sorter",
ProcessorDescriptor.create(SumProcessor.class.getName()),
numReduceTasks,
MRHelpers.getReduceResource(conf));
sumVertex.addDataSink(
"Output",
MROutput.createConfigBuilder(conf, TextOutputFormat.class, outputPath).build());
// Edge: SCATTER_GATHER, sorted by word key
dag.addVertex(tokenizerVertex)
.addVertex(sumVertex)
.addEdge(Edge.create(tokenizerVertex, sumVertex, EdgeProperty.create(
DataMovementType.SCATTER_GATHER,
DataSourceType.PERSISTED,
SchedulingType.SEQUENTIAL,
OrderedPartitionedKVOutput.createConfigBuilder(conf, HashPartitioner.class).build(),
OrderedGroupedKVInput.createConfigBuilder(conf).build())));
DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletion();
What this teaches about Tez's structure:
Vertex.create(name, processorDescriptor, parallelism, resource)— the four primitives of a vertex: name, code to run, how many copies, how much resource.EdgeProperty.create(movementType, sourceType, schedulingType, outputDesc, inputDesc)— edge properties completely specify how data moves.MRInput/MROutputbridge the gap between legacy HadoopInputFormat/OutputFormatand Tez's native I/O descriptors.
Bridge to source code:
# Read OrderedWordCount to understand the complete DAG lifecycle from a client
cat tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
# Follow TezClient.submitDAG() into the AM
grep -n "public.*submitDAG" \
tez-api/src/main/java/org/apache/tez/dag/api/client/TezClient.java
# EdgeProperty — the central struct that determines routing
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
Scenario 4: Pipelined Execution — Where Tez Approaches Flink
Tez supports PIPELINED edge scheduling (vs. SEQUENTIAL). With pipelined edges, downstream
tasks can start before all upstream tasks complete — the data flows like a stream within the DAG.
EdgeProperty pipelinedEdge = EdgeProperty.create(
DataMovementType.SCATTER_GATHER,
DataSourceType.PERSISTED_PIPELINED, // <-- pipelined
SchedulingType.CONCURRENT, // <-- downstream starts immediately
outputDescriptor,
inputDescriptor);
This is used by Hive for query pipelining in long-running SELECT ... INSERT chains. The
downstream vertex starts consuming partial output from the upstream before it finishes, reducing
end-to-end latency for multi-stage queries.
When pipelining causes problems:
| Problem | Symptom | Root class |
|---|---|---|
| Upstream task fails mid-stream | Downstream task has consumed partial data → must be killed and retried with upstream | TaskAttemptImpl.FAILED_TRANSITION |
| Downstream cannot consume fast enough | Back-pressure: upstream pauses on write() | OrderedPartitionedKVOutput.sendingThreadShouldRun |
| Memory overflow in pipelined buffer | OutOfMemoryError in fetcher threads | MergeManager in-memory limit |
Bridge to source code:
grep -n "PERSISTENT_PIPELINED\|PIPELINED\|CONCURRENT" \
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
grep -n "SchedulingType.CONCURRENT" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10
Dataset Scenarios for Testing Edge Cases
When you are writing a repro test case or validating a fix, the dataset you choose determines which code paths you exercise. Use these as your starting templates.
Dataset 1: The Empty Partition
// Generate test data where one reduce partition has 0 records
// Triggers ShuffleVertexManager.onSourceTaskCompleted() with 0-byte output
private static final int NUM_PARTITIONS = 10;
private static final int RECORDS_PER_PARTITION = 100;
// Force all records into partitions 0–8, leave partition 9 empty
int partition = key.hashCode() % (NUM_PARTITIONS - 1); // never 9
What this tests: ShuffleVertexManager must handle a vertex where some reducer partitions
receive zero input. Before TEZ-3247, this caused reducers to hang waiting for shuffle data
that would never arrive.
# Test class that covers empty-partition behavior
grep -rn "emptyPartition\|zeroInput\|emptyInput" tez-tests/src/test/ | head -10
Dataset 2: Extreme Key Skew
// One key accounts for 95% of records
for (int i = 0; i < 1_000_000; i++) {
String key = (i < 950_000) ? "hot_key" : "key_" + i;
writer.write(new Text(key), new IntWritable(1));
}
What this tests: The reducer that receives hot_key gets ~950,000 records while other
reducers get ~50 each. This exposes:
- Speculative execution decisions in
LegacySpeculator - Container reuse after the skewed reducer finishes last
- Per-vertex timing in
VertexImpl.checkTasksForCompletion()
Dataset 3: Zero-Row Input
// Empty input — 0 files, 0 records
// The DAG should complete SUCCEEDED with 0 output, not hang
String inputPath = "/tmp/empty_dir_" + UUID.randomUUID();
fs.mkdirs(new Path(inputPath)); // create directory but put no files in it
What this tests: VertexImpl must handle the case where MRInput generates 0 splits.
A vertex with 0 input splits sets its parallelism to 0, transitions immediately to
V_SUCCEEDED without scheduling any tasks. This has historically been a source of
NullPointerException bugs when downstream vertices assume at least one upstream task ran.
grep -n "setParallelism.*0\|numTasks.*0\|zeroTasks\|numSourceTasks.*0" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15
Dataset 4: Very Wide Rows (Many Columns)
// 1,000 columns per row — stresses IFile serialization and spill logic
StringBuilder sb = new StringBuilder();
for (int col = 0; col < 1000; col++) {
sb.append("column_").append(col).append("=").append("value_").append(col).append("\t");
}
writer.write(new Text("key"), new Text(sb.toString())); // ~30 KB per record
What this tests: PipelinedSorter and DefaultSorter spill thresholds. With 30 KB
per record, even a modest sort buffer fills quickly. This exercises the spill path in
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java
and exposes off-by-one bugs in the IFile index writer.
Dataset 5: Many Small Files (HDFS Small-File Problem)
# Generate 50,000 files of 1 KB each — a classic HDFS anti-pattern
for i in $(seq 1 50000); do
echo "record_$i value_$i" > /tmp/smallfiles/file_$i.txt
done
hadoop fs -put /tmp/smallfiles /data/input/smallfiles/
What this tests: Split generation produces 50,000 map tasks. This is a realistic workload that stresses:
TaskSchedulerManagertask queue management- Container reuse logic (50,000 containers → reuse is essential for performance)
DAGAppMasterAMRM heartbeat frequency under high task count
# Container reuse configuration
grep -n "heldContainer\|releaseTimeout\|IDLE_TIMEOUT" \
tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncherImpl.java \
| head -20
Dataset 6: Nested Structs (Complex Types)
-- ORC table with nested complex types
CREATE TABLE events (
event_id BIGINT,
metadata STRUCT<
user_id: BIGINT,
session_id: STRING,
properties: MAP<STRING, STRING>,
tags: ARRAY<STRING>
>,
timestamp BIGINT
) STORED AS ORC;
What this tests: ORC vectorized reader deserialization of STRUCT, MAP, and ARRAY
types. These types are serialized into Hive's OrcStruct/OrcMap/OrcList classes
before being passed through MRInput to the MapOperator. If the column count or type
tree changes between what the ORC file was written with and what the Hive schema says,
you get schema evolution behavior — which can generate bugs that look like Tez data
corruption but are actually ORC schema evolution issues.
Dataset 7: Partitioned Iceberg Table (Snapshot Isolation)
# Using PyIceberg or Spark to create an Iceberg table with multiple snapshots
from pyiceberg.catalog import load_catalog
catalog = load_catalog("hive_catalog", **{"uri": "thrift://hive-metastore:9083"})
table = catalog.load_table("db.events_iceberg")
# Write 3 snapshots representing 3 days of appends
for day in range(3):
df = generate_day_data(day)
table.append(df)
# Now query with time travel — Hive generates a DAGPlan that reads snapshot 1
hive_execute("""
SELECT COUNT(*) FROM db.events_iceberg
FOR SYSTEM_TIME AS OF '2026-05-29 00:00:00'
""")
What this tests: Iceberg's IcebergInputFormat generates a split list that differs per
snapshot. The DataSourceDescriptor passed to Tez encodes the snapshot ID. If Hive
resolves the wrong snapshot, Tez faithfully executes it — the bug is in the DagUtils
snapshot resolution in Hive, not in Tez. But the symptom (wrong row count) looks like
a Tez data bug.
Running Tez End-to-End: The Local Developer Loop
Before writing source code, every Tez contributor should be able to do this loop in under 10 minutes:
# 1. Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist -q # ~8–12 min cold, 3–4 min warm
# 2. Run the canonical integration test that exercises the full stack
mvn test -pl tez-tests \
-Dtest=TestOrderedWordCount \
-DfailIfNoTests=false 2>&1 | tail -30
# 3. Run a single unit test (fast feedback loop — use this constantly)
mvn test -pl tez-dag \
-Dtest=TestVertexImpl#testVertexSucceededSpeculation \
-DfailIfNoTests=false 2>&1 | tail -20
# 4. Run OrderedWordCount in local mode (no YARN cluster required)
hadoop jar tez-examples/target/tez-examples-*.jar orderedwordcount \
-D tez.local.mode=true \
/path/to/input /tmp/tez-output-$(date +%s)
# 5. Verify output
hadoop fs -cat /tmp/tez-output-*/part-* | sort | head -20
The TestOrderedWordCount test is your baseline health check. If it passes, the full
end-to-end stack (TezClient → DAGAppMaster → VertexImpl → shuffle → MRInput/MROutput)
is working. If it fails, something fundamental is broken and you need to fix that before
touching anything else.
The Bridge: User Scenario → Source Code
Every scenario above maps to a specific source subsystem. Use this table whenever you see a runtime behavior and want to find the code responsible:
| Observed behavior | Source location |
|---|---|
| Map task count equals file count | tez-mapreduce/.../MRInputLegacy.createSplitsProto() |
| Reducer count auto-adjusted down | ShuffleVertexManager.computeParallelism() |
| DAG completes even with 0-row input | VertexImpl.scheduleTasks() (0-task vertex path) |
| Broadcast join: small table to all maps | BroadcastEdgeManager + ONE_TO_ONE edge |
| Container reused between tasks | AMContainerImpl.assignContainer() + HeldContainer |
| Task retried after failure | TaskAttemptImpl → TaskImpl.handleTaskAttemptFailed() |
| OOM in shuffle fetch | MergeManager.memoryAvailable / Fetcher.copyFromHost() |
| Hung vertex with tasks still RUNNING | VertexImpl.checkTasksForCompletion() not triggered |
| Wrong output record count | Check OrcInputFormat predicate pushdown first, then Tez |
| Slow single reducer (skew) | LegacySpeculator slow-task detection → speculative attempt |
| Pipelined task killed on upstream failure | TaskAttemptImpl.FAILED_TRANSITION cascades |
What to Verify Before Starting Level 1
Run through this checklist once. It takes 30–45 minutes and proves your environment is solid.
# Environment check
java -version # must be Java 8 or Java 11
mvn -version # must be 3.6.3+
git --version # must be 2.x
# Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist 2>&1 | tail -10
# Confirm build artifacts exist
ls tez-dist/target/tez-*.tar.gz # should exist
ls tez-examples/target/tez-examples-*.jar
# Run the unit test suite in the two most important modules
mvn test -pl tez-dag -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
mvn test -pl tez-api -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
# Run the critical end-to-end test
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -DfailIfNoTests=false 2>&1 | tail -10
# All lines should read "Tests run: N, Failures: 0, Errors: 0"
If any of these fail before you have modified a single line of code, stop and fix your
environment. Do not proceed into Level 1 with a broken baseline. A broken baseline means
every subsequent mvn test will produce false failures that obscure the real work.
Continue to Overview & Prerequisites or jump directly to Level 1: Hadoop and Tez Foundation.