Level 1: Hadoop and Tez Foundation
This level establishes the technical baseline every subsequent level depends on. You will understand where Tez fits in the Hadoop ecosystem, successfully build the project from source, run the test suite, and execute your first Tez DAG in local mode.
Learning Objectives
By the end of Level 1 you must be able to:
- Explain where Apache Tez sits in the Hadoop ecosystem and why it exists
- Build Apache Tez from source using Maven, with and without tests
- Execute unit tests scoped to a single module and interpret the results
- Run a simple Tez DAG in local mode without a YARN cluster
- Locate any class mentioned in Levels 2–9 without using a search engine
- Articulate the difference between a MapReduce job and a Tez DAG at the execution model level
- Read
TezConfiguration.javaand find any configuration key by category
The Hadoop Ecosystem Context
Apache Tez lives inside the Hadoop ecosystem. Before touching a line of Tez code, build an accurate mental model of the stack:
┌─────────────────────────────────────────────────────┐
│ Apache Hive / Apache Pig / Cascading │ ← Query / scripting layer
├─────────────────────────────────────────────────────┤
│ Apache Tez │ ← DAG execution engine
├─────────────────────────────────────────────────────┤
│ Apache YARN │ ← Cluster resource management
├─────────────────────────────────────────────────────┤
│ Apache HDFS │ ← Distributed storage
└─────────────────────────────────────────────────────┘
YARN (Yet Another Resource Negotiator) manages cluster resources. It runs an
ApplicationMaster (AM) per application, allocates containers, and monitors health. Tez's
DAGAppMaster IS a YARN ApplicationMaster.
HDFS stores input, output, and sometimes intermediate data. Tez prefers to keep intermediate data on local disk or in memory, but falls back to HDFS for recovery and large-scale shuffles.
Tez submits a DAGAppMaster to YARN, which requests containers for task execution. Tasks
read inputs, execute processors, and write outputs — either directly to downstream tasks via
shuffle or to HDFS for final output.
MapReduce vs. Tez
| Aspect | MapReduce | Apache Tez |
|---|---|---|
| Execution model | Fixed: Map → Shuffle → Reduce | Arbitrary DAG of vertices |
| Multi-stage queries | Chain of separate MR jobs | Single DAG |
| Inter-stage data | Always written to HDFS | Pipelined or local disk |
| JVM startup | New JVM per task | Container reuse across tasks |
| Vertex types | Two (Map, Reduce) | Unlimited |
| Speculative execution | Yes | Yes (configurable per vertex) |
| Session support | No | Yes — TezClient session mode |
For a 10-stage Hive aggregation query, MapReduce requires 10 separate MR jobs with HDFS writes between every stage. Tez runs the same query as a single DAG — no HDFS round-trips between stages, containers reused across task waves, and pipeline-style data movement between compatible vertices.
Required Reading
Complete in this order before starting the labs:
| # | Resource | What to extract |
|---|---|---|
| 1 | README.md in the Tez repo root | Build commands, module overview |
| 2 | Tez architecture document | Original design intent, DAG model rationale |
| 3 | YARN Architecture | Container lifecycle, AM responsibilities |
| 4 | tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java | Class-level Javadoc only — understand session vs. non-session |
| 5 | tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | Skim all keys — understand the category groupings |
| 6 | tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java | End-to-end DAG construction and submission |
Note on reading strategy: In a mature Apache codebase, Javadoc is often the best documentation that exists. Class-level Javadoc on public API classes reflects decisions debated and agreed upon by committers. Read it seriously.
Source Code Areas to Inspect
Read these files before and after the labs. You are not modifying anything yet.
tez-api — Public API
| File | Why |
|---|---|
dag/api/TezClient.java | Entry point for all DAG submissions. Read createTezClient(), start(), submitDAG(). |
dag/api/DAG.java | DAG construction API. Note addVertex(), addEdge(), addTaskLocalFiles(). |
dag/api/Vertex.java | Vertex definition. Understand ProcessorDescriptor, parallelism, and VertexManagerPlugin. |
dag/api/Edge.java | Edge definition. Understand EdgeProperty and DataMovementType. |
dag/api/client/DAGClient.java | DAG monitoring. Understand getDAGStatus() and progress tracking. |
dag/api/TezConfiguration.java | All Tez configuration keys. Every key is documented. |
dag/api/EdgeProperty.java | Data movement type and scheduling type for edges. Fundamental to DAG design. |
tez-dag — Core Execution Engine
| File | Why |
|---|---|
app/DAGAppMaster.java | The YARN ApplicationMaster. First read: just init() and start(). It is 5000+ lines. |
app/dag/impl/DAGImpl.java | DAG state machine. Read the state/transition enum declarations at the top. |
app/dag/impl/VertexImpl.java | Most complex class in the project. First read: state enum + handle() only. |
app/dag/impl/TaskImpl.java | Task state machine. More tractable than VertexImpl. Read fully. |
app/dag/impl/TaskAttemptImpl.java | TaskAttempt state machine. Read fully. |
tez-runtime-library — I/O Implementations
| File | Why |
|---|---|
runtime/library/input/OrderedGroupedKVInput.java | Standard sorted shuffle input. Used by most Hive reduce operations. |
runtime/library/output/OrderedPartitionedKVOutput.java | Standard sorted shuffle output. Paired with the above. |
runtime/library/input/UnorderedKVInput.java | Broadcast input — data is not sorted. |
tez-examples — Reference Implementations
| File | Why |
|---|---|
examples/OrderedWordCount.java | The canonical Tez DAG example. Read this completely. |
examples/IntersectExample.java | Shows a 3-vertex DAG with a broadcast edge. |
Key Classes Quick Reference
| Class | Module | Package | Role |
|---|---|---|---|
TezClient | tez-api | org.apache.tez.dag.api | Creates sessions, submits DAGs |
DAG | tez-api | org.apache.tez.dag.api | Defines the computation graph |
Vertex | tez-api | org.apache.tez.dag.api | One processing stage |
Edge | tez-api | org.apache.tez.dag.api | Data connection between vertices |
EdgeProperty | tez-api | org.apache.tez.dag.api | Data movement + scheduling type |
ProcessorDescriptor | tez-api | org.apache.tez.dag.api | Which Processor class runs in a vertex |
TezConfiguration | tez-api | org.apache.tez.dag.api | All Tez configuration keys |
DAGAppMaster | tez-dag | org.apache.tez.dag.app | YARN ApplicationMaster |
DAGImpl | tez-dag | org.apache.tez.dag.app.dag.impl | DAG state machine |
VertexImpl | tez-dag | org.apache.tez.dag.app.dag.impl | Vertex state machine |
TaskImpl | tez-dag | org.apache.tez.dag.app.dag.impl | Task state machine |
TaskAttemptImpl | tez-dag | org.apache.tez.dag.app.dag.impl | TaskAttempt state machine |
TezTaskRunner2 | tez-runtime-internals | org.apache.tez.runtime | Runs a task inside a container |
OrderedWordCount | tez-examples | org.apache.tez.examples | Canonical DAG example |
JIRA Issue Categories for Level 1 Contributors
At this stage, focus exclusively on:
- Documentation — Javadoc typos, outdated parameter descriptions, missing
@paramor@returnannotations, broken links in comments - Test improvements — Adding missing assertions to existing tests, improving test method naming, removing dead code from test classes
- Checkstyle violations — Unused imports, line length violations, missing final keywords
How to find these:
- Go to Apache Tez JIRA
- Search:
project = TEZ AND labels = "newbie" AND resolution = Unresolved - Also scan:
project = TEZ AND component = "Documentation" AND resolution = Unresolved - Look at recently closed "trivial" issues to understand the standard for accepted patches
Warning: Do not pick up a JIRA issue and immediately upload a patch. Read all existing comments. If there is an active discussion or existing assignee, move on. Leave a comment saying you are investigating before you claim an issue.
Deliverables
You must demonstrate all of the following before advancing to Level 2:
-
Successful
mvn install -DskipTestsoutput — no build failures -
At least one unit test class run successfully (e.g.,
TestDAGImpl) -
Successful local DAG execution showing
DAG completed: SUCCEEDED -
Ability to locate
DAGAppMaster,TezClient, andOrderedGroupedKVInputby memory - Written explanation (2–3 sentences) of why a Tez DAG is faster than chained MapReduce
- Written explanation of the difference between a YARN container and a Tez task
Common Mistakes
| Mistake | Consequence | Fix |
|---|---|---|
Building with Java 17 against master | Compile errors or compatibility failures | Use Java 8 or Java 11; check <maven.compiler.source> in root pom.xml |
Running mvn test on the full repository | Hours-long run including integration tests | Use -pl tez-dag -am to scope to one module |
Ignoring TezConfiguration.java | Confusion about configuration keys throughout all levels | Skim the entire file; every key is documented |
| Skipping the YARN architecture doc | Confusion about what Tez owns vs. what YARN owns | YARN understanding is required from Level 3 onward |
Trying to understand all of DAGAppMaster at once | Overwhelm — 5000+ lines | First pass: read only init() and start() |
| Reading Tez code without running it | Abstract understanding that does not transfer to debugging | Always run the code after reading it |
| Picking a JIRA issue without reading existing comments | Duplicate work; community friction | Read all comments; check assignee; leave a note before claiming |
How to Verify Success
# 1. Full build without tests
cd /path/to/tez
mvn install -DskipTests -q && echo "BUILD OK"
# 2. Unit test from tez-dag
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q
# 3. Local DAG run (from Lab 1.3)
# Expected final output line:
# DAG: [OrderedWordCount] finished with status: [SUCCEEDED]
Patch Profile: Level 1 Graduate
| Patch type | Example | Test requirement |
|---|---|---|
| Javadoc fix | Correcting a wrong @param description in TezClient | None — documentation only |
| Dead import removal | Remove unused import statement flagged by checkstyle | Run mvn checkstyle:check -pl <module> |
| Test assertion improvement | Add assertEquals to an existing test that only checks for no-exception | Run the test class |
| README update | Fix a broken Maven command in the build instructions | Manual verification |
You are not ready to submit: bug fixes in state machines, new features, performance patches, or changes to the shuffle path. Those require Levels 3–7.