Level 1: Hadoop and Tez Foundation

This level establishes the technical baseline every subsequent level depends on. You will understand where Tez fits in the Hadoop ecosystem, successfully build the project from source, run the test suite, and execute your first Tez DAG in local mode.

Learning Objectives

By the end of Level 1 you must be able to:

Explain where Apache Tez sits in the Hadoop ecosystem and why it exists
Build Apache Tez from source using Maven, with and without tests
Execute unit tests scoped to a single module and interpret the results
Run a simple Tez DAG in local mode without a YARN cluster
Locate any class mentioned in Levels 2–9 without using a search engine
Articulate the difference between a MapReduce job and a Tez DAG at the execution model level
Read TezConfiguration.java and find any configuration key by category

The Hadoop Ecosystem Context

Apache Tez lives inside the Hadoop ecosystem. Before touching a line of Tez code, build an accurate mental model of the stack:

┌─────────────────────────────────────────────────────┐
│         Apache Hive / Apache Pig / Cascading        │  ← Query / scripting layer
├─────────────────────────────────────────────────────┤
│                  Apache Tez                         │  ← DAG execution engine
├─────────────────────────────────────────────────────┤
│                  Apache YARN                        │  ← Cluster resource management
├─────────────────────────────────────────────────────┤
│                  Apache HDFS                        │  ← Distributed storage
└─────────────────────────────────────────────────────┘

YARN (Yet Another Resource Negotiator) manages cluster resources. It runs an ApplicationMaster (AM) per application, allocates containers, and monitors health. Tez's DAGAppMaster IS a YARN ApplicationMaster.

HDFS stores input, output, and sometimes intermediate data. Tez prefers to keep intermediate data on local disk or in memory, but falls back to HDFS for recovery and large-scale shuffles.

Tez submits a DAGAppMaster to YARN, which requests containers for task execution. Tasks read inputs, execute processors, and write outputs — either directly to downstream tasks via shuffle or to HDFS for final output.

MapReduce vs. Tez

Aspect	MapReduce	Apache Tez
Execution model	Fixed: Map → Shuffle → Reduce	Arbitrary DAG of vertices
Multi-stage queries	Chain of separate MR jobs	Single DAG
Inter-stage data	Always written to HDFS	Pipelined or local disk
JVM startup	New JVM per task	Container reuse across tasks
Vertex types	Two (Map, Reduce)	Unlimited
Speculative execution	Yes	Yes (configurable per vertex)
Session support	No	Yes — `TezClient` session mode

For a 10-stage Hive aggregation query, MapReduce requires 10 separate MR jobs with HDFS writes between every stage. Tez runs the same query as a single DAG — no HDFS round-trips between stages, containers reused across task waves, and pipeline-style data movement between compatible vertices.

Required Reading

Complete in this order before starting the labs:

#	Resource	What to extract
1	`README.md` in the Tez repo root	Build commands, module overview
2	Tez architecture document	Original design intent, DAG model rationale
3	YARN Architecture	Container lifecycle, AM responsibilities
4	`tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java`	Class-level Javadoc only — understand session vs. non-session
5	`tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java`	Skim all keys — understand the category groupings
6	`tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java`	End-to-end DAG construction and submission

Note on reading strategy: In a mature Apache codebase, Javadoc is often the best documentation that exists. Class-level Javadoc on public API classes reflects decisions debated and agreed upon by committers. Read it seriously.

Source Code Areas to Inspect

Read these files before and after the labs. You are not modifying anything yet.

`tez-api` — Public API

File	Why
`dag/api/TezClient.java`	Entry point for all DAG submissions. Read `createTezClient()`, `start()`, `submitDAG()`.
`dag/api/DAG.java`	DAG construction API. Note `addVertex()`, `addEdge()`, `addTaskLocalFiles()`.
`dag/api/Vertex.java`	Vertex definition. Understand `ProcessorDescriptor`, parallelism, and `VertexManagerPlugin`.
`dag/api/Edge.java`	Edge definition. Understand `EdgeProperty` and `DataMovementType`.
`dag/api/client/DAGClient.java`	DAG monitoring. Understand `getDAGStatus()` and progress tracking.
`dag/api/TezConfiguration.java`	All Tez configuration keys. Every key is documented.
`dag/api/EdgeProperty.java`	Data movement type and scheduling type for edges. Fundamental to DAG design.

`tez-dag` — Core Execution Engine

File	Why
`app/DAGAppMaster.java`	The YARN ApplicationMaster. First read: just `init()` and `start()`. It is 5000+ lines.
`app/dag/impl/DAGImpl.java`	DAG state machine. Read the state/transition enum declarations at the top.
`app/dag/impl/VertexImpl.java`	Most complex class in the project. First read: state enum + `handle()` only.
`app/dag/impl/TaskImpl.java`	Task state machine. More tractable than VertexImpl. Read fully.
`app/dag/impl/TaskAttemptImpl.java`	TaskAttempt state machine. Read fully.

`tez-runtime-library` — I/O Implementations

File	Why
`runtime/library/input/OrderedGroupedKVInput.java`	Standard sorted shuffle input. Used by most Hive reduce operations.
`runtime/library/output/OrderedPartitionedKVOutput.java`	Standard sorted shuffle output. Paired with the above.
`runtime/library/input/UnorderedKVInput.java`	Broadcast input — data is not sorted.

`tez-examples` — Reference Implementations

File	Why
`examples/OrderedWordCount.java`	The canonical Tez DAG example. Read this completely.
`examples/IntersectExample.java`	Shows a 3-vertex DAG with a broadcast edge.

Key Classes Quick Reference

Class	Module	Package	Role
`TezClient`	`tez-api`	`org.apache.tez.dag.api`	Creates sessions, submits DAGs
`DAG`	`tez-api`	`org.apache.tez.dag.api`	Defines the computation graph
`Vertex`	`tez-api`	`org.apache.tez.dag.api`	One processing stage
`Edge`	`tez-api`	`org.apache.tez.dag.api`	Data connection between vertices
`EdgeProperty`	`tez-api`	`org.apache.tez.dag.api`	Data movement + scheduling type
`ProcessorDescriptor`	`tez-api`	`org.apache.tez.dag.api`	Which Processor class runs in a vertex
`TezConfiguration`	`tez-api`	`org.apache.tez.dag.api`	All Tez configuration keys
`DAGAppMaster`	`tez-dag`	`org.apache.tez.dag.app`	YARN ApplicationMaster
`DAGImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	DAG state machine
`VertexImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	Vertex state machine
`TaskImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	Task state machine
`TaskAttemptImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	TaskAttempt state machine
`TezTaskRunner2`	`tez-runtime-internals`	`org.apache.tez.runtime`	Runs a task inside a container
`OrderedWordCount`	`tez-examples`	`org.apache.tez.examples`	Canonical DAG example

JIRA Issue Categories for Level 1 Contributors

At this stage, focus exclusively on:

Documentation — Javadoc typos, outdated parameter descriptions, missing @param or @return annotations, broken links in comments
Test improvements — Adding missing assertions to existing tests, improving test method naming, removing dead code from test classes
Checkstyle violations — Unused imports, line length violations, missing final keywords

How to find these:

Go to Apache Tez JIRA
Search: project = TEZ AND labels = "newbie" AND resolution = Unresolved
Also scan: project = TEZ AND component = "Documentation" AND resolution = Unresolved
Look at recently closed "trivial" issues to understand the standard for accepted patches

Warning: Do not pick up a JIRA issue and immediately upload a patch. Read all existing comments. If there is an active discussion or existing assignee, move on. Leave a comment saying you are investigating before you claim an issue.

Deliverables

You must demonstrate all of the following before advancing to Level 2:

Successful mvn install -DskipTests output — no build failures
At least one unit test class run successfully (e.g., TestDAGImpl)
Successful local DAG execution showing DAG completed: SUCCEEDED
Ability to locate DAGAppMaster, TezClient, and OrderedGroupedKVInput by memory
Written explanation (2–3 sentences) of why a Tez DAG is faster than chained MapReduce
Written explanation of the difference between a YARN container and a Tez task

Common Mistakes

Mistake	Consequence	Fix
Building with Java 17 against `master`	Compile errors or compatibility failures	Use Java 8 or Java 11; check `<maven.compiler.source>` in root `pom.xml`
Running `mvn test` on the full repository	Hours-long run including integration tests	Use `-pl tez-dag -am` to scope to one module
Ignoring `TezConfiguration.java`	Confusion about configuration keys throughout all levels	Skim the entire file; every key is documented
Skipping the YARN architecture doc	Confusion about what Tez owns vs. what YARN owns	YARN understanding is required from Level 3 onward
Trying to understand all of `DAGAppMaster` at once	Overwhelm — 5000+ lines	First pass: read only `init()` and `start()`
Reading Tez code without running it	Abstract understanding that does not transfer to debugging	Always run the code after reading it
Picking a JIRA issue without reading existing comments	Duplicate work; community friction	Read all comments; check assignee; leave a note before claiming

How to Verify Success

# 1. Full build without tests
cd /path/to/tez
mvn install -DskipTests -q && echo "BUILD OK"

# 2. Unit test from tez-dag
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q

# 3. Local DAG run (from Lab 1.3)
# Expected final output line:
#   DAG: [OrderedWordCount] finished with status: [SUCCEEDED]

Patch Profile: Level 1 Graduate

Patch type	Example	Test requirement
Javadoc fix	Correcting a wrong `@param` description in `TezClient`	None — documentation only
Dead import removal	Remove unused `import` statement flagged by checkstyle	Run `mvn checkstyle:check -pl <module>`
Test assertion improvement	Add `assertEquals` to an existing test that only checks for no-exception	Run the test class
README update	Fix a broken Maven command in the build instructions	Manual verification

You are not ready to submit: bug fixes in state machines, new features, performance patches, or changes to the shuffle path. Those require Levels 3–7.

Open-Source Engineer & Contributor