Level 1: Hadoop and Tez Foundation

This level establishes the technical baseline every subsequent level depends on. You will understand where Tez fits in the Hadoop ecosystem, successfully build the project from source, run the test suite, and execute your first Tez DAG in local mode.


Learning Objectives

By the end of Level 1 you must be able to:

  1. Explain where Apache Tez sits in the Hadoop ecosystem and why it exists
  2. Build Apache Tez from source using Maven, with and without tests
  3. Execute unit tests scoped to a single module and interpret the results
  4. Run a simple Tez DAG in local mode without a YARN cluster
  5. Locate any class mentioned in Levels 2–9 without using a search engine
  6. Articulate the difference between a MapReduce job and a Tez DAG at the execution model level
  7. Read TezConfiguration.java and find any configuration key by category

The Hadoop Ecosystem Context

Apache Tez lives inside the Hadoop ecosystem. Before touching a line of Tez code, build an accurate mental model of the stack:

┌─────────────────────────────────────────────────────┐
│         Apache Hive / Apache Pig / Cascading        │  ← Query / scripting layer
├─────────────────────────────────────────────────────┤
│                  Apache Tez                         │  ← DAG execution engine
├─────────────────────────────────────────────────────┤
│                  Apache YARN                        │  ← Cluster resource management
├─────────────────────────────────────────────────────┤
│                  Apache HDFS                        │  ← Distributed storage
└─────────────────────────────────────────────────────┘

YARN (Yet Another Resource Negotiator) manages cluster resources. It runs an ApplicationMaster (AM) per application, allocates containers, and monitors health. Tez's DAGAppMaster IS a YARN ApplicationMaster.

HDFS stores input, output, and sometimes intermediate data. Tez prefers to keep intermediate data on local disk or in memory, but falls back to HDFS for recovery and large-scale shuffles.

Tez submits a DAGAppMaster to YARN, which requests containers for task execution. Tasks read inputs, execute processors, and write outputs — either directly to downstream tasks via shuffle or to HDFS for final output.

MapReduce vs. Tez

AspectMapReduceApache Tez
Execution modelFixed: Map → Shuffle → ReduceArbitrary DAG of vertices
Multi-stage queriesChain of separate MR jobsSingle DAG
Inter-stage dataAlways written to HDFSPipelined or local disk
JVM startupNew JVM per taskContainer reuse across tasks
Vertex typesTwo (Map, Reduce)Unlimited
Speculative executionYesYes (configurable per vertex)
Session supportNoYes — TezClient session mode

For a 10-stage Hive aggregation query, MapReduce requires 10 separate MR jobs with HDFS writes between every stage. Tez runs the same query as a single DAG — no HDFS round-trips between stages, containers reused across task waves, and pipeline-style data movement between compatible vertices.


Required Reading

Complete in this order before starting the labs:

#ResourceWhat to extract
1README.md in the Tez repo rootBuild commands, module overview
2Tez architecture documentOriginal design intent, DAG model rationale
3YARN ArchitectureContainer lifecycle, AM responsibilities
4tez-api/src/main/java/org/apache/tez/dag/api/TezClient.javaClass-level Javadoc only — understand session vs. non-session
5tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.javaSkim all keys — understand the category groupings
6tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.javaEnd-to-end DAG construction and submission

Note on reading strategy: In a mature Apache codebase, Javadoc is often the best documentation that exists. Class-level Javadoc on public API classes reflects decisions debated and agreed upon by committers. Read it seriously.


Source Code Areas to Inspect

Read these files before and after the labs. You are not modifying anything yet.

tez-api — Public API

FileWhy
dag/api/TezClient.javaEntry point for all DAG submissions. Read createTezClient(), start(), submitDAG().
dag/api/DAG.javaDAG construction API. Note addVertex(), addEdge(), addTaskLocalFiles().
dag/api/Vertex.javaVertex definition. Understand ProcessorDescriptor, parallelism, and VertexManagerPlugin.
dag/api/Edge.javaEdge definition. Understand EdgeProperty and DataMovementType.
dag/api/client/DAGClient.javaDAG monitoring. Understand getDAGStatus() and progress tracking.
dag/api/TezConfiguration.javaAll Tez configuration keys. Every key is documented.
dag/api/EdgeProperty.javaData movement type and scheduling type for edges. Fundamental to DAG design.

tez-dag — Core Execution Engine

FileWhy
app/DAGAppMaster.javaThe YARN ApplicationMaster. First read: just init() and start(). It is 5000+ lines.
app/dag/impl/DAGImpl.javaDAG state machine. Read the state/transition enum declarations at the top.
app/dag/impl/VertexImpl.javaMost complex class in the project. First read: state enum + handle() only.
app/dag/impl/TaskImpl.javaTask state machine. More tractable than VertexImpl. Read fully.
app/dag/impl/TaskAttemptImpl.javaTaskAttempt state machine. Read fully.

tez-runtime-library — I/O Implementations

FileWhy
runtime/library/input/OrderedGroupedKVInput.javaStandard sorted shuffle input. Used by most Hive reduce operations.
runtime/library/output/OrderedPartitionedKVOutput.javaStandard sorted shuffle output. Paired with the above.
runtime/library/input/UnorderedKVInput.javaBroadcast input — data is not sorted.

tez-examples — Reference Implementations

FileWhy
examples/OrderedWordCount.javaThe canonical Tez DAG example. Read this completely.
examples/IntersectExample.javaShows a 3-vertex DAG with a broadcast edge.

Key Classes Quick Reference

ClassModulePackageRole
TezClienttez-apiorg.apache.tez.dag.apiCreates sessions, submits DAGs
DAGtez-apiorg.apache.tez.dag.apiDefines the computation graph
Vertextez-apiorg.apache.tez.dag.apiOne processing stage
Edgetez-apiorg.apache.tez.dag.apiData connection between vertices
EdgePropertytez-apiorg.apache.tez.dag.apiData movement + scheduling type
ProcessorDescriptortez-apiorg.apache.tez.dag.apiWhich Processor class runs in a vertex
TezConfigurationtez-apiorg.apache.tez.dag.apiAll Tez configuration keys
DAGAppMastertez-dagorg.apache.tez.dag.appYARN ApplicationMaster
DAGImpltez-dagorg.apache.tez.dag.app.dag.implDAG state machine
VertexImpltez-dagorg.apache.tez.dag.app.dag.implVertex state machine
TaskImpltez-dagorg.apache.tez.dag.app.dag.implTask state machine
TaskAttemptImpltez-dagorg.apache.tez.dag.app.dag.implTaskAttempt state machine
TezTaskRunner2tez-runtime-internalsorg.apache.tez.runtimeRuns a task inside a container
OrderedWordCounttez-examplesorg.apache.tez.examplesCanonical DAG example

JIRA Issue Categories for Level 1 Contributors

At this stage, focus exclusively on:

  • Documentation — Javadoc typos, outdated parameter descriptions, missing @param or @return annotations, broken links in comments
  • Test improvements — Adding missing assertions to existing tests, improving test method naming, removing dead code from test classes
  • Checkstyle violations — Unused imports, line length violations, missing final keywords

How to find these:

  1. Go to Apache Tez JIRA
  2. Search: project = TEZ AND labels = "newbie" AND resolution = Unresolved
  3. Also scan: project = TEZ AND component = "Documentation" AND resolution = Unresolved
  4. Look at recently closed "trivial" issues to understand the standard for accepted patches

Warning: Do not pick up a JIRA issue and immediately upload a patch. Read all existing comments. If there is an active discussion or existing assignee, move on. Leave a comment saying you are investigating before you claim an issue.


Deliverables

You must demonstrate all of the following before advancing to Level 2:

  • Successful mvn install -DskipTests output — no build failures
  • At least one unit test class run successfully (e.g., TestDAGImpl)
  • Successful local DAG execution showing DAG completed: SUCCEEDED
  • Ability to locate DAGAppMaster, TezClient, and OrderedGroupedKVInput by memory
  • Written explanation (2–3 sentences) of why a Tez DAG is faster than chained MapReduce
  • Written explanation of the difference between a YARN container and a Tez task

Common Mistakes

MistakeConsequenceFix
Building with Java 17 against masterCompile errors or compatibility failuresUse Java 8 or Java 11; check <maven.compiler.source> in root pom.xml
Running mvn test on the full repositoryHours-long run including integration testsUse -pl tez-dag -am to scope to one module
Ignoring TezConfiguration.javaConfusion about configuration keys throughout all levelsSkim the entire file; every key is documented
Skipping the YARN architecture docConfusion about what Tez owns vs. what YARN ownsYARN understanding is required from Level 3 onward
Trying to understand all of DAGAppMaster at onceOverwhelm — 5000+ linesFirst pass: read only init() and start()
Reading Tez code without running itAbstract understanding that does not transfer to debuggingAlways run the code after reading it
Picking a JIRA issue without reading existing commentsDuplicate work; community frictionRead all comments; check assignee; leave a note before claiming

How to Verify Success

# 1. Full build without tests
cd /path/to/tez
mvn install -DskipTests -q && echo "BUILD OK"

# 2. Unit test from tez-dag
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q

# 3. Local DAG run (from Lab 1.3)
# Expected final output line:
#   DAG: [OrderedWordCount] finished with status: [SUCCEEDED]

Patch Profile: Level 1 Graduate

Patch typeExampleTest requirement
Javadoc fixCorrecting a wrong @param description in TezClientNone — documentation only
Dead import removalRemove unused import statement flagged by checkstyleRun mvn checkstyle:check -pl <module>
Test assertion improvementAdd assertEquals to an existing test that only checks for no-exceptionRun the test class
README updateFix a broken Maven command in the build instructionsManual verification

You are not ready to submit: bug fixes in state machines, new features, performance patches, or changes to the shuffle path. Those require Levels 3–7.