Apache Tez Open-Source Contributor Curriculum
Welcome to the Apache Tez Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious Apache Tez contributors and eventually operate at the level of a core contributor, committer, or PMC-aware engineer.
What This Curriculum Is
This is not a tutorial. It is a structured engineering apprenticeship built around how Apache Tez is actually developed, tested, reviewed, and maintained by its committers and PMC members.
Every level is tied to real Apache Tez source code, real JIRA issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an Apache Tez committer actually does — reading state machine code, tracing DAG execution paths, debugging shuffle failures, reproducing reported issues, and preparing patches for community review.
The curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and push you to develop the muscle memory of someone who works at this level habitually.
Who This Is For
This curriculum is designed for strong backend and distributed systems engineers who:
- Have 3+ years of Java development experience (Maven-based projects)
- Are familiar with Hadoop, YARN, or MapReduce at a conceptual level
- Understand distributed systems fundamentals: scheduling, fault tolerance, partitioning, shuffle
- Want to contribute to Apache open-source at a serious level — not just fix typos
You should be comfortable with:
- Reading large, unfamiliar Java codebases without a guide
gitworkflows, reading diffs, working with patch-based reviews- The Hadoop ecosystem at a high level: YARN, HDFS, MapReduce, Hive
- Distributed execution concepts: task graphs, data movement, speculative execution
What You Will Be Able to Do
After completing this curriculum, you will be able to:
| Capability | Description |
|---|---|
| Build and test | Build Apache Tez from source, run unit and integration tests, run DAGs locally |
| Navigate the codebase | Find any class, understand its role, trace execution across module boundaries |
| Understand DAG execution | Follow a DAG from client submission through AM scheduling to task completion |
| Debug failures | Diagnose failed task attempts, hung DAGs, shuffle errors, and YARN allocation failures |
| Trace state machines | Read and reason about DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl state machines |
| Contribute patches | Reproduce issues, fix bugs, write tests, prepare high-quality patches |
| Engage the community | Interact productively on JIRA and mailing lists |
| Understand Hive integration | Trace a SQL query through Hive planning to a Tez DAG execution |
| Think like a committer | Reason about compatibility, test stability, performance, and release impact |
How to Use This Curriculum
Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.
| Level | Title | Core Focus |
|---|---|---|
| 1 | Hadoop and Tez Foundation | Build, test, first DAG, Hadoop ecosystem |
| 2 | Apache Contributor Onboarding | Workflow, patches, JIRA, mailing lists |
| 3 | Tez Architecture | DAG model, TezClient, DAGAppMaster, key subsystems |
| 4 | DAG Execution Internals | State machines, vertex/task/attempt lifecycle, events |
| 5 | Testing and Debugging | Test infra, mini-cluster, debugging failed tasks |
| 6 | Hive/Tez Integration | SQL-to-DAG, Hive integration, cross-project bugs |
| 7 | Runtime and Shuffle | TezRuntime, I/O abstractions, shuffle and sort |
| 8 | Real Issue Contribution | JIRA reproduction, root cause analysis, real patches |
| 9 | Advanced Committer / PMC | Performance, backward compatibility, release practices |
Beyond the 9 levels, the curriculum includes five additional sections:
| Section | Purpose |
|---|---|
| Contributor Mindset | How to think, behave, and grow as an Apache contributor |
| Issue Roadmap | Staged progression from beginner-friendly to release-blocking issues |
| Internals Deep Dives | 21 focused deep dives, each with a mini-lab |
| Hive-on-Tez Labs | Cross-project debugging, SQL-to-DAG tracing, integration bugs |
| Release, Review, and PMC Practices | Apache governance, voting, licensing, release management |
The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged patch and engineering write-up.
Required Tools
Before starting Level 1, ensure you have the following installed and working:
Java 8 or Java 11 (OpenJDK recommended — match the Tez branch target)
Apache Maven 3.6.3 or newer
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse with M2E
Docker (optional — useful for containerized mini-cluster environments)
You will also need:
- A clone of the Apache Tez repository (GitHub mirror of the Apache GitBox repo)
- A clone of the Apache Hadoop repository (for YARN API context and integration reference)
- An account on Apache JIRA (free to create)
- Subscription to the Apache Tez mailing lists:
dev@tez.apache.org— development discussion (required)issues@tez.apache.org— JIRA notifications (optional but useful)
Note on Java version: Apache Tez's
masterbranch targets Java 8 as the minimum. Some newer branches may require Java 11. Always check thepom.xmlat the root of the branch you are working on.
Apache Tez at a Glance
Apache Tez is a general-purpose DAG execution engine built on top of Apache YARN. It is the primary execution engine for Apache Hive since Hive 0.13, and is used by other Hadoop ecosystem projects including Pig, Cascading, and Spark (historically).
Why Tez Exists
MapReduce forces every computation into a Map → Shuffle → Reduce pattern. Complex analytical queries (like multi-join SQL) require chaining many MapReduce jobs, with intermediate results written to HDFS between each stage. This is slow and wasteful.
Tez allows arbitrary directed acyclic graphs (DAGs) of computation where:
- Vertices represent computation stages
- Edges represent data movement between stages
- Container reuse eliminates JVM startup overhead between tasks
- Data can be pipelined between tasks without HDFS materialization
- The same container can run multiple task types
This makes Tez significantly faster than MapReduce for multi-stage queries.
Key Modules
You will spend the majority of your time in these modules:
| Module | Path | Description |
|---|---|---|
tez-api | tez-api/ | Public API: DAG, Vertex, Edge, TezClient, DAGClient |
tez-dag | tez-dag/ | Core execution engine: AM, state machines, scheduling |
tez-runtime-library | tez-runtime-library/ | Input/Output/Processor implementations, shuffle |
tez-mapreduce | tez-mapreduce/ | MapReduce compatibility layer (MRInput, MROutput) |
tez-runtime-internals | tez-runtime-internals/ | Task execution framework, container management |
tez-tests | tez-tests/ | Integration tests and system-level tests |
tez-tools | tez-tools/ | Utility tools (DAG recovery, history parsing) |
tez-plugins | tez-plugins/ | Optional plugins (LLAP, timeline server integration) |
Key Classes (High-Level Preview)
| Class | Module | Role |
|---|---|---|
TezClient | tez-api | Entry point for DAG submission from a client |
DAGClient | tez-api | Handle for monitoring a submitted DAG |
DAG | tez-api | DAG definition: vertices + edges |
Vertex | tez-api | Vertex definition: processor + parallelism |
DAGAppMaster | tez-dag | ApplicationMaster — orchestrates DAG execution |
DAGImpl | tez-dag | State machine: models DAG lifecycle |
VertexImpl | tez-dag | State machine: models vertex lifecycle |
TaskImpl | tez-dag | State machine: models task lifecycle |
TaskAttemptImpl | tez-dag | State machine: models a single task attempt |
TaskCommunicatorManager | tez-dag | Manages communication between AM and task containers |
TezTaskRunner2 | tez-runtime-internals | Runs a task inside a container |
LogicalIOProcessorRuntimeTask | tez-runtime-internals | Wires up I/O processors inside a task |
Apache Tez Community
Apache Tez is a mature project with an active but selective community. The codebase reflects years of careful design decisions, many of which are documented in JIRA issues, design documents, and mailing list threads rather than in code comments.
What the community values:
- Patches that include tests
- Issues that include a clear reproduction case
- Comments that demonstrate you have read the existing code
- Contributors who engage respectfully and patiently
- Sustained contribution over time, not one-off patches
The path from contributor to committer is measured in years, not weeks. That is intentional. The Apache meritocracy rewards sustained, high-quality contribution — not volume of patches.
This curriculum will help you build the habits and depth of understanding that make that path realistic.
Begin with Level 1: Hadoop and Tez Foundation.