Apache Tez Open-Source Contributor Curriculum

Welcome to the Apache Tez Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious Apache Tez contributors and eventually operate at the level of a core contributor, committer, or PMC-aware engineer.

What This Curriculum Is

This is not a tutorial. It is a structured engineering apprenticeship built around how Apache Tez is actually developed, tested, reviewed, and maintained by its committers and PMC members.

Every level is tied to real Apache Tez source code, real JIRA issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an Apache Tez committer actually does — reading state machine code, tracing DAG execution paths, debugging shuffle failures, reproducing reported issues, and preparing patches for community review.

The curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and push you to develop the muscle memory of someone who works at this level habitually.

Who This Is For

This curriculum is designed for strong backend and distributed systems engineers who:

Have 3+ years of Java development experience (Maven-based projects)
Are familiar with Hadoop, YARN, or MapReduce at a conceptual level
Understand distributed systems fundamentals: scheduling, fault tolerance, partitioning, shuffle
Want to contribute to Apache open-source at a serious level — not just fix typos

You should be comfortable with:

Reading large, unfamiliar Java codebases without a guide
git workflows, reading diffs, working with patch-based reviews
The Hadoop ecosystem at a high level: YARN, HDFS, MapReduce, Hive
Distributed execution concepts: task graphs, data movement, speculative execution

What You Will Be Able to Do

After completing this curriculum, you will be able to:

Capability	Description
Build and test	Build Apache Tez from source, run unit and integration tests, run DAGs locally
Navigate the codebase	Find any class, understand its role, trace execution across module boundaries
Understand DAG execution	Follow a DAG from client submission through AM scheduling to task completion
Debug failures	Diagnose failed task attempts, hung DAGs, shuffle errors, and YARN allocation failures
Trace state machines	Read and reason about `DAGImpl`, `VertexImpl`, `TaskImpl`, `TaskAttemptImpl` state machines
Contribute patches	Reproduce issues, fix bugs, write tests, prepare high-quality patches
Engage the community	Interact productively on JIRA and mailing lists
Understand Hive integration	Trace a SQL query through Hive planning to a Tez DAG execution
Think like a committer	Reason about compatibility, test stability, performance, and release impact

How to Use This Curriculum

Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.

Level	Title	Core Focus
1	Hadoop and Tez Foundation	Build, test, first DAG, Hadoop ecosystem
2	Apache Contributor Onboarding	Workflow, patches, JIRA, mailing lists
3	Tez Architecture	DAG model, TezClient, DAGAppMaster, key subsystems
4	DAG Execution Internals	State machines, vertex/task/attempt lifecycle, events
5	Testing and Debugging	Test infra, mini-cluster, debugging failed tasks
6	Hive/Tez Integration	SQL-to-DAG, Hive integration, cross-project bugs
7	Runtime and Shuffle	TezRuntime, I/O abstractions, shuffle and sort
8	Real Issue Contribution	JIRA reproduction, root cause analysis, real patches
9	Advanced Committer / PMC	Performance, backward compatibility, release practices

Beyond the 9 levels, the curriculum includes five additional sections:

Section	Purpose
Contributor Mindset	How to think, behave, and grow as an Apache contributor
Issue Roadmap	Staged progression from beginner-friendly to release-blocking issues
Internals Deep Dives	21 focused deep dives, each with a mini-lab
Hive-on-Tez Labs	Cross-project debugging, SQL-to-DAG tracing, integration bugs
Release, Review, and PMC Practices	Apache governance, voting, licensing, release management

The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged patch and engineering write-up.

Required Tools

Before starting Level 1, ensure you have the following installed and working:

Java 8 or Java 11 (OpenJDK recommended — match the Tez branch target)
Apache Maven 3.6.3 or newer
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse with M2E
Docker (optional — useful for containerized mini-cluster environments)

You will also need:

A clone of the Apache Tez repository (GitHub mirror of the Apache GitBox repo)
A clone of the Apache Hadoop repository (for YARN API context and integration reference)
An account on Apache JIRA (free to create)
Subscription to the Apache Tez mailing lists:
- dev@tez.apache.org — development discussion (required)
- issues@tez.apache.org — JIRA notifications (optional but useful)

Note on Java version: Apache Tez's master branch targets Java 8 as the minimum. Some newer branches may require Java 11. Always check the pom.xml at the root of the branch you are working on.

Apache Tez at a Glance

Apache Tez is a general-purpose DAG execution engine built on top of Apache YARN. It is the primary execution engine for Apache Hive since Hive 0.13, and is used by other Hadoop ecosystem projects including Pig, Cascading, and Spark (historically).

Why Tez Exists

MapReduce forces every computation into a Map → Shuffle → Reduce pattern. Complex analytical queries (like multi-join SQL) require chaining many MapReduce jobs, with intermediate results written to HDFS between each stage. This is slow and wasteful.

Tez allows arbitrary directed acyclic graphs (DAGs) of computation where:

Vertices represent computation stages
Edges represent data movement between stages
Container reuse eliminates JVM startup overhead between tasks
Data can be pipelined between tasks without HDFS materialization
The same container can run multiple task types

This makes Tez significantly faster than MapReduce for multi-stage queries.

Key Modules

You will spend the majority of your time in these modules:

Module	Path	Description
`tez-api`	`tez-api/`	Public API: `DAG`, `Vertex`, `Edge`, `TezClient`, `DAGClient`
`tez-dag`	`tez-dag/`	Core execution engine: AM, state machines, scheduling
`tez-runtime-library`	`tez-runtime-library/`	Input/Output/Processor implementations, shuffle
`tez-mapreduce`	`tez-mapreduce/`	MapReduce compatibility layer (`MRInput`, `MROutput`)
`tez-runtime-internals`	`tez-runtime-internals/`	Task execution framework, container management
`tez-tests`	`tez-tests/`	Integration tests and system-level tests
`tez-tools`	`tez-tools/`	Utility tools (DAG recovery, history parsing)
`tez-plugins`	`tez-plugins/`	Optional plugins (LLAP, timeline server integration)

Key Classes (High-Level Preview)

Class	Module	Role
`TezClient`	`tez-api`	Entry point for DAG submission from a client
`DAGClient`	`tez-api`	Handle for monitoring a submitted DAG
`DAG`	`tez-api`	DAG definition: vertices + edges
`Vertex`	`tez-api`	Vertex definition: processor + parallelism
`DAGAppMaster`	`tez-dag`	ApplicationMaster — orchestrates DAG execution
`DAGImpl`	`tez-dag`	State machine: models DAG lifecycle
`VertexImpl`	`tez-dag`	State machine: models vertex lifecycle
`TaskImpl`	`tez-dag`	State machine: models task lifecycle
`TaskAttemptImpl`	`tez-dag`	State machine: models a single task attempt
`TaskCommunicatorManager`	`tez-dag`	Manages communication between AM and task containers
`TezTaskRunner2`	`tez-runtime-internals`	Runs a task inside a container
`LogicalIOProcessorRuntimeTask`	`tez-runtime-internals`	Wires up I/O processors inside a task

Apache Tez Community

Apache Tez is a mature project with an active but selective community. The codebase reflects years of careful design decisions, many of which are documented in JIRA issues, design documents, and mailing list threads rather than in code comments.

What the community values:

Patches that include tests
Issues that include a clear reproduction case
Comments that demonstrate you have read the existing code
Contributors who engage respectfully and patiently
Sustained contribution over time, not one-off patches

The path from contributor to committer is measured in years, not weeks. That is intentional. The Apache meritocracy rewards sustained, high-quality contribution — not volume of patches.

This curriculum will help you build the habits and depth of understanding that make that path realistic.

Begin with Level 1: Hadoop and Tez Foundation.

Open-Source Engineer & Contributor