Apache Tez Open-Source Contributor Curriculum

Welcome to the Apache Tez Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious Apache Tez contributors and eventually operate at the level of a core contributor, committer, or PMC-aware engineer.


What This Curriculum Is

This is not a tutorial. It is a structured engineering apprenticeship built around how Apache Tez is actually developed, tested, reviewed, and maintained by its committers and PMC members.

Every level is tied to real Apache Tez source code, real JIRA issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an Apache Tez committer actually does — reading state machine code, tracing DAG execution paths, debugging shuffle failures, reproducing reported issues, and preparing patches for community review.

The curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and push you to develop the muscle memory of someone who works at this level habitually.


Who This Is For

This curriculum is designed for strong backend and distributed systems engineers who:

  • Have 3+ years of Java development experience (Maven-based projects)
  • Are familiar with Hadoop, YARN, or MapReduce at a conceptual level
  • Understand distributed systems fundamentals: scheduling, fault tolerance, partitioning, shuffle
  • Want to contribute to Apache open-source at a serious level — not just fix typos

You should be comfortable with:

  • Reading large, unfamiliar Java codebases without a guide
  • git workflows, reading diffs, working with patch-based reviews
  • The Hadoop ecosystem at a high level: YARN, HDFS, MapReduce, Hive
  • Distributed execution concepts: task graphs, data movement, speculative execution

What You Will Be Able to Do

After completing this curriculum, you will be able to:

CapabilityDescription
Build and testBuild Apache Tez from source, run unit and integration tests, run DAGs locally
Navigate the codebaseFind any class, understand its role, trace execution across module boundaries
Understand DAG executionFollow a DAG from client submission through AM scheduling to task completion
Debug failuresDiagnose failed task attempts, hung DAGs, shuffle errors, and YARN allocation failures
Trace state machinesRead and reason about DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl state machines
Contribute patchesReproduce issues, fix bugs, write tests, prepare high-quality patches
Engage the communityInteract productively on JIRA and mailing lists
Understand Hive integrationTrace a SQL query through Hive planning to a Tez DAG execution
Think like a committerReason about compatibility, test stability, performance, and release impact

How to Use This Curriculum

Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.

LevelTitleCore Focus
1Hadoop and Tez FoundationBuild, test, first DAG, Hadoop ecosystem
2Apache Contributor OnboardingWorkflow, patches, JIRA, mailing lists
3Tez ArchitectureDAG model, TezClient, DAGAppMaster, key subsystems
4DAG Execution InternalsState machines, vertex/task/attempt lifecycle, events
5Testing and DebuggingTest infra, mini-cluster, debugging failed tasks
6Hive/Tez IntegrationSQL-to-DAG, Hive integration, cross-project bugs
7Runtime and ShuffleTezRuntime, I/O abstractions, shuffle and sort
8Real Issue ContributionJIRA reproduction, root cause analysis, real patches
9Advanced Committer / PMCPerformance, backward compatibility, release practices

Beyond the 9 levels, the curriculum includes five additional sections:

SectionPurpose
Contributor MindsetHow to think, behave, and grow as an Apache contributor
Issue RoadmapStaged progression from beginner-friendly to release-blocking issues
Internals Deep Dives21 focused deep dives, each with a mini-lab
Hive-on-Tez LabsCross-project debugging, SQL-to-DAG tracing, integration bugs
Release, Review, and PMC PracticesApache governance, voting, licensing, release management

The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged patch and engineering write-up.


Required Tools

Before starting Level 1, ensure you have the following installed and working:

Java 8 or Java 11 (OpenJDK recommended — match the Tez branch target)
Apache Maven 3.6.3 or newer
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse with M2E
Docker (optional — useful for containerized mini-cluster environments)

You will also need:

Note on Java version: Apache Tez's master branch targets Java 8 as the minimum. Some newer branches may require Java 11. Always check the pom.xml at the root of the branch you are working on.


Apache Tez at a Glance

Apache Tez is a general-purpose DAG execution engine built on top of Apache YARN. It is the primary execution engine for Apache Hive since Hive 0.13, and is used by other Hadoop ecosystem projects including Pig, Cascading, and Spark (historically).

Why Tez Exists

MapReduce forces every computation into a Map → Shuffle → Reduce pattern. Complex analytical queries (like multi-join SQL) require chaining many MapReduce jobs, with intermediate results written to HDFS between each stage. This is slow and wasteful.

Tez allows arbitrary directed acyclic graphs (DAGs) of computation where:

  • Vertices represent computation stages
  • Edges represent data movement between stages
  • Container reuse eliminates JVM startup overhead between tasks
  • Data can be pipelined between tasks without HDFS materialization
  • The same container can run multiple task types

This makes Tez significantly faster than MapReduce for multi-stage queries.

Key Modules

You will spend the majority of your time in these modules:

ModulePathDescription
tez-apitez-api/Public API: DAG, Vertex, Edge, TezClient, DAGClient
tez-dagtez-dag/Core execution engine: AM, state machines, scheduling
tez-runtime-librarytez-runtime-library/Input/Output/Processor implementations, shuffle
tez-mapreducetez-mapreduce/MapReduce compatibility layer (MRInput, MROutput)
tez-runtime-internalstez-runtime-internals/Task execution framework, container management
tez-teststez-tests/Integration tests and system-level tests
tez-toolstez-tools/Utility tools (DAG recovery, history parsing)
tez-pluginstez-plugins/Optional plugins (LLAP, timeline server integration)

Key Classes (High-Level Preview)

ClassModuleRole
TezClienttez-apiEntry point for DAG submission from a client
DAGClienttez-apiHandle for monitoring a submitted DAG
DAGtez-apiDAG definition: vertices + edges
Vertextez-apiVertex definition: processor + parallelism
DAGAppMastertez-dagApplicationMaster — orchestrates DAG execution
DAGImpltez-dagState machine: models DAG lifecycle
VertexImpltez-dagState machine: models vertex lifecycle
TaskImpltez-dagState machine: models task lifecycle
TaskAttemptImpltez-dagState machine: models a single task attempt
TaskCommunicatorManagertez-dagManages communication between AM and task containers
TezTaskRunner2tez-runtime-internalsRuns a task inside a container
LogicalIOProcessorRuntimeTasktez-runtime-internalsWires up I/O processors inside a task

Apache Tez Community

Apache Tez is a mature project with an active but selective community. The codebase reflects years of careful design decisions, many of which are documented in JIRA issues, design documents, and mailing list threads rather than in code comments.

What the community values:

  • Patches that include tests
  • Issues that include a clear reproduction case
  • Comments that demonstrate you have read the existing code
  • Contributors who engage respectfully and patiently
  • Sustained contribution over time, not one-off patches

The path from contributor to committer is measured in years, not weeks. That is intentional. The Apache meritocracy rewards sustained, high-quality contribution — not volume of patches.

This curriculum will help you build the habits and depth of understanding that make that path realistic.


Begin with Level 1: Hadoop and Tez Foundation.