Lab 2.1: Navigate the Repository Structure
Background
Before writing a single line of code, a new contributor must be able to navigate the repository with the same fluency as a committer. This lab builds that fluency by walking you through every module, understanding the Maven multi-module structure, and being able to locate any class in under 30 seconds.
Repository Root Layout
apache/tez/
├── pom.xml # Root POM — module declarations, dep management
├── tez-api/ # Public client API
├── tez-common/ # Utilities shared across modules
├── tez-dag/ # DAG AppMaster — the core of Tez
├── tez-examples/ # Example DAG implementations
├── tez-ext-service-tests/ # External service integration tests
├── tez-mapreduce/ # MapReduce compatibility layer
├── tez-plugins/ # Optional plugins (ATSv2, etc.)
├── tez-runtime-internals/ # Internal runtime interfaces
├── tez-runtime-library/ # I/O processors, shuffle
├── tez-tests/ # Integration test suite
├── tez-tools/ # Performance analysis utilities
├── src/
│ └── config/
│ ├── checkstyle.xml # Style enforcement rules
│ └── checkstyle-suppressions.xml
└── CHANGES.txt # Release changelog
Module-by-Module Walkthrough
tez-api — The Public Contract
Everything in tez-api is part of the public API that application developers use. Changes here
must be backward-compatible or explicitly versioned. This is the highest-stability module.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.dag.api | DAG, Vertex, Edge, TezClient, TezConfiguration |
org.apache.tez.dag.api.client | DAGClient, DAGStatus — monitoring and control |
org.apache.tez.dag.api.event | Events emitted by the AM to task processors |
org.apache.tez.dag.api.records | Protocol Buffer message classes (generated) |
org.apache.tez.runtime.api | AbstractProcessor, Input, Output interfaces |
Exercise:
# Count public classes in tez-api (the API surface)
find tez-api/src/main/java -name "*.java" | wc -l
# Find all classes that implement or extend AbstractProcessor
grep -rl "extends AbstractProcessor" tez-runtime-library/src/
tez-dag — The Application Master
This is the largest and most complex module. It implements the DAG AppMaster that runs in a YARN container and orchestrates vertex and task execution.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.dag.app | DAGAppMaster — the main AM class |
org.apache.tez.dag.app.dag | DAG, Vertex, Task, TaskAttempt state machine interfaces |
org.apache.tez.dag.app.dag.impl | DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl |
org.apache.tez.dag.app.rm | YARN resource management integration |
org.apache.tez.dag.app.launcher | Container launch logic |
org.apache.tez.dag.app.web | AM web UI servlets |
org.apache.tez.dag.history | Timeline history event handling |
Exercise:
# Count lines in DAGImpl (the most complex class)
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java
# Count state machine transitions in VertexImpl
grep "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l
tez-runtime-library — I/O and Shuffle
The I/O module implements the actual data reading/writing done inside task containers. Shuffle happens here.
Key packages:
| Package | Contents |
|---|---|
org.apache.tez.runtime.library.input | OrderedGroupedKVInput, UnorderedKVInput, etc. |
org.apache.tez.runtime.library.output | OrderedPartitionedKVOutput, UnorderedKVOutput, etc. |
org.apache.tez.runtime.library.common.shuffle | Shuffle fetch infrastructure |
org.apache.tez.runtime.library.common.sort | External sort implementation |
org.apache.tez.runtime.library.common.writers | Spilling KV writers |
Exercise:
# Find all Input implementations
find tez-runtime-library/src/main/java -name "*Input*.java" | grep -v test
# Find the shuffle Fetcher
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")
tez-common — Shared Utilities
Contains utilities used by multiple modules that do not fit in tez-api:
TezUtils— configuration serialization/deserializationTezTaskID,TezVertexID,TezDAGID— ID typesReflectionUtils— Tez-specific reflection helpersVersionUtils— version compatibility checks
tez-mapreduce — MapReduce Compatibility
Allows MapReduce jobs to run on Tez without code changes. Contains MRInput, MROutput,
and the mapper/reducer wrapping infrastructure.
tez-examples — Reference Implementations
Four example DAGs:
| Class | What it demonstrates |
|---|---|
OrderedWordCount | 3-vertex pipeline, ordered shuffle, sort by value |
IntersectExample | 2-way join using broadcast edge |
JoinDataGen | Data generation for the join example |
FilterLinesByWord | Simple filter with configurable parallelism |
tez-tests — Integration Test Suite
Contains tests that run against MiniTezCluster — a full in-process Tez + YARN + HDFS cluster.
These tests are slow (minutes each) but provide end-to-end coverage.
Key test class: TestMiniTezSessionWithLocalMode — runs example DAGs in local mode.
Maven Structure Deep Dive
Root pom.xml
Read the root pom.xml to understand:
- Module declarations (
<modules>section) — the build order - Dependency management (
<dependencyManagement>) — canonical versions for all deps - Plugin management (
<pluginManagement>) — canonical plugin configurations - Build profiles —
hadoop-2vshadoop-3,distprofile for assembly
Exercise:
# What Hadoop version does Tez build against by default?
grep -A2 "hadoop.version" pom.xml | head -5
# What Java version is required?
grep "maven.compiler" pom.xml
# How many external dependencies does the root pom manage?
grep "<artifactId>" pom.xml | wc -l
Module pom.xml Structure
Each module follows the same pattern:
<parent>
<groupId>org.apache.tez</groupId>
<artifactId>tez</artifactId>
<version>0.10.x-SNAPSHOT</version>
</parent>
<artifactId>tez-dag</artifactId>
<name>Tez DAG</name>
<dependencies>
<!-- Module-specific dependencies -->
</dependencies>
Modules declare their inter-dependencies explicitly. This is how Maven knows the build order.
Exercise:
# What modules does tez-dag depend on?
grep -A3 "<dependency>" tez-dag/pom.xml | grep "tez-" | grep "artifactId"
# What does tez-runtime-library depend on?
grep -A3 "<dependency>" tez-runtime-library/pom.xml | grep "tez-" | grep "artifactId"
Finding Classes Quickly
By Name
find . -name "VertexImpl.java"
find . -name "Fetcher.java"
find . -name "TestDAGImpl.java"
By Content
# Find the class that defines TEZ_LOCAL_MODE
grep -rl "TEZ_LOCAL_MODE" --include="*.java" .
# Find all state machine StateMachine declarations
grep -rl "StateMachineFactory" --include="*.java" . | grep -v test
In IntelliJ
- Navigate to class:
⌘ O(macOS) — type class name, supports wildcards - Navigate to file:
⌘ ⇧ O— type file name - Find usages:
⌥ F7— shows all places a class/method is used - Go to implementation:
⌘ ⌥ B— jumps from interface to implementation
Navigation Checklist
After completing this lab, time yourself on each:
| Task | Target time |
|---|---|
Find DAGImpl.java | < 10 seconds |
Find TezConfiguration.TEZ_LOCAL_MODE declaration | < 20 seconds |
Find all tests for VertexImpl | < 30 seconds |
| Identify which module handles shuffle fetch retry | < 60 seconds |
| Find the class that submits a DAG from client to AM | < 60 seconds |
If any take longer, repeat the exercises in this lab.
Expected Output
By end of this lab you should have notes documenting:
- The line count of
VertexImpl.javaandDAGImpl.java - The number of state machine transitions in
VertexImpl - The names of all 4 example DAG classes
- The Hadoop version Tez builds against
- Which module handles shuffle (your own words, not copy-pasted)
Stretch Goals
-
Generate the full module dependency graph:
mvn dependency:tree -pl tez-dag -am | grep "\\-\\-" | head -30 -
Find all Protocol Buffer definition files (
.proto):find . -name "*.proto" | sortFor each, identify which module it belongs to and what messages it defines.
-
Read
tez-api/src/main/proto/DAGApiRecords.protocompletely. Identify which messages correspond to Java classes you have already read.