Lab 2.1: Navigate the Repository Structure

Background

Before writing a single line of code, a new contributor must be able to navigate the repository with the same fluency as a committer. This lab builds that fluency by walking you through every module, understanding the Maven multi-module structure, and being able to locate any class in under 30 seconds.


Repository Root Layout

apache/tez/
├── pom.xml                     # Root POM — module declarations, dep management
├── tez-api/                    # Public client API
├── tez-common/                 # Utilities shared across modules
├── tez-dag/                    # DAG AppMaster — the core of Tez
├── tez-examples/               # Example DAG implementations
├── tez-ext-service-tests/      # External service integration tests
├── tez-mapreduce/              # MapReduce compatibility layer
├── tez-plugins/                # Optional plugins (ATSv2, etc.)
├── tez-runtime-internals/      # Internal runtime interfaces
├── tez-runtime-library/        # I/O processors, shuffle
├── tez-tests/                  # Integration test suite
├── tez-tools/                  # Performance analysis utilities
├── src/
│   └── config/
│       ├── checkstyle.xml      # Style enforcement rules
│       └── checkstyle-suppressions.xml
└── CHANGES.txt                 # Release changelog

Module-by-Module Walkthrough

tez-api — The Public Contract

Everything in tez-api is part of the public API that application developers use. Changes here must be backward-compatible or explicitly versioned. This is the highest-stability module.

Key packages:

PackageContents
org.apache.tez.dag.apiDAG, Vertex, Edge, TezClient, TezConfiguration
org.apache.tez.dag.api.clientDAGClient, DAGStatus — monitoring and control
org.apache.tez.dag.api.eventEvents emitted by the AM to task processors
org.apache.tez.dag.api.recordsProtocol Buffer message classes (generated)
org.apache.tez.runtime.apiAbstractProcessor, Input, Output interfaces

Exercise:

# Count public classes in tez-api (the API surface)
find tez-api/src/main/java -name "*.java" | wc -l

# Find all classes that implement or extend AbstractProcessor
grep -rl "extends AbstractProcessor" tez-runtime-library/src/

tez-dag — The Application Master

This is the largest and most complex module. It implements the DAG AppMaster that runs in a YARN container and orchestrates vertex and task execution.

Key packages:

PackageContents
org.apache.tez.dag.appDAGAppMaster — the main AM class
org.apache.tez.dag.app.dagDAG, Vertex, Task, TaskAttempt state machine interfaces
org.apache.tez.dag.app.dag.implDAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl
org.apache.tez.dag.app.rmYARN resource management integration
org.apache.tez.dag.app.launcherContainer launch logic
org.apache.tez.dag.app.webAM web UI servlets
org.apache.tez.dag.historyTimeline history event handling

Exercise:

# Count lines in DAGImpl (the most complex class)
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java

# Count state machine transitions in VertexImpl
grep "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l

tez-runtime-library — I/O and Shuffle

The I/O module implements the actual data reading/writing done inside task containers. Shuffle happens here.

Key packages:

PackageContents
org.apache.tez.runtime.library.inputOrderedGroupedKVInput, UnorderedKVInput, etc.
org.apache.tez.runtime.library.outputOrderedPartitionedKVOutput, UnorderedKVOutput, etc.
org.apache.tez.runtime.library.common.shuffleShuffle fetch infrastructure
org.apache.tez.runtime.library.common.sortExternal sort implementation
org.apache.tez.runtime.library.common.writersSpilling KV writers

Exercise:

# Find all Input implementations
find tez-runtime-library/src/main/java -name "*Input*.java" | grep -v test

# Find the shuffle Fetcher
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")

tez-common — Shared Utilities

Contains utilities used by multiple modules that do not fit in tez-api:

  • TezUtils — configuration serialization/deserialization
  • TezTaskID, TezVertexID, TezDAGID — ID types
  • ReflectionUtils — Tez-specific reflection helpers
  • VersionUtils — version compatibility checks

tez-mapreduce — MapReduce Compatibility

Allows MapReduce jobs to run on Tez without code changes. Contains MRInput, MROutput, and the mapper/reducer wrapping infrastructure.

tez-examples — Reference Implementations

Four example DAGs:

ClassWhat it demonstrates
OrderedWordCount3-vertex pipeline, ordered shuffle, sort by value
IntersectExample2-way join using broadcast edge
JoinDataGenData generation for the join example
FilterLinesByWordSimple filter with configurable parallelism

tez-tests — Integration Test Suite

Contains tests that run against MiniTezCluster — a full in-process Tez + YARN + HDFS cluster. These tests are slow (minutes each) but provide end-to-end coverage.

Key test class: TestMiniTezSessionWithLocalMode — runs example DAGs in local mode.


Maven Structure Deep Dive

Root pom.xml

Read the root pom.xml to understand:

  1. Module declarations (<modules> section) — the build order
  2. Dependency management (<dependencyManagement>) — canonical versions for all deps
  3. Plugin management (<pluginManagement>) — canonical plugin configurations
  4. Build profileshadoop-2 vs hadoop-3, dist profile for assembly

Exercise:

# What Hadoop version does Tez build against by default?
grep -A2 "hadoop.version" pom.xml | head -5

# What Java version is required?
grep "maven.compiler" pom.xml

# How many external dependencies does the root pom manage?
grep "<artifactId>" pom.xml | wc -l

Module pom.xml Structure

Each module follows the same pattern:

<parent>
  <groupId>org.apache.tez</groupId>
  <artifactId>tez</artifactId>
  <version>0.10.x-SNAPSHOT</version>
</parent>

<artifactId>tez-dag</artifactId>
<name>Tez DAG</name>

<dependencies>
  <!-- Module-specific dependencies -->
</dependencies>

Modules declare their inter-dependencies explicitly. This is how Maven knows the build order.

Exercise:

# What modules does tez-dag depend on?
grep -A3 "<dependency>" tez-dag/pom.xml | grep "tez-" | grep "artifactId"

# What does tez-runtime-library depend on?
grep -A3 "<dependency>" tez-runtime-library/pom.xml | grep "tez-" | grep "artifactId"

Finding Classes Quickly

By Name

find . -name "VertexImpl.java"
find . -name "Fetcher.java"
find . -name "TestDAGImpl.java"

By Content

# Find the class that defines TEZ_LOCAL_MODE
grep -rl "TEZ_LOCAL_MODE" --include="*.java" .

# Find all state machine StateMachine declarations
grep -rl "StateMachineFactory" --include="*.java" . | grep -v test

In IntelliJ

  • Navigate to class: ⌘ O (macOS) — type class name, supports wildcards
  • Navigate to file: ⌘ ⇧ O — type file name
  • Find usages: ⌥ F7 — shows all places a class/method is used
  • Go to implementation: ⌘ ⌥ B — jumps from interface to implementation

After completing this lab, time yourself on each:

TaskTarget time
Find DAGImpl.java< 10 seconds
Find TezConfiguration.TEZ_LOCAL_MODE declaration< 20 seconds
Find all tests for VertexImpl< 30 seconds
Identify which module handles shuffle fetch retry< 60 seconds
Find the class that submits a DAG from client to AM< 60 seconds

If any take longer, repeat the exercises in this lab.


Expected Output

By end of this lab you should have notes documenting:

  1. The line count of VertexImpl.java and DAGImpl.java
  2. The number of state machine transitions in VertexImpl
  3. The names of all 4 example DAG classes
  4. The Hadoop version Tez builds against
  5. Which module handles shuffle (your own words, not copy-pasted)

Stretch Goals

  1. Generate the full module dependency graph:

    mvn dependency:tree -pl tez-dag -am | grep "\\-\\-" | head -30
    
  2. Find all Protocol Buffer definition files (.proto):

    find . -name "*.proto" | sort
    

    For each, identify which module it belongs to and what messages it defines.

  3. Read tez-api/src/main/proto/DAGApiRecords.proto completely. Identify which messages correspond to Java classes you have already read.