Reading a 200k+ LOC Apache Codebase

Apache Tez is roughly 200,000 lines of Java across 15+ Maven modules. No single human holds it all in their head — not even the most senior committers. The skill is not memory; it is navigation. This chapter gives you the strategies committers actually use.

Module Map First

Before reading any code, learn the module shape. Run this once and pin the output:

cd ~/tez-src
find . -maxdepth 2 -name pom.xml | sort

The modules that matter for ~90% of work:

ModuleWhat lives thereWhen you read it
tez-apiPublic API: TezClient, DAG, Vertex, Edge, *DescriptorAlways start here
tez-commonShared utilities, TezConfiguration, countersTracing configs
tez-runtime-internalsTask runtime, LogicalIOProcessorRuntimeTaskFollowing a task
tez-runtime-libraryOrderedPartitionedKVOutput, shuffle inputsI/O contracts
tez-dagDAGAppMaster, schedulers, state machinesAM-side bugs
tez-mapreduceMR compat: MRInput, MROutputMR-on-Tez
tez-testsMiniTezCluster, TestOrderedWordCountIntegration tests
tez-toolsCheckstyle config, swimlanes, analyzerProcess tooling

Tez follows the Hadoop convention: code lives in <module>/src/main/java, tests in <module>/src/test/java. Protobufs live in <module>/src/main/proto.

Strategy 1: Start From the Public API, Trace Inward

Every Tez user program goes through tez-api. That makes it the only mandatory entry point. The reading order:

tez-api (what users see)
   ↓
tez-dag (what the AM does with it)
   ↓
tez-runtime-internals (what tasks do)
   ↓
tez-runtime-library (the I/Os tasks use)

Trace example — "where does parallelism come from?":

cd ~/tez-src
grep -rn "setParallelism" tez-api/src/main/java | head
grep -rn "setParallelism\|reconfigureVertex" tez-dag/src/main/java | head

You will find Vertex.setParallelism(int) in tez-api and follow it to VertexImpl.setParallelism in tez-dag. That arc — API → impl — is the canonical pattern for reading Tez.

Strategy 2: Protobufs Are the Source of Truth for Anything Serialized

Anything that crosses a process boundary (client → AM, AM → container, AM → history) is defined in protobuf. The protos are the contract; the Java is the implementation.

find ~/tez-src -name "*.proto" | sort

The four protos to internalise:

ProtoRole
tez-api/src/main/proto/DAGApiRecords.protoDAGPlan, VertexPlan, EdgePlan — the DAG on the wire
tez-api/src/main/proto/Events.protoThe event types that flow on the dispatcher
tez-common/src/main/proto/TezCommonProtos.protoCounters, plugin descriptors
tez-dag/src/main/proto/DAGProtos.protoAM-internal records

When you see a class named *Proto (e.g. DAGProtos.DAGPlan) the generated code lives in target/generated-sources/ after a build. Don't read the generated code; read the .proto.

Practical rule: if you are changing a field that appears in a proto, you are changing wire compatibility. See Compatibility.

Strategy 3: IDE Call Hierarchy + git log -S

Two tools, used together, replace 80% of speculative reading.

Call hierarchy (IntelliJ: Ctrl-Alt-H, Eclipse: Ctrl-Alt-H) answers "who calls this?". Use it on entry points like TezClient.submitDAG to find every call site in tests and examples.

git log -S answers "when and why did this code appear?".

cd ~/tez-src
git log -S "reconfigureVertex" --oneline -- tez-dag/
git log -S "reconfigureVertex" --oneline -- tez-api/

Pick the oldest commit referenced and read its JIRA:

git show <sha> | head -30
# Look for "TEZ-NNNN" in the commit message

That JIRA is the design discussion. It is more valuable than the code.

Strategy 4: Tests Are Executable Spec

The Tez test suite is the cheapest way to learn what a class does. For any class Foo.java, look for TestFoo.java:

find ~/tez-src -name "TestVertexImpl.java"
find ~/tez-src -name "TestDAGImpl.java"
find ~/tez-src -name "TestShuffleVertexManager.java"

The test names alone form a behavior spec:

grep "  public void test" $(find ~/tez-src -name TestVertexImpl.java)

For runtime behavior, integration tests in tez-tests/ are the gold:

ls ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/

TestTezJobs.java and TestExceptionPropagation.java walk full DAGs end-to-end on a MiniTezCluster. Read them before guessing how a feature behaves at runtime.

Strategy 5: Keep a Reading Log

Committers have working memory of the codebase because they wrote a lot of it. You don't. Compensate with notes. Keep one file:

mkdir -p ~/tez-notes
cat > ~/tez-notes/reading-log.md <<'EOF'
# Tez Reading Log

## YYYY-MM-DD — DAG submission path
- TezClient.submitDAG(DAG) in tez-api builds DAGPlan
- → DAGClientAMProtocolBlockingPB.submitDAG (RPC)
- → DAGAppMaster.submitDAGToAppMaster
- → DAGAppMaster.startDAG → AsyncDispatcher.getEventHandler().handle(DAGEventType.DAG_INIT)

## YYYY-MM-DD — Vertex parallelism reconfiguration
- VertexManagerPlugin.context.reconfigureVertex(...)
...
EOF

Re-reading three months later, the log is gold. Without it, you re-trace the same path.

Worked Exercise: TezClient.submitDAG → AsyncDispatcher

Goal: in 90 minutes, trace the path from a user calling tezClient.submitDAG(dag) to the event landing on the DAGAppMaster async dispatcher.

Step 1 (15 min) — Find the entry

cd ~/tez-src
find tez-api/src/main/java -name "TezClient.java"
grep -n "public DAGClient submitDAG" $(find tez-api/src/main/java -name TezClient.java)

You will find an overload that takes DAG dag. Read its body. Note that it does two things: builds a DAGPlan from the DAG, then sends it via an RPC stub.

Step 2 (20 min) — Identify the RPC

grep -rn "submitDAG" tez-api/src/main/proto/

Find DAGClientAMProtocol.proto. The SubmitDAGRequestProto carries the DAGPlan. The generated stub is DAGClientAMProtocolBlockingPB. The server side implements it in tez-dag.

grep -rn "implements DAGClientAMProtocolBlockingPB\|extends DAGClientAMProtocolBlockingPB" tez-dag/src/main/java

You will land in DAGClientHandler (in tez-dag/.../dag/app/).

Step 3 (20 min) — Server-side handling

grep -n "submitDAG" $(find tez-dag/src/main/java -name "DAGClientHandler.java")

Follow submitDAGDAGAppMaster.submitDAGToAppMasterDAGAppMaster.startDAG. Inside startDAG, you will see a DAG dag = createDAG(dagPlan) and then an event dispatched through dispatcher.getEventHandler().handle(...).

Step 4 (20 min) — The dispatcher

find tez-dag/src/main/java -name "DAGAppMaster.java"
grep -n "AsyncDispatcher\|dispatcher" $(find tez-dag/src/main/java -name DAGAppMaster.java) | head

Find where dispatcher is instantiated and where event handlers are registered. The handler for DAGEventType is the DAGImpl's state machine.

Step 5 (15 min) — Record it

Open your reading log and write the four-line summary. Cite the file and line for each hop.

Validation Artifacts

After this chapter you should produce and keep:

  1. A ~/tez-notes/module-map.md with one sentence per module.
  2. A ~/tez-notes/reading-log.md with the submitDAG trace from the exercise above.
  3. A grep-able list of the four protos and what each one defines.
  4. One git log -S command and the JIRA it surfaced, saved to the log.

When you can do the exercise without checking this page, you have the navigation skill. The next chapter — Design via JIRA — tells you where the design decisions behind that code actually lived.