Reading a 200k+ LOC Apache Codebase
Apache Tez is roughly 200,000 lines of Java across 15+ Maven modules. No single human holds it all in their head — not even the most senior committers. The skill is not memory; it is navigation. This chapter gives you the strategies committers actually use.
Module Map First
Before reading any code, learn the module shape. Run this once and pin the output:
cd ~/tez-src
find . -maxdepth 2 -name pom.xml | sort
The modules that matter for ~90% of work:
| Module | What lives there | When you read it |
|---|---|---|
tez-api | Public API: TezClient, DAG, Vertex, Edge, *Descriptor | Always start here |
tez-common | Shared utilities, TezConfiguration, counters | Tracing configs |
tez-runtime-internals | Task runtime, LogicalIOProcessorRuntimeTask | Following a task |
tez-runtime-library | OrderedPartitionedKVOutput, shuffle inputs | I/O contracts |
tez-dag | DAGAppMaster, schedulers, state machines | AM-side bugs |
tez-mapreduce | MR compat: MRInput, MROutput | MR-on-Tez |
tez-tests | MiniTezCluster, TestOrderedWordCount | Integration tests |
tez-tools | Checkstyle config, swimlanes, analyzer | Process tooling |
Tez follows the Hadoop convention: code lives in <module>/src/main/java, tests in
<module>/src/test/java. Protobufs live in <module>/src/main/proto.
Strategy 1: Start From the Public API, Trace Inward
Every Tez user program goes through tez-api. That makes it the only mandatory entry
point. The reading order:
tez-api (what users see)
↓
tez-dag (what the AM does with it)
↓
tez-runtime-internals (what tasks do)
↓
tez-runtime-library (the I/Os tasks use)
Trace example — "where does parallelism come from?":
cd ~/tez-src
grep -rn "setParallelism" tez-api/src/main/java | head
grep -rn "setParallelism\|reconfigureVertex" tez-dag/src/main/java | head
You will find Vertex.setParallelism(int) in tez-api and follow it to
VertexImpl.setParallelism in tez-dag. That arc — API → impl — is the canonical pattern
for reading Tez.
Strategy 2: Protobufs Are the Source of Truth for Anything Serialized
Anything that crosses a process boundary (client → AM, AM → container, AM → history) is defined in protobuf. The protos are the contract; the Java is the implementation.
find ~/tez-src -name "*.proto" | sort
The four protos to internalise:
| Proto | Role |
|---|---|
tez-api/src/main/proto/DAGApiRecords.proto | DAGPlan, VertexPlan, EdgePlan — the DAG on the wire |
tez-api/src/main/proto/Events.proto | The event types that flow on the dispatcher |
tez-common/src/main/proto/TezCommonProtos.proto | Counters, plugin descriptors |
tez-dag/src/main/proto/DAGProtos.proto | AM-internal records |
When you see a class named *Proto (e.g. DAGProtos.DAGPlan) the generated code lives in
target/generated-sources/ after a build. Don't read the generated code; read the .proto.
Practical rule: if you are changing a field that appears in a proto, you are changing wire compatibility. See Compatibility.
Strategy 3: IDE Call Hierarchy + git log -S
Two tools, used together, replace 80% of speculative reading.
Call hierarchy (IntelliJ: Ctrl-Alt-H, Eclipse: Ctrl-Alt-H) answers "who calls
this?". Use it on entry points like TezClient.submitDAG to find every call site in
tests and examples.
git log -S answers "when and why did this code appear?".
cd ~/tez-src
git log -S "reconfigureVertex" --oneline -- tez-dag/
git log -S "reconfigureVertex" --oneline -- tez-api/
Pick the oldest commit referenced and read its JIRA:
git show <sha> | head -30
# Look for "TEZ-NNNN" in the commit message
That JIRA is the design discussion. It is more valuable than the code.
Strategy 4: Tests Are Executable Spec
The Tez test suite is the cheapest way to learn what a class does. For any class
Foo.java, look for TestFoo.java:
find ~/tez-src -name "TestVertexImpl.java"
find ~/tez-src -name "TestDAGImpl.java"
find ~/tez-src -name "TestShuffleVertexManager.java"
The test names alone form a behavior spec:
grep " public void test" $(find ~/tez-src -name TestVertexImpl.java)
For runtime behavior, integration tests in tez-tests/ are the gold:
ls ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/
TestTezJobs.java and TestExceptionPropagation.java walk full DAGs end-to-end on a
MiniTezCluster. Read them before guessing how a feature behaves at runtime.
Strategy 5: Keep a Reading Log
Committers have working memory of the codebase because they wrote a lot of it. You don't. Compensate with notes. Keep one file:
mkdir -p ~/tez-notes
cat > ~/tez-notes/reading-log.md <<'EOF'
# Tez Reading Log
## YYYY-MM-DD — DAG submission path
- TezClient.submitDAG(DAG) in tez-api builds DAGPlan
- → DAGClientAMProtocolBlockingPB.submitDAG (RPC)
- → DAGAppMaster.submitDAGToAppMaster
- → DAGAppMaster.startDAG → AsyncDispatcher.getEventHandler().handle(DAGEventType.DAG_INIT)
## YYYY-MM-DD — Vertex parallelism reconfiguration
- VertexManagerPlugin.context.reconfigureVertex(...)
...
EOF
Re-reading three months later, the log is gold. Without it, you re-trace the same path.
Worked Exercise: TezClient.submitDAG → AsyncDispatcher
Goal: in 90 minutes, trace the path from a user calling tezClient.submitDAG(dag) to the
event landing on the DAGAppMaster async dispatcher.
Step 1 (15 min) — Find the entry
cd ~/tez-src
find tez-api/src/main/java -name "TezClient.java"
grep -n "public DAGClient submitDAG" $(find tez-api/src/main/java -name TezClient.java)
You will find an overload that takes DAG dag. Read its body. Note that it does two
things: builds a DAGPlan from the DAG, then sends it via an RPC stub.
Step 2 (20 min) — Identify the RPC
grep -rn "submitDAG" tez-api/src/main/proto/
Find DAGClientAMProtocol.proto. The SubmitDAGRequestProto carries the DAGPlan. The
generated stub is DAGClientAMProtocolBlockingPB. The server side implements it in
tez-dag.
grep -rn "implements DAGClientAMProtocolBlockingPB\|extends DAGClientAMProtocolBlockingPB" tez-dag/src/main/java
You will land in DAGClientHandler (in tez-dag/.../dag/app/).
Step 3 (20 min) — Server-side handling
grep -n "submitDAG" $(find tez-dag/src/main/java -name "DAGClientHandler.java")
Follow submitDAG → DAGAppMaster.submitDAGToAppMaster → DAGAppMaster.startDAG. Inside
startDAG, you will see a DAG dag = createDAG(dagPlan) and then an event dispatched
through dispatcher.getEventHandler().handle(...).
Step 4 (20 min) — The dispatcher
find tez-dag/src/main/java -name "DAGAppMaster.java"
grep -n "AsyncDispatcher\|dispatcher" $(find tez-dag/src/main/java -name DAGAppMaster.java) | head
Find where dispatcher is instantiated and where event handlers are registered. The
handler for DAGEventType is the DAGImpl's state machine.
Step 5 (15 min) — Record it
Open your reading log and write the four-line summary. Cite the file and line for each hop.
Validation Artifacts
After this chapter you should produce and keep:
- A
~/tez-notes/module-map.mdwith one sentence per module. - A
~/tez-notes/reading-log.mdwith thesubmitDAGtrace from the exercise above. - A
grep-able list of the four protos and what each one defines. - One
git log -Scommand and the JIRA it surfaced, saved to the log.
When you can do the exercise without checking this page, you have the navigation skill. The next chapter — Design via JIRA — tells you where the design decisions behind that code actually lived.