Milestones: M1 Through M9

Milestones are the "what does mastery look like at this stage" checkpoints. Each milestone has:

Expected completion — a calendar guideline.
Skills you must demonstrate — 5–8 concrete abilities.
Self-check questions — answer them out loud, without notes.
20-point rubric — five criteria, four points each.
Pass threshold — minimum total to advance.
Move to the next level when — the binary gate.

Pass thresholds are deliberately high. The point is competence, not throughput.

M1 — Orientation (end of Week 2)

You can read the Tez DAG API and explain what every method on DAG, Vertex, and Edge does.

Skills

Write a 3-vertex DAG end-to-end without consulting docs.
Explain the three enums on EdgeProperty and pick the correct one for a given problem.
Name the protobuf message that represents a DAG on the wire.
Predict which built-in EdgeManager implementation will be selected for a given edge.
Locate any class in the tez-api module by name within 30 seconds.

Self-check questions

What is the difference between DataSourceDescriptor and a runtime Input?
Why is DAG.verify() called before submission?
Which class produces the protobuf DAGPlan?

Rubric

Criterion	1	2	3	4
API fluency	Can name classes	Can describe responsibilities	Can write code from memory	Can predict behavior
Edge model	Confused	Knows enums	Picks correct edge type	Predicts EdgeManager impl
Reading speed	>5 min/file	~3 min/file	~1 min/file	scanning fluently
Mental model	Vague	Sketches DAG	Sketches DAG + edge types	Sketches DAG + edge types + plan flow
Communication	Cannot explain	Explains with notes	Explains without notes	Teaches another

Pass threshold: 14/20, with no criterion below 2.

Move to Level 2 when: you can draft a new DAG class in 10 minutes from a verbal problem statement, on a whiteboard.

M2 — Build and Test Literacy (end of Week 4)

You can navigate the codebase, build it, and run any test by name.

Skills

Run a single test in any module via mvn -pl <module> test -Dtest=Class#method.
Add a new test file to tez-dag and have it picked up by Maven.
Read TestVertexImpl and explain at least 10 individual test methods.
Identify the module of a class given just its FQN (e.g., o.a.t.dag.app... → tez-dag).
Build Tez from a clean checkout in under 5 minutes (with cached deps).
Distinguish unit tests from MiniTezCluster-backed integration tests.

Self-check questions

Why does tez-dag depend on tez-api and not the reverse?
What is DrainDispatcher and why do tests use it?
Where do MiniTezCluster tests live and what classpath do they need?

Rubric

Criterion	1	2	3	4
Build mastery	`mvn install` works	Can skip tests, profiles	Knows module deps	Diagnoses build failures
Test execution	Runs all tests	Runs a class	Runs a method	Runs cross-module
Test reading	Skims	Understands assertions	Understands setup	Recreates from scratch
Module map	Knows names	Knows top-level deps	Knows transitive deps	Diagnoses cycles
Tooling	IDE-only	CLI + IDE	CLI primary	CLI + scripting

Pass threshold: 14/20.

Move to Level 3 when: you can clone Tez on a fresh laptop, build it, and run a TestVertexImpl method by name within 15 minutes.

M3 — Submission and AM Bring-up (end of Week 6)

You can trace a DAG from TezClient.submitDAG() to DAGImpl.handle(...) inside the AM.

Skills

List the three local resources TezClientUtils uploads.
Explain session vs non-session mode and the AM keep-alive mechanism.
Name every AsyncDispatcher event-handler registered in DAGAppMaster.
Locate the line of code where DAGImpl is constructed inside the AM.
Read AM logs at DEBUG and map lines to source positions.
Run MiniTezCluster in your tests and inspect AM logs.

Self-check questions

What RPC does TezClient use to submit a DAG? Which protocol class?
How does the AM stay alive between DAGs in a session?
What happens if the AM dies during a DAG run with recovery disabled?

Rubric

Criterion	1	2	3	4
Submission path	Vague	Knows TezClient API	Knows RPC	Knows full byte path
AM bring-up	Cannot describe	Names dispatcher	Names handlers	Walks serviceInit
Session model	Confused	Knows the flag	Knows keep-alive	Knows timeouts
Log reading	Greps blindly	Greps with intent	Maps to code	Predicts log line
Recovery	Unknown	Aware	Knows config keys	Knows record format

Pass threshold: 14/20.

Move to Level 4 when: you can answer "where in the AM does my DAG show up?" with a file:line citation.

M4 — State Machines and VertexManager (end of Week 9)

You can read and modify the vertex/task/attempt state machines.

Skills

Write a small StateMachineFactory-based state machine from scratch.
Add a transition to VertexImpl.stateMachineFactory and update tests in the same patch.
Implement a custom VertexManagerPlugin with a unit test.
Diagnose an InvalidStateTransitonException from a stack trace.
Distinguish SingleArcTransition from MultipleArcTransition.
Explain the dispatcher single-threading invariant.

Self-check questions

Why must state-machine code be single-threaded? What breaks if not?
What happens if you forget to register a transition for an event in a state?
How does ShuffleVertexManager implement slow-start?

Rubric

Criterion	1	2	3	4
State machine	Knows it exists	Can read transitions	Can add transition	Can refactor safely
Test discipline	None	Adds happy path	Adds happy + sad	Updates per transition
VertexManager	Knows interface	Implements minimal	Implements custom	Implements + tests
Concurrency	Confused	Knows the rule	Knows why	Can audit a PR
Debugging	Reads stack	Maps to source	Reproduces locally	Writes regression test

Pass threshold: 16/20 — this is the first hard gate.

Move to Level 5 when: you have submitted (or at minimum drafted) a state machine change that compiles, with a passing test.

M5 — Runtime and Shuffle (end of Week 11)

You can read the runtime data path and explain spill, merge, and fetch.

Skills

Walk a single task's lifecycle: container start → processor.run() → output close.
Explain IFile framing and the difference between V1 and V2.
Distinguish DefaultSorter, PipelinedSorter, and unordered output.
Diagnose a fetcher failure from logs.
Read ShuffleManager and explain its scheduling of fetchers.
Explain combiners and where they run in the pipeline.

Self-check questions

What umbilical RPCs does a task make during its run?
Where is the spill threshold checked?
What triggers a FAILED_FETCH event upstream?

Rubric

Criterion	1	2	3	4
Runtime path	Names classes	Walks happy path	Walks failure paths	Walks edge cases
IFile	Knows format	Reads with hexdump	Modifies safely	Diagnoses corruption
Sorter	Names them	Knows tradeoffs	Picks for workload	Tunes configs
Shuffle	Vague	Knows pull model	Knows scheduling	Knows backoff
Combiner	Aware	Knows when run	Implements one	Debugs incorrect output

Pass threshold: 15/20.

Move to Level 6 when: you can intentionally produce a fetcher failure on MiniTezCluster and explain every log line.

M6 — Scheduling and Container Reuse (end of Week 12)

You understand how Tez decides where tasks run.

Skills

Read YarnTaskSchedulerService and explain its scheduling loop.
List the conditions under which a container is/is not reused.
Explain affinity, locality, and racks.
Tune tez.am.container.reuse.* for a given workload.
Diagnose "stuck" scheduling.

Self-check questions

Why does Tez prefer to reuse containers over requesting new ones?
What happens if tez.am.container.idle-release-timeout-min.millis is too low?

Rubric

Criterion	1	2	3	4
Reuse model	Aware	Knows conditions	Knows configs	Tunes for workload
Scheduling	Black box	Reads main loop	Reads matching	Reads + modifies
Locality	Aware	Knows hints	Knows fallback	Knows rack policy
Diagnostics	Guess-and-check	Reads AM logs	Reads + maps to code	Adds counters
YARN integration	Aware	Knows AMRM	Knows tokens	Knows failover

Pass threshold: 14/20.

Move to Level 7 when: you can explain why container reuse is on by default and pick five workloads where you would tune it.

M7 — Integrations (end of Week 13)

You can read and modify the MapReduce shim and explain Hive-on-Tez at a high level.

Skills

Write a DAG that uses MRInput reading from HDFS.
Explain MROutput commit semantics.
Sketch how Hive's TezTask builds a DAG.
Identify which features Hive uses (custom edges, manager plugins, dynamic reconfig).

Self-check questions

What does MROutput.commit() do, and what guarantees does it offer?
Why does Hive use ROOT_INPUT_INITIALIZER_FAILED heavily in its bug fixes?

Rubric

Criterion	1	2	3	4
MR shim	Knows existence	Reads MRInput	Reads + uses	Modifies safely
Commit	Aware	Knows semantics	Knows failure modes	Knows speculative cleanup
Hive lens	Aware	Reads TezTask	Reads + maps	Diagnoses cross-project bug
Cross-project	Confused	Knows boundaries	Picks the right list	Files bug correctly

Pass threshold: 12/16 (only 4 criteria here).

Move to Level 8 when: you can read a Hive query plan and predict its DAG.

M8 — Production Diagnostics (end of Week 14)

You can debug a real Tez job failure given logs and an ATS dump.

Skills

Read a Tez counters dump and find a bottleneck.
Find a VertexImpl failure cause from AM logs in <5 minutes.
Read ATS events and reconstruct a DAG timeline.
Identify a stuck task vs a slow task vs a failed task from counters.
Build a one-pager triage runbook for your team.

Rubric

Criterion	1	2	3	4
Counters	Knows existence	Reads	Interprets	Tunes
Log triage	Greps	Maps to code	Maps to state	Predicts next event
ATS	Aware	Queries	Reads events	Cross-checks vs AM log
Runbook	None	Draft	Reviewed	Shipped to team
Speed	>30 min	~15 min	<10 min	<5 min

Pass threshold: 16/20.

Move to capstone when: you've helped someone (on chat, dev list, or internally) debug a real Tez issue successfully.

M9 — Capstone (end of Week 16)

You've shipped a patch.

Skills

Selected an appropriate issue.
Reproduced and root-caused.
Implemented a fix with tests.
Submitted a patch in the project's accepted format.
Responded to at least one round of review feedback.

Rubric (20 points)

Criterion	1	2	3	4
Issue selection	Random	Scoped	Justified	Aligned to roadmap
Reproduction	None	Manual	Scripted	Added as a test
Root cause	Speculative	Localized	Cited	Explained in JIRA
Implementation	Compiles	Tests pass	Idiomatic	Minimal & focused
Submission	None	Draft	Submitted	Reviewed

Pass threshold: 16/20, and the patch must compile and pass mvn verify on the affected module.

Global Rubric (committer-readiness)

Use this every quarter, regardless of level, to self-assess.

Dimension	1 (Beginner)	2 (Apprentice)	3 (Practitioner)	4 (Committer-ready)
Code	Reads	Modifies	Designs subsystem	Reviews others' changes
Testing	Runs tests	Adds tests	Writes regression suites	Drives test infra
Docs	Reads	Edits	Writes user-facing	Owns module-level docs
Integration	Single module	Cross-module	Cross-project (Hive)	Drives release decisions

A committer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused contribution.

Open-Source Engineer & Contributor