Milestones: M1 Through M9

Milestones are the "what does mastery look like at this stage" checkpoints. Each milestone has:

  • Expected completion — a calendar guideline.
  • Skills you must demonstrate — 5–8 concrete abilities.
  • Self-check questions — answer them out loud, without notes.
  • 20-point rubric — five criteria, four points each.
  • Pass threshold — minimum total to advance.
  • Move to the next level when — the binary gate.

Pass thresholds are deliberately high. The point is competence, not throughput.


M1 — Orientation (end of Week 2)

You can read the Tez DAG API and explain what every method on DAG, Vertex, and Edge does.

Skills

  1. Write a 3-vertex DAG end-to-end without consulting docs.
  2. Explain the three enums on EdgeProperty and pick the correct one for a given problem.
  3. Name the protobuf message that represents a DAG on the wire.
  4. Predict which built-in EdgeManager implementation will be selected for a given edge.
  5. Locate any class in the tez-api module by name within 30 seconds.

Self-check questions

  • What is the difference between DataSourceDescriptor and a runtime Input?
  • Why is DAG.verify() called before submission?
  • Which class produces the protobuf DAGPlan?

Rubric

Criterion1234
API fluencyCan name classesCan describe responsibilitiesCan write code from memoryCan predict behavior
Edge modelConfusedKnows enumsPicks correct edge typePredicts EdgeManager impl
Reading speed>5 min/file~3 min/file~1 min/filescanning fluently
Mental modelVagueSketches DAGSketches DAG + edge typesSketches DAG + edge types + plan flow
CommunicationCannot explainExplains with notesExplains without notesTeaches another

Pass threshold: 14/20, with no criterion below 2.

Move to Level 2 when: you can draft a new DAG class in 10 minutes from a verbal problem statement, on a whiteboard.


M2 — Build and Test Literacy (end of Week 4)

You can navigate the codebase, build it, and run any test by name.

Skills

  1. Run a single test in any module via mvn -pl <module> test -Dtest=Class#method.
  2. Add a new test file to tez-dag and have it picked up by Maven.
  3. Read TestVertexImpl and explain at least 10 individual test methods.
  4. Identify the module of a class given just its FQN (e.g., o.a.t.dag.app...tez-dag).
  5. Build Tez from a clean checkout in under 5 minutes (with cached deps).
  6. Distinguish unit tests from MiniTezCluster-backed integration tests.

Self-check questions

  • Why does tez-dag depend on tez-api and not the reverse?
  • What is DrainDispatcher and why do tests use it?
  • Where do MiniTezCluster tests live and what classpath do they need?

Rubric

Criterion1234
Build masterymvn install worksCan skip tests, profilesKnows module depsDiagnoses build failures
Test executionRuns all testsRuns a classRuns a methodRuns cross-module
Test readingSkimsUnderstands assertionsUnderstands setupRecreates from scratch
Module mapKnows namesKnows top-level depsKnows transitive depsDiagnoses cycles
ToolingIDE-onlyCLI + IDECLI primaryCLI + scripting

Pass threshold: 14/20.

Move to Level 3 when: you can clone Tez on a fresh laptop, build it, and run a TestVertexImpl method by name within 15 minutes.


M3 — Submission and AM Bring-up (end of Week 6)

You can trace a DAG from TezClient.submitDAG() to DAGImpl.handle(...) inside the AM.

Skills

  1. List the three local resources TezClientUtils uploads.
  2. Explain session vs non-session mode and the AM keep-alive mechanism.
  3. Name every AsyncDispatcher event-handler registered in DAGAppMaster.
  4. Locate the line of code where DAGImpl is constructed inside the AM.
  5. Read AM logs at DEBUG and map lines to source positions.
  6. Run MiniTezCluster in your tests and inspect AM logs.

Self-check questions

  • What RPC does TezClient use to submit a DAG? Which protocol class?
  • How does the AM stay alive between DAGs in a session?
  • What happens if the AM dies during a DAG run with recovery disabled?

Rubric

Criterion1234
Submission pathVagueKnows TezClient APIKnows RPCKnows full byte path
AM bring-upCannot describeNames dispatcherNames handlersWalks serviceInit
Session modelConfusedKnows the flagKnows keep-aliveKnows timeouts
Log readingGreps blindlyGreps with intentMaps to codePredicts log line
RecoveryUnknownAwareKnows config keysKnows record format

Pass threshold: 14/20.

Move to Level 4 when: you can answer "where in the AM does my DAG show up?" with a file:line citation.


M4 — State Machines and VertexManager (end of Week 9)

You can read and modify the vertex/task/attempt state machines.

Skills

  1. Write a small StateMachineFactory-based state machine from scratch.
  2. Add a transition to VertexImpl.stateMachineFactory and update tests in the same patch.
  3. Implement a custom VertexManagerPlugin with a unit test.
  4. Diagnose an InvalidStateTransitonException from a stack trace.
  5. Distinguish SingleArcTransition from MultipleArcTransition.
  6. Explain the dispatcher single-threading invariant.

Self-check questions

  • Why must state-machine code be single-threaded? What breaks if not?
  • What happens if you forget to register a transition for an event in a state?
  • How does ShuffleVertexManager implement slow-start?

Rubric

Criterion1234
State machineKnows it existsCan read transitionsCan add transitionCan refactor safely
Test disciplineNoneAdds happy pathAdds happy + sadUpdates per transition
VertexManagerKnows interfaceImplements minimalImplements customImplements + tests
ConcurrencyConfusedKnows the ruleKnows whyCan audit a PR
DebuggingReads stackMaps to sourceReproduces locallyWrites regression test

Pass threshold: 16/20 — this is the first hard gate.

Move to Level 5 when: you have submitted (or at minimum drafted) a state machine change that compiles, with a passing test.


M5 — Runtime and Shuffle (end of Week 11)

You can read the runtime data path and explain spill, merge, and fetch.

Skills

  1. Walk a single task's lifecycle: container start → processor.run() → output close.
  2. Explain IFile framing and the difference between V1 and V2.
  3. Distinguish DefaultSorter, PipelinedSorter, and unordered output.
  4. Diagnose a fetcher failure from logs.
  5. Read ShuffleManager and explain its scheduling of fetchers.
  6. Explain combiners and where they run in the pipeline.

Self-check questions

  • What umbilical RPCs does a task make during its run?
  • Where is the spill threshold checked?
  • What triggers a FAILED_FETCH event upstream?

Rubric

Criterion1234
Runtime pathNames classesWalks happy pathWalks failure pathsWalks edge cases
IFileKnows formatReads with hexdumpModifies safelyDiagnoses corruption
SorterNames themKnows tradeoffsPicks for workloadTunes configs
ShuffleVagueKnows pull modelKnows schedulingKnows backoff
CombinerAwareKnows when runImplements oneDebugs incorrect output

Pass threshold: 15/20.

Move to Level 6 when: you can intentionally produce a fetcher failure on MiniTezCluster and explain every log line.


M6 — Scheduling and Container Reuse (end of Week 12)

You understand how Tez decides where tasks run.

Skills

  1. Read YarnTaskSchedulerService and explain its scheduling loop.
  2. List the conditions under which a container is/is not reused.
  3. Explain affinity, locality, and racks.
  4. Tune tez.am.container.reuse.* for a given workload.
  5. Diagnose "stuck" scheduling.

Self-check questions

  • Why does Tez prefer to reuse containers over requesting new ones?
  • What happens if tez.am.container.idle-release-timeout-min.millis is too low?

Rubric

Criterion1234
Reuse modelAwareKnows conditionsKnows configsTunes for workload
SchedulingBlack boxReads main loopReads matchingReads + modifies
LocalityAwareKnows hintsKnows fallbackKnows rack policy
DiagnosticsGuess-and-checkReads AM logsReads + maps to codeAdds counters
YARN integrationAwareKnows AMRMKnows tokensKnows failover

Pass threshold: 14/20.

Move to Level 7 when: you can explain why container reuse is on by default and pick five workloads where you would tune it.


M7 — Integrations (end of Week 13)

You can read and modify the MapReduce shim and explain Hive-on-Tez at a high level.

Skills

  1. Write a DAG that uses MRInput reading from HDFS.
  2. Explain MROutput commit semantics.
  3. Sketch how Hive's TezTask builds a DAG.
  4. Identify which features Hive uses (custom edges, manager plugins, dynamic reconfig).

Self-check questions

  • What does MROutput.commit() do, and what guarantees does it offer?
  • Why does Hive use ROOT_INPUT_INITIALIZER_FAILED heavily in its bug fixes?

Rubric

Criterion1234
MR shimKnows existenceReads MRInputReads + usesModifies safely
CommitAwareKnows semanticsKnows failure modesKnows speculative cleanup
Hive lensAwareReads TezTaskReads + mapsDiagnoses cross-project bug
Cross-projectConfusedKnows boundariesPicks the right listFiles bug correctly

Pass threshold: 12/16 (only 4 criteria here).

Move to Level 8 when: you can read a Hive query plan and predict its DAG.


M8 — Production Diagnostics (end of Week 14)

You can debug a real Tez job failure given logs and an ATS dump.

Skills

  1. Read a Tez counters dump and find a bottleneck.
  2. Find a VertexImpl failure cause from AM logs in <5 minutes.
  3. Read ATS events and reconstruct a DAG timeline.
  4. Identify a stuck task vs a slow task vs a failed task from counters.
  5. Build a one-pager triage runbook for your team.

Rubric

Criterion1234
CountersKnows existenceReadsInterpretsTunes
Log triageGrepsMaps to codeMaps to statePredicts next event
ATSAwareQueriesReads eventsCross-checks vs AM log
RunbookNoneDraftReviewedShipped to team
Speed>30 min~15 min<10 min<5 min

Pass threshold: 16/20.

Move to capstone when: you've helped someone (on chat, dev list, or internally) debug a real Tez issue successfully.


M9 — Capstone (end of Week 16)

You've shipped a patch.

Skills

  1. Selected an appropriate issue.
  2. Reproduced and root-caused.
  3. Implemented a fix with tests.
  4. Submitted a patch in the project's accepted format.
  5. Responded to at least one round of review feedback.

Rubric (20 points)

Criterion1234
Issue selectionRandomScopedJustifiedAligned to roadmap
ReproductionNoneManualScriptedAdded as a test
Root causeSpeculativeLocalizedCitedExplained in JIRA
ImplementationCompilesTests passIdiomaticMinimal & focused
SubmissionNoneDraftSubmittedReviewed

Pass threshold: 16/20, and the patch must compile and pass mvn verify on the affected module.


Global Rubric (committer-readiness)

Use this every quarter, regardless of level, to self-assess.

Dimension1 (Beginner)2 (Apprentice)3 (Practitioner)4 (Committer-ready)
CodeReadsModifiesDesigns subsystemReviews others' changes
TestingRuns testsAdds testsWrites regression suitesDrives test infra
DocsReadsEditsWrites user-facingOwns module-level docs
IntegrationSingle moduleCross-moduleCross-project (Hive)Drives release decisions

A committer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused contribution.