Milestones: M1 Through M9
Milestones are the "what does mastery look like at this stage" checkpoints. Each milestone has:
- Expected completion — a calendar guideline.
- Skills you must demonstrate — 5–8 concrete abilities.
- Self-check questions — answer them out loud, without notes.
- 20-point rubric — five criteria, four points each.
- Pass threshold — minimum total to advance.
- Move to the next level when — the binary gate.
Pass thresholds are deliberately high. The point is competence, not throughput.
M1 — Orientation (end of Week 2)
You can read the Tez DAG API and explain what every method on DAG, Vertex,
and Edge does.
Skills
- Write a 3-vertex DAG end-to-end without consulting docs.
- Explain the three enums on
EdgePropertyand pick the correct one for a given problem. - Name the protobuf message that represents a DAG on the wire.
- Predict which built-in
EdgeManagerimplementation will be selected for a given edge. - Locate any class in the
tez-apimodule by name within 30 seconds.
Self-check questions
- What is the difference between
DataSourceDescriptorand a runtimeInput? - Why is
DAG.verify()called before submission? - Which class produces the protobuf
DAGPlan?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| API fluency | Can name classes | Can describe responsibilities | Can write code from memory | Can predict behavior |
| Edge model | Confused | Knows enums | Picks correct edge type | Predicts EdgeManager impl |
| Reading speed | >5 min/file | ~3 min/file | ~1 min/file | scanning fluently |
| Mental model | Vague | Sketches DAG | Sketches DAG + edge types | Sketches DAG + edge types + plan flow |
| Communication | Cannot explain | Explains with notes | Explains without notes | Teaches another |
Pass threshold: 14/20, with no criterion below 2.
Move to Level 2 when: you can draft a new DAG class in 10 minutes from a verbal problem statement, on a whiteboard.
M2 — Build and Test Literacy (end of Week 4)
You can navigate the codebase, build it, and run any test by name.
Skills
- Run a single test in any module via
mvn -pl <module> test -Dtest=Class#method. - Add a new test file to
tez-dagand have it picked up by Maven. - Read
TestVertexImpland explain at least 10 individual test methods. - Identify the module of a class given just its FQN (e.g.,
o.a.t.dag.app...→tez-dag). - Build Tez from a clean checkout in under 5 minutes (with cached deps).
- Distinguish unit tests from
MiniTezCluster-backed integration tests.
Self-check questions
- Why does
tez-dagdepend ontez-apiand not the reverse? - What is
DrainDispatcherand why do tests use it? - Where do
MiniTezClustertests live and what classpath do they need?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Build mastery | mvn install works | Can skip tests, profiles | Knows module deps | Diagnoses build failures |
| Test execution | Runs all tests | Runs a class | Runs a method | Runs cross-module |
| Test reading | Skims | Understands assertions | Understands setup | Recreates from scratch |
| Module map | Knows names | Knows top-level deps | Knows transitive deps | Diagnoses cycles |
| Tooling | IDE-only | CLI + IDE | CLI primary | CLI + scripting |
Pass threshold: 14/20.
Move to Level 3 when: you can clone Tez on a fresh laptop, build it, and run
a TestVertexImpl method by name within 15 minutes.
M3 — Submission and AM Bring-up (end of Week 6)
You can trace a DAG from TezClient.submitDAG() to DAGImpl.handle(...) inside
the AM.
Skills
- List the three local resources
TezClientUtilsuploads. - Explain session vs non-session mode and the AM keep-alive mechanism.
- Name every AsyncDispatcher event-handler registered in
DAGAppMaster. - Locate the line of code where
DAGImplis constructed inside the AM. - Read AM logs at
DEBUGand map lines to source positions. - Run
MiniTezClusterin your tests and inspect AM logs.
Self-check questions
- What RPC does
TezClientuse to submit a DAG? Which protocol class? - How does the AM stay alive between DAGs in a session?
- What happens if the AM dies during a DAG run with recovery disabled?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Submission path | Vague | Knows TezClient API | Knows RPC | Knows full byte path |
| AM bring-up | Cannot describe | Names dispatcher | Names handlers | Walks serviceInit |
| Session model | Confused | Knows the flag | Knows keep-alive | Knows timeouts |
| Log reading | Greps blindly | Greps with intent | Maps to code | Predicts log line |
| Recovery | Unknown | Aware | Knows config keys | Knows record format |
Pass threshold: 14/20.
Move to Level 4 when: you can answer "where in the AM does my DAG show up?" with a file:line citation.
M4 — State Machines and VertexManager (end of Week 9)
You can read and modify the vertex/task/attempt state machines.
Skills
- Write a small
StateMachineFactory-based state machine from scratch. - Add a transition to
VertexImpl.stateMachineFactoryand update tests in the same patch. - Implement a custom
VertexManagerPluginwith a unit test. - Diagnose an
InvalidStateTransitonExceptionfrom a stack trace. - Distinguish
SingleArcTransitionfromMultipleArcTransition. - Explain the dispatcher single-threading invariant.
Self-check questions
- Why must state-machine code be single-threaded? What breaks if not?
- What happens if you forget to register a transition for an event in a state?
- How does
ShuffleVertexManagerimplement slow-start?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| State machine | Knows it exists | Can read transitions | Can add transition | Can refactor safely |
| Test discipline | None | Adds happy path | Adds happy + sad | Updates per transition |
| VertexManager | Knows interface | Implements minimal | Implements custom | Implements + tests |
| Concurrency | Confused | Knows the rule | Knows why | Can audit a PR |
| Debugging | Reads stack | Maps to source | Reproduces locally | Writes regression test |
Pass threshold: 16/20 — this is the first hard gate.
Move to Level 5 when: you have submitted (or at minimum drafted) a state machine change that compiles, with a passing test.
M5 — Runtime and Shuffle (end of Week 11)
You can read the runtime data path and explain spill, merge, and fetch.
Skills
- Walk a single task's lifecycle: container start → processor.run() → output close.
- Explain
IFileframing and the difference between V1 and V2. - Distinguish
DefaultSorter,PipelinedSorter, and unordered output. - Diagnose a fetcher failure from logs.
- Read
ShuffleManagerand explain its scheduling of fetchers. - Explain combiners and where they run in the pipeline.
Self-check questions
- What umbilical RPCs does a task make during its run?
- Where is the spill threshold checked?
- What triggers a
FAILED_FETCHevent upstream?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Runtime path | Names classes | Walks happy path | Walks failure paths | Walks edge cases |
| IFile | Knows format | Reads with hexdump | Modifies safely | Diagnoses corruption |
| Sorter | Names them | Knows tradeoffs | Picks for workload | Tunes configs |
| Shuffle | Vague | Knows pull model | Knows scheduling | Knows backoff |
| Combiner | Aware | Knows when run | Implements one | Debugs incorrect output |
Pass threshold: 15/20.
Move to Level 6 when: you can intentionally produce a fetcher failure on
MiniTezCluster and explain every log line.
M6 — Scheduling and Container Reuse (end of Week 12)
You understand how Tez decides where tasks run.
Skills
- Read
YarnTaskSchedulerServiceand explain its scheduling loop. - List the conditions under which a container is/is not reused.
- Explain affinity, locality, and racks.
- Tune
tez.am.container.reuse.*for a given workload. - Diagnose "stuck" scheduling.
Self-check questions
- Why does Tez prefer to reuse containers over requesting new ones?
- What happens if
tez.am.container.idle-release-timeout-min.millisis too low?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Reuse model | Aware | Knows conditions | Knows configs | Tunes for workload |
| Scheduling | Black box | Reads main loop | Reads matching | Reads + modifies |
| Locality | Aware | Knows hints | Knows fallback | Knows rack policy |
| Diagnostics | Guess-and-check | Reads AM logs | Reads + maps to code | Adds counters |
| YARN integration | Aware | Knows AMRM | Knows tokens | Knows failover |
Pass threshold: 14/20.
Move to Level 7 when: you can explain why container reuse is on by default and pick five workloads where you would tune it.
M7 — Integrations (end of Week 13)
You can read and modify the MapReduce shim and explain Hive-on-Tez at a high level.
Skills
- Write a DAG that uses
MRInputreading from HDFS. - Explain
MROutputcommit semantics. - Sketch how Hive's
TezTaskbuilds aDAG. - Identify which features Hive uses (custom edges, manager plugins, dynamic reconfig).
Self-check questions
- What does
MROutput.commit()do, and what guarantees does it offer? - Why does Hive use
ROOT_INPUT_INITIALIZER_FAILEDheavily in its bug fixes?
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| MR shim | Knows existence | Reads MRInput | Reads + uses | Modifies safely |
| Commit | Aware | Knows semantics | Knows failure modes | Knows speculative cleanup |
| Hive lens | Aware | Reads TezTask | Reads + maps | Diagnoses cross-project bug |
| Cross-project | Confused | Knows boundaries | Picks the right list | Files bug correctly |
Pass threshold: 12/16 (only 4 criteria here).
Move to Level 8 when: you can read a Hive query plan and predict its DAG.
M8 — Production Diagnostics (end of Week 14)
You can debug a real Tez job failure given logs and an ATS dump.
Skills
- Read a Tez counters dump and find a bottleneck.
- Find a
VertexImplfailure cause from AM logs in <5 minutes. - Read ATS events and reconstruct a DAG timeline.
- Identify a stuck task vs a slow task vs a failed task from counters.
- Build a one-pager triage runbook for your team.
Rubric
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Counters | Knows existence | Reads | Interprets | Tunes |
| Log triage | Greps | Maps to code | Maps to state | Predicts next event |
| ATS | Aware | Queries | Reads events | Cross-checks vs AM log |
| Runbook | None | Draft | Reviewed | Shipped to team |
| Speed | >30 min | ~15 min | <10 min | <5 min |
Pass threshold: 16/20.
Move to capstone when: you've helped someone (on chat, dev list, or internally) debug a real Tez issue successfully.
M9 — Capstone (end of Week 16)
You've shipped a patch.
Skills
- Selected an appropriate issue.
- Reproduced and root-caused.
- Implemented a fix with tests.
- Submitted a patch in the project's accepted format.
- Responded to at least one round of review feedback.
Rubric (20 points)
| Criterion | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Issue selection | Random | Scoped | Justified | Aligned to roadmap |
| Reproduction | None | Manual | Scripted | Added as a test |
| Root cause | Speculative | Localized | Cited | Explained in JIRA |
| Implementation | Compiles | Tests pass | Idiomatic | Minimal & focused |
| Submission | None | Draft | Submitted | Reviewed |
Pass threshold: 16/20, and the patch must compile and pass mvn verify on
the affected module.
Global Rubric (committer-readiness)
Use this every quarter, regardless of level, to self-assess.
| Dimension | 1 (Beginner) | 2 (Apprentice) | 3 (Practitioner) | 4 (Committer-ready) |
|---|---|---|---|---|
| Code | Reads | Modifies | Designs subsystem | Reviews others' changes |
| Testing | Runs tests | Adds tests | Writes regression suites | Drives test infra |
| Docs | Reads | Edits | Writes user-facing | Owns module-level docs |
| Integration | Single module | Cross-module | Cross-project (Hive) | Drives release decisions |
A committer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused contribution.