Step 10: Engineering Write-Up

The patch is merged. The JIRA is Resolved. Most contributors stop here. The ones who become committers, write the post. The write-up is the artifact that travels with you when you change jobs, apply for a committer vote, or get cited by another contributor doing similar work.

Eight hundred to a thousand words. Most of it written in the four hours right after merge, while the dead ends are still fresh.


Why It Matters

Three audiences:

  1. Future you. Six months from now you'll touch this code again and want to remember what you tried.
  2. The next contributor working a similar bug. They'll find your post via Google ("Tez vertex stuck RUNNING") and shortcut a week of work.
  3. The committers / PMC evaluating you for a vote. They want to see that you can communicate engineering reasoning, not just produce diffs.

A good write-up is not a press release. It is a postmortem: honest about what you tried, including the failed approaches.


The Template

Sections in order, suggested word counts.

Title (one line)

Fixing TEZ-NNNN: <one-line technical summary>

Examples:

  • "Fixing TEZ-4567: A speculative-recovery short-circuit race in VertexImpl"
  • "Fixing TEZ-3982: Why our shuffle was 30% slow on small inputs"
  • "Fixing TEZ-2451: An off-by-one in MergeManager spill accounting"

Technical, specific. Not "My first Apache Tez contribution" — write that post separately on your blog. The engineering post stands on its own.

Problem (100–150 words)

What broke, for whom, under what conditions. Plain English, but precise.

Tez vertices configured with checkpoint-based recovery would intermittently
fail to transition to SUCCEEDED, leaving the DAG in RUNNING state until the
AM hit its global timeout. The bug only manifested when the application
master pre-populated recovery data at vertex initialization (rather than
lazily during an actual replay), which is the path used by long-running
Tez sessions reusing AMs across DAG submissions.

The symptom was a stalled DAG with all tasks reporting SUCCEEDED in the
counters but no DAGFinishedEvent in the AM log. Affected Tez 0.9.x and
0.10.0 onward.

State the symptom (what the user sees), the trigger condition (when it manifests), and the affected version range. No code yet.

Investigation Log (200–300 words)

The most valuable section. Walk through what you tried, including the hypotheses that were wrong.

Initial hypothesis was a task-scheduler bug — we suspected
TaskSchedulerManager was dropping a TASK_COMPLETED event under load.
DrainDispatcher-based reproducers in isolation showed no event loss, so
we ruled this out within a day.

Second hypothesis: a state-machine transition guard rejecting the final
event. Adding TRACE logging to VertexImpl confirmed V_TASK_COMPLETED was
arriving and being dispatched, but completedTaskCount remained one short
of total. This shifted attention from "the event is missing" to "the
event is processed but not by the expected handler."

Reading VertexImpl.handle(...) line by line revealed the recovery
short-circuit at line ~2400: `if (recoveryData != null) { handleRecovery(...); }`.
A git blame placed this in TEZ-2877 (commit a1b2c3d4), where the
assumption "non-null recoveryData implies active replay" was reasonable
at the time but became invalid when TEZ-3105 introduced speculative
recovery-data population at vertex init.

The actual race: V_TASK_COMPLETED for the final task arrived at the
moment when recoveryData was populated but isRecovering() would have
returned false — there was no isRecovering() check.

Three to five hypotheses, in the order you tried them. Each with one sentence on what suggested it and one sentence on what disproved it. The dead ends are not embarrassments — they are the work, and they teach readers what not to spend a week on.

Root Cause (50–100 words)

One paragraph, the truth as you now understand it.

The vertex state machine's V_TASK_COMPLETED handler in the RUNNING state
short-circuited any event to handleRecovery() when recoveryData was non-null,
regardless of whether a recovery replay was actually in progress. Speculative
population of recoveryData at vertex initialization (TEZ-3105) made the
guard fire in normal execution, routing terminal events to the recovery
path which silently ignored them when not replaying. The completedTaskCount
counter never reached totalTaskCount, blocking the SUCCEEDED transition.

Cite the introducing JIRA. Cite the bisect commit if you have it.

Final Design (150–200 words)

What you actually changed and why this design over alternatives.

The fix introduces an isReplayingRecovery() predicate that returns true
only when a recovery replay is in flight (tracked by an existing
RecoveryState flag in DAGAppMaster). The short-circuit is gated on this
predicate:

  if (recoveryData != null && isReplayingRecovery()) { ... }

This is a one-line production change plus a four-line predicate method.
It preserves all behavior for actual recovery scenarios and corrects the
behavior only for the speculatively-populated case.

Show the diff size and the principle ("minimum surface area"). Note any public API impact (here: none).

Alternatives Considered (100–150 words)

Two to three alternatives you rejected, with the reason.

**Alternative 1: stop populating recoveryData speculatively at vertex init.**
Rejected: TEZ-3105 documented performance reasons for the eager population
(avoids a stall when actual recovery kicks in). Reverting it would
regress that path.

**Alternative 2: have handleRecovery() forward the event back to the
standard transition when not replaying.** Rejected: it works, but couples
the recovery path to internal knowledge of which events the standard
transition needs. The gate-at-source approach is local and reviewable.

**Alternative 3: remove the short-circuit entirely and let handleRecovery()
no-op when not replaying.** Rejected: changes the semantics of every other
event flowing through the recovery path, with broader behavioral risk for
a narrowly-scoped bug.

This is the section that separates contributor-quality write-ups from committer-quality ones. Anyone can ship a fix. Articulating why this fix and not the obvious alternatives demonstrates engineering judgment.

Performance / Behavior Impact (50–100 words)

If perf-relevant, numbers from Step 7. Otherwise, one sentence:

No measurable performance impact. The new predicate is a single field
read on a hot path (VertexImpl.handle) but the original short-circuit
already paid this cost on every event. Validated via TestOrderedWordCount
runtime: no statistically significant change across 10 runs.

Lessons Learned (100–150 words)

The transferable insights, written for a peer. Things you would tell yourself before starting.

- Recovery code in Tez has always been the sharpest edge: it is the
  least-tested path because it only runs during AM failover, and most
  developer environments don't trigger it. When a bug touches recovery
  data flow, assume the test coverage is thin and add reproducers
  aggressively.
- `git pickaxe` and `git bisect` together were decisive — bisect found
  the introducing commit (TEZ-2877), and pickaxe on the changed expression
  showed it had never had a guard. Without bisect this would have been
  a week of code archaeology.
- DrainDispatcher in TestVertexImpl is underused. The repro test for this
  bug took two hours to write once I learned the pattern, and it is now
  permanent regression protection.

Three to five bullets. Concrete enough that a peer at another project could apply them.

- JIRA: https://issues.apache.org/jira/browse/TEZ-NNNN
- PR: https://github.com/apache/tez/pull/<NNN>
- Merged commit: <SHA>
- Introducing commit (TEZ-2877): <SHA>

Where to Publish

Three venues, in roughly decreasing order of effort and impact.

1. Personal blog or company engineering blog

Full ~1000-word write-up. SEO-friendly title with the JIRA number and a keyword phrase users would search for ("Tez vertex stuck RUNNING fix"). Link prominently to JIRA and PR. This is the version that follows you across jobs.

2. Apache wiki / Tez documentation

Shorter version (300–500 words) focused on the lesson, not the personal narrative. Filed under a relevant page (recovery troubleshooting, debugging state machines). Requires wiki access — committers will grant it once you have a few merged contributions.

3. dev@ summary email

Two to three paragraph summary on dev@tez.apache.org with subject [TEZ-NNNN] Notes on the fix. Lets watchers and PMC see the engineering reasoning without having to read the whole PR. Optional but earns goodwill.

Subject: [TEZ-NNNN] Notes on the fix

Hi all,

Merged TEZ-NNNN this morning. Quick notes on the investigation since
recovery bugs are uncommon and the root cause was a non-obvious
interaction with TEZ-3105:

<2 paragraphs of summary>

Full write-up: <link to blog post>

Thanks again to [~alice] for the review.

Anti-Patterns

What separates write-ups that help from ones that don't:

  • "I learned a lot working on this!" — Yes, we know. Cut it. The artifact is the engineering, not the feel-good.
  • Personal narrative dominating the engineering. Save the "my journey into open source" angle for a separate post. Engineering posts get cited and reread. Narrative posts get one-time clicks.
  • Sanitized version where you "knew the answer all along." Nobody believes this and it actively misleads new contributors who feel inadequate when their investigation is messy. Be honest about the dead ends.
  • No code snippets. A write-up without showing the actual diff or the symptomatic log line is unfalsifiable.
  • No links. JIRA, PR, commit — all three minimum. A write-up without the JIRA link is unreviewable.
  • Word-padding to look thorough. A tight 600-word write-up that respects the reader beats a 2000-word slog every time.

Validation / Self-check

Before declaring the Capstone complete:

  1. The write-up is published at a URL you can share (blog, GitHub Gist, capstone-work/writeup.md in a public repo).
  2. It is 500–1000 words; not 200 (too thin) and not 3000 (padding).
  3. Investigation Log section contains at least two hypotheses you ruled out, not only the winning one.
  4. Alternatives Considered section names at least two designs you rejected with reasons.
  5. Lessons Learned section has three to five bullets, each concrete enough to be reusable by another contributor.
  6. JIRA, PR, and merged-commit SHA are all linked.
  7. The write-up reads as something a peer engineer would respect, not a triumphalist blog post.