Stage 12 — Release-Blocking Issues

What this stage teaches

Stage 12 is the committer/PMC stage. You learn:

  • The four categories of release blockers: data loss, correctness regressions, AM crash, security CVE.
  • How to triage a candidate blocker during an RC vote: what evidence is required, who must be CC'd, and what the deadline-pressure tradeoffs are.
  • The Apache release process from a committer's seat: building an RC, signing artifacts, calling a [VOTE] thread, the 72-hour rule, and the meaning of +1 binding, -1 binding, +1, and 0 votes.
  • The Tez release notes format and what a release blocker contributes to it.
  • Security CVE handling: the private security@ list, embargoed disclosure, and the path from private patch to public release.

This is the only stage where you may be voting on someone else's work as much as writing your own. The patch surface is identical to earlier stages; the context in which you act is different.

JIRA filter to find candidates

project = TEZ
  AND priority in (Blocker, Critical)
  AND resolution = Unresolved
ORDER BY priority DESC, updated DESC

The set is small at any given time. During an RC vote it grows fast.

A second filter for the RC voting period:

project = TEZ AND priority = Blocker AND created > -7d

The four categories of release blockers

1. Data loss

The strictest category. Any code path where a successfully-acknowledged write can be lost, or a successfully-acknowledged read can return wrong data, is a data-loss blocker. Examples in Tez history:

  • A MergeManager spill that double-counted records and silently dropped one.
  • A Fetcher that ignored a checksum mismatch and returned corrupted bytes to the downstream processor.
  • A DAGRecovery path that reconstructed an incorrect parent vertex state after AM restart.

Triage: the JIRA description must contain a deterministic repro that the release manager can run in under five minutes. Without a repro, the issue is not a blocker — it is a "to be investigated" ticket.

2. Correctness regressions

A query that returned correct results in version N-1 returns wrong results in version N. The bar is lower than data loss (the data is still there; the output is wrong) but the triage is the same. A correctness regression that affects a single Hive query path is a blocker.

3. AM crash

Any reproducible InvalidStateTransitonException in master is a blocker during an RC. Operators expect the AM to survive their workload. An AM crash on a Hive-emitted DAG that worked in the previous release blocks the RC even if the DAG itself is "unusual" — the AM must be defensive against its inputs.

4. Security CVE

A demonstrated CVE in a Tez-owned class is a blocker regardless of whether it has been exploited. The disclosure path is security@tez.apache.org first, then the public JIRA only after the fix is ready.

Triage during an RC vote

The RC vote pattern on dev@:

Subject: [VOTE] Release Apache Tez 0.10.4 (RC1)

Hi,

I've prepared the first release candidate for Tez 0.10.4. The artifacts
are at:
  https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/

The git tag is:
  https://github.com/apache/tez/releases/tag/release-0.10.4-rc1

The release notes are:
  CHANGES.txt at the top of the tag.

Please verify the signatures, run the smoke tests, and vote:
  [+1] release this RC
  [0]  no opinion
  [-1] do not release (please explain)

The vote is open for 72 hours.

Your job, as a contributor evaluating the RC:

  1. Verify the artifact:
    curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz
    curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz.asc
    gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz
    
  2. Build from source:
    tar xf apache-tez-0.10.4-src.tar.gz
    cd apache-tez-0.10.4-src
    mvn clean install -DskipTests -Phadoop28
    
  3. Run a smoke test:
    mvn -pl tez-tests test -Dtest=TestExternalTezServices -Phadoop28
    
  4. Reply on the vote thread with your evidence.

Vote semantics

VoteMeaning
+1 bindingPMC member endorses release. Three are required for release.
+1Non-PMC endorses. Counts for momentum, not the binding count.
0No opinion. Often used to indicate "I built it, smoke test passed, but I can't speak to my use case."
-1 bindingPMC member vetoes. One -1 binding stops the release unless overridden by another vote (rare).
-1Non-PMC veto. Not binding, but committers will read it.

A -1 vote must include the reason. "Build failed" is not enough; "build failed because X test fails reproducibly on Hadoop 3.x profile, evidence at URL" is.

Walked example — discovering a blocker during RC vote

Symptom: during the 0.10.4 RC1 vote, you run the smoke test and observe a test failure in TestShuffleManager#testReadErrorReportDebounce that did not happen in 0.10.3.

Step 1 — Reproduce

cd apache-tez-0.10.4-src
for i in 1 2 3; do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -5
done

If the failure is 3/3, it is reproducible. If 1/3, it is a flake (Stage 9 issue, not a blocker).

Step 2 — Identify the cause

git log v0.10.3..release-0.10.4-rc1 -- \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped

You see a commit that changed the debounce window default from 5000ms to 500ms. The test was written against 5000ms; the change silently broke it.

Step 3 — Decide blocker vs not

A failing unit test in an RC is not automatically a blocker. The question is: does the underlying behaviour change affect production?

  • If the default change is intentional and the test should be updated → not a blocker. Fix the test in 0.10.4 hotfix or 0.10.5.
  • If the default change is unintentional or it breaks production users → blocker. RC1 must be cancelled; RC2 reverts the default change.

For this example, suppose the default change was intentional but the release notes don't mention it. The behaviour change is operator-visible (fetch-failure reports now arrive 10x more often, may overwhelm the AM event queue). That makes it a blocker for a different reason than the test failure: an undocumented behaviour change.

Step 4 — Vote and document

Subject: Re: [VOTE] Release Apache Tez 0.10.4 (RC1)

[-1] non-binding

While building the RC and running the smoke tests, I observed:
  TestShuffleManager#testReadErrorReportDebounce fails 3/3 runs.

Root cause: commit <hash> changed the default of
tez.runtime.shuffle.fetch-failure.report.cooldown-ms from 5000 to 500.
This is operator-visible behaviour change not noted in CHANGES.txt.

Recommendation: either revert the default in RC2 with the new default
deferred to 0.11.0, or keep the new default and update CHANGES.txt to
flag the operator impact and update the test.

Filed TEZ-XXXX with the analysis.

The release manager will respond. RC2 will either fix the issue (cancel, rebuild, vote again) or argue why the change is acceptable.

Release notes

The Tez release notes live in CHANGES.txt at the repo root, organised by release. The format:

Release 0.10.4 - 2026-XX-XX

  NEW FEATURES:
    TEZ-XXXX. Sharded AsyncDispatcher for high-fanout DAGs. (you)

  IMPROVEMENTS:
    TEZ-YYYY. Make DAGPlan size limit configurable. (you)

  BUG FIXES:
    TEZ-ZZZZ. Release held containers on AMRM onError. (you)

  INCOMPATIBLE CHANGES:
    TEZ-AAAA. Default of tez.runtime.shuffle.fetch-failure.report.cooldown-ms
              changed from 5000 to 500. Operators of long-running session AMs
              should evaluate AM event-queue capacity. (you)

Every patch that lands during the release cycle gets a line. The release manager assembles the file from the JIRA "Fix Version" field; contributors make the lines short and accurate.

Security CVE pipeline

The path from "I think I found a CVE" to a public release:

  1. Do not file a public JIRA. Email security@tez.apache.org (the private list, monitored by PMC members).
  2. Wait for acknowledgement (typically within 48 hours).
  3. Work with the security responder on a fix privately, in a private branch.
  4. Once the fix is ready, request a CVE ID via the Apache security team (or MITRE via the responder).
  5. Build a release that includes the fix.
  6. Publish the release; then the CVE is disclosed publicly with a JIRA.

The embargo window is typically 30–90 days. Contributors who report through the private channel and respect the embargo are credited in the advisory.

Pitfalls

  • Don't +1 a release you have not built and smoke-tested. A +1 carries weight; do not give it as a courtesy.
  • Don't -1 without evidence. A -1 blocks the release; the bar for evidence is high.
  • Don't escalate a Stage 9 flake to a blocker. Reproduce three times before voting.
  • Don't disclose a security vulnerability publicly before the embargo expires. Apache projects take this very seriously; a leak can lose you committer status.
  • Don't file Priority: Blocker casually. Reserve it for the four categories above. JIRA pollution diminishes the signal.
  • Don't merge a "must-have" fix during an active RC vote without cancelling the RC first. Mid-vote merges invalidate the artifact and reset the 72-hour clock.
  • Don't assume the release manager will catch your concern silently. Vote on the thread, even if just to 0 with a comment.

Exit criteria — there is no next stage

Stage 12 is the final rung of this roadmap. The exit criterion is that you continue — you are now operating as a committer-track contributor. The next steps are not stages but ongoing practices:

  • Participate in every RC vote with a built artifact and a smoke-test result, even just 0.
  • Watch the security@ and dev@ lists daily.
  • Mentor a new contributor through Stages 1–4 every year.
  • Read every CHANGES.txt diff for every release line you care about.
  • Send a quarterly note to dev@ on which areas of the codebase you are willing to review, so contributors know where to ask.

If you have walked all twelve stages, you are the Apache Tez committer the project needed when you started reading this book.