Stage 12 — Release-Blocking Issues
What this stage teaches
Stage 12 is the committer/PMC stage. You learn:
- The four categories of release blockers: data loss, correctness regressions, AM crash, security CVE.
- How to triage a candidate blocker during an RC vote: what evidence is required, who must be CC'd, and what the deadline-pressure tradeoffs are.
- The Apache release process from a committer's seat: building an RC,
signing artifacts, calling a
[VOTE]thread, the 72-hour rule, and the meaning of+1 binding,-1 binding,+1, and0votes. - The Tez release notes format and what a release blocker contributes to it.
- Security CVE handling: the private security@ list, embargoed disclosure, and the path from private patch to public release.
This is the only stage where you may be voting on someone else's work as much as writing your own. The patch surface is identical to earlier stages; the context in which you act is different.
JIRA filter to find candidates
project = TEZ
AND priority in (Blocker, Critical)
AND resolution = Unresolved
ORDER BY priority DESC, updated DESC
The set is small at any given time. During an RC vote it grows fast.
A second filter for the RC voting period:
project = TEZ AND priority = Blocker AND created > -7d
The four categories of release blockers
1. Data loss
The strictest category. Any code path where a successfully-acknowledged write can be lost, or a successfully-acknowledged read can return wrong data, is a data-loss blocker. Examples in Tez history:
- A
MergeManagerspill that double-counted records and silently dropped one. - A
Fetcherthat ignored a checksum mismatch and returned corrupted bytes to the downstream processor. - A
DAGRecoverypath that reconstructed an incorrect parent vertex state after AM restart.
Triage: the JIRA description must contain a deterministic repro that the release manager can run in under five minutes. Without a repro, the issue is not a blocker — it is a "to be investigated" ticket.
2. Correctness regressions
A query that returned correct results in version N-1 returns wrong results in version N. The bar is lower than data loss (the data is still there; the output is wrong) but the triage is the same. A correctness regression that affects a single Hive query path is a blocker.
3. AM crash
Any reproducible InvalidStateTransitonException in master is a blocker
during an RC. Operators expect the AM to survive their workload. An AM
crash on a Hive-emitted DAG that worked in the previous release blocks the
RC even if the DAG itself is "unusual" — the AM must be defensive against
its inputs.
4. Security CVE
A demonstrated CVE in a Tez-owned class is a blocker regardless of whether
it has been exploited. The disclosure path is security@tez.apache.org
first, then the public JIRA only after the fix is ready.
Triage during an RC vote
The RC vote pattern on dev@:
Subject: [VOTE] Release Apache Tez 0.10.4 (RC1)
Hi,
I've prepared the first release candidate for Tez 0.10.4. The artifacts
are at:
https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/
The git tag is:
https://github.com/apache/tez/releases/tag/release-0.10.4-rc1
The release notes are:
CHANGES.txt at the top of the tag.
Please verify the signatures, run the smoke tests, and vote:
[+1] release this RC
[0] no opinion
[-1] do not release (please explain)
The vote is open for 72 hours.
Your job, as a contributor evaluating the RC:
- Verify the artifact:
curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz.asc gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz - Build from source:
tar xf apache-tez-0.10.4-src.tar.gz cd apache-tez-0.10.4-src mvn clean install -DskipTests -Phadoop28 - Run a smoke test:
mvn -pl tez-tests test -Dtest=TestExternalTezServices -Phadoop28 - Reply on the vote thread with your evidence.
Vote semantics
| Vote | Meaning |
|---|---|
+1 binding | PMC member endorses release. Three are required for release. |
+1 | Non-PMC endorses. Counts for momentum, not the binding count. |
0 | No opinion. Often used to indicate "I built it, smoke test passed, but I can't speak to my use case." |
-1 binding | PMC member vetoes. One -1 binding stops the release unless overridden by another vote (rare). |
-1 | Non-PMC veto. Not binding, but committers will read it. |
A -1 vote must include the reason. "Build failed" is not enough; "build
failed because X test fails reproducibly on Hadoop 3.x profile, evidence at
URL" is.
Walked example — discovering a blocker during RC vote
Symptom: during the 0.10.4 RC1 vote, you run the smoke test and observe a
test failure in TestShuffleManager#testReadErrorReportDebounce that did
not happen in 0.10.3.
Step 1 — Reproduce
cd apache-tez-0.10.4-src
for i in 1 2 3; do
mvn -pl tez-runtime-library test \
-Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -5
done
If the failure is 3/3, it is reproducible. If 1/3, it is a flake (Stage 9 issue, not a blocker).
Step 2 — Identify the cause
git log v0.10.3..release-0.10.4-rc1 -- \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped
You see a commit that changed the debounce window default from 5000ms to 500ms. The test was written against 5000ms; the change silently broke it.
Step 3 — Decide blocker vs not
A failing unit test in an RC is not automatically a blocker. The question is: does the underlying behaviour change affect production?
- If the default change is intentional and the test should be updated → not a blocker. Fix the test in 0.10.4 hotfix or 0.10.5.
- If the default change is unintentional or it breaks production users → blocker. RC1 must be cancelled; RC2 reverts the default change.
For this example, suppose the default change was intentional but the release notes don't mention it. The behaviour change is operator-visible (fetch-failure reports now arrive 10x more often, may overwhelm the AM event queue). That makes it a blocker for a different reason than the test failure: an undocumented behaviour change.
Step 4 — Vote and document
Subject: Re: [VOTE] Release Apache Tez 0.10.4 (RC1)
[-1] non-binding
While building the RC and running the smoke tests, I observed:
TestShuffleManager#testReadErrorReportDebounce fails 3/3 runs.
Root cause: commit <hash> changed the default of
tez.runtime.shuffle.fetch-failure.report.cooldown-ms from 5000 to 500.
This is operator-visible behaviour change not noted in CHANGES.txt.
Recommendation: either revert the default in RC2 with the new default
deferred to 0.11.0, or keep the new default and update CHANGES.txt to
flag the operator impact and update the test.
Filed TEZ-XXXX with the analysis.
The release manager will respond. RC2 will either fix the issue (cancel, rebuild, vote again) or argue why the change is acceptable.
Release notes
The Tez release notes live in CHANGES.txt at the repo root, organised by
release. The format:
Release 0.10.4 - 2026-XX-XX
NEW FEATURES:
TEZ-XXXX. Sharded AsyncDispatcher for high-fanout DAGs. (you)
IMPROVEMENTS:
TEZ-YYYY. Make DAGPlan size limit configurable. (you)
BUG FIXES:
TEZ-ZZZZ. Release held containers on AMRM onError. (you)
INCOMPATIBLE CHANGES:
TEZ-AAAA. Default of tez.runtime.shuffle.fetch-failure.report.cooldown-ms
changed from 5000 to 500. Operators of long-running session AMs
should evaluate AM event-queue capacity. (you)
Every patch that lands during the release cycle gets a line. The release manager assembles the file from the JIRA "Fix Version" field; contributors make the lines short and accurate.
Security CVE pipeline
The path from "I think I found a CVE" to a public release:
- Do not file a public JIRA. Email
security@tez.apache.org(the private list, monitored by PMC members). - Wait for acknowledgement (typically within 48 hours).
- Work with the security responder on a fix privately, in a private branch.
- Once the fix is ready, request a CVE ID via the Apache security team (or MITRE via the responder).
- Build a release that includes the fix.
- Publish the release; then the CVE is disclosed publicly with a JIRA.
The embargo window is typically 30–90 days. Contributors who report through the private channel and respect the embargo are credited in the advisory.
Pitfalls
- Don't
+1a release you have not built and smoke-tested. A+1carries weight; do not give it as a courtesy. - Don't
-1without evidence. A-1blocks the release; the bar for evidence is high. - Don't escalate a Stage 9 flake to a blocker. Reproduce three times before voting.
- Don't disclose a security vulnerability publicly before the embargo expires. Apache projects take this very seriously; a leak can lose you committer status.
- Don't file
Priority: Blockercasually. Reserve it for the four categories above. JIRA pollution diminishes the signal. - Don't merge a "must-have" fix during an active RC vote without cancelling the RC first. Mid-vote merges invalidate the artifact and reset the 72-hour clock.
- Don't assume the release manager will catch your concern silently.
Vote on the thread, even if just to
0with a comment.
Exit criteria — there is no next stage
Stage 12 is the final rung of this roadmap. The exit criterion is that you continue — you are now operating as a committer-track contributor. The next steps are not stages but ongoing practices:
- Participate in every RC vote with a built artifact and a smoke-test
result, even just
0. - Watch the
security@anddev@lists daily. - Mentor a new contributor through Stages 1–4 every year.
- Read every
CHANGES.txtdiff for every release line you care about. - Send a quarterly note to dev@ on which areas of the codebase you are willing to review, so contributors know where to ask.
If you have walked all twelve stages, you are the Apache Tez committer the project needed when you started reading this book.