Stage 10 — Performance Regressions
What this stage teaches
Stage 10 is where you stop fixing bugs and start measuring. You learn:
- The Tez perf-regression workflow: identify symptom,
git bisectto the culprit commit, profile under load, attribute the cost, ship a fix with before/after numbers. - Microbenchmarking with
tez-examples/OrderedWordCountas the canonical small DAG. When that is too coarse, JMH at the call-site level. - Profilers:
async-profilerfor CPU/lock contention, JFR for allocation/GC pressure. When to use which. - The two perf hotspots most often blamed first:
AsyncDispatcherqueue contention andIFilerecord encoding. - How to file a perf-regression JIRA that committers take seriously: numbers, methodology, reproducibility, and a fix bounded in scope.
Patches are 30–300 lines, always with benchmark evidence in the JIRA. A performance patch without numbers is a no-op.
JIRA filter to find candidates
project = TEZ
AND resolution = Unresolved
AND (text ~ "performance regression" OR text ~ "slow"
OR text ~ "contention" OR text ~ "allocation"
OR labels = "performance")
ORDER BY priority DESC, updated DESC
A second source is the dev@ archive — search for "slowdown" or "regression" in the last six months. Operators often report perf issues without filing a JIRA. The first contribution is filing the JIRA with a repro.
The Tez perf-regression workflow
1. Reproduce the regression with a number
Never start a perf investigation with a vibe. Get a number:
cd ~/tez-src
mvn -pl tez-examples -am clean install -DskipTests -Phadoop28 -q
# Then run OrderedWordCount end-to-end on MiniTezCluster
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
For a more isolated benchmark, write a JMH micro:
find ~/tez-src -name "pom.xml" -exec grep -l jmh {} \;
If JMH is not in the test pom, add it scoped to test only — never to
compile.
2. git bisect to the culprit commit
Suppose the regression is "OrderedWordCount on a 10-node MiniTezCluster went from 12s to 19s between 0.10.2 and 0.10.3":
cd ~/tez-src
git bisect start
git bisect bad 0.10.3
git bisect good 0.10.2
# Each step:
mvn clean install -DskipTests -Phadoop28 -q
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
# Record the wall time. Then:
git bisect good # or 'git bisect bad'
Twenty commits between two minor releases means log2(20) ≈ 5 bisect steps.
Bisect to the single commit, then read its diff. Often the commit is innocent
and the regression is in a sibling commit interacting with it; bisect is the
start of the investigation.
3. Profile under load
Once you suspect a region of code, profile:
# async-profiler: CPU samples
$ASYNC_PROFILER/profiler.sh -d 60 -f /tmp/dag.html -e cpu <AM-pid>
# JFR: GC + allocation
jcmd <AM-pid> JFR.start name=tez duration=60s filename=/tmp/dag.jfr
Profile the AM, not the submitting client. The AM is the long-running process where contention manifests.
For a per-task profile:
// In a one-off test only — never in production code
conf.set(TezConfiguration.TEZ_TASK_LAUNCH_CMD_OPTS,
"-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=/tmp/task-%p.jfr");
4. Attribute the cost
Read the flame graph. A single fat frame above the noise floor is your target. Most Tez regressions land in one of three buckets:
- Lock contention on
AsyncDispatcher.eventQueueorVertexImpl.writeLock. - Allocation pressure from
IFile.WriterorMergeManagerbuilding short-lived buffers in a tight loop. - GC overhead from a long-lived collection that grows unbounded
(e.g. a
HashMapkeyed byTaskAttemptIdthat is never pruned).
5. Ship a fix with numbers
A Stage 10 JIRA description must include:
Methodology:
- Hardware: 16-core M3 Pro, 32GB RAM.
- Command: mvn -pl tez-tests test -Dtest=...
- Runs: 5 cold, 10 warm, report median + p95.
- Hadoop profile: hadoop28.
Before (TEZ master at <hash>): median 19.0s, p95 22.1s.
After (this patch on top): median 12.4s, p95 13.7s.
Profile evidence: flame graph attached. AsyncDispatcher.handle was 38% CPU
before, 4% after.
A reviewer will ask for the profile artifact. Attach it.
Walked example A — AsyncDispatcher queue contention
Symptom: AM throughput collapses on DAGs with > 10k tasks. Profile shows 40%
of CPU is in AsyncDispatcher.handle under LinkedBlockingQueue.put.
Step 1 — Diagnose
cd ~/tez-src
grep -n "LinkedBlockingQueue\|eventQueue" \
tez-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java
(The class is technically Hadoop's AsyncDispatcher, but Tez subclasses and
configures it in tez-common.) Single-producer multi-consumer would benefit
from a partitioned queue keyed by event type.
Step 2 — The fix surface
Two acceptable approaches:
- Sharded dispatcher: partition events by destination ID so each shard has its own queue. Tez has the building blocks but not the wiring; the patch is the wiring.
- Batched event submission: collect events on the producer side and submit in groups, reducing lock acquisitions per task.
Both are large patches. The Stage 10 contribution is one of them, with a clear scope: "sharded dispatcher for vertex events only", not "rewrite AsyncDispatcher".
Step 3 — Numbers
For the sharded-dispatcher patch on a 10k-task OrderedWordCount:
Before: 19.0s median, 22.1s p95.
After: 12.4s median, 13.7s p95.
AsyncDispatcher.handle: 38% → 4% CPU.
These numbers go into the JIRA description, with a flame graph attached.
Step 4 — dev@ design ping
Any Stage 10 patch above ~50 lines deserves a dev@ thread:
Subject: [DISCUSS] TEZ-XXXX — shard AsyncDispatcher by destination type
I have a repro for AM throughput collapse on 10k-task DAGs. Profile attached.
Proposed fix: shard the AsyncDispatcher event queue by destination type
(Vertex / Task / TaskAttempt / Container). Numbers: 19s -> 12s median.
Open questions:
1. Default shard count: I propose 4 with a configurable override.
2. Compat: AsyncDispatcher is org.apache.hadoop, so we shim in tez-common.
3. Tests: TestAsyncDispatcher + the existing scheduler integration tests.
Comments welcome before I post the patch.
If a committer flags an unexpected constraint (e.g. "we cannot shard because ATS event ordering depends on global sequence"), redesign before coding.
Walked example B — IFile record encoding hot path
Symptom: profile shows 22% CPU in IFile.Writer.append under
WritableUtils.writeVInt. Allocation profile shows two byte[] per record.
Diagnose:
cd ~/tez-src
grep -n "writeVInt\|writeVLong\|new byte\[" \
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
The hot path allocates a fresh byte[] per record for VInt encoding. The fix
is a reusable scratch buffer per Writer instance:
+ private final byte[] vIntBuf = new byte[9];
+
public void append(DataInputBuffer key, DataInputBuffer value) throws IOException {
- byte[] scratch = new byte[9];
- int n = encodeVInt(key.getLength(), scratch);
- out.write(scratch, 0, n);
+ int n = encodeVInt(key.getLength(), vIntBuf);
+ out.write(vIntBuf, 0, n);
...
}
The patch is six lines. The justification is the JMH micro:
JMH benchmark: IFileWriter.append for 1M small records.
Before: 14.2 us/op, 32B/op allocation.
After: 8.7 us/op, 0B/op allocation.
This is a textbook Stage 10 patch: small, measurable, attributable.
Pitfalls
- Don't ship a perf patch without numbers. Reviewers will reject it. "Looks faster" is not evidence.
- Don't benchmark on the same machine you developed on without warm-up. Always run cold + warm passes; report median + p95.
- Don't compare across different Hadoop profiles. Pick one profile and hold it constant.
- Don't widen the scope of a perf patch mid-review. "I found another hotspot while I was here" → new JIRA.
- Don't use micro-benchmark numbers in isolation. Always show the
end-to-end impact too. A 2x improvement in
IFile.Writer.appendthat yields 0.1% end-to-end improvement may not be worth merging. - Don't
git bisectagainst a tree with unrelated WIP.git bisectis deterministic only against a clean tree. - Don't profile in production without the operator's consent. Even async-profiler has overhead; the operator should know.
Exit criteria — when you're ready for the next stage
Move to Stage 11 when:
- You have shipped one perf patch with documented before/after numbers and an attached profile.
- You can
git bisect20 commits without referring to documentation. - You have read at least one
async-profilerflame graph for Tez and identified the hotspot without help. - A committer has accepted your patch's methodology section as sufficient evidence.
Stage 11 takes you into the compatibility contract.