Stage 10 — Performance Regressions

What this stage teaches

Stage 10 is where you stop fixing bugs and start measuring. You learn:

The Tez perf-regression workflow: identify symptom, git bisect to the culprit commit, profile under load, attribute the cost, ship a fix with before/after numbers.
Microbenchmarking with tez-examples/OrderedWordCount as the canonical small DAG. When that is too coarse, JMH at the call-site level.
Profilers: async-profiler for CPU/lock contention, JFR for allocation/GC pressure. When to use which.
The two perf hotspots most often blamed first: AsyncDispatcher queue contention and IFile record encoding.
How to file a perf-regression JIRA that committers take seriously: numbers, methodology, reproducibility, and a fix bounded in scope.

Patches are 30–300 lines, always with benchmark evidence in the JIRA. A performance patch without numbers is a no-op.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (text ~ "performance regression" OR text ~ "slow"
       OR text ~ "contention" OR text ~ "allocation"
       OR labels = "performance")
ORDER BY priority DESC, updated DESC

A second source is the dev@ archive — search for "slowdown" or "regression" in the last six months. Operators often report perf issues without filing a JIRA. The first contribution is filing the JIRA with a repro.

The Tez perf-regression workflow

1. Reproduce the regression with a number

Never start a perf investigation with a vibe. Get a number:

cd ~/tez-src
mvn -pl tez-examples -am clean install -DskipTests -Phadoop28 -q
# Then run OrderedWordCount end-to-end on MiniTezCluster
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q

For a more isolated benchmark, write a JMH micro:

find ~/tez-src -name "pom.xml" -exec grep -l jmh {} \;

If JMH is not in the test pom, add it scoped to test only — never to compile.

2. `git bisect` to the culprit commit

Suppose the regression is "OrderedWordCount on a 10-node MiniTezCluster went from 12s to 19s between 0.10.2 and 0.10.3":

cd ~/tez-src
git bisect start
git bisect bad 0.10.3
git bisect good 0.10.2

# Each step:
mvn clean install -DskipTests -Phadoop28 -q
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
# Record the wall time. Then:
git bisect good   # or 'git bisect bad'

Twenty commits between two minor releases means log2(20) ≈ 5 bisect steps. Bisect to the single commit, then read its diff. Often the commit is innocent and the regression is in a sibling commit interacting with it; bisect is the start of the investigation.

3. Profile under load

Once you suspect a region of code, profile:

# async-profiler: CPU samples
$ASYNC_PROFILER/profiler.sh -d 60 -f /tmp/dag.html -e cpu <AM-pid>

# JFR: GC + allocation
jcmd <AM-pid> JFR.start name=tez duration=60s filename=/tmp/dag.jfr

Profile the AM, not the submitting client. The AM is the long-running process where contention manifests.

For a per-task profile:

// In a one-off test only — never in production code
conf.set(TezConfiguration.TEZ_TASK_LAUNCH_CMD_OPTS,
    "-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=/tmp/task-%p.jfr");

4. Attribute the cost

Read the flame graph. A single fat frame above the noise floor is your target. Most Tez regressions land in one of three buckets:

Lock contention on AsyncDispatcher.eventQueue or VertexImpl.writeLock.
Allocation pressure from IFile.Writer or MergeManager building short-lived buffers in a tight loop.
GC overhead from a long-lived collection that grows unbounded (e.g. a HashMap keyed by TaskAttemptId that is never pruned).

5. Ship a fix with numbers

A Stage 10 JIRA description must include:

Methodology:
  - Hardware: 16-core M3 Pro, 32GB RAM.
  - Command: mvn -pl tez-tests test -Dtest=...
  - Runs: 5 cold, 10 warm, report median + p95.
  - Hadoop profile: hadoop28.

Before (TEZ master at <hash>): median 19.0s, p95 22.1s.
After  (this patch on top):    median 12.4s, p95 13.7s.

Profile evidence: flame graph attached. AsyncDispatcher.handle was 38% CPU
before, 4% after.

A reviewer will ask for the profile artifact. Attach it.

Walked example A — `AsyncDispatcher` queue contention

Symptom: AM throughput collapses on DAGs with > 10k tasks. Profile shows 40% of CPU is in AsyncDispatcher.handle under LinkedBlockingQueue.put.

Step 1 — Diagnose

cd ~/tez-src
grep -n "LinkedBlockingQueue\|eventQueue" \
  tez-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java

(The class is technically Hadoop's AsyncDispatcher, but Tez subclasses and configures it in tez-common.) Single-producer multi-consumer would benefit from a partitioned queue keyed by event type.

Step 2 — The fix surface

Two acceptable approaches:

Sharded dispatcher: partition events by destination ID so each shard has its own queue. Tez has the building blocks but not the wiring; the patch is the wiring.
Batched event submission: collect events on the producer side and submit in groups, reducing lock acquisitions per task.

Both are large patches. The Stage 10 contribution is one of them, with a clear scope: "sharded dispatcher for vertex events only", not "rewrite AsyncDispatcher".

Step 3 — Numbers

For the sharded-dispatcher patch on a 10k-task OrderedWordCount:

Before: 19.0s median, 22.1s p95.
After:  12.4s median, 13.7s p95.
AsyncDispatcher.handle: 38% → 4% CPU.

These numbers go into the JIRA description, with a flame graph attached.

Step 4 — dev@ design ping

Any Stage 10 patch above ~50 lines deserves a dev@ thread:

Subject: [DISCUSS] TEZ-XXXX — shard AsyncDispatcher by destination type

I have a repro for AM throughput collapse on 10k-task DAGs. Profile attached.
Proposed fix: shard the AsyncDispatcher event queue by destination type
(Vertex / Task / TaskAttempt / Container). Numbers: 19s -> 12s median.

Open questions:
  1. Default shard count: I propose 4 with a configurable override.
  2. Compat: AsyncDispatcher is org.apache.hadoop, so we shim in tez-common.
  3. Tests: TestAsyncDispatcher + the existing scheduler integration tests.

Comments welcome before I post the patch.

If a committer flags an unexpected constraint (e.g. "we cannot shard because ATS event ordering depends on global sequence"), redesign before coding.

Walked example B — `IFile` record encoding hot path

Symptom: profile shows 22% CPU in IFile.Writer.append under WritableUtils.writeVInt. Allocation profile shows two byte[] per record.

Diagnose:

cd ~/tez-src
grep -n "writeVInt\|writeVLong\|new byte\[" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java

The hot path allocates a fresh byte[] per record for VInt encoding. The fix is a reusable scratch buffer per Writer instance:

+  private final byte[] vIntBuf = new byte[9];
+
   public void append(DataInputBuffer key, DataInputBuffer value) throws IOException {
-    byte[] scratch = new byte[9];
-    int n = encodeVInt(key.getLength(), scratch);
-    out.write(scratch, 0, n);
+    int n = encodeVInt(key.getLength(), vIntBuf);
+    out.write(vIntBuf, 0, n);
     ...
   }

The patch is six lines. The justification is the JMH micro:

JMH benchmark: IFileWriter.append for 1M small records.
Before: 14.2 us/op, 32B/op allocation.
After:   8.7 us/op,  0B/op allocation.

This is a textbook Stage 10 patch: small, measurable, attributable.

Pitfalls

Don't ship a perf patch without numbers. Reviewers will reject it. "Looks faster" is not evidence.
Don't benchmark on the same machine you developed on without warm-up. Always run cold + warm passes; report median + p95.
Don't compare across different Hadoop profiles. Pick one profile and hold it constant.
Don't widen the scope of a perf patch mid-review. "I found another hotspot while I was here" → new JIRA.
Don't use micro-benchmark numbers in isolation. Always show the end-to-end impact too. A 2x improvement in IFile.Writer.append that yields 0.1% end-to-end improvement may not be worth merging.
Don't git bisect against a tree with unrelated WIP. git bisect is deterministic only against a clean tree.
Don't profile in production without the operator's consent. Even async-profiler has overhead; the operator should know.

Exit criteria — when you're ready for the next stage

Move to Stage 11 when:

You have shipped one perf patch with documented before/after numbers and an attached profile.
You can git bisect 20 commits without referring to documentation.
You have read at least one async-profiler flame graph for Tez and identified the hotspot without help.
A committer has accepted your patch's methodology section as sufficient evidence.

Stage 11 takes you into the compatibility contract.

Open-Source Engineer & Contributor