Lab 9.2 — Analyze a Performance Regression
Lab type: Research & Benchmark
Estimated time: 3–4 hours
Overview
Performance regressions are among the most impactful bugs in Tez — a 10% slowdown in shuffle can translate to significant cost at scale. But they are also the hardest to reproduce and fix.
In this lab you will:
- Identify a performance-sensitive code path
- Write a micro-benchmark using JMH
- Compare two implementations and quantify the difference
- Write a JIRA with a clear, reproducible performance report
Step 1 — Identify a Hot Path
The most performance-critical paths in Tez:
| Path | Class | Why it matters |
|---|---|---|
| Record serialization | TezSerializer, WritableSerialization | Called once per record |
| Sort buffer writes | DefaultSorter.collect() | Called once per output record |
| Shuffle URL construction | Fetcher.getFetchList() | Called per fetch request |
| Counter increment | TezCounter.increment() | Called very frequently |
| BitSet operations | VertexManagerPlugin.onTaskAttemptCompleted | Called per task completion |
Step 2 — Add Maven Surefire Benchmark Configuration
For a quick JMH benchmark within the project:
<!-- Add to level-4-waving-manager/pom.xml if you want to benchmark BitSet -->
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-core</artifactId>
<version>1.37</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.openjdk.jmh</groupId>
<artifactId>jmh-generator-annprocess</artifactId>
<version>1.37</version>
<scope>test</scope>
</dependency>
Step 3 — Write the Benchmark
Example: compare BitSet.andNot(clone) vs re-building the set from scratch:
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkBitSetAndNot(Blackhole bh) {
BitSet scheduled = createBigBitSet(1000);
BitSet finished = createBigBitSet(500);
BitSet copy = (BitSet) scheduled.clone();
copy.andNot(finished);
bh.consume(copy.isEmpty());
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkManualIteration(Blackhole bh) {
Set<Integer> scheduled = createBigSet(1000);
Set<Integer> finished = createBigSet(500);
boolean allDone = finished.containsAll(scheduled);
bh.consume(allDone);
}
Step 4 — Run and Analyze
cd book/projects
mvn -pl level-4-waving-manager test -Dtest=WavingBenchmark -q 2>&1 | tail -30
Record:
- Mean time per operation (nanoseconds)
- Confidence intervals
- Winner
Step 5 — Write the JIRA Performance Report
Summary: [ClassName] uses O(n) Set.containsAll() where O(n) BitSet.andNot() is available
Description:
Micro-benchmark comparison of BitSet.andNot() vs Set.containsAll() for
wave-completion detection in WavingVertexManager (and by extension any
similar VertexManagerPlugin).
Results (1000 tasks, 500 completed, JDK 11, M1 MacBook):
BitSet.andNot(): X ± Y ns/op
Set.containsAll(): X ± Y ns/op
Speedup: Nx
For large DAGs with thousands of tasks, this difference compounds
significantly over the lifetime of the DAG.
Patch: Switch from HashSet to BitSet in [ClassName].
Priority: Minor
Component: tez-dag
Reflection
| # | Question |
|---|---|
| 1 | At what scale (number of tasks per DAG) would the BitSet optimization matter in practice? At 10 tasks? 10,000? |
| 2 | JMH benchmarks measure throughput in isolation. What real-world factors could make the benchmark results misleading? |
| 3 | Performance patches are often held to a higher standard of review than correctness patches. Why? |