Lab 9.2 — Analyze a Performance Regression

Lab type: Research & Benchmark
Estimated time: 3–4 hours

Overview

Performance regressions are among the most impactful bugs in Tez — a 10% slowdown in shuffle can translate to significant cost at scale. But they are also the hardest to reproduce and fix.

In this lab you will:

Identify a performance-sensitive code path
Write a micro-benchmark using JMH
Compare two implementations and quantify the difference
Write a JIRA with a clear, reproducible performance report

Step 1 — Identify a Hot Path

The most performance-critical paths in Tez:

Path	Class	Why it matters
Record serialization	`TezSerializer`, `WritableSerialization`	Called once per record
Sort buffer writes	`DefaultSorter.collect()`	Called once per output record
Shuffle URL construction	`Fetcher.getFetchList()`	Called per fetch request
Counter increment	`TezCounter.increment()`	Called very frequently
BitSet operations	`VertexManagerPlugin.onTaskAttemptCompleted`	Called per task completion

Step 2 — Add Maven Surefire Benchmark Configuration

For a quick JMH benchmark within the project:

<!-- Add to level-4-waving-manager/pom.xml if you want to benchmark BitSet -->
<dependency>
  <groupId>org.openjdk.jmh</groupId>
  <artifactId>jmh-core</artifactId>
  <version>1.37</version>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.openjdk.jmh</groupId>
  <artifactId>jmh-generator-annprocess</artifactId>
  <version>1.37</version>
  <scope>test</scope>
</dependency>

Step 3 — Write the Benchmark

Example: compare BitSet.andNot(clone) vs re-building the set from scratch:

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkBitSetAndNot(Blackhole bh) {
    BitSet scheduled = createBigBitSet(1000);
    BitSet finished = createBigBitSet(500);
    BitSet copy = (BitSet) scheduled.clone();
    copy.andNot(finished);
    bh.consume(copy.isEmpty());
}

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkManualIteration(Blackhole bh) {
    Set<Integer> scheduled = createBigSet(1000);
    Set<Integer> finished = createBigSet(500);
    boolean allDone = finished.containsAll(scheduled);
    bh.consume(allDone);
}

Step 4 — Run and Analyze

cd book/projects
mvn -pl level-4-waving-manager test -Dtest=WavingBenchmark -q 2>&1 | tail -30

Record:

Mean time per operation (nanoseconds)
Confidence intervals
Winner

Step 5 — Write the JIRA Performance Report

Summary: [ClassName] uses O(n) Set.containsAll() where O(n) BitSet.andNot() is available

Description:
  Micro-benchmark comparison of BitSet.andNot() vs Set.containsAll() for
  wave-completion detection in WavingVertexManager (and by extension any
  similar VertexManagerPlugin).

  Results (1000 tasks, 500 completed, JDK 11, M1 MacBook):

    BitSet.andNot():        X ± Y ns/op
    Set.containsAll():      X ± Y ns/op
    Speedup:                Nx

  For large DAGs with thousands of tasks, this difference compounds
  significantly over the lifetime of the DAG.

Patch: Switch from HashSet to BitSet in [ClassName].

Priority: Minor
Component: tez-dag

Reflection

#	Question
1	At what scale (number of tasks per DAG) would the BitSet optimization matter in practice? At 10 tasks? 10,000?
2	JMH benchmarks measure throughput in isolation. What real-world factors could make the benchmark results misleading?
3	Performance patches are often held to a higher standard of review than correctness patches. Why?

Open-Source Engineer & Contributor