Lab 4.4 — Fix It: Null Dereference in ShuffleVertexManager on Zero-Partition Source
Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 120–150 min
Tez component: tez-dag → org.apache.tez.dag.app.dag.impl.ShuffleVertexManager
Background
ShuffleVertexManager uses partition statistics sent by map tasks to decide
when to start reduce tasks (slow-start) and how many reducers to run
(auto-parallelism). It processes these statistics via
onVertexManagerEventReceived().
A long-standing bug category in this path: when a source vertex has zero
output partitions (all records were filtered, or the vertex ran with zero
tasks), the plugin can receive a ShuffleVertexManager.VertexManagerEvent
whose payload encodes 0 partitions. In several versions of Tez, this caused
a NullPointerException or ArithmeticException (divide by zero) deep in the
statistics-processing path — the code assumed at least one partition existed.
This lab reproduces the bug pattern in a unit test, locates the exact guard that is missing, applies the fix, and submits a patch.
Step 1 — Locate the Source File
find ~/tez-src -name "ShuffleVertexManager.java" | head -5
Expected:
./tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java
Also locate the test file:
find ~/tez-src -name "TestShuffleVertexManager.java" | head -5
Step 2 — Read the Statistics Path
In ShuffleVertexManager.java, find the method that processes
VertexManagerEvent payloads. It will have a call to
ShuffleVertexManagerBase.parseStatsHeader() or similar, and will work with
numPartitions or partitionCount.
Trace the complete call chain from onVertexManagerEventReceived() to the
line that first uses the partition count arithmetically.
Questions
| # | Question |
|---|---|
| 1 | What is the name of the proto-based payload class that encodes partition statistics? |
| 2 | Which method extracts the partition count from the payload? |
| 3 | On what line does the first arithmetic operation involving the partition count occur? |
| 4 | Is there a null-check or zero-check before that line? |
| 5 | What exception would result if partitionCount == 0 at that line? |
Step 3 — Find the Existing Test
find ~/tez-src -name "TestShuffleVertexManager.java"
Open it and search for any test that covers the zero-partition case:
grep -n "zero\|0.*partition\|partition.*0" TestShuffleVertexManager.java -i | head -20
Note: in most Tez versions there is no such test — that is the gap you will fill.
Step 4 — Write the Reproducing Test
Add the following test to TestShuffleVertexManager.java. The exact helper
methods depend on the version you have; adapt the setup pattern from the
nearest existing test (look for testAutoParallelism or testSlowStart).
@Test(expected = Exception.class) // replace Exception with the specific type you observe
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
// TODO: set up a ShuffleVertexManager with auto-parallelism enabled
// TODO: send a VertexManagerEvent with numPartitions = 0
// TODO: call onVertexManagerEventReceived with that event
// The call should NOT throw — once fixed.
// Mark expected = Exception.class so the test initially *passes*
// when the bug exists (the code throws), then change to asserting
// no throw after the fix is applied.
}
Run:
cd ~/tez-src
mvn test -pl tez-dag -Dtest=TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash -q 2>&1 | tail -30
Record: which exception is thrown and on which line.
Step 5 — Apply the Fix
In ShuffleVertexManager.java, add a guard at the point identified in Step 2.
Rules
- The guard must be a minimum: either
if (partitionCount == 0) { return; }to skip the event, orif (partitionCount == 0) { partitionCount = 1; }to normalise (choose the semantically correct one — which is safer for scheduling?) - Do not reformat surrounding code
- Do not change method signatures
Step 6 — Update the Test
Now that the fix is applied, update the test:
@Test
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
// Same setup as before
// This time assert NO exception is thrown
// Optionally assert that scheduling state is unchanged
}
Run the full tez-dag test suite:
mvn test -pl tez-dag -q 2>&1 | tail -20
All tests must pass.
Step 7 — Checkstyle
mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20
Zero violations required.
Step 8 — Format the Patch
cd ~/tez-src
git diff > /tmp/TEZ-ZEROPART.001.patch
cat /tmp/TEZ-ZEROPART.001.patch
Checklist:
-
Only
ShuffleVertexManager.javaandTestShuffleVertexManager.javamodified -
No trailing whitespace:
grep -P "\\s+$" /tmp/TEZ-ZEROPART.001.patch -
Patch applies cleanly:
git apply --check /tmp/TEZ-ZEROPART.001.patch -
All tests pass after
git apply
Step 9 — Write the JIRA Description
Summary: ShuffleVertexManager throws [ExceptionType] when source vertex
has zero output partitions
Description:
When a source vertex completes with zero output partitions (all records
filtered or vertex ran zero tasks), ShuffleVertexManager.onVertexManagerEventReceived
receives a VertexManagerEvent with partitionCount=0. The statistics
processing path performs arithmetic on this value without a zero guard,
causing [ExceptionType] at [ClassName].java:[line].
Steps to reproduce:
See attached TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash.
Fix:
Add a zero-partition guard at [method name], line [N].
Skip or normalise the event when partitionCount == 0.
Priority: Major
Component: tez-dag
Affects Version: 0.10.x
Step 10 — Deeper Understanding
After completing the fix, answer these questions by reading ShuffleVertexManager.java:
| # | Question |
|---|---|
| 1 | What is the slowStartMinFraction and slowStartMaxFraction used for? At what point in the scheduling lifecycle are they checked? |
| 2 | When does ShuffleVertexManager call reconfigureVertex()? What does it change? |
| 3 | What data structure accumulates partition statistics across multiple VertexManagerEvent calls? Why accumulate rather than process each event independently? |
| 4 | The test class uses mock(VertexManagerPluginContext.class). Compare this to TestWavingVertexManager — what additional interactions does ShuffleVertexManager have with the context that WavingVertexManager does not? |
| 5 | Search for all places in ShuffleVertexManager where a divide-by-zero could theoretically occur. List them. |