Lab 4.4 — Fix It: Null Dereference in ShuffleVertexManager on Zero-Partition Source

Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 120–150 min
Tez component: tez-dagorg.apache.tez.dag.app.dag.impl.ShuffleVertexManager


Background

ShuffleVertexManager uses partition statistics sent by map tasks to decide when to start reduce tasks (slow-start) and how many reducers to run (auto-parallelism). It processes these statistics via onVertexManagerEventReceived().

A long-standing bug category in this path: when a source vertex has zero output partitions (all records were filtered, or the vertex ran with zero tasks), the plugin can receive a ShuffleVertexManager.VertexManagerEvent whose payload encodes 0 partitions. In several versions of Tez, this caused a NullPointerException or ArithmeticException (divide by zero) deep in the statistics-processing path — the code assumed at least one partition existed.

This lab reproduces the bug pattern in a unit test, locates the exact guard that is missing, applies the fix, and submits a patch.


Step 1 — Locate the Source File

find ~/tez-src -name "ShuffleVertexManager.java" | head -5

Expected:

./tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java

Also locate the test file:

find ~/tez-src -name "TestShuffleVertexManager.java" | head -5

Step 2 — Read the Statistics Path

In ShuffleVertexManager.java, find the method that processes VertexManagerEvent payloads. It will have a call to ShuffleVertexManagerBase.parseStatsHeader() or similar, and will work with numPartitions or partitionCount.

Trace the complete call chain from onVertexManagerEventReceived() to the line that first uses the partition count arithmetically.

Questions

#Question
1What is the name of the proto-based payload class that encodes partition statistics?
2Which method extracts the partition count from the payload?
3On what line does the first arithmetic operation involving the partition count occur?
4Is there a null-check or zero-check before that line?
5What exception would result if partitionCount == 0 at that line?

Step 3 — Find the Existing Test

find ~/tez-src -name "TestShuffleVertexManager.java"

Open it and search for any test that covers the zero-partition case:

grep -n "zero\|0.*partition\|partition.*0" TestShuffleVertexManager.java -i | head -20

Note: in most Tez versions there is no such test — that is the gap you will fill.


Step 4 — Write the Reproducing Test

Add the following test to TestShuffleVertexManager.java. The exact helper methods depend on the version you have; adapt the setup pattern from the nearest existing test (look for testAutoParallelism or testSlowStart).

@Test(expected = Exception.class)   // replace Exception with the specific type you observe
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
    // TODO: set up a ShuffleVertexManager with auto-parallelism enabled
    // TODO: send a VertexManagerEvent with numPartitions = 0
    // TODO: call onVertexManagerEventReceived with that event
    //       The call should NOT throw — once fixed.
    //       Mark expected = Exception.class so the test initially *passes*
    //       when the bug exists (the code throws), then change to asserting
    //       no throw after the fix is applied.
}

Run:

cd ~/tez-src
mvn test -pl tez-dag -Dtest=TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash -q 2>&1 | tail -30

Record: which exception is thrown and on which line.


Step 5 — Apply the Fix

In ShuffleVertexManager.java, add a guard at the point identified in Step 2.

Rules

  • The guard must be a minimum: either if (partitionCount == 0) { return; } to skip the event, or if (partitionCount == 0) { partitionCount = 1; } to normalise (choose the semantically correct one — which is safer for scheduling?)
  • Do not reformat surrounding code
  • Do not change method signatures

Step 6 — Update the Test

Now that the fix is applied, update the test:

@Test
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
    // Same setup as before
    // This time assert NO exception is thrown
    // Optionally assert that scheduling state is unchanged
}

Run the full tez-dag test suite:

mvn test -pl tez-dag -q 2>&1 | tail -20

All tests must pass.


Step 7 — Checkstyle

mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20

Zero violations required.


Step 8 — Format the Patch

cd ~/tez-src
git diff > /tmp/TEZ-ZEROPART.001.patch
cat /tmp/TEZ-ZEROPART.001.patch

Checklist:

  • Only ShuffleVertexManager.java and TestShuffleVertexManager.java modified
  • No trailing whitespace: grep -P "\\s+$" /tmp/TEZ-ZEROPART.001.patch
  • Patch applies cleanly: git apply --check /tmp/TEZ-ZEROPART.001.patch
  • All tests pass after git apply

Step 9 — Write the JIRA Description

Summary: ShuffleVertexManager throws [ExceptionType] when source vertex
         has zero output partitions

Description:
  When a source vertex completes with zero output partitions (all records
  filtered or vertex ran zero tasks), ShuffleVertexManager.onVertexManagerEventReceived
  receives a VertexManagerEvent with partitionCount=0.  The statistics
  processing path performs arithmetic on this value without a zero guard,
  causing [ExceptionType] at [ClassName].java:[line].

  Steps to reproduce:
    See attached TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash.

  Fix:
    Add a zero-partition guard at [method name], line [N].
    Skip or normalise the event when partitionCount == 0.

Priority: Major
Component: tez-dag
Affects Version: 0.10.x

Step 10 — Deeper Understanding

After completing the fix, answer these questions by reading ShuffleVertexManager.java:

#Question
1What is the slowStartMinFraction and slowStartMaxFraction used for? At what point in the scheduling lifecycle are they checked?
2When does ShuffleVertexManager call reconfigureVertex()? What does it change?
3What data structure accumulates partition statistics across multiple VertexManagerEvent calls? Why accumulate rather than process each event independently?
4The test class uses mock(VertexManagerPluginContext.class). Compare this to TestWavingVertexManager — what additional interactions does ShuffleVertexManager have with the context that WavingVertexManager does not?
5Search for all places in ShuffleVertexManager where a divide-by-zero could theoretically occur. List them.