Lab 7.1 — Debug Shuffle Behavior

Lab type: Read & Research
Estimated time: 120 min
Tez module: tez-runtime-library


Overview

Shuffle failures are the most common source of Tez bug reports. They manifest as FetchFailure events, IOException during map-output reads, or hung reduce tasks. In this lab you will trace the complete shuffle path from log line to source code.


Step 1 — Locate the Core Classes

find ~/tez-src/tez-runtime-library -name "*.java" | xargs grep -l "FetchFailure\|Fetcher\|ShuffleHandler" | head -10
find ~/tez-src/tez-shuffle -name "*.java" | head -10

Step 2 — Read the Shuffle Fetch Path

Open Fetcher.java (in tez-runtime-library) and trace the fetch loop:

#Question
1What HTTP method does the Fetcher use to request map output? GET or POST?
2What is the URL format it sends to ShuffleHandler? What parameters does it include?
3If the HTTP response code is 404, what does the Fetcher do? (Fail immediately? Retry? Report back to the InputManager?)
4What does the Fetcher do when it detects data corruption (checksum mismatch)? Which class handles checksum verification?
5How many concurrent fetcher threads does a reduce task run? What configuration key controls this?

Step 3 — Read the FetchFailure Event Path

When a fetch fails, an event travels up to the AM:

grep -rn "FetchFailure\|FETCH_FAILURE" ~/tez-src/tez-dag/src/main/java/ | \
  grep -v "test" | grep ".java:" | head -20

Trace: where does the FetchFailure event originate, and what state transition does it trigger in TaskAttemptImpl?

#Question
1What is the name of the event class that carries the fetch-failure information to the AM?
2In TaskAttemptImpl, what state does the task transition to when it receives a fetch failure?
3Does a single fetch failure kill the task, or does Tez retry? What configuration controls max fetch retries?
4What happens to the source task attempt (the map) when its output cannot be fetched? Is it re-run?

Step 4 — Read ShuffleHandler

Open ShuffleHandler.java in tez-shuffle:

#Question
1What Netty class does ShuffleHandler extend?
2How does ShuffleHandler authenticate that a requester is authorized to fetch map output? (Hint: look for TOKEN or JobTokenSecretManager.)
3Where does ShuffleHandler read the index file? What class represents the index?
4If the NM restarts while a reduce is fetching, what happens to in-flight fetch requests?

Step 5 — Read the Spill Path

Open DefaultSorter.java or PipelinedSorter.java in tez-runtime-library:

  1. At what memory threshold does a spill occur?
  2. How many spill files can accumulate before a merge is triggered?
  3. After a spill, where is the index written?

Step 6 — Common Shuffle Bug Patterns

For each pattern below, identify the relevant Tez class and the configuration that can mitigate it:

PatternClassConfig key
Slow fetch due to too few fetcher threads
OOM in reducer due to large in-memory merge buffer
Fetch failure due to ShuffleHandler authentication timeout
Data skew: one reducer processes 100× more data than others

Step 7 — JIRA Research

Search:

project = TEZ AND component = "tez-runtime-library" AND resolution = Fixed ORDER BY updated DESC

Find a recently fixed shuffle or sort bug. Read the patch:

  1. What was the bug?
  2. Was it in Fetcher, DefaultSorter, MergeManager, or ShuffleHandler?
  3. Was a test added? What does it mock or simulate?