Lab 7.1 — Debug Shuffle Behavior
Lab type: Read & Research
Estimated time: 120 min
Tez module: tez-runtime-library
Overview
Shuffle failures are the most common source of Tez bug reports. They manifest
as FetchFailure events, IOException during map-output reads, or hung reduce
tasks. In this lab you will trace the complete shuffle path from log line to
source code.
Step 1 — Locate the Core Classes
find ~/tez-src/tez-runtime-library -name "*.java" | xargs grep -l "FetchFailure\|Fetcher\|ShuffleHandler" | head -10
find ~/tez-src/tez-shuffle -name "*.java" | head -10
Step 2 — Read the Shuffle Fetch Path
Open Fetcher.java (in tez-runtime-library) and trace the fetch loop:
| # | Question |
|---|---|
| 1 | What HTTP method does the Fetcher use to request map output? GET or POST? |
| 2 | What is the URL format it sends to ShuffleHandler? What parameters does it include? |
| 3 | If the HTTP response code is 404, what does the Fetcher do? (Fail immediately? Retry? Report back to the InputManager?) |
| 4 | What does the Fetcher do when it detects data corruption (checksum mismatch)? Which class handles checksum verification? |
| 5 | How many concurrent fetcher threads does a reduce task run? What configuration key controls this? |
Step 3 — Read the FetchFailure Event Path
When a fetch fails, an event travels up to the AM:
grep -rn "FetchFailure\|FETCH_FAILURE" ~/tez-src/tez-dag/src/main/java/ | \
grep -v "test" | grep ".java:" | head -20
Trace: where does the FetchFailure event originate, and what state transition
does it trigger in TaskAttemptImpl?
| # | Question |
|---|---|
| 1 | What is the name of the event class that carries the fetch-failure information to the AM? |
| 2 | In TaskAttemptImpl, what state does the task transition to when it receives a fetch failure? |
| 3 | Does a single fetch failure kill the task, or does Tez retry? What configuration controls max fetch retries? |
| 4 | What happens to the source task attempt (the map) when its output cannot be fetched? Is it re-run? |
Step 4 — Read ShuffleHandler
Open ShuffleHandler.java in tez-shuffle:
| # | Question |
|---|---|
| 1 | What Netty class does ShuffleHandler extend? |
| 2 | How does ShuffleHandler authenticate that a requester is authorized to fetch map output? (Hint: look for TOKEN or JobTokenSecretManager.) |
| 3 | Where does ShuffleHandler read the index file? What class represents the index? |
| 4 | If the NM restarts while a reduce is fetching, what happens to in-flight fetch requests? |
Step 5 — Read the Spill Path
Open DefaultSorter.java or PipelinedSorter.java in tez-runtime-library:
- At what memory threshold does a spill occur?
- How many spill files can accumulate before a merge is triggered?
- After a spill, where is the index written?
Step 6 — Common Shuffle Bug Patterns
For each pattern below, identify the relevant Tez class and the configuration that can mitigate it:
| Pattern | Class | Config key |
|---|---|---|
| Slow fetch due to too few fetcher threads | ||
| OOM in reducer due to large in-memory merge buffer | ||
| Fetch failure due to ShuffleHandler authentication timeout | ||
| Data skew: one reducer processes 100× more data than others |
Step 7 — JIRA Research
Search:
project = TEZ AND component = "tez-runtime-library" AND resolution = Fixed ORDER BY updated DESC
Find a recently fixed shuffle or sort bug. Read the patch:
- What was the bug?
- Was it in
Fetcher,DefaultSorter,MergeManager, orShuffleHandler? - Was a test added? What does it mock or simulate?