Step 4: Root Cause Identification
A symptom is "the test fails." A root cause is "this specific line, in this specific state, when this specific event arrives, performs this specific incorrect operation, because of this specific design assumption that no longer holds." If your statement does not have that shape, you have not found root cause yet.
This step is mostly thinking. The tools are five-whys, git blame, and
git bisect. The output is a 200–500 word root-cause document and a tested
hypothesis.
Five Whys, Applied to a State-Machine Race
The five-whys technique sounds trite. It is not. The discipline of asking "why" five times in a row forces you past the first plausible explanation (almost always wrong) and into the actual design defect (almost always two or three levels deeper than you initially thought).
Worked example: vertex stays in RUNNING after all tasks succeed
Symptom from Step 2: assertEquals(SUCCEEDED, vertex.getState()) fails with
expected SUCCEEDED but was RUNNING. Repro is deterministic at 5/5.
Why 1: Why is the vertex still in RUNNING?
Because the transition to SUCCEEDED requires all tasks to have completed AND
the vertex's completion handler to have been invoked. Looking at the probe trace
from Step 3, the completion handler was invoked. So the transition was attempted.
Why 2: Why did the transition not happen even though the handler ran?
Because the handler returned a new state that depends on a counter
(completedTaskCount). The probe shows completedTaskCount = 19 when the
handler ran, but the vertex has 20 tasks. So the guard says "not done yet."
Why 3: Why is the count 19 when all 20 task-completed events were fired?
Because the count is incremented inside the handler, AFTER a check that re-routes
certain V_TASK_COMPLETED events back through another handler. The re-route
fires for the 20th task (look at VertexImpl.java around line 2750 — the
if (recoveryData != null) branch). The re-routed event is queued but the
test's dispatcher.await() returns before the queue is fully drained.
Why 4: Why does dispatcher.await() return before the re-routed event
is processed?
Because AsyncDispatcher.await() waits for the current queue to drain, but
the re-route enqueues into a secondary queue (the recovery dispatcher) which
is not joined by the primary await.
Why 5: Why are there two dispatchers, and why does the test only await one?
Because recovery events were added in TEZ-2877 as a separate dispatch path to avoid blocking the main event loop during recovery replay. The test setup predates that change. The test never knew there was a second queue to wait on.
Root cause statement: The 20th V_TASK_COMPLETED event is enqueued into the
recovery dispatcher rather than handled directly when recoveryData != null,
and the test (and any caller relying on the primary dispatcher having drained)
observes a stale completedTaskCount. The fix is either to (a) join the
recovery dispatcher in await(), (b) handle the recovery-data branch
synchronously when not actually replaying recovery, or (c) document that
callers must use a different barrier.
That is a root cause. The fix direction is now obvious-ish. You can argue between (a), (b), (c) — but you know what each one changes.
Git Archaeology
Once you have a candidate cause, ask: when did this break? And why did the person who wrote it think it was correct?
git log --follow -p -S<token>
Find every commit that introduced or removed a specific string or method name:
cd ~/tez-src
# Every commit that touched the recovery dispatcher branch
git log --follow -p -S "recoveryData != null" \
-- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# Every commit that mentions the counter
git log --follow -p -S "completedTaskCount" \
-- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
# The original change that added recovery dispatching
git log --all --grep="TEZ-2877" --oneline
-S ("pickaxe") matches commits where the count of that string changed —
either added or removed. It is the single most powerful git command in
this entire chapter. Learn it.
git blame -L <start>,<end>
Once you know the file and lines, find the commit and committer:
git blame -L 2740,2770 \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Output looks like:
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2745) if (recoveryData != null) {
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2746) handleRecovery(event);
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2747) return;
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2748) }
Then read the commit:
git show a1b2c3d4
git log -1 --format="%B" a1b2c3d4
Look for the JIRA reference in the commit message (TEZ-NNNN: ...). Open
that JIRA. Read every comment. Often you will discover:
- The change was made to fix a different bug (recovery correctness) and introduced your bug as collateral.
- There was a comment on the original JIRA flagging the exact concern you are hitting. ("This might race with the test dispatcher pattern" — and it did.)
- The fix you are considering was discussed and rejected for a reason you must now address.
git bisect for Regressions
If the bug is a regression — works in 0.9.x, broken in 0.10.x — bisect tells you the exact commit that introduced it. This is the highest-confidence signal in all of root-cause work.
cd ~/tez-src
git bisect start
git bisect bad master
git bisect good rel/release-0.9.2
# git checks out a midpoint commit. Build and run the repro:
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q
# If the test FAILS at this commit: bug exists here
git bisect bad
# If the test PASSES at this commit: bug introduced later
git bisect good
# Repeat. git narrows to one commit in log2(N) steps.
Once bisect converges:
a1b2c3d4 is the first bad commit
commit a1b2c3d4
Author: Alice <alice@example.org>
Date: Thu Apr 12 09:34:18 2018
TEZ-2877: Add recovery dispatcher path
Now you know:
- The JIRA that introduced the regression.
- The author (potential reviewer for your fix — Cc them).
- The exact diff to study.
Automating bisect with git bisect run <script> is also fair game once you
have a return-code-clean reproducer command.
Writing the Root-Cause Statement
This document goes into your JIRA, into your PR description, and into your write-up. 200–500 words, no more, no less. Use this template:
## Root cause: TEZ-NNNN
### Symptom
<one sentence — what the user sees>
### Trigger conditions
- <condition 1, e.g. recovery data is non-null when V_TASK_COMPLETED fires>
- <condition 2, e.g. only on the last task in a vertex>
- <condition 3 if any>
### Affected code
- `tez-dag/src/main/java/.../VertexImpl.java#L2745-L2748` (the recovery branch)
- `tez-dag/src/main/java/.../AsyncDispatcher.java#L210` (`await()` does not
join the secondary queue)
### Mechanism
<three to five sentences explaining the actual defect. Use words like "because",
"as a result", "however". This is the part most people get wrong — they describe
the symptom again instead of the mechanism. The mechanism answers: of the
many ways this code could have been written, why does the current way produce
this wrong answer?>
### Introducing change
- TEZ-2877 (commit a1b2c3d4) added the recovery-dispatch branch without
updating `AsyncDispatcher.await()` to join the recovery queue.
- The original JIRA flagged this as a concern (link to comment) but the
resolution was deferred ("we don't await in production paths, only in
tests").
### Fix direction
Three options considered:
1. **Join the recovery dispatcher in `await()`.** Smallest change. Risk: may
slow recovery in production if a slow recovery handler blocks the await.
2. **Handle the recovery branch synchronously when not replaying.** Larger
change, narrower blast radius. Recommended.
3. **Document that tests must use a new barrier.** Cheapest. Pushes burden
onto every test author. Rejected.
Recommended: option 2. See Step 5 for the diff.
Save as capstone-work/root-cause.md.
Validating the Hypothesis
A root cause is not validated until you have demonstrated it. Two ways:
1. Revert the introducing commit and re-run the repro
git checkout master
git revert --no-commit a1b2c3d4 # introducing commit from bisect
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q
If the test now PASSES (because the change you reverted is what introduced the bug), your root cause is at least partially correct. If it still FAILS, the introducing commit is not the root cause — there is a deeper issue.
Reset before you go any further:
git reset --hard origin/master
2. Make a minimal one-line "patch" that confirms the mechanism
You are not writing the real fix yet. You are confirming the mechanism. For the example above:
--- a/tez-dag/.../VertexImpl.java
+++ b/tez-dag/.../VertexImpl.java
@@ -2745,3 +2745,3 @@
- if (recoveryData != null) {
+ if (recoveryData != null && isReplayingRecovery()) {
handleRecovery(event);
return;
}
(Assume isReplayingRecovery() does not exist yet — pretend it returns false
in tests, true only during actual recovery replay.) Apply this, re-run the
repro. If it passes, the mechanism is confirmed even if the actual API does
not exist yet.
If the test still fails: your mechanism is wrong. Go back to the five-whys.
If the test now passes but breaks 14 other tests: your fix direction is too broad. Go back to "fix direction" in the root-cause statement and pick a narrower option.
Validation / Self-check
Before advancing to Step 5:
capstone-work/root-cause.mdexists, follows the template, is 200–500 words.- You can name the introducing commit (full SHA) and JIRA.
- You ran
git bisectto convergence (or proved bisect doesn't apply because the bug existed since the file was first added — note this in the doc). - You ran a "revert introducing commit" experiment and saw the test go green (or have a documented reason the revert doesn't apply).
- You wrote a one-line throwaway "mechanism confirmation" patch and saw the test pass on it.
- You have read every comment on the introducing JIRA.
- You can articulate three fix directions and explain why you rejected two of them in one sentence each.