Lab 8.3 — Improve Error Messages for Failed DAGs
Lab type: Fix-It (error message quality)
Estimated time: 90 min
Overview
Poor error messages are one of the most common complaints from Tez users. "Container exited with a non-zero exit code" tells an operator almost nothing. This lab focuses on finding and improving a diagnostic message in the Tez AM.
Step 1 — Find Weak Error Messages
Search for generic or unhelpful diagnostics:
grep -rn '"Container exited\|"Task failed\|"Vertex failed\|unknown error' \
~/tez-src/tez-dag/src/main/java/ | grep -v test | head -20
Also look for messages that use string concatenation on a potentially-null object:
grep -rn 'diagnostics.*\+.*null\|null.*\+.*diagnostics' \
~/tez-src/tez-dag/src/main/java/ | head -20
Step 2 — Pick a Target
Select one diagnostic message that you can improve. Good candidates:
- A message that says "failed" without explaining why
- A message that could NPE if a field is null
- A message that uses a raw integer code without a human-readable explanation
Step 3 — Understand the Context
For your chosen message:
- What class emits it?
- What state transition triggers it?
- What information is available at that point (in the method parameters or fields) that could be added to the message?
Step 4 — Improve the Message
Example improvement:
// Before (unhelpful):
diagnostics.add("Container " + containerId + " failed");
// After (actionable):
diagnostics.add(String.format(
"Container %s failed with exit code %d (%s). " +
"Check container logs at: %s",
containerId,
exitCode,
ContainerExitStatus.getExitCodeString(exitCode),
logURL));
Step 5 — Write a Test for the New Message
The test should verify that:
- The improved message appears in
TaskAttemptImpl.getDiagnostics()orVertexImpl.getDiagnostics()after the relevant failure event - It contains the expected key fields (exit code, container ID, etc.)
Pattern:
@Test
public void testDiagnosticsContainsExitCode() {
// ... set up failing task attempt with specific exit code ...
List<String> diags = taskAttempt.getDiagnostics();
assertTrue("Diagnostics should contain exit code",
diags.stream().anyMatch(d -> d.contains("exitCode=123")));
}
Step 6 — Format Patch and JIRA
git diff > /tmp/TEZ-ERRORMSG.001.patch
JIRA title pattern: [tez-dag] Improve error message for [specific failure scenario]
Reflection Questions
- What makes a good diagnostic message? List 4 properties.
- Why do projects accumulate bad error messages over time? (Hint: think about who writes the code vs. who runs it.)
- Find a Tez JIRA where the only change was improving a log or diagnostic message. Was the patch accepted? How long did the review take?