Lab 8.3 — Improve Error Messages for Failed DAGs

Lab type: Fix-It (error message quality)
Estimated time: 90 min


Overview

Poor error messages are one of the most common complaints from Tez users. "Container exited with a non-zero exit code" tells an operator almost nothing. This lab focuses on finding and improving a diagnostic message in the Tez AM.


Step 1 — Find Weak Error Messages

Search for generic or unhelpful diagnostics:

grep -rn '"Container exited\|"Task failed\|"Vertex failed\|unknown error' \
  ~/tez-src/tez-dag/src/main/java/ | grep -v test | head -20

Also look for messages that use string concatenation on a potentially-null object:

grep -rn 'diagnostics.*\+.*null\|null.*\+.*diagnostics' \
  ~/tez-src/tez-dag/src/main/java/ | head -20

Step 2 — Pick a Target

Select one diagnostic message that you can improve. Good candidates:

  • A message that says "failed" without explaining why
  • A message that could NPE if a field is null
  • A message that uses a raw integer code without a human-readable explanation

Step 3 — Understand the Context

For your chosen message:

  1. What class emits it?
  2. What state transition triggers it?
  3. What information is available at that point (in the method parameters or fields) that could be added to the message?

Step 4 — Improve the Message

Example improvement:

// Before (unhelpful):
diagnostics.add("Container " + containerId + " failed");

// After (actionable):
diagnostics.add(String.format(
    "Container %s failed with exit code %d (%s). " +
    "Check container logs at: %s",
    containerId,
    exitCode,
    ContainerExitStatus.getExitCodeString(exitCode),
    logURL));

Step 5 — Write a Test for the New Message

The test should verify that:

  1. The improved message appears in TaskAttemptImpl.getDiagnostics() or VertexImpl.getDiagnostics() after the relevant failure event
  2. It contains the expected key fields (exit code, container ID, etc.)

Pattern:

@Test
public void testDiagnosticsContainsExitCode() {
    // ... set up failing task attempt with specific exit code ...
    List<String> diags = taskAttempt.getDiagnostics();
    assertTrue("Diagnostics should contain exit code",
        diags.stream().anyMatch(d -> d.contains("exitCode=123")));
}

Step 6 — Format Patch and JIRA

git diff > /tmp/TEZ-ERRORMSG.001.patch

JIRA title pattern: [tez-dag] Improve error message for [specific failure scenario]


Reflection Questions

  1. What makes a good diagnostic message? List 4 properties.
  2. Why do projects accumulate bad error messages over time? (Hint: think about who writes the code vs. who runs it.)
  3. Find a Tez JIRA where the only change was improving a log or diagnostic message. Was the patch accepted? How long did the review take?