Stage 3 — Error Messages and Exception Context

What this stage teaches

Stage 3 is the first stage where you change behaviour visible to operators in a production postmortem. You learn:

  • The CONTEXT rule for tez-dag: every error raised, logged, or rethrown inside the AppMaster must include the DAG ID, and the vertex/task/attempt ID wherever the call site has them in scope.
  • How to chain causes correctly: throw new TezException(msg, cause) instead of throw new TezException(msg) then cause.printStackTrace().
  • How to find exception sites that swallow the original cause: a catch (Exception e) followed by throw new RuntimeException("init failed") is the canonical bug.
  • How NDC (Nested Diagnostic Context, configured in tez-common) propagates IDs into log messages automatically — and how to add explicit IDs where NDC is not set up.

These patches are 20–200 lines, often single-method changes that touch error paths. The reviewer test is brutal but fair: "If this exception fires in a production AM log, can the on-call engineer identify the DAG, vertex, and task without cross-referencing any other log file?" If the answer is "no," the patch is not done.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (text ~ "uninformative error" OR text ~ "missing context"
       OR text ~ "swallowed exception" OR text ~ "no DAG id"
       OR text ~ "improve error message" OR description ~ "InvalidStateTransitonException"
       AND text ~ "stack trace")
ORDER BY updated DESC

A second sweep — find your own candidates by grep:

cd ~/tez-src
# error sites that build a message without an ID
grep -rn 'throw new .*Exception(".*failed' tez-dag/src/main/java \
  | grep -v "ID\|Id\|getName" | head -30

# catch sites that drop the cause
grep -rn "catch (.*Exception .*)" tez-dag/src/main/java -A 2 \
  | grep -B1 "throw new" | grep -v ", e)" | head -30

The second grep is fuzzy; you will get false positives. But every true positive is a Stage 3 patch.

The CONTEXT rule for tez-dag

Every error inside the AppMaster must include enough state to identify which DAG instance on which AM on which application attempt threw it. The minimum fields, listed in priority order:

  1. The DAG ID (TezDAGID).
  2. The Vertex ID (TezVertexID) — required if the error is in a vertex context.
  3. The Task ID (TezTaskID) — required if in a task context.
  4. The Task Attempt ID (TezTaskAttemptID) — required if in an attempt context.
  5. The container ID — required for container-management errors.

Each of these IDs is a stable string (toString() returns the canonical form). They are present on every relevant impl object in tez-dag:

grep -n "getDagId\|getVertexId\|getTaskId\|getTaskAttemptId" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

If you are editing a method on VertexImpl, you have getVertexId() and getDagId() in scope. If you do not include them in the error, the patch is incomplete.

Walked example A — uninformative TezException in VertexImpl.maybeSendConfiguredEvent

Symptom: a user reports their DAG fails with:

2026-04-12 10:14:21,003 ERROR [Dispatcher thread] org.apache.tez.dag.app.dag.impl.VertexImpl:
  Vertex init failed
org.apache.tez.dag.api.TezException: init failed
    at org.apache.tez.dag.app.dag.impl.VertexImpl.maybeSendConfiguredEvent(VertexImpl.java:NNNN)

That error tells the operator nothing. No DAG ID, no vertex name, no cause.

Step 1 — Find the throw site

cd ~/tez-src
grep -n 'throw new TezException("init failed' \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

Read 20 lines of context around the hit. The method has vertexId, getDagId(), and getName() all in scope.

Step 2 — The diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
-    } catch (AMUserCodeException e) {
-      throw new TezException("init failed");
-    }
+    } catch (AMUserCodeException e) {
+      String msg = String.format(
+          "Vertex %s (%s) of DAG %s failed during configured-event dispatch: %s",
+          getName(), vertexId, getDagId(), e.getMessage());
+      LOG.error(msg, e);
+      throw new TezException(msg, e);
+    }

What changed:

  1. The message now identifies the vertex name (human-readable), the vertex ID (machine-stable), and the DAG ID.
  2. The original exception is chained via the two-argument TezException constructor. The full stack trace survives.
  3. The error is also logged at ERROR with the cause. Belt and braces — some callers swallow the exception silently, and the log line is the only record that survives.
  4. String.format is used so the placeholders are visually aligned with the field names. Reviewers prefer it over +-concatenation when the message has more than three substitutions.

Step 3 — Regression test

Add to tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java:

@Test(timeout = 5000)
public void testInitFailureMessageIncludesIds() throws Exception {
  VertexImpl v = createVertexThatFailsInConfigured(); // existing helper pattern
  try {
    v.maybeSendConfiguredEvent();
    fail("expected TezException");
  } catch (TezException e) {
    assertTrue("message should contain vertex id",
        e.getMessage().contains(v.getVertexId().toString()));
    assertTrue("message should contain dag id",
        e.getMessage().contains(v.getDagId().toString()));
    assertNotNull("cause should be preserved", e.getCause());
  }
}

The test asserts on substring presence, not exact string equality. Reviewers reject exact-string assertions because they break the next time the message is rephrased.

Step 4 — Run targeted tests

cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -40

The full TestVertexImpl suite takes 3–5 minutes on a laptop. Run it. A state-machine-adjacent change always risks breaking a sibling transition.

Walked example B — swallowed cause in DAGAppMaster.startDAG

Find the bug:

cd ~/tez-src
grep -rn "catch (.*Exception" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java \
  -A 3 | grep -B1 "throw new" | head -20

Suppose the offender looks like:

try {
  initServices();
} catch (Exception e) {
  throw new TezUncheckedException("Failed to start AM");
}

The diff:

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
     try {
       initServices();
     } catch (Exception e) {
-      throw new TezUncheckedException("Failed to start AM");
+      throw new TezUncheckedException(
+          "Failed to start AM for application " + appAttemptID + ": "
+              + e.getMessage(), e);
     }

Two fixes at once: the cause is preserved (the second constructor argument), and the message now includes the appAttemptID which the surrounding DAGAppMaster has in scope. This patch is small but high-leverage: the AM startup path is the single most common place a swallowed cause hides a real configuration bug.

Walked example C — log-only context via NDC

Some hot paths cannot afford a String.format per call. The Tez convention there is NDC. Look in tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java:

cat $(find ~/tez-src/tez-common/src/main/java -name "CallableWithNdc.java")

When the dispatcher invokes a vertex transition callback, it pushes the vertex ID onto the NDC stack. log4j's %X{...} pattern then includes the ID in every log line for the duration of the call. If you discover a log message in VertexImpl that lacks the vertex ID, first check whether NDC already provides it via the log pattern. If yes, the message is fine; if no, add the ID inline. Submitting a patch that adds an explicit ID where NDC already prints it is a reviewer-rejected patch.

Pitfalls

  • Don't include e.getStackTrace() in your message. The stack trace is what LOG.error(msg, e) is for. Concatenating it into the message turns a one-line log into a 60-line one.
  • Don't use e.toString() in messages. Use e.getMessage() so the message stays single-line; the stack trace lives in the chained throwable.
  • Don't catch Throwable to add context. Catching Throwable swallows OutOfMemoryError and ThreadDeath. Catch Exception (or the narrowest superclass that fits).
  • Don't add context that requires a lock. A getName() call that internally takes the vertex write-lock is a deadlock waiting to happen if the error path itself holds the lock. Always check the lock semantics of the getter you call in an error path.
  • Don't change the exception type to add context. throw new TezException is still a TezException after your patch; changing it to TezUncheckedException is a behavior change and not allowed in Stage 3.
  • Don't add context that includes user data without redaction. If your error message includes a configuration value, check whether it could contain credentials. The Tez convention is to print the key, not the value, when the key matches .*\.(password|secret|token|credential).

Exit criteria — when you're ready for the next stage

Move to Stage 4 when:

  • You have shipped at least one error-context patch in tez-dag and one in tez-runtime-library that includes the DAG and vertex/task IDs.
  • A reviewer has accepted your test pattern (substring assertion, no exact-string match) without a comment.
  • You can find at least three more candidate error sites in five minutes of grepping without referring to this chapter.
  • You have read VertexImpl.maybeSendConfiguredEvent and the surrounding 200 lines without feeling lost — that file is the gateway to Stage 4.

Stage 4 will take you inside the state machines themselves.