Stage 3 — Error Messages and Exception Context
What this stage teaches
Stage 3 is the first stage where you change behaviour visible to operators in a production postmortem. You learn:
- The CONTEXT rule for
tez-dag: every error raised, logged, or rethrown inside the AppMaster must include the DAG ID, and the vertex/task/attempt ID wherever the call site has them in scope. - How to chain causes correctly:
throw new TezException(msg, cause)instead ofthrow new TezException(msg)thencause.printStackTrace(). - How to find exception sites that swallow the original cause: a
catch (Exception e)followed bythrow new RuntimeException("init failed")is the canonical bug. - How NDC (Nested Diagnostic Context, configured in
tez-common) propagates IDs into log messages automatically — and how to add explicit IDs where NDC is not set up.
These patches are 20–200 lines, often single-method changes that touch error paths. The reviewer test is brutal but fair: "If this exception fires in a production AM log, can the on-call engineer identify the DAG, vertex, and task without cross-referencing any other log file?" If the answer is "no," the patch is not done.
JIRA filter to find candidates
project = TEZ
AND resolution = Unresolved
AND (text ~ "uninformative error" OR text ~ "missing context"
OR text ~ "swallowed exception" OR text ~ "no DAG id"
OR text ~ "improve error message" OR description ~ "InvalidStateTransitonException"
AND text ~ "stack trace")
ORDER BY updated DESC
A second sweep — find your own candidates by grep:
cd ~/tez-src
# error sites that build a message without an ID
grep -rn 'throw new .*Exception(".*failed' tez-dag/src/main/java \
| grep -v "ID\|Id\|getName" | head -30
# catch sites that drop the cause
grep -rn "catch (.*Exception .*)" tez-dag/src/main/java -A 2 \
| grep -B1 "throw new" | grep -v ", e)" | head -30
The second grep is fuzzy; you will get false positives. But every true positive is a Stage 3 patch.
The CONTEXT rule for tez-dag
Every error inside the AppMaster must include enough state to identify which DAG instance on which AM on which application attempt threw it. The minimum fields, listed in priority order:
- The DAG ID (
TezDAGID). - The Vertex ID (
TezVertexID) — required if the error is in a vertex context. - The Task ID (
TezTaskID) — required if in a task context. - The Task Attempt ID (
TezTaskAttemptID) — required if in an attempt context. - The container ID — required for container-management errors.
Each of these IDs is a stable string (toString() returns the canonical form). They
are present on every relevant impl object in tez-dag:
grep -n "getDagId\|getVertexId\|getTaskId\|getTaskAttemptId" \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
If you are editing a method on VertexImpl, you have getVertexId() and
getDagId() in scope. If you do not include them in the error, the patch is
incomplete.
Walked example A — uninformative TezException in VertexImpl.maybeSendConfiguredEvent
Symptom: a user reports their DAG fails with:
2026-04-12 10:14:21,003 ERROR [Dispatcher thread] org.apache.tez.dag.app.dag.impl.VertexImpl:
Vertex init failed
org.apache.tez.dag.api.TezException: init failed
at org.apache.tez.dag.app.dag.impl.VertexImpl.maybeSendConfiguredEvent(VertexImpl.java:NNNN)
That error tells the operator nothing. No DAG ID, no vertex name, no cause.
Step 1 — Find the throw site
cd ~/tez-src
grep -n 'throw new TezException("init failed' \
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
Read 20 lines of context around the hit. The method has vertexId,
getDagId(), and getName() all in scope.
Step 2 — The diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
- } catch (AMUserCodeException e) {
- throw new TezException("init failed");
- }
+ } catch (AMUserCodeException e) {
+ String msg = String.format(
+ "Vertex %s (%s) of DAG %s failed during configured-event dispatch: %s",
+ getName(), vertexId, getDagId(), e.getMessage());
+ LOG.error(msg, e);
+ throw new TezException(msg, e);
+ }
What changed:
- The message now identifies the vertex name (human-readable), the vertex ID (machine-stable), and the DAG ID.
- The original exception is chained via the two-argument
TezExceptionconstructor. The full stack trace survives. - The error is also logged at
ERRORwith the cause. Belt and braces — some callers swallow the exception silently, and the log line is the only record that survives. String.formatis used so the placeholders are visually aligned with the field names. Reviewers prefer it over+-concatenation when the message has more than three substitutions.
Step 3 — Regression test
Add to tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java:
@Test(timeout = 5000)
public void testInitFailureMessageIncludesIds() throws Exception {
VertexImpl v = createVertexThatFailsInConfigured(); // existing helper pattern
try {
v.maybeSendConfiguredEvent();
fail("expected TezException");
} catch (TezException e) {
assertTrue("message should contain vertex id",
e.getMessage().contains(v.getVertexId().toString()));
assertTrue("message should contain dag id",
e.getMessage().contains(v.getDagId().toString()));
assertNotNull("cause should be preserved", e.getCause());
}
}
The test asserts on substring presence, not exact string equality. Reviewers reject exact-string assertions because they break the next time the message is rephrased.
Step 4 — Run targeted tests
cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -40
The full TestVertexImpl suite takes 3–5 minutes on a laptop. Run it. A
state-machine-adjacent change always risks breaking a sibling transition.
Walked example B — swallowed cause in DAGAppMaster.startDAG
Find the bug:
cd ~/tez-src
grep -rn "catch (.*Exception" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java \
-A 3 | grep -B1 "throw new" | head -20
Suppose the offender looks like:
try {
initServices();
} catch (Exception e) {
throw new TezUncheckedException("Failed to start AM");
}
The diff:
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
try {
initServices();
} catch (Exception e) {
- throw new TezUncheckedException("Failed to start AM");
+ throw new TezUncheckedException(
+ "Failed to start AM for application " + appAttemptID + ": "
+ + e.getMessage(), e);
}
Two fixes at once: the cause is preserved (the second constructor argument), and
the message now includes the appAttemptID which the surrounding DAGAppMaster
has in scope. This patch is small but high-leverage: the AM startup path is the
single most common place a swallowed cause hides a real configuration bug.
Walked example C — log-only context via NDC
Some hot paths cannot afford a String.format per call. The Tez convention there
is NDC. Look in tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java:
cat $(find ~/tez-src/tez-common/src/main/java -name "CallableWithNdc.java")
When the dispatcher invokes a vertex transition callback, it pushes the vertex ID
onto the NDC stack. log4j's %X{...} pattern then includes the ID in every log
line for the duration of the call. If you discover a log message in VertexImpl
that lacks the vertex ID, first check whether NDC already provides it via the
log pattern. If yes, the message is fine; if no, add the ID inline. Submitting a
patch that adds an explicit ID where NDC already prints it is a reviewer-rejected
patch.
Pitfalls
- Don't include
e.getStackTrace()in your message. The stack trace is whatLOG.error(msg, e)is for. Concatenating it into the message turns a one-line log into a 60-line one. - Don't use
e.toString()in messages. Usee.getMessage()so the message stays single-line; the stack trace lives in the chained throwable. - Don't catch
Throwableto add context. CatchingThrowableswallowsOutOfMemoryErrorandThreadDeath. CatchException(or the narrowest superclass that fits). - Don't add context that requires a lock. A
getName()call that internally takes the vertex write-lock is a deadlock waiting to happen if the error path itself holds the lock. Always check the lock semantics of the getter you call in an error path. - Don't change the exception type to add context.
throw new TezExceptionis still aTezExceptionafter your patch; changing it toTezUncheckedExceptionis a behavior change and not allowed in Stage 3. - Don't add context that includes user data without redaction. If your error
message includes a configuration value, check whether it could contain
credentials. The Tez convention is to print the key, not the value, when the
key matches
.*\.(password|secret|token|credential).
Exit criteria — when you're ready for the next stage
Move to Stage 4 when:
- You have shipped at least one error-context patch in
tez-dagand one intez-runtime-librarythat includes the DAG and vertex/task IDs. - A reviewer has accepted your test pattern (substring assertion, no exact-string match) without a comment.
- You can find at least three more candidate error sites in five minutes of grepping without referring to this chapter.
- You have read
VertexImpl.maybeSendConfiguredEventand the surrounding 200 lines without feeling lost — that file is the gateway to Stage 4.
Stage 4 will take you inside the state machines themselves.