Lab H3: Debugging a Failed Query
Background
Production Hive-on-Tez failures usually surface as one line in the Hive console:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Vertex failed, vertexName=Map 1, vertexId=vertex_1718000000000_4321_1_00,
diagnostics=[Task failed, taskId=task_1718000000000_4321_1_00_000003,
diagnostics=[TaskAttempt 0 failed, info=[
Container container_e123_1718000000000_4321_01_000007 failed.
Exit code: 1
Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :
... ]]]
That message is the tip. The actual exception is buried 3–4 hops away. This lab is the operational walk from that tip to the root-cause stack trace, with a fabricated-but- realistic example.
The Failure Hop Sequence
flowchart TD
H[Hive console error<br/>'Vertex failed, vertexName=Map 1']
H --> A[AM log<br/>tez-dag log on the AM container]
A --> T[TaskAttempt diagnostics<br/>which task, which container]
T --> C[Container stderr / stdout log<br/>on the worker node]
C --> E[Actual exception<br/>the root cause]
E --> X[Attribute to Hive / Tez runtime / Tez AM / YARN]
Five hops. Most engineers can do hop 1 (read the console). Few can do hops 2–4 without guidance. This lab is the guidance.
Step 1: Parse the Console Message
Take the message above and extract the identifiers:
| Identifier | Value (in our example) | Use for |
|---|---|---|
| Application ID | application_1718000000000_4321 | YARN log retrieval |
| DAG ID | dag_1718000000000_4321_1 | Tez UI URL |
| Vertex ID | vertex_1718000000000_4321_1_00 | The failing vertex; here 00 ≈ Map 1 |
| Task ID | task_1718000000000_4321_1_00_000003 | Which task within the vertex |
| Attempt | 0 | First attempt failed |
| Container ID | container_e123_1718000000000_4321_01_000007 | Where the work was running |
| Exit code | 1 | Process died abnormally |
The format is consistent across all Hive-on-Tez failures. Memorise the structure.
Step 2: Get the AM Log
The Tez AM is itself a YARN container. Its log is fetched with yarn logs:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000001
The AM container is typically _01_000001 (always the first container of the app). The
log streams to stdout. Pipe to a file:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000001 \
> ~/tez-notes/hive-h3-amlog.txt
The AM log contains the DAGAppMaster lifecycle, vertex state transitions, and
diagnostics aggregated from failing tasks.
Search for our failing task:
grep -n "task_1718000000000_4321_1_00_000003" ~/tez-notes/hive-h3-amlog.txt | head
You will see lines like:
2024-06-10 14:22:11,432 [INFO ] TaskImpl - task_..._000003 transitioned from SCHEDULED to RUNNING
2024-06-10 14:22:13,108 [INFO ] TaskAttemptImpl - attempt_..._000003_0 transitioned from RUNNING to FAILED
2024-06-10 14:22:13,108 [WARN ] TaskImpl - Diagnostics for ..._000003_0:
Container ..._000007 failed.
Exit code: 1
... [Last 4096 bytes of stderr] ...
The "Last 4096 bytes of stderr" is the AM's view of why the container died. It's truncated. For the full container log, hop 3.
Step 3: Get the Container Log
The container ID from the AM log (container_..._000007) is the worker. Its log:
yarn logs -applicationId application_1718000000000_4321 \
-containerId container_e123_1718000000000_4321_01_000007 \
> ~/tez-notes/hive-h3-container-007.txt
The container log contains the full stdout and stderr from the Tez task runtime
(LogicalIOProcessorRuntimeTask), including all logged exceptions and any user-code
output.
The container log structure:
LogType:stdout
...
LogType:syslog
2024-06-10 14:22:12,856 [INFO ] LogicalIOProcessorRuntimeTask - Initializing task ...
2024-06-10 14:22:12,891 [INFO ] MRInput - Initializing MRInput for ...
2024-06-10 14:22:13,007 [WARN ] MRInput - ...
2024-06-10 14:22:13,084 [ERROR] LogicalIOProcessorRuntimeTask - Failed to execute task
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing row {"a":3,"b":"q"}
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
at org.apache.hadoop.hive.ql.exec.UDFBridge.<init>(UDFBridge.java:54)
...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
...
LogType:stderr
This is the actual exception. The Caused by: chain walks from Hive's wrapping
exception down to the JVM-level cause.
Step 4: Walk the Exception
Reading the trace top-down for our example:
| Frame | Tells you |
|---|---|
java.lang.RuntimeException | Container exit, generic |
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"a":3,"b":"q"} | Hive boundary; you know the input row |
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow:91 | Hive Tez map-side row processor |
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run:418 | Hive Tez map record processor |
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run:223 | Hive's Tez Processor adapter |
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run:374 | Tez runtime task |
org.apache.tez.runtime.task.TaskRunner2Callable... | Tez runtime task launcher |
Now the Caused by: chain:
| Cause | Tells you |
|---|---|
HiveException: Unable to load UDF X | The proximate Hive problem |
ClassNotFoundException: com.example.udf.X | The root: classloader can't find UDF |
So the root cause is a UDF class missing from the classpath of the Tez task. That's a Hive (or user) issue, not a Tez issue. See Lab H4 for how to make that attribution rigorously.
Step 5: Attribute the Failure
Apply the decision rule from H4 (preview):
The package of the top frame whose code you can change indicates the project.
Top frames in order:
java.lang.RuntimeException— JVM, not actionable.org.apache.hadoop.hive.ql.metadata.HiveException— Hive, but generic wrap; keep walking.org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91— Hive code, specific. Stop here for the top frame: this is Hive'sMapRecordSource.
Then the Caused by: chain:
HiveException: Unable to load UDF X— Hive.ClassNotFoundException: com.example.udf.X— root cause.
Attribution: Hive (the proximate code is MapRecordSource) and user
(the missing class is the user's UDF jar). Tez is not at fault — it correctly ran the
task, the Hive code, and surfaced the exception. Tez's job is to provide a stack trace,
which it did.
The fix is to ensure the UDF jar is on the AuxJar list:
ADD JAR /path/to/udf.jar;
or in hive-site.xml:
<property>
<name>hive.aux.jars.path</name>
<value>file:///opt/hive/auxlib/udf.jar</value>
</property>
Tooling Shortcuts
Get all container logs at once
yarn logs -applicationId application_1718000000000_4321 \
> ~/tez-notes/hive-h3-all.txt
For a large DAG with many containers, this is large (often 100s of MB). Use the per-container form when you know which one to look at.
Search across container logs
grep -B2 -A20 "java.lang.\|Caused by" ~/tez-notes/hive-h3-all.txt | head -100
Find the failing task fast
grep "FAILED\|state changed.*FAILED\|attempt.*FAILED" ~/tez-notes/hive-h3-amlog.txt
Tez UI shortcut
If your cluster has the Tez UI, the per-task log links are one click. The UI URL pattern:
http://<tez-ui-host>:9999/tez-ui/#/tez-dag/dag_1718000000000_4321_1
From that page, navigate to Map 1 → task 000003 → attempt 0 → "logs". The UI fetches
the container log automatically.
A Second Worked Example — Tez Runtime Failure
Console:
Vertex failed, vertexName=Reducer 2, ...
Container ... failed. Exit code: 1
Container log top of stack:
java.io.IOException: Failed on local exception: java.io.IOException: Failed to fetch shuffle data
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
...
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
...
Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.
Attribution: Tez runtime library. Specifically the shuffle fetcher. The root cause —
ConnectException: Connection refused — points to the upstream task's container being
gone (killed, evicted, or networked away). Investigation continues into the upstream
container's log.
This is the canonical Tez shuffle failure shape. The reproduction is in H5.
A Third Worked Example — AM Failure
Console:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Application application_1718000000000_4321 failed with state FAILED.
Diagnostics: Application application_1718000000000_4321 failed 2 times due to AM Container ... exited with exitCode: -103 ...
The AM itself died. Container log of the AM:
[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
...
Top frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412. Attribution: Tez AM.
Root cause: AM heap too small for the DAG (tez.am.resource.memory.mb). Fix is
configuration; if reproducible at the default, file a JIRA against Tez requesting either
a smarter default or a sizing recommendation.
Validation Artifacts
For our first example, save:
- The console error verbatim (
~/tez-notes/hive-h3-console.txt). - The parsed-identifiers table (Application ID, DAG ID, Vertex ID, Task ID, Container ID).
- The AM log fragment showing the task transition to FAILED.
- The container log fragment showing the full exception with
Caused by:chain. - The attribution paragraph: which project owns the bug, and why.
- The fix you propose.
Once you can produce that artifact for an arbitrary Hive-on-Tez failure, you can debug one. The next lab — Lab H4: Bug Attribution — makes the attribution rigorous with a decision tree and four more worked examples.