Lab H3: Debugging a Failed Query

Background

Production Hive-on-Tez failures usually surface as one line in the Hive console:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Vertex failed, vertexName=Map 1, vertexId=vertex_1718000000000_4321_1_00,
diagnostics=[Task failed, taskId=task_1718000000000_4321_1_00_000003,
diagnostics=[TaskAttempt 0 failed, info=[
Container container_e123_1718000000000_4321_01_000007 failed.
Exit code: 1
Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :
... ]]]

That message is the tip. The actual exception is buried 3–4 hops away. This lab is the operational walk from that tip to the root-cause stack trace, with a fabricated-but- realistic example.


The Failure Hop Sequence

flowchart TD
  H[Hive console error<br/>'Vertex failed, vertexName=Map 1']
  H --> A[AM log<br/>tez-dag log on the AM container]
  A --> T[TaskAttempt diagnostics<br/>which task, which container]
  T --> C[Container stderr / stdout log<br/>on the worker node]
  C --> E[Actual exception<br/>the root cause]
  E --> X[Attribute to Hive / Tez runtime / Tez AM / YARN]

Five hops. Most engineers can do hop 1 (read the console). Few can do hops 2–4 without guidance. This lab is the guidance.


Step 1: Parse the Console Message

Take the message above and extract the identifiers:

IdentifierValue (in our example)Use for
Application IDapplication_1718000000000_4321YARN log retrieval
DAG IDdag_1718000000000_4321_1Tez UI URL
Vertex IDvertex_1718000000000_4321_1_00The failing vertex; here 00 ≈ Map 1
Task IDtask_1718000000000_4321_1_00_000003Which task within the vertex
Attempt0First attempt failed
Container IDcontainer_e123_1718000000000_4321_01_000007Where the work was running
Exit code1Process died abnormally

The format is consistent across all Hive-on-Tez failures. Memorise the structure.


Step 2: Get the AM Log

The Tez AM is itself a YARN container. Its log is fetched with yarn logs:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000001

The AM container is typically _01_000001 (always the first container of the app). The log streams to stdout. Pipe to a file:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000001 \
  > ~/tez-notes/hive-h3-amlog.txt

The AM log contains the DAGAppMaster lifecycle, vertex state transitions, and diagnostics aggregated from failing tasks.

Search for our failing task:

grep -n "task_1718000000000_4321_1_00_000003" ~/tez-notes/hive-h3-amlog.txt | head

You will see lines like:

2024-06-10 14:22:11,432 [INFO ] TaskImpl - task_..._000003 transitioned from SCHEDULED to RUNNING
2024-06-10 14:22:13,108 [INFO ] TaskAttemptImpl - attempt_..._000003_0 transitioned from RUNNING to FAILED
2024-06-10 14:22:13,108 [WARN ] TaskImpl - Diagnostics for ..._000003_0:
  Container ..._000007 failed.
  Exit code: 1
  ... [Last 4096 bytes of stderr] ...

The "Last 4096 bytes of stderr" is the AM's view of why the container died. It's truncated. For the full container log, hop 3.


Step 3: Get the Container Log

The container ID from the AM log (container_..._000007) is the worker. Its log:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000007 \
  > ~/tez-notes/hive-h3-container-007.txt

The container log contains the full stdout and stderr from the Tez task runtime (LogicalIOProcessorRuntimeTask), including all logged exceptions and any user-code output.

The container log structure:

LogType:stdout
...
LogType:syslog
2024-06-10 14:22:12,856 [INFO ] LogicalIOProcessorRuntimeTask - Initializing task ...
2024-06-10 14:22:12,891 [INFO ] MRInput - Initializing MRInput for ...
2024-06-10 14:22:13,007 [WARN ] MRInput - ...
2024-06-10 14:22:13,084 [ERROR] LogicalIOProcessorRuntimeTask - Failed to execute task
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
  Hive Runtime Error while processing row {"a":3,"b":"q"}
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        ...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
        at org.apache.hadoop.hive.ql.exec.UDFBridge.<init>(UDFBridge.java:54)
        ...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        ...
LogType:stderr

This is the actual exception. The Caused by: chain walks from Hive's wrapping exception down to the JVM-level cause.


Step 4: Walk the Exception

Reading the trace top-down for our example:

FrameTells you
java.lang.RuntimeExceptionContainer exit, generic
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"a":3,"b":"q"}Hive boundary; you know the input row
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow:91Hive Tez map-side row processor
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run:418Hive Tez map record processor
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run:223Hive's Tez Processor adapter
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run:374Tez runtime task
org.apache.tez.runtime.task.TaskRunner2Callable...Tez runtime task launcher

Now the Caused by: chain:

CauseTells you
HiveException: Unable to load UDF XThe proximate Hive problem
ClassNotFoundException: com.example.udf.XThe root: classloader can't find UDF

So the root cause is a UDF class missing from the classpath of the Tez task. That's a Hive (or user) issue, not a Tez issue. See Lab H4 for how to make that attribution rigorously.


Step 5: Attribute the Failure

Apply the decision rule from H4 (preview):

The package of the top frame whose code you can change indicates the project.

Top frames in order:

  1. java.lang.RuntimeException — JVM, not actionable.
  2. org.apache.hadoop.hive.ql.metadata.HiveException — Hive, but generic wrap; keep walking.
  3. org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91 — Hive code, specific. Stop here for the top frame: this is Hive's MapRecordSource.

Then the Caused by: chain:

  1. HiveException: Unable to load UDF X — Hive.
  2. ClassNotFoundException: com.example.udf.X — root cause.

Attribution: Hive (the proximate code is MapRecordSource) and user (the missing class is the user's UDF jar). Tez is not at fault — it correctly ran the task, the Hive code, and surfaced the exception. Tez's job is to provide a stack trace, which it did.

The fix is to ensure the UDF jar is on the AuxJar list:

ADD JAR /path/to/udf.jar;

or in hive-site.xml:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/hive/auxlib/udf.jar</value>
</property>

Tooling Shortcuts

Get all container logs at once

yarn logs -applicationId application_1718000000000_4321 \
  > ~/tez-notes/hive-h3-all.txt

For a large DAG with many containers, this is large (often 100s of MB). Use the per-container form when you know which one to look at.

Search across container logs

grep -B2 -A20 "java.lang.\|Caused by" ~/tez-notes/hive-h3-all.txt | head -100

Find the failing task fast

grep "FAILED\|state changed.*FAILED\|attempt.*FAILED" ~/tez-notes/hive-h3-amlog.txt

Tez UI shortcut

If your cluster has the Tez UI, the per-task log links are one click. The UI URL pattern:

http://<tez-ui-host>:9999/tez-ui/#/tez-dag/dag_1718000000000_4321_1

From that page, navigate to Map 1 → task 000003 → attempt 0 → "logs". The UI fetches the container log automatically.


A Second Worked Example — Tez Runtime Failure

Console:

Vertex failed, vertexName=Reducer 2, ...
Container ... failed. Exit code: 1

Container log top of stack:

java.io.IOException: Failed on local exception: java.io.IOException: Failed to fetch shuffle data
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
        ...
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        ...

Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.

Attribution: Tez runtime library. Specifically the shuffle fetcher. The root cause — ConnectException: Connection refused — points to the upstream task's container being gone (killed, evicted, or networked away). Investigation continues into the upstream container's log.

This is the canonical Tez shuffle failure shape. The reproduction is in H5.


A Third Worked Example — AM Failure

Console:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Application application_1718000000000_4321 failed with state FAILED.
Diagnostics: Application application_1718000000000_4321 failed 2 times due to AM Container ... exited with exitCode: -103 ...

The AM itself died. Container log of the AM:

[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
        at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
        ...

Top frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412. Attribution: Tez AM. Root cause: AM heap too small for the DAG (tez.am.resource.memory.mb). Fix is configuration; if reproducible at the default, file a JIRA against Tez requesting either a smarter default or a sizing recommendation.


Validation Artifacts

For our first example, save:

  1. The console error verbatim (~/tez-notes/hive-h3-console.txt).
  2. The parsed-identifiers table (Application ID, DAG ID, Vertex ID, Task ID, Container ID).
  3. The AM log fragment showing the task transition to FAILED.
  4. The container log fragment showing the full exception with Caused by: chain.
  5. The attribution paragraph: which project owns the bug, and why.
  6. The fix you propose.

Once you can produce that artifact for an arbitrary Hive-on-Tez failure, you can debug one. The next lab — Lab H4: Bug Attribution — makes the attribution rigorous with a decision tree and four more worked examples.