Lab 6.2 — Debug a Failed Hive-on-Tez Query
Lab type: Fix-It (diagnostics + root-cause analysis)
Estimated time: 120 min
Overview
A Hive-on-Tez query failure can originate from:
- Tez DAG layer — vertex scheduling error, shuffle failure, OOM
- Hive operator layer — deserialization error, UDF crash, wrong SerDe
- Infra layer — YARN container killed, HDFS quota exceeded, network timeout
In this lab you will work through a systematic diagnostic process and trace a simulated failure back to its Tez-layer root cause.
Scenario
A Hive query:
SELECT k, SUM(v) FROM large_table GROUP BY k;
fails with:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1700000000000_0001_1_01,
diagnostics=[Task failed, taskId=task_1700000000000_0001_1_01_000000,
diagnostics=[TaskAttempt 0 failed, info=[Container container_... exited
with exitCode: -104]]
Exit code -104 means container killed by YARN for exceeding memory.
Step 1 — Identify the Layer
| # | Question |
|---|---|
| 1 | Is exit code -104 a Tez error or a YARN error? Where is this code defined? |
| 2 | Which vertex failed — the map or the reduce? How do you know from the diagnostic message? |
| 3 | What Tez API would you call (in Java) to retrieve these diagnostics programmatically? |
| 4 | The error says "TaskAttempt 0 failed". Does this mean no retries happened, or that all retries were exhausted? |
Step 2 — Locate the Logs
In a real cluster:
# Get the AM logs
yarn logs -applicationId application_1700000000000_0001 \
-log_files syslog | grep -A 20 "Reducer 2"
# Get the container logs
yarn logs -applicationId application_1700000000000_0001 \
-containerId container_... | head -200
Questions:
- In the AM logs, what Tez class emits the
Task failedmessage? (Hint: grep forTaskImplorVertexImplin the log.) - The container log has a Java OOM or GC log. Where in
TaskAttemptImpldoes the container exit code get translated to aTaskAttemptEvent?
Step 3 — Identify the Tez Configuration Fix
The reduce vertex ran out of memory. The relevant configuration:
| Config key | Default | Description |
|---|---|---|
tez.am.resource.memory.mb | 1024 | AM container memory |
tez.task.resource.memory.mb | 1024 | Task container memory |
hive.tez.container.size | -1 (inherits from mapred) | Hive override for Tez task memory |
hive.auto.convert.join.noconditionaltask.size | 10MB | In-memory join threshold |
- Which config key should be increased to fix the OOM?
- Is this a Tez config or a Hive config? Which system applies it?
- Find where
tez.task.resource.memory.mbis read in Tez source. In which class and method?
Step 4 — Tez Source Reading: Container Exit Code Handling
Find where Tez handles non-zero container exit codes:
grep -rn "exitCode\|EXIT_CODE\|ContainerExitStatus" \
~/tez-src/tez-dag/src/main/java/ | grep -v "test" | head -30
Answer:
- What class translates the YARN container exit code into a
TaskAttemptEvent? - Is
-104(PREEMPTED) treated differently from-1(ABORTED)? - Does Tez retry a preempted task? What configuration controls the max retries?
Step 5 — Simulate the Fix
In a real system you would increase tez.task.resource.memory.mb and rerun.
Since you do not have a Hive cluster, instead:
Find the test in TestTaskAttemptImpl.java that covers container preemption:
grep -n "preempt\|PREEMPT\|exitCode" \
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttemptImpl.java \
| head -20
Read the test. Answer:
- How does the test simulate a container exit with a non-zero exit code?
- What state does
TaskAttemptImpltransition to on preemption? - Is there a test for the full retry-until-max-attempts path?
Step 6 — Write a Diagnostic Runbook Entry
Write 5–8 bullet points as a "runbook entry" for this class of failure:
## Hive-on-Tez: Reducer OOM (exit code -104)
**Symptoms:** ...
**Root cause:** ...
**Diagnostic steps:** ...
**Fix:** ...
**Tez classes involved:** ...
**Relevant configuration:** ...
This is the kind of documentation that Tez PMC members write for operators.
JIRA Research
Search for Tez issues related to container OOM or preemption handling:
project = TEZ AND text ~ "preempt OR oom OR out of memory" AND resolution = Fixed
Find one. Read the patch. Was the fix in TaskAttemptImpl, in configuration
defaults, or in a different class?