Lab 6.2 — Debug a Failed Hive-on-Tez Query

Lab type: Fix-It (diagnostics + root-cause analysis)
Estimated time: 120 min


Overview

A Hive-on-Tez query failure can originate from:

  1. Tez DAG layer — vertex scheduling error, shuffle failure, OOM
  2. Hive operator layer — deserialization error, UDF crash, wrong SerDe
  3. Infra layer — YARN container killed, HDFS quota exceeded, network timeout

In this lab you will work through a systematic diagnostic process and trace a simulated failure back to its Tez-layer root cause.


Scenario

A Hive query:

SELECT k, SUM(v) FROM large_table GROUP BY k;

fails with:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1700000000000_0001_1_01,
diagnostics=[Task failed, taskId=task_1700000000000_0001_1_01_000000,
diagnostics=[TaskAttempt 0 failed, info=[Container container_... exited
with exitCode: -104]]

Exit code -104 means container killed by YARN for exceeding memory.


Step 1 — Identify the Layer

#Question
1Is exit code -104 a Tez error or a YARN error? Where is this code defined?
2Which vertex failed — the map or the reduce? How do you know from the diagnostic message?
3What Tez API would you call (in Java) to retrieve these diagnostics programmatically?
4The error says "TaskAttempt 0 failed". Does this mean no retries happened, or that all retries were exhausted?

Step 2 — Locate the Logs

In a real cluster:

# Get the AM logs
yarn logs -applicationId application_1700000000000_0001 \
  -log_files syslog | grep -A 20 "Reducer 2"

# Get the container logs
yarn logs -applicationId application_1700000000000_0001 \
  -containerId container_... | head -200

Questions:

  1. In the AM logs, what Tez class emits the Task failed message? (Hint: grep for TaskImpl or VertexImpl in the log.)
  2. The container log has a Java OOM or GC log. Where in TaskAttemptImpl does the container exit code get translated to a TaskAttemptEvent?

Step 3 — Identify the Tez Configuration Fix

The reduce vertex ran out of memory. The relevant configuration:

Config keyDefaultDescription
tez.am.resource.memory.mb1024AM container memory
tez.task.resource.memory.mb1024Task container memory
hive.tez.container.size-1 (inherits from mapred)Hive override for Tez task memory
hive.auto.convert.join.noconditionaltask.size10MBIn-memory join threshold
  1. Which config key should be increased to fix the OOM?
  2. Is this a Tez config or a Hive config? Which system applies it?
  3. Find where tez.task.resource.memory.mb is read in Tez source. In which class and method?

Step 4 — Tez Source Reading: Container Exit Code Handling

Find where Tez handles non-zero container exit codes:

grep -rn "exitCode\|EXIT_CODE\|ContainerExitStatus" \
  ~/tez-src/tez-dag/src/main/java/ | grep -v "test" | head -30

Answer:

  1. What class translates the YARN container exit code into a TaskAttemptEvent?
  2. Is -104 (PREEMPTED) treated differently from -1 (ABORTED)?
  3. Does Tez retry a preempted task? What configuration controls the max retries?

Step 5 — Simulate the Fix

In a real system you would increase tez.task.resource.memory.mb and rerun. Since you do not have a Hive cluster, instead:

Find the test in TestTaskAttemptImpl.java that covers container preemption:

grep -n "preempt\|PREEMPT\|exitCode" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttemptImpl.java \
  | head -20

Read the test. Answer:

  1. How does the test simulate a container exit with a non-zero exit code?
  2. What state does TaskAttemptImpl transition to on preemption?
  3. Is there a test for the full retry-until-max-attempts path?

Step 6 — Write a Diagnostic Runbook Entry

Write 5–8 bullet points as a "runbook entry" for this class of failure:

## Hive-on-Tez: Reducer OOM (exit code -104)

**Symptoms:** ...
**Root cause:** ...
**Diagnostic steps:** ...
**Fix:** ...
**Tez classes involved:** ...
**Relevant configuration:** ...

This is the kind of documentation that Tez PMC members write for operators.


JIRA Research

Search for Tez issues related to container OOM or preemption handling:

project = TEZ AND text ~ "preempt OR oom OR out of memory" AND resolution = Fixed

Find one. Read the patch. Was the fix in TaskAttemptImpl, in configuration defaults, or in a different class?