Lab H4: Bug Attribution
Background
A failing Hive-on-Tez query may be a Hive bug, a Tez runtime bug, a Tez AM bug, a YARN bug, a Hadoop common bug, a JVM bug, a user bug, or an infrastructure bug. Filing it on the wrong project wastes the reporter's time and the maintainer's. This lab gives you a mechanical decision tree to attribute correctly from a stack trace, plus four worked examples.
The Decision Tree
Given a stack trace (after Lab H3 has surfaced it):
flowchart TD
S[Start: have stack trace]
S --> T1[Find top frame whose package you can change]
T1 --> P{Package prefix?}
P -->|org.apache.hadoop.hive.*| H[Hive bug]
P -->|org.apache.tez.runtime.library.*| TR[Tez runtime library<br/>tez-runtime-library]
P -->|org.apache.tez.runtime.*<br/>not .library| TRI[Tez runtime internals<br/>tez-runtime-internals]
P -->|org.apache.tez.dag.app.*| TA[Tez AM<br/>tez-dag]
P -->|org.apache.tez.dag.api.*| TC[Tez client / API<br/>tez-api]
P -->|org.apache.tez.client.*| TC
P -->|org.apache.hadoop.yarn.*| Y[YARN bug]
P -->|org.apache.hadoop.hdfs.*| HD[HDFS bug]
P -->|org.apache.hadoop.mapred.*| MR[Hadoop MR compat<br/>tez-mapreduce]
P -->|user package| U[User code bug]
P -->|java.*, sun.*| J[Walk down to next frame]
J --> T1
H --> CD[Then check Caused by chain]
TR --> CD
TRI --> CD
TA --> CD
Y --> CD
HD --> CD
CD --> R[Root cause may shift attribution]
R --> END[File on the project that owns the actionable code]
The rule in one sentence: find the top frame in actionable code, name its package prefix, and read off the project.
Package → Project → Module Table
| Package prefix | Project | Module / area | Where to file |
|---|---|---|---|
org.apache.hadoop.hive.ql.exec.tez.* | Hive | Tez integration | https://issues.apache.org/jira/projects/HIVE |
org.apache.hadoop.hive.ql.exec.* (not .tez) | Hive | Operators | HIVE JIRA |
org.apache.hadoop.hive.ql.metadata.* | Hive | Metadata / UDF | HIVE JIRA |
org.apache.hadoop.hive.serde2.* | Hive | Serialization | HIVE JIRA |
org.apache.hadoop.hive.* (any other) | Hive | Core | HIVE JIRA |
org.apache.tez.runtime.library.* | Tez | tez-runtime-library | TEZ JIRA |
org.apache.tez.runtime.task.* | Tez | tez-runtime-internals | TEZ JIRA |
org.apache.tez.runtime.* (not .library, not .task) | Tez | tez-runtime-internals | TEZ JIRA |
org.apache.tez.dag.app.dag.impl.* | Tez | tez-dag (state machines) | TEZ JIRA |
org.apache.tez.dag.app.rm.* | Tez | tez-dag (RM client / container scheduling) | TEZ JIRA |
org.apache.tez.dag.app.launcher.* | Tez | tez-dag (container launcher) | TEZ JIRA |
org.apache.tez.dag.app.* (other) | Tez | tez-dag (AM core) | TEZ JIRA |
org.apache.tez.dag.api.* | Tez | tez-api (DAG / Vertex / Edge) | TEZ JIRA |
org.apache.tez.client.* | Tez | tez-api (TezClient) | TEZ JIRA |
org.apache.tez.mapreduce.* | Tez | tez-mapreduce (MRInput/MROutput) | TEZ JIRA |
org.apache.hadoop.yarn.client.* | YARN | Client | HADOOP JIRA, component YARN |
org.apache.hadoop.yarn.server.resourcemanager.* | YARN | RM | HADOOP YARN |
org.apache.hadoop.yarn.server.nodemanager.* | YARN | NM | HADOOP YARN |
org.apache.hadoop.hdfs.* | HDFS | Client / DN / NN | HADOOP HDFS |
org.apache.hadoop.mapred.* | MR compat | tez-mapreduce for MR-on-Tez | TEZ JIRA |
org.apache.hadoop.io.* / .fs.* / .conf.* | Hadoop common | hadoop-common | HADOOP COMMON |
com.<user>.* / org.<user>.* (not apache) | User code | n/a | Fix locally |
java.*, sun.*, jdk.* | JVM | walk down | (not the cause; keep looking) |
Verify the modules against your tree:
find ~/tez-src -maxdepth 2 -name pom.xml | sort
find ~/hive-src -maxdepth 3 -name pom.xml | head
Example 1: UDF Not Found (Hive bug → User bug)
Trace (from Lab H3):
java.lang.RuntimeException: ...
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(...)
...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
Apply the tree:
- Top actionable frame:
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91. - Package:
org.apache.hadoop.hive.ql.exec.tez.*. - Project: Hive (the Tez integration code).
- Check
Caused by: root isClassNotFoundException: com.example.udf.X— a user class. - Adjust: this is user error (their UDF jar isn't on the classpath), surfaced by Hive's UDF registry, surfaced by Hive's Tez integration. No bug to file.
Fix: ADD JAR or hive.aux.jars.path.
If the same trace came with Caused by: ClassNotFoundException: org.apache.hadoop.hive.ql.exec.UDFBridge,
then the root is a Hive class missing from the Hive distribution — file on HIVE.
Example 2: Shuffle Fetch Failure (Tez runtime bug)
Trace:
java.io.IOException: Failed to fetch shuffle data
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
Apply the tree:
- Top actionable frame:
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391. - Package:
org.apache.tez.runtime.library.*. - Project: Tez, module
tez-runtime-library. - Check
Caused by:ConnectException— network. - Adjust: the root is a network/infra failure. The shuffle code surfaced it
correctly; not a bug in itself. But:
- If this happens once with sporadic node failures: infrastructure issue, no bug.
- If this happens frequently and the fetcher isn't retrying enough times before
giving up: Tez bug — file on TEZ asking to bump or expose
tez.runtime.shuffle.connect.timeout/retry counts. - If the upstream container died because of an AM scheduling bug: Tez AM bug, file on TEZ with the AM log evidence.
Verify the retry config:
grep "shuffle.connect\|shuffle.fetch.retry\|shuffle.read.timeout" \
~/tez-src/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/api/TezRuntimeConfiguration.java
Example 3: AM OOM During DAG Submit (Tez AM bug)
AM container log:
[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3210)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(...)
...
at com.google.protobuf.ByteString.copyFrom(ByteString.java:194)
at org.apache.tez.dag.api.records.DAGProtos$DAGPlan.toBuilder(DAGProtos.java:...)
at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
at org.apache.tez.dag.app.DAGAppMaster.createDAG(DAGAppMaster.java:...)
Apply the tree:
- Top actionable frame: skip JVM/protobuf frames. First Tez frame:
org.apache.tez.dag.app.dag.impl.VertexImpl:412. - Package:
org.apache.tez.dag.app.*. - Project: Tez, module
tez-dag(AM). - Check
Caused by: none — just the OOM.
Attribution: Tez AM. The proximate cause is constructing VertexImpl from a large
DAGPlan. Three possible JIRA shapes:
- "Tez AM OOMs on submission of N-vertex DAG at default
tez.am.resource.memory.mb" — file requesting smarter sizing or doc. - "VertexImpl construction allocates O(N²) memory in inputs" — file with a profile and a fix suggestion.
- "DAGPlan toBuilder() materialises a full copy" — file as a perf bug.
The correct shape depends on profile evidence. Without profiling, file the sizing/doc variant first; the deeper variants follow.
Example 4: NodeManager Lost (YARN bug)
AM log:
2024-06-10 ... [WARN ] AMContainerImpl - Container container_..._000007 transitioned from RUNNING to STOPPED. exitStatus -100
2024-06-10 ... [WARN ] DAGAppMaster - Container ..._000007 completed unexpectedly; will be rescheduled
2024-06-10 ... [WARN ] RMContainerRequestor - Lost node nm-12.example.com
2024-06-10 ... [INFO ] DAGAppMaster - Marking task attempt as failed due to lost node: attempt_..._000003_0
Apply the tree:
-
Top frame in trace:
org.apache.tez.dag.app.rm.AMContainerImpl— but this is the AM's correct reaction to a node loss, not a bug. -
The substantive cause is "NodeManager nm-12 lost" — diagnose by checking NodeManager log on that host:
yarn node -list -all | grep nm-12 tail -200 /var/log/hadoop-yarn/yarn-nodemanager.log # on nm-12 -
Common nm-side root causes:
- NM heap OOM (NM stops responding to RM heartbeats) → YARN bug or NM tuning.
- Network partition → infra.
- Disk full on NM local-dirs → ops issue.
Attribution:
- If NM died from OOM, file on HADOOP YARN.
- If Tez AM didn't reschedule the lost task correctly, file on TEZ. But the AM log here shows correct reaction, so that's not in play.
- If Tez's
TaskSchedulerretried the task on the same lost node repeatedly, file on TEZ (a scheduler awareness issue).
Cross-Project Patterns
Some failure modes have a well-known cross-project shape. Memorise the shapes:
| Shape | Likely project | Quick diagnostic |
|---|---|---|
ClassCastException inside MapRecordSource / ReduceRecordSource | Hive (schema mismatch in vectorization) | Check EXPLAIN VECTORIZATION DETAIL |
IOException: Stream is closed in shuffle reader | Tez runtime library | Check upstream container alive |
TaskCommitDeniedException | Tez AM speculative-exec coordination | Check tez.am.speculation.enabled |
NoSuchMethodError on a Tez or Hive class | Version skew | Check classpath; check mvn dependency:tree |
IllegalArgumentException: Wrong FS | Hadoop FS | Check fs.defaultFS, core-site.xml |
Container killed by OOM killer (exit code 137) | YARN or workload | Check container memory request vs JVM heap |
org.apache.hadoop.security.AccessControlException | HDFS or Hive Ranger | Permissions issue, not a code bug |
What to Do With the Attribution
Having attributed correctly:
| Attribution | Action |
|---|---|
| Hive | File on https://issues.apache.org/jira/projects/HIVE with Tez in summary if relevant |
Tez tez-runtime-library | File on https://issues.apache.org/jira/projects/TEZ, component Runtime Library |
Tez tez-runtime-internals | File on TEZ, component Runtime Internals |
Tez tez-dag (AM) | File on TEZ, component AM |
Tez tez-api | File on TEZ, component Client / API |
Tez tez-mapreduce | File on TEZ, component MR Compat |
| YARN | File on https://issues.apache.org/jira/projects/HADOOP, component YARN |
| HDFS | File on HADOOP, component HDFS |
| User | Fix locally, no JIRA |
| Infrastructure | Operations issue, no JIRA |
| Multiple (Hive needs change AND Tez needs change) | File on both, cross-reference |
In all cases, the JIRA description follows the skeleton in Design via JIRA.
Validation Artifacts
After this lab:
- The decision tree printed and pinned at your desk (or in
~/tez-notes/). - The Package → Project → Module table memorised or saved as
~/tez-notes/hive-h4-attribution.md. - Four attributions, one for each worked example, written out in your own words.
- The reflex: never file a JIRA on a project whose code does not appear in the top of the actionable stack.
The next lab — Lab H5: Reproducing Bugs — covers how to turn an attributed bug into a minimum reproducer suitable to attach to a JIRA.