Lab H4: Bug Attribution

Background

A failing Hive-on-Tez query may be a Hive bug, a Tez runtime bug, a Tez AM bug, a YARN bug, a Hadoop common bug, a JVM bug, a user bug, or an infrastructure bug. Filing it on the wrong project wastes the reporter's time and the maintainer's. This lab gives you a mechanical decision tree to attribute correctly from a stack trace, plus four worked examples.

The Decision Tree

Given a stack trace (after Lab H3 has surfaced it):

flowchart TD
  S[Start: have stack trace]
  S --> T1[Find top frame whose package you can change]
  T1 --> P{Package prefix?}
  P -->|org.apache.hadoop.hive.*| H[Hive bug]
  P -->|org.apache.tez.runtime.library.*| TR[Tez runtime library<br/>tez-runtime-library]
  P -->|org.apache.tez.runtime.*<br/>not .library| TRI[Tez runtime internals<br/>tez-runtime-internals]
  P -->|org.apache.tez.dag.app.*| TA[Tez AM<br/>tez-dag]
  P -->|org.apache.tez.dag.api.*| TC[Tez client / API<br/>tez-api]
  P -->|org.apache.tez.client.*| TC
  P -->|org.apache.hadoop.yarn.*| Y[YARN bug]
  P -->|org.apache.hadoop.hdfs.*| HD[HDFS bug]
  P -->|org.apache.hadoop.mapred.*| MR[Hadoop MR compat<br/>tez-mapreduce]
  P -->|user package| U[User code bug]
  P -->|java.*, sun.*| J[Walk down to next frame]
  J --> T1
  H --> CD[Then check Caused by chain]
  TR --> CD
  TRI --> CD
  TA --> CD
  Y --> CD
  HD --> CD
  CD --> R[Root cause may shift attribution]
  R --> END[File on the project that owns the actionable code]

The rule in one sentence: find the top frame in actionable code, name its package prefix, and read off the project.

Package → Project → Module Table

Package prefix	Project	Module / area	Where to file
`org.apache.hadoop.hive.ql.exec.tez.*`	Hive	Tez integration	`https://issues.apache.org/jira/projects/HIVE`
`org.apache.hadoop.hive.ql.exec.*` (not .tez)	Hive	Operators	HIVE JIRA
`org.apache.hadoop.hive.ql.metadata.*`	Hive	Metadata / UDF	HIVE JIRA
`org.apache.hadoop.hive.serde2.*`	Hive	Serialization	HIVE JIRA
`org.apache.hadoop.hive.*` (any other)	Hive	Core	HIVE JIRA
`org.apache.tez.runtime.library.*`	Tez	`tez-runtime-library`	TEZ JIRA
`org.apache.tez.runtime.task.*`	Tez	`tez-runtime-internals`	TEZ JIRA
`org.apache.tez.runtime.*` (not .library, not .task)	Tez	`tez-runtime-internals`	TEZ JIRA
`org.apache.tez.dag.app.dag.impl.*`	Tez	`tez-dag` (state machines)	TEZ JIRA
`org.apache.tez.dag.app.rm.*`	Tez	`tez-dag` (RM client / container scheduling)	TEZ JIRA
`org.apache.tez.dag.app.launcher.*`	Tez	`tez-dag` (container launcher)	TEZ JIRA
`org.apache.tez.dag.app.*` (other)	Tez	`tez-dag` (AM core)	TEZ JIRA
`org.apache.tez.dag.api.*`	Tez	`tez-api` (DAG / Vertex / Edge)	TEZ JIRA
`org.apache.tez.client.*`	Tez	`tez-api` (TezClient)	TEZ JIRA
`org.apache.tez.mapreduce.*`	Tez	`tez-mapreduce` (MRInput/MROutput)	TEZ JIRA
`org.apache.hadoop.yarn.client.*`	YARN	Client	HADOOP JIRA, component YARN
`org.apache.hadoop.yarn.server.resourcemanager.*`	YARN	RM	HADOOP YARN
`org.apache.hadoop.yarn.server.nodemanager.*`	YARN	NM	HADOOP YARN
`org.apache.hadoop.hdfs.*`	HDFS	Client / DN / NN	HADOOP HDFS
`org.apache.hadoop.mapred.*`	MR compat	`tez-mapreduce` for MR-on-Tez	TEZ JIRA
`org.apache.hadoop.io.` / `.fs.` / `.conf.*`	Hadoop common	hadoop-common	HADOOP COMMON
`com.<user>.` / `org.<user>.` (not apache)	User code	n/a	Fix locally
`java.`, `sun.`, `jdk.*`	JVM	walk down	(not the cause; keep looking)

Verify the modules against your tree:

find ~/tez-src -maxdepth 2 -name pom.xml | sort
find ~/hive-src -maxdepth 3 -name pom.xml | head

Example 1: UDF Not Found (Hive bug → User bug)

Trace (from Lab H3):

java.lang.RuntimeException: ...
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(...)
        ...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
        ...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X

Apply the tree:

Top actionable frame: org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91.
Package: org.apache.hadoop.hive.ql.exec.tez.*.
Project: Hive (the Tez integration code).
Check Caused by: root is ClassNotFoundException: com.example.udf.X — a user class.
Adjust: this is user error (their UDF jar isn't on the classpath), surfaced by Hive's UDF registry, surfaced by Hive's Tez integration. No bug to file.

Fix: ADD JAR or hive.aux.jars.path.

If the same trace came with Caused by: ClassNotFoundException: org.apache.hadoop.hive.ql.exec.UDFBridge, then the root is a Hive class missing from the Hive distribution — file on HIVE.

Example 2: Shuffle Fetch Failure (Tez runtime bug)

Trace:

java.io.IOException: Failed to fetch shuffle data
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)

Apply the tree:

Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.
Package: org.apache.tez.runtime.library.*.
Project: Tez, module tez-runtime-library.
Check Caused by: ConnectException — network.
Adjust: the root is a network/infra failure. The shuffle code surfaced it correctly; not a bug in itself. But:
- If this happens once with sporadic node failures: infrastructure issue, no bug.
- If this happens frequently and the fetcher isn't retrying enough times before giving up: Tez bug — file on TEZ asking to bump or expose tez.runtime.shuffle.connect.timeout/retry counts.
- If the upstream container died because of an AM scheduling bug: Tez AM bug, file on TEZ with the AM log evidence.

Verify the retry config:

grep "shuffle.connect\|shuffle.fetch.retry\|shuffle.read.timeout" \
  ~/tez-src/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/api/TezRuntimeConfiguration.java

Example 3: AM OOM During DAG Submit (Tez AM bug)

AM container log:

[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3210)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(...)
        ...
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:194)
        at org.apache.tez.dag.api.records.DAGProtos$DAGPlan.toBuilder(DAGProtos.java:...)
        at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
        at org.apache.tez.dag.app.DAGAppMaster.createDAG(DAGAppMaster.java:...)

Apply the tree:

Top actionable frame: skip JVM/protobuf frames. First Tez frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412.
Package: org.apache.tez.dag.app.*.
Project: Tez, module tez-dag (AM).
Check Caused by: none — just the OOM.

Attribution: Tez AM. The proximate cause is constructing VertexImpl from a large DAGPlan. Three possible JIRA shapes:

"Tez AM OOMs on submission of N-vertex DAG at default tez.am.resource.memory.mb" — file requesting smarter sizing or doc.
"VertexImpl construction allocates O(N²) memory in inputs" — file with a profile and a fix suggestion.
"DAGPlan toBuilder() materialises a full copy" — file as a perf bug.

The correct shape depends on profile evidence. Without profiling, file the sizing/doc variant first; the deeper variants follow.

Example 4: NodeManager Lost (YARN bug)

AM log:

2024-06-10 ... [WARN ] AMContainerImpl - Container container_..._000007 transitioned from RUNNING to STOPPED. exitStatus -100
2024-06-10 ... [WARN ] DAGAppMaster - Container ..._000007 completed unexpectedly; will be rescheduled
2024-06-10 ... [WARN ] RMContainerRequestor - Lost node nm-12.example.com
2024-06-10 ... [INFO ] DAGAppMaster - Marking task attempt as failed due to lost node: attempt_..._000003_0

Apply the tree:

Top frame in trace: org.apache.tez.dag.app.rm.AMContainerImpl — but this is the AM's correct reaction to a node loss, not a bug.
The substantive cause is "NodeManager nm-12 lost" — diagnose by checking NodeManager log on that host:
```
yarn node -list -all | grep nm-12
tail -200 /var/log/hadoop-yarn/yarn-nodemanager.log  # on nm-12
```
Common nm-side root causes:
- NM heap OOM (NM stops responding to RM heartbeats) → YARN bug or NM tuning.
- Network partition → infra.
- Disk full on NM local-dirs → ops issue.

Attribution:

If NM died from OOM, file on HADOOP YARN.
If Tez AM didn't reschedule the lost task correctly, file on TEZ. But the AM log here shows correct reaction, so that's not in play.
If Tez's TaskScheduler retried the task on the same lost node repeatedly, file on TEZ (a scheduler awareness issue).

Cross-Project Patterns

Some failure modes have a well-known cross-project shape. Memorise the shapes:

Shape	Likely project	Quick diagnostic
`ClassCastException` inside `MapRecordSource` / `ReduceRecordSource`	Hive (schema mismatch in vectorization)	Check `EXPLAIN VECTORIZATION DETAIL`
`IOException: Stream is closed` in shuffle reader	Tez runtime library	Check upstream container alive
`TaskCommitDeniedException`	Tez AM speculative-exec coordination	Check `tez.am.speculation.enabled`
`NoSuchMethodError` on a Tez or Hive class	Version skew	Check classpath; check `mvn dependency:tree`
`IllegalArgumentException: Wrong FS`	Hadoop FS	Check `fs.defaultFS`, `core-site.xml`
`Container killed by OOM killer` (exit code 137)	YARN or workload	Check container memory request vs JVM heap
`org.apache.hadoop.security.AccessControlException`	HDFS or Hive Ranger	Permissions issue, not a code bug

What to Do With the Attribution

Having attributed correctly:

Attribution	Action
Hive	File on `https://issues.apache.org/jira/projects/HIVE` with `Tez` in summary if relevant
Tez `tez-runtime-library`	File on `https://issues.apache.org/jira/projects/TEZ`, component `Runtime Library`
Tez `tez-runtime-internals`	File on TEZ, component `Runtime Internals`
Tez `tez-dag` (AM)	File on TEZ, component `AM`
Tez `tez-api`	File on TEZ, component `Client / API`
Tez `tez-mapreduce`	File on TEZ, component `MR Compat`
YARN	File on `https://issues.apache.org/jira/projects/HADOOP`, component `YARN`
HDFS	File on HADOOP, component `HDFS`
User	Fix locally, no JIRA
Infrastructure	Operations issue, no JIRA
Multiple (Hive needs change AND Tez needs change)	File on both, cross-reference

In all cases, the JIRA description follows the skeleton in Design via JIRA.

Validation Artifacts

After this lab:

The decision tree printed and pinned at your desk (or in ~/tez-notes/).
The Package → Project → Module table memorised or saved as ~/tez-notes/hive-h4-attribution.md.
Four attributions, one for each worked example, written out in your own words.
The reflex: never file a JIRA on a project whose code does not appear in the top of the actionable stack.

The next lab — Lab H5: Reproducing Bugs — covers how to turn an attributed bug into a minimum reproducer suitable to attach to a JIRA.

Open-Source Engineer & Contributor