Lab H4: Bug Attribution

Background

A failing Hive-on-Tez query may be a Hive bug, a Tez runtime bug, a Tez AM bug, a YARN bug, a Hadoop common bug, a JVM bug, a user bug, or an infrastructure bug. Filing it on the wrong project wastes the reporter's time and the maintainer's. This lab gives you a mechanical decision tree to attribute correctly from a stack trace, plus four worked examples.


The Decision Tree

Given a stack trace (after Lab H3 has surfaced it):

flowchart TD
  S[Start: have stack trace]
  S --> T1[Find top frame whose package you can change]
  T1 --> P{Package prefix?}
  P -->|org.apache.hadoop.hive.*| H[Hive bug]
  P -->|org.apache.tez.runtime.library.*| TR[Tez runtime library<br/>tez-runtime-library]
  P -->|org.apache.tez.runtime.*<br/>not .library| TRI[Tez runtime internals<br/>tez-runtime-internals]
  P -->|org.apache.tez.dag.app.*| TA[Tez AM<br/>tez-dag]
  P -->|org.apache.tez.dag.api.*| TC[Tez client / API<br/>tez-api]
  P -->|org.apache.tez.client.*| TC
  P -->|org.apache.hadoop.yarn.*| Y[YARN bug]
  P -->|org.apache.hadoop.hdfs.*| HD[HDFS bug]
  P -->|org.apache.hadoop.mapred.*| MR[Hadoop MR compat<br/>tez-mapreduce]
  P -->|user package| U[User code bug]
  P -->|java.*, sun.*| J[Walk down to next frame]
  J --> T1
  H --> CD[Then check Caused by chain]
  TR --> CD
  TRI --> CD
  TA --> CD
  Y --> CD
  HD --> CD
  CD --> R[Root cause may shift attribution]
  R --> END[File on the project that owns the actionable code]

The rule in one sentence: find the top frame in actionable code, name its package prefix, and read off the project.


Package → Project → Module Table

Package prefixProjectModule / areaWhere to file
org.apache.hadoop.hive.ql.exec.tez.*HiveTez integrationhttps://issues.apache.org/jira/projects/HIVE
org.apache.hadoop.hive.ql.exec.* (not .tez)HiveOperatorsHIVE JIRA
org.apache.hadoop.hive.ql.metadata.*HiveMetadata / UDFHIVE JIRA
org.apache.hadoop.hive.serde2.*HiveSerializationHIVE JIRA
org.apache.hadoop.hive.* (any other)HiveCoreHIVE JIRA
org.apache.tez.runtime.library.*Teztez-runtime-libraryTEZ JIRA
org.apache.tez.runtime.task.*Teztez-runtime-internalsTEZ JIRA
org.apache.tez.runtime.* (not .library, not .task)Teztez-runtime-internalsTEZ JIRA
org.apache.tez.dag.app.dag.impl.*Teztez-dag (state machines)TEZ JIRA
org.apache.tez.dag.app.rm.*Teztez-dag (RM client / container scheduling)TEZ JIRA
org.apache.tez.dag.app.launcher.*Teztez-dag (container launcher)TEZ JIRA
org.apache.tez.dag.app.* (other)Teztez-dag (AM core)TEZ JIRA
org.apache.tez.dag.api.*Teztez-api (DAG / Vertex / Edge)TEZ JIRA
org.apache.tez.client.*Teztez-api (TezClient)TEZ JIRA
org.apache.tez.mapreduce.*Teztez-mapreduce (MRInput/MROutput)TEZ JIRA
org.apache.hadoop.yarn.client.*YARNClientHADOOP JIRA, component YARN
org.apache.hadoop.yarn.server.resourcemanager.*YARNRMHADOOP YARN
org.apache.hadoop.yarn.server.nodemanager.*YARNNMHADOOP YARN
org.apache.hadoop.hdfs.*HDFSClient / DN / NNHADOOP HDFS
org.apache.hadoop.mapred.*MR compattez-mapreduce for MR-on-TezTEZ JIRA
org.apache.hadoop.io.* / .fs.* / .conf.*Hadoop commonhadoop-commonHADOOP COMMON
com.<user>.* / org.<user>.* (not apache)User coden/aFix locally
java.*, sun.*, jdk.*JVMwalk down(not the cause; keep looking)

Verify the modules against your tree:

find ~/tez-src -maxdepth 2 -name pom.xml | sort
find ~/hive-src -maxdepth 3 -name pom.xml | head

Example 1: UDF Not Found (Hive bug → User bug)

Trace (from Lab H3):

java.lang.RuntimeException: ...
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(...)
        ...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
        ...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X

Apply the tree:

  1. Top actionable frame: org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91.
  2. Package: org.apache.hadoop.hive.ql.exec.tez.*.
  3. Project: Hive (the Tez integration code).
  4. Check Caused by: root is ClassNotFoundException: com.example.udf.X — a user class.
  5. Adjust: this is user error (their UDF jar isn't on the classpath), surfaced by Hive's UDF registry, surfaced by Hive's Tez integration. No bug to file.

Fix: ADD JAR or hive.aux.jars.path.

If the same trace came with Caused by: ClassNotFoundException: org.apache.hadoop.hive.ql.exec.UDFBridge, then the root is a Hive class missing from the Hive distribution — file on HIVE.


Example 2: Shuffle Fetch Failure (Tez runtime bug)

Trace:

java.io.IOException: Failed to fetch shuffle data
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)

Apply the tree:

  1. Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.
  2. Package: org.apache.tez.runtime.library.*.
  3. Project: Tez, module tez-runtime-library.
  4. Check Caused by: ConnectException — network.
  5. Adjust: the root is a network/infra failure. The shuffle code surfaced it correctly; not a bug in itself. But:
    • If this happens once with sporadic node failures: infrastructure issue, no bug.
    • If this happens frequently and the fetcher isn't retrying enough times before giving up: Tez bug — file on TEZ asking to bump or expose tez.runtime.shuffle.connect.timeout/retry counts.
    • If the upstream container died because of an AM scheduling bug: Tez AM bug, file on TEZ with the AM log evidence.

Verify the retry config:

grep "shuffle.connect\|shuffle.fetch.retry\|shuffle.read.timeout" \
  ~/tez-src/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/api/TezRuntimeConfiguration.java

Example 3: AM OOM During DAG Submit (Tez AM bug)

AM container log:

[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3210)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(...)
        ...
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:194)
        at org.apache.tez.dag.api.records.DAGProtos$DAGPlan.toBuilder(DAGProtos.java:...)
        at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
        at org.apache.tez.dag.app.DAGAppMaster.createDAG(DAGAppMaster.java:...)

Apply the tree:

  1. Top actionable frame: skip JVM/protobuf frames. First Tez frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412.
  2. Package: org.apache.tez.dag.app.*.
  3. Project: Tez, module tez-dag (AM).
  4. Check Caused by: none — just the OOM.

Attribution: Tez AM. The proximate cause is constructing VertexImpl from a large DAGPlan. Three possible JIRA shapes:

  • "Tez AM OOMs on submission of N-vertex DAG at default tez.am.resource.memory.mb" — file requesting smarter sizing or doc.
  • "VertexImpl construction allocates O(N²) memory in inputs" — file with a profile and a fix suggestion.
  • "DAGPlan toBuilder() materialises a full copy" — file as a perf bug.

The correct shape depends on profile evidence. Without profiling, file the sizing/doc variant first; the deeper variants follow.


Example 4: NodeManager Lost (YARN bug)

AM log:

2024-06-10 ... [WARN ] AMContainerImpl - Container container_..._000007 transitioned from RUNNING to STOPPED. exitStatus -100
2024-06-10 ... [WARN ] DAGAppMaster - Container ..._000007 completed unexpectedly; will be rescheduled
2024-06-10 ... [WARN ] RMContainerRequestor - Lost node nm-12.example.com
2024-06-10 ... [INFO ] DAGAppMaster - Marking task attempt as failed due to lost node: attempt_..._000003_0

Apply the tree:

  1. Top frame in trace: org.apache.tez.dag.app.rm.AMContainerImpl — but this is the AM's correct reaction to a node loss, not a bug.

  2. The substantive cause is "NodeManager nm-12 lost" — diagnose by checking NodeManager log on that host:

    yarn node -list -all | grep nm-12
    tail -200 /var/log/hadoop-yarn/yarn-nodemanager.log  # on nm-12
    
  3. Common nm-side root causes:

    • NM heap OOM (NM stops responding to RM heartbeats) → YARN bug or NM tuning.
    • Network partition → infra.
    • Disk full on NM local-dirs → ops issue.

Attribution:

  • If NM died from OOM, file on HADOOP YARN.
  • If Tez AM didn't reschedule the lost task correctly, file on TEZ. But the AM log here shows correct reaction, so that's not in play.
  • If Tez's TaskScheduler retried the task on the same lost node repeatedly, file on TEZ (a scheduler awareness issue).

Cross-Project Patterns

Some failure modes have a well-known cross-project shape. Memorise the shapes:

ShapeLikely projectQuick diagnostic
ClassCastException inside MapRecordSource / ReduceRecordSourceHive (schema mismatch in vectorization)Check EXPLAIN VECTORIZATION DETAIL
IOException: Stream is closed in shuffle readerTez runtime libraryCheck upstream container alive
TaskCommitDeniedExceptionTez AM speculative-exec coordinationCheck tez.am.speculation.enabled
NoSuchMethodError on a Tez or Hive classVersion skewCheck classpath; check mvn dependency:tree
IllegalArgumentException: Wrong FSHadoop FSCheck fs.defaultFS, core-site.xml
Container killed by OOM killer (exit code 137)YARN or workloadCheck container memory request vs JVM heap
org.apache.hadoop.security.AccessControlExceptionHDFS or Hive RangerPermissions issue, not a code bug

What to Do With the Attribution

Having attributed correctly:

AttributionAction
HiveFile on https://issues.apache.org/jira/projects/HIVE with Tez in summary if relevant
Tez tez-runtime-libraryFile on https://issues.apache.org/jira/projects/TEZ, component Runtime Library
Tez tez-runtime-internalsFile on TEZ, component Runtime Internals
Tez tez-dag (AM)File on TEZ, component AM
Tez tez-apiFile on TEZ, component Client / API
Tez tez-mapreduceFile on TEZ, component MR Compat
YARNFile on https://issues.apache.org/jira/projects/HADOOP, component YARN
HDFSFile on HADOOP, component HDFS
UserFix locally, no JIRA
InfrastructureOperations issue, no JIRA
Multiple (Hive needs change AND Tez needs change)File on both, cross-reference

In all cases, the JIRA description follows the skeleton in Design via JIRA.


Validation Artifacts

After this lab:

  1. The decision tree printed and pinned at your desk (or in ~/tez-notes/).
  2. The Package → Project → Module table memorised or saved as ~/tez-notes/hive-h4-attribution.md.
  3. Four attributions, one for each worked example, written out in your own words.
  4. The reflex: never file a JIRA on a project whose code does not appear in the top of the actionable stack.

The next lab — Lab H5: Reproducing Bugs — covers how to turn an attributed bug into a minimum reproducer suitable to attach to a JIRA.