Lab 6.1 — Trace a Hive SQL Query to the Generated Tez DAG

Lab type: Read & Research
Estimated time: 120 min
Key classes: DagUtils, TezWork, TezTask (all in Hive)


Overview

When you run SELECT a, COUNT(*) FROM t GROUP BY a on a Hive-on-Tez cluster, Hive builds a TezWork object (a description of what the DAG should look like) and hands it to DagUtils.createDag(). That method creates the actual Tez DAG, vertices, edges, and VertexManagerPluginDescriptors.

In this lab you will trace this path end-to-end.


Step 1 — Check Out Hive Source (Optional)

If you have Hive source:

git clone https://github.com/apache/hive.git ~/hive-src --depth=1
find ~/hive-src -name "DagUtils.java" | head -3
find ~/hive-src -name "TezWork.java" | head -3
find ~/hive-src -name "TezTask.java" | head -3

If you do not have Hive source, you can read these classes on GitHub:

  • ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
  • ql/src/java/org/apache/hadoop/hive/ql/plan/TezWork.java
  • ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java

Step 2 — Read TezWork.java

TezWork is a directed graph of BaseWork nodes. Answer:

#Question
1What are the two main subclasses of BaseWork that represent map and reduce phases?
2How does TezWork represent edges between vertices? What class holds edge configuration?
3Where does TezWork store the VertexManagerPluginDescriptor?
4A GROUP BY query produces how many BaseWork nodes? Draw the graph.

Step 3 — Read DagUtils.createDag()

This is the core translation method. It iterates over TezWork and calls createVertex() and createEdge().

#Question
1What Tez EdgeProperty.DataMovementType does Hive use for a reduce shuffle? Where is this set?
2What VertexManagerPlugin does Hive attach to reduce vertices? Is this set unconditionally or based on a configuration flag?
3What is auto-parallelism in this context? How does Hive enable it?
4What UserPayload does Hive pass to ShuffleVertexManager? Specifically: what are the values of minFraction and maxFraction?

Step 4 — Read TezTask.execute()

This method submits the DAG and waits for completion.

#Question
1Does TezTask create a new TezClient per query, or reuse one per session?
2How does TezTask wait for DAG completion? Which Tez API does it poll?
3When a Hive query fails, what information does TezTask extract from the DAGStatus to show the user?
4TezTask updates Hive counters from Tez counters. What is the counter group mapping?

Step 5 — Tez Counterpart: ShuffleVertexManager

Open ShuffleVertexManager.java in your Tez source. Cross-reference with what you learned from DagUtils.java:

  1. The minFraction/maxFraction payload you found in Step 3 is parsed by which method in ShuffleVertexManager?
  2. When Hive enables auto-parallelism, what happens inside ShuffleVertexManager that does NOT happen when it is disabled?
  3. Where does ShuffleVertexManager call context.reconfigureVertex()? What does reconfigureVertex do to the number of reducer tasks?

Step 6 — End-to-End Mental Model

Draw (on paper or in a text diagram) the full path for:

SELECT dept, COUNT(*) FROM employees GROUP BY dept

Show:

  • Hive logical plan nodes
  • TezWork graph (label each BaseWork)
  • Tez DAG (label each vertex, edge type, VertexManagerPlugin)
  • Which Tez APIs TezTask calls

Step 7 — JIRA Research: Hive/Tez Compatibility

Search:

project = TEZ AND text ~ "hive" AND resolution = Fixed ORDER BY updated DESC

Find one issue where a Tez change broke Hive or where a Hive bug exposed a Tez issue.

  1. What was the incompatibility?
  2. Was the fix in Tez or Hive (or both)?
  3. Did the patch include a test? If so, where?