Lab 6.1 — Trace a Hive SQL Query to the Generated Tez DAG
Lab type: Read & Research
Estimated time: 120 min
Key classes: DagUtils, TezWork, TezTask (all in Hive)
Overview
When you run SELECT a, COUNT(*) FROM t GROUP BY a on a Hive-on-Tez cluster,
Hive builds a TezWork object (a description of what the DAG should look like)
and hands it to DagUtils.createDag(). That method creates the actual Tez
DAG, vertices, edges, and VertexManagerPluginDescriptors.
In this lab you will trace this path end-to-end.
Step 1 — Check Out Hive Source (Optional)
If you have Hive source:
git clone https://github.com/apache/hive.git ~/hive-src --depth=1
find ~/hive-src -name "DagUtils.java" | head -3
find ~/hive-src -name "TezWork.java" | head -3
find ~/hive-src -name "TezTask.java" | head -3
If you do not have Hive source, you can read these classes on GitHub:
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.javaql/src/java/org/apache/hadoop/hive/ql/plan/TezWork.javaql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
Step 2 — Read TezWork.java
TezWork is a directed graph of BaseWork nodes. Answer:
| # | Question |
|---|---|
| 1 | What are the two main subclasses of BaseWork that represent map and reduce phases? |
| 2 | How does TezWork represent edges between vertices? What class holds edge configuration? |
| 3 | Where does TezWork store the VertexManagerPluginDescriptor? |
| 4 | A GROUP BY query produces how many BaseWork nodes? Draw the graph. |
Step 3 — Read DagUtils.createDag()
This is the core translation method. It iterates over TezWork and calls
createVertex() and createEdge().
| # | Question |
|---|---|
| 1 | What Tez EdgeProperty.DataMovementType does Hive use for a reduce shuffle? Where is this set? |
| 2 | What VertexManagerPlugin does Hive attach to reduce vertices? Is this set unconditionally or based on a configuration flag? |
| 3 | What is auto-parallelism in this context? How does Hive enable it? |
| 4 | What UserPayload does Hive pass to ShuffleVertexManager? Specifically: what are the values of minFraction and maxFraction? |
Step 4 — Read TezTask.execute()
This method submits the DAG and waits for completion.
| # | Question |
|---|---|
| 1 | Does TezTask create a new TezClient per query, or reuse one per session? |
| 2 | How does TezTask wait for DAG completion? Which Tez API does it poll? |
| 3 | When a Hive query fails, what information does TezTask extract from the DAGStatus to show the user? |
| 4 | TezTask updates Hive counters from Tez counters. What is the counter group mapping? |
Step 5 — Tez Counterpart: ShuffleVertexManager
Open ShuffleVertexManager.java in your Tez source. Cross-reference with
what you learned from DagUtils.java:
- The
minFraction/maxFractionpayload you found in Step 3 is parsed by which method inShuffleVertexManager? - When Hive enables auto-parallelism, what happens inside
ShuffleVertexManagerthat does NOT happen when it is disabled? - Where does
ShuffleVertexManagercallcontext.reconfigureVertex()? What doesreconfigureVertexdo to the number of reducer tasks?
Step 6 — End-to-End Mental Model
Draw (on paper or in a text diagram) the full path for:
SELECT dept, COUNT(*) FROM employees GROUP BY dept
Show:
- Hive logical plan nodes
TezWorkgraph (label eachBaseWork)- Tez
DAG(label each vertex, edge type,VertexManagerPlugin) - Which Tez APIs
TezTaskcalls
Step 7 — JIRA Research: Hive/Tez Compatibility
Search:
project = TEZ AND text ~ "hive" AND resolution = Fixed ORDER BY updated DESC
Find one issue where a Tez change broke Hive or where a Hive bug exposed a Tez issue.
- What was the incompatibility?
- Was the fix in Tez or Hive (or both)?
- Did the patch include a test? If so, where?