Lab 6.1 — Trace a Hive SQL Query to the Generated Tez DAG

Lab type: Read & Research
Estimated time: 120 min
Key classes: DagUtils, TezWork, TezTask (all in Hive)

Overview

When you run SELECT a, COUNT(*) FROM t GROUP BY a on a Hive-on-Tez cluster, Hive builds a TezWork object (a description of what the DAG should look like) and hands it to DagUtils.createDag(). That method creates the actual Tez DAG, vertices, edges, and VertexManagerPluginDescriptors.

In this lab you will trace this path end-to-end.

Step 1 — Check Out Hive Source (Optional)

If you have Hive source:

git clone https://github.com/apache/hive.git ~/hive-src --depth=1
find ~/hive-src -name "DagUtils.java" | head -3
find ~/hive-src -name "TezWork.java" | head -3
find ~/hive-src -name "TezTask.java" | head -3

If you do not have Hive source, you can read these classes on GitHub:

ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
ql/src/java/org/apache/hadoop/hive/ql/plan/TezWork.java
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java

Step 2 — Read `TezWork.java`

TezWork is a directed graph of BaseWork nodes. Answer:

#	Question
1	What are the two main subclasses of `BaseWork` that represent map and reduce phases?
2	How does `TezWork` represent edges between vertices? What class holds edge configuration?
3	Where does `TezWork` store the `VertexManagerPluginDescriptor`?
4	A `GROUP BY` query produces how many `BaseWork` nodes? Draw the graph.

Step 3 — Read `DagUtils.createDag()`

This is the core translation method. It iterates over TezWork and calls createVertex() and createEdge().

#	Question
1	What Tez `EdgeProperty.DataMovementType` does Hive use for a reduce shuffle? Where is this set?
2	What `VertexManagerPlugin` does Hive attach to reduce vertices? Is this set unconditionally or based on a configuration flag?
3	What is `auto-parallelism` in this context? How does Hive enable it?
4	What `UserPayload` does Hive pass to `ShuffleVertexManager`? Specifically: what are the values of `minFraction` and `maxFraction`?

Step 4 — Read `TezTask.execute()`

This method submits the DAG and waits for completion.

#	Question
1	Does `TezTask` create a new `TezClient` per query, or reuse one per session?
2	How does `TezTask` wait for DAG completion? Which Tez API does it poll?
3	When a Hive query fails, what information does `TezTask` extract from the `DAGStatus` to show the user?
4	`TezTask` updates Hive counters from Tez counters. What is the counter group mapping?

Step 5 — Tez Counterpart: `ShuffleVertexManager`

Open ShuffleVertexManager.java in your Tez source. Cross-reference with what you learned from DagUtils.java:

The minFraction/maxFraction payload you found in Step 3 is parsed by which method in ShuffleVertexManager?
When Hive enables auto-parallelism, what happens inside ShuffleVertexManager that does NOT happen when it is disabled?
Where does ShuffleVertexManager call context.reconfigureVertex()? What does reconfigureVertex do to the number of reducer tasks?

Step 6 — End-to-End Mental Model

Draw (on paper or in a text diagram) the full path for:

SELECT dept, COUNT(*) FROM employees GROUP BY dept

Show:

Hive logical plan nodes
TezWork graph (label each BaseWork)
Tez DAG (label each vertex, edge type, VertexManagerPlugin)
Which Tez APIs TezTask calls

Step 7 — JIRA Research: Hive/Tez Compatibility

Search:

project = TEZ AND text ~ "hive" AND resolution = Fixed ORDER BY updated DESC

Find one issue where a Tez change broke Hive or where a Hive bug exposed a Tez issue.

What was the incompatibility?
Was the fix in Tez or Hive (or both)?
Did the patch include a test? If so, where?

Open-Source Engineer & Contributor