Level 6: Hive/Tez Integration
Hive-on-Tez is the largest consumer of the Tez API. Understanding how Hive translates SQL into a Tez DAG — and what can go wrong — is essential for any contributor who wants to fix real production bugs.
What Hive does with Tez
Every Hive query that runs on Tez goes through this pipeline:
SQL → Hive AST → Operator tree → MapReduceWork/ReduceWork tasks
→ TezWork → Tez DAG (vertices + edges + VertexManagerPlugins)
→ TezClient.submitDAG()
The translation layer lives in hive-exec module, specifically
TezWork, DagUtils, and TezTask.
Why Tez contributors must understand Hive
- Most real Tez bugs are first reported from Hive (a slow query, a failing shuffle, a counter discrepancy)
ShuffleVertexManagerwas built specifically for the Hive reduce pattern- Hive adds many
VertexManagerEventpayloads that Tez must handle correctly - Compatibility issues between Hive versions and Tez versions are common release blockers
What this level covers
| Topic | Lab |
|---|---|
| Trace a Hive SQL query to the generated Tez DAG | Lab 6.1 |
Read DagUtils and understand vertex/edge configuration | Lab 6.1 |
| Debug a failing Hive-on-Tez query (task diagnostics, AM logs) | Lab 6.2 |
| Fix a Hive-Tez compatibility issue via a Tez patch | Lab 6.2 |
Prerequisites
- Level 5 complete (you can submit and debug a Tez DAG)
- Optional but helpful: basic SQL knowledge
- Optional: Hive source checked out alongside Tez
Key classes
| Class | Where | What it does |
|---|---|---|
TezWork | hive-exec | Container for all Tez DAG specifications |
DagUtils | hive-exec | Builds Tez DAG from TezWork |
TezTask | hive-exec | Executes a TezWork via TezClient |
ShuffleVertexManager | tez-dag | Manages reduce-vertex scheduling |
OrderedPartitionedKVOutput | tez-runtime-library | Default Hive reduce output |