Level 6: Hive/Tez Integration

Hive-on-Tez is the largest consumer of the Tez API. Understanding how Hive translates SQL into a Tez DAG — and what can go wrong — is essential for any contributor who wants to fix real production bugs.

What Hive does with Tez

Every Hive query that runs on Tez goes through this pipeline:

SQL → Hive AST → Operator tree → MapReduceWork/ReduceWork tasks
   → TezWork → Tez DAG (vertices + edges + VertexManagerPlugins)
   → TezClient.submitDAG()

The translation layer lives in hive-exec module, specifically TezWork, DagUtils, and TezTask.

Why Tez contributors must understand Hive

  • Most real Tez bugs are first reported from Hive (a slow query, a failing shuffle, a counter discrepancy)
  • ShuffleVertexManager was built specifically for the Hive reduce pattern
  • Hive adds many VertexManagerEvent payloads that Tez must handle correctly
  • Compatibility issues between Hive versions and Tez versions are common release blockers

What this level covers

TopicLab
Trace a Hive SQL query to the generated Tez DAGLab 6.1
Read DagUtils and understand vertex/edge configurationLab 6.1
Debug a failing Hive-on-Tez query (task diagnostics, AM logs)Lab 6.2
Fix a Hive-Tez compatibility issue via a Tez patchLab 6.2

Prerequisites

  • Level 5 complete (you can submit and debug a Tez DAG)
  • Optional but helpful: basic SQL knowledge
  • Optional: Hive source checked out alongside Tez

Key classes

ClassWhereWhat it does
TezWorkhive-execContainer for all Tez DAG specifications
DagUtilshive-execBuilds Tez DAG from TezWork
TezTaskhive-execExecutes a TezWork via TezClient
ShuffleVertexManagertez-dagManages reduce-vertex scheduling
OrderedPartitionedKVOutputtez-runtime-libraryDefault Hive reduce output