Hive-on-Tez Labs
Hive on Tez is the production context that has carried Tez through the last decade. Every large Hive deployment that's not on Spark is on Tez. Understanding the Tez/Hive boundary is therefore not a niche skill — it is the production debugging skill for both projects.
These labs work from a SQL query down through Hive compilation, into a Tez DAG, into
running tasks, and back out through failure attribution and remediation. They are
deliberately hands-on; every step has commands to run against ~/tez-src and
~/hive-src.
Prerequisites
| Tool | Required version | Why |
|---|---|---|
| Apache Tez | 0.10.x | Matches the rest of this book |
| Apache Hive | 3.x or 4.x | Production-relevant; Hive 2 is end of life |
| Hadoop | 3.3.x | Tez and Hive both target this |
| JDK | 11 (Hive 4) or 8 (Hive 3) | Per project requirements |
| Local clones | ~/tez-src, ~/hive-src | All commands assume these paths |
If you only have one of Hive 3 vs Hive 4, the labs work either way — they call out the delta where it matters. Class paths used throughout these labs (the integration boundary):
org.apache.hadoop.hive.ql.exec.tez.TezTask — Hive's "execute on Tez" task
org.apache.hadoop.hive.ql.exec.tez.DagUtils — Builds Tez DAG from Hive plan
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager — Pools Tez sessions
org.apache.hadoop.hive.ql.exec.tez.TezSessionState — One Hive session = one Tez AM
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource — Map-side record source
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource — Reduce-side record source
Verify these exist in your tree:
find ~/hive-src -path "*ql/exec/tez/TezTask.java"
find ~/hive-src -path "*ql/exec/tez/DagUtils.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionPoolManager.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionState.java"
find ~/hive-src -path "*ql/exec/tez/MapRecordSource.java"
find ~/hive-src -path "*ql/exec/tez/ReduceRecordSource.java"
If any are missing, your Hive tree may be too old. Hive 3.1.x and 4.0.x both have all six.
The Tez/Hive Boundary, At a Glance
The boundary is one Hive class — TezTask — and a handful of supporting utilities. Above
the boundary, Hive owns: SQL parsing, semantic analysis, logical plan, physical plan
(MapWork/ReduceWork). Below the boundary, Tez owns: DAG execution, task scheduling,
shuffle, recovery.
flowchart TD
subgraph Hive
A[SQL Query] --> B[Parser]
B --> C[Semantic Analyzer]
C --> D[Logical Plan]
D --> E[Physical Plan<br/>MapWork / ReduceWork]
E --> F[TezTask.execute]
F --> G[DagUtils.createVertex<br/>DagUtils.createEdge]
G --> H[DAG object]
end
subgraph Tez
H --> I[TezSession.submitDAG]
I --> J[DAGAppMaster<br/>tez-dag]
J --> K[Vertex tasks<br/>tez-runtime-internals]
K --> L[Shuffle I/O<br/>tez-runtime-library]
end
That TezTask → DagUtils → DAG → submitDAG sequence is the entire integration
surface. The seven labs below walk it from the top (Lab H1) to the runtime (Lab H6).
Lab Index
| Lab | Goal | Output artifact |
|---|---|---|
| H1: SQL → DAG | Trace a SELECT...GROUP BY...ORDER BY from SQL to a labelled Tez DAG | DAG diagram |
| H2: Inspect DAG | Capture and inspect the DAG Hive submits | EXPLAIN output + .dot file |
| H3: Debug a query | Walk from a "Vertex failed" message to the actual exception | Failure narrative |
| H4: Bug attribution | Use stack-trace top frame to attribute to Hive, Tez runtime, Tez AM, or YARN | Decision tree applied |
| H5: Reproducing bugs | Build a minimum reproducer for a Hive-on-Tez bug | Repro tarball |
| H6: Diagnostics | Write a small diagnostic patch (log, counter, config) and attach to JIRA | Patch + JIRA |
Reading Order
H1 and H2 are foundational — do them in order. H3 and H4 are debugging skills that build on each other. H5 and H6 are the contributor-facing skills you need to file a useful Hive-on-Tez JIRA from a production observation.
If you are coming to this section from the Capstone, H4 and H5 are the most directly relevant.
Where the Real Work Happens
The Tez/Hive boundary is one of the most-asked-about areas on both project mailing lists. The labs are written so that, when you encounter a production issue, you can:
- Read the stack trace and attribute it (H4).
- Locate the SQL that produced the DAG (H1).
- Capture the DAG and find the relevant vertex (H2).
- Identify the failing task and its log (H3).
- Reproduce it minimally on
MiniTezCluster(H5). - Attach a diagnostic patch to a JIRA to get more data from the reporter (H6).
That six-step routine, executed crisply, is what gets Hive-on-Tez JIRAs resolved.
Validation for the Section
You have absorbed the Hive-on-Tez section when, given a freshly-failing query in a production Hive-on-Tez deployment, you can:
- Within 10 minutes, identify which project owns the failure (Hive / Tez / YARN).
- Within 30 minutes, locate the relevant code on both sides of the boundary.
- Within 1 hour, capture the DAG and the failing task's log.
- Within a day, produce a minimum reproducer on
MiniTezCluster. - Within a week, file a JIRA on the right project with all the data needed.
That is the standard a Hive-on-Tez committer holds themselves to. The labs build the muscle.