Hive-on-Tez Labs

Hive on Tez is the production context that has carried Tez through the last decade. Every large Hive deployment that's not on Spark is on Tez. Understanding the Tez/Hive boundary is therefore not a niche skill — it is the production debugging skill for both projects.

These labs work from a SQL query down through Hive compilation, into a Tez DAG, into running tasks, and back out through failure attribution and remediation. They are deliberately hands-on; every step has commands to run against ~/tez-src and ~/hive-src.

Prerequisites

Tool	Required version	Why
Apache Tez	0.10.x	Matches the rest of this book
Apache Hive	3.x or 4.x	Production-relevant; Hive 2 is end of life
Hadoop	3.3.x	Tez and Hive both target this
JDK	11 (Hive 4) or 8 (Hive 3)	Per project requirements
Local clones	`~/tez-src`, `~/hive-src`	All commands assume these paths

If you only have one of Hive 3 vs Hive 4, the labs work either way — they call out the delta where it matters. Class paths used throughout these labs (the integration boundary):

org.apache.hadoop.hive.ql.exec.tez.TezTask                  — Hive's "execute on Tez" task
org.apache.hadoop.hive.ql.exec.tez.DagUtils                 — Builds Tez DAG from Hive plan
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager    — Pools Tez sessions
org.apache.hadoop.hive.ql.exec.tez.TezSessionState          — One Hive session = one Tez AM
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource          — Map-side record source
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource       — Reduce-side record source

Verify these exist in your tree:

find ~/hive-src -path "*ql/exec/tez/TezTask.java"
find ~/hive-src -path "*ql/exec/tez/DagUtils.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionPoolManager.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionState.java"
find ~/hive-src -path "*ql/exec/tez/MapRecordSource.java"
find ~/hive-src -path "*ql/exec/tez/ReduceRecordSource.java"

If any are missing, your Hive tree may be too old. Hive 3.1.x and 4.0.x both have all six.

The Tez/Hive Boundary, At a Glance

The boundary is one Hive class — TezTask — and a handful of supporting utilities. Above the boundary, Hive owns: SQL parsing, semantic analysis, logical plan, physical plan (MapWork/ReduceWork). Below the boundary, Tez owns: DAG execution, task scheduling, shuffle, recovery.

flowchart TD
  subgraph Hive
    A[SQL Query] --> B[Parser]
    B --> C[Semantic Analyzer]
    C --> D[Logical Plan]
    D --> E[Physical Plan<br/>MapWork / ReduceWork]
    E --> F[TezTask.execute]
    F --> G[DagUtils.createVertex<br/>DagUtils.createEdge]
    G --> H[DAG object]
  end
  subgraph Tez
    H --> I[TezSession.submitDAG]
    I --> J[DAGAppMaster<br/>tez-dag]
    J --> K[Vertex tasks<br/>tez-runtime-internals]
    K --> L[Shuffle I/O<br/>tez-runtime-library]
  end

That TezTask → DagUtils → DAG → submitDAG sequence is the entire integration surface. The seven labs below walk it from the top (Lab H1) to the runtime (Lab H6).

Lab Index

Lab	Goal	Output artifact
H1: SQL → DAG	Trace a `SELECT...GROUP BY...ORDER BY` from SQL to a labelled Tez DAG	DAG diagram
H2: Inspect DAG	Capture and inspect the DAG Hive submits	EXPLAIN output + `.dot` file
H3: Debug a query	Walk from a "Vertex failed" message to the actual exception	Failure narrative
H4: Bug attribution	Use stack-trace top frame to attribute to Hive, Tez runtime, Tez AM, or YARN	Decision tree applied
H5: Reproducing bugs	Build a minimum reproducer for a Hive-on-Tez bug	Repro tarball
H6: Diagnostics	Write a small diagnostic patch (log, counter, config) and attach to JIRA	Patch + JIRA

Reading Order

H1 and H2 are foundational — do them in order. H3 and H4 are debugging skills that build on each other. H5 and H6 are the contributor-facing skills you need to file a useful Hive-on-Tez JIRA from a production observation.

If you are coming to this section from the Capstone, H4 and H5 are the most directly relevant.

Where the Real Work Happens

The Tez/Hive boundary is one of the most-asked-about areas on both project mailing lists. The labs are written so that, when you encounter a production issue, you can:

Read the stack trace and attribute it (H4).
Locate the SQL that produced the DAG (H1).
Capture the DAG and find the relevant vertex (H2).
Identify the failing task and its log (H3).
Reproduce it minimally on MiniTezCluster (H5).
Attach a diagnostic patch to a JIRA to get more data from the reporter (H6).

That six-step routine, executed crisply, is what gets Hive-on-Tez JIRAs resolved.

Validation for the Section

You have absorbed the Hive-on-Tez section when, given a freshly-failing query in a production Hive-on-Tez deployment, you can:

Within 10 minutes, identify which project owns the failure (Hive / Tez / YARN).
Within 30 minutes, locate the relevant code on both sides of the boundary.
Within 1 hour, capture the DAG and the failing task's log.
Within a day, produce a minimum reproducer on MiniTezCluster.
Within a week, file a JIRA on the right project with all the data needed.

That is the standard a Hive-on-Tez committer holds themselves to. The labs build the muscle.

Open-Source Engineer & Contributor