Hive-on-Tez Labs

Hive on Tez is the production context that has carried Tez through the last decade. Every large Hive deployment that's not on Spark is on Tez. Understanding the Tez/Hive boundary is therefore not a niche skill — it is the production debugging skill for both projects.

These labs work from a SQL query down through Hive compilation, into a Tez DAG, into running tasks, and back out through failure attribution and remediation. They are deliberately hands-on; every step has commands to run against ~/tez-src and ~/hive-src.

Prerequisites

ToolRequired versionWhy
Apache Tez0.10.xMatches the rest of this book
Apache Hive3.x or 4.xProduction-relevant; Hive 2 is end of life
Hadoop3.3.xTez and Hive both target this
JDK11 (Hive 4) or 8 (Hive 3)Per project requirements
Local clones~/tez-src, ~/hive-srcAll commands assume these paths

If you only have one of Hive 3 vs Hive 4, the labs work either way — they call out the delta where it matters. Class paths used throughout these labs (the integration boundary):

org.apache.hadoop.hive.ql.exec.tez.TezTask                  — Hive's "execute on Tez" task
org.apache.hadoop.hive.ql.exec.tez.DagUtils                 — Builds Tez DAG from Hive plan
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager    — Pools Tez sessions
org.apache.hadoop.hive.ql.exec.tez.TezSessionState          — One Hive session = one Tez AM
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource          — Map-side record source
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource       — Reduce-side record source

Verify these exist in your tree:

find ~/hive-src -path "*ql/exec/tez/TezTask.java"
find ~/hive-src -path "*ql/exec/tez/DagUtils.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionPoolManager.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionState.java"
find ~/hive-src -path "*ql/exec/tez/MapRecordSource.java"
find ~/hive-src -path "*ql/exec/tez/ReduceRecordSource.java"

If any are missing, your Hive tree may be too old. Hive 3.1.x and 4.0.x both have all six.

The Tez/Hive Boundary, At a Glance

The boundary is one Hive class — TezTask — and a handful of supporting utilities. Above the boundary, Hive owns: SQL parsing, semantic analysis, logical plan, physical plan (MapWork/ReduceWork). Below the boundary, Tez owns: DAG execution, task scheduling, shuffle, recovery.

flowchart TD
  subgraph Hive
    A[SQL Query] --> B[Parser]
    B --> C[Semantic Analyzer]
    C --> D[Logical Plan]
    D --> E[Physical Plan<br/>MapWork / ReduceWork]
    E --> F[TezTask.execute]
    F --> G[DagUtils.createVertex<br/>DagUtils.createEdge]
    G --> H[DAG object]
  end
  subgraph Tez
    H --> I[TezSession.submitDAG]
    I --> J[DAGAppMaster<br/>tez-dag]
    J --> K[Vertex tasks<br/>tez-runtime-internals]
    K --> L[Shuffle I/O<br/>tez-runtime-library]
  end

That TezTaskDagUtilsDAGsubmitDAG sequence is the entire integration surface. The seven labs below walk it from the top (Lab H1) to the runtime (Lab H6).

Lab Index

LabGoalOutput artifact
H1: SQL → DAGTrace a SELECT...GROUP BY...ORDER BY from SQL to a labelled Tez DAGDAG diagram
H2: Inspect DAGCapture and inspect the DAG Hive submitsEXPLAIN output + .dot file
H3: Debug a queryWalk from a "Vertex failed" message to the actual exceptionFailure narrative
H4: Bug attributionUse stack-trace top frame to attribute to Hive, Tez runtime, Tez AM, or YARNDecision tree applied
H5: Reproducing bugsBuild a minimum reproducer for a Hive-on-Tez bugRepro tarball
H6: DiagnosticsWrite a small diagnostic patch (log, counter, config) and attach to JIRAPatch + JIRA

Reading Order

H1 and H2 are foundational — do them in order. H3 and H4 are debugging skills that build on each other. H5 and H6 are the contributor-facing skills you need to file a useful Hive-on-Tez JIRA from a production observation.

If you are coming to this section from the Capstone, H4 and H5 are the most directly relevant.

Where the Real Work Happens

The Tez/Hive boundary is one of the most-asked-about areas on both project mailing lists. The labs are written so that, when you encounter a production issue, you can:

  1. Read the stack trace and attribute it (H4).
  2. Locate the SQL that produced the DAG (H1).
  3. Capture the DAG and find the relevant vertex (H2).
  4. Identify the failing task and its log (H3).
  5. Reproduce it minimally on MiniTezCluster (H5).
  6. Attach a diagnostic patch to a JIRA to get more data from the reporter (H6).

That six-step routine, executed crisply, is what gets Hive-on-Tez JIRAs resolved.

Validation for the Section

You have absorbed the Hive-on-Tez section when, given a freshly-failing query in a production Hive-on-Tez deployment, you can:

  1. Within 10 minutes, identify which project owns the failure (Hive / Tez / YARN).
  2. Within 30 minutes, locate the relevant code on both sides of the boundary.
  3. Within 1 hour, capture the DAG and the failing task's log.
  4. Within a day, produce a minimum reproducer on MiniTezCluster.
  5. Within a week, file a JIRA on the right project with all the data needed.

That is the standard a Hive-on-Tez committer holds themselves to. The labs build the muscle.