Lab H2: Inspecting the Hive-Emitted DAG

Background

Lab H1 traced compilation to derive the DAG by reading code. In production, you can't always re-derive — you need to capture the DAG Hive submitted to Tez. This lab covers the four production-grade ways to do that:

EXPLAIN FORMATTED and EXPLAIN VECTORIZATION DETAIL from Hive.
TezTask logging at DEBUG level.
The Tez UI (backed by YARN ATS or Tez SimpleHistoryLoggingService).
The tez.am.dag.dot.file.location graphviz dump.

Plus the cross-cutting skill: mapping each Hive operator in the captured DAG to its Tez Input/Processor/Output (I/P/O).

Setup

# Hive CLI or beeline. Use the same table from H1:
CREATE TABLE IF NOT EXISTS t (a INT, b STRING) STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');

Verify Tez is the execution engine:

SET hive.execution.engine;        -- should be 'tez'

If not:

SET hive.execution.engine=tez;

Method 1: `EXPLAIN FORMATTED`

EXPLAIN FORMATTED emits a JSON-ish structure with operator details. Useful for programmatic parsing.

EXPLAIN FORMATTED
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Snippet of the output (structure varies by Hive version):

{
  "STAGE DEPENDENCIES": {
    "Stage-1": {"ROOT STAGE": "TRUE"},
    "Stage-0": {"DEPENDENT STAGES": "Stage-1"}
  },
  "STAGE PLANS": {
    "Stage-1": {
      "Tez": {
        "DagId:": "...",
        "Edges:": {
          "Reducer 2": [{"parent": "Map 1", "type": "SIMPLE_EDGE"}],
          "Reducer 3": [{"parent": "Reducer 2", "type": "SIMPLE_EDGE"}]
        },
        "Vertices:": {
          "Map 1": {
            "Map Operator Tree:": [...],
            "Execution mode:": "vectorized"
          },
          "Reducer 2": { ... },
          "Reducer 3": { ... }
        }
      }
    }
  }
}

Save it:

hive -e "EXPLAIN FORMATTED SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;" \
  > ~/tez-notes/hive-h2-explain-formatted.json

What it tells you that EXPLAIN doesn't:

Edge types between vertices (SIMPLE_EDGE, BROADCAST_EDGE, CUSTOM_SIMPLE_EDGE, CUSTOM_EDGE).
Execution mode per vertex (vectorized, llap, neither).
The full operator tree per vertex, including row-schema annotations.

Method 2: `EXPLAIN VECTORIZATION DETAIL`

When a query runs slower than expected on Tez, vectorization is the first thing to check. EXPLAIN VECTORIZATION DETAIL shows per-operator whether vectorization succeeded and, if not, why.

EXPLAIN VECTORIZATION DETAIL
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Look for per-vertex Execution mode: vectorized and per-operator Vectorized: true. If you see notVectorizedReason: <reason>, that's the diagnostic.

Common notVectorizedReason values:

Reason	Cause
`UDF X is not vectorized`	Hive lacks a vectorized impl of a UDF you used
`Reduce vectorization disabled`	`hive.vectorized.execution.reduce.enabled=false`
`MAP_JOIN with key types ...`	Vectorized map-join doesn't support the key type combo
`Column type X not supported`	Vectorization doesn't handle the column type (DECIMAL precision, etc.)

This explains a class of Hive-on-Tez perf surprises that are unrelated to Tez itself.

Method 3: `TezTask` Logging

Increase the log level on TezTask to capture the DAG it submitted:

SET hive.root.logger=DEBUG,console;
-- or, more targeted:
SET hive.log.explain.output=true;

hive.log.explain.output=true writes the EXPLAIN to the Hive log on each query — useful in production where you can't get a CLI run but can grep the log.

grep -A100 "DAG description" /var/log/hive/hive-server2.log | head -200

For the most detail, set DEBUG specifically on the Tez integration:

# in hive-site.xml or via SET:
log4j.logger.org.apache.hadoop.hive.ql.exec.tez=DEBUG
log4j.logger.org.apache.tez.dag.api=DEBUG

In DEBUG you see:

The serialised DAGPlan size at submit time.
Each Vertex's name, parallelism, processor descriptor class.
Each Edge's source, destination, data-source / data-movement / scheduling type.

Method 4: Tez UI

The Tez UI runs against YARN Timeline Service (ATS) or against the file-system SimpleHistoryLoggingService. When configured, every Tez DAG submitted by Hive (or anything else) is captured.

Capture is enabled via tez.history.logging.service.class:

grep "tez.history.logging.service.class" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Once a DAG runs, browse to:

http://<atstimeline-host>:8188/applicationhistory/

or for the standalone Tez UI:

http://<tez-ui-host>:9999/tez-ui/

Click into a DAG to see:

Per-vertex stats (tasks, attempts, succeeded, failed, killed).
Edges with type and statistics (BYTES_TRANSFERRED).
A graphical DAG view.
Per-task and per-attempt logs.

For an offline cluster, the file-system logger writes JSON files under tez.simple.history.logging.dir. They can be loaded into the Tez UI later.

Method 5: `tez.am.dag.dot.file.location`

For visual inspection, Tez can write each DAG as a Graphviz .dot file:

SET tez.am.dag.dot.file.location=/tmp/tez-dags;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

After the query:

ls /tmp/tez-dags/
# <app-id>_<dag-name>.dot

dot -Tpng /tmp/tez-dags/<file>.dot -o ~/tez-notes/hive-h2-dag.png

The .dot has the same nodes/edges as the Tez UI, in a portable format.

Caveat: the location is written from the AM, so on a real cluster it lands on the AM node, not the client. Configure the path to a shared filesystem or copy after the fact.

Mapping Hive Operators to Tez I/P/O

Now the cross-cutting skill: each Hive operator inside a Vertex maps to one of Tez's three runtime roles — Input, Processor, or Output. For our query:

Map 1 (vertex)

Hive operator	Tez role	Tez class
TableScan	`Input`	`MRInput` (from `tez-mapreduce`) or `HiveInputFormat` adapter
Select	(inside Processor)	—
GroupBy (partial)	(inside Processor)	—
ReduceSink	`Output`	`OrderedPartitionedKVOutput` (from `tez-runtime-library`)

The Processor itself: MapTezProcessor. Find it:

find ~/hive-src -name "MapTezProcessor.java"

Reducer 2 (vertex)

Hive operator	Tez role	Tez class
(shuffle in)	`Input`	`OrderedGroupedKVInput`
GroupBy (final)	(inside Processor)	—
ReduceSink	`Output`	`OrderedPartitionedKVOutput`

Processor: ReduceTezProcessor. Find it:

find ~/hive-src -name "ReduceTezProcessor.java"

Reducer 3 (vertex)

Hive operator	Tez role	Tez class
(shuffle in)	`Input`	`OrderedGroupedKVInput`
Select	(inside Processor)	—
FileSink	`Output`	`MROutput` (from `tez-mapreduce`)

Validation — A Side-by-Side Table

Build this for your captured DAG and save it:

Vertex	Tasks	Inputs (class, source)	Processor	Outputs (class, dest)
Map 1	(from EXPLAIN)	`MRInput` ← `t` ORC files	`MapTezProcessor`	`OrderedPartitionedKVOutput` → Reducer 2
Reducer 2	(from EXPLAIN)	`OrderedGroupedKVInput` ← Map 1	`ReduceTezProcessor`	`OrderedPartitionedKVOutput` → Reducer 3
Reducer 3	1	`OrderedGroupedKVInput` ← Reducer 2	`ReduceTezProcessor`	`MROutput` → query result location

Save as ~/tez-notes/hive-h2-iop-mapping.md.

Worked Differences Across Methods

When all four capture methods agree, you have ground truth. When they disagree:

Disagreement	Likely cause
`EXPLAIN FORMATTED` shows N vertices, runtime UI shows N+1	Dynamic vertex insertion (CBO, runtime statistics)
`tez.am.dag.dot.file.location` shows fewer edges than UI	Edges added by VertexManager at runtime (see Lab 4.2)
UI shows `BROADCAST_EDGE`, `EXPLAIN` says `SIMPLE_EDGE`	Hive's `EXPLAIN` is sometimes loose on edge type; trust the UI
Parallelism in UI differs from `EXPLAIN`'s `-mapred.reduce.tasks`	`tez.shuffle.vertex.manager` reconfigured parallelism at runtime

Each disagreement is informative — it shows you which subsystem made the dynamic decision.

Production Diagnostic Routine

When asked "why is this query slow on Tez?":

EXPLAIN FORMATTED to see the planned DAG.
EXPLAIN VECTORIZATION DETAIL to spot non-vectorized operators.
Run with hive.exec.print.summary=true to get the runtime summary.
Open the Tez UI for the DAG, look at per-vertex and per-edge stats.
Compare planned parallelism to actual (VertexManager may have changed it).
Identify the bottleneck vertex by WALL_CLOCK_MILLIS or OUTPUT_RECORDS skew.

Most slowness is one of: vectorization failure, parallelism mismatch, data skew on a shuffle key, or AM overhead for a many-vertex DAG.

Validation Artifacts

The EXPLAIN FORMATTED JSON saved to ~/tez-notes/hive-h2-explain-formatted.json.
The EXPLAIN VECTORIZATION DETAIL saved to ~/tez-notes/hive-h2-vec.txt.
A .png rendered from the .dot saved to ~/tez-notes/hive-h2-dag.png.
The Tez UI URL for the actual DAG run, bookmarked.
The Hive-operator-to-Tez-I/P/O table above, filled in for your captured DAG.

Once you can capture and read the DAG four ways, you are ready for failure analysis — Lab H3: Debug a Failed Query.

Open-Source Engineer & Contributor