Lab H2: Inspecting the Hive-Emitted DAG
Background
Lab H1 traced compilation to derive the DAG by reading code. In production, you can't always re-derive — you need to capture the DAG Hive submitted to Tez. This lab covers the four production-grade ways to do that:
EXPLAIN FORMATTEDandEXPLAIN VECTORIZATION DETAILfrom Hive.TezTasklogging atDEBUGlevel.- The Tez UI (backed by YARN ATS or Tez SimpleHistoryLoggingService).
- The
tez.am.dag.dot.file.locationgraphviz dump.
Plus the cross-cutting skill: mapping each Hive operator in the captured DAG to its Tez Input/Processor/Output (I/P/O).
Setup
# Hive CLI or beeline. Use the same table from H1:
CREATE TABLE IF NOT EXISTS t (a INT, b STRING) STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');
Verify Tez is the execution engine:
SET hive.execution.engine; -- should be 'tez'
If not:
SET hive.execution.engine=tez;
Method 1: EXPLAIN FORMATTED
EXPLAIN FORMATTED emits a JSON-ish structure with operator details. Useful for
programmatic parsing.
EXPLAIN FORMATTED
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Snippet of the output (structure varies by Hive version):
{
"STAGE DEPENDENCIES": {
"Stage-1": {"ROOT STAGE": "TRUE"},
"Stage-0": {"DEPENDENT STAGES": "Stage-1"}
},
"STAGE PLANS": {
"Stage-1": {
"Tez": {
"DagId:": "...",
"Edges:": {
"Reducer 2": [{"parent": "Map 1", "type": "SIMPLE_EDGE"}],
"Reducer 3": [{"parent": "Reducer 2", "type": "SIMPLE_EDGE"}]
},
"Vertices:": {
"Map 1": {
"Map Operator Tree:": [...],
"Execution mode:": "vectorized"
},
"Reducer 2": { ... },
"Reducer 3": { ... }
}
}
}
}
}
Save it:
hive -e "EXPLAIN FORMATTED SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;" \
> ~/tez-notes/hive-h2-explain-formatted.json
What it tells you that EXPLAIN doesn't:
- Edge types between vertices (
SIMPLE_EDGE,BROADCAST_EDGE,CUSTOM_SIMPLE_EDGE,CUSTOM_EDGE). - Execution mode per vertex (
vectorized,llap, neither). - The full operator tree per vertex, including row-schema annotations.
Method 2: EXPLAIN VECTORIZATION DETAIL
When a query runs slower than expected on Tez, vectorization is the first thing to
check. EXPLAIN VECTORIZATION DETAIL shows per-operator whether vectorization succeeded
and, if not, why.
EXPLAIN VECTORIZATION DETAIL
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Look for per-vertex Execution mode: vectorized and per-operator Vectorized: true.
If you see notVectorizedReason: <reason>, that's the diagnostic.
Common notVectorizedReason values:
| Reason | Cause |
|---|---|
UDF X is not vectorized | Hive lacks a vectorized impl of a UDF you used |
Reduce vectorization disabled | hive.vectorized.execution.reduce.enabled=false |
MAP_JOIN with key types ... | Vectorized map-join doesn't support the key type combo |
Column type X not supported | Vectorization doesn't handle the column type (DECIMAL precision, etc.) |
This explains a class of Hive-on-Tez perf surprises that are unrelated to Tez itself.
Method 3: TezTask Logging
Increase the log level on TezTask to capture the DAG it submitted:
SET hive.root.logger=DEBUG,console;
-- or, more targeted:
SET hive.log.explain.output=true;
hive.log.explain.output=true writes the EXPLAIN to the Hive log on each query —
useful in production where you can't get a CLI run but can grep the log.
grep -A100 "DAG description" /var/log/hive/hive-server2.log | head -200
For the most detail, set DEBUG specifically on the Tez integration:
# in hive-site.xml or via SET:
log4j.logger.org.apache.hadoop.hive.ql.exec.tez=DEBUG
log4j.logger.org.apache.tez.dag.api=DEBUG
In DEBUG you see:
- The serialised
DAGPlansize at submit time. - Each Vertex's name, parallelism, processor descriptor class.
- Each Edge's source, destination, data-source / data-movement / scheduling type.
Method 4: Tez UI
The Tez UI runs against YARN Timeline Service (ATS) or against the file-system
SimpleHistoryLoggingService. When configured, every Tez DAG submitted by Hive (or
anything else) is captured.
Capture is enabled via tez.history.logging.service.class:
grep "tez.history.logging.service.class" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Once a DAG runs, browse to:
http://<atstimeline-host>:8188/applicationhistory/
or for the standalone Tez UI:
http://<tez-ui-host>:9999/tez-ui/
Click into a DAG to see:
- Per-vertex stats (tasks, attempts, succeeded, failed, killed).
- Edges with type and statistics (
BYTES_TRANSFERRED). - A graphical DAG view.
- Per-task and per-attempt logs.
For an offline cluster, the file-system logger writes JSON files under
tez.simple.history.logging.dir. They can be loaded into the Tez UI later.
Method 5: tez.am.dag.dot.file.location
For visual inspection, Tez can write each DAG as a Graphviz .dot file:
SET tez.am.dag.dot.file.location=/tmp/tez-dags;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
After the query:
ls /tmp/tez-dags/
# <app-id>_<dag-name>.dot
dot -Tpng /tmp/tez-dags/<file>.dot -o ~/tez-notes/hive-h2-dag.png
The .dot has the same nodes/edges as the Tez UI, in a portable format.
Caveat: the location is written from the AM, so on a real cluster it lands on the AM node, not the client. Configure the path to a shared filesystem or copy after the fact.
Mapping Hive Operators to Tez I/P/O
Now the cross-cutting skill: each Hive operator inside a Vertex maps to one of Tez's three runtime roles — Input, Processor, or Output. For our query:
Map 1 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| TableScan | Input | MRInput (from tez-mapreduce) or HiveInputFormat adapter |
| Select | (inside Processor) | — |
| GroupBy (partial) | (inside Processor) | — |
| ReduceSink | Output | OrderedPartitionedKVOutput (from tez-runtime-library) |
The Processor itself: MapTezProcessor. Find it:
find ~/hive-src -name "MapTezProcessor.java"
Reducer 2 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| (shuffle in) | Input | OrderedGroupedKVInput |
| GroupBy (final) | (inside Processor) | — |
| ReduceSink | Output | OrderedPartitionedKVOutput |
Processor: ReduceTezProcessor. Find it:
find ~/hive-src -name "ReduceTezProcessor.java"
Reducer 3 (vertex)
| Hive operator | Tez role | Tez class |
|---|---|---|
| (shuffle in) | Input | OrderedGroupedKVInput |
| Select | (inside Processor) | — |
| FileSink | Output | MROutput (from tez-mapreduce) |
Validation — A Side-by-Side Table
Build this for your captured DAG and save it:
| Vertex | Tasks | Inputs (class, source) | Processor | Outputs (class, dest) |
|---|---|---|---|---|
| Map 1 | (from EXPLAIN) | MRInput ← t ORC files | MapTezProcessor | OrderedPartitionedKVOutput → Reducer 2 |
| Reducer 2 | (from EXPLAIN) | OrderedGroupedKVInput ← Map 1 | ReduceTezProcessor | OrderedPartitionedKVOutput → Reducer 3 |
| Reducer 3 | 1 | OrderedGroupedKVInput ← Reducer 2 | ReduceTezProcessor | MROutput → query result location |
Save as ~/tez-notes/hive-h2-iop-mapping.md.
Worked Differences Across Methods
When all four capture methods agree, you have ground truth. When they disagree:
| Disagreement | Likely cause |
|---|---|
EXPLAIN FORMATTED shows N vertices, runtime UI shows N+1 | Dynamic vertex insertion (CBO, runtime statistics) |
tez.am.dag.dot.file.location shows fewer edges than UI | Edges added by VertexManager at runtime (see Lab 4.2) |
UI shows BROADCAST_EDGE, EXPLAIN says SIMPLE_EDGE | Hive's EXPLAIN is sometimes loose on edge type; trust the UI |
Parallelism in UI differs from EXPLAIN's -mapred.reduce.tasks | tez.shuffle.vertex.manager reconfigured parallelism at runtime |
Each disagreement is informative — it shows you which subsystem made the dynamic decision.
Production Diagnostic Routine
When asked "why is this query slow on Tez?":
EXPLAIN FORMATTEDto see the planned DAG.EXPLAIN VECTORIZATION DETAILto spot non-vectorized operators.- Run with
hive.exec.print.summary=trueto get the runtime summary. - Open the Tez UI for the DAG, look at per-vertex and per-edge stats.
- Compare planned parallelism to actual (VertexManager may have changed it).
- Identify the bottleneck vertex by
WALL_CLOCK_MILLISorOUTPUT_RECORDSskew.
Most slowness is one of: vectorization failure, parallelism mismatch, data skew on a shuffle key, or AM overhead for a many-vertex DAG.
Validation Artifacts
- The
EXPLAIN FORMATTEDJSON saved to~/tez-notes/hive-h2-explain-formatted.json. - The
EXPLAIN VECTORIZATION DETAILsaved to~/tez-notes/hive-h2-vec.txt. - A
.pngrendered from the.dotsaved to~/tez-notes/hive-h2-dag.png. - The Tez UI URL for the actual DAG run, bookmarked.
- The Hive-operator-to-Tez-I/P/O table above, filled in for your captured DAG.
Once you can capture and read the DAG four ways, you are ready for failure analysis — Lab H3: Debug a Failed Query.