Lab H2: Inspecting the Hive-Emitted DAG

Background

Lab H1 traced compilation to derive the DAG by reading code. In production, you can't always re-derive — you need to capture the DAG Hive submitted to Tez. This lab covers the four production-grade ways to do that:

  1. EXPLAIN FORMATTED and EXPLAIN VECTORIZATION DETAIL from Hive.
  2. TezTask logging at DEBUG level.
  3. The Tez UI (backed by YARN ATS or Tez SimpleHistoryLoggingService).
  4. The tez.am.dag.dot.file.location graphviz dump.

Plus the cross-cutting skill: mapping each Hive operator in the captured DAG to its Tez Input/Processor/Output (I/P/O).


Setup

# Hive CLI or beeline. Use the same table from H1:
CREATE TABLE IF NOT EXISTS t (a INT, b STRING) STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');

Verify Tez is the execution engine:

SET hive.execution.engine;        -- should be 'tez'

If not:

SET hive.execution.engine=tez;

Method 1: EXPLAIN FORMATTED

EXPLAIN FORMATTED emits a JSON-ish structure with operator details. Useful for programmatic parsing.

EXPLAIN FORMATTED
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Snippet of the output (structure varies by Hive version):

{
  "STAGE DEPENDENCIES": {
    "Stage-1": {"ROOT STAGE": "TRUE"},
    "Stage-0": {"DEPENDENT STAGES": "Stage-1"}
  },
  "STAGE PLANS": {
    "Stage-1": {
      "Tez": {
        "DagId:": "...",
        "Edges:": {
          "Reducer 2": [{"parent": "Map 1", "type": "SIMPLE_EDGE"}],
          "Reducer 3": [{"parent": "Reducer 2", "type": "SIMPLE_EDGE"}]
        },
        "Vertices:": {
          "Map 1": {
            "Map Operator Tree:": [...],
            "Execution mode:": "vectorized"
          },
          "Reducer 2": { ... },
          "Reducer 3": { ... }
        }
      }
    }
  }
}

Save it:

hive -e "EXPLAIN FORMATTED SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;" \
  > ~/tez-notes/hive-h2-explain-formatted.json

What it tells you that EXPLAIN doesn't:

  • Edge types between vertices (SIMPLE_EDGE, BROADCAST_EDGE, CUSTOM_SIMPLE_EDGE, CUSTOM_EDGE).
  • Execution mode per vertex (vectorized, llap, neither).
  • The full operator tree per vertex, including row-schema annotations.

Method 2: EXPLAIN VECTORIZATION DETAIL

When a query runs slower than expected on Tez, vectorization is the first thing to check. EXPLAIN VECTORIZATION DETAIL shows per-operator whether vectorization succeeded and, if not, why.

EXPLAIN VECTORIZATION DETAIL
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Look for per-vertex Execution mode: vectorized and per-operator Vectorized: true. If you see notVectorizedReason: <reason>, that's the diagnostic.

Common notVectorizedReason values:

ReasonCause
UDF X is not vectorizedHive lacks a vectorized impl of a UDF you used
Reduce vectorization disabledhive.vectorized.execution.reduce.enabled=false
MAP_JOIN with key types ...Vectorized map-join doesn't support the key type combo
Column type X not supportedVectorization doesn't handle the column type (DECIMAL precision, etc.)

This explains a class of Hive-on-Tez perf surprises that are unrelated to Tez itself.


Method 3: TezTask Logging

Increase the log level on TezTask to capture the DAG it submitted:

SET hive.root.logger=DEBUG,console;
-- or, more targeted:
SET hive.log.explain.output=true;

hive.log.explain.output=true writes the EXPLAIN to the Hive log on each query — useful in production where you can't get a CLI run but can grep the log.

grep -A100 "DAG description" /var/log/hive/hive-server2.log | head -200

For the most detail, set DEBUG specifically on the Tez integration:

# in hive-site.xml or via SET:
log4j.logger.org.apache.hadoop.hive.ql.exec.tez=DEBUG
log4j.logger.org.apache.tez.dag.api=DEBUG

In DEBUG you see:

  • The serialised DAGPlan size at submit time.
  • Each Vertex's name, parallelism, processor descriptor class.
  • Each Edge's source, destination, data-source / data-movement / scheduling type.

Method 4: Tez UI

The Tez UI runs against YARN Timeline Service (ATS) or against the file-system SimpleHistoryLoggingService. When configured, every Tez DAG submitted by Hive (or anything else) is captured.

Capture is enabled via tez.history.logging.service.class:

grep "tez.history.logging.service.class" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Once a DAG runs, browse to:

http://<atstimeline-host>:8188/applicationhistory/

or for the standalone Tez UI:

http://<tez-ui-host>:9999/tez-ui/

Click into a DAG to see:

  • Per-vertex stats (tasks, attempts, succeeded, failed, killed).
  • Edges with type and statistics (BYTES_TRANSFERRED).
  • A graphical DAG view.
  • Per-task and per-attempt logs.

For an offline cluster, the file-system logger writes JSON files under tez.simple.history.logging.dir. They can be loaded into the Tez UI later.


Method 5: tez.am.dag.dot.file.location

For visual inspection, Tez can write each DAG as a Graphviz .dot file:

SET tez.am.dag.dot.file.location=/tmp/tez-dags;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

After the query:

ls /tmp/tez-dags/
# <app-id>_<dag-name>.dot

dot -Tpng /tmp/tez-dags/<file>.dot -o ~/tez-notes/hive-h2-dag.png

The .dot has the same nodes/edges as the Tez UI, in a portable format.

Caveat: the location is written from the AM, so on a real cluster it lands on the AM node, not the client. Configure the path to a shared filesystem or copy after the fact.


Mapping Hive Operators to Tez I/P/O

Now the cross-cutting skill: each Hive operator inside a Vertex maps to one of Tez's three runtime roles — Input, Processor, or Output. For our query:

Map 1 (vertex)

Hive operatorTez roleTez class
TableScanInputMRInput (from tez-mapreduce) or HiveInputFormat adapter
Select(inside Processor)
GroupBy (partial)(inside Processor)
ReduceSinkOutputOrderedPartitionedKVOutput (from tez-runtime-library)

The Processor itself: MapTezProcessor. Find it:

find ~/hive-src -name "MapTezProcessor.java"

Reducer 2 (vertex)

Hive operatorTez roleTez class
(shuffle in)InputOrderedGroupedKVInput
GroupBy (final)(inside Processor)
ReduceSinkOutputOrderedPartitionedKVOutput

Processor: ReduceTezProcessor. Find it:

find ~/hive-src -name "ReduceTezProcessor.java"

Reducer 3 (vertex)

Hive operatorTez roleTez class
(shuffle in)InputOrderedGroupedKVInput
Select(inside Processor)
FileSinkOutputMROutput (from tez-mapreduce)

Validation — A Side-by-Side Table

Build this for your captured DAG and save it:

VertexTasksInputs (class, source)ProcessorOutputs (class, dest)
Map 1(from EXPLAIN)MRInputt ORC filesMapTezProcessorOrderedPartitionedKVOutput → Reducer 2
Reducer 2(from EXPLAIN)OrderedGroupedKVInput ← Map 1ReduceTezProcessorOrderedPartitionedKVOutput → Reducer 3
Reducer 31OrderedGroupedKVInput ← Reducer 2ReduceTezProcessorMROutput → query result location

Save as ~/tez-notes/hive-h2-iop-mapping.md.


Worked Differences Across Methods

When all four capture methods agree, you have ground truth. When they disagree:

DisagreementLikely cause
EXPLAIN FORMATTED shows N vertices, runtime UI shows N+1Dynamic vertex insertion (CBO, runtime statistics)
tez.am.dag.dot.file.location shows fewer edges than UIEdges added by VertexManager at runtime (see Lab 4.2)
UI shows BROADCAST_EDGE, EXPLAIN says SIMPLE_EDGEHive's EXPLAIN is sometimes loose on edge type; trust the UI
Parallelism in UI differs from EXPLAIN's -mapred.reduce.taskstez.shuffle.vertex.manager reconfigured parallelism at runtime

Each disagreement is informative — it shows you which subsystem made the dynamic decision.


Production Diagnostic Routine

When asked "why is this query slow on Tez?":

  1. EXPLAIN FORMATTED to see the planned DAG.
  2. EXPLAIN VECTORIZATION DETAIL to spot non-vectorized operators.
  3. Run with hive.exec.print.summary=true to get the runtime summary.
  4. Open the Tez UI for the DAG, look at per-vertex and per-edge stats.
  5. Compare planned parallelism to actual (VertexManager may have changed it).
  6. Identify the bottleneck vertex by WALL_CLOCK_MILLIS or OUTPUT_RECORDS skew.

Most slowness is one of: vectorization failure, parallelism mismatch, data skew on a shuffle key, or AM overhead for a many-vertex DAG.


Validation Artifacts

  1. The EXPLAIN FORMATTED JSON saved to ~/tez-notes/hive-h2-explain-formatted.json.
  2. The EXPLAIN VECTORIZATION DETAIL saved to ~/tez-notes/hive-h2-vec.txt.
  3. A .png rendered from the .dot saved to ~/tez-notes/hive-h2-dag.png.
  4. The Tez UI URL for the actual DAG run, bookmarked.
  5. The Hive-operator-to-Tez-I/P/O table above, filled in for your captured DAG.

Once you can capture and read the DAG four ways, you are ready for failure analysis — Lab H3: Debug a Failed Query.