Lab H1: SQL → DAG

Background

A user writes:

SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;

Hive compiles this into a Tez DAG with three vertices and two edges. This lab walks the compilation path: parser → semantic analyzer → logical plan → physical plan (MapWork/ReduceWork) → TezTaskDagUtils.createVertex/createEdge → submitted DAG.

By the end you will have a labelled DAG diagram for this query and you will be able to trace any similar query from SQL to runtime topology.


Setup

cd ~/hive-src
git log --oneline -1                    # know the version you're on
find . -name "TezTask.java"             # boundary class
find . -name "DagUtils.java"            # DAG construction

A representative test table (use Hive CLI or beeline):

CREATE TABLE t (a INT, b STRING)
  STORED AS ORC;

INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');

The query under study:

SELECT a, COUNT(*) AS c
  FROM t
  GROUP BY a
  ORDER BY a;

Step 1: Parser (lexing, AST)

Hive uses ANTLR. The grammar lives in:

find ~/hive-src -name "HiveParser.g" -o -name "HiveLexer.g"

The parser produces an AST. From the CLI:

EXPLAIN AST SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

You will see a Lisp-style tree:

(TOK_QUERY
  (TOK_FROM (TOK_TABREF (TOK_TABNAME t)))
  (TOK_INSERT
    (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
    (TOK_SELECT
      (TOK_SELEXPR (TOK_TABLE_OR_COL a))
      (TOK_SELEXPR (TOK_FUNCTIONSTAR COUNT) c))
    (TOK_GROUPBY (TOK_TABLE_OR_COL a))
    (TOK_ORDERBY (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL a)))))

The AST is the input to the next phase.


Step 2: Semantic Analyzer

The AST goes through SemanticAnalyzer:

find ~/hive-src -name "SemanticAnalyzer.java" | head

It resolves table references, expands *, type-checks aggregates, and produces a Query Block (QB) tree → Operator tree (logical plan).

EXPLAIN LOGICAL SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

You see operators like TS (TableScan), SEL (Select), GBY (GroupBy), RS (ReduceSink), FS (FileSink). Two GBY and two RS are typical for a GROUP BY ... ORDER BY (one pair each).


Step 3: Physical Plan — MapWork, ReduceWork

The logical operator tree is converted to a physical plan whose top-level units are MapWork, ReduceWork, and MergeJoinWork. For our query, Hive produces three Work units:

WorkPurposeOperators inside
MapWork (Map 1)Read t, partial aggregate by aTS → SEL → GBY → RS
ReduceWork (Reducer 2)Final aggregate by a, prepare for sortGBY → RS
ReduceWork (Reducer 3)Total-order sort by a, write outputSEL → FS

Inspect the structures:

grep -rn "class MapWork" ~/hive-src/ql/src/java/
grep -rn "class ReduceWork" ~/hive-src/ql/src/java/

Get this from Hive directly:

EXPLAIN SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Look for the Stage: Stage-1 / Tez block and the per-vertex sections (Map 1, Reducer 2, Reducer 3).


Step 4: TezTask — The Boundary

The Hive-side execution entry point for a Tez query:

grep -n "public int execute" $(find ~/hive-src -name TezTask.java)

TezTask.execute(TaskQueue queue, DriverContext driverContext) does roughly:

  1. Acquire a TezSessionState (existing pooled session or new one) via TezSessionPoolManager.
  2. Build a DAG from MapWork/ReduceWork via DagUtils.
  3. Submit the DAG via the session's TezSession.submitDAG.
  4. Block on the DAGClient for completion.
  5. Surface counters and diagnostics.

The DAG-building call:

grep -n "DagUtils\|dagUtils\.create" $(find ~/hive-src -name TezTask.java)

You will see calls to DagUtils.createDag or DagUtils.buildDag (name varies by Hive version).


Step 5: DagUtils.createVertex / createEdge

The mapping from Hive Work units to Tez Vertex happens here:

find ~/hive-src -name "DagUtils.java"
grep -n "createVertex\|public Vertex " $(find ~/hive-src -name DagUtils.java)
grep -n "createEdge\|public Edge "     $(find ~/hive-src -name DagUtils.java)

For our query, DagUtils produces:

Hive WorkTez VertexProcessor descriptor
MapWork "Map 1"Vertex "Map 1"MapTezProcessor
ReduceWork "Reducer 2"Vertex "Reducer 2"ReduceTezProcessor
ReduceWork "Reducer 3"Vertex "Reducer 3"ReduceTezProcessor

And two edges:

FromToEdgeProperty kind
Map 1Reducer 2SCATTER_GATHER (shuffle)
Reducer 2Reducer 3SCATTER_GATHER (with a 1-task sink for total order)

The "1-task sink for total order" is how Hive forces a single reducer for ORDER BY (no LIMIT): Reducer 3 has parallelism 1.


Step 6: The Submitted DAG

After DagUtils.createDag returns, TezTask submits via the session:

grep -n "submitDAG" $(find ~/hive-src -name TezTask.java)
grep -n "submitDAG" $(find ~/hive-src -name TezSessionState.java)

The call lands on TezSession.submitDAG(DAG dag) in tez-api:

grep -n "public DAGClient submitDAG" \
  $(find ~/tez-src/tez-api/src/main/java -name TezClient.java)

From there, Reading the Codebase Step 2's worked exercise picks up.


Step 7: Validation — Labelled DAG Diagram

Build this diagram for our query and save it.

flowchart TD
  M1["Map 1<br/>processor: MapTezProcessor<br/>operators: TS → SEL → GBY → RS<br/>parallelism: numSplits(t)"]
  R2["Reducer 2<br/>processor: ReduceTezProcessor<br/>operators: GBY → RS<br/>parallelism: hive.exec.reducers.* tuning"]
  R3["Reducer 3<br/>processor: ReduceTezProcessor<br/>operators: SEL → FS<br/>parallelism: 1 (ORDER BY)"]
  M1 -->|"SCATTER_GATHER<br/>partition on a"| R2
  R2 -->|"SCATTER_GATHER<br/>partition on sort key"| R3

Capture this as your validation artifact (~/tez-notes/hive-h1-dag.md).


Step 8: Print the DAG via Hive

Hive has a setting to print a runtime summary of the executed DAG:

SET hive.exec.print.summary=true;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

The summary, printed after the query, lists each vertex, its task count, and counters. Confirm the topology matches the diagram. (If you see four vertices, you may be on a build that splits ORDER BY differently; record the actual topology.)

For more detail, tez.am.dag.dot.file.location writes a .dot file — used in Lab H2.


Step 9: Counter Pop Quiz

After the query runs (with hive.exec.print.summary=true), find:

CounterWhere it livesWhat it measures
INPUT_RECORDS_PROCESSEDMap 1Rows read from t
OUTPUT_RECORDSMap 1Records emitted to shuffle (post partial-aggregate)
REDUCE_INPUT_GROUPSReducer 2Distinct a values seen
OUTPUT_RECORDSReducer 2Records to Reducer 3
OUTPUT_RECORDSReducer 3Final result row count

For our 6-row input with 3 distinct values of a:

CounterExpected
Map 1 INPUT_RECORDS_PROCESSED6
Map 1 OUTPUT_RECORDS3 (after partial GBY)
Reducer 2 REDUCE_INPUT_GROUPS3
Reducer 2 OUTPUT_RECORDS3
Reducer 3 OUTPUT_RECORDS3

Verify against your actual run.


Validation Artifacts

  1. The labelled mermaid DAG diagram saved at ~/tez-notes/hive-h1-dag.md.
  2. The EXPLAIN AST, EXPLAIN LOGICAL, and EXPLAIN outputs saved.
  3. The hive.exec.print.summary output for the actual run.
  4. The counter table above, with your actual numbers filled in.
  5. The grep results for createVertex and createEdge in DagUtils.java saved as ~/tez-notes/hive-h1-dagutils.txt.

You can now trace any Hive query through compilation to a Tez DAG. The next lab — Lab H2: Inspect the DAG — adds the production-grade techniques for capturing and inspecting that DAG at runtime.