Lab H1: SQL → DAG
Background
A user writes:
SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;
Hive compiles this into a Tez DAG with three vertices and two edges. This lab walks the
compilation path: parser → semantic analyzer → logical plan → physical plan
(MapWork/ReduceWork) → TezTask → DagUtils.createVertex/createEdge → submitted
DAG.
By the end you will have a labelled DAG diagram for this query and you will be able to trace any similar query from SQL to runtime topology.
Setup
cd ~/hive-src
git log --oneline -1 # know the version you're on
find . -name "TezTask.java" # boundary class
find . -name "DagUtils.java" # DAG construction
A representative test table (use Hive CLI or beeline):
CREATE TABLE t (a INT, b STRING)
STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');
The query under study:
SELECT a, COUNT(*) AS c
FROM t
GROUP BY a
ORDER BY a;
Step 1: Parser (lexing, AST)
Hive uses ANTLR. The grammar lives in:
find ~/hive-src -name "HiveParser.g" -o -name "HiveLexer.g"
The parser produces an AST. From the CLI:
EXPLAIN AST SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
You will see a Lisp-style tree:
(TOK_QUERY
(TOK_FROM (TOK_TABREF (TOK_TABNAME t)))
(TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT
(TOK_SELEXPR (TOK_TABLE_OR_COL a))
(TOK_SELEXPR (TOK_FUNCTIONSTAR COUNT) c))
(TOK_GROUPBY (TOK_TABLE_OR_COL a))
(TOK_ORDERBY (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL a)))))
The AST is the input to the next phase.
Step 2: Semantic Analyzer
The AST goes through SemanticAnalyzer:
find ~/hive-src -name "SemanticAnalyzer.java" | head
It resolves table references, expands *, type-checks aggregates, and produces a
Query Block (QB) tree → Operator tree (logical plan).
EXPLAIN LOGICAL SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
You see operators like TS (TableScan), SEL (Select), GBY (GroupBy), RS
(ReduceSink), FS (FileSink). Two GBY and two RS are typical for a
GROUP BY ... ORDER BY (one pair each).
Step 3: Physical Plan — MapWork, ReduceWork
The logical operator tree is converted to a physical plan whose top-level units are
MapWork, ReduceWork, and MergeJoinWork. For our query, Hive produces three
Work units:
| Work | Purpose | Operators inside |
|---|---|---|
MapWork (Map 1) | Read t, partial aggregate by a | TS → SEL → GBY → RS |
ReduceWork (Reducer 2) | Final aggregate by a, prepare for sort | GBY → RS |
ReduceWork (Reducer 3) | Total-order sort by a, write output | SEL → FS |
Inspect the structures:
grep -rn "class MapWork" ~/hive-src/ql/src/java/
grep -rn "class ReduceWork" ~/hive-src/ql/src/java/
Get this from Hive directly:
EXPLAIN SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
Look for the Stage: Stage-1 / Tez block and the per-vertex sections (Map 1,
Reducer 2, Reducer 3).
Step 4: TezTask — The Boundary
The Hive-side execution entry point for a Tez query:
grep -n "public int execute" $(find ~/hive-src -name TezTask.java)
TezTask.execute(TaskQueue queue, DriverContext driverContext) does roughly:
- Acquire a
TezSessionState(existing pooled session or new one) viaTezSessionPoolManager. - Build a
DAGfromMapWork/ReduceWorkviaDagUtils. - Submit the DAG via the session's
TezSession.submitDAG. - Block on the
DAGClientfor completion. - Surface counters and diagnostics.
The DAG-building call:
grep -n "DagUtils\|dagUtils\.create" $(find ~/hive-src -name TezTask.java)
You will see calls to DagUtils.createDag or DagUtils.buildDag (name varies by Hive
version).
Step 5: DagUtils.createVertex / createEdge
The mapping from Hive Work units to Tez Vertex happens here:
find ~/hive-src -name "DagUtils.java"
grep -n "createVertex\|public Vertex " $(find ~/hive-src -name DagUtils.java)
grep -n "createEdge\|public Edge " $(find ~/hive-src -name DagUtils.java)
For our query, DagUtils produces:
Hive Work | Tez Vertex | Processor descriptor |
|---|---|---|
MapWork "Map 1" | Vertex "Map 1" | MapTezProcessor |
ReduceWork "Reducer 2" | Vertex "Reducer 2" | ReduceTezProcessor |
ReduceWork "Reducer 3" | Vertex "Reducer 3" | ReduceTezProcessor |
And two edges:
| From | To | EdgeProperty kind |
|---|---|---|
| Map 1 | Reducer 2 | SCATTER_GATHER (shuffle) |
| Reducer 2 | Reducer 3 | SCATTER_GATHER (with a 1-task sink for total order) |
The "1-task sink for total order" is how Hive forces a single reducer for ORDER BY
(no LIMIT): Reducer 3 has parallelism 1.
Step 6: The Submitted DAG
After DagUtils.createDag returns, TezTask submits via the session:
grep -n "submitDAG" $(find ~/hive-src -name TezTask.java)
grep -n "submitDAG" $(find ~/hive-src -name TezSessionState.java)
The call lands on TezSession.submitDAG(DAG dag) in tez-api:
grep -n "public DAGClient submitDAG" \
$(find ~/tez-src/tez-api/src/main/java -name TezClient.java)
From there, Reading the Codebase Step 2's worked exercise picks up.
Step 7: Validation — Labelled DAG Diagram
Build this diagram for our query and save it.
flowchart TD
M1["Map 1<br/>processor: MapTezProcessor<br/>operators: TS → SEL → GBY → RS<br/>parallelism: numSplits(t)"]
R2["Reducer 2<br/>processor: ReduceTezProcessor<br/>operators: GBY → RS<br/>parallelism: hive.exec.reducers.* tuning"]
R3["Reducer 3<br/>processor: ReduceTezProcessor<br/>operators: SEL → FS<br/>parallelism: 1 (ORDER BY)"]
M1 -->|"SCATTER_GATHER<br/>partition on a"| R2
R2 -->|"SCATTER_GATHER<br/>partition on sort key"| R3
Capture this as your validation artifact (~/tez-notes/hive-h1-dag.md).
Step 8: Print the DAG via Hive
Hive has a setting to print a runtime summary of the executed DAG:
SET hive.exec.print.summary=true;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;
The summary, printed after the query, lists each vertex, its task count, and counters.
Confirm the topology matches the diagram. (If you see four vertices, you may be on a
build that splits ORDER BY differently; record the actual topology.)
For more detail, tez.am.dag.dot.file.location writes a .dot file — used in
Lab H2.
Step 9: Counter Pop Quiz
After the query runs (with hive.exec.print.summary=true), find:
| Counter | Where it lives | What it measures |
|---|---|---|
INPUT_RECORDS_PROCESSED | Map 1 | Rows read from t |
OUTPUT_RECORDS | Map 1 | Records emitted to shuffle (post partial-aggregate) |
REDUCE_INPUT_GROUPS | Reducer 2 | Distinct a values seen |
OUTPUT_RECORDS | Reducer 2 | Records to Reducer 3 |
OUTPUT_RECORDS | Reducer 3 | Final result row count |
For our 6-row input with 3 distinct values of a:
| Counter | Expected |
|---|---|
Map 1 INPUT_RECORDS_PROCESSED | 6 |
Map 1 OUTPUT_RECORDS | 3 (after partial GBY) |
Reducer 2 REDUCE_INPUT_GROUPS | 3 |
Reducer 2 OUTPUT_RECORDS | 3 |
Reducer 3 OUTPUT_RECORDS | 3 |
Verify against your actual run.
Validation Artifacts
- The labelled mermaid DAG diagram saved at
~/tez-notes/hive-h1-dag.md. - The
EXPLAIN AST,EXPLAIN LOGICAL, andEXPLAINoutputs saved. - The
hive.exec.print.summaryoutput for the actual run. - The counter table above, with your actual numbers filled in.
- The
grepresults forcreateVertexandcreateEdgeinDagUtils.javasaved as~/tez-notes/hive-h1-dagutils.txt.
You can now trace any Hive query through compilation to a Tez DAG. The next lab — Lab H2: Inspect the DAG — adds the production-grade techniques for capturing and inspecting that DAG at runtime.