Stage 7 — Hive-on-Tez Compatibility
What this stage teaches
Stage 7 is the cross-project stage. You learn:
- The largest consumer of Tez in production is Hive. Bugs that look like Tez bugs are often Hive bugs that surface through Tez, and vice versa.
- The contracts Hive depends on:
DAGPlansize limits, edge property serialisation, session reuse viaTezSessionPoolManager, and theHiveSplitGeneratorevent protocol. - The attribution decision tree: when to file on
TEZ, when onHIVE, and when on both with a cross-reference. - The release-train interplay: Hive 3.x ships a specific Tez version; Hive 4.x
ships a different one. A "fix" in Tez
masterdoes not automatically reach a Hive user until the next Hive release picks up a Tez release. - How to write an attribution argument in a JIRA description so committers in both projects agree on ownership before any code is written.
The "patch" deliverable for Stage 7 is often a JIRA, not code. A correct attribution call is the contribution; the code may be one line in each project or zero lines in Tez and a workaround in Hive.
JIRA filter to find candidates
project = TEZ AND text ~ "Hive" AND resolution = Unresolved
ORDER BY updated DESC
Then on the Hive side:
project = HIVE
AND (text ~ "Tez" OR text ~ "TezSession" OR text ~ "DAGPlan"
OR text ~ "VertexManagerPlugin")
AND resolution = Unresolved
ORDER BY updated DESC
Cross-reference: a TEZ- ticket linked to a HIVE- ticket is a Stage 7
opportunity. The contribution is reading both, choosing the owner project, and
writing the attribution.
The attribution decision tree
Given a symptom, walk this tree:
Is the symptom observed in a non-Hive Tez workload?
├── Yes → Tez bug. File on TEZ. Stage 4–6 patch.
└── No (Hive-specific)
│
Does the symptom depend on a Hive class on the stack trace?
├── Yes (Hive frame is the top user-code frame)
│ │
│ Is the Tez API contract being misused by Hive?
│ ├── Yes → HIVE bug. File on HIVE. Tez may need a clearer
│ │ contract / better exception message — file
│ │ a follow-up TEZ ticket.
│ └── No → Possibly a Tez API contract gap. File on TEZ
│ with a Hive repro, link the HIVE ticket.
│
└── No (the bug surfaces inside Tez code triggered by Hive's DAG)
│
Does the Hive DAG exercise an edge case Tez tests don't cover?
├── Yes → Tez bug. File on TEZ. Add a Tez-side test that
│ reproduces the shape without Hive.
└── No → File a `cross-project` ticket on TEZ with a
HIVE counter-ticket; sort ownership on dev@.
The tree is not law. It is the start of a dev@ conversation.
Walked example A — DAGPlan size exceeds limit on Hive autogenerated DAG
Symptom: a Hive 3.1 query with a large IN list (10k+ literals) submits a DAG
that fails at TezClient.submitDAG with:
TezException: DAGPlan serialised size 67_108_864 exceeds limit 67_108_864
The Tez default is 64MB on the wire. Hive can in principle stay under it, but
the codegen path for very large IN lists doesn't truncate.
Step 1 — Attribution
Walk the tree:
- Non-Hive workload? No, Hive-specific.
- Hive on stack? Yes,
HiveSplitGenerator. - Is Hive misusing Tez API? No —
DAGPlanis exactly the wire format Tez expects; Hive is sending a legitimate but large payload. - Is this an edge case Tez tests don't cover? Yes — Tez tests submit small DAGPlans.
Conclusion: this is a Tez API contract gap that Hive happens to hit first. The fix is twofold:
- Tez side: raise the configurable limit and improve the error message to tell the operator which key to bump. File on TEZ.
- Hive side: paginate the
INlist literal codegen. File on HIVE.
The Tez patch is small and lands first.
Step 2 — The Tez-side diff
--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+ /**
+ * Maximum size (bytes) of the serialised {@link DAGPlan} that the AM
+ * accepts in a single submission. The default of 64MiB is a Hadoop
+ * IPC limit. Operators submitting very large DAGs (typically generated
+ * by upstream query engines) may need to raise this.
+ * @since 0.10.4
+ */
+ public static final String TEZ_DAG_PLAN_MAX_BYTES =
+ TEZ_PREFIX + "dag.plan.max.bytes";
+ public static final int TEZ_DAG_PLAN_MAX_BYTES_DEFAULT = 64 * 1024 * 1024;
And in tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java:
- if (serialised.length > 64 * 1024 * 1024) {
- throw new TezException("DAGPlan too large");
+ int max = conf.getInt(TEZ_DAG_PLAN_MAX_BYTES, TEZ_DAG_PLAN_MAX_BYTES_DEFAULT);
+ if (serialised.length > max) {
+ throw new TezException(String.format(
+ "DAGPlan serialised size %d exceeds limit %d. "
+ + "Raise %s on the submitter and AM, or reduce DAGPlan size "
+ + "(typically by pruning literal lists or split metadata).",
+ serialised.length, max, TEZ_DAG_PLAN_MAX_BYTES));
}
The patch makes the limit explicit, configurable, and self-describing.
Step 3 — The JIRA description (attribution argument)
Summary: Make DAGPlan size limit configurable and self-describing
Description:
Hive's HiveSplitGenerator can generate DAGPlans > 64MiB for queries with
very large IN lists. Currently Tez throws "DAGPlan too large" with no
actionable advice. The Hive side will paginate (HIVE-NNNNN), but Tez
should:
1. Expose tez.dag.plan.max.bytes so operators can raise the cap.
2. Produce an error message that names the key and the cause.
Attribution rationale:
- This is a Tez API contract gap: legitimate DAGPlans should not be
silently rejected with no recourse.
- Hive is the first downstream that hits this; other DAG generators
(Pig-on-Tez, custom DAGs from BI tools) will hit it next.
- HIVE-NNNNN is filed in parallel for the codegen pagination.
Tests:
- TestDAGAppMaster#testDAGPlanSizeLimitConfigurable
- End-to-end repro left to HIVE-NNNNN (Tez has no test that builds a
pathological 64MiB DAGPlan).
This is the cross-project pattern: the TEZ ticket cites HIVE-NNNNN explicitly, states the attribution rationale, and stops short of fixing Hive's behaviour.
Walked example B — edge property mismatch on Hive upgrade
Symptom: after upgrading Hive 3.1 → 3.2, certain queries fail with:
TezException: EdgeProperty mismatch on edge v1->v2: source class
org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput
does not match sink class
org.apache.tez.runtime.library.input.UnorderedKVInput
Tez rejects the DAG because the edge wiring is inconsistent.
Attribution: Hive 3.2 emitted a different sink type for that vertex. Tez is behaving correctly — it is enforcing the edge contract. This is a HIVE bug. File on HIVE. The Tez side requires no patch.
The contribution here is the attribution itself plus a Tez-side documentation
note on the validator: "see EdgeProperty.checkCompatible for the rules
enforced." Add a docs patch (Stage 1) if no such note exists.
Walked example C — TezSessionPoolManager reuse leak
Symptom: HiveServer2 uses TezSessionPoolManager to reuse AMs across queries.
A specific Hive query path leaves the session in a state where the next query
sees stale credentials.
Attribution: TezSessionPoolManager is a Hive class (in the Hive repo),
even though it manages TezClient instances. Find it:
grep -rn "class TezSessionPoolManager" ~/hive-src/ql/src/java
The bug is in Hive. The Tez API used (TezClient.start()) is correct.
File on HIVE. The Tez contribution is zero code; it is the attribution call and the explanation in the JIRA comments that prevents the ticket bouncing.
Reading the Hive code path for attribution
Even though you may not commit to Hive, you must be able to read the Hive classes that touch Tez:
org.apache.hadoop.hive.ql.exec.tez.DagUtils— Hive's DAG builder.org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator— Hive's input split generation, called from Tez VertexManagers.org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager— session reuse.org.apache.hadoop.hive.ql.exec.tez.TezSessionState— per-session state.
Keep a Hive checkout next to your Tez checkout:
git clone https://github.com/apache/hive ~/hive-src
A grep across both:
grep -rn "DAGPlan\|VertexManagerPluginDescriptor" ~/hive-src/ql/src/java | head
is the start of every Stage 7 investigation.
Pitfalls
- Don't fix a Hive bug in Tez. Even if the symptom appears on a Tez stack frame, do not patch Tez to work around an incorrect Hive use of the API. You will trap Tez into supporting buggy clients forever.
- Don't expand a Tez API to "make Hive easier". That is a Stage 11 patch with a dev@ design thread; not a Stage 7 patch.
- Don't assume the Hive committers will read your TEZ ticket. CC the
appropriate Hive committers explicitly, or post a short note on
dev@hive.apache.orglinking the JIRA. - Don't promise a Tez backport to a specific Hive release. Release alignment is a separate conversation; you control your patch's landing in Tez, not when Hive picks it up.
- Don't file the same bug on both projects without distinguishing the work. TEZ-NNNN should fix the Tez side; HIVE-NNNN should fix the Hive side; each ticket should cross-reference the other and say exactly what code lives in which project.
- Don't break older Hive versions to fix newer ones. A Tez change that raises the minimum required Hive version is a Stage 11 / Stage 12 call.
Exit criteria — when you're ready for the next stage
Move to Stage 8 when:
- You have correctly attributed at least one symptom to HIVE (saving Tez from an incorrect patch) and one to TEZ (with a Hive counter-ticket).
- You have a
~/hive-srccheckout next to~/tez-srcand have grepped across both at least three times during real investigation. - You can describe the lifecycle of a
TezSessionStatefrom creation to reuse to teardown in five sentences. - You have read
EdgeProperty.checkCompatibleand know which mismatches the Tez validator does and does not flag.
Stage 8 takes you into the YARN integration layer.