Lab H6: Writing a Diagnostic Patch
Background
You have a Hive-on-Tez bug report from production. You can't reproduce locally (Lab H5 didn't work). You need more data. The way to get it is a diagnostic patch — a small change that adds logging, counters, or a debug toggle without changing behavior, attached to the JIRA, that the reporter can apply and re-run.
A well-shaped diagnostic patch:
- Adds boundary-INFO logging at the suspected fault site.
- Adds a
TezCounterso the data is captured in the standard counter mechanism. - Adds a debug-only
TezConfigurationswitch so the cost is opt-in.
This lab walks the three patterns.
Pattern 1: Boundary INFO Logging
A "boundary" is the point at which control flows from one subsystem to another:
| Boundary | Example |
|---|---|
| Hive → Tez submit | TezTask.execute → TezSession.submitDAG |
| Tez AM → Container | DAGAppMaster.scheduleTaskAttempt → ContainerLauncherImpl.launch |
| Container → Task | LogicalIOProcessorRuntimeTask.run → Processor.run |
| Task → Input shuffle | OrderedGroupedKVInput.start |
| Task → Output shuffle | OrderedPartitionedKVOutput.start |
INFO at a boundary is cheap, lasts the lifetime of a task, and gives the next debugger a structured trail.
Example patch (illustrative diff)
Suppose the bug is "DAG submission occasionally takes >10s on large DAGs." A diagnostic
patch in TezTask:
diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index abcdef1..2345678 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -201,7 +201,12 @@ public class TezTask extends Task<TezWork> {
private DAGClient submit(DAG dag, TezSessionState session) throws Exception {
+ long submitStartNs = System.nanoTime();
+ int dagPlanBytes = dag.createDag(conf, null, null, null, false).getSerializedSize();
+ LOG.info("HIVE-XXXX diag: about to submitDAG, dagName={}, vertices={}, planBytes={}",
+ dag.getName(), dag.getVertices().size(), dagPlanBytes);
DAGClient client = session.getSession().submitDAG(dag);
+ LOG.info("HIVE-XXXX diag: submitDAG returned in {} ms",
+ TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - submitStartNs));
return client;
}
Rules for the patch:
- Tag every log line with the JIRA ID. The reporter greps for
HIVE-XXXX diag:to find your data. - INFO, not DEBUG. The reporter must not have to change log levels.
- Structured key=value or
{}placeholders. Easy to parse. - Cheap. Measure or log only what's needed; no full-DAG dumps unless explicitly asked.
Pattern 2: A New TezCounter
Counters are the production-safe way to surface a number. They aggregate across tasks
and are visible in the Tez UI, in hive.exec.print.summary output, and in the AM log.
Define a new counter
Tez counters are enums. The Tez-side counters:
find ~/tez-src -name "DAGCounter.java" -o -name "TaskCounter.java"
public enum TaskCounter {
// ... existing ...
REDUCE_INPUT_GROUPS,
REDUCE_OUTPUT_RECORDS,
// new for diagnostic:
/** TEZ-XXXX diag: number of shuffle fetch retries on this task. */
SHUFFLE_FETCH_RETRIES,
}
The Hive-side counters live in:
find ~/hive-src -name "OperatorVariation.java" -o -name "HiveCounter*.java" | head
For Hive-side, use a Reporter.incrCounter or operator counter mechanism, depending on
Hive version.
Increment it where it matters
In the suspected hot spot:
- copyFromHost(host);
+ try {
+ copyFromHost(host);
+ } catch (IOException e) {
+ context.getCounters().findCounter(TaskCounter.SHUFFLE_FETCH_RETRIES).increment(1);
+ throw e;
+ }
After the reporter runs with the patch:
SET hive.exec.print.summary=true;
-- repro the bug
The summary will show SHUFFLE_FETCH_RETRIES = N per task, surfacing data that was
previously invisible.
Counters vs logs
| Aspect | Counter | Log |
|---|---|---|
| Aggregation across tasks | Automatic | Manual |
| Production safety | High | High |
| Persistence | Long (ATS / Tez UI) | Short (containerlog rotation) |
| Detail per event | None (just a count) | Full message |
| Cost | Near zero | Low to moderate |
Use both for big diagnostics: a counter to know "this happened N times" and a log to know "and the first time, here's what it looked like."
Pattern 3: A Debug TezConfiguration Switch
For more invasive diagnostics — extra log lines that would be too noisy by default, or extra checks that have a measurable cost — gate them behind a config switch.
Define the switch
In TezConfiguration (Tez side) or HiveConf (Hive side):
// TezConfiguration.java
@Private @Unstable
public static final String TEZ_AM_DIAGNOSTICS_VERBOSE = "tez.am.diagnostics.verbose";
public static final boolean TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT = false;
Use @Private and @Unstable for a diagnostic key — see
Compatibility. It signals "this is not a
supported API, may be removed once the bug is fixed."
Gate the diagnostic
private final boolean verboseDiagnostics;
public VertexImpl(...) {
this.verboseDiagnostics = conf.getBoolean(
TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE,
TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT);
}
public void scheduleTasks(...) {
// ... existing logic ...
if (verboseDiagnostics) {
LOG.info("TEZ-XXXX diag: scheduling {} tasks for vertex {}; first task locations: {}",
tasksToSchedule.size(), getName(),
tasksToSchedule.subList(0, Math.min(5, tasksToSchedule.size())));
}
}
Reporter applies the patch and turns it on:
SET tez.am.diagnostics.verbose=true;
-- repro the bug
SET tez.am.diagnostics.verbose=false;
When the bug is diagnosed, the switch is removed in the proper fix. It is not a supported config — the JIRA tracks both the diagnostic patch (to be reverted) and the real fix.
Assembling the Diagnostic Patch
A complete patch for attachment:
- One new INFO log line at the boundary you suspect.
- One new counter if there's a count to track.
- One debug switch if the diagnostic has cost.
- JIRA description with:
- What the patch adds.
- How to apply it.
- How to enable any switch.
- What output the reporter should capture and attach.
- Test that the patch compiles and runs the existing tests — diagnostic patches must not change behavior.
Skeleton JIRA comment
Diagnostic patch attached: TEZ-XXXX.diag.001.patch
Adds:
- INFO log "TEZ-XXXX diag:" in VertexImpl.scheduleTasks
- TaskCounter SHUFFLE_FETCH_RETRIES
- Config switch tez.am.diagnostics.verbose (default false)
To reproduce with the patch:
1. Apply: git apply TEZ-XXXX.diag.001.patch
2. Build: mvn install -DskipTests
3. Run query that reproduces the issue, with:
SET tez.am.diagnostics.verbose=true;
SET hive.exec.print.summary=true;
4. Attach:
- The AM log (yarn logs -applicationId ...)
- The full Tez summary output
- One container log from a failing task
Will use the data to file a proper fix. The diagnostic patch is not for commit;
the fix patch will be separate.
Thanks,
<First>
When the Reporter Can't Apply a Patch
Some reporters can't patch their cluster (locked-down enterprise environment). In that case:
- Ask for what data they can capture: AM log, container logs, Tez UI screenshots, counter values.
- Tell them which existing INFO-level logs to grep for.
- Tell them which existing counters to read off.
- If a config switch already exists that increases logging, point at it
(e.g.
tez.am.history.logging.enabled=true,tez.runtime.shuffle.connect.timeout=X).
You don't always get a diagnostic patch onto the cluster. The skill is to plan the diagnostic so you get as much as possible from what's already shipped.
After the Diagnostic Runs
The reporter attaches:
- AM log with
TEZ-XXXX diag:lines. - Tez summary with counters.
- Container log fragments.
You analyse:
- What did the diagnostic counter show?
- What did the diagnostic log line tell you?
- Where is the actual root cause?
Once you know the root cause:
- File a separate JIRA for the real fix (or repurpose the diagnostic JIRA).
- Attach a proper fix patch (no diag, no INFO noise, no
@Privateconfig keys). - Note in the JIRA comment that the diagnostic patch is being abandoned in favor of the fix patch.
Worked Example — Slow DAG Submission
Production bug: "Hive query takes 30 seconds in TezTask.submit before the DAG starts
running on large DAGs."
Diagnostic patch
(As above) — adds two INFO lines in TezTask.submit capturing time and DAGPlan size.
Reporter runs
HIVE-XXXX diag: about to submitDAG, dagName=Hive_..., vertices=347, planBytes=8421567
HIVE-XXXX diag: submitDAG returned in 28341 ms
Analysis
- 347 vertices: large DAG, but not absurd.
- DAGPlan 8.4 MB: very large.
- 28 seconds for
submitDAG: most likely RPC + protobuf parse on the AM side.
Further diagnosis
Add a Tez-side INFO in DAGAppMaster.submitDAGToAppMaster:
TEZ-XXXX diag: received DAGPlan of {} bytes, deserializing
TEZ-XXXX diag: createDAG completed in {} ms
TEZ-XXXX diag: VertexImpl construction completed in {} ms
Re-run on the cluster. Pinpoint where the 28 seconds go.
Likely result: VertexImpl construction is O(N²) in vertex count for some reason.
File the fix patch with the profile evidence.
What a Diagnostic Patch Is Not
- Not a place to add unrelated improvements.
- Not a permanent feature; it gets reverted after the bug is fixed.
- Not a substitute for a proper reproducer (combine both when you can).
- Not a place to use
@PublicAPIs — diagnostic config keys are@Private,@Unstable. - Not committable to trunk as-is.
Validation Artifacts
After this lab:
- A
~/tez-notes/diag-patch-template.mdcontaining the JIRA-comment template above. - One worked diagnostic patch (real or imagined) following the three patterns.
- The reflex to tag every diagnostic log line with the JIRA ID.
- The discipline to file diagnostic and fix as separate JIRAs (or stages of one JIRA).
This lab closes the Hive-on-Tez Labs section. You now have the full toolkit: trace a SQL query into a DAG, capture the DAG four ways, walk a failure to its root, attribute it to the right project, reproduce it minimally, and instrument it for remote diagnosis. That toolkit is the practising-Tez-committer skill at the Tez/Hive boundary.