Lab H6: Writing a Diagnostic Patch

Background

You have a Hive-on-Tez bug report from production. You can't reproduce locally (Lab H5 didn't work). You need more data. The way to get it is a diagnostic patch — a small change that adds logging, counters, or a debug toggle without changing behavior, attached to the JIRA, that the reporter can apply and re-run.

A well-shaped diagnostic patch:

  1. Adds boundary-INFO logging at the suspected fault site.
  2. Adds a TezCounter so the data is captured in the standard counter mechanism.
  3. Adds a debug-only TezConfiguration switch so the cost is opt-in.

This lab walks the three patterns.


Pattern 1: Boundary INFO Logging

A "boundary" is the point at which control flows from one subsystem to another:

BoundaryExample
Hive → Tez submitTezTask.executeTezSession.submitDAG
Tez AM → ContainerDAGAppMaster.scheduleTaskAttemptContainerLauncherImpl.launch
Container → TaskLogicalIOProcessorRuntimeTask.runProcessor.run
Task → Input shuffleOrderedGroupedKVInput.start
Task → Output shuffleOrderedPartitionedKVOutput.start

INFO at a boundary is cheap, lasts the lifetime of a task, and gives the next debugger a structured trail.

Example patch (illustrative diff)

Suppose the bug is "DAG submission occasionally takes >10s on large DAGs." A diagnostic patch in TezTask:

diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index abcdef1..2345678 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -201,7 +201,12 @@ public class TezTask extends Task<TezWork> {
   private DAGClient submit(DAG dag, TezSessionState session) throws Exception {
+    long submitStartNs = System.nanoTime();
+    int dagPlanBytes = dag.createDag(conf, null, null, null, false).getSerializedSize();
+    LOG.info("HIVE-XXXX diag: about to submitDAG, dagName={}, vertices={}, planBytes={}",
+        dag.getName(), dag.getVertices().size(), dagPlanBytes);
     DAGClient client = session.getSession().submitDAG(dag);
+    LOG.info("HIVE-XXXX diag: submitDAG returned in {} ms",
+        TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - submitStartNs));
     return client;
   }

Rules for the patch:

  • Tag every log line with the JIRA ID. The reporter greps for HIVE-XXXX diag: to find your data.
  • INFO, not DEBUG. The reporter must not have to change log levels.
  • Structured key=value or {} placeholders. Easy to parse.
  • Cheap. Measure or log only what's needed; no full-DAG dumps unless explicitly asked.

Pattern 2: A New TezCounter

Counters are the production-safe way to surface a number. They aggregate across tasks and are visible in the Tez UI, in hive.exec.print.summary output, and in the AM log.

Define a new counter

Tez counters are enums. The Tez-side counters:

find ~/tez-src -name "DAGCounter.java" -o -name "TaskCounter.java"
public enum TaskCounter {
  // ... existing ...
  REDUCE_INPUT_GROUPS,
  REDUCE_OUTPUT_RECORDS,
  // new for diagnostic:
  /** TEZ-XXXX diag: number of shuffle fetch retries on this task. */
  SHUFFLE_FETCH_RETRIES,
}

The Hive-side counters live in:

find ~/hive-src -name "OperatorVariation.java" -o -name "HiveCounter*.java" | head

For Hive-side, use a Reporter.incrCounter or operator counter mechanism, depending on Hive version.

Increment it where it matters

In the suspected hot spot:

-        copyFromHost(host);
+        try {
+          copyFromHost(host);
+        } catch (IOException e) {
+          context.getCounters().findCounter(TaskCounter.SHUFFLE_FETCH_RETRIES).increment(1);
+          throw e;
+        }

After the reporter runs with the patch:

SET hive.exec.print.summary=true;
-- repro the bug

The summary will show SHUFFLE_FETCH_RETRIES = N per task, surfacing data that was previously invisible.

Counters vs logs

AspectCounterLog
Aggregation across tasksAutomaticManual
Production safetyHighHigh
PersistenceLong (ATS / Tez UI)Short (containerlog rotation)
Detail per eventNone (just a count)Full message
CostNear zeroLow to moderate

Use both for big diagnostics: a counter to know "this happened N times" and a log to know "and the first time, here's what it looked like."


Pattern 3: A Debug TezConfiguration Switch

For more invasive diagnostics — extra log lines that would be too noisy by default, or extra checks that have a measurable cost — gate them behind a config switch.

Define the switch

In TezConfiguration (Tez side) or HiveConf (Hive side):

// TezConfiguration.java
@Private @Unstable
public static final String TEZ_AM_DIAGNOSTICS_VERBOSE = "tez.am.diagnostics.verbose";
public static final boolean TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT = false;

Use @Private and @Unstable for a diagnostic key — see Compatibility. It signals "this is not a supported API, may be removed once the bug is fixed."

Gate the diagnostic

private final boolean verboseDiagnostics;

public VertexImpl(...) {
  this.verboseDiagnostics = conf.getBoolean(
      TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE,
      TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT);
}

public void scheduleTasks(...) {
  // ... existing logic ...
  if (verboseDiagnostics) {
    LOG.info("TEZ-XXXX diag: scheduling {} tasks for vertex {}; first task locations: {}",
        tasksToSchedule.size(), getName(),
        tasksToSchedule.subList(0, Math.min(5, tasksToSchedule.size())));
  }
}

Reporter applies the patch and turns it on:

SET tez.am.diagnostics.verbose=true;
-- repro the bug
SET tez.am.diagnostics.verbose=false;

When the bug is diagnosed, the switch is removed in the proper fix. It is not a supported config — the JIRA tracks both the diagnostic patch (to be reverted) and the real fix.


Assembling the Diagnostic Patch

A complete patch for attachment:

  1. One new INFO log line at the boundary you suspect.
  2. One new counter if there's a count to track.
  3. One debug switch if the diagnostic has cost.
  4. JIRA description with:
    • What the patch adds.
    • How to apply it.
    • How to enable any switch.
    • What output the reporter should capture and attach.
  5. Test that the patch compiles and runs the existing tests — diagnostic patches must not change behavior.

Skeleton JIRA comment

Diagnostic patch attached: TEZ-XXXX.diag.001.patch

Adds:
  - INFO log "TEZ-XXXX diag:" in VertexImpl.scheduleTasks
  - TaskCounter SHUFFLE_FETCH_RETRIES
  - Config switch tez.am.diagnostics.verbose (default false)

To reproduce with the patch:

  1. Apply: git apply TEZ-XXXX.diag.001.patch
  2. Build: mvn install -DskipTests
  3. Run query that reproduces the issue, with:
       SET tez.am.diagnostics.verbose=true;
       SET hive.exec.print.summary=true;
  4. Attach:
       - The AM log (yarn logs -applicationId ...)
       - The full Tez summary output
       - One container log from a failing task

Will use the data to file a proper fix. The diagnostic patch is not for commit;
the fix patch will be separate.

Thanks,
<First>

When the Reporter Can't Apply a Patch

Some reporters can't patch their cluster (locked-down enterprise environment). In that case:

  1. Ask for what data they can capture: AM log, container logs, Tez UI screenshots, counter values.
  2. Tell them which existing INFO-level logs to grep for.
  3. Tell them which existing counters to read off.
  4. If a config switch already exists that increases logging, point at it (e.g. tez.am.history.logging.enabled=true, tez.runtime.shuffle.connect.timeout=X).

You don't always get a diagnostic patch onto the cluster. The skill is to plan the diagnostic so you get as much as possible from what's already shipped.


After the Diagnostic Runs

The reporter attaches:

  • AM log with TEZ-XXXX diag: lines.
  • Tez summary with counters.
  • Container log fragments.

You analyse:

  1. What did the diagnostic counter show?
  2. What did the diagnostic log line tell you?
  3. Where is the actual root cause?

Once you know the root cause:

  1. File a separate JIRA for the real fix (or repurpose the diagnostic JIRA).
  2. Attach a proper fix patch (no diag, no INFO noise, no @Private config keys).
  3. Note in the JIRA comment that the diagnostic patch is being abandoned in favor of the fix patch.

Worked Example — Slow DAG Submission

Production bug: "Hive query takes 30 seconds in TezTask.submit before the DAG starts running on large DAGs."

Diagnostic patch

(As above) — adds two INFO lines in TezTask.submit capturing time and DAGPlan size.

Reporter runs

HIVE-XXXX diag: about to submitDAG, dagName=Hive_..., vertices=347, planBytes=8421567
HIVE-XXXX diag: submitDAG returned in 28341 ms

Analysis

  • 347 vertices: large DAG, but not absurd.
  • DAGPlan 8.4 MB: very large.
  • 28 seconds for submitDAG: most likely RPC + protobuf parse on the AM side.

Further diagnosis

Add a Tez-side INFO in DAGAppMaster.submitDAGToAppMaster:

TEZ-XXXX diag: received DAGPlan of {} bytes, deserializing
TEZ-XXXX diag: createDAG completed in {} ms
TEZ-XXXX diag: VertexImpl construction completed in {} ms

Re-run on the cluster. Pinpoint where the 28 seconds go.

Likely result: VertexImpl construction is O(N²) in vertex count for some reason. File the fix patch with the profile evidence.


What a Diagnostic Patch Is Not

  • Not a place to add unrelated improvements.
  • Not a permanent feature; it gets reverted after the bug is fixed.
  • Not a substitute for a proper reproducer (combine both when you can).
  • Not a place to use @Public APIs — diagnostic config keys are @Private, @Unstable.
  • Not committable to trunk as-is.

Validation Artifacts

After this lab:

  1. A ~/tez-notes/diag-patch-template.md containing the JIRA-comment template above.
  2. One worked diagnostic patch (real or imagined) following the three patterns.
  3. The reflex to tag every diagnostic log line with the JIRA ID.
  4. The discipline to file diagnostic and fix as separate JIRAs (or stages of one JIRA).

This lab closes the Hive-on-Tez Labs section. You now have the full toolkit: trace a SQL query into a DAG, capture the DAG four ways, walk a failure to its root, attribute it to the right project, reproduce it minimally, and instrument it for remote diagnosis. That toolkit is the practising-Tez-committer skill at the Tez/Hive boundary.