Counters and Diagnostics

When a Tez DAG misbehaves, you have two primary signals: counters (numeric aggregates from every task) and diagnostics strings (free-text causes at every level of the hierarchy). This chapter is the operator's reference for both.


TezCounters

find tez-api/src/main/java -name "TezCounters.java"
wc -l $(find tez-api/src/main/java -name "TezCounters.java")
grep -n "class TezCounters\|addGroup\|getGroup\|findCounter" \
  $(find tez-api/src/main/java -name "TezCounters.java")

TezCounters is a typed map of (groupName) → CounterGroup → (counterName) → Counter. It is hash-cons style: identical strings share storage. Counters are long values with thread-safe increment.

find tez-api/src/main/java -name "TaskCounter.java"
cat $(find tez-api/src/main/java -name "TaskCounter.java")

Standard groups

GroupSource classWhat lives there
org.apache.tez.common.counters.TaskCounterTaskCounter enumPer-task framework metrics
org.apache.tez.common.counters.DAGCounterDAGCounter enumPer-DAG aggregate metrics
org.apache.tez.common.counters.FileSystemCounterFileSystemCounterPer-FS bytes-read/written
org.apache.hadoop.mapreduce.JobCounter(legacy MR)Compatibility shim
User-defined<your class name>App code

Key TaskCounter values

grep -n "INPUT_RECORDS_PROCESSED\|OUTPUT_RECORDS\|SPILLED_RECORDS\|SHUFFLE_BYTES\|GC_TIME_MILLIS\|REDUCE_INPUT_GROUPS" \
  $(find tez-api/src/main/java -name "TaskCounter.java")
CounterMeaning
INPUT_RECORDS_PROCESSEDRecords read from logical inputs
OUTPUT_RECORDSRecords written to logical outputs
OUTPUT_BYTESBytes written (post-compression for shuffle)
OUTPUT_BYTES_PHYSICALBytes actually written to disk
SPILLED_RECORDSRecords spilled by sorter
NUM_SPILLSNumber of spill files created
MERGED_MAP_OUTPUTSSpills merged on the source side
SHUFFLE_BYTESBytes fetched by shuffle
SHUFFLE_BYTES_TO_MEM, SHUFFLE_BYTES_TO_DISKFetcher allocation split
REDUCE_INPUT_GROUPSDistinct keys seen by a KeyValuesReader
REDUCE_INPUT_RECORDSTotal values across all groups
GC_TIME_MILLISSum of GC time during the task
CPU_MILLISECONDSProcess CPU time
COMMITTED_HEAP_BYTESHeap size at task end
PHYSICAL_MEMORY_BYTES, VIRTUAL_MEMORY_BYTESProcess memory snapshot

DAGCounter

find tez-api/src/main/java -name "DAGCounter.java"
cat $(find tez-api/src/main/java -name "DAGCounter.java")
CounterMeaning
NUM_SUCCEEDED_TASKSAggregated across all vertices
NUM_KILLED_TASKSSpeculative duplicates + user kills
NUM_FAILED_TASKSTA failures (counts every failed attempt)
TOTAL_LAUNCHED_TASKSLifetime sum
OTHER_LOCAL_TASKS, RACK_LOCAL_TASKS, DATA_LOCAL_TASKSLocality histogram
AM_CPU_MILLISECONDS, AM_GC_TIME_MILLISAM process counters
WALL_CLOCK_MILLISDAG submission → completion

Aggregation: task → TA → vertex → DAG

flowchart TD
  TA[TaskAttempt counters] -->|flushed via heartbeat| T[Task counters]
  T -->|on TASK_SUCCEEDED| V[Vertex counters]
  V -->|on VERTEX_SUCCEEDED| D[DAG counters]

Mechanism:

  1. Each LogicalIOProcessorRuntimeTask accumulates counters in process.
  2. TaskReporter heartbeat carries a snapshot to the AM via TezTaskUmbilicalProtocol.statusUpdate.
  3. AM's TaskAttemptImpl stores the latest snapshot.
  4. On TASK_SUCCEEDED, the winning attempt's counters become the Task counters; other attempts are discarded.
  5. On VERTEX_SUCCEEDED, VertexImpl sums all task counters into the vertex counters.
  6. On DAG_SUCCEEDED, DAGImpl sums all vertex counters into DAG counters and includes AM_* and DAG_* self-counters.
grep -n "incrCounters\|aggregateCounters\|getCounters\|setCounters" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -30

Counter limits (and how they kill DAGs)

grep -n "COUNTERS_MAX\|TEZ_COUNTERS_MAX\|countersMax" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
KeyDefaultCap on
tez.counters.max1200Total counter count per TezCounters instance
tez.counters.max.groups500Group count
tez.counters.group-name.max256Length of a group name
tez.counters.counter-name.max64Length of a counter name

Exceeding any limit throws LimitExceededException. This typically happens when:

  • An app creates a counter per unique key (e.g. per file path).
  • A user vertex manager creates per-task counters.
  • A DAG has very many vertices, each contributing many counters, and the DAG-level aggregate blows the cap.

The exception propagates up the heartbeat path and kills the DAG with INTERNAL_ERROR. Look for LimitExceededException in the AM log to confirm.


Diagnostics strings

Every level (TA → Task → Vertex → DAG) has a List<String> of diagnostics.

LevelClassPopulated by
TaskAttemptTaskAttemptImplUser exception stacks, framework errors, container exit reasons
TaskTaskImplAggregate of failed attempt diagnostics + scheduling diagnostics
VertexVertexImplAggregate of failed task diagnostics + vertex manager events
DAGDAGImplAggregate of failed vertex diagnostics + DAG-level events
grep -n "addDiagnostic\|diagnostics\|getDiagnostics" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -40

When a DAG completes, DAGStatus.getDiagnostics() is the union of every diagnostic at every level. This is what tez-tool and the Tez UI display.


Where to find diagnostics

SurfacePathNotes
Client return valueDAGStatus.getDiagnostics()Concatenated strings
AM logsyslogSearch for DIAG:, ERROR, the cause keyword
ATSDAGFinishedEvent.diagnostics, VertexFinishedEvent.diagnostics, etcOne field per entity
Tez UIDAG / Vertex / Task pageRenders the same ATS fields
dag.dot (if dumped)local file written by TezClient when enabledStatic plan only, no diagnostics
Counter dump from CLItez-tool dump-counters <appId>Counter snapshots
grep -rn "DIAG\|addDiagnosticInfo" tez-dag/src/main/java | head -20

Counters in the AM log

A typical successful-task log line:

TaskAttempt: [attempt_1_0_00_000000_0]
  TASK_ATTEMPT_FINISHED ...
  counters: Counters: 26
    org.apache.tez.common.counters.TaskCounter
      INPUT_RECORDS_PROCESSED=12345
      OUTPUT_RECORDS=12345
      OUTPUT_BYTES=4567890
      ...
grep -rn "Counters: " tez-dag/src/main/java | head

For diagnostic grepping, search the AM log for:

PatternWhat it finds
DIAG:Diagnostics appends
Counters:Counter dumps
LimitExceededExceptionCounter limit hits
TaskAttemptTerminationCauseFailure causes
TERMINATED_BY_CLIENTUser-initiated kills
OUTPUT_LOSTCascading reruns

Custom counters

User code accesses counters via the IPO context:

public class MyProcessor extends AbstractLogicalIOProcessor {
  @Override
  public void run(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs) {
    TezCounters counters = getContext().getCounters();
    counters.findCounter("MyApp", "ROWS_FILTERED").increment(1);
  }
}
grep -rn "getContext().getCounters\|getCounters()" \
  tez-tests/src/main/java tez-examples/src/main/java | head

Operational guidance:

  • Cap group/counter cardinality at compile time. Never use unbounded user input as a counter name.
  • One group per app; many counters per group.
  • Counter names are visible in ATS forever — treat them as a stable API.

Reading exercise

  1. cat $(find tez-api/src/main/java -name "TaskCounter.java") — read the enum.
  2. grep -n "incrCounter\|addCounters" \ $(find tez-runtime-library/src/main/java -name "*.java") | head -20 — find every place runtime increments counters.
  3. grep -rn "LimitExceededException" tez-api/src/main/java tez-dag/src/main/java — trace the kill path.
  4. find tez-tools -type f -name "*.java" | head — look at tez-tools for counter-dump tooling.
  5. grep -rn "addDiagnosticInfo\|addDiagnostic" tez-dag/src/main/java | wc -l — count the call sites; build a mental model of "where diagnostics flow in."
  6. Open the Tez UI for a recent app, navigate DAG → Vertex → Task, and compare each level's counter view against what the AM log shows.

Common bugs and symptoms

SymptomLikely cause
DAG fails with LimitExceededExceptionToo many counters — search AM log for the limit that triggered.
Counters at DAG level don't sum to vertex countersOne vertex failed; its counters are excluded from the sum.
Counter group missing from ATSCounter was never incremented (zero is not stored).
Diagnostics string truncatedATS field length limit; check yarn.timeline-service.client.max-attempts and entity size.
INPUT_RECORDS_PROCESSED is zero but task succeededInput had zero rows, or a custom IPO does not increment the standard counter.
SHUFFLE_BYTES_TO_DISK >> SHUFFLE_BYTES_TO_MEMFetcher exhausted memory budget; tune tez.runtime.shuffle.memory.limit.percent.
Wall clock huge vs CPU millisTask spent most time waiting (shuffle, GC, blocked); not CPU bound.

Validation: prove you understand this

  1. Name the four standard counter groups and the class that defines each.
  2. Explain why two attempts of the same task can have different counter values, and what happens to the loser's counters.
  3. Calculate the smallest DAG that can hit tez.counters.max=1200, assuming each TaskCounter contributes 26 counters per vertex on success.
  4. Trace the path of a single counter increment in user code through the classes that aggregate it up to the DAGStatus returned to the client.
  5. Given an AM log line DIAG: TaskAttempt attempt_1_0_05_000003_2 failed, cause=APPLICATION_ERROR, list the four levels where this diagnostic ultimately appears and the exact classes that store each copy.