Counters and Diagnostics

When a Tez DAG misbehaves, you have two primary signals: counters (numeric aggregates from every task) and diagnostics strings (free-text causes at every level of the hierarchy). This chapter is the operator's reference for both.

`TezCounters`

find tez-api/src/main/java -name "TezCounters.java"
wc -l $(find tez-api/src/main/java -name "TezCounters.java")
grep -n "class TezCounters\|addGroup\|getGroup\|findCounter" \
  $(find tez-api/src/main/java -name "TezCounters.java")

TezCounters is a typed map of (groupName) → CounterGroup → (counterName) → Counter. It is hash-cons style: identical strings share storage. Counters are long values with thread-safe increment.

find tez-api/src/main/java -name "TaskCounter.java"
cat $(find tez-api/src/main/java -name "TaskCounter.java")

Standard groups

Group	Source class	What lives there
`org.apache.tez.common.counters.TaskCounter`	`TaskCounter` enum	Per-task framework metrics
`org.apache.tez.common.counters.DAGCounter`	`DAGCounter` enum	Per-DAG aggregate metrics
`org.apache.tez.common.counters.FileSystemCounter`	`FileSystemCounter`	Per-FS bytes-read/written
`org.apache.hadoop.mapreduce.JobCounter`	(legacy MR)	Compatibility shim
User-defined	`<your class name>`	App code

Key `TaskCounter` values

grep -n "INPUT_RECORDS_PROCESSED\|OUTPUT_RECORDS\|SPILLED_RECORDS\|SHUFFLE_BYTES\|GC_TIME_MILLIS\|REDUCE_INPUT_GROUPS" \
  $(find tez-api/src/main/java -name "TaskCounter.java")

Counter	Meaning
`INPUT_RECORDS_PROCESSED`	Records read from logical inputs
`OUTPUT_RECORDS`	Records written to logical outputs
`OUTPUT_BYTES`	Bytes written (post-compression for shuffle)
`OUTPUT_BYTES_PHYSICAL`	Bytes actually written to disk
`SPILLED_RECORDS`	Records spilled by sorter
`NUM_SPILLS`	Number of spill files created
`MERGED_MAP_OUTPUTS`	Spills merged on the source side
`SHUFFLE_BYTES`	Bytes fetched by shuffle
`SHUFFLE_BYTES_TO_MEM`, `SHUFFLE_BYTES_TO_DISK`	Fetcher allocation split
`REDUCE_INPUT_GROUPS`	Distinct keys seen by a `KeyValuesReader`
`REDUCE_INPUT_RECORDS`	Total values across all groups
`GC_TIME_MILLIS`	Sum of GC time during the task
`CPU_MILLISECONDS`	Process CPU time
`COMMITTED_HEAP_BYTES`	Heap size at task end
`PHYSICAL_MEMORY_BYTES`, `VIRTUAL_MEMORY_BYTES`	Process memory snapshot

`DAGCounter`

find tez-api/src/main/java -name "DAGCounter.java"
cat $(find tez-api/src/main/java -name "DAGCounter.java")

Counter	Meaning
`NUM_SUCCEEDED_TASKS`	Aggregated across all vertices
`NUM_KILLED_TASKS`	Speculative duplicates + user kills
`NUM_FAILED_TASKS`	TA failures (counts every failed attempt)
`TOTAL_LAUNCHED_TASKS`	Lifetime sum
`OTHER_LOCAL_TASKS`, `RACK_LOCAL_TASKS`, `DATA_LOCAL_TASKS`	Locality histogram
`AM_CPU_MILLISECONDS`, `AM_GC_TIME_MILLIS`	AM process counters
`WALL_CLOCK_MILLIS`	DAG submission → completion

Aggregation: task → TA → vertex → DAG

flowchart TD
  TA[TaskAttempt counters] -->|flushed via heartbeat| T[Task counters]
  T -->|on TASK_SUCCEEDED| V[Vertex counters]
  V -->|on VERTEX_SUCCEEDED| D[DAG counters]

Mechanism:

Each LogicalIOProcessorRuntimeTask accumulates counters in process.
TaskReporter heartbeat carries a snapshot to the AM via TezTaskUmbilicalProtocol.statusUpdate.
AM's TaskAttemptImpl stores the latest snapshot.
On TASK_SUCCEEDED, the winning attempt's counters become the Task counters; other attempts are discarded.
On VERTEX_SUCCEEDED, VertexImpl sums all task counters into the vertex counters.
On DAG_SUCCEEDED, DAGImpl sums all vertex counters into DAG counters and includes AM_* and DAG_* self-counters.

grep -n "incrCounters\|aggregateCounters\|getCounters\|setCounters" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -30

Counter limits (and how they kill DAGs)

grep -n "COUNTERS_MAX\|TEZ_COUNTERS_MAX\|countersMax" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Key	Default	Cap on
`tez.counters.max`	1200	Total counter count per `TezCounters` instance
`tez.counters.max.groups`	500	Group count
`tez.counters.group-name.max`	256	Length of a group name
`tez.counters.counter-name.max`	64	Length of a counter name

Exceeding any limit throws LimitExceededException. This typically happens when:

An app creates a counter per unique key (e.g. per file path).
A user vertex manager creates per-task counters.
A DAG has very many vertices, each contributing many counters, and the DAG-level aggregate blows the cap.

The exception propagates up the heartbeat path and kills the DAG with INTERNAL_ERROR. Look for LimitExceededException in the AM log to confirm.

Diagnostics strings

Every level (TA → Task → Vertex → DAG) has a List<String> of diagnostics.

Level	Class	Populated by
`TaskAttempt`	`TaskAttemptImpl`	User exception stacks, framework errors, container exit reasons
`Task`	`TaskImpl`	Aggregate of failed attempt diagnostics + scheduling diagnostics
`Vertex`	`VertexImpl`	Aggregate of failed task diagnostics + vertex manager events
`DAG`	`DAGImpl`	Aggregate of failed vertex diagnostics + DAG-level events

grep -n "addDiagnostic\|diagnostics\|getDiagnostics" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -40

When a DAG completes, DAGStatus.getDiagnostics() is the union of every diagnostic at every level. This is what tez-tool and the Tez UI display.

Where to find diagnostics

Surface	Path	Notes
Client return value	`DAGStatus.getDiagnostics()`	Concatenated strings
AM log	`syslog`	Search for `DIAG:`, `ERROR`, the cause keyword
ATS	`DAGFinishedEvent.diagnostics`, `VertexFinishedEvent.diagnostics`, etc	One field per entity
Tez UI	DAG / Vertex / Task page	Renders the same ATS fields
`dag.dot` (if dumped)	local file written by `TezClient` when enabled	Static plan only, no diagnostics
Counter dump from CLI	`tez-tool dump-counters <appId>`	Counter snapshots

grep -rn "DIAG\|addDiagnosticInfo" tez-dag/src/main/java | head -20

Counters in the AM log

A typical successful-task log line:

TaskAttempt: [attempt_1_0_00_000000_0]
  TASK_ATTEMPT_FINISHED ...
  counters: Counters: 26
    org.apache.tez.common.counters.TaskCounter
      INPUT_RECORDS_PROCESSED=12345
      OUTPUT_RECORDS=12345
      OUTPUT_BYTES=4567890
      ...

grep -rn "Counters: " tez-dag/src/main/java | head

For diagnostic grepping, search the AM log for:

Pattern	What it finds
`DIAG:`	Diagnostics appends
`Counters:`	Counter dumps
`LimitExceededException`	Counter limit hits
`TaskAttemptTerminationCause`	Failure causes
`TERMINATED_BY_CLIENT`	User-initiated kills
`OUTPUT_LOST`	Cascading reruns

Custom counters

User code accesses counters via the IPO context:

public class MyProcessor extends AbstractLogicalIOProcessor {
  @Override
  public void run(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs) {
    TezCounters counters = getContext().getCounters();
    counters.findCounter("MyApp", "ROWS_FILTERED").increment(1);
  }
}

grep -rn "getContext().getCounters\|getCounters()" \
  tez-tests/src/main/java tez-examples/src/main/java | head

Operational guidance:

Cap group/counter cardinality at compile time. Never use unbounded user input as a counter name.
One group per app; many counters per group.
Counter names are visible in ATS forever — treat them as a stable API.

Reading exercise

cat $(find tez-api/src/main/java -name "TaskCounter.java") — read the enum.
grep -n "incrCounter\|addCounters" \ $(find tez-runtime-library/src/main/java -name "*.java") | head -20 — find every place runtime increments counters.
grep -rn "LimitExceededException" tez-api/src/main/java tez-dag/src/main/java — trace the kill path.
find tez-tools -type f -name "*.java" | head — look at tez-tools for counter-dump tooling.
grep -rn "addDiagnosticInfo\|addDiagnostic" tez-dag/src/main/java | wc -l — count the call sites; build a mental model of "where diagnostics flow in."
Open the Tez UI for a recent app, navigate DAG → Vertex → Task, and compare each level's counter view against what the AM log shows.

Common bugs and symptoms

Symptom	Likely cause
DAG fails with `LimitExceededException`	Too many counters — search AM log for the limit that triggered.
Counters at DAG level don't sum to vertex counters	One vertex failed; its counters are excluded from the sum.
Counter group missing from ATS	Counter was never incremented (zero is not stored).
Diagnostics string truncated	ATS field length limit; check `yarn.timeline-service.client.max-attempts` and entity size.
`INPUT_RECORDS_PROCESSED` is zero but task succeeded	Input had zero rows, or a custom IPO does not increment the standard counter.
`SHUFFLE_BYTES_TO_DISK` >> `SHUFFLE_BYTES_TO_MEM`	Fetcher exhausted memory budget; tune `tez.runtime.shuffle.memory.limit.percent`.
Wall clock huge vs CPU millis	Task spent most time waiting (shuffle, GC, blocked); not CPU bound.

Validation: prove you understand this

Name the four standard counter groups and the class that defines each.
Explain why two attempts of the same task can have different counter values, and what happens to the loser's counters.
Calculate the smallest DAG that can hit tez.counters.max=1200, assuming each TaskCounter contributes 26 counters per vertex on success.
Trace the path of a single counter increment in user code through the classes that aggregate it up to the DAGStatus returned to the client.
Given an AM log line DIAG: TaskAttempt attempt_1_0_05_000003_2 failed, cause=APPLICATION_ERROR, list the four levels where this diagnostic ultimately appears and the exact classes that store each copy.

Open-Source Engineer & Contributor