Container Reuse

Container reuse is the single biggest reason Tez runs short-task DAGs faster than MapReduce. This chapter explains why, where the policy lives, and how to debug it when it stops working.

Why reuse matters

Container allocation has three costs:

RM round-trip — addContainerRequest, RM scheduling cycle (typically yarn.scheduler.capacity.node-locality-delay adds extra ms), onContainersAllocated.
NM container launch — ContainerLaunchContext setup, localization of resources, NodeManager forking the JVM.
JVM warmup — classloading, JIT, GC tuning.

For a 5-second task on a fresh container the wall time looks like:

Phase	ms
AM request → RM allocate	200–2000
NM launch + localization	500–3000
JVM start	500–2000
Task work	5000
Overhead share	25–60%

For 100 such tasks, paying that overhead 100 times turns a DAG that should finish in ~10s into one that takes 60–90s. Reuse drops to near-zero overhead for tasks 2..N on the same container.

See tez-runtime.md. The single key fact: after each completed task, TezChild.run() calls umbilical.getTask() again instead of exiting. As long as the AM hands it work, the same JVM keeps running.

grep -n "umbilical.getTask\|shouldDie\|run()" \
  $(find tez-runtime-internals/src/main/java -name "TezChild.java") | head

So the entire reuse policy is implemented on the AM side — the container asks "what next?" and the AM decides whether to give it another task or release it.

`AMContainerImpl` — per-container state machine

find tez-dag/src/main/java -name "AMContainerImpl.java"
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")
grep -n "AMContainerState\|enum AMContainerState" \
  $(find tez-dag/src/main/java -name "AMContainerState.java" \
                                -o -name "AMContainerImpl.java") | head

Each YARN container the AM holds has a corresponding AMContainerImpl state machine. States include:

State	Meaning
`ALLOCATED`	RM has assigned the container; not yet launched.
`LAUNCHING`	NMClient is forking the JVM.
`IDLE`	Launched, no task assigned (reuse candidate).
`RUNNING`	A task attempt is currently executing.
`STOP_REQUESTED` / `COMPLETED`	Releasing or released.

The transition RUNNING → IDLE is the moment Tez decides between reuse and release.

`HeldContainer`

grep -n "HeldContainer\|heldContainers\|delayedContainers" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20

HeldContainer is the scheduler-side view of an idle reused container:

Field	Purpose
`container`	The underlying YARN `Container` (resource, node, priority).
`priority`	The priority class it was originally allocated at.
`lastTaskActivity`	Timestamp of the last task completion.
`nextScheduleTime`	When `DelayedContainerManager` will reconsider it.
`localityMatchLevel`	Track the locality at which it can still be matched.

When a task completes, AMContainerImpl reports back to YarnTaskSchedulerService which wraps the container in a HeldContainer and queues it for matching.

Matching: who gets the held container?

Algorithm (paraphrased from YarnTaskSchedulerService):

Walk pending requests at the same priority as the held container's original allocation.
Prefer requests with locality matching the container's node, then rack, then any.
Verify resource compatibility: container's Resource must satisfy the request's capability.
If a match exists, dispatch reuse to the matched TaskAttemptImpl.
If no match, leave the container as HeldContainer and schedule the DelayedContainerManager to re-evaluate after the locality delay.

grep -n "tryAssignReUsedContainer\|matchHeldContainerToRequest\|getMatchingRequests" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

Why priority-strict matching?

Tez does not reuse a container allocated for priority class P1 for a task of priority P2 because RM accounting attributed the container to the P1 queue/request. Crossing priority classes would corrupt fairness and create double-counting in the RM's view of demand.

Idle timeout

grep -n "tez.am.container.idle.release-timeout" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Two timeouts bracket the wait:

Key	Default	Meaning
`tez.am.container.idle.release-timeout-min.millis`	5000	Don't release before this much idle time.
`tez.am.container.idle.release-timeout-max.millis`	10000	Definitely release after this much.

DelayedContainerManager runs a periodic sweep. For each HeldContainer:

If now - lastActivity < min, wait.
If min ≤ now - lastActivity < max, try a relaxed-locality match.
If now - lastActivity ≥ max, release back to YARN (AMRMClient.releaseAssignedContainer).

Why a range? Avoids thundering-herd releases when a wave of tasks finishes simultaneously, and gives the AM a window to re-match before paying the allocate-from-scratch cost.

Locality re-matching

grep -n "localityMatchLevel\|adjustLocalityMatch\|fallbackMatch" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

A held container starts at NODE_LOCAL. Each sweep without a match relaxes the level:

NODE_LOCAL → RACK_LOCAL → ANY → release.

tez.am.container.reuse.locality.delay-allocation-millis (default 250) is the per-step delay. Higher values raise locality at the cost of throughput; lower values give up locality faster.

DAG transitions and reuse

grep -n "tez.am.container.reuse.across-dags.enabled\|tez.am.container.reuse.enabled" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Reuse policy at DAG boundaries:

Key	Default	Effect
`tez.am.container.reuse.enabled`	true	Master toggle.
`tez.am.container.reuse.rack-fallback.enabled`	true	Allow RACK_LOCAL fallback.
`tez.am.container.reuse.non-local-fallback.enabled`	false	Allow ANY-locality fallback.
`tez.am.container.reuse.new-containers.enabled`	true	Reuse a brand-new container for a different task than originally requested.
`tez.am.session.mode.tez-session.enabled` (Hive)	controls inter-DAG reuse via session	Hive holds the AM across queries.

When Session mode is on (Hive's TezSessionPoolManager does this), the AM holds containers across DAGs, so the first DAG warms the JVMs that the second DAG reuses.

Failure modes

Stale credentials

grep -n "credentials\|Token\|getCredentials" \
  $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
                                -o -name "AMContainerImpl.java") | head

If a DAG uses delegation tokens (HDFS, HiveMetastore) that expire mid-session, reused containers still hold the old tokens. Symptoms: tasks fail with SecretManager$InvalidToken on file open. Fix: token renewal via TokenRenewer, or release reused containers between DAGs that use tokens with short TTLs.

Leaked containers on AM failover

grep -n "recoverContainer\|onAMRestart" \
  $(find tez-dag/src/main/java -name "*.java" | head -50) | head

When the AM dies and YARN restarts attempt 2, the old containers are still running. YARN passes them to the new AM via getContainersFromPreviousAttempts. If the new AM mis-handles the priority mapping, those containers can become orphaned — neither released nor reused — until the YARN-level yarn.am.liveness-monitor.expiry-interval-ms kicks in.

Resource fragmentation

Tez does not reshape containers. A 4 GB container allocated for a heavyweight mapper sits idle through the reduce phase if reducers want 2 GB containers — the 4 GB block is not subdivided.

Container blacklisting

grep -n "blacklist\|NodeTracker" \
  $(find tez-dag/src/main/java -name "*.java" | grep rm) | head

A node accumulating task failures gets blacklisted; held containers on that node are released even within the idle window.

Tuning playbook

Goal	Tune
Reduce p50 task latency	Increase `tez.am.container.idle.release-timeout-max.millis` — keep JVMs warm longer.
Reduce YARN queue pressure	Lower `tez.am.container.idle.release-timeout-min.millis` — return idle containers faster.
Improve locality on long DAGs	Increase `tez.am.container.reuse.locality.delay-allocation-millis`.
Hive interactive queries	Enable session pools (`hive.server2.tez.initialize.default.sessions`) and large reuse windows.
Debugging "why was this container released?"	Set log4j level for `org.apache.tez.dag.app.rm` to DEBUG.

Reading exercise

wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java") then read the state machine declaration block. Count states and transitions.
grep -n "DelayedContainerManager" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — find the sweep loop.
grep -rn "idle.release-timeout" tez-dag/src/main/java — list all read sites for the idle timeout.
grep -n "previousAttemptContainers\|registerApplicationMaster" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — how does the AM enumerate inherited containers on failover?
cat tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | grep -A 1 "REUSE\|REUSE_ENABLED" — list every reuse-related config key.
grep -n "containerCompleted" $(find tez-dag/src/main/java -name "AMContainerImpl.java") — where does the AM learn that the JVM exited unexpectedly?

Common bugs and symptoms

Symptom	Likely cause
0% reuse despite `tez.am.container.reuse.enabled=true`	Priority mismatches; verify with AM log `Container released because no matching request`.
Hive query slow after token refresh	Reused container holding stale `HiveMetastore` delegation token. Release after refresh or shorten reuse window.
AM log spam: `Released container X because expired`	Tasks completing faster than next-wave dispatch — lower `idle.release-timeout-min`.
YARN queue at 100% but tasks pending	Held containers at wrong priority blocking new allocations; check `nm-rm-heartbeat-interval-ms`.
Containers orphaned after AM crash	New AM did not register previous containers; check
`getContainersFromPreviousAttempts` handling.

Validation: prove you understand this

Describe the four-step locality relaxation a HeldContainer undergoes.
Why is priority-strict matching enforced even when relaxing locality? Cite the RM accounting consequence.
Given idle.release-timeout-min=5000, idle.release-timeout-max=10000, and 200 ms between successive task completions on the same vertex, what fraction of containers get reused?
Identify the exact configuration key that controls whether RM-fresh containers can be assigned to a task different from the one that triggered the request. Cite file:line.
Sketch the sequence of AM events when an AMContainer transitions RUNNING → IDLE → RUNNING with reuse, including which state machine emits each event.

Open-Source Engineer & Contributor