Container Reuse

Container reuse is the single biggest reason Tez runs short-task DAGs faster than MapReduce. This chapter explains why, where the policy lives, and how to debug it when it stops working.


Why reuse matters

Container allocation has three costs:

  1. RM round-tripaddContainerRequest, RM scheduling cycle (typically yarn.scheduler.capacity.node-locality-delay adds extra ms), onContainersAllocated.
  2. NM container launchContainerLaunchContext setup, localization of resources, NodeManager forking the JVM.
  3. JVM warmup — classloading, JIT, GC tuning.

For a 5-second task on a fresh container the wall time looks like:

Phasems
AM request → RM allocate200–2000
NM launch + localization500–3000
JVM start500–2000
Task work5000
Overhead share25–60%

For 100 such tasks, paying that overhead 100 times turns a DAG that should finish in ~10s into one that takes 60–90s. Reuse drops to near-zero overhead for tasks 2..N on the same container.


The reuse loop in TezChild

See tez-runtime.md. The single key fact: after each completed task, TezChild.run() calls umbilical.getTask() again instead of exiting. As long as the AM hands it work, the same JVM keeps running.

grep -n "umbilical.getTask\|shouldDie\|run()" \
  $(find tez-runtime-internals/src/main/java -name "TezChild.java") | head

So the entire reuse policy is implemented on the AM side — the container asks "what next?" and the AM decides whether to give it another task or release it.


AMContainerImpl — per-container state machine

find tez-dag/src/main/java -name "AMContainerImpl.java"
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")
grep -n "AMContainerState\|enum AMContainerState" \
  $(find tez-dag/src/main/java -name "AMContainerState.java" \
                                -o -name "AMContainerImpl.java") | head

Each YARN container the AM holds has a corresponding AMContainerImpl state machine. States include:

StateMeaning
ALLOCATEDRM has assigned the container; not yet launched.
LAUNCHINGNMClient is forking the JVM.
IDLELaunched, no task assigned (reuse candidate).
RUNNINGA task attempt is currently executing.
STOP_REQUESTED / COMPLETEDReleasing or released.

The transition RUNNING → IDLE is the moment Tez decides between reuse and release.


HeldContainer

grep -n "HeldContainer\|heldContainers\|delayedContainers" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20

HeldContainer is the scheduler-side view of an idle reused container:

FieldPurpose
containerThe underlying YARN Container (resource, node, priority).
priorityThe priority class it was originally allocated at.
lastTaskActivityTimestamp of the last task completion.
nextScheduleTimeWhen DelayedContainerManager will reconsider it.
localityMatchLevelTrack the locality at which it can still be matched.

When a task completes, AMContainerImpl reports back to YarnTaskSchedulerService which wraps the container in a HeldContainer and queues it for matching.


Matching: who gets the held container?

Algorithm (paraphrased from YarnTaskSchedulerService):

  1. Walk pending requests at the same priority as the held container's original allocation.
  2. Prefer requests with locality matching the container's node, then rack, then any.
  3. Verify resource compatibility: container's Resource must satisfy the request's capability.
  4. If a match exists, dispatch reuse to the matched TaskAttemptImpl.
  5. If no match, leave the container as HeldContainer and schedule the DelayedContainerManager to re-evaluate after the locality delay.
grep -n "tryAssignReUsedContainer\|matchHeldContainerToRequest\|getMatchingRequests" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

Why priority-strict matching?

Tez does not reuse a container allocated for priority class P1 for a task of priority P2 because RM accounting attributed the container to the P1 queue/request. Crossing priority classes would corrupt fairness and create double-counting in the RM's view of demand.


Idle timeout

grep -n "tez.am.container.idle.release-timeout" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Two timeouts bracket the wait:

KeyDefaultMeaning
tez.am.container.idle.release-timeout-min.millis5000Don't release before this much idle time.
tez.am.container.idle.release-timeout-max.millis10000Definitely release after this much.

DelayedContainerManager runs a periodic sweep. For each HeldContainer:

  • If now - lastActivity < min, wait.
  • If min ≤ now - lastActivity < max, try a relaxed-locality match.
  • If now - lastActivity ≥ max, release back to YARN (AMRMClient.releaseAssignedContainer).

Why a range? Avoids thundering-herd releases when a wave of tasks finishes simultaneously, and gives the AM a window to re-match before paying the allocate-from-scratch cost.


Locality re-matching

grep -n "localityMatchLevel\|adjustLocalityMatch\|fallbackMatch" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

A held container starts at NODE_LOCAL. Each sweep without a match relaxes the level:

NODE_LOCAL → RACK_LOCAL → ANY → release.

tez.am.container.reuse.locality.delay-allocation-millis (default 250) is the per-step delay. Higher values raise locality at the cost of throughput; lower values give up locality faster.


DAG transitions and reuse

grep -n "tez.am.container.reuse.across-dags.enabled\|tez.am.container.reuse.enabled" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Reuse policy at DAG boundaries:

KeyDefaultEffect
tez.am.container.reuse.enabledtrueMaster toggle.
tez.am.container.reuse.rack-fallback.enabledtrueAllow RACK_LOCAL fallback.
tez.am.container.reuse.non-local-fallback.enabledfalseAllow ANY-locality fallback.
tez.am.container.reuse.new-containers.enabledtrueReuse a brand-new container for a different task than originally requested.
tez.am.session.mode.tez-session.enabled (Hive)controls inter-DAG reuse via sessionHive holds the AM across queries.

When Session mode is on (Hive's TezSessionPoolManager does this), the AM holds containers across DAGs, so the first DAG warms the JVMs that the second DAG reuses.


Failure modes

Stale credentials

grep -n "credentials\|Token\|getCredentials" \
  $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
                                -o -name "AMContainerImpl.java") | head

If a DAG uses delegation tokens (HDFS, HiveMetastore) that expire mid-session, reused containers still hold the old tokens. Symptoms: tasks fail with SecretManager$InvalidToken on file open. Fix: token renewal via TokenRenewer, or release reused containers between DAGs that use tokens with short TTLs.

Leaked containers on AM failover

grep -n "recoverContainer\|onAMRestart" \
  $(find tez-dag/src/main/java -name "*.java" | head -50) | head

When the AM dies and YARN restarts attempt 2, the old containers are still running. YARN passes them to the new AM via getContainersFromPreviousAttempts. If the new AM mis-handles the priority mapping, those containers can become orphaned — neither released nor reused — until the YARN-level yarn.am.liveness-monitor.expiry-interval-ms kicks in.

Resource fragmentation

Tez does not reshape containers. A 4 GB container allocated for a heavyweight mapper sits idle through the reduce phase if reducers want 2 GB containers — the 4 GB block is not subdivided.

Container blacklisting

grep -n "blacklist\|NodeTracker" \
  $(find tez-dag/src/main/java -name "*.java" | grep rm) | head

A node accumulating task failures gets blacklisted; held containers on that node are released even within the idle window.


Tuning playbook

GoalTune
Reduce p50 task latencyIncrease tez.am.container.idle.release-timeout-max.millis — keep JVMs warm longer.
Reduce YARN queue pressureLower tez.am.container.idle.release-timeout-min.millis — return idle containers faster.
Improve locality on long DAGsIncrease tez.am.container.reuse.locality.delay-allocation-millis.
Hive interactive queriesEnable session pools (hive.server2.tez.initialize.default.sessions) and large reuse windows.
Debugging "why was this container released?"Set log4j level for org.apache.tez.dag.app.rm to DEBUG.

Reading exercise

  1. wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java") then read the state machine declaration block. Count states and transitions.
  2. grep -n "DelayedContainerManager" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — find the sweep loop.
  3. grep -rn "idle.release-timeout" tez-dag/src/main/java — list all read sites for the idle timeout.
  4. grep -n "previousAttemptContainers\|registerApplicationMaster" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — how does the AM enumerate inherited containers on failover?
  5. cat tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | grep -A 1 "REUSE\|REUSE_ENABLED" — list every reuse-related config key.
  6. grep -n "containerCompleted" $(find tez-dag/src/main/java -name "AMContainerImpl.java") — where does the AM learn that the JVM exited unexpectedly?

Common bugs and symptoms

SymptomLikely cause
0% reuse despite tez.am.container.reuse.enabled=truePriority mismatches; verify with AM log Container released because no matching request.
Hive query slow after token refreshReused container holding stale HiveMetastore delegation token. Release after refresh or shorten reuse window.
AM log spam: Released container X because expiredTasks completing faster than next-wave dispatch — lower idle.release-timeout-min.
YARN queue at 100% but tasks pendingHeld containers at wrong priority blocking new allocations; check nm-rm-heartbeat-interval-ms.
Containers orphaned after AM crashNew AM did not register previous containers; check
getContainersFromPreviousAttempts handling.

Validation: prove you understand this

  1. Describe the four-step locality relaxation a HeldContainer undergoes.
  2. Why is priority-strict matching enforced even when relaxing locality? Cite the RM accounting consequence.
  3. Given idle.release-timeout-min=5000, idle.release-timeout-max=10000, and 200 ms between successive task completions on the same vertex, what fraction of containers get reused?
  4. Identify the exact configuration key that controls whether RM-fresh containers can be assigned to a task different from the one that triggered the request. Cite file:line.
  5. Sketch the sequence of AM events when an AMContainer transitions RUNNING → IDLE → RUNNING with reuse, including which state machine emits each event.