Container Reuse
Container reuse is the single biggest reason Tez runs short-task DAGs faster than MapReduce. This chapter explains why, where the policy lives, and how to debug it when it stops working.
Why reuse matters
Container allocation has three costs:
- RM round-trip —
addContainerRequest, RM scheduling cycle (typicallyyarn.scheduler.capacity.node-locality-delayadds extra ms),onContainersAllocated. - NM container launch —
ContainerLaunchContextsetup, localization of resources, NodeManager forking the JVM. - JVM warmup — classloading, JIT, GC tuning.
For a 5-second task on a fresh container the wall time looks like:
| Phase | ms |
|---|---|
| AM request → RM allocate | 200–2000 |
| NM launch + localization | 500–3000 |
| JVM start | 500–2000 |
| Task work | 5000 |
| Overhead share | 25–60% |
For 100 such tasks, paying that overhead 100 times turns a DAG that should finish in ~10s into one that takes 60–90s. Reuse drops to near-zero overhead for tasks 2..N on the same container.
The reuse loop in TezChild
See tez-runtime.md. The single key fact: after each
completed task, TezChild.run() calls umbilical.getTask() again instead
of exiting. As long as the AM hands it work, the same JVM keeps running.
grep -n "umbilical.getTask\|shouldDie\|run()" \
$(find tez-runtime-internals/src/main/java -name "TezChild.java") | head
So the entire reuse policy is implemented on the AM side — the container asks "what next?" and the AM decides whether to give it another task or release it.
AMContainerImpl — per-container state machine
find tez-dag/src/main/java -name "AMContainerImpl.java"
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")
grep -n "AMContainerState\|enum AMContainerState" \
$(find tez-dag/src/main/java -name "AMContainerState.java" \
-o -name "AMContainerImpl.java") | head
Each YARN container the AM holds has a corresponding AMContainerImpl
state machine. States include:
| State | Meaning |
|---|---|
ALLOCATED | RM has assigned the container; not yet launched. |
LAUNCHING | NMClient is forking the JVM. |
IDLE | Launched, no task assigned (reuse candidate). |
RUNNING | A task attempt is currently executing. |
STOP_REQUESTED / COMPLETED | Releasing or released. |
The transition RUNNING → IDLE is the moment Tez decides between reuse and
release.
HeldContainer
grep -n "HeldContainer\|heldContainers\|delayedContainers" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20
HeldContainer is the scheduler-side view of an idle reused container:
| Field | Purpose |
|---|---|
container | The underlying YARN Container (resource, node, priority). |
priority | The priority class it was originally allocated at. |
lastTaskActivity | Timestamp of the last task completion. |
nextScheduleTime | When DelayedContainerManager will reconsider it. |
localityMatchLevel | Track the locality at which it can still be matched. |
When a task completes, AMContainerImpl reports back to
YarnTaskSchedulerService which wraps the container in a HeldContainer
and queues it for matching.
Matching: who gets the held container?
Algorithm (paraphrased from YarnTaskSchedulerService):
- Walk pending requests at the same priority as the held container's original allocation.
- Prefer requests with locality matching the container's node, then rack, then any.
- Verify resource compatibility: container's
Resourcemust satisfy the request'scapability. - If a match exists, dispatch reuse to the matched
TaskAttemptImpl. - If no match, leave the container as
HeldContainerand schedule theDelayedContainerManagerto re-evaluate after the locality delay.
grep -n "tryAssignReUsedContainer\|matchHeldContainerToRequest\|getMatchingRequests" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head
Why priority-strict matching?
Tez does not reuse a container allocated for priority class P1 for a task
of priority P2 because RM accounting attributed the container to the P1
queue/request. Crossing priority classes would corrupt fairness and create
double-counting in the RM's view of demand.
Idle timeout
grep -n "tez.am.container.idle.release-timeout" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Two timeouts bracket the wait:
| Key | Default | Meaning |
|---|---|---|
tez.am.container.idle.release-timeout-min.millis | 5000 | Don't release before this much idle time. |
tez.am.container.idle.release-timeout-max.millis | 10000 | Definitely release after this much. |
DelayedContainerManager runs a periodic sweep. For each HeldContainer:
- If
now - lastActivity < min, wait. - If
min ≤ now - lastActivity < max, try a relaxed-locality match. - If
now - lastActivity ≥ max, release back to YARN (AMRMClient.releaseAssignedContainer).
Why a range? Avoids thundering-herd releases when a wave of tasks finishes simultaneously, and gives the AM a window to re-match before paying the allocate-from-scratch cost.
Locality re-matching
grep -n "localityMatchLevel\|adjustLocalityMatch\|fallbackMatch" \
$(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head
A held container starts at NODE_LOCAL. Each sweep without a match
relaxes the level:
NODE_LOCAL → RACK_LOCAL → ANY → release.
tez.am.container.reuse.locality.delay-allocation-millis (default 250) is
the per-step delay. Higher values raise locality at the cost of throughput;
lower values give up locality faster.
DAG transitions and reuse
grep -n "tez.am.container.reuse.across-dags.enabled\|tez.am.container.reuse.enabled" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
Reuse policy at DAG boundaries:
| Key | Default | Effect |
|---|---|---|
tez.am.container.reuse.enabled | true | Master toggle. |
tez.am.container.reuse.rack-fallback.enabled | true | Allow RACK_LOCAL fallback. |
tez.am.container.reuse.non-local-fallback.enabled | false | Allow ANY-locality fallback. |
tez.am.container.reuse.new-containers.enabled | true | Reuse a brand-new container for a different task than originally requested. |
tez.am.session.mode.tez-session.enabled (Hive) | controls inter-DAG reuse via session | Hive holds the AM across queries. |
When Session mode is on (Hive's TezSessionPoolManager does this), the AM
holds containers across DAGs, so the first DAG warms the JVMs that the
second DAG reuses.
Failure modes
Stale credentials
grep -n "credentials\|Token\|getCredentials" \
$(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
-o -name "AMContainerImpl.java") | head
If a DAG uses delegation tokens (HDFS, HiveMetastore) that expire mid-session,
reused containers still hold the old tokens. Symptoms: tasks fail with
SecretManager$InvalidToken on file open. Fix: token renewal via
TokenRenewer, or release reused containers between DAGs that use tokens
with short TTLs.
Leaked containers on AM failover
grep -n "recoverContainer\|onAMRestart" \
$(find tez-dag/src/main/java -name "*.java" | head -50) | head
When the AM dies and YARN restarts attempt 2, the old containers are still
running. YARN passes them to the new AM via getContainersFromPreviousAttempts.
If the new AM mis-handles the priority mapping, those containers can become
orphaned — neither released nor reused — until the YARN-level
yarn.am.liveness-monitor.expiry-interval-ms kicks in.
Resource fragmentation
Tez does not reshape containers. A 4 GB container allocated for a heavyweight mapper sits idle through the reduce phase if reducers want 2 GB containers — the 4 GB block is not subdivided.
Container blacklisting
grep -n "blacklist\|NodeTracker" \
$(find tez-dag/src/main/java -name "*.java" | grep rm) | head
A node accumulating task failures gets blacklisted; held containers on that node are released even within the idle window.
Tuning playbook
| Goal | Tune |
|---|---|
| Reduce p50 task latency | Increase tez.am.container.idle.release-timeout-max.millis — keep JVMs warm longer. |
| Reduce YARN queue pressure | Lower tez.am.container.idle.release-timeout-min.millis — return idle containers faster. |
| Improve locality on long DAGs | Increase tez.am.container.reuse.locality.delay-allocation-millis. |
| Hive interactive queries | Enable session pools (hive.server2.tez.initialize.default.sessions) and large reuse windows. |
| Debugging "why was this container released?" | Set log4j level for org.apache.tez.dag.app.rm to DEBUG. |
Reading exercise
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")then read the state machine declaration block. Count states and transitions.grep -n "DelayedContainerManager" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— find the sweep loop.grep -rn "idle.release-timeout" tez-dag/src/main/java— list all read sites for the idle timeout.grep -n "previousAttemptContainers\|registerApplicationMaster" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")— how does the AM enumerate inherited containers on failover?cat tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | grep -A 1 "REUSE\|REUSE_ENABLED"— list every reuse-related config key.grep -n "containerCompleted" $(find tez-dag/src/main/java -name "AMContainerImpl.java")— where does the AM learn that the JVM exited unexpectedly?
Common bugs and symptoms
| Symptom | Likely cause |
|---|---|
0% reuse despite tez.am.container.reuse.enabled=true | Priority mismatches; verify with AM log Container released because no matching request. |
| Hive query slow after token refresh | Reused container holding stale HiveMetastore delegation token. Release after refresh or shorten reuse window. |
AM log spam: Released container X because expired | Tasks completing faster than next-wave dispatch — lower idle.release-timeout-min. |
| YARN queue at 100% but tasks pending | Held containers at wrong priority blocking new allocations; check nm-rm-heartbeat-interval-ms. |
| Containers orphaned after AM crash | New AM did not register previous containers; check |
getContainersFromPreviousAttempts handling. |
Validation: prove you understand this
- Describe the four-step locality relaxation a
HeldContainerundergoes. - Why is priority-strict matching enforced even when relaxing locality? Cite the RM accounting consequence.
- Given
idle.release-timeout-min=5000,idle.release-timeout-max=10000, and 200 ms between successive task completions on the same vertex, what fraction of containers get reused? - Identify the exact configuration key that controls whether RM-fresh containers can be assigned to a task different from the one that triggered the request. Cite file:line.
- Sketch the sequence of AM events when an AMContainer transitions
RUNNING → IDLE → RUNNINGwith reuse, including which state machine emits each event.