Stage 2 — Build and Logging Hygiene

What this stage teaches

Stage 2 teaches the smallest patches that touch running production code: build metadata (pom.xml), logging idioms, and dependency hygiene. You learn:

  • How Tez's dependency version bands work and which bumps are safe within a minor line.
  • The slf4j-api + log4j (or reload4j) logging stack as wired in tez-common, and the four idioms reviewers actively enforce.
  • How to remove deprecated Guava and Hadoop calls without breaking older Hadoop consumers in the supported compatibility band.
  • How to triage log-level mismatches: messages logged at INFO that should be DEBUG (and the reverse).

The patches are still small (5–80 lines) and the risk surface is small, but they go into the AM and the runtime tasks. A LOG.info in ShuffleManager that fires once per fetch will be seen by every operator running Hive-on-Tez.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (summary ~ "logging" OR summary ~ "deprecated" OR summary ~ "guava"
       OR summary ~ "bump" OR summary ~ "upgrade dependency"
       OR summary ~ "System.out" OR description ~ "isDebugEnabled")
ORDER BY updated DESC

A second sweep for dependency bumps that the build flags:

project = TEZ AND component in (build) AND status = Open ORDER BY priority DESC

You can also generate candidates by running OWASP / dependency-check:

cd ~/tez-src
mvn -pl tez-common dependency:tree -DoutputType=text | grep -E "guava|jackson|netty"

Any line that flags a Guava 12.x in transitive scope is a Stage 2 candidate, because Tez has been on Guava-shaded internals for years.

Walked example A — System.out.println in production code

Symptom: a grep finds three stray System.out.println calls in tez-runtime-library. They were left over from a debugging session and now show up in NodeManager stdout logs, polluting operator dashboards.

Step 1 — Find every offender

cd ~/tez-src
grep -rn "System\.out\.println\|System\.err\.println" \
  tez-runtime-library/src/main/java tez-runtime-internals/src/main/java tez-dag/src/main/java \
  | grep -v "/test/" | grep -v "examples"

Each hit is a separate JIRA candidate (one stage-2 patch per ticket). Pick one, file the JIRA, claim it.

Step 2 — The diff

Suppose the offender is in tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java:

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
@@
-    System.out.println("Spill " + numSpills + " starting, size=" + buffer.position());
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Spill {} starting, size={}", numSpills, buffer.position());
+    }

Three rules in one diff:

  1. Replace System.out with the class's existing slf4j LOG. If the file does not have one, add private static final Logger LOG = LoggerFactory.getLogger(...) at the top.
  2. Use slf4j {} placeholders, not string concatenation. The placeholder form avoids constructing the message string when the log level is filtered out.
  3. Wrap the call in LOG.isDebugEnabled() only when the argument list does non-trivial work (a toString() on a large object, a list copy, a .size() on a synchronized collection). Pure references (numbers, already-bound strings) do not need the guard.

The third rule is the one reviewers nitpick most. The placeholder form already defers toString(), so a guard around a plain LOG.debug("foo {}", x) where x is an int is unnecessary noise. But this:

LOG.debug("Pending {}", scheduledTasks);   // scheduledTasks.toString() is expensive

does benefit from a guard, because scheduledTasks will be toString()-ed before slf4j forms the message.

Step 3 — Verify the build

mvn -pl tez-runtime-library -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-runtime-library checkstyle:check -q

There is no easy unit test for "no System.out left behind." The reviewer signal is a clean grep across the changed file plus a green checkstyle run.

Walked example B — pom.xml dep bump within the compat band

Symptom: jackson-databind 2.12.x has a known CVE; Tez is pinned to 2.12.6 in the parent POM. The compatibility band for the 0.10.x line allows bumps within the 2.12.* range.

Step 1 — Find the pin

cd ~/tez-src
grep -n "jackson-databind\|jackson.version\|jackson-core" pom.xml

Result, abbreviated:

pom.xml:178:    <jackson.version>2.12.6</jackson.version>

Most jackson artifacts in Tez are governed by ${jackson.version} in the parent POM. That is the only string you change.

Step 2 — The diff

--- a/pom.xml
+++ b/pom.xml
@@
-    <jackson.version>2.12.6</jackson.version>
+    <jackson.version>2.12.7.1</jackson.version>

That is the entire patch. The harder part is justifying it.

Step 3 — The JIRA description

Summary: Bump jackson-databind from 2.12.6 to 2.12.7.1

Description:
2.12.6 is affected by CVE-YYYY-NNNN. 2.12.7.1 is the latest patch on the 2.12
line and is API-compatible per the jackson maintainers' compat notes. We do not
bump to 2.13 / 2.14 here to keep Hive-on-Tez compatibility unchanged.

Verification:
  mvn clean install -DskipTests -Phadoop28
  mvn -pl tez-dag test -Dtest=TestDAGImpl
  mvn -pl tez-runtime-library test -Dtest=TestShuffleManager

Step 4 — Why "within the compat band" matters

If you bumped to 2.14, you would break Hive 3.x users who ship 2.13. A 2.12 → 2.12.7.1 bump is a one-line patch. A 2.12 → 2.14 bump is a six-month compatibility argument and lives in Stage 11. Stay on rung.

Walked example C — log-level mismatch

Symptom: a user reports their NodeManager logs are at 100GB/day. Investigation shows Fetcher is logging every single shuffle fetch at INFO:

LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);

That message fires per attempt per source per fetch. For a 10k-task vertex it is catastrophic.

Diff

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
@@
-    LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Fetcher {} connecting to {}:{}", id, host, port);
+    }

Rules for INFO → DEBUG demotions:

  • The message fires more than once per task attempt → almost always DEBUG.
  • The message fires once per DAG lifecycle event (DAG start, vertex committed, task killed by user) → keep at INFO.
  • The message fires per exception → keep at WARN or ERROR per the existing level, never demote silently.
  • Never demote a log without dev@ confirmation if the message references a contract event (state transition, container release). Operators rely on those for postmortems.

The Fetcher example is uncontroversial; a LOG.info on every state transition in VertexImpl is not — that would be Stage 4.

Pitfalls

  • Don't introduce a logger dependency change in a logging patch. If the file imports org.apache.commons.logging.Log, do not migrate it to slf4j in this patch. That migration is a separate JIRA and a much larger surface area.
  • Don't use Throwable.printStackTrace() even in tests. Reviewers will flag it. Use LOG.error("msg", t) instead.
  • Don't bump a dep across a major version line in a Stage 2 patch. That is Stage 11.
  • Don't mvn versions:use-latest-releases and submit the resulting diff. The bump must be justified per artifact with the CVE or the bug being fixed.
  • Don't remove deprecated Guava calls by adding new Guava calls. The Tez trajectory is off Guava in public code. Replace Preconditions.checkNotNull with Objects.requireNonNull (JDK 7+) — not with a different Guava class.
  • Don't add a LOG.debug guard around a string literal. LOG.debug("hello") needs no guard.

Exit criteria — when you're ready for the next stage

Move on when:

  • You have shipped at least two logging-cleanup patches and one pom dep bump.
  • You can explain, without looking it up, when to add LOG.isDebugEnabled() and when not to.
  • You have read tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java and understand the NDC pattern used to attach DAG/Vertex IDs to log messages — that knowledge is the bridge to Stage 3.
  • Your last patch was reviewed and merged without a "split this into two JIRAs" comment.

Stage 3 layers on top: you keep the same surgical patch style, but now you make the content of error messages tell the operator which DAG and which vertex.

Appendix — finding logging hygiene candidates yourself

The JIRA filter at the top of this stage may return zero results during quiet periods. When that happens, you can manufacture candidates yourself with two grep patterns that have a high signal-to-noise ratio.

Pattern A — unguarded string concatenation in LOG.debug

cd ~/tez-src
grep -rn 'LOG\.debug(.*+' --include="*.java" tez-dag tez-runtime-internals \
    tez-runtime-library | grep -v isDebugEnabled | head -30

This finds calls of the form LOG.debug("got " + counter) that allocate the concatenated string unconditionally. Pick one, wrap in if (LOG.isDebugEnabled()), attach to a JIRA.

Pattern B — LOG.info calls with high call-site frequency

cd ~/tez-src
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library | wc -l
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle | head

The shuffle path runs per-fetch — any LOG.info there fires hundreds of thousands of times per DAG. Most are candidates for demotion to DEBUG.

Pattern C — pom files referencing pinned old versions

cd ~/tez-src
grep -rn "<version>" --include="pom.xml" | grep -E "jackson|commons-|guava|netty" \
  | grep -v -- "-test" | head -20

Cross-reference against the latest patch release on the package's GitHub releases page. If your match is two patch versions behind and the changelog mentions a security fix, you have a Stage 2 candidate.

The bar for these "self-found" candidates is the same: file a JIRA before coding, attach a 001 patch, wait for review.