Stage 1 — Docs and Tests

What this stage teaches

Stage 1 is the on-ramp. The skills are deliberately non-technical:

  • Navigate the Apache JIRA workflow: claim a ticket, assign it to yourself, attach a patch, set "Patch Available", respond to review.
  • Run mvn apache-rat:check and mvn checkstyle:check cleanly.
  • Produce a git format-patch artifact that applies on master.
  • Wait for a Jenkins precommit run and read its output without panicking.

The contributions themselves are surgical: a docs typo, a missing @since tag, a @param javadoc that the linter complains about, a LOG.info whose message is misleading. Nothing in this stage will surprise a reviewer. That is the point: you are exercising the workflow so the next stages can be about code.

JIRA filter to find candidates

Real JQL you can paste into https://issues.apache.org/jira/issues:

project = TEZ
  AND labels in (newbie, beginner, "newbie-friendly", "low-hanging-fruit")
  AND resolution = Unresolved
  AND (component in (Documentation) OR summary ~ "typo" OR summary ~ "javadoc")
ORDER BY updated DESC

A second filter that often surfaces good Stage 1 work — javadoc that the build already flags:

project = TEZ AND status = Open AND text ~ "javadoc" AND text ~ "missing"

Open three candidates, read each comment thread end to end. Choose one that has no assignee, no patch attached, and was last updated more than three months ago. That is the abandoned-but-still-valid ticket: a perfect Stage 1.

If nothing fits, file your own. Walk the docs/src/site/markdown/ tree and grep for broken links, stale Hadoop version numbers, and configuration keys removed years ago:

cd ~/tez-src
grep -rn "tez\.am\.task\.max\.failed\.attempts" docs/src/site/markdown/
grep -rn "hadoop-2\.[0-6]" docs/src/site/markdown/
grep -rn "TODO\|FIXME\|XXX" docs/src/site/markdown/

A genuine doc bug found this way is fair game for your first JIRA.

Walked example — TezConfiguration javadoc missing @since

Symptom: a contributor reports on dev@ that TezConfiguration.TEZ_AM_RESOURCE_MEMORY_MB has no @since tag, so users cannot tell which release introduced the property's default change.

Step 1 — Locate the symbol

cd ~/tez-src
grep -n "TEZ_AM_RESOURCE_MEMORY_MB" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head

Open the file. The relevant block looks roughly like:

@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
    TEZ_AM_PREFIX + "resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;

No javadoc, no @since. That is the bug.

Step 2 — Claim the JIRA

On https://issues.apache.org/jira/projects/TEZ:

  1. Click Create, set Project = TEZ, Issue Type = Improvement.
  2. Summary: Add @since tags and javadoc for TEZ_AM_RESOURCE_MEMORY_MB family.
  3. Component: tez-api. Affects Version: 0.10.3. Fix Version: leave blank — the release manager sets it.
  4. Description: state the symptom, paste the grep above, link the dev@ thread.
  5. Save, then click Assign to me.

Step 3 — Diff

--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+  /**
+   * Memory (in MB) requested for the AppMaster container. If the AM is launched
+   * by YARN, this is passed through to {@link
+   * org.apache.hadoop.yarn.api.records.Resource#setMemorySize(long)} on the
+   * {@code ApplicationSubmissionContext}.
+   *
+   * @since 0.5.0
+   */
   @ConfigurationScope(Scope.AM)
   public static final String TEZ_AM_RESOURCE_MEMORY_MB =
       TEZ_AM_PREFIX + "resource.memory.mb";
+  /** Default value of {@link #TEZ_AM_RESOURCE_MEMORY_MB}. @since 0.5.0 */
   public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;

Two rules for @since:

  1. Look at the earliest commit that introduced the symbol, not the current version. git log --diff-filter=A -- tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java then git log -S "TEZ_AM_RESOURCE_MEMORY_MB" -- tez-api/.... Cross-reference the commit hash against the release tags (git tag --contains <hash>).
  2. Never guess. If you cannot find the release, ask on dev@. A wrong @since is worse than no @since.

Step 4 — Build and lint

cd ~/tez-src
mvn -pl tez-api -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-api checkstyle:check -q
mvn -pl tez-api apache-rat:check -q
mvn -pl tez-api javadoc:javadoc -q 2>&1 | grep -i "error\|warning" | head

The javadoc target is the slowest gate in Tez. Run it. If it warns about an @link that no longer resolves, fix that in the same patch — reviewers will ask anyway.

Step 5 — Format and attach the patch

cd ~/tez-src
git add tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
git commit -m "TEZ-XXXX. Add @since tags for TEZ_AM_RESOURCE_MEMORY_MB family"
git format-patch -1 HEAD --stdout > /tmp/TEZ-XXXX.001.patch

The Tez convention is TEZ-XXXX.NNN.patch where NNN starts at 001 and increments on every reroll. Upload to the JIRA, click "Submit Patch" so the status flips to Patch Available. Jenkins precommit will pick it up within an hour and post results.

Step 6 — Respond to review

Almost certain reviewer requests for a docs patch:

  • "Add {@value} macros so the default appears inline."
  • "Wrap the line at 100 chars."
  • "Capitalise the first word of the javadoc sentence."

Reroll as 002, never overwrite the 001 file. Each reroll is an attachment in JIRA, not a force-push; reviewers compare attachments by name.

Pitfalls

  • Don't fix two bugs in one patch. A whitespace cleanup tacked onto a typo fix is the most common reason a Stage 1 patch sits unmerged for months.
  • Don't run mvn install without -DskipTests. The full test suite takes well over an hour. For a docs patch you need only the lint targets above.
  • Don't squash through git rebase -i master and call git diff master — the Apache toolchain expects git format-patch -1 output. The two are not identical whenever your branch contains merge commits.
  • Don't paste the diff into the JIRA description. Attach the .patch file.
  • Don't request a reviewer in the JIRA description. Use the Assignee field to assign to yourself and let committers self-select. CC on dev@ if it has been more than two weeks with no review.
  • Don't open a GitHub PR instead of a JIRA patch unless the project guide says so. As of 0.10.x, Tez accepts GitHub PRs but the JIRA is still the source of truth and must be referenced in the PR title.

Exit criteria — when you're ready for the next stage

You can move to Stage 2 when:

  • You have one merged docs or javadoc patch and one merged test-only patch (typically a missing @Test method or a broken assertion message in tez-tests/).
  • You have responded to at least one round of reviewer nits without needing the reviewer to walk you through git format-patch syntax.
  • A green Jenkins precommit run on your patch no longer makes you nervous, and you can read the report and tell which warnings are pre-existing versus introduced by your change.
  • You can recite from memory: "JIRA first, branch from master, one logical change per patch, TEZ-XXXX.NNN.patch naming, attach not paste."

A second walked example — fixing a misleading log message

Symptom: a contributor sees a LOG.info in tez-dag that reads:

LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");

But it fires every time the vertex is re-initialised, not just on first initialisation. The message implies a one-shot event; operators have complained that they cannot grep the log to find unique vertices.

The diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
-    LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
+    LOG.info("Vertex {} (id={}) initialised with {} tasks (init count={})",
+        vertexName, vertexId, numTasks, ++initCount);

Three changes in one diff:

  1. The message uses slf4j placeholders.
  2. The vertex ID is added so operators can correlate with downstream ATS events.
  3. The init counter makes the "re-initialise" case visible.

This patch is technically a borderline Stage 3 candidate (it adds the vertex ID — see stage-3-error-messages.md). For a first patch, the JIRA description should explicitly say "I am only changing the log message; the init-count field is added but no transition behaviour changes." That framing keeps the patch in Stage 1 scope.

Test

A log-message change usually has no functional test. The reviewer signal is a manual run of a small OrderedWordCount against MiniTezCluster with the modified jar, and a grep of the resulting log to confirm the new format. Document the grep in the JIRA comments:

grep "initialised with" tez-am.log | head

When to file a follow-up

If, while working on a Stage 1 patch, you discover a bigger issue — suppose the missing javadoc is missing because the configuration key was silently renamed without an @since in either place — file a follow-up JIRA in the same component. Do not bundle the bigger fix into your Stage 1 patch.

Standard wording in your JIRA comments:

While working on TEZ-XXXX I noticed that TEZ_AM_RESOURCE_MEMORY_MB was
renamed from TEZ_AM_MEMORY_MB in 0.7.0 without an @deprecated on the
old key. Filed TEZ-YYYY to track the deprecation cleanup.

This habit — narrow Stage 1 patch + follow-up JIRA — is what reviewers mean when they say "keep patches focused." It is the skill the rest of the roadmap depends on.

Where Stage 1 patches go wrong

The two most common failure modes for a Stage 1 patch:

  1. Scope creep. The contributor "just fixes" three sibling issues while editing the file. Reviewers ask for a split. The contributor reroll incompletely. Two months later the patch is abandoned.
  2. Silent rebase break. The contributor rebases on master, the patch no longer applies cleanly, but they never upload an 002 reroll. The committer sees a stale patch and moves on.

Neither failure is about code. Both are about workflow discipline. Stage 1 exists to drill that discipline before the stakes get higher.

Stage 2 will move you from documentation into code that runs in production AMs.