Stage 1 — Docs and Tests
What this stage teaches
Stage 1 is the on-ramp. The skills are deliberately non-technical:
- Navigate the Apache JIRA workflow: claim a ticket, assign it to yourself, attach a patch, set "Patch Available", respond to review.
- Run
mvn apache-rat:checkandmvn checkstyle:checkcleanly. - Produce a
git format-patchartifact that applies onmaster. - Wait for a Jenkins precommit run and read its output without panicking.
The contributions themselves are surgical: a docs typo, a missing @since tag, a
@param javadoc that the linter complains about, a LOG.info whose message is
misleading. Nothing in this stage will surprise a reviewer. That is the point: you
are exercising the workflow so the next stages can be about code.
JIRA filter to find candidates
Real JQL you can paste into https://issues.apache.org/jira/issues:
project = TEZ
AND labels in (newbie, beginner, "newbie-friendly", "low-hanging-fruit")
AND resolution = Unresolved
AND (component in (Documentation) OR summary ~ "typo" OR summary ~ "javadoc")
ORDER BY updated DESC
A second filter that often surfaces good Stage 1 work — javadoc that the build already flags:
project = TEZ AND status = Open AND text ~ "javadoc" AND text ~ "missing"
Open three candidates, read each comment thread end to end. Choose one that has no assignee, no patch attached, and was last updated more than three months ago. That is the abandoned-but-still-valid ticket: a perfect Stage 1.
If nothing fits, file your own. Walk the docs/src/site/markdown/ tree and grep for
broken links, stale Hadoop version numbers, and configuration keys removed years ago:
cd ~/tez-src
grep -rn "tez\.am\.task\.max\.failed\.attempts" docs/src/site/markdown/
grep -rn "hadoop-2\.[0-6]" docs/src/site/markdown/
grep -rn "TODO\|FIXME\|XXX" docs/src/site/markdown/
A genuine doc bug found this way is fair game for your first JIRA.
Walked example — TezConfiguration javadoc missing @since
Symptom: a contributor reports on dev@ that TezConfiguration.TEZ_AM_RESOURCE_MEMORY_MB
has no @since tag, so users cannot tell which release introduced the property's
default change.
Step 1 — Locate the symbol
cd ~/tez-src
grep -n "TEZ_AM_RESOURCE_MEMORY_MB" \
tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head
Open the file. The relevant block looks roughly like:
@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
TEZ_AM_PREFIX + "resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;
No javadoc, no @since. That is the bug.
Step 2 — Claim the JIRA
On https://issues.apache.org/jira/projects/TEZ:
- Click Create, set Project = TEZ, Issue Type = Improvement.
- Summary:
Add @since tags and javadoc for TEZ_AM_RESOURCE_MEMORY_MB family. - Component:
tez-api. Affects Version:0.10.3. Fix Version: leave blank — the release manager sets it. - Description: state the symptom, paste the grep above, link the dev@ thread.
- Save, then click Assign to me.
Step 3 — Diff
--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+ /**
+ * Memory (in MB) requested for the AppMaster container. If the AM is launched
+ * by YARN, this is passed through to {@link
+ * org.apache.hadoop.yarn.api.records.Resource#setMemorySize(long)} on the
+ * {@code ApplicationSubmissionContext}.
+ *
+ * @since 0.5.0
+ */
@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
TEZ_AM_PREFIX + "resource.memory.mb";
+ /** Default value of {@link #TEZ_AM_RESOURCE_MEMORY_MB}. @since 0.5.0 */
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;
Two rules for @since:
- Look at the earliest commit that introduced the symbol, not the current version.
git log --diff-filter=A -- tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.javathengit log -S "TEZ_AM_RESOURCE_MEMORY_MB" -- tez-api/.... Cross-reference the commit hash against the release tags (git tag --contains <hash>). - Never guess. If you cannot find the release, ask on dev@. A wrong
@sinceis worse than no@since.
Step 4 — Build and lint
cd ~/tez-src
mvn -pl tez-api -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-api checkstyle:check -q
mvn -pl tez-api apache-rat:check -q
mvn -pl tez-api javadoc:javadoc -q 2>&1 | grep -i "error\|warning" | head
The javadoc target is the slowest gate in Tez. Run it. If it warns about an @link
that no longer resolves, fix that in the same patch — reviewers will ask anyway.
Step 5 — Format and attach the patch
cd ~/tez-src
git add tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
git commit -m "TEZ-XXXX. Add @since tags for TEZ_AM_RESOURCE_MEMORY_MB family"
git format-patch -1 HEAD --stdout > /tmp/TEZ-XXXX.001.patch
The Tez convention is TEZ-XXXX.NNN.patch where NNN starts at 001 and
increments on every reroll. Upload to the JIRA, click "Submit Patch" so the
status flips to Patch Available. Jenkins precommit will pick it up within an
hour and post results.
Step 6 — Respond to review
Almost certain reviewer requests for a docs patch:
- "Add
{@value}macros so the default appears inline." - "Wrap the line at 100 chars."
- "Capitalise the first word of the javadoc sentence."
Reroll as 002, never overwrite the 001 file. Each reroll is an attachment in
JIRA, not a force-push; reviewers compare attachments by name.
Pitfalls
- Don't fix two bugs in one patch. A whitespace cleanup tacked onto a typo fix is the most common reason a Stage 1 patch sits unmerged for months.
- Don't run
mvn installwithout-DskipTests. The full test suite takes well over an hour. For a docs patch you need only the lint targets above. - Don't squash through
git rebase -i masterand callgit diff master— the Apache toolchain expectsgit format-patch -1output. The two are not identical whenever your branch contains merge commits. - Don't paste the diff into the JIRA description. Attach the
.patchfile. - Don't request a reviewer in the JIRA description. Use the Assignee field to assign to yourself and let committers self-select. CC on dev@ if it has been more than two weeks with no review.
- Don't open a GitHub PR instead of a JIRA patch unless the project guide says so. As of 0.10.x, Tez accepts GitHub PRs but the JIRA is still the source of truth and must be referenced in the PR title.
Exit criteria — when you're ready for the next stage
You can move to Stage 2 when:
- You have one merged docs or javadoc patch and one merged test-only patch
(typically a missing
@Testmethod or a broken assertion message intez-tests/). - You have responded to at least one round of reviewer nits without needing the
reviewer to walk you through
git format-patchsyntax. - A green Jenkins precommit run on your patch no longer makes you nervous, and you can read the report and tell which warnings are pre-existing versus introduced by your change.
- You can recite from memory: "JIRA first, branch from master, one logical
change per patch,
TEZ-XXXX.NNN.patchnaming, attach not paste."
A second walked example — fixing a misleading log message
Symptom: a contributor sees a LOG.info in tez-dag that reads:
LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
But it fires every time the vertex is re-initialised, not just on first initialisation. The message implies a one-shot event; operators have complained that they cannot grep the log to find unique vertices.
The diff
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
- LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
+ LOG.info("Vertex {} (id={}) initialised with {} tasks (init count={})",
+ vertexName, vertexId, numTasks, ++initCount);
Three changes in one diff:
- The message uses slf4j placeholders.
- The vertex ID is added so operators can correlate with downstream ATS events.
- The init counter makes the "re-initialise" case visible.
This patch is technically a borderline Stage 3 candidate (it adds the vertex ID — see stage-3-error-messages.md). For a first patch, the JIRA description should explicitly say "I am only changing the log message; the init-count field is added but no transition behaviour changes." That framing keeps the patch in Stage 1 scope.
Test
A log-message change usually has no functional test. The reviewer signal
is a manual run of a small OrderedWordCount against MiniTezCluster
with the modified jar, and a grep of the resulting log to confirm the new
format. Document the grep in the JIRA comments:
grep "initialised with" tez-am.log | head
When to file a follow-up
If, while working on a Stage 1 patch, you discover a bigger issue —
suppose the missing javadoc is missing because the configuration key was
silently renamed without an @since in either place — file a follow-up
JIRA in the same component. Do not bundle the bigger fix into your
Stage 1 patch.
Standard wording in your JIRA comments:
While working on TEZ-XXXX I noticed that TEZ_AM_RESOURCE_MEMORY_MB was
renamed from TEZ_AM_MEMORY_MB in 0.7.0 without an @deprecated on the
old key. Filed TEZ-YYYY to track the deprecation cleanup.
This habit — narrow Stage 1 patch + follow-up JIRA — is what reviewers mean when they say "keep patches focused." It is the skill the rest of the roadmap depends on.
Where Stage 1 patches go wrong
The two most common failure modes for a Stage 1 patch:
- Scope creep. The contributor "just fixes" three sibling issues while editing the file. Reviewers ask for a split. The contributor reroll incompletely. Two months later the patch is abandoned.
- Silent rebase break. The contributor rebases on master, the
patch no longer applies cleanly, but they never upload an
002reroll. The committer sees a stale patch and moves on.
Neither failure is about code. Both are about workflow discipline. Stage 1 exists to drill that discipline before the stakes get higher.
Stage 2 will move you from documentation into code that runs in production AMs.