Step 1: Issue Selection
Picking the wrong issue is the most expensive mistake in the Capstone. Two weeks of
investigation on a JIRA that turns out to be a duplicate, a WONTFIX, or a
multi-month rearchitecture is two weeks you do not get back. The goal of this step
is not to find a perfect issue. It is to find a tractable issue that exercises
the parts of Tez you actually know.
Budget: 1–3 days. If you are past day 4 and still triaging, your standards are too high.
Where the Real Issues Live
Apache Tez tracks issues in JIRA at:
https://issues.apache.org/jira/projects/TEZ
There is no good-first-issue label on Tez (unlike Hadoop). The closest
proxies are newbie, very small subtasks of larger umbrellas, and stale
unassigned bugs with reproducers attached. You will write your own JQL.
Starter JQL Queries
Run these in JIRA's "Advanced" search box. Open each in a separate tab; do not chase one result before you have seen the whole landscape.
1. Unassigned open bugs, sorted by recency:
project = TEZ AND status in (Open, "In Progress")
AND assignee is EMPTY
AND type = Bug
ORDER BY created DESC
2. Bugs with reproducers attached (the gold standard):
project = TEZ AND status = Open
AND type = Bug
AND attachments is not EMPTY
ORDER BY updated DESC
3. Newbie-labeled (small surface area):
project = TEZ AND status = Open
AND (labels = newbie OR labels = beginner OR labels = "low-hanging-fruit")
ORDER BY priority DESC, created DESC
4. Flaky tests (Stage 9 territory, often great Capstone fodder):
project = TEZ AND status = Open
AND (summary ~ "flaky" OR summary ~ "intermittent" OR description ~ "flaky")
ORDER BY votes DESC
5. Open bugs touching modules you know:
project = TEZ AND status = Open AND type = Bug
AND (component in ("tez-dag", "tez-runtime-internals", "tez-runtime-library")
OR summary ~ "VertexImpl"
OR summary ~ "ShuffleManager"
OR summary ~ "AsyncDispatcher")
ORDER BY created DESC
Cast a wide net. Pull 20+ candidates into a scratchpad. You will trim aggressively.
Triage: Pick 5 Finalists from 20
For each candidate, spend 10–15 minutes — no more — answering this single question: "Could I write a failing test for this today?" If "no" or "I have no idea," drop it. If "probably yes, here's how," keep it.
Concrete triage protocol:
- Read the JIRA description and every comment. Watch for "I cannot reproduce" or "this is a duplicate of TEZ-XXXX" buried at the bottom.
- Check
git log --grep "TEZ-NNNN"in your~/tez-src/clone — has it already been partially fixed? - Search the dev@ mailing list archive for the issue number:
https://lists.apache.org/list.html?dev@tez.apache.org. - Open the linked files in your editor. Are they in
tez-dag,tez-runtime-*,tez-api(familiar territory), ortez-ui,tez-plugins,tez-yarn-timeline-*(less familiar — skip unless you specifically studied them)? - Note the Affects-Versions field. If it only affects 0.8.x and master has been rewritten in the area, the fix may not be portable.
Keep the 5 finalists in a markdown table:
| TEZ-NNNN | Title | Component | Reproducer? | Last activity | My read |
|---|---|---|---|---|---|
| TEZ-4321 | Fetcher hangs on connection reset | tez-runtime-library | none | 2024-11 | Plausible; I know ShuffleManager |
| TEZ-4456 | VertexImpl NPE on V_ROUTE_EVENT after kill | tez-dag | stack trace only | 2025-02 | Race-y; familiar state machine |
| ... | | | | | |
Scoring Rubric
Score each finalist 0–2 in each column. The winner is the highest aggregate.
| Criterion | 0 | 1 | 2 |
|---|---|---|---|
| Clarity | Description is one sentence and ambiguous | Description names symptom but not conditions | Clear symptom + reproduction conditions in description |
| Scope | Open-ended ("refactor X") | Bounded but spans modules | Bounded to one or two classes |
| Isolation | Requires Hive/Pig running | Needs MiniTezCluster | Can be reproduced in pure unit test |
| Testability | No clear failing assertion possible | Failing assertion possible after MiniTezCluster run | Failing assertion possible in DrainDispatcher test |
| Alignment | Touches code I have never read | Touches one familiar class | Touches 2–3 classes I have studied in Levels 4–6 |
| Community engagement | Last activity > 2 years, no watchers | Some activity in last year | Recently discussed; a committer responded |
Total possible: 12. Anything below 7 is risky. Pick the 9+ candidate.
Three Worked Examples
These are illustrative archetypes, not literal current JIRAs.
Candidate A: "ShuffleManager retries forever on IOException: Connection reset"
- Clarity: 2 (description names the exception and the loop).
- Scope: 2 (one class:
ShuffleManagerorFetcher). - Isolation: 1 (need a fake
Fetcherto inject the exception). - Testability: 2 (mock-based unit test with retry counter assertion).
- Alignment: 2 (you read this in Level 5).
- Community engagement: 1 (one committer comment, no resolution).
- Total: 10. Pick this.
Candidate B: "Refactor DAGImpl state machine to use enum-based transitions"
- Clarity: 1 (vague — "refactor").
- Scope: 0 (touches
DAGImpl, every event handler, every test). - Isolation: 0 (no failing behavior to test).
- Testability: 0 (regression-only testing).
- Alignment: 1 (you know
DAGImplbut this is huge). - Community engagement: 0 (no committer +1).
- Total: 2. Skip. This is a months-long design proposal, not a bug.
Candidate C: "Container reuse logs say assigned then released for same container"
- Clarity: 2 (you can pull the log lines from the description).
- Scope: 1 (touches
TaskSchedulerManagerand possiblyYarnTaskSchedulerService). - Isolation: 0 (need
MiniYARNCluster— slow, flaky, environment-sensitive). - Testability: 1 (assertions are on log content + scheduler state).
- Alignment: 1 (you read
TaskSchedulerManageronce). - Community engagement: 2 (recent discussion).
- Total: 7. Borderline. Pick only if you have no candidate above 8 and you budget extra time for the YARN harness.
Claiming the Issue
Once you decide, claim it publicly. This is non-negotiable — it prevents wasted work by others, and it commits you.
JIRA comment template
Hi — I'd like to work on this as part of an extended Tez learning project.
My plan:
1. Build a deterministic reproducer (target: <date+1 week>).
2. Root-cause analysis (target: <date+2 weeks>).
3. Patch + tests posted for review (target: <date+4 weeks>).
I'll post weekly updates here. If anyone with context has pointers on
<specific question, e.g. "whether this race was discussed in TEZ-NNNN">,
I'd be grateful. Otherwise I'll start on the reproducer this week.
— <Your Name>
Then assign the JIRA to yourself (you need a JIRA account; the Tez PMC grants contributor role on request — comment "please grant contributor role" on any issue and a PMC member will action it within a few days).
If you get no response in 5 business days
Post to dev@tez.apache.org:
Subject: [TEZ-NNNN] Working on this — any context before I dive in?
Hi all,
I left a comment on TEZ-NNNN <link> last week saying I plan to work on it. No
objections so far, so I'm starting on a reproducer this week. If anyone has
historical context — especially whether this overlaps with TEZ-XXXX — please
shout. Otherwise I'll update the JIRA as I make progress.
Thanks,
<Your Name>
If still no response after another week, proceed. Silence on a small bug is permission. (Silence on a redesign proposal is not — different beast.)
Red Flags: Issues to Skip
- Last comment is from a committer saying "we should think about this more." You are not the right person to land a design call.
- Open for >5 years with multiple abandoned patches. Something is structurally hard. Not Capstone material — pick later.
- Touches
tez-ui(Ember 1.x). The UI is on a separate lifecycle; build and test setup is divergent from the JVM modules you studied. - "Upgrade dependency X to version Y." Looks easy, ends up rebuilding the shuffle stack to handle a Guava API change. Skip unless you specifically want this experience.
CriticalorBlockerpriority with no patch. A committer would already be on it. If they are not, the issue may be misclassified or stale-critical.- Reproducer requires a specific Hive version + a 1TB TPC-DS run. No.
Validation / Self-check
Before you advance to Step 2, produce:
- A markdown table of your 5 finalists with full scoring rubric, saved as
capstone-work/issue-shortlist.md. - The TEZ-NNNN number of your chosen issue, posted as a JIRA comment claiming it.
- A 1-paragraph statement of why you picked it (which two criteria scored highest and which scored lowest).
- A self-assigned target date for Step 2 (deterministic reproducer in hand).
- Subscription confirmed to
dev@tez.apache.organd the JIRA itself (click the "Start watching" eye icon). - Your fork of
apache/tezexists on GitHub with a branch namedtez-NNNN-<short-slug>checked out locally. - A note in
capstone-work/issue-shortlist.mdof any near-miss candidates you may revisit after the Capstone — these are your next contributions.