Open-Source Engineer & Contributor

A collection of deep, implementation-level curricula for engineers who want to contribute seriously to major open-source projects — not just fix typos, but build the kind of sustained understanding that leads to committer status.

Each curriculum is designed around how the project is actually developed, tested, reviewed, and maintained by its core contributors. Labs reference real source code, real issue trackers, and real contribution workflows.

Curricula

Project	Focus	Status
Apache Tez	DAG execution engine on YARN — used by Hive, Pig, and custom batch pipelines	Active
OpenSearch	Distributed search & analytics engine on Apache Lucene — search, log analytics, observability	Active
Apache Kafka	Distributed log — producers, consumers, brokers, replication, Streams API	Planned
Apache Flink	Streaming and batch — state machines, checkpointing, watermarks, operators	Planned
Apache Spark	Unified analytics — scheduler, shuffle, RDD lineage, SQL planning	Planned
Apache Hadoop	HDFS, YARN, MapReduce — the foundation layer for everything above	Planned

How to Use This Book

Each curriculum is self-contained. Start at the curriculum's Introduction page and work through its levels sequentially. Levels build on each other — skipping levels skips foundations that later labs depend on.

What you will need for any curriculum:

3+ years of Java (or the project's primary language) on production-grade codebases
Comfort reading large, unfamiliar codebases without a guide
Git, a build tool (Maven / Gradle / sbt), and an IDE (IntelliJ recommended)
Patience: the path from contributor to committer is measured in months to years

Select a curriculum from the table above or from the sidebar to begin.

Apache Tez Open-Source Contributor Curriculum

Welcome to the Apache Tez Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious Apache Tez contributors and eventually operate at the level of a core contributor, committer, or PMC-aware engineer.

What This Curriculum Is

This is not a tutorial. It is a structured engineering apprenticeship built around how Apache Tez is actually developed, tested, reviewed, and maintained by its committers and PMC members.

Every level is tied to real Apache Tez source code, real JIRA issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an Apache Tez committer actually does — reading state machine code, tracing DAG execution paths, debugging shuffle failures, reproducing reported issues, and preparing patches for community review.

The curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and push you to develop the muscle memory of someone who works at this level habitually.

Who This Is For

This curriculum is designed for strong backend and distributed systems engineers who:

Have 3+ years of Java development experience (Maven-based projects)
Are familiar with Hadoop, YARN, or MapReduce at a conceptual level
Understand distributed systems fundamentals: scheduling, fault tolerance, partitioning, shuffle
Want to contribute to Apache open-source at a serious level — not just fix typos

You should be comfortable with:

Reading large, unfamiliar Java codebases without a guide
git workflows, reading diffs, working with patch-based reviews
The Hadoop ecosystem at a high level: YARN, HDFS, MapReduce, Hive
Distributed execution concepts: task graphs, data movement, speculative execution

What You Will Be Able to Do

After completing this curriculum, you will be able to:

Capability	Description
Build and test	Build Apache Tez from source, run unit and integration tests, run DAGs locally
Navigate the codebase	Find any class, understand its role, trace execution across module boundaries
Understand DAG execution	Follow a DAG from client submission through AM scheduling to task completion
Debug failures	Diagnose failed task attempts, hung DAGs, shuffle errors, and YARN allocation failures
Trace state machines	Read and reason about `DAGImpl`, `VertexImpl`, `TaskImpl`, `TaskAttemptImpl` state machines
Contribute patches	Reproduce issues, fix bugs, write tests, prepare high-quality patches
Engage the community	Interact productively on JIRA and mailing lists
Understand Hive integration	Trace a SQL query through Hive planning to a Tez DAG execution
Think like a committer	Reason about compatibility, test stability, performance, and release impact

How to Use This Curriculum

Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.

Level	Title	Core Focus
1	Hadoop and Tez Foundation	Build, test, first DAG, Hadoop ecosystem
2	Apache Contributor Onboarding	Workflow, patches, JIRA, mailing lists
3	Tez Architecture	DAG model, TezClient, DAGAppMaster, key subsystems
4	DAG Execution Internals	State machines, vertex/task/attempt lifecycle, events
5	Testing and Debugging	Test infra, mini-cluster, debugging failed tasks
6	Hive/Tez Integration	SQL-to-DAG, Hive integration, cross-project bugs
7	Runtime and Shuffle	TezRuntime, I/O abstractions, shuffle and sort
8	Real Issue Contribution	JIRA reproduction, root cause analysis, real patches
9	Advanced Committer / PMC	Performance, backward compatibility, release practices

Beyond the 9 levels, the curriculum includes five additional sections:

Section	Purpose
Contributor Mindset	How to think, behave, and grow as an Apache contributor
Issue Roadmap	Staged progression from beginner-friendly to release-blocking issues
Internals Deep Dives	21 focused deep dives, each with a mini-lab
Hive-on-Tez Labs	Cross-project debugging, SQL-to-DAG tracing, integration bugs
Release, Review, and PMC Practices	Apache governance, voting, licensing, release management

The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged patch and engineering write-up.

Required Tools

Before starting Level 1, ensure you have the following installed and working:

Java 8 or Java 11 (OpenJDK recommended — match the Tez branch target)
Apache Maven 3.6.3 or newer
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse with M2E
Docker (optional — useful for containerized mini-cluster environments)

You will also need:

A clone of the Apache Tez repository (GitHub mirror of the Apache GitBox repo)
A clone of the Apache Hadoop repository (for YARN API context and integration reference)
An account on Apache JIRA (free to create)
Subscription to the Apache Tez mailing lists:
- dev@tez.apache.org — development discussion (required)
- issues@tez.apache.org — JIRA notifications (optional but useful)

Note on Java version: Apache Tez's master branch targets Java 8 as the minimum. Some newer branches may require Java 11. Always check the pom.xml at the root of the branch you are working on.

Apache Tez at a Glance

Apache Tez is a general-purpose DAG execution engine built on top of Apache YARN. It is the primary execution engine for Apache Hive since Hive 0.13, and is used by other Hadoop ecosystem projects including Pig, Cascading, and Spark (historically).

Why Tez Exists

MapReduce forces every computation into a Map → Shuffle → Reduce pattern. Complex analytical queries (like multi-join SQL) require chaining many MapReduce jobs, with intermediate results written to HDFS between each stage. This is slow and wasteful.

Tez allows arbitrary directed acyclic graphs (DAGs) of computation where:

Vertices represent computation stages
Edges represent data movement between stages
Container reuse eliminates JVM startup overhead between tasks
Data can be pipelined between tasks without HDFS materialization
The same container can run multiple task types

This makes Tez significantly faster than MapReduce for multi-stage queries.

Key Modules

You will spend the majority of your time in these modules:

Module	Path	Description
`tez-api`	`tez-api/`	Public API: `DAG`, `Vertex`, `Edge`, `TezClient`, `DAGClient`
`tez-dag`	`tez-dag/`	Core execution engine: AM, state machines, scheduling
`tez-runtime-library`	`tez-runtime-library/`	Input/Output/Processor implementations, shuffle
`tez-mapreduce`	`tez-mapreduce/`	MapReduce compatibility layer (`MRInput`, `MROutput`)
`tez-runtime-internals`	`tez-runtime-internals/`	Task execution framework, container management
`tez-tests`	`tez-tests/`	Integration tests and system-level tests
`tez-tools`	`tez-tools/`	Utility tools (DAG recovery, history parsing)
`tez-plugins`	`tez-plugins/`	Optional plugins (LLAP, timeline server integration)

Key Classes (High-Level Preview)

Class	Module	Role
`TezClient`	`tez-api`	Entry point for DAG submission from a client
`DAGClient`	`tez-api`	Handle for monitoring a submitted DAG
`DAG`	`tez-api`	DAG definition: vertices + edges
`Vertex`	`tez-api`	Vertex definition: processor + parallelism
`DAGAppMaster`	`tez-dag`	ApplicationMaster — orchestrates DAG execution
`DAGImpl`	`tez-dag`	State machine: models DAG lifecycle
`VertexImpl`	`tez-dag`	State machine: models vertex lifecycle
`TaskImpl`	`tez-dag`	State machine: models task lifecycle
`TaskAttemptImpl`	`tez-dag`	State machine: models a single task attempt
`TaskCommunicatorManager`	`tez-dag`	Manages communication between AM and task containers
`TezTaskRunner2`	`tez-runtime-internals`	Runs a task inside a container
`LogicalIOProcessorRuntimeTask`	`tez-runtime-internals`	Wires up I/O processors inside a task

Apache Tez Community

Apache Tez is a mature project with an active but selective community. The codebase reflects years of careful design decisions, many of which are documented in JIRA issues, design documents, and mailing list threads rather than in code comments.

What the community values:

Patches that include tests
Issues that include a clear reproduction case
Comments that demonstrate you have read the existing code
Contributors who engage respectfully and patiently
Sustained contribution over time, not one-off patches

The path from contributor to committer is measured in years, not weeks. That is intentional. The Apache meritocracy rewards sustained, high-quality contribution — not volume of patches.

This curriculum will help you build the habits and depth of understanding that make that path realistic.

Begin with Level 1: Hadoop and Tez Foundation.

Overview & Prerequisites

Status: Full content coming in Phase 11.

This section covers the complete prerequisites checklist, environment setup guide, and how to navigate the curriculum effectively.

Topics covered:

Detailed environment setup (Java, Maven, Git, IDE configuration)
Cloning and verifying the Apache Tez and Hadoop repositories
Subscribing to Apache Tez mailing lists
Setting up an Apache JIRA account
How to navigate each curriculum level
How to use the labs

Tez Warm-Up: From Data Engineer to Source Contributor

Before you read a single line of VertexImpl.java, you need to have sat in the seat of the person whose workload Tez is serving. The engineers who built Tez's state machines, container reuse logic, and shuffle pipelines were solving specific, painful problems that showed up in production Hive and Pig workloads every day. If you skip that context and go straight to the source code, you will memorize class names without understanding why the design exists.

This chapter is the missing first mile. You will run Tez from the outside — as a data engineer would — across a series of practical scenarios covering different data shapes, query patterns, and ecosystem integrations. After each scenario, the chapter maps what you observed back to the source code structures that own it. By the end, you will have a mental model that makes every internal class feel like an old acquaintance rather than an alien term.

What Tez Actually Is (Two Sentences)

Apache Tez is a general-purpose DAG execution engine that runs on Apache YARN. It does not execute SQL or process files itself — it provides a runtime that other systems (Hive, Pig, custom applications) compile their work into, and then Tez runs that compiled work as a directed acyclic graph of parallel tasks.

Everything else — SQL parsing, query planning, physical operators, file format codecs — belongs to the caller. Tez sees vertex descriptors, edge properties, and processor classes. That boundary is what you need to hold clearly in mind throughout the curriculum.

Where Tez Sits in the Data Engineering Spectrum

┌─────────────────────────────────────────────────────────────────────────────┐
│                     Data Engineering Tool Spectrum                          │
│                                                                             │
│  Batch ◄───────────────────────────────────────────────► Streaming         │
│                                                                             │
│  MapReduce    Tez      Spark         Flink         Kafka Streams            │
│  (2004)       (2013)   (2014)        (2014)        (2016)                   │
│  pure batch   batch+   micro-batch   true stream   native stream            │
│               pipelined & batch      & batch                                │
│                                                                             │
│  ──────────────────────────────────────────────────────────────────────     │
│  Ingest Layer:  Flume  →  Kafka  →  Flink/Kafka Streams                    │
│  Storage Layer: HDFS / S3 / ORC / Parquet / Iceberg / Delta Lake           │
│  Query Layer:   Hive (Tez), Presto, Trino, Spark SQL, Flink SQL            │
└─────────────────────────────────────────────────────────────────────────────┘

Tez vs. MapReduce

MapReduce forces every computation into map → shuffle → reduce. A five-join SQL query becomes five chained MapReduce jobs with HDFS materializations between each. Tez expresses that same query as one DAG, pipelines intermediate data between vertices without HDFS writes, and reuses JVMs across tasks. Typical improvement: 2–5x on complex queries, 10x+ on workflows that would have been five MR jobs.

Tez vs. Spark

Dimension	Tez	Spark
Primary use case	Hive SQL (on YARN/HDFS ecosystems)	General batch + ML + streaming
Execution model	YARN-native, container reuse	Driver + executor (YARN or Kubernetes)
In-memory caching	No (disk-backed shuffle)	RDD/DataFrame caching (explicit)
Streaming	Not native	Structured Streaming (micro-batch)
Deployment	YARN only	YARN, Kubernetes, standalone
Hive integration	Deep (Hive's primary engine)	Separate (Hive-on-Spark is less common)
Community	Apache Tez (focused on Hive use case)	Apache Spark (broad general use)

When you are on a Hadoop/YARN cluster where Hive is the primary SQL layer, Tez is the right choice. Spark is a better fit for Python/Scala workloads, ML pipelines, or when you need in-memory caching across multiple queries.

Tez vs. Flink

Flink is a streaming-first engine that also handles batch. Tez is a batch-first engine that handles simple pipelines. The key structural difference: Flink maintains persistent operator state across windows and checkpoints; Tez vertices are stateless per-task (state is external: HDFS, HBase). If you are building event-time windowed aggregations or exactly-once stream processing, you want Flink. If you are running nightly ETL on HDFS data via Hive, Tez is the right tool and Flink would be overengineered for the job.

Tez vs. Flume (Ingest)

Flume is not a computation engine — it is a log/event ingestion agent that moves data from sources (web servers, syslog, Kafka) to sinks (HDFS, Kafka, HBase). The typical pipeline is:

Application Logs → Flume Agent → HDFS (ORC/Parquet files) → Hive table → Tez query

Flume and Tez are not competitors; they are peers in the same pipeline. Tez reads the data that Flume (or Kafka, or Sqoop) landed on HDFS. Knowing this boundary matters when you encounter a data quality bug: is it in the ingest (Flume), the storage format (ORC serialization), or the compute layer (Tez/Hive)?

Data Formats in the Tez Ecosystem

Tez itself is format-agnostic. It does not read or write ORC, Parquet, or Iceberg directly. Tez sees InputDescriptor and OutputDescriptor objects — the actual codec lives in the class pointed to by those descriptors. The format lives in the tez-mapreduce compatibility layer (MRInput, MROutput) or in Hive's vectorized readers.

ORC (Optimized Row Columnar)

ORC is Hive's native format. When you INSERT INTO an ORC table and query it via Hive-on-Tez:

The input split is an OrcSplit generated by OrcInputFormat in the Hive ORC library.
Tez receives that split as a DataSourceDescriptor in the DAGPlan.
MRInput wraps OrcInputFormat.createRecordReader(), feeding vectorized row batches to Hive's MapOperator.
The key Tez entry point is MRInputLegacy.createReaderInternal() in tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInputLegacy.java.

ORC's predicate pushdown (column pruning, row group skipping) happens before Tez sees the data — entirely inside OrcInputFormat. If a Hive-on-Tez query reads 10 billion rows instead of the 100K it should (wrong predicate pushdown), the bug is in ORC/Hive, not in Tez.

Parquet

Parquet is the other dominant columnar format, more common in cross-ecosystem pipelines (Spark + Hive interop). With Hive-on-Tez reading Parquet:

ParquetInputFormat generates ParquetInputSplit objects.
Tez receives those as DataSourceDescriptor entries.
Vectorization depth varies: ORC vectorization is deeper in Hive; Parquet goes through an additional row-column translation layer.

From a Tez contributor's standpoint, Parquet vs. ORC differences show up mainly in:

Split size calculations affecting vertex parallelism (how many map tasks Tez schedules)
Record skew when one Parquet file is much larger than others

Iceberg

Apache Iceberg is a table format (not a file format). It stores data in Parquet or ORC files but adds a metadata layer for ACID semantics, time travel, and hidden partitioning. Hive + Tez reads Iceberg via IcebergInputFormat (from the Iceberg Hive runtime JAR).

From Tez's view, Iceberg is yet another InputFormat. The novel behavior is:

Iceberg's snapshot-based read means splits can come from multiple physical locations.
Iceberg's PlanningUtil generates splits that can be much more numerous than traditional partition-based splits — this affects Tez vertex parallelism significantly.
Time-travel queries (SELECT ... FOR SYSTEM_TIME AS OF ...) generate a different split list at query compile time, which Hive encodes into the DAGPlan before Tez sees it.

Key insight for contributors: Tez bugs triggered by Iceberg tables are almost always about parallelism (too many small tasks, too few tasks for large snapshots) or about the DataSourceDescriptor encoding. The actual file reading is not Tez's responsibility.

Scenario 1: Classic Batch ETL — Aggregation Over a Large Table

What the data engineer does:

-- Run in Hive CLI connected to a cluster with Hive-on-Tez enabled
SET hive.execution.engine=tez;

CREATE TABLE daily_sales (
  event_date STRING,
  product_id BIGINT,
  region     STRING,
  revenue    DOUBLE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");

-- Query: daily revenue by region, last 90 days
SELECT
  event_date,
  region,
  SUM(revenue)     AS total_revenue,
  COUNT(*)         AS transaction_count
FROM daily_sales
WHERE event_date >= '2026-03-01'
GROUP BY event_date, region
ORDER BY event_date, region;

What Tez does under the hood:

Hive compiles the query to MapWork (map-side partial aggregation) + ReduceWork (global aggregation + sort).
DagUtils.createVertex() in Hive creates two Tez Vertex objects: Map 1 and Reducer 2.
The edge between them is SCATTER_GATHER (partitioned shuffle by GROUP BY key hash).
ShuffleVertexManager auto-parallelism kicks in: it monitors how much data map tasks produce, then dynamically reduces the reducer count if data is smaller than expected (config: tez.shuffle-vertex-manager.desired-task-input-size).
Map tasks run MapProcessor → HashTableContainer (partial agg) → OrderedPartitionedKVOutput (partitioned, sorted).
Reducer tasks run ReduceProcessor → OrderedGroupedKVInput (merge shuffle inputs) → PTFOperator (for ORDER BY) → FileSinkOperator → ORC writer.

Dataset characteristics and edge behaviors:

Dataset characteristic	Tez behavior	Source class to read
1 small file (< 1 block)	1 map task, ShuffleVertexManager sets 1 reducer	`ShuffleVertexManager.java`
1,000 files, uniform size	Parallelism = file count (MR split logic)	`MRInputLegacy.java` split sizing
1 file, 10 GB, no ORC splits	1 map task (cannot split non-splittable format)	`OrcInputFormat.isSplittable()`
WHERE predicate on partitioned column	Hive partition pruning, fewer splits passed to Tez	Hive `PartitionPruner`, not Tez
WHERE filters out all rows	0 output bytes from map, ShuffleVertexManager → 1 reducer	`ShuffleVertexManager.onSourceTaskCompleted()`

Bridge to source code:

cd ~/tez-src

# ShuffleVertexManager — the most important vertex manager for map-reduce style DAGs
find . -name "ShuffleVertexManager.java" -path "*/tez-dag/*"
# tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java

# Auto-parallelism: how ShuffleVertexManager decides to reduce the number of reducers
grep -n "computeParallelism\|desiredTaskInputSize\|onSourceTaskCompleted" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java | head -30

# The edge between Map 1 and Reducer 2 is SCATTER_GATHER — EdgeProperty documentation
grep -n "SCATTER_GATHER" \
  tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

Scenario 2: Multi-Table Join — The Real Workload Tez Was Built For

What the data engineer does:

SET hive.execution.engine=tez;
SET hive.auto.convert.join=true;      -- enable map-side (broadcast) joins
SET hive.mapjoin.smalltable.filesize=25000000;  -- 25 MB threshold

SELECT
  o.order_id,
  c.customer_name,
  p.product_name,
  o.quantity * p.unit_price AS line_total
FROM orders          o
JOIN customers       c ON o.customer_id = c.customer_id
JOIN products        p ON o.product_id  = p.product_id
WHERE o.order_date = '2026-05-31'
AND   c.region = 'US-WEST';

What Tez does:

Hive's query planner analyzes table sizes:

customers and products are small (< 25 MB) → broadcast join (MapJoin)
orders is large (> 25 MB) → the probe side, goes through SCATTER_GATHER shuffle

The resulting DAG has:

Map 1 — reads orders, builds hash table from customers and products small tables, emits matching rows. Small tables arrive via a BROADCAST edge (ONE_TO_ONE semantics: every map task gets the full small table).
Optionally a Reducer 2 if there's a DISTINCT or ORDER BY.

VertexGroup for broadcast joins: Hive uses VertexGroup to express that one physical vertex's output goes to both Map 1 and any other map-side consumer. This is expressed via DAG.addVertex() with a VertexGroup wrapper.

Dataset edge cases for joins:

Scenario	What goes wrong	Where to look
`customers` grows from 20 MB to 30 MB	Map join threshold exceeded, query switches to shuffle join; slower	Hive `CommonJoinResolver`, not Tez
`orders` has extreme key skew (one customer_id has 90% of rows)	One reducer gets 90% of data; task timeout	`SkewedJoin` hint in Hive; Tez sees it as one overloaded reducer
Broadcast table > YARN container heap	OOM in map task	Container memory: `tez.task.resource.memory.mb`
Right side of join returns 0 rows	Map tasks emit 0 output; downstream vertex immediately succeeds	`VertexImpl.checkTasksForCompletion()`

Bridge to source code:

# ONE_TO_ONE edge (broadcast) — how every map task gets all small-table data
grep -n "ONE_TO_ONE\|BroadcastEdgeManager" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ -r | grep -v Test | head -20

# VertexGroup — Hive's mechanism for fan-out to multiple consumers
grep -n "class VertexGroup\|addVertexGroup" \
  tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -15

# How the DAGAppMaster sees both edges from the same vertex
grep -n "vertexGroup\|groupInput" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20

Scenario 3: Direct Tez API — No Hive

Not all Tez workloads go through Hive. Custom data pipelines, internal batch frameworks, and migration tools often build Tez DAGs directly. The canonical example is OrderedWordCount in tez-examples/.

// Simplified from tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java
TezClient tezClient = TezClient.create("OrderedWordCount", tezConf);
tezClient.start();

DAG dag = DAG.create("OrderedWordCount");

// Vertex 1: Tokenize words from input files
Vertex tokenizerVertex = Vertex.create(
    "Tokenizer",
    ProcessorDescriptor.create(TokenProcessor.class.getName()),
    numMapTasks,
    MRHelpers.getMapResource(conf));
tokenizerVertex.addDataSource(
    "Input",
    MRInput.createConfigBuilder(conf, TextInputFormat.class, inputPath).build());

// Vertex 2: Sort and deduplicate
Vertex sumVertex = Vertex.create(
    "Sorter",
    ProcessorDescriptor.create(SumProcessor.class.getName()),
    numReduceTasks,
    MRHelpers.getReduceResource(conf));
sumVertex.addDataSink(
    "Output",
    MROutput.createConfigBuilder(conf, TextOutputFormat.class, outputPath).build());

// Edge: SCATTER_GATHER, sorted by word key
dag.addVertex(tokenizerVertex)
   .addVertex(sumVertex)
   .addEdge(Edge.create(tokenizerVertex, sumVertex, EdgeProperty.create(
       DataMovementType.SCATTER_GATHER,
       DataSourceType.PERSISTED,
       SchedulingType.SEQUENTIAL,
       OrderedPartitionedKVOutput.createConfigBuilder(conf, HashPartitioner.class).build(),
       OrderedGroupedKVInput.createConfigBuilder(conf).build())));

DAGClient dagClient = tezClient.submitDAG(dag);
DAGStatus status = dagClient.waitForCompletion();

What this teaches about Tez's structure:

Vertex.create(name, processorDescriptor, parallelism, resource) — the four primitives of a vertex: name, code to run, how many copies, how much resource.
EdgeProperty.create(movementType, sourceType, schedulingType, outputDesc, inputDesc) — edge properties completely specify how data moves.
MRInput/MROutput bridge the gap between legacy Hadoop InputFormat/OutputFormat and Tez's native I/O descriptors.

Bridge to source code:

# Read OrderedWordCount to understand the complete DAG lifecycle from a client
cat tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java

# Follow TezClient.submitDAG() into the AM
grep -n "public.*submitDAG" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/TezClient.java

# EdgeProperty — the central struct that determines routing
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

Scenario 4: Pipelined Execution — Where Tez Approaches Flink

Tez supports PIPELINED edge scheduling (vs. SEQUENTIAL). With pipelined edges, downstream tasks can start before all upstream tasks complete — the data flows like a stream within the DAG.

EdgeProperty pipelinedEdge = EdgeProperty.create(
    DataMovementType.SCATTER_GATHER,
    DataSourceType.PERSISTED_PIPELINED,    // <-- pipelined
    SchedulingType.CONCURRENT,             // <-- downstream starts immediately
    outputDescriptor,
    inputDescriptor);

This is used by Hive for query pipelining in long-running SELECT ... INSERT chains. The downstream vertex starts consuming partial output from the upstream before it finishes, reducing end-to-end latency for multi-stage queries.

When pipelining causes problems:

Problem	Symptom	Root class
Upstream task fails mid-stream	Downstream task has consumed partial data → must be killed and retried with upstream	`TaskAttemptImpl.FAILED_TRANSITION`
Downstream cannot consume fast enough	Back-pressure: upstream pauses on `write()`	`OrderedPartitionedKVOutput.sendingThreadShouldRun`
Memory overflow in pipelined buffer	`OutOfMemoryError` in fetcher threads	`MergeManager` in-memory limit

Bridge to source code:

grep -n "PERSISTENT_PIPELINED\|PIPELINED\|CONCURRENT" \
  tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

grep -n "SchedulingType.CONCURRENT" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10

Dataset Scenarios for Testing Edge Cases

When you are writing a repro test case or validating a fix, the dataset you choose determines which code paths you exercise. Use these as your starting templates.

Dataset 1: The Empty Partition

// Generate test data where one reduce partition has 0 records
// Triggers ShuffleVertexManager.onSourceTaskCompleted() with 0-byte output
private static final int NUM_PARTITIONS = 10;
private static final int RECORDS_PER_PARTITION = 100;

// Force all records into partitions 0–8, leave partition 9 empty
int partition = key.hashCode() % (NUM_PARTITIONS - 1);  // never 9

What this tests: ShuffleVertexManager must handle a vertex where some reducer partitions receive zero input. Before TEZ-3247, this caused reducers to hang waiting for shuffle data that would never arrive.

# Test class that covers empty-partition behavior
grep -rn "emptyPartition\|zeroInput\|emptyInput" tez-tests/src/test/ | head -10

Dataset 2: Extreme Key Skew

// One key accounts for 95% of records
for (int i = 0; i < 1_000_000; i++) {
    String key = (i < 950_000) ? "hot_key" : "key_" + i;
    writer.write(new Text(key), new IntWritable(1));
}

What this tests: The reducer that receives hot_key gets ~950,000 records while other reducers get ~50 each. This exposes:

Speculative execution decisions in LegacySpeculator
Container reuse after the skewed reducer finishes last
Per-vertex timing in VertexImpl.checkTasksForCompletion()

Dataset 3: Zero-Row Input

// Empty input — 0 files, 0 records
// The DAG should complete SUCCEEDED with 0 output, not hang
String inputPath = "/tmp/empty_dir_" + UUID.randomUUID();
fs.mkdirs(new Path(inputPath));  // create directory but put no files in it

What this tests: VertexImpl must handle the case where MRInput generates 0 splits. A vertex with 0 input splits sets its parallelism to 0, transitions immediately to V_SUCCEEDED without scheduling any tasks. This has historically been a source of NullPointerException bugs when downstream vertices assume at least one upstream task ran.

grep -n "setParallelism.*0\|numTasks.*0\|zeroTasks\|numSourceTasks.*0" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15

Dataset 4: Very Wide Rows (Many Columns)

// 1,000 columns per row — stresses IFile serialization and spill logic
StringBuilder sb = new StringBuilder();
for (int col = 0; col < 1000; col++) {
    sb.append("column_").append(col).append("=").append("value_").append(col).append("\t");
}
writer.write(new Text("key"), new Text(sb.toString()));  // ~30 KB per record

What this tests: PipelinedSorter and DefaultSorter spill thresholds. With 30 KB per record, even a modest sort buffer fills quickly. This exercises the spill path in tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java and exposes off-by-one bugs in the IFile index writer.

Dataset 5: Many Small Files (HDFS Small-File Problem)

# Generate 50,000 files of 1 KB each — a classic HDFS anti-pattern
for i in $(seq 1 50000); do
  echo "record_$i value_$i" > /tmp/smallfiles/file_$i.txt
done
hadoop fs -put /tmp/smallfiles /data/input/smallfiles/

What this tests: Split generation produces 50,000 map tasks. This is a realistic workload that stresses:

TaskSchedulerManager task queue management
Container reuse logic (50,000 containers → reuse is essential for performance)
DAGAppMaster AMRM heartbeat frequency under high task count

# Container reuse configuration
grep -n "heldContainer\|releaseTimeout\|IDLE_TIMEOUT" \
  tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncherImpl.java \
  | head -20

Dataset 6: Nested Structs (Complex Types)

-- ORC table with nested complex types
CREATE TABLE events (
  event_id  BIGINT,
  metadata  STRUCT<
    user_id:       BIGINT,
    session_id:    STRING,
    properties:    MAP<STRING, STRING>,
    tags:          ARRAY<STRING>
  >,
  timestamp BIGINT
) STORED AS ORC;

What this tests: ORC vectorized reader deserialization of STRUCT, MAP, and ARRAY types. These types are serialized into Hive's OrcStruct/OrcMap/OrcList classes before being passed through MRInput to the MapOperator. If the column count or type tree changes between what the ORC file was written with and what the Hive schema says, you get schema evolution behavior — which can generate bugs that look like Tez data corruption but are actually ORC schema evolution issues.

Dataset 7: Partitioned Iceberg Table (Snapshot Isolation)

# Using PyIceberg or Spark to create an Iceberg table with multiple snapshots
from pyiceberg.catalog import load_catalog
catalog = load_catalog("hive_catalog", **{"uri": "thrift://hive-metastore:9083"})
table = catalog.load_table("db.events_iceberg")

# Write 3 snapshots representing 3 days of appends
for day in range(3):
    df = generate_day_data(day)
    table.append(df)

# Now query with time travel — Hive generates a DAGPlan that reads snapshot 1
hive_execute("""
    SELECT COUNT(*) FROM db.events_iceberg
    FOR SYSTEM_TIME AS OF '2026-05-29 00:00:00'
""")

What this tests: Iceberg's IcebergInputFormat generates a split list that differs per snapshot. The DataSourceDescriptor passed to Tez encodes the snapshot ID. If Hive resolves the wrong snapshot, Tez faithfully executes it — the bug is in the DagUtils snapshot resolution in Hive, not in Tez. But the symptom (wrong row count) looks like a Tez data bug.

Running Tez End-to-End: The Local Developer Loop

Before writing source code, every Tez contributor should be able to do this loop in under 10 minutes:

# 1. Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist -q   # ~8–12 min cold, 3–4 min warm

# 2. Run the canonical integration test that exercises the full stack
mvn test -pl tez-tests \
    -Dtest=TestOrderedWordCount \
    -DfailIfNoTests=false 2>&1 | tail -30

# 3. Run a single unit test (fast feedback loop — use this constantly)
mvn test -pl tez-dag \
    -Dtest=TestVertexImpl#testVertexSucceededSpeculation \
    -DfailIfNoTests=false 2>&1 | tail -20

# 4. Run OrderedWordCount in local mode (no YARN cluster required)
hadoop jar tez-examples/target/tez-examples-*.jar orderedwordcount \
    -D tez.local.mode=true \
    /path/to/input /tmp/tez-output-$(date +%s)

# 5. Verify output
hadoop fs -cat /tmp/tez-output-*/part-* | sort | head -20

The TestOrderedWordCount test is your baseline health check. If it passes, the full end-to-end stack (TezClient → DAGAppMaster → VertexImpl → shuffle → MRInput/MROutput) is working. If it fails, something fundamental is broken and you need to fix that before touching anything else.

The Bridge: User Scenario → Source Code

Every scenario above maps to a specific source subsystem. Use this table whenever you see a runtime behavior and want to find the code responsible:

Observed behavior	Source location
Map task count equals file count	`tez-mapreduce/.../MRInputLegacy.createSplitsProto()`
Reducer count auto-adjusted down	`ShuffleVertexManager.computeParallelism()`
DAG completes even with 0-row input	`VertexImpl.scheduleTasks()` (0-task vertex path)
Broadcast join: small table to all maps	`BroadcastEdgeManager` + `ONE_TO_ONE` edge
Container reused between tasks	`AMContainerImpl.assignContainer()` + `HeldContainer`
Task retried after failure	`TaskAttemptImpl` → `TaskImpl.handleTaskAttemptFailed()`
OOM in shuffle fetch	`MergeManager.memoryAvailable` / `Fetcher.copyFromHost()`
Hung vertex with tasks still RUNNING	`VertexImpl.checkTasksForCompletion()` not triggered
Wrong output record count	Check `OrcInputFormat` predicate pushdown first, then Tez
Slow single reducer (skew)	`LegacySpeculator` slow-task detection → speculative attempt
Pipelined task killed on upstream failure	`TaskAttemptImpl.FAILED_TRANSITION` cascades

What to Verify Before Starting Level 1

Run through this checklist once. It takes 30–45 minutes and proves your environment is solid.

# Environment check
java -version    # must be Java 8 or Java 11
mvn -version     # must be 3.6.3+
git --version    # must be 2.x

# Clone and build
git clone https://github.com/apache/tez.git ~/tez-src
cd ~/tez-src
mvn clean install -DskipTests -Pdist 2>&1 | tail -10

# Confirm build artifacts exist
ls tez-dist/target/tez-*.tar.gz  # should exist
ls tez-examples/target/tez-examples-*.jar

# Run the unit test suite in the two most important modules
mvn test -pl tez-dag -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5
mvn test -pl tez-api -DfailIfNoTests=false 2>&1 | grep -E "Tests run:|FAIL|ERROR" | tail -5

# Run the critical end-to-end test
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -DfailIfNoTests=false 2>&1 | tail -10

# All lines should read "Tests run: N, Failures: 0, Errors: 0"

If any of these fail before you have modified a single line of code, stop and fix your environment. Do not proceed into Level 1 with a broken baseline. A broken baseline means every subsequent mvn test will produce false failures that obscure the real work.

Continue to Overview & Prerequisites or jump directly to Level 1: Hadoop and Tez Foundation.

16-Week Plan: From Curious Reader to Tez Committer Candidate

This is a 16-week, ~10-hour-per-week plan that maps the curriculum (Levels 1–9 plus a 2-week capstone) onto a calendar. Each week states:

Reading — concrete Tez source files. Open them; do not just skim diagrams.
Hands-on — what you must build/run on your machine.
JIRA practice queries — searches that surface real, beginner-appropriate issues.
Labs — the curriculum labs you must complete.
Exit checkpoint — concrete deliverables. If you cannot produce them, repeat the week.

The plan assumes you have ~/tez-src checked out, tez-tests/ building with mvn -DskipTests install, and a working Java 8+/Maven 3.6+ environment.

Weeks 1–2: Level 1 — Orientation and First DAG

Week 1 — The DAG model and the client API

Reading

tez-api/src/main/java/org/apache/tez/dag/api/DAG.java (entire file; ~600 lines)
tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
tez-api/src/main/java/org/apache/tez/dag/api/Edge.java
tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java
tez-api/src/main/proto/DAGApiRecords.proto — focus on DAGPlan, VertexPlan, EdgePlan, EdgeProperty.

Hands-on

Build Tez from source: mvn clean install -DskipTests -Phadoop28.
Run OrderedWordCount against a local file using MiniTezCluster (see tez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java).
Inspect the generated DAGPlan: print it with dag.createDag(...).toString().

JIRA practice queries

project = TEZ AND status in (Open, "In Progress") AND labels = newbie
project = TEZ AND component = tez-api AND fixVersion is empty AND priority in (Trivial, Minor)

Labs

Lab 1.1 — Trace a WordCount end-to-end.
Lab 1.2 — Modify the DAG: add a second mapper vertex.

Exit checkpoint

You can name every required argument to DAG.create(), Vertex.create(), Edge.create(), and EdgeProperty.create().
You can diagram the WordCount DAG without looking.
You have one JIRA ticket open in a browser tab that you've read end-to-end (description + every comment).

Week 2 — Edges in depth

Reading

tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java — all three enums (DataMovementType, DataSourceType, SchedulingType).
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/EdgeManager*.java — five built-in edge managers.
tez-api/src/main/java/org/apache/tez/dag/api/InputDescriptor.java, OutputDescriptor.java, ProcessorDescriptor.java.

Hands-on

Build the same WordCount with BROADCAST instead of SCATTER_GATHER for the edge. Observe the failure mode and explain it.
Write a 3-vertex DAG (A -> B -> C) where A->B is ONE_TO_ONE and B->C is SCATTER_GATHER. Run it; confirm parallelism rules from the source.

JIRA practice queries

project = TEZ AND text ~ "EdgeManager" AND resolution = Unresolved
project = TEZ AND text ~ "broadcast" AND status = Resolved ORDER BY created DESC

Labs

Lab 1.3 — Edge type matrix experiment.

Exit checkpoint

Edge type matrix (movement × scheduling × source) drawn from memory.
You can predict, given edge properties, which EdgeManager impl will be picked.
One short forum/dev-list email you drafted (do not send) summarizing your reading of an EdgeManager file.

Weeks 3–4: Level 2 — Build, run, and read tests

Week 3 — Tez build system and module layout

Reading

pom.xml (root), tez-api/pom.xml, tez-dag/pom.xml.
BUILDING.txt.
tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java — entry-point for nearly every integration test.

Hands-on

Run mvn -pl tez-dag test -Dtest=TestVertexImpl#testBasicVertexCompletion.
Run mvn -pl tez-tests test -Dtest=TestTezJobs#testWordCount.
Profile a build: mvn -DskipTests install -X 2>&1 | grep "Building\|BUILD".

JIRA practice queries

project = TEZ AND component = build AND status = Open
project = TEZ AND text ~ "MiniTezCluster" AND resolution = Unresolved

Labs

Lab 2.1 — Build Tez and run all tez-api tests.
Lab 2.2 — Add a no-op test to tez-dag and run it via Maven.

Exit checkpoint

You can explain why tez-dag depends on tez-api but not vice versa.
You know the difference between tez-runtime-internals and tez-runtime-library.
You can run a single test via Maven without consulting any docs.

Week 4 — Tests as documentation

Reading

tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java (~5000 lines; pick the top 10 test methods).
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestDAGImpl.java.
tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java.

Hands-on

Pick one test method in TestVertexImpl; rewrite it from scratch in your notebook, then diff against the original.
Add an assertion that fails; observe the message; fix it.

JIRA practice queries

project = TEZ AND text ~ "flaky" AND status in (Open, "In Progress")
project = TEZ AND text ~ "TestVertexImpl" AND resolution = Unresolved

Labs

Lab 2.3 — Read TestVertexImpl#testKilledTasksHandling and explain every line.

Exit checkpoint

You can write a test that constructs a VertexImpl directly (without MiniTezCluster).
You understand the DrainDispatcher pattern (see state-machines.md).

Weeks 5–6: Level 3 — Submission and AM lifecycle

Week 5 — TezClient and submission

Reading

tez-api/src/main/java/org/apache/tez/client/TezClient.java.
tez-api/src/main/java/org/apache/tez/client/TezClientUtils.java.
tez-api/src/main/java/org/apache/tez/client/TezSessionImpl.java.

Hands-on

Write a small Java program that uses TezClient directly (no MR shim) to submit a DAG to MiniTezCluster.
Use both session and non-session modes; measure the second-DAG latency difference.

JIRA practice queries

project = TEZ AND component = "tez-api" AND text ~ "TezClient" AND status = Open

Labs

Lab 3.1 — Build a custom client that submits two DAGs in one session.

Exit checkpoint

You can list every method that talks to the AM over RPC (grep for dagAMProtocol in TezClient.java).
You can name the three local resources that TezClientUtils uploads.

Week 6 — DAGAppMaster bring-up

Reading

tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java — focus on serviceInit, serviceStart, dispatcher registration.
tez-dag/src/main/java/org/apache/tez/dag/app/TaskCommunicatorManager.java.
tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ContainerLauncher*.java.

Hands-on

Run a DAG against MiniTezCluster with AM logs at DEBUG. Identify the line in DAGAppMaster.java that emits the first "Created DAG" log line.

Labs

Lab 3.2 — Map an AM log line to source code (Lab in Level 3).

Exit checkpoint

You can list the AsyncDispatcher event-handler registrations in DAGAppMaster in order.
You can walk the path from TezClient.submitDAG() to DAGImpl being instantiated inside the AM.

Weeks 7–9: Level 4 — Vertex internals and state machines

Week 7 — State machine library

Reading

hadoop-yarn-common StateMachineFactory source (you'll need to fetch Hadoop source separately).
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java — read only the stateMachineFactory block first (~200 lines near the top).

Hands-on

Write a toy StateMachineFactory for a Light (OFF, ON, BROKEN) in a scratch project.

Labs

Lab 4.1 — State-machine introduction.

Exit checkpoint

You can explain SingleArcTransition vs MultipleArcTransition without notes.

Week 8 — VertexManager plugins

Reading

tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPlugin.java, VertexManagerPluginContext.java.
tez-dag/src/main/java/org/apache/tez/dag/library/vertexmanager/ShuffleVertexManager.java.

Labs

Lab 4.2 — VertexManager deep dive (the depth-bar lab).

Exit checkpoint

A working CountingVertexManager with passing unit test, as specified in Lab 4.2.

Week 9 — Task and TaskAttempt

Reading

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java.
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java.

Labs

Lab 4.3 — Task lifecycle walk.
Lab 4.4 — TaskAttempt termination causes.

Exit checkpoint

You can draw the TaskAttempt state machine from memory.
You can list every TaskAttemptTerminationCause and what produces it.

Weeks 10–11: Level 5 — Runtime, IPO, and shuffle

Week 10 — Runtime task execution

Reading

tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java.
tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java.

Labs

Lab 5.1 — Trace a task from container start to processor exit.

Exit checkpoint

You can list every umbilical call a task makes during its lifetime (grep umbilical in tez-runtime-internals).

Week 11 — Shuffle and merge

Reading

tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java.
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java.
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java.

Labs

Lab 5.2 — Spilled output inspection on MiniTezCluster.
Lab 5.3 — Force a fetch failure.

Exit checkpoint

You can explain IFile framing in two paragraphs.
You can name the three sorter implementations and when each is used.

Week 12: Level 6 — Scheduling and container reuse

Reading

tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java.
tez-dag/src/main/java/org/apache/tez/dag/app/rm/TaskSchedulerManager.java.
tez-dag/src/main/java/org/apache/tez/dag/app/rm/container/AMContainerImpl.java.

JIRA practice queries

project = TEZ AND text ~ "container reuse" AND status in (Open, "In Progress")

Labs

Lab 6.1 — Disable container reuse; measure latency cost.
Lab 6.2 — Read and explain tez.am.container.reuse.* configs.

Exit checkpoint

You can list the four conditions under which a container is not reused.

Week 13: Level 7 — MapReduce compatibility and integrations

Reading

tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java.
tez-mapreduce/src/main/java/org/apache/tez/mapreduce/output/MROutput.java.
tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/map/MapProcessor.java.

Labs

Lab 7.1 — Submit a vanilla MR job via Tez (tez.lib.uris mode).

Exit checkpoint

You can write a one-page essay on "what MRInput does that a plain LogicalInput does not."

Week 14: Level 8 — Production diagnostics

Reading

tez-api/src/main/java/org/apache/tez/common/counters/TezCounters.java.
tez-dag/src/main/java/org/apache/tez/dag/history/HistoryEventHandler.java.
tez-plugins/tez-yarn-timeline-history/.

Labs

Lab 8.1 — Read a real ATS event dump.
Lab 8.2 — Trace a failure through the AM log + ATS + counters.

Exit checkpoint

You can answer: "Why did vertex X fail?" given only an AM log and ATS dump.

Weeks 15–16: Capstone

Follow capstone/index.md start-to-finish:

Issue selection (week 15, day 1–2).
Reproduction → root cause (week 15, day 3–7).
Implementation + tests (week 16, day 1–4).
Patch submission + write-up (week 16, day 5–7).

Exit checkpoint

A real patch attached to a real JIRA, with passing tests and a clear summary.
A 1500–3000 word public write-up of the experience.

How to use this plan when you fall behind

If you finish a week's reading but cannot pass the exit checkpoint, repeat the week. Do not advance.
If a JIRA query returns no results, change the query. The dev community moves; labels and components shift.
Skip a Level only if you can pass all exit checkpoints from previous Levels in one sitting.

Milestones: M1 Through M9

Milestones are the "what does mastery look like at this stage" checkpoints. Each milestone has:

Expected completion — a calendar guideline.
Skills you must demonstrate — 5–8 concrete abilities.
Self-check questions — answer them out loud, without notes.
20-point rubric — five criteria, four points each.
Pass threshold — minimum total to advance.
Move to the next level when — the binary gate.

Pass thresholds are deliberately high. The point is competence, not throughput.

M1 — Orientation (end of Week 2)

You can read the Tez DAG API and explain what every method on DAG, Vertex, and Edge does.

Skills

Write a 3-vertex DAG end-to-end without consulting docs.
Explain the three enums on EdgeProperty and pick the correct one for a given problem.
Name the protobuf message that represents a DAG on the wire.
Predict which built-in EdgeManager implementation will be selected for a given edge.
Locate any class in the tez-api module by name within 30 seconds.

Self-check questions

What is the difference between DataSourceDescriptor and a runtime Input?
Why is DAG.verify() called before submission?
Which class produces the protobuf DAGPlan?

Rubric

Criterion	1	2	3	4
API fluency	Can name classes	Can describe responsibilities	Can write code from memory	Can predict behavior
Edge model	Confused	Knows enums	Picks correct edge type	Predicts EdgeManager impl
Reading speed	>5 min/file	~3 min/file	~1 min/file	scanning fluently
Mental model	Vague	Sketches DAG	Sketches DAG + edge types	Sketches DAG + edge types + plan flow
Communication	Cannot explain	Explains with notes	Explains without notes	Teaches another

Pass threshold: 14/20, with no criterion below 2.

Move to Level 2 when: you can draft a new DAG class in 10 minutes from a verbal problem statement, on a whiteboard.

M2 — Build and Test Literacy (end of Week 4)

You can navigate the codebase, build it, and run any test by name.

Skills

Run a single test in any module via mvn -pl <module> test -Dtest=Class#method.
Add a new test file to tez-dag and have it picked up by Maven.
Read TestVertexImpl and explain at least 10 individual test methods.
Identify the module of a class given just its FQN (e.g., o.a.t.dag.app... → tez-dag).
Build Tez from a clean checkout in under 5 minutes (with cached deps).
Distinguish unit tests from MiniTezCluster-backed integration tests.

Self-check questions

Why does tez-dag depend on tez-api and not the reverse?
What is DrainDispatcher and why do tests use it?
Where do MiniTezCluster tests live and what classpath do they need?

Rubric

Criterion	1	2	3	4
Build mastery	`mvn install` works	Can skip tests, profiles	Knows module deps	Diagnoses build failures
Test execution	Runs all tests	Runs a class	Runs a method	Runs cross-module
Test reading	Skims	Understands assertions	Understands setup	Recreates from scratch
Module map	Knows names	Knows top-level deps	Knows transitive deps	Diagnoses cycles
Tooling	IDE-only	CLI + IDE	CLI primary	CLI + scripting

Pass threshold: 14/20.

Move to Level 3 when: you can clone Tez on a fresh laptop, build it, and run a TestVertexImpl method by name within 15 minutes.

M3 — Submission and AM Bring-up (end of Week 6)

You can trace a DAG from TezClient.submitDAG() to DAGImpl.handle(...) inside the AM.

Skills

List the three local resources TezClientUtils uploads.
Explain session vs non-session mode and the AM keep-alive mechanism.
Name every AsyncDispatcher event-handler registered in DAGAppMaster.
Locate the line of code where DAGImpl is constructed inside the AM.
Read AM logs at DEBUG and map lines to source positions.
Run MiniTezCluster in your tests and inspect AM logs.

Self-check questions

What RPC does TezClient use to submit a DAG? Which protocol class?
How does the AM stay alive between DAGs in a session?
What happens if the AM dies during a DAG run with recovery disabled?

Rubric

Criterion	1	2	3	4
Submission path	Vague	Knows TezClient API	Knows RPC	Knows full byte path
AM bring-up	Cannot describe	Names dispatcher	Names handlers	Walks serviceInit
Session model	Confused	Knows the flag	Knows keep-alive	Knows timeouts
Log reading	Greps blindly	Greps with intent	Maps to code	Predicts log line
Recovery	Unknown	Aware	Knows config keys	Knows record format

Pass threshold: 14/20.

Move to Level 4 when: you can answer "where in the AM does my DAG show up?" with a file:line citation.

M4 — State Machines and VertexManager (end of Week 9)

You can read and modify the vertex/task/attempt state machines.

Skills

Write a small StateMachineFactory-based state machine from scratch.
Add a transition to VertexImpl.stateMachineFactory and update tests in the same patch.
Implement a custom VertexManagerPlugin with a unit test.
Diagnose an InvalidStateTransitonException from a stack trace.
Distinguish SingleArcTransition from MultipleArcTransition.
Explain the dispatcher single-threading invariant.

Self-check questions

Why must state-machine code be single-threaded? What breaks if not?
What happens if you forget to register a transition for an event in a state?
How does ShuffleVertexManager implement slow-start?

Rubric

Criterion	1	2	3	4
State machine	Knows it exists	Can read transitions	Can add transition	Can refactor safely
Test discipline	None	Adds happy path	Adds happy + sad	Updates per transition
VertexManager	Knows interface	Implements minimal	Implements custom	Implements + tests
Concurrency	Confused	Knows the rule	Knows why	Can audit a PR
Debugging	Reads stack	Maps to source	Reproduces locally	Writes regression test

Pass threshold: 16/20 — this is the first hard gate.

Move to Level 5 when: you have submitted (or at minimum drafted) a state machine change that compiles, with a passing test.

M5 — Runtime and Shuffle (end of Week 11)

You can read the runtime data path and explain spill, merge, and fetch.

Skills

Walk a single task's lifecycle: container start → processor.run() → output close.
Explain IFile framing and the difference between V1 and V2.
Distinguish DefaultSorter, PipelinedSorter, and unordered output.
Diagnose a fetcher failure from logs.
Read ShuffleManager and explain its scheduling of fetchers.
Explain combiners and where they run in the pipeline.

Self-check questions

What umbilical RPCs does a task make during its run?
Where is the spill threshold checked?
What triggers a FAILED_FETCH event upstream?

Rubric

Criterion	1	2	3	4
Runtime path	Names classes	Walks happy path	Walks failure paths	Walks edge cases
IFile	Knows format	Reads with hexdump	Modifies safely	Diagnoses corruption
Sorter	Names them	Knows tradeoffs	Picks for workload	Tunes configs
Shuffle	Vague	Knows pull model	Knows scheduling	Knows backoff
Combiner	Aware	Knows when run	Implements one	Debugs incorrect output

Pass threshold: 15/20.

Move to Level 6 when: you can intentionally produce a fetcher failure on MiniTezCluster and explain every log line.

M6 — Scheduling and Container Reuse (end of Week 12)

You understand how Tez decides where tasks run.

Skills

Read YarnTaskSchedulerService and explain its scheduling loop.
List the conditions under which a container is/is not reused.
Explain affinity, locality, and racks.
Tune tez.am.container.reuse.* for a given workload.
Diagnose "stuck" scheduling.

Self-check questions

Why does Tez prefer to reuse containers over requesting new ones?
What happens if tez.am.container.idle-release-timeout-min.millis is too low?

Rubric

Criterion	1	2	3	4
Reuse model	Aware	Knows conditions	Knows configs	Tunes for workload
Scheduling	Black box	Reads main loop	Reads matching	Reads + modifies
Locality	Aware	Knows hints	Knows fallback	Knows rack policy
Diagnostics	Guess-and-check	Reads AM logs	Reads + maps to code	Adds counters
YARN integration	Aware	Knows AMRM	Knows tokens	Knows failover

Pass threshold: 14/20.

Move to Level 7 when: you can explain why container reuse is on by default and pick five workloads where you would tune it.

M7 — Integrations (end of Week 13)

You can read and modify the MapReduce shim and explain Hive-on-Tez at a high level.

Skills

Write a DAG that uses MRInput reading from HDFS.
Explain MROutput commit semantics.
Sketch how Hive's TezTask builds a DAG.
Identify which features Hive uses (custom edges, manager plugins, dynamic reconfig).

Self-check questions

What does MROutput.commit() do, and what guarantees does it offer?
Why does Hive use ROOT_INPUT_INITIALIZER_FAILED heavily in its bug fixes?

Rubric

Criterion	1	2	3	4
MR shim	Knows existence	Reads MRInput	Reads + uses	Modifies safely
Commit	Aware	Knows semantics	Knows failure modes	Knows speculative cleanup
Hive lens	Aware	Reads TezTask	Reads + maps	Diagnoses cross-project bug
Cross-project	Confused	Knows boundaries	Picks the right list	Files bug correctly

Pass threshold: 12/16 (only 4 criteria here).

Move to Level 8 when: you can read a Hive query plan and predict its DAG.

M8 — Production Diagnostics (end of Week 14)

You can debug a real Tez job failure given logs and an ATS dump.

Skills

Read a Tez counters dump and find a bottleneck.
Find a VertexImpl failure cause from AM logs in <5 minutes.
Read ATS events and reconstruct a DAG timeline.
Identify a stuck task vs a slow task vs a failed task from counters.
Build a one-pager triage runbook for your team.

Rubric

Criterion	1	2	3	4
Counters	Knows existence	Reads	Interprets	Tunes
Log triage	Greps	Maps to code	Maps to state	Predicts next event
ATS	Aware	Queries	Reads events	Cross-checks vs AM log
Runbook	None	Draft	Reviewed	Shipped to team
Speed	>30 min	~15 min	<10 min	<5 min

Pass threshold: 16/20.

Move to capstone when: you've helped someone (on chat, dev list, or internally) debug a real Tez issue successfully.

M9 — Capstone (end of Week 16)

You've shipped a patch.

Skills

Selected an appropriate issue.
Reproduced and root-caused.
Implemented a fix with tests.
Submitted a patch in the project's accepted format.
Responded to at least one round of review feedback.

Rubric (20 points)

Criterion	1	2	3	4
Issue selection	Random	Scoped	Justified	Aligned to roadmap
Reproduction	None	Manual	Scripted	Added as a test
Root cause	Speculative	Localized	Cited	Explained in JIRA
Implementation	Compiles	Tests pass	Idiomatic	Minimal & focused
Submission	None	Draft	Submitted	Reviewed

Pass threshold: 16/20, and the patch must compile and pass mvn verify on the affected module.

Global Rubric (committer-readiness)

Use this every quarter, regardless of level, to self-assess.

Dimension	1 (Beginner)	2 (Apprentice)	3 (Practitioner)	4 (Committer-ready)
Code	Reads	Modifies	Designs subsystem	Reviews others' changes
Testing	Runs tests	Adds tests	Writes regression suites	Drives test infra
Docs	Reads	Edits	Writes user-facing	Owns module-level docs
Integration	Single module	Cross-module	Cross-project (Hive)	Drives release decisions

A committer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused contribution.

Level 1: Hadoop and Tez Foundation

This level establishes the technical baseline every subsequent level depends on. You will understand where Tez fits in the Hadoop ecosystem, successfully build the project from source, run the test suite, and execute your first Tez DAG in local mode.

Learning Objectives

By the end of Level 1 you must be able to:

Explain where Apache Tez sits in the Hadoop ecosystem and why it exists
Build Apache Tez from source using Maven, with and without tests
Execute unit tests scoped to a single module and interpret the results
Run a simple Tez DAG in local mode without a YARN cluster
Locate any class mentioned in Levels 2–9 without using a search engine
Articulate the difference between a MapReduce job and a Tez DAG at the execution model level
Read TezConfiguration.java and find any configuration key by category

The Hadoop Ecosystem Context

Apache Tez lives inside the Hadoop ecosystem. Before touching a line of Tez code, build an accurate mental model of the stack:

┌─────────────────────────────────────────────────────┐
│         Apache Hive / Apache Pig / Cascading        │  ← Query / scripting layer
├─────────────────────────────────────────────────────┤
│                  Apache Tez                         │  ← DAG execution engine
├─────────────────────────────────────────────────────┤
│                  Apache YARN                        │  ← Cluster resource management
├─────────────────────────────────────────────────────┤
│                  Apache HDFS                        │  ← Distributed storage
└─────────────────────────────────────────────────────┘

YARN (Yet Another Resource Negotiator) manages cluster resources. It runs an ApplicationMaster (AM) per application, allocates containers, and monitors health. Tez's DAGAppMaster IS a YARN ApplicationMaster.

HDFS stores input, output, and sometimes intermediate data. Tez prefers to keep intermediate data on local disk or in memory, but falls back to HDFS for recovery and large-scale shuffles.

Tez submits a DAGAppMaster to YARN, which requests containers for task execution. Tasks read inputs, execute processors, and write outputs — either directly to downstream tasks via shuffle or to HDFS for final output.

MapReduce vs. Tez

Aspect	MapReduce	Apache Tez
Execution model	Fixed: Map → Shuffle → Reduce	Arbitrary DAG of vertices
Multi-stage queries	Chain of separate MR jobs	Single DAG
Inter-stage data	Always written to HDFS	Pipelined or local disk
JVM startup	New JVM per task	Container reuse across tasks
Vertex types	Two (Map, Reduce)	Unlimited
Speculative execution	Yes	Yes (configurable per vertex)
Session support	No	Yes — `TezClient` session mode

For a 10-stage Hive aggregation query, MapReduce requires 10 separate MR jobs with HDFS writes between every stage. Tez runs the same query as a single DAG — no HDFS round-trips between stages, containers reused across task waves, and pipeline-style data movement between compatible vertices.

Required Reading

Complete in this order before starting the labs:

#	Resource	What to extract
1	`README.md` in the Tez repo root	Build commands, module overview
2	Tez architecture document	Original design intent, DAG model rationale
3	YARN Architecture	Container lifecycle, AM responsibilities
4	`tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java`	Class-level Javadoc only — understand session vs. non-session
5	`tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java`	Skim all keys — understand the category groupings
6	`tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java`	End-to-end DAG construction and submission

Note on reading strategy: In a mature Apache codebase, Javadoc is often the best documentation that exists. Class-level Javadoc on public API classes reflects decisions debated and agreed upon by committers. Read it seriously.

Source Code Areas to Inspect

Read these files before and after the labs. You are not modifying anything yet.

`tez-api` — Public API

File	Why
`dag/api/TezClient.java`	Entry point for all DAG submissions. Read `createTezClient()`, `start()`, `submitDAG()`.
`dag/api/DAG.java`	DAG construction API. Note `addVertex()`, `addEdge()`, `addTaskLocalFiles()`.
`dag/api/Vertex.java`	Vertex definition. Understand `ProcessorDescriptor`, parallelism, and `VertexManagerPlugin`.
`dag/api/Edge.java`	Edge definition. Understand `EdgeProperty` and `DataMovementType`.
`dag/api/client/DAGClient.java`	DAG monitoring. Understand `getDAGStatus()` and progress tracking.
`dag/api/TezConfiguration.java`	All Tez configuration keys. Every key is documented.
`dag/api/EdgeProperty.java`	Data movement type and scheduling type for edges. Fundamental to DAG design.

`tez-dag` — Core Execution Engine

File	Why
`app/DAGAppMaster.java`	The YARN ApplicationMaster. First read: just `init()` and `start()`. It is 5000+ lines.
`app/dag/impl/DAGImpl.java`	DAG state machine. Read the state/transition enum declarations at the top.
`app/dag/impl/VertexImpl.java`	Most complex class in the project. First read: state enum + `handle()` only.
`app/dag/impl/TaskImpl.java`	Task state machine. More tractable than VertexImpl. Read fully.
`app/dag/impl/TaskAttemptImpl.java`	TaskAttempt state machine. Read fully.

`tez-runtime-library` — I/O Implementations

File	Why
`runtime/library/input/OrderedGroupedKVInput.java`	Standard sorted shuffle input. Used by most Hive reduce operations.
`runtime/library/output/OrderedPartitionedKVOutput.java`	Standard sorted shuffle output. Paired with the above.
`runtime/library/input/UnorderedKVInput.java`	Broadcast input — data is not sorted.

`tez-examples` — Reference Implementations

File	Why
`examples/OrderedWordCount.java`	The canonical Tez DAG example. Read this completely.
`examples/IntersectExample.java`	Shows a 3-vertex DAG with a broadcast edge.

Key Classes Quick Reference

Class	Module	Package	Role
`TezClient`	`tez-api`	`org.apache.tez.dag.api`	Creates sessions, submits DAGs
`DAG`	`tez-api`	`org.apache.tez.dag.api`	Defines the computation graph
`Vertex`	`tez-api`	`org.apache.tez.dag.api`	One processing stage
`Edge`	`tez-api`	`org.apache.tez.dag.api`	Data connection between vertices
`EdgeProperty`	`tez-api`	`org.apache.tez.dag.api`	Data movement + scheduling type
`ProcessorDescriptor`	`tez-api`	`org.apache.tez.dag.api`	Which Processor class runs in a vertex
`TezConfiguration`	`tez-api`	`org.apache.tez.dag.api`	All Tez configuration keys
`DAGAppMaster`	`tez-dag`	`org.apache.tez.dag.app`	YARN ApplicationMaster
`DAGImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	DAG state machine
`VertexImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	Vertex state machine
`TaskImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	Task state machine
`TaskAttemptImpl`	`tez-dag`	`org.apache.tez.dag.app.dag.impl`	TaskAttempt state machine
`TezTaskRunner2`	`tez-runtime-internals`	`org.apache.tez.runtime`	Runs a task inside a container
`OrderedWordCount`	`tez-examples`	`org.apache.tez.examples`	Canonical DAG example

JIRA Issue Categories for Level 1 Contributors

At this stage, focus exclusively on:

Documentation — Javadoc typos, outdated parameter descriptions, missing @param or @return annotations, broken links in comments
Test improvements — Adding missing assertions to existing tests, improving test method naming, removing dead code from test classes
Checkstyle violations — Unused imports, line length violations, missing final keywords

How to find these:

Go to Apache Tez JIRA
Search: project = TEZ AND labels = "newbie" AND resolution = Unresolved
Also scan: project = TEZ AND component = "Documentation" AND resolution = Unresolved
Look at recently closed "trivial" issues to understand the standard for accepted patches

Warning: Do not pick up a JIRA issue and immediately upload a patch. Read all existing comments. If there is an active discussion or existing assignee, move on. Leave a comment saying you are investigating before you claim an issue.

Deliverables

You must demonstrate all of the following before advancing to Level 2:

Successful mvn install -DskipTests output — no build failures
At least one unit test class run successfully (e.g., TestDAGImpl)
Successful local DAG execution showing DAG completed: SUCCEEDED
Ability to locate DAGAppMaster, TezClient, and OrderedGroupedKVInput by memory
Written explanation (2–3 sentences) of why a Tez DAG is faster than chained MapReduce
Written explanation of the difference between a YARN container and a Tez task

Common Mistakes

Mistake	Consequence	Fix
Building with Java 17 against `master`	Compile errors or compatibility failures	Use Java 8 or Java 11; check `<maven.compiler.source>` in root `pom.xml`
Running `mvn test` on the full repository	Hours-long run including integration tests	Use `-pl tez-dag -am` to scope to one module
Ignoring `TezConfiguration.java`	Confusion about configuration keys throughout all levels	Skim the entire file; every key is documented
Skipping the YARN architecture doc	Confusion about what Tez owns vs. what YARN owns	YARN understanding is required from Level 3 onward
Trying to understand all of `DAGAppMaster` at once	Overwhelm — 5000+ lines	First pass: read only `init()` and `start()`
Reading Tez code without running it	Abstract understanding that does not transfer to debugging	Always run the code after reading it
Picking a JIRA issue without reading existing comments	Duplicate work; community friction	Read all comments; check assignee; leave a note before claiming

How to Verify Success

# 1. Full build without tests
cd /path/to/tez
mvn install -DskipTests -q && echo "BUILD OK"

# 2. Unit test from tez-dag
mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q

# 3. Local DAG run (from Lab 1.3)
# Expected final output line:
#   DAG: [OrderedWordCount] finished with status: [SUCCEEDED]

Patch Profile: Level 1 Graduate

Patch type	Example	Test requirement
Javadoc fix	Correcting a wrong `@param` description in `TezClient`	None — documentation only
Dead import removal	Remove unused `import` statement flagged by checkstyle	Run `mvn checkstyle:check -pl <module>`
Test assertion improvement	Add `assertEquals` to an existing test that only checks for no-exception	Run the test class
README update	Fix a broken Maven command in the build instructions	Manual verification

You are not ready to submit: bug fixes in state machines, new features, performance patches, or changes to the shuffle path. Those require Levels 3–7.

Lab 1.1: Build Apache Tez from Source

Background

Apache Tez is a multi-module Maven project. Building from source is the mandatory first step for any contributor — you need the ability to make code changes, rebuild specific modules, and run tests against your local changes. This lab walks through the full build, from cloning to verifying artifacts.

Why This Lab Matters for Contributors

You cannot submit a credible patch without first verifying it builds cleanly
Knowing which Maven flags control which modules saves hours during development
Understanding the build structure helps you scope test runs efficiently
Build failures are sometimes real bugs — knowing a clean build baseline lets you detect regressions

Prerequisites

Verify before starting:

java -version    # Must be Java 8 or Java 11
mvn -version     # Must be Maven 3.6.3 or newer
git --version    # Must be 2.x

Disk space: at least 10 GB free. The full build with tests generates large artifacts. Memory: at least 8 GB RAM. The tez-dag unit tests can spike to 4 GB during parallel runs.

Step-by-Step Tasks

Step 1: Clone the Repository

git clone https://github.com/apache/tez.git
cd tez

The GitHub repository at https://github.com/apache/tez is a mirror of the canonical Apache GitBox repository. For contribution purposes (submitting patches via JIRA), the GitHub mirror is acceptable for development. The patch will be attached to the JIRA issue rather than sent as a GitHub PR — this is Apache's traditional workflow.

Verify the remote:

git remote -v
# origin  https://github.com/apache/tez.git (fetch)
# origin  https://github.com/apache/tez.git (push)

Step 2: Inspect the Branch Structure

git branch -r | grep -v HEAD | sort

You will see branches like:

origin/master — development trunk
origin/branch-0.10 — stable release branch
origin/branch-0.9 — older stable branch

For contributor work, use master unless you are reproducing an issue specific to a release branch. Bug fixes for release branches are typically backported from master.

Check the current Hadoop dependency in pom.xml:

grep -m1 "hadoop.version" pom.xml

This tells you which Hadoop version Tez is built against. The default Hadoop version target controls which APIs are available.

Step 3: Full Build (Skip Tests)

mvn install -DskipTests -q

Expected duration: 5–15 minutes depending on hardware and Maven cache state.

The first run downloads all dependencies. With a warm Maven cache (~/.m2/repository), subsequent builds of unchanged modules are near-instant due to incremental compilation.

What -DskipTests does:
Skips compilation and execution of test classes. Use this for iterative development when you are not changing test code.

What -q does:
Suppresses INFO-level Maven output. Remove -q if you need to debug build failures.

When the build completes, you will see:

[INFO] BUILD SUCCESS
[INFO] Total time:  X min Y s

If you see BUILD FAILURE, go to the Troubleshooting section below.

Step 4: Verify Build Artifacts

After a successful build, key JARs exist in each module's target/ directory:

find . -name "tez-dag-*.jar" -not -path "*/test-*" | grep -v sources
# Expected: ./tez-dag/target/tez-dag-<version>.jar

find . -name "tez-api-*.jar" -not -path "*/test-*" | grep -v sources
# Expected: ./tez-api/target/tez-api-<version>.jar

The assembled distribution tarball is built by a separate command:

mvn package -DskipTests -Pdist -q
ls tez-dist/target/*.tar.gz

This produces the full binary distribution used by HDP and other distributions.

Step 5: Build a Single Module

During development you will almost always build a single module to save time:

# Build only tez-dag and its dependencies
mvn install -DskipTests -pl tez-dag -am -q

# Build only tez-api (no dependencies needed — it has none in Tez)
mvn install -DskipTests -pl tez-api -q

-pl specifies the module path. -am (also-make) builds all upstream dependencies first. This is the command you will run hundreds of times during contributor work.

Step 6: Configure IntelliJ IDEA

IntelliJ handles Maven multi-module projects natively.

File → Open → select the tez/ directory (the one containing pom.xml)
IntelliJ detects the Maven project and imports all modules
When prompted, select the JDK that matches the build (Java 8 or Java 11)
Wait for the initial index build to complete (2–5 minutes)

Verify the import worked:

Open tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
Ctrl+Click on any class reference — it should navigate correctly
Open Find Class (Cmd+O / Ctrl+N) and search TestDAGImpl — it should find the test

Enable checkstyle integration:

Install the CheckStyle-IDEA plugin (Settings → Plugins)
Configure it to use src/config/checkstyle.xml in the Tez repo root
This gives you real-time checkstyle feedback as you edit

Implementation Requirements

This lab has no code to implement. Deliverables are:

A successful mvn install -DskipTests run (screenshot or terminal output)
Identification of the Hadoop version Tez is built against
Location of the tez-dag-<version>.jar artifact
A working IntelliJ project that resolves all imports

Troubleshooting Common Build Failures

"Source/Target Java version mismatch"

error: Source option X is no longer supported. Use Y or later.

Cause: Your JAVA_HOME or java in PATH is the wrong version.
Fix:

export JAVA_HOME=$(/usr/libexec/java_home -v 11)   # macOS
export PATH=$JAVA_HOME/bin:$PATH
java -version   # verify
mvn install -DskipTests -q

"Cannot resolve dependency: org.apache.hadoop:..."

Cause: The required Hadoop version is not in Maven Central or your local cache.
Fix: Ensure Maven Central is reachable. If building offline, use an internal repository mirror. On a clean machine with network access this should not occur.

"Killed" or "Out of Memory"

Cause: Maven forked JVM runs out of heap.
Fix:

export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=512m"
mvn install -DskipTests -q

"ERROR: Failed to execute goal ... tez-tests"

Cause: The tez-tests module requires specific integration test infrastructure.
Fix: Build only the modules you need:

mvn install -DskipTests -pl tez-api,tez-dag,tez-runtime-library,tez-examples -am -q

Expected Output

[INFO] Reactor Summary:
[INFO] Apache Tez ......................................... SUCCESS [  2.345 s]
[INFO] tez-api ............................................ SUCCESS [ 15.678 s]
[INFO] tez-dag ............................................ SUCCESS [ 45.123 s]
[INFO] tez-runtime-internals .............................. SUCCESS [ 12.456 s]
[INFO] tez-runtime-library ................................ SUCCESS [ 18.789 s]
[INFO] tez-mapreduce ...................................... SUCCESS [  8.012 s]
[INFO] tez-examples ....................................... SUCCESS [  5.234 s]
...
[INFO] BUILD SUCCESS

Stretch Goals

Build against a specific Hadoop version by overriding the hadoop.version property:
```
mvn install -DskipTests -Dhadoop.version=3.3.6 -q
```
Inspect the generated effective-pom.xml for tez-dag to see all inherited dependency versions:
```
mvn help:effective-pom -pl tez-dag | grep -A3 "dependency>"
```
Identify which modules depend on tez-api by inspecting all pom.xml files:
```
grep -r "tez-api" */pom.xml | grep "artifactId"
```

Build breakage issues (e.g., dependency version conflicts) — you can observe but not fix at Level 1
Java version compatibility issues — important context when reading bug reports

Lab 1.2: Run Unit and Integration Tests

Background

Apache Tez has a well-structured test suite that spans unit tests, module-level integration tests, and full cluster integration tests using MiniTezCluster. Understanding how to run specific tests, read failures, and scope test execution is essential for contributor work — your patch must include a passing test run before upload.

Why This Lab Matters for Contributors

You must run tests before submitting any patch
Being able to run a single test class in seconds makes iteration fast
Understanding test failure output is the first step to debugging
Many flaky tests are contributor opportunities once you understand how tests work

How Tez Tests Are Organized

Tez tests fall into three categories:

Category	Location	Runs with	Scope
Unit tests	`src/test/java/` in each module	`mvn test -pl <module>`	Fast, no cluster
Module integration tests	`tez-tests/src/test/java/`	`mvn test -pl tez-tests`	Requires MiniTezCluster
System tests	Manual / CI scripts	Requires full cluster	Not run locally

For Level 1–3 work, focus exclusively on unit tests.

Key unit test classes in tez-dag (path: tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/):

Test Class	What it Tests
`TestDAGImpl`	`DAGImpl` state machine transitions, initialization, completion
`TestVertexImpl`	`VertexImpl` state machine — the most complex test class in the project
`TestTaskImpl`	`TaskImpl` state machine transitions
`TestTaskAttemptImpl`	`TaskAttemptImpl` state transitions, speculation, failure handling

Supporting test infrastructure in tez-dag/src/test/java/org/apache/tez/dag/app/:

Class	Role
`MockDAGAppMaster`	A reduced AM for unit testing — no YARN connection needed
`MockAppContext`	Mock `AppContext` that provides state to state machine tests
`MockHistoryEventHandler`	No-op history handler for tests that don't test history

Step-by-Step Tasks

Step 1: Run All Unit Tests in `tez-dag`

cd /path/to/tez
mvn test -pl tez-dag -am -q 2>&1 | tail -30

Expected duration: 3–8 minutes depending on hardware.

Expected completion:

[INFO] Tests run: NNNN, Failures: 0, Errors: 0, Skipped: NN
[INFO] BUILD SUCCESS

Some tests are marked @Ignore or skipped due to environment constraints — a non-zero Skipped count is normal.

Step 2: Run a Single Test Class

mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q

Expected output (last few lines):

[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: X.XXX s
[INFO] BUILD SUCCESS

If a test fails, you will see:

[ERROR] Tests run: 42, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: X.XXX s
[ERROR] testDAGCreation(org.apache.tez.dag.app.dag.impl.TestDAGImpl): expected:<...> but was:<...>

Step 3: Run a Single Test Method

mvn test -pl tez-dag -am -Dtest=TestDAGImpl#testDAGCreation -q

This is the command you will use most often: run exactly one test after a code change to verify your fix.

Step 4: Read the Surefire Report

Maven writes detailed test results to:

tez-dag/target/surefire-reports/

For a failing test, read the .txt file for the test class:

cat tez-dag/target/surefire-reports/org.apache.tez.dag.app.dag.impl.TestDAGImpl.txt

This contains the full stack trace, which is often more informative than the Maven console output.

Step 5: Run Tests in `tez-api`

mvn test -pl tez-api -q

tez-api tests are faster and simpler. Key test classes:

Test Class	What it Tests
`TestDAG`	`DAG` API construction, validation, serialization
`TestVertex`	`Vertex` API construction and edge validation
`TestTezClient`	`TezClient` initialization and session management
`TestAMControl`	AM communication protocol

Step 6: Run Tests in `tez-runtime-library`

mvn test -pl tez-runtime-library -am -q

This includes shuffle and I/O tests. Expected duration: 5–10 minutes.

Key test classes:

Test Class	What it Tests
`TestOrderedPartitionedKVWriter`	Sorted KV output serialization
`TestFetcher`	Shuffle fetch logic
`TestShuffleScheduler`	Fetch scheduling and retry
`TestTezMerger`	Sort-merge implementation

Step 7: Understand a Test Failure

Intentionally break a test to understand failure output:

Open tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java
Find the getTotalVertices() method
Add return 0; as the first line
Run mvn test -pl tez-dag -am -Dtest=TestDAGImpl -q
Read the failure output in both the console and the surefire report
Revert the change with git checkout tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java

This exercise makes test failure output familiar before you encounter a real failure.

Debugging Test Failures

Adding Log Output

Tez uses SLF4J + Log4j. To enable debug-level logging during a test run:

mvn test -pl tez-dag -am -Dtest=TestDAGImpl \
  -Dlog4j.configuration=file:src/test/resources/log4j.properties \
  -Dlog4j.logger.org.apache.tez=DEBUG

Running Tests with Remote Debug (IntelliJ)

To attach a debugger to a Maven test run:

mvn test -pl tez-dag -am -Dtest=TestDAGImpl \
  -Dmaven.surefire.debug="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005"

In IntelliJ: Run → Attach to Process → port 5005. The test JVM pauses until IntelliJ connects.

Testing Checklist

Before submitting any patch:

Run mvn test -pl <changed-module> -am — zero failures
If adding a new test: mvn test -pl <module> -am -Dtest=<YourNewTest> passes
Run mvn checkstyle:check -pl <changed-module> — zero violations
If the change touches shuffle or I/O: run mvn test -pl tez-runtime-library -am

Expected Output

A clean test run for TestDAGImpl:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.tez.dag.app.dag.impl.TestDAGImpl
[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.345 s
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 42, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] BUILD SUCCESS

Stretch Goals

Find all test classes in tez-dag that test the VertexImpl state machine:
```
find tez-dag/src/test -name "*.java" | xargs grep -l "VertexImpl"
```

Count the total number of test methods in TestVertexImpl:

grep -c "@Test" tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

Identify which test classes take the longest to run by examining surefire report timestamps:
```
grep "Time elapsed" tez-dag/target/surefire-reports/*.txt | sort -t= -k2 -rn | head -10
```
Find tests that use MockDAGAppMaster to understand the test infrastructure pattern:
```
grep -rl "MockDAGAppMaster" tez-dag/src/test/
```

Flaky tests (timing-dependent, environment-dependent) — a major contributor opportunity
Tests that don't assert anything meaningful — test quality improvements
Missing test coverage for error paths — discoverable by reading state machine code

Lab 1.3: Run a Simple Tez DAG Locally

Background

Apache Tez supports a local mode that runs the entire DAG execution inside a single JVM without YARN or HDFS. This is the primary environment for rapid development and testing. Understanding how to run a DAG in local mode is essential before attempting cluster testing.

The tez-examples module contains reference DAG implementations. OrderedWordCount is the canonical example: it reads text, counts word occurrences, and sorts by frequency. It demonstrates the complete Tez DAG API: TezClient, DAG, Vertex, Edge, and I/O processors.

Why This Lab Matters for Contributors

Local mode is how you verify behavior changes without a cluster
All integration test work in tez-tests builds on the same local mode infrastructure
Understanding how a real DAG is constructed gives concrete context for reading state machine code
Every DAG execution produces log output that teaches you about the AM lifecycle

Understanding Tez Local Mode

Tez local mode is enabled by setting tez.local.mode=true in the TezConfiguration. When this is set:

No YARN cluster is contacted
No containers are launched — task execution happens in threads within the same JVM
LocalMode.java replaces the full DAGAppMaster with a lightweight local executor
HDFS is replaced by the local filesystem (configurable)

Key configuration for local mode:

TezConfiguration tezConf = new TezConfiguration();
tezConf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, true);
// Use local filesystem instead of HDFS
tezConf.set("fs.defaultFS", "file:///");
tezConf.setBoolean("tez.local.mode.without.network", true);

Anatomy of `OrderedWordCount`

Before running the example, read tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java.

The DAG structure:

[Tokenizer Vertex]
      |
      | (SCATTER_GATHER edge — partitioned by hash, sorted)
      v
[SumReducer Vertex]
      |
      | (SCATTER_GATHER edge — partitioned by value for sort)
      v
[Sorter Vertex] → HDFS output

Tokenizer: Reads input text lines, splits into words, emits (word, 1) pairs.
Processor class: TokenProcessor (inner class in OrderedWordCount)

SumReducer: Receives (word, [1, 1, 1, ...]) groups, sums counts, emits (word, count).
Processor class: SumProcessor (inner class in OrderedWordCount)

Sorter: Receives by (count, word) key (reversed), emits sorted output.
Processor class: NoOpSorter — uses OrderedGroupedKVInput to do the sort during shuffle

The key insight: Tez uses edge properties and I/O processor configuration to control the sort and partition behavior. The Sorter vertex does not sort — the shuffle/merge into OrderedGroupedKVInput does the sorting.

Step-by-Step Tasks

Step 1: Prepare Sample Input

mkdir -p /tmp/tez-lab/input
cat > /tmp/tez-lab/input/words.txt << 'EOF'
the quick brown fox jumps over the lazy dog
the dog barked at the fox
quick brown dog
EOF

Step 2: Build `tez-examples`

cd /path/to/tez
mvn package -DskipTests -pl tez-examples -am -q

Locate the examples JAR:

ls tez-examples/target/tez-examples-*.jar | grep -v sources | grep -v tests

Step 3: Run `OrderedWordCount` in Local Mode

The example is run as a standard Java main class:

# Set classpath to include Tez JARs
TEZ_HOME=/path/to/tez

CLASSPATH=\
$TEZ_HOME/tez-examples/target/tez-examples-*.jar:\
$TEZ_HOME/tez-api/target/tez-api-*.jar:\
$TEZ_HOME/tez-dag/target/tez-dag-*.jar:\
$TEZ_HOME/tez-runtime-library/target/tez-runtime-library-*.jar:\
$TEZ_HOME/tez-runtime-internals/target/tez-runtime-internals-*.jar:\
$TEZ_HOME/tez-mapreduce/target/tez-mapreduce-*.jar:\
$TEZ_HOME/tez-common/target/tez-common-*.jar

# Add Hadoop JARs (required for FileSystem, Configuration, etc.)
# If Hadoop is installed:
CLASSPATH=$CLASSPATH:$(hadoop classpath)
# If not, add from Maven local cache manually

java -cp "$CLASSPATH" \
  org.apache.tez.examples.OrderedWordCount \
  /tmp/tez-lab/input \
  /tmp/tez-lab/output \
  1

Tip: The easiest way to handle classpaths during development is to use Maven's exec:java goal or to build a fat JAR using the shade plugin. The tez-dist assembly includes all JARs and the bin/ scripts handle classpath setup.

Step 4: Run with Maven exec plugin (simpler)

If you have Hadoop installed and HADOOP_HOME set, use the Tez distributed shell script:

cd $TEZ_HOME
bin/tez-examples.sh OrderedWordCount \
  /tmp/tez-lab/input \
  /tmp/tez-lab/output \
  1

Or, add local mode flags to the Hadoop conf:

java -Dtez.local.mode=true \
     -Dfs.defaultFS=file:/// \
     -cp "$CLASSPATH" \
     org.apache.tez.examples.OrderedWordCount \
     /tmp/tez-lab/input \
     /tmp/tez-lab/output \
     1

Step 5: Verify Output

cat /tmp/tez-lab/output/part-*

Expected output (sorted by frequency descending):

the	4
dog	3
fox	2
quick	2
brown	2
...

Step 6: Read the Execution Log

Examine the log output from the run. Key lines to understand:

INFO  TezClient: Submitting DAG to YARN, queueName=...
INFO  DAGAppMaster: Running DAG: [OrderedWordCount]
INFO  VertexImpl: Vertex: [Tokenizer] initialized
INFO  VertexImpl: Vertex: [Tokenizer] started
INFO  DAGImpl: DAG: [OrderedWordCount] finished with status: [SUCCEEDED]

These lines correspond directly to state machine transitions you will study in Level 4. For each log line, identify the state transition it represents.

Implementation Requirements

Modify OrderedWordCount to add a fourth vertex that filters out words with count < 2:

Add a new Vertex named "Filter" after SumReducer and before Sorter
Write a minimal FilterProcessor extends AbstractProcessor:
- In run(): iterate the input, skip pairs where the count value < 2, forward the rest
Add an edge SumReducer → Filter and Filter → Sorter
Run the modified DAG and verify that single-occurrence words are removed from output

This exercise teaches you:

How to add a vertex to an existing DAG
How to write a minimal Processor implementation
How edges connect processors

Do not overthink the implementation — the processor body is ~20 lines.

Debugging Checklist

If the DAG fails with DAG status: FAILED:

Read the log for ERROR lines — they contain the failure reason and task attempt ID
Check DAGAppMaster log for VertexImpl: Vertex [...] failed
The error message will include the class and method where the exception occurred
Common causes:
- Classpath missing a required JAR (NoClassDefFoundError)
- Output directory already exists (FileAlreadyExistsException)
- Wrong input path (FileNotFoundException)

Clean output directory before re-running:

rm -rf /tmp/tez-lab/output

Expected Output

A successful run ends with:

INFO  DAGImpl: DAG: [OrderedWordCount] finished with status: [SUCCEEDED]
INFO  TezClient: Shutting down TezSession...

Stretch Goals

Enable INFO-level logging for org.apache.tez.dag.app.dag.impl and observe vertex state transitions in the console output during the DAG run.
Modify the DAG to use UnorderedKVInput/UnorderedKVOutput instead of the ordered pair for the first edge. Observe the difference in output ordering.
Change the parallelism of the Sorter vertex to 2 and observe the output directory structure (2 part files instead of 1).
Add a timer around the TezClient.submitDAG() → DAGClient.waitForCompletion() block and measure execution time for different input sizes.

Local mode-specific bugs (different from cluster mode) — contributor opportunity
DAG API usability issues — often exposed by example code
Local mode configuration issues — often reported by new users

Lab 1.4: Project — Number Pipeline DAG

What You Are Building

A self-contained, runnable Java project that builds and executes a 3-vertex Tez DAG entirely in local mode — no YARN cluster, no HDFS, no Docker required.

Generator (2 tasks)
    │  SCATTER_GATHER shuffle
    ▼
Multiplier (2 tasks)   [value * 2]
    │  SCATTER_GATHER shuffle
    ▼
Sink (1 task)          [sum → counter]

Numbers 0–99 flow through the pipeline. The expected final sum is: sum(0..99) * 2 = 4950 * 2 = 9900.

This pipeline intentionally mirrors the structure of Apache Tez's own OrderedWordCount example but with an integer domain so the math is verifiable without a corpus.

Project Location

book/projects/
├── pom.xml                              ← parent; sets Tez + Hadoop versions
└── level-1-number-pipeline/
    ├── pom.xml
    └── src/main/java/org/apache/tez/learning/l1/
        ├── GeneratorProcessor.java      ← no inputs; emits integers
        ├── MultiplierProcessor.java     ← one input, one output; value * 2
        ├── SinkProcessor.java           ← sums values; publishes counter
        ├── FilterProcessor.java         ← exercise stub (incomplete)
        └── NumberPipelineDAG.java       ← main class; configures + submits DAG

Prerequisites

Completed Lab 1.1 (Apache Tez built from source with mvn install -DskipTests)
Java 8+ on $PATH
Maven 3.6+ on $PATH

Step 1: Set the Tez Version

The parent pom.xml needs to reference the exact version that mvn install installed into your local ~/.m2 repository. Find it:

# Inside your apache/tez clone:
grep -m1 '<version>' pom.xml

Open book/projects/pom.xml and set <tez.version> to match:

<tez.version>0.10.3-SNAPSHOT</tez.version>   <!-- adjust to your build -->

Step 2: Compile

cd /path/to/opensource-engineer-and-contributor/apache-tez/book/projects

# Build only the level-1 module (fast; skips the other modules)
mvn -pl level-1-number-pipeline package -q

You should see no errors. The fat JAR is at:

level-1-number-pipeline/target/level-1-number-pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar

If you see Could not resolve dependency org.apache.tez:tez-api:

Verify that tez.version matches the version in ~/.m2/repository/org/apache/tez/tez-api/
Re-run mvn install -DskipTests in your Tez clone

Step 3: Run

java -jar level-1-number-pipeline/target/level-1-number-pipeline-1.0-SNAPSHOT-jar-with-dependencies.jar

Expected output (log lines abbreviated):

TezClient started (local mode).
Submitting DAG...
[SinkProcessor] task=0  partialSum=9900

=== NumberPipeline Result ===
  Expected : 9900
  Actual   : 9900
  Result   : PASS

Note: You will see a large number of INFO log lines from the Tez framework. This is normal for local mode. The important lines are the ones from [SinkProcessor] and the final === Result === block.

Step 4: Read Every Source File

Before modifying anything, read each Java file carefully.

`GeneratorProcessor.java`

Key questions:

Which Tez interface does it implement?
Why is output.start() called before getWriter()? What happens if you remove it?
How does the processor know which range of numbers to generate? What Tez API provides this?
The key and value written are both the same integer n. Why? When would you want them to differ?

`MultiplierProcessor.java`

Key questions:

OrderedGroupedKVInput vs OrderedPartitionedKVOutput — which side is the input and which is the output? Why are they named differently?
Both input.start() and output.start() are called. What does input.start() actually trigger? (Hint: look at OrderedGroupedKVInput.start() in the Tez source.)
FACTOR = 2 is hardcoded. The Javadoc explains how to pass it via UserPayload. What is the size in bytes of an int encoded in a ByteBuffer?

`SinkProcessor.java`

Key questions:

What is the type of getContext().getCounters()?
findCounter(group, name) — what happens if the counter doesn't exist yet when first called?
There is only one Sink task (parallelism=1). If you changed it to 2, would the counter still be correct? Why?

`NumberPipelineDAG.java`

Key questions:

What does tez.local.mode=true actually change about task execution?
OrderedPartitionedKVEdgeConfig.newBuilder(keyClass, valueClass, partitionerClass) — what is HashPartitioner doing here, and where does the partition count come from?
dagClient.waitForCompletion() — does this block on the calling thread, or is it async?
EnumSet.of(StatusGetOpts.GET_COUNTERS) — why is this extra call needed? Why aren't counters always included in DAGStatus?

Step 5: Break It and Understand It

Make each change, run the JAR, observe the failure, then revert.

Break 1: Remove `output.start()`

In GeneratorProcessor.run(), comment out logicalOutput.start().

Expected: NullPointerException or IllegalStateException from the Tez runtime when getWriter() is called on an uninitialized output.

Why this matters: Tez I/O objects are lazily initialized. The start() method triggers buffer allocation, sort buffer setup, and (for inputs) the shuffle fetch. Forgetting start() is a common first patch mistake.

Break 2: Set the wrong parallelism

Change sink parallelism from 1 to 3, run again.

Observe: does the result change? Is it still 9900? Why or why not?

Expected: the total counter is still 9900, because each Sink task emits a partial sum and the AM aggregates counters across all tasks automatically.

Break 3: Swap key and value in the Generator

Change writer.write(new IntWritable(n), new IntWritable(n)) to writer.write(new IntWritable(0), new IntWritable(n)) (fixed key = 0).

Expected: all values route to the same Multiplier task (the one that owns partition 0). The other Multiplier task gets no work. The result is still 9900 (correct) but the work distribution is skewed. You can verify this by adding a counter in MultiplierProcessor that tracks how many records each task processed.

Why this matters: key-skew (many records with the same key) is one of the most common Tez/MapReduce performance problems. This exercise makes it visible.

Step 6: Add a FilterProcessor (Exercise)

Open FilterProcessor.java. This is the skeleton for your exercise.

Your task: Insert a FilterProcessor between Multiplier and Sink that drops all values not divisible by 4, then verify the new expected sum.

Step 6a: Implement `FilterProcessor`

Add a private int threshold field.

In initialize(), read the threshold from UserPayload:

byte[] bytes = getContext().getUserPayload().deepCopyAsArray();
this.threshold = ByteBuffer.wrap(bytes).getInt();

In run(), replace if (true) with if (value.get() % threshold == 0).

Step 6b: Update `NumberPipelineDAG.buildDAG()`

Vertex filter = Vertex.create("filter",
    ProcessorDescriptor.create(FilterProcessor.class.getName())
        .setUserPayload(UserPayload.create(
            ByteBuffer.allocate(4).putInt(4).flip())),  // threshold=4
    2);  // same parallelism as multiplier

// New edge chain: generator → multiplier → filter → sink
.addEdge(Edge.create(generator,  multiplier, edgeConf.createDefaultEdgeProperty()))
.addEdge(Edge.create(multiplier, filter,     edgeConf.createDefaultEdgeProperty()))
.addEdge(Edge.create(filter,     sink,       edgeConf.createDefaultEdgeProperty()));

Step 6c: Calculate the new expected sum

After multiplying by 2, the values are: 0, 2, 4, 6, 8, …, 198. After filtering (keep only values divisible by 4): 0, 4, 8, 12, …, 196. Sum of {0, 4, 8, …, 196} = 4 * sum(0, 1, 2, …, 49) = 4 * (49*50/2) = 4 * 1225 = 4900.

Update NumberPipelineDAG.expectedSum() to return 4900 and verify PASS.

Step 7: Connect This to the Tez Source

Every class you used in this project maps to a real Tez module.

Class	Module	Source path
`AbstractLogicalIOProcessor`	`tez-runtime-api`	`tez-runtime-api/src/main/java/org/apache/tez/runtime/api/AbstractLogicalIOProcessor.java`
`OrderedPartitionedKVOutput`	`tez-runtime-library`	`tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java`
`OrderedGroupedKVInput`	`tez-runtime-library`	`tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java`
`OrderedPartitionedKVEdgeConfig`	`tez-runtime-library`	`tez-runtime-library/src/main/java/org/apache/tez/runtime/library/conf/OrderedPartitionedKVEdgeConfig.java`
`TezClient`	`tez-api`	`tez-api/src/main/java/org/apache/tez/client/TezClient.java`
`TezConfiguration`	`tez-common`	`tez-common/src/main/java/org/apache/tez/dag/api/TezConfiguration.java`

After running the pipeline successfully, open each source file above. For each one:

Find the method you called
Read its implementation — what does it actually do?
Find the unit test class for that file (usually in src/test/java/ under the same package)

This pipeline uses OrderedPartitionedKVOutput. Search the Tez JIRA for issues in this component to find real bugs and improvements you could work on:

project = TEZ AND component = "runtime-library" AND status in (Open, Patch Available)
ORDER BY priority DESC

Also search specifically:

text ~ "OrderedPartitionedKVOutput" AND status in (Open, "Patch Available")

For each open issue you find, ask yourself:

Do you understand what the bug description is saying?
Can you locate the relevant code in the source?
Is there a failing test, or do you need to write one?

Expected Deliverables

Project compiles without errors
Running the JAR prints PASS with result 9900
You can answer all questions in Step 4 (with file:line references to the source)
You have run all three "Break It" experiments and understand each failure
FilterProcessor is implemented and the pipeline prints PASS with result 4900
You have opened all 5 source files from the "Connect to Source" table
You have found at least 2 open JIRA issues in the runtime-library component

Level 2: Apache Contributor Onboarding

This level teaches you how the Apache open-source contribution machine works — not in the abstract, but in the specific context of Apache Tez. You will set up your tooling, understand the community structure, learn the patch workflow, and submit your first meaningful change.

Learning Objectives

By the end of Level 2 you must be able to:

Subscribe to dev@tez.apache.org and read a week's worth of threads
Navigate Apache Tez JIRA to find and evaluate open issues
Describe the full lifecycle of a patch: from JIRA issue to committed code
Generate a unified diff patch from a Git branch
Run Apache checkstyle and resolve all violations before submitting a patch
Write a JIRA comment that adds technical value
Find any class in the Tez repository in under 30 seconds

Apache Open-Source Contribution Fundamentals

Apache projects operate differently from GitHub-native open-source projects. The primary communication channels are mailing lists, not GitHub issues or Slack. Patches are attached to JIRA issues, not submitted as GitHub pull requests (though GitHub PRs may be used as a convenience in some projects — Tez still prefers JIRA-based workflow).

The Contribution Hierarchy

PMC (Project Management Committee)
  └─ Committers (can commit directly)
       └─ Contributors (submit patches via JIRA)
            └─ Everyone else (can file issues, ask questions)

Becoming a contributor means submitting patches. Becoming a committer means sustained, high-quality contributions over time that earn the trust of existing committers.

The Patch Lifecycle

1. Find or file a JIRA issue
2. Leave a comment: "I'm looking into this"
3. Make changes on a local branch
4. Run: mvn test -pl <module> -am  (must pass)
5. Run: mvn checkstyle:check -pl <module>  (must pass)
6. Generate a patch: git diff origin/master > TEZ-NNNN.patch
7. Attach the patch to the JIRA issue
8. Set JIRA status to "Patch Available"
9. Wait for review — a committer will comment or set "Reviewed" or "Not a bug"
10. Address feedback → upload v2 patch → repeat
11. Committer commits the patch (you cannot commit yourself until you are a committer)

Required Reading

#	Resource	What to extract
1	Apache Tez Contributing	The official contribution guide
2	Apache JIRA for Tez	Browse recent issues to understand what active work looks like
3	`dev@tez.apache.org` archives	Read 2 weeks of mailing list threads at https://lists.apache.org/list.html?dev@tez.apache.org
4	`src/config/checkstyle.xml` in the Tez repo	What style rules are enforced
5	Apache How It Works	Meritocracy, governance, why Apache operates the way it does
6	Any 3 recently closed Tez patches	Read the JIRA comment thread — observe how committers give feedback

Source Code Areas to Inspect

File	Why
`pom.xml` (root)	Module structure, dependency management, build profiles
`tez-dag/pom.xml`	Module-level dependency declarations
`src/config/checkstyle.xml`	Style rules enforced on every patch
`src/config/checkstyle-suppressions.xml`	Suppressions — which files are exempt and why
`.gitignore`	What is excluded from version control
Any recently committed file	Read the commit message format

Apache Tez JIRA Structure

Issue Types You Will Encounter

Type	Description
Bug	A defect in behavior
Improvement	An enhancement to existing functionality
New Feature	Something that does not exist yet
Task	Non-code work (documentation, release, etc.)
Sub-task	Part of a larger issue
Test	Adding or fixing a test

Priority Levels

Priority	Meaning
Blocker	Prevents a release
Critical	Significant data loss or correctness risk
Major	Important but not release-blocking
Minor	Small issue or improvement
Trivial	Typo, cosmetic, minor cleanup

For Level 2 contributors: Only work on Minor and Trivial issues. Do not pick up Major or higher issues until you have at least 3 accepted patches in the project.

Component Labels

JIRA issues are labeled by component. The most relevant for early contributors:

Component	What it covers
`Tez-DAG`	DAG execution, AM, state machines
`Tez-Runtime`	I/O library, shuffle
`Tez-API`	Public API — high stability required
`Documentation`	Docs, Javadoc, website
`Tests`	Test additions and fixes

Mailing List Etiquette

# Send an empty email to:
dev-subscribe@tez.apache.org
# You will receive a confirmation email — reply to it

What to Read First

Do not post until you have read at least two weeks of threads. Understand:

What issues are currently being discussed
How committers respond to patches
The tone and technical depth expected
What questions get quick responses vs. what gets ignored

How to Ask a Question

Good question format:

Subject: [QUESTION] Understanding VertexImpl initialization flow

Hi dev@,

I'm trying to understand the initialization sequence in VertexImpl.
Specifically, I'm looking at the transition from INITIALIZING to INITED
in VertexImpl.java around line 1234.

The code calls rootInputInitializer() before transitioning, but I'm unclear
on what happens if an initializer throws an unchecked exception.

I've read the JIRA issue TEZ-XXXX and the associated commit, but I still
have this question. Can anyone point me to the relevant code path?

Thanks,
[Your name]

What makes this question good:

Specific class and approximate line number
State machine terminology used correctly
References prior research
Concrete question, not "how does Tez work?"

What makes a question bad:

"How do I contribute?" — this is answered in the contributing guide
"Can you explain how shuffle works?" — too broad; you should read the code first
Posting before subscribing and reading archives

Apache Checkstyle

Tez enforces checkstyle on every patch. A patch that fails checkstyle will not be committed.

Running Checkstyle

# Check a specific module
mvn checkstyle:check -pl tez-dag

# Check all modules (slow)
mvn checkstyle:check

# Check and see violations inline
mvn checkstyle:checkstyle -pl tez-dag
open tez-dag/target/checkstyle-result.xml

Common Violations

Violation	Cause	Fix
`UnusedImports`	Import statement for an unused class	Remove the import
`LineLength`	Line exceeds 100 characters	Break the line
`WhitespaceAround`	Missing space around operator	Add space
`LeftCurly`	`{` on wrong line	Move to end of previous line
`JavadocMethod`	Public method missing Javadoc	Add `/** ... */` block
`FinalClass`	Utility class not declared `final`	Add `final` modifier

JIRA Issue Categories for Level 2 Contributors

In addition to Level 1 categories, you can now attempt:

Test improvements — adding tests for uncovered paths you identify from reading the code
Logging improvements — adding LOG.debug() statements that would help diagnose issues
Checkstyle fixes — especially in modules you have been reading

Discipline: The quality of your first 5 patches determines how quickly you build credibility in the community. A patch with a checkstyle violation, compilation error, or test failure will be rejected immediately. Every patch must be verified locally before upload.

Deliverables

Subscribed to dev@tez.apache.org and can describe two active discussions
Apache JIRA account created
One JIRA issue identified, studied, and commented on (even if not yet working on it)
Lab 2.1 completed: module-by-module walkthrough documented
Lab 2.2 completed: patch generated, checkstyle passing, JIRA description written
Understanding of the difference between a Minor and a Trivial issue

Common Mistakes

Mistake	Consequence	Fix
Opening a GitHub PR instead of attaching a patch to JIRA	PR will likely be ignored or closed	Use JIRA; attach a `.patch` file
Submitting a patch that changes formatting in unrelated lines	Noise in the diff; committers reject it	Change only the lines you meant to change
Claiming an issue without leaving a JIRA comment	Another contributor may do the same work	Comment "I am investigating this" before starting
Submitting a patch without running tests	Immediate rejection	Test everything locally first
Writing a JIRA comment that just says "fix attached"	Unhelpful; committers will ask for explanation	Explain what was wrong and what the fix does
Using `git commit -m "fix"`	Unprofessional commit message	Format: `TEZ-NNNN. Short description of change.`

How to Verify Success

# Your patch generates cleanly
git diff origin/master > /tmp/TEZ-NNNN.001.patch
cat /tmp/TEZ-NNNN.001.patch | head -20   # should show only your intended changes

# Checkstyle passes on the module you changed
mvn checkstyle:check -pl <changed-module>

# Tests pass
mvn test -pl <changed-module> -am -Dtest=<RelevantTestClass>

Patch Profile: Level 2 Graduate

Patch type	Example	Test requirement
Javadoc improvement	Add missing `@throws` annotation to a method	None
Log statement improvement	Add context to an existing LOG.warn that is unhelpful	Run the affected test class
Checkstyle fix	Fix unused import across multiple files in one module	Run `mvn checkstyle:check -pl <module>`
Test comment improvement	Add test setup comments explaining what `MockAppContext` does	Run the test class

You are not ready to submit: behavioral code changes, new features, bug fixes in state machines or shuffle. Continue to Level 3.

Lab 2.1: Navigate the Repository Structure

Background

Before writing a single line of code, a new contributor must be able to navigate the repository with the same fluency as a committer. This lab builds that fluency by walking you through every module, understanding the Maven multi-module structure, and being able to locate any class in under 30 seconds.

Repository Root Layout

apache/tez/
├── pom.xml                     # Root POM — module declarations, dep management
├── tez-api/                    # Public client API
├── tez-common/                 # Utilities shared across modules
├── tez-dag/                    # DAG AppMaster — the core of Tez
├── tez-examples/               # Example DAG implementations
├── tez-ext-service-tests/      # External service integration tests
├── tez-mapreduce/              # MapReduce compatibility layer
├── tez-plugins/                # Optional plugins (ATSv2, etc.)
├── tez-runtime-internals/      # Internal runtime interfaces
├── tez-runtime-library/        # I/O processors, shuffle
├── tez-tests/                  # Integration test suite
├── tez-tools/                  # Performance analysis utilities
├── src/
│   └── config/
│       ├── checkstyle.xml      # Style enforcement rules
│       └── checkstyle-suppressions.xml
└── CHANGES.txt                 # Release changelog

Module-by-Module Walkthrough

`tez-api` — The Public Contract

Everything in tez-api is part of the public API that application developers use. Changes here must be backward-compatible or explicitly versioned. This is the highest-stability module.

Key packages:

Package	Contents
`org.apache.tez.dag.api`	`DAG`, `Vertex`, `Edge`, `TezClient`, `TezConfiguration`
`org.apache.tez.dag.api.client`	`DAGClient`, `DAGStatus` — monitoring and control
`org.apache.tez.dag.api.event`	Events emitted by the AM to task processors
`org.apache.tez.dag.api.records`	Protocol Buffer message classes (generated)
`org.apache.tez.runtime.api`	`AbstractProcessor`, `Input`, `Output` interfaces

Exercise:

# Count public classes in tez-api (the API surface)
find tez-api/src/main/java -name "*.java" | wc -l

# Find all classes that implement or extend AbstractProcessor
grep -rl "extends AbstractProcessor" tez-runtime-library/src/

`tez-dag` — The Application Master

This is the largest and most complex module. It implements the DAG AppMaster that runs in a YARN container and orchestrates vertex and task execution.

Key packages:

Package	Contents
`org.apache.tez.dag.app`	`DAGAppMaster` — the main AM class
`org.apache.tez.dag.app.dag`	`DAG`, `Vertex`, `Task`, `TaskAttempt` state machine interfaces
`org.apache.tez.dag.app.dag.impl`	`DAGImpl`, `VertexImpl`, `TaskImpl`, `TaskAttemptImpl`
`org.apache.tez.dag.app.rm`	YARN resource management integration
`org.apache.tez.dag.app.launcher`	Container launch logic
`org.apache.tez.dag.app.web`	AM web UI servlets
`org.apache.tez.dag.history`	Timeline history event handling

Exercise:

# Count lines in DAGImpl (the most complex class)
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java

# Count state machine transitions in VertexImpl
grep "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l

`tez-runtime-library` — I/O and Shuffle

The I/O module implements the actual data reading/writing done inside task containers. Shuffle happens here.

Key packages:

Package	Contents
`org.apache.tez.runtime.library.input`	`OrderedGroupedKVInput`, `UnorderedKVInput`, etc.
`org.apache.tez.runtime.library.output`	`OrderedPartitionedKVOutput`, `UnorderedKVOutput`, etc.
`org.apache.tez.runtime.library.common.shuffle`	Shuffle fetch infrastructure
`org.apache.tez.runtime.library.common.sort`	External sort implementation
`org.apache.tez.runtime.library.common.writers`	Spilling KV writers

Exercise:

# Find all Input implementations
find tez-runtime-library/src/main/java -name "*Input*.java" | grep -v test

# Find the shuffle Fetcher
find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")

`tez-common` — Shared Utilities

Contains utilities used by multiple modules that do not fit in tez-api:

TezUtils — configuration serialization/deserialization
TezTaskID, TezVertexID, TezDAGID — ID types
ReflectionUtils — Tez-specific reflection helpers
VersionUtils — version compatibility checks

`tez-mapreduce` — MapReduce Compatibility

Allows MapReduce jobs to run on Tez without code changes. Contains MRInput, MROutput, and the mapper/reducer wrapping infrastructure.

`tez-examples` — Reference Implementations

Four example DAGs:

Class	What it demonstrates
`OrderedWordCount`	3-vertex pipeline, ordered shuffle, sort by value
`IntersectExample`	2-way join using broadcast edge
`JoinDataGen`	Data generation for the join example
`FilterLinesByWord`	Simple filter with configurable parallelism

`tez-tests` — Integration Test Suite

Contains tests that run against MiniTezCluster — a full in-process Tez + YARN + HDFS cluster. These tests are slow (minutes each) but provide end-to-end coverage.

Key test class: TestMiniTezSessionWithLocalMode — runs example DAGs in local mode.

Maven Structure Deep Dive

Root `pom.xml`

Read the root pom.xml to understand:

Module declarations (<modules> section) — the build order
Dependency management (<dependencyManagement>) — canonical versions for all deps
Plugin management (<pluginManagement>) — canonical plugin configurations
Build profiles — hadoop-2 vs hadoop-3, dist profile for assembly

Exercise:

# What Hadoop version does Tez build against by default?
grep -A2 "hadoop.version" pom.xml | head -5

# What Java version is required?
grep "maven.compiler" pom.xml

# How many external dependencies does the root pom manage?
grep "<artifactId>" pom.xml | wc -l

Module `pom.xml` Structure

Each module follows the same pattern:

<parent>
  <groupId>org.apache.tez</groupId>
  <artifactId>tez</artifactId>
  <version>0.10.x-SNAPSHOT</version>
</parent>

<artifactId>tez-dag</artifactId>
<name>Tez DAG</name>

<dependencies>
  <!-- Module-specific dependencies -->
</dependencies>

Modules declare their inter-dependencies explicitly. This is how Maven knows the build order.

Exercise:

# What modules does tez-dag depend on?
grep -A3 "<dependency>" tez-dag/pom.xml | grep "tez-" | grep "artifactId"

# What does tez-runtime-library depend on?
grep -A3 "<dependency>" tez-runtime-library/pom.xml | grep "tez-" | grep "artifactId"

Finding Classes Quickly

By Name

find . -name "VertexImpl.java"
find . -name "Fetcher.java"
find . -name "TestDAGImpl.java"

By Content

# Find the class that defines TEZ_LOCAL_MODE
grep -rl "TEZ_LOCAL_MODE" --include="*.java" .

# Find all state machine StateMachine declarations
grep -rl "StateMachineFactory" --include="*.java" . | grep -v test

In IntelliJ

Navigate to class: ⌘ O (macOS) — type class name, supports wildcards
Navigate to file: ⌘ ⇧ O — type file name
Find usages: ⌥ F7 — shows all places a class/method is used
Go to implementation: ⌘ ⌥ B — jumps from interface to implementation

After completing this lab, time yourself on each:

Task	Target time
Find `DAGImpl.java`	< 10 seconds
Find `TezConfiguration.TEZ_LOCAL_MODE` declaration	< 20 seconds
Find all tests for `VertexImpl`	< 30 seconds
Identify which module handles shuffle fetch retry	< 60 seconds
Find the class that submits a DAG from client to AM	< 60 seconds

If any take longer, repeat the exercises in this lab.

Expected Output

By end of this lab you should have notes documenting:

The line count of VertexImpl.java and DAGImpl.java
The number of state machine transitions in VertexImpl
The names of all 4 example DAG classes
The Hadoop version Tez builds against
Which module handles shuffle (your own words, not copy-pasted)

Stretch Goals

Generate the full module dependency graph:

mvn dependency:tree -pl tez-dag -am | grep "\\-\\-" | head -30

Find all Protocol Buffer definition files (.proto):
```
find . -name "*.proto" | sort
```
For each, identify which module it belongs to and what messages it defines.
Read tez-api/src/main/proto/DAGApiRecords.proto completely. Identify which messages correspond to Java classes you have already read.

Lab 2.2: Prepare a Patch Using Apache Practices

Background

A "patch" in Apache open-source culture means a unified diff file attached to a JIRA issue. This lab walks you through the complete workflow: finding a safe change to make, preparing the patch, verifying it, and writing the JIRA description.

This lab uses a real but trivial change as the vehicle — a Javadoc improvement in tez-api. Trivial changes are intentional: the goal is to master the workflow, not to write impressive code.

The Apache Git Patch Workflow

Apache Tez development uses a linear history on master (now trunk in some Apache projects, master in Tez). The standard contributor workflow:

origin/master  (read-only for non-committers)
      |
      ↓ checkout
local/master
      |
      ↓ branch
local/TEZ-NNNN
      |
      ↓ make changes
      ↓ mvn test (pass)
      ↓ mvn checkstyle:check (pass)
      ↓ git diff origin/master > TEZ-NNNN.001.patch
      |
      → Attach to JIRA

You never push your branch to Apache. You generate a diff and attach it.

Step-by-Step Tasks

Step 1: Set Up Your Working Branch

cd /path/to/tez

# Always start from a clean, up-to-date master
git fetch origin
git checkout master
git merge origin/master

# Create a branch named after the JIRA issue you are working on
# Use TEZ-0000 as a placeholder for this lab
git checkout -b TEZ-0000-javadoc-tezvertex

Verify you are on the new branch:

git branch
# * TEZ-0000-javadoc-tezvertex
#   master

Step 2: Find a Target for Your Change

Open tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java.

Look for public methods that:

Have no Javadoc, or
Have a @param tag with a non-descriptive name like // TODO, or
Have a @return tag missing from a non-void method

A useful starting point:

# Find methods with empty or missing Javadoc in tez-api
javadoc -private -sourcepath tez-api/src/main/java \
  org.apache.tez.dag.api 2>&1 | grep "no comment"

Or manually: open Vertex.java in IntelliJ, look at the addDataSink() method. If it lacks a @param description for dataSink, that is your target.

Step 3: Make the Change

Add or improve the Javadoc for the method you identified. Follow this format exactly:

/**
 * Adds a {@link DataSink} to this vertex. The sink will receive the output
 * of this vertex after all tasks complete.
 *
 * @param outputName
 *          the name used to identify this sink in the DAG; must be unique
 *          within this vertex
 * @param dataSink
 *          the {@link DataSink} descriptor defining the sink type and
 *          configuration
 * @return this {@link Vertex} instance (for method chaining)
 * @throws IllegalStateException if the vertex has already been added to a {@link DAG}
 */
public Vertex addDataSink(String outputName, DataSinkDescriptor dataSink) {

Rules for Apache Javadoc style:

First sentence is a brief imperative description (no subject: "Adds a…" not "This method adds a…")
Multi-line @param descriptions indent the continuation by 10 spaces (2 more than @param)
Use {@link ClassName} for all class references
Use {@code value} for code literals and parameter names in prose

Step 4: Verify Compilation

mvn compile -pl tez-api -q

Expected: BUILD SUCCESS with no errors.

Step 5: Run Checkstyle

mvn checkstyle:check -pl tez-api

Expected: BUILD SUCCESS. If there are violations, fix them before continuing.

Common Javadoc-specific violations:

JavadocStyle — Javadoc comment does not end with a period
JavadocMethod — @param or @return tag is missing
JavadocVariable — public field missing Javadoc

Step 6: Run the Relevant Tests

mvn test -pl tez-api -q

Expected: BUILD SUCCESS. Even a pure Javadoc change requires a test run — checkstyle runs as part of the test phase in some configurations.

Step 7: Generate the Patch

# Verify what you changed
git diff

# The diff should show only the lines you intentionally changed
# No whitespace changes, no unrelated files

# Generate the patch file
git diff origin/master > /tmp/TEZ-0000.001.patch

# Inspect it
cat /tmp/TEZ-0000.001.patch

The patch file should:

Start with diff --git a/tez-api/...
Show exactly the lines you added/removed (prefixed with +/-)
Contain no changes to files you did not intend to modify

If the patch is longer than expected, run git status to find unexpected changes and use git checkout -- <file> to revert them.

Step 8: Write the JIRA Description

For the JIRA issue you would create for this patch, write:

Summary line format:

TEZ-0000. Improve Javadoc for Vertex.addDataSink()

Description format:

Problem:
The addDataSink() method in Vertex.java has no @param documentation for the
'dataSink' parameter. This makes it harder for new users to understand the
expected input without reading the implementation.

Fix:
Add complete @param, @return, and @throws Javadoc for addDataSink().

Testing:
mvn test -pl tez-api  (all existing tests pass)
mvn checkstyle:check -pl tez-api  (no violations)

Step 9: Review the Patch as a Committer Would

Before attaching a patch, ask yourself:

Does the patch contain only the changes described in the JIRA description?
Does it pass mvn test -pl <module> locally?
Does it pass mvn checkstyle:check -pl <module>?
Is the commit message format correct? (TEZ-NNNN. Short description.)
Is there a clear explanation in the JIRA description of what was wrong and what was fixed?

If any answer is "no", fix it before uploading.

Common Mistakes

Mistake	How to detect	Fix
Patch includes unrelated formatting changes	`git diff` shows hundreds of lines	`git checkout -- <unintended-file>`
Patch modifies generated code	Proto-generated files in the diff	Revert generated files; only change source
Patch applies only to a non-`master` branch	`git diff origin/master` shows no changes	Rebase your branch onto current `master`
Checkstyle violation in unchanged line	`mvn checkstyle:check` fails in a line you did not write	You must fix it anyway — it is in your patch
Test fails on unrelated module	Running all tests surfaces a pre-existing failure	Confirm by running on a clean checkout; note the existing failure in JIRA

JIRA Status Workflow

After attaching your patch:

Set the JIRA status to "Patch Available"
Add a comment: "Patch attached. Tested with mvn test -pl tez-api and mvn checkstyle:check -pl tez-api, both pass."
Wait for a committer to review — do not ping on the mailing list immediately

If no response in 2 weeks, it is acceptable to send one polite reminder to dev@tez.apache.org:

Subject: [REMINDER] TEZ-NNNN patch available for review

Hi dev@,

Friendly reminder that TEZ-NNNN has a patch attached. Any feedback welcome.
https://issues.apache.org/jira/browse/TEZ-NNNN

Thanks

Expected Output

At the end of this lab you have:

A local branch TEZ-0000-javadoc-tezvertex with a Javadoc change
A passing test run: mvn test -pl tez-api
A passing checkstyle run: mvn checkstyle:check -pl tez-api
A patch file at /tmp/TEZ-0000.001.patch with only the intended diff
A written JIRA description (even if not submitted) in the format above

Stretch Goals

Find a real Minor or Trivial open issue in Apache Tez JIRA that has been open for more than 6 months with no patch. Leave a JIRA comment expressing interest.
Attempt the same patch workflow with a real issue:
- Use git checkout -b TEZ-<real-number>-<short-description> for the branch name
- Use the real JIRA number in the patch filename: TEZ-NNNN.001.patch
Read three recently committed Tez patches by browsing JIRA issues with status "Resolved". For each, read the complete comment thread to understand the feedback cycle and how many patch revisions were required.
Generate a git log view that shows only your branch's commits:
```
git log origin/master..HEAD --oneline
```
This is what a committer sees when reviewing your work.

Lab 2.3 — Fix It: `NullPointerException` in `TezTaskAttemptID.fromString`

Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 90–120 min
Tez component: tez-common → org.apache.tez.common.TezTaskAttemptID

Background

TezTaskAttemptID is the primary key that links a task attempt to its vertex, DAG, and application. Its static fromString method parses a serialised ID like:

attempt_1609459200000_0001_1_00_000000_0

In the Tez codebase the parse path for certain malformed inputs has historically thrown an unguarded NullPointerException rather than a descriptive IllegalArgumentException. Null returns from String.split() or Integer.parseInt() call sites that skip validation are the common culprit.

This lab walks the complete Apache contribution workflow for such a bug:

Reproduce the crash in a test
Read the source to understand why it crashes
Apply the minimal fix
Verify tests pass and checkstyle is clean
Produce a .patch file ready for JIRA upload

Step 1 — Locate the Source File

cd ~/tez-src         # your local Tez clone from Lab 1.1
find . -name "TezTaskAttemptID.java" | head -5

Expected path:

./tez-common/src/main/java/org/apache/tez/common/TezTaskAttemptID.java

Open the file and read the fromString method in full.

Questions to answer before continuing

#	Question
1	What does `fromString` call first — `TezDAGID.fromString`, `TezVertexID.fromString`, or does it parse raw tokens?
2	What happens if the input string is `null`? Is there an explicit null guard?
3	What exception type does the method declare in its signature (`throws` clause)?
4	Find the `split("_")` call(s). If the split produces fewer parts than expected, what line would throw?
5	Is there a sibling method `toString()`? What is the canonical string format it produces?

Step 2 — Find the Existing Tests

find . -name "TestTezTaskAttemptID.java" | head -5

Expected path:

./tez-common/src/test/java/org/apache/tez/common/TestTezTaskAttemptID.java

Open it.

Questions

#	Question
1	How many `fromString` test cases already exist?
2	Is there a test for a `null` input?
3	Is there a test for a string with too few underscore-separated parts?
4	What assertion style does the file use — JUnit 4 `@Test(expected=...)` or `try/catch`?

Step 3 — Reproduce the Bug

Add the following test to TestTezTaskAttemptID.java inside the existing test class. Do not modify the test — the goal is to make it pass, not work around it.

@Test(expected = IllegalArgumentException.class)
public void testFromStringNullInput() {
    TezTaskAttemptID.fromString(null);
}

@Test(expected = IllegalArgumentException.class)
public void testFromStringTooFewParts() {
    // Fewer underscore-separated tokens than the format requires
    TezTaskAttemptID.fromString("attempt_1609459200000_0001_1");
}

Run the tests:

cd tez-common
mvn test -pl . -Dtest=TestTezTaskAttemptID -q 2>&1 | tail -30

Expected result: Both new tests FAIL (the method throws NullPointerException or ArrayIndexOutOfBoundsException, not IllegalArgumentException).

Record the exact exception and stack-trace line. You will need this for the JIRA description later.

Step 4 — Apply the Fix

Open TezTaskAttemptID.java and apply a minimal patch to the fromString method.

Rules for a minimal patch

Add a null-check at the very top of fromString; throw IllegalArgumentException with a clear message
Add a length-check on the parsed tokens before subscripting the array
Do not reformat unrelated lines (this produces noisy diffs that fail checkstyle review)
Do not change method signatures or visibility

Hint — guard pattern used elsewhere in the same class

Search the file for how other fromString variants guard their input:

grep -n "IllegalArgumentException" TezTaskAttemptID.java

Use the same pattern and message style.

Step 5 — Verify the Fix

# All TezTaskAttemptID tests must pass
mvn test -pl tez-common -Dtest=TestTezTaskAttemptID -q

# Full tez-common test suite (regression guard)
mvn test -pl tez-common -q 2>&1 | tail -20

# Checkstyle must be clean
mvn checkstyle:check -pl tez-common -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20

All three commands must produce zero errors.

Step 6 — Understand the Checkstyle Rules

cat tez-common/src/main/checkstyle/tez-checkstyle.xml | grep -A2 "LineLength"
cat tez-common/src/main/checkstyle/tez-checkstyle.xml | grep -A2 "Javadoc"

Questions

#	Question
1	What is the maximum line length enforced?
2	Does the project require Javadoc on all public methods, or only some?
3	What import ordering rule is in effect — alphabetical, grouped, or none?

Step 7 — Format the Patch File

Apache Tez uses the unified diff format. From the repo root:

cd ~/tez-src
git diff > /tmp/TEZ-XXXX.001.patch

Inspect the patch:

cat /tmp/TEZ-XXXX.001.patch

Checklist before uploading to JIRA

Patch header shows the correct file path relative to the repo root
Only TezTaskAttemptID.java and TestTezTaskAttemptID.java are modified
No trailing whitespace on any changed line (grep -P "\s+$" /tmp/TEZ-XXXX.001.patch)
Patch applies cleanly to a fresh checkout: git apply --check /tmp/TEZ-XXXX.001.patch
mvn test -pl tez-common still passes after git apply

Step 8 — Write the JIRA Description

Draft a JIRA ticket description following the Apache Tez convention:

Summary: TezTaskAttemptID.fromString throws NPE/AIOOBE on malformed input
         instead of IllegalArgumentException

Description:
  TezTaskAttemptID.fromString does not validate its input before parsing.
  Passing null or a string with fewer than N underscore-separated parts
  causes an unhandled NullPointerException (null path) or
  ArrayIndexOutOfBoundsException (short-string path) instead of
  the expected IllegalArgumentException.

  Steps to reproduce:
    TezTaskAttemptID.fromString(null);
    → NullPointerException at TezTaskAttemptID.java:NN

    TezTaskAttemptID.fromString("attempt_1609459200000_0001_1");
    → ArrayIndexOutOfBoundsException at TezTaskAttemptID.java:NN

  Fix: add explicit null guard + array-length guard at the top of fromString.

Priority: Minor
Component: tez-common

Replace NN with the actual line numbers from your stack traces in Step 3.

Step 9 — Connect the Concepts

Concept	Where to find it in the codebase
`TezTaskAttemptID`	`tez-common/src/main/java/.../TezTaskAttemptID.java`
`TezID` base class	Same package — `TezID.java`
All `fromString` sibling methods	`TezDAGID`, `TezVertexID`, `TezTaskID` — same package
Checkstyle config	`tez-common/src/main/checkstyle/tez-checkstyle.xml`
Example past fix (similar pattern)	Search JIRA for `TEZ-` + `IllegalArgumentException` + `fromString`

Reflection

Why should library code throw IllegalArgumentException rather than letting a NullPointerException propagate?
What does the Apache contribution guide say about test coverage for bug fixes?
(Hint: CONTRIBUTING.md or the Apache Tez wiki — every bug fix must include a reproducing test.)
How does the Tez fromString guard pattern compare to the one in Hadoop's TaskAttemptID.forName?
Could this same class of bug exist in TezDAGID.fromString or TezVertexID.fromString?
Check both files and note your findings.

Lab 2.4 — Review It: Spot the Flaws in `TEZ-FAKE001.001.patch`

Lab type: Review-It — read a synthetic patch, find every flaw, explain the impact, propose fixes
Estimated time: 60–90 min
Tez component: tez-dag → org.apache.tez.dag.app.dag.impl.TaskImpl

Context

You are a Tez committer reviewing a patch uploaded to JIRA. The contributor claims the patch fixes a race condition where TaskImpl.getCounters() returns null when called before any task attempt has completed.

Your job is to review the patch before it merges. There are exactly 5 intentional flaws hidden in the diff below. Find them all.

The Synthetic Patch

diff --git a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
index a1b2c3d..e4f5a6b 100644
--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java
@@ -214,6 +214,8 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
 
+  import org.apache.tez.common.counters.TezCounters;
+
   public synchronized TezCounters getCounters() {
     TezCounters counters = null;
     if (successfulAttempt != null) {
@@ -221,7 +223,7 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
       counters = successfulAttempt.getCounters();
     } else {
       counters = attemptList.stream()
-          .filter(a -> a.getState() == TaskAttemptState.SUCCEEDED)
+          .filter(a -> a.getState() == TaskAttemptState.RUNNING)
           .findFirst()
           .map(TaskAttemptImpl::getCounters)
           .orElse(null);
@@ -231,6 +233,14 @@ public class TaskImpl implements Task, EventHandler<TaskEvent> {
     return counters;
   }
 
+  /**
+   * Returns the counter for this task, or a new empty TezCounters object
+   * if no counters are available yet.
+   *
+   * @return counters, never null
+   */
+  public synchronized TezCounters getCountersOrEmpty() {
+    TezCounters c = getCounters();
+    return c == null ? new TezCounters() : c;
+  }
+
diff --git a/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java b/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
index b7c8d9e..f0a1b2c 100644
--- a/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
+++ b/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java
@@ -891,6 +891,18 @@ public class TestTaskImpl {
 
+  @Test
+  public void testGetCountersBeforeAnyAttempt() {
+    // No attempts started; counters should not be null
+    initTask();
+    TezCounters result = task.getCounters();
+    assertNotNull("getCounters() must not return null", result);
+  }
+
+  @Test
+  public void testGetCountersOrEmptyReturnsSameObjectEachTime() {
+    initTask();
+    TezCounters first  = task.getCountersOrEmpty();
+    TezCounters second = task.getCountersOrEmpty();
+    assertSame("Must return same instance", first, second);
+  }
+

Your Task

For each flaw you find, fill in the table:

#	File	Line / hunk	Flaw description	Why it matters	Suggested fix
1
2
3
4
5

Guided Questions

Work through these questions one by one. Each one points at a different flaw.

Question 1 — Import placement

Look at where the import statement was added:

+  import org.apache.tez.common.counters.TezCounters;
+
   public synchronized TezCounters getCounters() {

Is this a valid location for a Java import declaration?
What would happen at compile time if this diff were applied as-is?
Where should imports go in a Java file?
Lookup: does TaskImpl.java already import TezCounters at the top?
(grep "import.*TezCounters" tez-dag/src/main/java/.../TaskImpl.java)

What is the flaw?

Question 2 — The filter predicate

The patch changes the fallback stream filter from:

.filter(a -> a.getState() == TaskAttemptState.SUCCEEDED)

to:

.filter(a -> a.getState() == TaskAttemptState.RUNNING)

Re-read the JIRA description: the reporter says getCounters() returns null when called before any attempt has completed.
Does filtering for RUNNING attempts fix that?
What does it mean to read counters from a RUNNING attempt vs a SUCCEEDED one?
Are the counters of a still-running attempt considered final/reliable?

What is the flaw? What should the filter be?

Question 3 — The new test `testGetCountersBeforeAnyAttempt`

Read the test body carefully:

TezCounters result = task.getCounters();
assertNotNull("getCounters() must not return null", result);

The test asserts that getCounters() is not null when no attempt has started.
But the patch does not change getCounters() to return an empty object — it adds a separate method getCountersOrEmpty() for that.
When successfulAttempt is null and attemptList is empty, what does getCounters() actually return?
Will this test pass or fail against the patched code?

What is the flaw?

Question 4 — The new test `testGetCountersOrEmptyReturnsSameObjectEachTime`

assertSame("Must return same instance", first, second);

getCountersOrEmpty() is implemented as:
return c == null ? new TezCounters() : c;
Each call creates a new TezCounters() when c is null.
Does the assertSame assertion match the implementation?
Is assertSame testing a documented contract, or is it over-specifying an implementation detail?
What assertion would actually verify the intended contract ("not null")?

What is the flaw?

Question 5 — The JIRA description says the fix is needed, but…

Re-read the patch one final time. The root cause (as stated in the JIRA) is that getCounters() can return null. The correct caller-safe fix for most Tez callers would be to make getCounters() itself never return null (return empty TezCounters as the contract).

Instead the patch adds getCountersOrEmpty() as a new method — but leaves the old getCounters() method returning null.

Every existing caller of getCounters() still gets null.
The Tez codebase uses getCounters() in aggregation loops that iterate counters: counters.incrAllCounters(taskCounters) — passing null there throws NPE.
How many callers of getCounters() exist in tez-dag?

grep -rn "\.getCounters()" tez-dag/src/main/ | grep -v "//.*getCounters" | wc -l

Does the patch actually fix the original bug?

What is the flaw?

Answer Key (Read After You've Filled the Table)

Reveal answers

#	Flaw	Impact	Fix
1	`import` statement placed inside the class body (after the opening `{`)	Compile error — Java imports must precede the class declaration	Remove the import; `TezCounters` is already imported at the top of `TaskImpl.java`
2	Filter changed to `RUNNING` instead of keeping `SUCCEEDED`; a running attempt's counters are partial and unstable	Returns wrong/partial data; counters values change as the attempt progresses	Revert to `SUCCEEDED` filter; the real fix is to handle the "no succeeded attempt yet" case separately (return null or empty)
3	`testGetCountersBeforeAnyAttempt` asserts `assertNotNull` on `getCounters()` which still returns null when no attempt has completed	Test will fail on the patched code — the patch doesn't make `getCounters()` non-null	Test should call `getCountersOrEmpty()` or the assertion should accept null and document the contract
4	`assertSame` requires the same object reference but `getCountersOrEmpty()` creates a new `TezCounters()` each time null is returned	Test fails on every call where no successful attempt exists	Use `assertNotNull` to verify the non-null contract; don't assert reference identity
5	The patch adds `getCountersOrEmpty()` but doesn't fix the root cause — `getCounters()` still returns null; all existing callers are still broken	Downstream NPEs in counter aggregation loops are not fixed	Change `getCounters()` itself to return `new TezCounters()` instead of null, or add a null-guard in every caller; document the chosen contract

Reflection

A patch that adds a new method instead of fixing the old one is sometimes called an "additive workaround." When is that acceptable? When is it wrong?
The Apache Tez review process requires that every patch include a test that would have failed before the fix and passes after. Does this patch satisfy that requirement? Why or why not?
If you were the committer, what feedback would you leave on JIRA? Write two or three sentences in the style of a real review comment (constructive, specific, pointing at the line).
Look up a real Tez JIRA review thread (search issues.apache.org/jira for project = TEZ AND labels = patch-available AND resolution = Fixed). Find one comment where a committer asked for a test change. What did they say?

Level 3: Tez Architecture

This level gives you a working mental model of how all Tez components fit together. After completing it you will be able to trace any execution path — from API call to task output — through the code without getting lost. Architecture knowledge is what separates a contributor who fixes isolated bugs from one who can design improvements.

Learning Objectives

By the end of Level 3 you must be able to:

Draw the Tez component topology from memory (Client → AM → RM → NM → Container)
Trace a DAG.submit() call through four class boundaries to the first vertex start
Explain the role of each of the four state machines and how they interact
Describe what happens on each of the three communication channels between components
Explain the Input-Processor-Output (IPO) model and how it relates to DAG edges
Identify which Protocol Buffer message type carries a given piece of information

Component Topology

┌─────────────────────────────────────────────────────────────────────┐
│  Client JVM                                                         │
│  ┌─────────────┐                                                    │
│  │  TezClient  │──── submitDAG() ────────────────────────────────┐  │
│  └─────────────┘                                                 │  │
└──────────────────────────────────────────────────────────────────┼──┘
                                                                   │ DAGPlan (protobuf)
                                                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│  YARN ResourceManager                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  ApplicationMaster container (DAGAppMaster)                  │   │
│  │                                                              │   │
│  │  ┌───────────┐  ┌────────────┐  ┌──────────┐  ┌─────────┐  │   │
│  │  │  DAGImpl  │→ │ VertexImpl │→ │ TaskImpl │→ │ TaskAttemptImpl│  │
│  │  └───────────┘  └────────────┘  └──────────┘  └─────────┘  │   │
│  │         │              │                                     │   │
│  │         └──── events ──┘                                    │   │
│  │                                                              │   │
│  │  ContainerLauncher ─── launches ──────────────────────────┐ │   │
│  └─────────────────────────────────────────────────────────┬─┘ │   │
└────────────────────────────────────────────────────────────┼───┼───┘
                                                             │   │
                                              container req  │   │ container
                                                             ▼   ▼
┌─────────────────────────────────────────────────────────────────────┐
│  YARN NodeManagers (one per worker node)                            │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │  TezChild (task container JVM)                                │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │  LogicalIOProcessorRuntimeTask                         │  │  │
│  │  │   Input(s) ─── Processor ─── Output(s)                │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Communication Channels

Channel	From → To	What travels
Client → AM	`TezClient` → `DAGClientAMProtocol` (IPC)	`DAGPlan` protobuf, `GetDAGStatusRequest`
AM → RM	`RMCommunicator` → YARN RM	Container requests, heartbeats, AM completion
AM → NM	`ContainerLauncher` → YARN NM	Container launch context, env, classpath, command
AM ↔ Container	`TaskCommunicatorManager` → `TezTaskUmbilicalProtocol` (IPC)	Task assignment, task status, event routing

The Four State Machines

Tez execution is modeled as four nested state machines. Each tracks a specific level of granularity and sends events to the others.

DAGImpl State Machine

State	Description
`NEW`	DAG created, not yet initialized
`INITED`	All vertices initialized, ready to start
`RUNNING`	At least one vertex is running
`SUCCEEDED`	All vertices succeeded
`FAILED`	At least one vertex failed (unrecoverable)
`KILLED`	AM received a kill request
`ERROR`	Internal AM error

Key transition: NEW → INITED triggers VertexInitializedEvent for each vertex.

VertexImpl State Machine

The most complex state machine in Tez. Has ~30 states and 80+ transitions.

Core states (simplified):

State	Description
`NEW`	Vertex created, not yet initialized
`INITIALIZING`	Waiting for inputs and vertex managers to initialize
`INITED`	Ready to schedule tasks
`RUNNING`	At least one task is running
`COMMITTING`	All tasks done, running output committers
`SUCCEEDED`	All tasks succeeded, all outputs committed
`FAILED`	Unrecoverable failure
`RECOVERING`	AM restarted, recovering state from history

The VertexImpl state machine is defined by the StateMachineFactory at the top of VertexImpl.java. Reading the factory definition gives you the complete transition table.

TaskImpl State Machine

Each vertex has N tasks (parallelism = N). TaskImpl tracks one task across its attempts.

State	Description
`NEW` → `SCHEDULED`	Task created and placed in the scheduler queue
`RUNNING`	At least one attempt is running
`SUCCEEDED`	One attempt succeeded
`FAILED`	All attempts exhausted
`KILLED`	Task explicitly killed (e.g., pre-emption)

TaskImpl manages the attempt retry logic: if attempt 1 fails, TaskImpl decides whether to launch attempt 2 based on the failure mode and retry count configuration.

TaskAttemptImpl State Machine

One actual container execution of a task.

State	Description
`NEW`	Attempt created, awaiting container assignment
`ASSIGNED`	Container assigned by the scheduler
`RUNNING`	Container launched, task code executing
`SUCCESS_FINISHING_CONTAINER`	Task reported success, container cleanup in progress
`SUCCEEDED`	Attempt completed successfully
`FAILED`	Attempt failed (may or may not trigger task retry)
`KILLED`	Attempt pre-empted or killed by AM

Event System

State machine transitions are driven by events. The event bus (AsyncDispatcher) routes events from producers to the correct state machine.

Key Event Types

Event Type	Producer	Consumer
`DAGEventType.DAG_INIT`	`DAGAppMaster`	`DAGImpl`
`VertexEventType.V_INIT`	`DAGImpl`	`VertexImpl`
`VertexEventType.V_START`	`DAGImpl`	`VertexImpl`
`TaskEventType.T_SCHEDULE`	`VertexImpl`	`TaskImpl`
`TaskAttemptEventType.TA_ASSIGNED`	`TaskScheduler`	`TaskAttemptImpl`
`TaskAttemptEventType.TA_DONE`	`TezTaskUmbilicalProtocol` (container callback)	`TaskAttemptImpl`
`VertexEventType.V_TASK_COMPLETED`	`TaskImpl`	`VertexImpl`
`DAGEventType.DAG_VERTEX_COMPLETED`	`VertexImpl`	`DAGImpl`

The event flow for a normal task success:

Container reports TA_DONE
  → TaskAttemptImpl: RUNNING → SUCCEEDED
  → sends T_ATTEMPT_SUCCEEDED to TaskImpl
    → TaskImpl: RUNNING → SUCCEEDED
    → sends V_TASK_COMPLETED to VertexImpl
      → VertexImpl checks: all tasks done?
        → if yes: sends DAG_VERTEX_COMPLETED to DAGImpl
          → DAGImpl checks: all vertices done?
            → if yes: DAG transitions to SUCCEEDED

Every state transition in this chain corresponds to a log line you will see in the AM logs.

Protocol Buffers

All cross-process data in Tez is serialized with Protocol Buffers (proto3 in newer versions).

Proto file	Location	Key messages
`DAGApiRecords.proto`	`tez-api/src/main/proto/`	`DAGPlan`, `VertexPlan`, `EdgePlan`
`DAGIo.proto`	`tez-api/src/main/proto/`	`RootInputLeafOutputProto`, `EntityDescriptorProto`
`HistoryProtos.proto`	`tez-dag/src/main/proto/`	All timeline/history event types
`Events.proto`	`tez-runtime-internals/src/main/proto/`	Task-level events (DataMovementEvent, etc.)

The DAGPlan message is what TezClient sends to the AM. It contains the complete description of the DAG: vertices, edges, processor descriptors, I/O configurations, and edge properties. It is generated from the DAG API object.

// In DAGImpl.java, the plan is received and deserialized:
DAGPlan dagPlan = clientAMProtocol.submitDAG(submitDAGRequest).getDagId();
// Plan is then converted to DAGImpl state

Input-Processor-Output (IPO) Model

Each task runs a single AbstractProcessor. The processor has access to named Input and Output instances, which are determined by the edges in the DAG.

┌──────────────────────────────────────────────────────────────────┐
│  Task container                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  LogicalIOProcessorRuntimeTask                          │    │
│  │                                                         │    │
│  │  Inputs:                    Outputs:                    │    │
│  │  ┌──────────────────┐       ┌──────────────────────┐   │    │
│  │  │ OrderedGrouped   │       │ OrderedPartitioned   │   │    │
│  │  │ KVInput          │──┐ ┌──│ KVOutput             │   │    │
│  │  └──────────────────┘  │ │  └──────────────────────┘   │    │
│  │                        ▼ │                              │    │
│  │               ┌───────────────┐                         │    │
│  │               │  MyProcessor  │                         │    │
│  │               │  extends      │                         │    │
│  │               │  AbstractProcessor                      │    │
│  │               └───────────────┘                         │    │
│  └─────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Edge Property Types

The EdgeProperty in the DAG API determines what I/O classes are used between two vertices.

`DataMovementType`	Meaning	Default I/O pair
`SCATTER_GATHER`	Partitioned, sorted shuffle	`OrderedPartitionedKVOutput` → `OrderedGroupedKVInput`
`BROADCAST`	All output sent to all downstream tasks	`UnorderedKVOutput` → `UnorderedKVInput`
`ONE_TO_ONE`	Task i → Task i, no shuffle	`UnorderedKVOutput` → `UnorderedKVInput`
`CUSTOM`	User-defined routing	User-provided `EdgeManagerPlugin`

SCATTER_GATHER corresponds to the classic MapReduce shuffle. BROADCAST is used for joins where one side is small enough to replicate to all tasks.

DataMovementEvent

When a task output is ready, it sends a DataMovementEvent through the umbilical to the AM. The AM routes it to the downstream tasks so their input knows which partition to fetch.

This event routing is the mechanism by which OrderedGroupedKVInput discovers where each upstream partition is located — it receives DataMovementEvents from the AM containing the shuffle server address and partition index.

Required Reading

#	Resource	What to extract
1	`tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java`	The `StateMachineFactory` declaration — read all `addTransition()` calls
2	`tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java`	The `createDag()` method — how `DAGPlan` becomes state machine objects
3	`tez-api/src/main/proto/DAGApiRecords.proto`	`DAGPlan` and `VertexPlan` message definitions
4	`tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java`	The `serviceStart()` method — component initialization order
5	`tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java`	How inputs, processors, and outputs are initialized in a container

Key Classes Quick Reference

Class	Module	Role
`DAGAppMaster`	`tez-dag`	AM main class; manages all components; starts the event dispatcher
`DAGImpl`	`tez-dag`	DAG state machine; tracks vertex completion; manages history
`VertexImpl`	`tez-dag`	Vertex state machine; manages task scheduling; calls VertexManager
`TaskImpl`	`tez-dag`	Task state machine; manages attempt lifecycle and retry logic
`TaskAttemptImpl`	`tez-dag`	TaskAttempt state machine; coordinates container assignment
`AsyncDispatcher`	`tez-dag` (via Hadoop)	Event bus; routes events to state machines asynchronously
`TezTaskUmbilicalProtocol`	`tez-runtime-internals`	IPC interface between container and AM
`TezChild`	`tez-dag`	Container main class; receives task assignment; runs the task
`LogicalIOProcessorRuntimeTask`	`tez-runtime-internals`	In-container task runner; sets up IPO
`TezClient`	`tez-api`	Client API; creates TezSession; submits DAGs

JIRA Categories for Level 3

Having read the architecture, you can now evaluate:

Architecture improvement JIRAs — proposals to change how components interact
State machine correctness bugs — transitions that lead to wrong states
Event routing issues — events that are lost or sent to wrong consumers
Container reuse improvements — how tasks are assigned to existing containers

You are still not ready to submit fixes for state machine bugs — those require Level 4. But you can now read these issues intelligently and leave informed comments.

Deliverables

Draw the component topology diagram from memory (no looking)
Trace TezClient.submitDAG() to VertexImpl V_START event through class names
Identify the state machines and their event types from code (not from this page)
Explain in your own words what DataMovementEvent does and why it exists
Lab 3.1 completed: DAG submission trace documented
Lab 3.2 completed: IPO abstraction walkthrough complete

Common Mistakes

Mistake	Impact	Correct understanding
Thinking the AM runs tasks directly	Leads to wrong mental model of container lifecycle	Tasks run in separate JVMs (containers); AM only schedules and monitors
Confusing `VertexImpl` with `Vertex` (API)	`Vertex` is the builder; `VertexImpl` is the runtime state machine	They are in different modules (`tez-api` vs `tez-dag`)
Thinking `AsyncDispatcher` is synchronous	Events are queued; transitions happen on the dispatcher thread	Never assume a transition is immediate after an event is posted
Reading `VertexImpl` top-to-bottom	The class is 6000+ lines; reading linearly is unproductive	Start with the `StateMachineFactory` declaration, then follow individual transitions

Lab 3.1: Trace a DAG Submission End-to-End

Background

A DAG goes from a Java object constructed with the API to running tasks in containers through a sequence of method calls, IPC calls, and event posts that spans six class boundaries and three JVMs. This lab asks you to trace that path precisely — class name, method name, and the data that crosses each boundary.

Being able to reconstruct this trace from code (not from documentation) is the skill. That means reading DAGAppMaster.java, DAGImpl.java, VertexImpl.java, and TezChild.java and following the chain yourself.

The Six Class Boundaries

[1] TezClient.submitDAG(dag)
         │
         │  DAGClientAMProtocol (IPC) — carries: SubmitDAGRequest{DAGPlan}
         ▼
[2] DAGClientHandler.submitDAG(request)   [in DAGAppMaster]
         │
         │  posts: DAGAppMasterEvent(NEW_DAG_SUBMITTED)
         ▼
[3] DAGAppMaster.handle(event)
         │
         │  calls createDag(dagPlan) → new DAGImpl(...)
         │  posts: DAGEvent(DAG_INIT)
         ▼
[4] DAGImpl.handle(DAGEvent{DAG_INIT})
         │
         │  InitTransition: initializes all VertexImpl objects
         │  posts: VertexEvent(V_INIT) for each vertex
         ▼
[5] VertexImpl.handle(VertexEvent{V_INIT})
         │
         │  InitTransition: sets up tasks, calls VertexManager
         │  posts: VertexEvent(V_START) when ready
         │  posts: TaskEvent(T_SCHEDULE) for each task
         ▼
[6] TaskImpl → TaskAttemptImpl → ContainerLauncher → NM
         │
         │  NM starts container JVM: TezChild.main()
         ▼
[Container JVM] TezChild receives task assignment via TezTaskUmbilicalProtocol
         │
         ▼
LogicalIOProcessorRuntimeTask.run()  — Processor.run() called

Step-by-Step Tasks

Step 1: Find the Entry Point in `TezClient`

Open tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java.

Find the submitDAG(DAG dag) method. Answer:

What is the name of the IPC protocol interface used to communicate with the AM?
What does TezClient do if it does not yet have an AM to talk to (session not started)?
What method on the DAG object serializes it to a DAGPlan protobuf?
What request object wraps the DAGPlan before it is sent over IPC?

# Find the IPC protocol interface
grep -n "Protocol" tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java | head -10

# Find DAGPlan construction
grep -n "DAGPlan\|createDag\|getPlan" tez-api/src/main/java/org/apache/tez/dag/api/TezClient.java

Step 2: Find the AM-side IPC Handler

The AM exposes the DAGClientAMProtocol interface. The implementation is in DAGAppMaster.

# Find the implementation of submitDAG on the AM side
grep -rn "submitDAG" tez-dag/src/main/java/org/apache/tez/dag/app/ | grep -v test

Open the handler class. Answer:

What is the exact class name that implements DAGClientAMProtocol?
What event type does it post to the AsyncDispatcher after receiving the DAGPlan?
Does the submitDAG call on the AM side block until the DAG completes, or does it return immediately?

Step 3: Trace `DAGAppMaster` Initialization

Open tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java.

Find the serviceStart() method. Read the component initialization order:

List the components initialized in serviceStart() in order
Find where AsyncDispatcher is created and started
Find where the DAGEventDispatcher (the component that routes DAGEvents to DAGImpl) is registered

# Find component initialization
grep -n "addService\|serviceStart\|startService" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -20

Step 4: Read the `DAGImpl` Init Transition

Open tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java.

Find the StateMachineFactory definition. Locate the transition for DAGEventType.DAG_INIT.

The transition handler class is InitTransition. Find it in the same file.

Answer:

What does InitTransition.transition() do with each vertex in the DAG?
After initializing vertices, what event does DAGImpl post?
Under what condition does the init transition immediately move to RUNNING vs waiting?

# Find the init transition
grep -n "InitTransition\|DAG_INIT" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -20

Step 5: Read the `VertexImpl` Init Transition

Open tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java.

Find the transition from INITIALIZING on event V_INIT. The handler is InitTransition (a different class from the one in DAGImpl).

Answer:

What is the VertexManager and when is it invoked during initialization?
How does VertexImpl know how many tasks to create (the parallelism)?
What event does VertexImpl send to DAGImpl when initialization completes?

# Find vertex init transition
grep -n "V_INIT\|InitTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20

Step 6: Trace the Container Launch

After tasks are scheduled, TaskAttemptImpl requests a container from the TaskScheduler. When a container is assigned, ContainerLauncher builds the launch context.

# Find the container launch command construction
grep -rn "containerLaunchContext\|getContainerLaunchContext\|vargs" \
  tez-dag/src/main/java/org/apache/tez/dag/app/launcher/ | grep -v test | head -10

Answer:

What is the main class of the container JVM? (The class with main() that YARN launches)
What information is passed to TezChild via system properties vs environment variables?
How does TezChild know which task to run when it starts?

Step 7: Read `TezChild.main()`

Open tez-dag/src/main/java/org/apache/tez/dag/app/TezChild.java.

Find the main() method and the run() loop.

Answer:

What IPC interface does TezChild use to communicate with the AM?
What does TezChild do when it receives a TaskSpec from the AM?
What class is instantiated to actually run the processor?

# Find TezChild
find tez-dag/src/main/java -name "TezChild.java"
wc -l $(find tez-dag/src/main/java -name "TezChild.java")

Complete the Trace Table

Fill in this table by reading the code (not from this page or any other documentation):

Step	Class	Method	Data / Event
1	`TezClient`	`submitDAG()`	Sends `SubmitDAGRequest{DAGPlan}` via IPC
2	?	`submitDAG()`	Posts event ???
3	`DAGAppMaster`	`handle()`	Creates `DAGImpl`, posts `DAGEvent{DAG_INIT}`
4	`DAGImpl`	`InitTransition.transition()`	Posts `VertexEvent{V_INIT}` for each vertex
5	`VertexImpl`	`InitTransition.transition()`	Posts `TaskEvent{T_SCHEDULE}` for each task
6	`TaskAttemptImpl`	?	Requests container from RM via `TaskScheduler`
7	`ContainerLauncher`	?	Launches container JVM with `TezChild` as main class
8	`TezChild`	`run()`	Receives task spec, starts processor
9	`LogicalIOProcessorRuntimeTask`	`run()`	Calls `Processor.run()`

Fill in the ? cells from the actual code. Each cell should contain the real method name.

Expected Output

A completed trace table with all cells filled from code, not from documentation. Each answer should be verifiable by pointing to a specific line in a specific file.

Example format for your notes:

Step 2: DAGClientHandler.submitDAG()
  in: tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
  line: ~1234
  posts: DAGAppMasterEvent(NEW_DAG_SUBMITTED)

Stretch Goals

Find the AsyncDispatcher queue size configuration. What happens if the queue fills up?

grep -rn "AsyncDispatcher\|dispatcher.queue" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -10

Find where the AM is told to exit when the DAG completes:

grep -n "stop\|shutdown\|exit" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | grep -i "succeeded\|complete"

Trace what happens to a TA_DONE event from TezChild back to DAGImpl:
- TezChild calls a method on the umbilical
- The AM receives it and posts a TaskAttemptEvent
- TaskAttemptImpl transitions to SUCCEEDED
- The chain continues up to DAGImpl Identify every class and event in this reverse chain.

Lab 3.2: Understand the IPO Abstraction

Background

Every task in Tez runs a Processor that reads from one or more Input objects and writes to one or more Output objects. This Input-Processor-Output (IPO) model is the fundamental abstraction for how data moves through a DAG. Edge properties in the API (EdgeProperty, DataMovementType) determine which I/O classes are instantiated in the container.

This lab walks through the IPO model from the API layer to the runtime, tracing how an ORDERED_PARTITIONED_KV_OUTPUT configuration becomes actual bytes in a shuffle buffer.

The IPO Interface Hierarchy

tez-runtime-api (in tez-api module):
  AbstractLogicalInput
      └── AbstractInput
  AbstractLogicalOutput
      └── AbstractOutput
  AbstractProcessor

tez-runtime-library (implementations):
  OrderedPartitionedKVOutput    extends AbstractLogicalOutput
  OrderedGroupedKVInput         extends AbstractLogicalInput
  UnorderedKVOutput             extends AbstractLogicalOutput
  UnorderedKVInput              extends AbstractLogicalInput
  UnorderedPartitionedKVOutput  extends AbstractLogicalOutput
  BroadcastKVInput              extends AbstractLogicalInput (alias for UnorderedKVInput)

The key interface chain:

AbstractLogicalOutput.initialize() → called by LogicalIOProcessorRuntimeTask
AbstractLogicalOutput.start()      → called when the processor is started
AbstractLogicalOutput.getWriter()  → returns KeyValueWriter for the processor to use
AbstractLogicalOutput.commit()     → called after processor.run() completes
AbstractLogicalOutput.close()      → cleanup

Step-by-Step Tasks

Step 1: Read the `AbstractLogicalOutput` Interface

Open tez-runtime-internals/src/main/java/org/apache/tez/runtime/api/AbstractLogicalOutput.java.

Answer:

What is the purpose of the initialize() method? What does it return?
What is the difference between start() and initialize()? Why are they separate?
What method does a Processor call to get a writer to write records?

Step 2: Trace `OrderedPartitionedKVOutput.initialize()`

Open tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java.

Find the initialize() method.

Answer:

What configuration key controls the buffer size for sorting?
What class is created in initialize() to handle the actual sort-and-spill?
How is the Partitioner class determined at runtime?

# Find sort buffer configuration
grep -n "SORT_MB\|sortmb\|buffer" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java \
  | head -10

# Find the writer/sorter creation
grep -n "new.*Writer\|new.*Sorter\|ExternalSorter" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/output/OrderedPartitionedKVOutput.java

Step 3: Trace the Write Path

When a processor calls writer.write(key, value), the data goes:

KeyValueWriter.write(key, value)
  → ExternalSorter.collect(key, value, partition)
    → SpillThread triggers when buffer is full
      → IFile.Writer writes sorted partition to local disk
        → On close(): merges all spills into final output file

Find the ExternalSorter class:

find tez-runtime-library/src/main/java -name "ExternalSorter.java"

Answer:

What data structure holds records before they are spilled?
What algorithm is used to sort records in the buffer?
How is the sort key computed for (K, V) pairs with a custom Comparator?

Step 4: Read `OrderedGroupedKVInput.initialize()`

Open tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java.

Find initialize().

Answer:

What class handles the shuffle (fetching data from remote nodes)?
How does the input know which upstream tasks it needs to fetch from?
What event type does the input consume to discover shuffle locations?

grep -n "Shuffle\|ShuffleManager\|DataMovementEvent" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/input/OrderedGroupedKVInput.java \
  | head -15

Step 5: Trace the Read Path

When a processor calls keyValueReader.next(), the data flow is:

KeyValueReader.next()
  → MergedKeyValueIterator.next()    [merging multiple sorted partitions]
    → TezRawKeyValueIterator         [from TezMerger]
      → IFile.Reader reads from local merged file

But before the merge can happen, the shuffle must fetch data:

DataMovementEvent arrives (from AM, routed from upstream task)
  → ShuffleManager records: "partition P is at host H:port/path"
  → Fetcher.fetch() downloads the partition file
    → stores locally
  → When all partitions fetched: MergeManager merges them
    → final sorted output available for KeyValueReader

Find the Fetcher class:

find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")

Answer:

What HTTP endpoint does Fetcher call to retrieve partition data?
What does Fetcher do if the HTTP request fails?
How many simultaneous fetch connections does Fetcher allow by default?

Step 6: Understand `DataMovementEvent` Routing

The DataMovementEvent is what connects output to input. When a task completes its output:

The ShuffleHandler (shuffle server) registers the output location
The task sends a DataMovementEvent via the TezTaskUmbilicalProtocol to the AM
The AM routes the event to the downstream tasks that need it
The downstream input receives it and knows to fetch from that location

Find the DataMovementEvent class:

find . -name "DataMovementEvent.java" | grep -v test

Answer:

What fields does DataMovementEvent carry?
Why is the payload (userPayload) a byte array and not a typed field?
How does the AM know which downstream tasks to route the event to?

# Find the AM-side routing logic
grep -rn "DataMovementEvent\|routeEvent" \
  tez-dag/src/main/java/org/apache/tez/dag/app/ --include="*.java" | grep -v test | head -15

Step 7: Edge Properties → I/O Classes

The EdgeProperty object in the API specifies which I/O classes to use. Trace how EdgeProperty becomes actual I/O class instantiation in the container.

Starting point:

# Find EdgeProperty
find tez-api/src/main/java -name "EdgeProperty.java"

Then trace:

# How does VertexImpl use EdgeProperty to configure I/O for a vertex?
grep -n "EdgeProperty\|getInputDescriptor\|getOutputDescriptor" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -15

Answer:

What field in EdgeProperty specifies the Input class for the destination vertex?
What field specifies the Output class for the source vertex?
How is the class name passed to the container so it can instantiate the correct I/O class?

Build the IPO Map

For OrderedWordCount, fill in this table by reading the code:

Edge	Source vertex	Dest vertex	Output class	Input class	DataMovementType
Tokenizer → SumReducer	Tokenizer	SumReducer	?	?	?
SumReducer → Sorter	SumReducer	Sorter	?	?	?

Read tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java to fill in the ? cells.

grep -n "EdgeProperty\|KVOutput\|KVInput\|DataMovementType" \
  tez-examples/src/main/java/org/apache/tez/examples/OrderedWordCount.java

Expected Output

By end of this lab you have:

The IPO map table for OrderedWordCount completed
An answer for each step question (from code, not from documentation)
Understanding of what DataMovementEvent carries and why it exists
Knowledge of which configuration key controls sort buffer size

Stretch Goals

Find the shuffle HTTP server that serves partition data to Fetcher:
```
find . -name "ShuffleHandler.java" | grep -v test
```
What HTTP framework does it use? What is the URL pattern for fetching a partition?
Trace what happens when Fetcher receives corrupted data (a checksum mismatch). Does the task fail immediately? Or does it retry from a different source?
Find the EdgeManagerPlugin interface and read its contract:
```
find tez-api/src/main/java -name "EdgeManagerPlugin.java"
```
What three methods must a custom edge manager implement, and what do they do? Why would you use a custom edge manager instead of SCATTER_GATHER?
Look at IntersectExample.java in tez-examples. It uses BROADCAST for one edge. Explain why: what is the semantic meaning of broadcasting in a join operation?

Lab 3.3 — Build It: Multi-Input Union DAG

Lab type: Build It — real Maven project, compilable Java, run + break + fix cycle
Estimated time: 90–120 min
Maven module: book/projects/level-3-multi-input
Main class: org.apache.tez.learning.l3.MultiInputDAG

What You Will Build

EvenNumberSource(1) ──even-edge──┐
                                  ├─▶ MultiInputUnionProcessor(1) ──▶ UnionSinkProcessor(1)
OddNumberSource(1)  ──odd-edge───┘

Two source vertices emit separate streams of integers (even: 0,2,4,…,98 and odd: 1,3,5,…,99). A middle vertex receives both streams through two named input edges, unions them, and forwards everything to a terminal sink. The sink sums all values and publishes the result via a Tez counter.

Expected output: TotalSum=4950 PASS

This is the smallest possible Tez program with a multi-input vertex — the same structural pattern used by every Tez join, union, and co-group operation.

Step 1 — Set the Tez Version

Open book/projects/pom.xml and confirm <tez.version> matches your local build:

cd ~/tez-src         # your Tez clone from Lab 1.1
git log --oneline -1 | head -c 60
mvn help:evaluate -Dexpression=project.version -q -DforceStdout 2>/dev/null

If the version printed differs from what is in the POM, update <tez.version> before continuing.

Step 2 — Compile and Run the Unit Tests

cd /path/to/apache-tez/book/projects

# Compile and test the new module only
mvn -pl level-3-multi-input test

You should see:

Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

Read every test in TestMultiInputProcessors.java before moving on.

Questions

#	Question
1	`testEvenAndOddRangesNoOverlapNoGap` simulates both sources using a `boolean[]`. Why is this a more rigorous check than just verifying the counts?
2	`testEdgeNameConstants` tests string literals. What real bug would be caught if a developer renamed the constant but not the string in `buildDAG()`?
3	`testExpectedSum` hardcodes `4950L`. Could you make this test fail by changing only `EvenNumberSource.COUNT`? What would change?

Step 3 — Build the Fat JAR and Run the DAG

mvn -pl level-3-multi-input package -q

java -jar level-3-multi-input/target/level-3-multi-input-1.0-SNAPSHOT-jar-with-dependencies.jar

Expected final line:

[MultiInputDAG] TotalSum=4950  expected=4950  PASS

If you see FAIL, the counter value is wrong — note the actual value before proceeding to the debugging exercises.

Step 4 — Read Every Source File

Work through each file in src/main/java/org/apache/tez/learning/l3/.

`EvenNumberSource.java`

#	Question
1	What Tez base class does it extend? What does that class provide?
2	Why does `run()` call `output.start()` before `getWriter()`? What happens if you skip it? (Break It experiment below)
3	The output is retrieved by `getOutputs().values().iterator().next()`. What would break if this vertex had two outputs?
4	Why are `key` and `value` declared once outside the loop rather than inside it? What allocation cost would the inner-loop placement cause?

`MultiInputUnionProcessor.java`

#	Question
1	Inputs are retrieved by string name: `inputs.get(EVEN_EDGE)`. Where are these names assigned? Trace the call to `setDestinationEdgeName` in `MultiInputDAG.java`.
2	Both inputs are started before either reader is obtained. Could you start them one at a time (start even → read even → start odd → read odd)? What would happen?
3	After draining the even input, the odd input's reader is obtained separately. Is there a scenario where odd records arrive before all even records have been read? How does Tez buffer handle this?
4	The processor forwards records unchanged (key=value=integer). What change to `run()` would be needed to emit only distinct values if both sources could produce duplicates?

`MultiInputDAG.java`

#	Question
1	Both `evenEdge` and `oddEdge` use `edgeCfg.createDefaultEdgeProperty()`. Could you use different edge configs for the two sources? When would that be necessary?
2	`Edge.setDestinationEdgeName(...)` names the edge as seen by the destination vertex. Does the source vertex also see this name? Check by reading the `Edge` API.
3	The DAG has 4 vertices. Draw the dependency graph. Which vertices can run in parallel?
4	`waitForCompletion(EnumSet.of(StatusGetOpts.GET_COUNTERS))` — what does `GET_COUNTERS` do? What would `status.getDAGCounters()` return if this option were omitted?

Step 5 — Break It: Three Experiments

Perform each experiment, observe the failure, then revert before the next one.

Experiment A — Swap the edge names

In MultiInputDAG.buildDAG(), swap EVEN_EDGE and ODD_EDGE:

.setDestinationEdgeName(MultiInputUnionProcessor.ODD_EDGE)   // was EVEN_EDGE
// ...
.setDestinationEdgeName(MultiInputUnionProcessor.EVEN_EDGE)  // was ODD_EDGE

Rebuild and run.

Does the DAG succeed or fail?
Is the sum still 4950?
Why does swapping the names not cause a failure here, but would cause a failure in a join operation where the left and right inputs have different schemas?

Experiment B — Remove one `start()` call

In MultiInputUnionProcessor.run(), remove evenInput.start().

Rebuild and run.

What exception is thrown? On which line?
Search the Tez source for the method that throws this exception. What is the guard condition?

Experiment C — Make one source emit duplicates

In EvenNumberSource.run(), change int n = i * 2 to int n = 0 (every write uses key=0).

Rebuild and run.

What is the counter value now?
Is the DAG PASS or FAIL?
What does this reveal about how OrderedGroupedKVInput handles duplicate keys when the value type is IntWritable?

Step 6 — Implement a `FilterUnionProcessor`

Create a new file in the same package: FilterUnionProcessor.java

Specification:

Extends AbstractLogicalIOProcessor
Has the same two named inputs as MultiInputUnionProcessor
Accepts a threshold via UserPayload (key "threshold", default 50)
Reads from both inputs; only forwards values >= threshold
Increments counter UnionPipeline/FilteredCount for each record dropped

Wire it into the DAG as a replacement for MultiInputUnionProcessor:

Vertex filter = Vertex.create(
    "FilterUnion",
    ProcessorDescriptor.create(FilterUnionProcessor.class.getName())
        .setUserPayload(UserPayload.create(
            ByteBuffer.wrap("threshold=50".getBytes()))),
    1);

Expected result: With threshold=50, values 0–49 are dropped, values 50–99 are forwarded. Sum at sink = 50+51+…+99 = 3725. FilteredCount = 50.

Step 7 — Tez Source Connection Table

For each class below, locate the corresponding source file in your Tez clone and record the path:

Class used in this project	Tez source file (relative to repo root)
`OrderedPartitionedKVOutput`
`OrderedGroupedKVInput`
`OrderedPartitionedKVEdgeConfig`
`HashPartitioner`
`Edge.setDestinationEdgeName`

Step 8 — Connect to Real Tez Data Flows

Open tez-examples/src/main/java/org/apache/tez/examples/JoinDataGen.java or OrderedWordCount.java in the Tez source tree.

Find a DAG in the examples that has more than 2 vertices.
Draw its topology as an ASCII diagram.
Identify which vertex is the "union-like" vertex (if any) that receives edges from multiple sources.
Compare its processor class to MultiInputUnionProcessor: what is similar, what is different?

Step 9 — JIRA Research

Search issues.apache.org/jira for:

project = TEZ AND text ~ "multi-input" AND resolution = Fixed

Find one resolved issue involving multiple inputs to a single vertex.

What was the bug?
Which class was modified?
Was a test added? If so, what does it test?

Level 4: DAG State Machine Internals

This level takes you inside VertexImpl — the most complex class in the Tez codebase. You will read the full state machine, understand every major state and the conditions that drive transitions, learn how VertexManager plugs in to control scheduling, and understand how speculative execution works. After this level you are capable of diagnosing vertex-level failures from AM log output and writing patches to the state machine.

Learning Objectives

By the end of Level 4 you must be able to:

Read a StateMachineFactory definition and produce a transition table from code
Explain the full INITIALIZING → INITED → RUNNING → SUCCEEDED path with all preconditions
Describe what VertexManager does and when it is invoked
Explain the difference between ImmediateStartVertexManager and ShuffleVertexManager
Describe the speculative execution trigger conditions and what it causes
Trace a vertex failure from first task failure to DAGImpl receiving V_COMPLETED
Explain what vertex groups are and why they exist

Reading a `StateMachineFactory`

Tez uses Hadoop's StateMachineFactory (from hadoop-common). The pattern:

private static final StateMachineFactory<VertexImpl, VertexState, VertexEventType, VertexEvent>
    stateMachineFactory =
        new StateMachineFactory<>(VertexState.NEW)

        // From NEW
        .addTransition(VertexState.NEW,
            VertexState.INITIALIZING,
            VertexEventType.V_INIT,
            new InitTransition())

        // From INITIALIZING
        .addTransition(VertexState.INITIALIZING,
            EnumSet.of(VertexState.INITED, VertexState.FAILED),
            VertexEventType.V_INIT_DONE,
            new InitedTransition())
        ...
        .installTopology();

Reading rules

First argument — the source state (where we are now)
Second argument — the destination state(s). If an EnumSet, the transition handler decides which destination to return.
Third argument — the event type that triggers this transition
Fourth argument — the SingleArcTransition or MultipleArcTransition handler

A SingleArcTransition always goes to the same destination state. Its transition() method returns void.

A MultipleArcTransition can go to different states. Its transition() method returns the next VertexState.

When you see an EnumSet as the second argument, look for a MultipleArcTransition implementation — the logic inside that class decides which state to move to.

How to extract the full transition table

# List all addTransition calls in VertexImpl
grep -n "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | wc -l

# Print them all
grep -n "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

The Full Vertex State Machine

`NEW → INITIALIZING` (event: `V_INIT`)

Triggered by DAGImpl.InitTransition when the DAG is initializing.

Handler: InitTransition

What happens:

VertexImpl sets up inputsWithInitializers — inputs that require RootInputInitializer
Registers event handlers for root input initializer completion events
If there are no root input initializers, immediately posts V_INIT_DONE (transitions to INITED in the same logical step)

Precondition for INITIALIZING → INITED:

All RootInputInitializers have reported completion
VertexManager.initialize() has completed

`INITED → RUNNING` (event: `V_START`)

Triggered by DAGImpl when all source vertices of this vertex have started or when the vertex has no source edges (it is a root vertex).

Handler: StartTransition

What happens:

Calls vertexManager.onVertexStarted()
The VertexManager decides when to schedule tasks

Important: the V_START event does not directly schedule tasks. The VertexManager does, via VertexManagerPlugin.scheduleVertexTasks().

`RUNNING` task completion handling

Each task completion (success or failure) generates a V_TASK_COMPLETED event.

The TaskCompletedTransition handler:

Increments the succeeded/failed task counter
Checks if all tasks are done → if yes, triggers V_COMPLETE_EVENT
Checks speculative execution conditions
Checks if failure count exceeds tolerable failures threshold

Key configuration: tez.vertex.failure-tasks-percent.to-fail-vertex — percentage of task failures that cause the entire vertex to fail. Default: 0 (any failure fails the vertex). Setting to > 0 enables partial failure tolerance.

`RUNNING → COMMITTING` (all tasks succeeded)

Before a vertex is marked SUCCEEDED, its output committers run.

Handler: VertexCommitCallback

What happens:

OutputCommitter.commitOutput() is called for each output with a committer
Commit is atomic: either all outputs commit or the vertex fails
The AM must not fail between task completion and output commit (AM recovery handles this)

`RUNNING → FAILED`

Triggers:

A task exceeds the failure threshold (V_TASK_COMPLETED with failure)
A container dies without a task completion report
VertexManager reports an error
A downstream vertex fails and error propagation is configured

`RECOVERING` states

When the AM restarts (e.g., due to a node failure), VertexImpl enters RECOVERING states. Recovery reads history events from the timeline service to reconstruct which tasks completed before the AM died, avoiding re-running already-succeeded tasks.

This is the most complex part of VertexImpl. Recovery bugs are a major category of contributor-fixable issues.

VertexManager

VertexManager is the plugin interface that controls task scheduling within a vertex. It sits between the AM framework and the actual task scheduler.

Interface (simplified)

public abstract class VertexManagerPlugin {
    // Called when vertex is initialized; plugin configures itself
    public abstract void initialize() throws Exception;

    // Called when V_START event fires; plugin decides when to schedule tasks
    public abstract void onVertexStarted(List<TaskAttemptIdentifier> completions)
        throws Exception;

    // Called each time a source vertex task completes
    // Plugin uses this to update scheduling decisions (for slow-start)
    public abstract void onSourceTaskCompleted(TaskAttemptIdentifier completedSrcTaskAttempt)
        throws Exception;

    // Called when vertex configuration changes (e.g., auto-parallelism)
    public abstract void onVertexManagerEventReceived(VertexManagerEvent vmEvent)
        throws Exception;
}

`ImmediateStartVertexManager`

The default for root vertices and vertices with no special scheduling requirements.

Behavior:

Schedules all tasks immediately when onVertexStarted() is called
Does not wait for any source task completion
Used by: Tokenizer vertex in OrderedWordCount

`ShuffleVertexManager`

Used for vertices that receive SCATTER_GATHER input from a source vertex.

Behavior:

Implements slow start: waits until a configurable fraction of source tasks have completed before scheduling downstream tasks
Configuration key: tez.shuffle-vertex-manager.min-src-fraction (default 0.25) and tez.shuffle-vertex-manager.max-src-fraction (default 0.75)
Implements auto-parallelism: can reduce the downstream vertex's parallelism based on the actual size of shuffle data
When auto-parallelism reduces parallelism, it calls context.reconfigureVertex() which posts a V_PARALLELISM_UPDATED event to VertexImpl

Why VertexManager Matters for Contributors

Auto-parallelism and slow-start bugs are a major category of Tez issues. The interaction between ShuffleVertexManager and VertexImpl involves:

Parallelism changes after task scheduling
Race conditions between task completion events and parallelism updates
Recovery of vertices that had parallelism changed before AM death

Speculative Execution

Speculative execution launches a duplicate task attempt when the original attempt is slow.

Trigger conditions

VertexImpl checks speculation conditions in TaskCompletedTransition and on a periodic timer:

At least one task has completed (we have a baseline for "normal" task duration)
The running attempt has been running longer than speculative_threshold * median_time
The running attempt's progress is lower than expected for its elapsed time

Configuration:

tez.am.speculation.enabled = true   (default: false)
tez.am.speculation.interval-ms = 5000  (check interval)

What happens

VertexImpl posts a TaskEventType.T_ADD_SPEC_ATTEMPT event to TaskImpl
TaskImpl creates a new TaskAttemptImpl
Both attempts run concurrently
The first to succeed wins; the other is killed
The winning attempt's output is committed; the losing attempt's output is discarded

Interaction with `ShuffleVertexManager`

If a speculative attempt completes, ShuffleVertexManager receives an onSourceTaskCompleted callback for the winning attempt. It must de-duplicate: the task's output should only be counted once regardless of which attempt succeeded.

Vertex Groups

Vertex groups (VertexGroup in the API) allow multiple vertices to be treated as a single logical vertex for downstream consumption.

Use case: merging the output of multiple Map vertices before a single Reduce vertex, without an intermediate shuffle. This is used in the Hive UnionAll operator implementation.

Key classes:

VertexGroup API: tez-api/src/main/java/org/apache/tez/dag/api/VertexGroup.java
GroupInputEdge: an edge from a VertexGroup to a regular vertex
The downstream vertex sees a single MergedLogicalInput that combines all group members

Key Classes for This Level

Class	Path	Focus
`VertexImpl`	`tez-dag/.../dag/impl/VertexImpl.java`	The entire state machine; 6000+ lines
`ShuffleVertexManager`	`tez-dag/.../library/cartesian/ShuffleVertexManager.java`	Wait: this is actually in `tez-dag/.../vertexmanager/`
`ImmediateStartVertexManager`	`tez-dag/.../vertexmanager/ImmediateStartVertexManager.java`	Simple baseline
`VertexManagerPlugin`	`tez-api/.../VertexManagerPlugin.java`	The interface
`VertexManagerPluginContext`	`tez-api/.../VertexManagerPluginContext.java`	What the plugin can call back into
`TaskImpl`	`tez-dag/.../dag/impl/TaskImpl.java`	Manages attempt lifecycle

# Find the VertexManager implementations
find tez-dag/src/main/java -name "*VertexManager*.java" | grep -v test

JIRA Categories for Level 4 Contributors

You are now ready to investigate and submit patches for:

Vertex failure handling bugs — incorrect state transitions, wrong error messages
VertexManager logic bugs — slow-start fraction calculation, auto-parallelism edge cases
Recovery bugs — vertices that fail to recover correctly after AM restart
Speculation bugs — duplicate completions, wrong trigger conditions
Test improvements — TestVertexImpl has hundreds of tests; adding coverage for edge cases

Approach:

Find a TestVertexImpl test that is @Ignored — read the comment explaining why
If the bug is fixed, the @Ignore can be removed (a trivial but real contribution)
Or find a state machine transition that has no test coverage (grep for the transition, then grep for the handler class name in test files)

Deliverables

Extract the complete VertexImpl state transition table (all source states, event types, destination states) from the code
Explain ShuffleVertexManager slow-start in your own words, with the relevant config keys
Trace a vertex failure through TaskImpl → VertexImpl → DAGImpl using event type names
Identify one @Ignored test in TestVertexImpl and read why it is ignored
Lab 4.1 completed: full state machine map documented
Lab 4.2 completed: VertexManager walkthrough complete

Common Mistakes

Mistake	Impact	Correct understanding
Assuming `V_START` schedules tasks	Code changes that bypass `VertexManager` break auto-parallelism	`V_START` calls `VertexManager.onVertexStarted()`; the manager schedules
Ignoring `RECOVERING` states	Patches that forget about recovery cause AM restart failures	Every new state or transition must handle the `RECOVERING_*` path
Confusing `TaskImpl` failure handling with `VertexImpl`	Retry logic is in `TaskImpl`; failure threshold is in `VertexImpl`	Read both classes before touching failure handling code
Reading `VertexImpl` in isolation	Many transitions involve callbacks to `DAGImpl`	Always trace events both ways: into the state machine AND back out

Lab 4.1: Read the VertexImpl State Machine

Background

VertexImpl.java is the most complex class in Apache Tez. It is approximately 6,000 lines long and contains the complete state machine for vertex execution, including initialization, scheduling, task completion handling, failure handling, speculative execution, and AM recovery. Reading it systematically — rather than linearly — is the skill this lab builds.

The output of this lab is a complete state transition table that you have produced from the source code, without reference to any external documentation.

How to Read a Large State Machine Class

Do not read VertexImpl.java from top to bottom. Instead:

Start with the StateMachineFactory declaration (search for stateMachineFactory =)
Extract all addTransition calls — this gives you the complete transition table
For each transition, find the handler class — the inner class that implements SingleArcTransition or MultipleArcTransition
Read each handler's transition() method — this is the actual state machine logic
Trace inter-state-machine events — where does the handler post events to other state machines?

Step-by-Step Tasks

Step 1: Find the `StateMachineFactory`

grep -n "stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -5

Note the line number. The factory declaration starts there and continues for hundreds of lines. Read the entire factory definition — do not skip any transitions.

Step 2: Count All States and Transitions

# Count distinct source states referenced in addTransition
grep "addTransition(VertexState\." \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | sed 's/.*addTransition(VertexState\.\([A-Z_]*\).*/\1/' \
  | sort -u

# Count total transitions
grep -c "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

Record your numbers. You should find more than 30 distinct source states and more than 80 transitions.

Step 3: Build the Transition Table

For each line in the StateMachineFactory, extract:

Source state
Event type
Destination state(s)
Handler class name

Begin with the transitions from NEW:

# Find all transitions FROM NEW
awk '/addTransition\(VertexState\.NEW/,/\.addTransition/' \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | head -20

Then from INITIALIZING:

grep -A4 "addTransition(VertexState\.INITIALIZING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -40

Build a table with columns: Source State | Event | Destination | Handler.

Step 4: Trace the Happy Path

The happy path for a vertex with no source edges (a root vertex, e.g., Tokenizer):

NEW
  V_INIT → INITIALIZING (InitTransition)
    V_INIT_DONE → INITED (InitedTransition — if no root input initializers)
  V_START → RUNNING (StartTransition)
    [VertexManager schedules tasks]
    [All tasks complete successfully]
    V_TASK_COMPLETED (final task) → COMMITTING (TaskCompletedTransition)
    V_COMMIT_COMPLETED → SUCCEEDED (CommitCompletedTransition)

For each transition in the happy path, find the handler class and answer:

InitTransition.transition():

grep -n "class InitTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

What does InitTransition do when there are no RootInputInitializers?
Does it immediately post V_INIT_DONE, or is there an intermediate step?

InitedTransition.transition() (or whatever class handles V_INIT_DONE):

When does INITIALIZING go to INITED vs going directly to RUNNING?
What is the condition that allows immediate transition to RUNNING?

StartTransition.transition():

What method on VertexManager is called here?
Does this method block or is it asynchronous?

TaskCompletedTransition.transition():

How does it track whether all tasks have completed?
What is numSuccessSourceAttemptCompletions?
At what point does it decide the vertex can move to COMMITTING?

Step 5: Trace the Failure Path

A task fails. The event chain:

TaskAttemptImpl: RUNNING → FAILED (sends T_ATTEMPT_FAILED to TaskImpl)
  TaskImpl: RUNNING → FAILED (if retry limit exceeded; sends V_TASK_COMPLETED{FAILED})
    VertexImpl: RUNNING → ?

Find the handler for V_TASK_COMPLETED when the task is FAILED:

# TaskCompletedTransition handles both success and failure
grep -n "TaskCompletedTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -10

Answer:

What field tracks the number of failed tasks?
What is the condition that causes the vertex to transition to FAILED?
What event does VertexImpl send to DAGImpl when it fails?
Does DAGImpl fail immediately when a vertex fails, or does it try to continue?

# Find how DAGImpl handles vertex failure
grep -n "DAG_VERTEX_COMPLETED\|vertexFailed\|VERTEX_FAILED" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -15

Step 6: Find the RECOVERING States

grep "RECOVERING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | grep "VertexState\." | head -20

Answer:

How many RECOVERING_* states exist?
What event exits the RECOVERING state?
What class handles recovery completion?

Step 7: Find All `@Ignore`d Tests in `TestVertexImpl`

grep -n "@Ignore" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

For each @Ignored test:

Read the comment explaining why it is ignored
Determine if the bug has been fixed (search JIRA for the referenced issue number)
If the fix exists, the test can likely be re-enabled — this is a contributor opportunity

Step 8: Find a Transition with No Test Coverage

Pick three transition handler classes from your transition table. For each, check if TestVertexImpl has a test that exercises that handler:

# Example: does TestVertexImpl test TaskCompletedTransition?
grep -n "TaskCompletedTransition\|taskCompletedTransition" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -5

# If none found, search for tests that trigger V_TASK_COMPLETED
grep -n "V_TASK_COMPLETED\|VertexEventType.V_TASK_COMPLETED" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java | head -10

Identify one transition that appears to have insufficient test coverage and document it. This is a potential Test JIRA issue you could file and fix.

Deliverable: Your Transition Table

Produce a table in this format (populate all rows from code):

| Source State      | Event Type          | Destination       | Handler Class            |
|---|---|---|---|
| NEW               | V_INIT              | INITIALIZING      | InitTransition           |
| INITIALIZING      | V_INIT_DONE         | INITED / FAILED   | InitedTransition         |
| INITED            | V_START             | RUNNING           | StartTransition          |
| RUNNING           | V_TASK_COMPLETED    | RUNNING/SUCCEEDED | TaskCompletedTransition  |
| ...               | ...                 | ...               | ...                      |

Your table should have at least 30 rows (covering the main execution paths). Recovery states are optional for this level.

Expected Output

A complete (or near-complete) state transition table for VertexImpl
Answers to all questions in Steps 4–6 with file:line references
List of @Ignored tests with your assessment of whether they could be re-enabled
One transition identified as having insufficient test coverage

Stretch Goals

Produce the same transition table for TaskImpl and TaskAttemptImpl. Compare their complexity (number of states and transitions) to VertexImpl.
Find all places where VertexImpl calls eventHandler.handle() to post an event to another state machine. What are the target state machines and what event types are used?
```
grep -n "eventHandler.handle" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | grep -v "VertexEvent" | head -20
```
Find the V_PARALLELISM_UPDATED transition — what does it do, and why is it one of the most bug-prone transitions in the state machine?

Lab 4.2: VertexManager Deep Dive

Background

VertexManager is the hook that makes Tez more than just a DAG scheduler. By plugging in a custom VertexManagerPlugin, applications can implement dynamic parallelism, slow start, skew handling, and custom task scheduling — without modifying the core AM.

This lab walks through the two built-in VertexManager implementations, explains their behaviors via code reading, and ends with a minimal custom VertexManagerPlugin that you write and unit-test.

The `VertexManagerPlugin` Contract

Full interface: tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPlugin.java

public abstract class VertexManagerPlugin {
    private VertexManagerPluginContext context;

    // Called once by the AM to provide the context object
    public final void setContext(VertexManagerPluginContext context) { ... }

    // The plugin implementation must implement these:
    public abstract void initialize() throws Exception;
    public abstract void onVertexStarted(List<TaskAttemptIdentifier> completions)
        throws Exception;
    public abstract void onSourceTaskCompleted(
        TaskAttemptIdentifier completedSrcTaskAttempt) throws Exception;
    public abstract void onVertexManagerEventReceived(
        VertexManagerEvent vmEvent) throws Exception;
    // Called when an input is initialized (root inputs only):
    public void onRootVertexInitialized(String inputName,
        InputDescriptor inputDescriptor, List<Event> events) throws Exception {}
}

`VertexManagerPluginContext` — what the plugin can call back into

find tez-api/src/main/java -name "VertexManagerPluginContext.java"
cat $(find tez-api/src/main/java -name "VertexManagerPluginContext.java")

Key methods on the context:

Method	What it does
`scheduleVertexTasks(List<TaskWithLocation>)`	Schedules the given tasks for execution
`reconfigureVertex(int parallelism, VertexLocationHint, Map<String,EdgeProperty>)`	Changes parallelism and/or edge properties at runtime
`getVertexNumTasks(String vertexName)`	Returns the current parallelism of a named vertex
`getCurrentParallelism()`	Returns this vertex's current parallelism
`getInputVertexEdgeProperties()`	Returns the `EdgeProperty` for each input edge
`sendEventToProcessor(List<Event>, String, int)`	Sends a `VertexManagerEvent` to a task

Reading `ImmediateStartVertexManager`

find tez-dag/src/main/java -name "ImmediateStartVertexManager.java"
cat $(find tez-dag/src/main/java -name "ImmediateStartVertexManager.java")

Answer these questions from the code:

In initialize(): does ImmediateStartVertexManager do anything? If not, why does it exist?
In onVertexStarted(): does it schedule tasks immediately or wait for anything?
What TaskWithLocation does it create for each task? Does it provide any location hints?
Does it implement onSourceTaskCompleted()? If so, what does it do?

Expected finding: ImmediateStartVertexManager is intentionally minimal. Its purpose is to provide a named, testable implementation that schedules all tasks immediately with no location hints. It is the baseline from which ShuffleVertexManager diverges.

Reading `ShuffleVertexManager`

find tez-dag/src/main/java -name "ShuffleVertexManager.java"
wc -l $(find tez-dag/src/main/java -name "ShuffleVertexManager.java")

Slow Start

Find the slow-start logic in onSourceTaskCompleted().

grep -n "minFraction\|maxFraction\|min-src-fraction\|completedSourceTasks\|pendingTasksToSchedule" \
  $(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -20

Answer:

What is the variable that tracks how many source tasks have completed?
At what fraction does ShuffleVertexManager start scheduling tasks?
What is the formula: at fraction F between minFraction and maxFraction, what percentage of downstream tasks are scheduled?

Auto-Parallelism

Find the auto-parallelism logic:

grep -n "reconfigureVertex\|numBipartiteSourceTasks\|desiredTaskInputSize\|targetParallelism" \
  $(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -20

Answer:

What configuration key enables auto-parallelism?
What information does ShuffleVertexManager use to compute the optimal parallelism?
When is context.reconfigureVertex() called?
What is the minimum parallelism ShuffleVertexManager will ever set (the floor)?

`VertexManagerEvent` handling

When auto-parallelism is enabled, each upstream task sends a VertexManagerEvent to the downstream VertexManagerPlugin containing statistics about its output (byte count, record count, partition sizes).

grep -n "VertexManagerEvent\|onVertexManagerEventReceived\|vmEvent" \
  $(find tez-dag/src/main/java -name "ShuffleVertexManager.java") | head -15

Answer:

What protobuf message is decoded from the event payload?
What statistic is accumulated across all events?
How does ShuffleVertexManager use the accumulated statistics to decide on new parallelism?

Write a Minimal Custom VertexManager

Create a CountingVertexManager that:

Schedules 50% of tasks immediately when onVertexStarted() is called
Schedules the remaining tasks when all source tasks have completed
Logs the number of scheduled tasks at each scheduling call

This is the core pattern of slow-start, stripped to its minimum.

Implementation skeleton

package org.apache.tez.dag.library.vertexmanager;

import org.apache.tez.dag.api.VertexManagerPlugin;
import org.apache.tez.dag.api.VertexManagerPluginContext;
import org.apache.tez.dag.api.TaskAttemptIdentifier;
import org.apache.tez.dag.api.event.VertexManagerEvent;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;

public class CountingVertexManager extends VertexManagerPlugin {

    private static final Logger LOG =
        LoggerFactory.getLogger(CountingVertexManager.class);

    private int totalSourceTasks = 0;
    private int completedSourceTasks = 0;
    private boolean secondBatchScheduled = false;
    private int totalTasksToSchedule = 0;

    @Override
    public void initialize() {
        totalTasksToSchedule = getContext().getCurrentParallelism();
        // Count source tasks across all input vertices
        for (String inputVertex : getContext().getInputVertexEdgeProperties().keySet()) {
            totalSourceTasks += getContext().getVertexNumTasks(inputVertex);
        }
    }

    @Override
    public void onVertexStarted(List<TaskAttemptIdentifier> completions) {
        // Schedule first 50%
        int firstBatch = totalTasksToSchedule / 2;
        List<VertexManagerPluginContext.ScheduleTaskRequest> toSchedule = new ArrayList<>();
        for (int i = 0; i < firstBatch; i++) {
            toSchedule.add(VertexManagerPluginContext.ScheduleTaskRequest.create(i, null));
        }
        LOG.info("CountingVertexManager: scheduling first batch of {} tasks", firstBatch);
        getContext().scheduleTasks(toSchedule);
    }

    @Override
    public void onSourceTaskCompleted(TaskAttemptIdentifier completedSrcTaskAttempt) {
        completedSourceTasks++;
        if (!secondBatchScheduled && completedSourceTasks >= totalSourceTasks) {
            // Schedule remaining 50%
            int firstBatch = totalTasksToSchedule / 2;
            List<VertexManagerPluginContext.ScheduleTaskRequest> toSchedule = new ArrayList<>();
            for (int i = firstBatch; i < totalTasksToSchedule; i++) {
                toSchedule.add(VertexManagerPluginContext.ScheduleTaskRequest.create(i, null));
            }
            LOG.info("CountingVertexManager: scheduling second batch of {} tasks",
                toSchedule.size());
            getContext().scheduleTasks(toSchedule);
            secondBatchScheduled = true;
        }
    }

    @Override
    public void onVertexManagerEventReceived(VertexManagerEvent vmEvent) {
        // No-op: we don't need statistics for this simple implementation
    }
}

Implementation tasks

Identify the correct API method: getContext().scheduleTasks() vs getContext().scheduleVertexTasks() — check which one exists in your version of the API.
Write a unit test using MockVertexManagerPluginContext (if it exists) or a mock:
- Initialize the manager with parallelism = 10 and 4 source tasks
- Call onVertexStarted() — verify 5 tasks are scheduled
- Call onSourceTaskCompleted() 4 times — verify remaining 5 tasks are scheduled on the 4th call
- Verify secondBatchScheduled is true after

Vertex reduceVertex = Vertex.create("reducer",
    ProcessorDescriptor.create(MyReducer.class.getName()), 10);
reduceVertex.setVertexManagerPlugin(
    VertexManagerPluginDescriptor.create(CountingVertexManager.class.getName()));

Finding the VertexManager Test Utilities

# Find mock context for testing
find tez-dag/src/test -name "*Mock*Vertex*" -o -name "*VertexManager*Test*" | grep -v ".class"

# Find TestShuffleVertexManager
find . -name "TestShuffleVertexManager.java" | grep test

Read TestShuffleVertexManager.java to understand how VertexManager tests are structured. The test creates a mock context, calls lifecycle methods in order, and asserts which tasks were scheduled.

Expected Output

Answers to all questions in the ImmediateStartVertexManager and ShuffleVertexManager sections, with file:line references
A working CountingVertexManager implementation that compiles
A unit test that passes for the two scheduling scenarios

Stretch Goals

Read CartesianProductVertexManager — the most complex VertexManager:
```
find tez-dag/src/main/java -name "CartesianProductVertexManager.java"
```
What computation does it coordinate? When is it used?
Find a ShuffleVertexManager related JIRA (search for "ShuffleVertexManager" in JIRA). Read the issue description and the patch. What invariant was violated?
Implement a NoOpVertexManager that schedules no tasks (for testing DAG failure paths). Use it in a test DAG and verify the vertex fails with FAILED status after the timeout.

Lab 4.3 — Build It: `WavingVertexManager`

Lab type: Build It — VertexManagerPlugin with full JUnit + Mockito test suite
Estimated time: 120–150 min
Maven module: book/projects/level-4-waving-manager
Key class: org.apache.tez.learning.l4.WavingVertexManager

What You Will Build

A VertexManagerPlugin that schedules tasks in configurable waves:

Wave 0: tasks 0 to waveSize-1
Wave 1: tasks waveSize to 2×waveSize-1
Wave N: starts only when all tasks in wave N-1 have succeeded

Wave size is read from UserPayload as "waveSize=N".
Default: WavingVertexManager.DEFAULT_WAVE_SIZE = 2.

This is a minimal but complete VertexManagerPlugin — the same architectural pattern used by ImmediateStartVertexManager, ShuffleVertexManager, and the VertexManagerPlugin inside every Hive-on-Tez reduce vertex.

Step 1 — Understand the VertexManagerPlugin Contract

Before reading any code, open the Tez source:

find ~/tez-src -name "VertexManagerPlugin.java" | head -3
find ~/tez-src -name "VertexManagerPluginContext.java" | head -3
find ~/tez-src -name "ImmediateStartVertexManager.java" | head -3

Read all three files completely. Then answer:

#	Question
1	What are all the lifecycle callback methods in `VertexManagerPlugin`? List them.
2	When does the Tez AM call `initialize()`? Can you call `scheduleVertexTasks()` from inside `initialize()`?
3	What does `VertexManagerPluginContext.scheduleVertexTasks(List<ScheduleTaskRequest>)` actually do to the DAG execution engine?
4	`ImmediateStartVertexManager.onVertexStarted()` calls `scheduleAllTasks()`. Does it call `scheduleVertexTasks` once (all tasks in one list) or once per task? Why does that matter for performance?
5	What is the purpose of `VertexManagerPluginContext.reconfigureVertex()`? Does `WavingVertexManager` use it?

Step 2 — Compile and Run the Tests

cd /path/to/apache-tez/book/projects
mvn -pl level-4-waving-manager test

Expected:

Tests run: 13, Failures: 0, Errors: 0, Skipped: 0

Step 3 — Read the Source Code

Open WavingVertexManager.java and work through every section.

`initialize()`

#	Question
1	The payload is parsed as `"waveSize=N"`. Where in a real DAG would you set this payload? (Hint: `VertexManagerPluginDescriptor.setUserPayload()` in `DAG.create()`)
2	Why does `initialize()` store `totalTasks` from the context rather than accepting it as a constructor argument?
3	If the user sets `waveSize=1000` but there are only 5 tasks, what happens? Is there a bug?
4	Why are `scheduled` and `waveFinished` `BitSet`s rather than `List<Integer>`? What is the time complexity of `BitSet.andNot()`?

`onVertexStarted()`

#	Question
1	The `completions` map passed to `onVertexStarted` is ignored. Under what condition would a real plugin need to process it?
2	Why is `scheduleNextWave()` called here and not from `initialize()`?

`onTaskAttemptCompleted()`

#	Question
1	Failed attempts are silently ignored (`if (!successful) return`). What should a production plugin do instead?
2	`checkAndScheduleNextWave()` clones `scheduled` to avoid mutating it. What subtle bug would occur without the clone?
3	Trace through the state machine for 4 tasks, waveSize=2. Draw the state of `scheduled` and `waveFinished` after each callback.

`scheduleNextWave()`

#	Question
1	The while loop has two conditions: `nextTaskToSchedule < totalTasks` AND `count < waveSize`. Which terminates the loop for the last wave if the number of tasks is not a multiple of waveSize?
2	The `scheduled.get(idx)` guard protects against double-scheduling. In what scenario could `idx` already be set? (Hint: look at the `testTaskNotScheduledTwice` test.)

Step 4 — Read the Test Suite

Open TestWavingVertexManager.java. For each test, before reading the assertions:

Read the test name
Predict what the test will assert
Then read the actual assertions and compare to your prediction

Pay particular attention to how Mockito is used:

Mockito call	What it does
`mock(VertexManagerPluginContext.class)`	Creates a fake context that records all calls
`when(ctx.getVertexNumTasks(...)).thenReturn(6)`	Stubs a specific return value
`verify(ctx, times(2)).scheduleVertexTasks(anyList())`	Asserts the method was called exactly twice
`ArgumentCaptor.forClass(List.class)`	Captures the actual argument for deep inspection

Questions

#	Question
1	`testThreeWavesForSixTasks` is an integration test of the entire scheduling lifecycle. Which individual unit tests cover the sub-cases that this test depends on?
2	`testPartialWave0DoesNotTriggerWave1` verifies the negative case (wave NOT triggered). How does `verify(times(1))` prove this? Could you use `verifyNoMoreInteractions()` instead?
3	The test class has a `@Before setUp()` method. What happens if you remove it and inline `mockContext = mock(...)` into each test instead?

Step 5 — Break It: Three Experiments

Experiment A — Remove the `if (!successful) return` guard

Delete the early-return in onTaskAttemptCompleted. Run:

mvn -pl level-4-waving-manager test -Dtest=TestWavingVertexManager#testFailedAttemptDoesNotAdvanceWave

Which test fails?
What is the actual vs. expected scheduleVertexTasks call count?
Why does treating failures as successes cause premature wave advancement?

Experiment B — Remove the `BitSet.clone()` in `checkAndScheduleNextWave`

Change:

BitSet scheduledCopy = (BitSet) scheduled.clone();
scheduledCopy.andNot(waveFinished);

to:

scheduled.andNot(waveFinished);

Run the full test suite.

Which tests fail?
What data corruption does this mutation cause? Trace through testThreeWavesForSixTasks manually.

Experiment C — Change `count < waveSize` to `count <= waveSize`

In scheduleNextWave(), change the loop condition.

How many tasks does wave 0 now schedule?
Which test catches this?

Step 6 — Add a New Feature: `onVertexManagerEventReceived`

The real ShuffleVertexManager uses onVertexManagerEventReceived to receive partition statistics from map tasks. Add support for a simple variant:

Create a new callback method:

@Override
public void onVertexManagerEventReceived(
        List<VertexManagerEvent> vmEvents) throws Exception {
    // If any event's user payload contains "skip=true", mark
    // that task as finished so it does not block wave advancement.
    for (VertexManagerEvent event : vmEvents) {
        // TODO: parse UserPayload for "skip=true"; if present, call
        //       onTaskAttemptCompleted(taskIndex, true) to release the wave
    }
}

Write a test for this method:

@Test
public void testSkipEventReleasesWave() {
    // set up 4 tasks, wave size 2
    // trigger onVertexStarted (wave 0: tasks 0,1)
    // send a VertexManagerEvent for task 0 with payload "skip=true"
    // verify task 0 is treated as done for wave-completion purposes
}

Step 7 — Tez Source Connection Table

Class used in this project	Tez source file
`VertexManagerPlugin`
`VertexManagerPluginContext`
`ScheduleTaskRequest`
`ImmediateStartVertexManager`
`ShuffleVertexManager`

Step 8 — ShuffleVertexManager Deep Dive

Open ShuffleVertexManager.java in the Tez source:

find ~/tez-src -name "ShuffleVertexManager.java"

Read onVertexStarted(). Does it schedule tasks immediately like WavingVertexManager, or does it wait? What does it wait for?
Find the slowStartFraction field. How does it determine when to start scheduling?
Find where reconfigureVertex() is called. What does it change about the vertex?
How does ShuffleVertexManager prevent double-scheduling? Compare its guard to the scheduled BitSet in WavingVertexManager.
ShuffleVertexManager has ~700 lines. Identify the 5 most important methods (the ones that contain the core scheduling logic) and list them.

Step 9 — JIRA Research: VertexManager Bugs

Search:

project = TEZ AND component = "tez-dag" AND text ~ "VertexManager" AND resolution = Fixed

Find one resolved issue where a VertexManagerPlugin had a scheduling bug.

What was the bug? (Race condition? Double scheduling? Wrong wave boundary?)
What was the fix?
Was a test added? What does it mock?

Lab 4.4 — Fix It: Null Dereference in `ShuffleVertexManager` on Zero-Partition Source

Lab type: Fix-It — reproduce → locate → write failing test → patch → verify → format patch
Estimated time: 120–150 min
Tez component: tez-dag → org.apache.tez.dag.app.dag.impl.ShuffleVertexManager

Background

ShuffleVertexManager uses partition statistics sent by map tasks to decide when to start reduce tasks (slow-start) and how many reducers to run (auto-parallelism). It processes these statistics via onVertexManagerEventReceived().

A long-standing bug category in this path: when a source vertex has zero output partitions (all records were filtered, or the vertex ran with zero tasks), the plugin can receive a ShuffleVertexManager.VertexManagerEvent whose payload encodes 0 partitions. In several versions of Tez, this caused a NullPointerException or ArithmeticException (divide by zero) deep in the statistics-processing path — the code assumed at least one partition existed.

This lab reproduces the bug pattern in a unit test, locates the exact guard that is missing, applies the fix, and submits a patch.

Step 1 — Locate the Source File

find ~/tez-src -name "ShuffleVertexManager.java" | head -5

Expected:

./tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ShuffleVertexManager.java

Also locate the test file:

find ~/tez-src -name "TestShuffleVertexManager.java" | head -5

Step 2 — Read the Statistics Path

In ShuffleVertexManager.java, find the method that processes VertexManagerEvent payloads. It will have a call to ShuffleVertexManagerBase.parseStatsHeader() or similar, and will work with numPartitions or partitionCount.

Trace the complete call chain from onVertexManagerEventReceived() to the line that first uses the partition count arithmetically.

Questions

#	Question
1	What is the name of the proto-based payload class that encodes partition statistics?
2	Which method extracts the partition count from the payload?
3	On what line does the first arithmetic operation involving the partition count occur?
4	Is there a null-check or zero-check before that line?
5	What exception would result if `partitionCount == 0` at that line?

Step 3 — Find the Existing Test

find ~/tez-src -name "TestShuffleVertexManager.java"

Open it and search for any test that covers the zero-partition case:

grep -n "zero\|0.*partition\|partition.*0" TestShuffleVertexManager.java -i | head -20

Note: in most Tez versions there is no such test — that is the gap you will fill.

Step 4 — Write the Reproducing Test

Add the following test to TestShuffleVertexManager.java. The exact helper methods depend on the version you have; adapt the setup pattern from the nearest existing test (look for testAutoParallelism or testSlowStart).

@Test(expected = Exception.class)   // replace Exception with the specific type you observe
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
    // TODO: set up a ShuffleVertexManager with auto-parallelism enabled
    // TODO: send a VertexManagerEvent with numPartitions = 0
    // TODO: call onVertexManagerEventReceived with that event
    //       The call should NOT throw — once fixed.
    //       Mark expected = Exception.class so the test initially *passes*
    //       when the bug exists (the code throws), then change to asserting
    //       no throw after the fix is applied.
}

Run:

cd ~/tez-src
mvn test -pl tez-dag -Dtest=TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash -q 2>&1 | tail -30

Record: which exception is thrown and on which line.

Step 5 — Apply the Fix

In ShuffleVertexManager.java, add a guard at the point identified in Step 2.

Rules

The guard must be a minimum: either if (partitionCount == 0) { return; } to skip the event, or if (partitionCount == 0) { partitionCount = 1; } to normalise (choose the semantically correct one — which is safer for scheduling?)
Do not reformat surrounding code
Do not change method signatures

Step 6 — Update the Test

Now that the fix is applied, update the test:

@Test
public void testZeroPartitionSourceDoesNotCrash() throws Exception {
    // Same setup as before
    // This time assert NO exception is thrown
    // Optionally assert that scheduling state is unchanged
}

Run the full tez-dag test suite:

mvn test -pl tez-dag -q 2>&1 | tail -20

All tests must pass.

Step 7 — Checkstyle

mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|WARNING|violation" | head -20

Zero violations required.

Step 8 — Format the Patch

cd ~/tez-src
git diff > /tmp/TEZ-ZEROPART.001.patch
cat /tmp/TEZ-ZEROPART.001.patch

Checklist:

Only ShuffleVertexManager.java and TestShuffleVertexManager.java modified
No trailing whitespace: grep -P "\\s+$" /tmp/TEZ-ZEROPART.001.patch
Patch applies cleanly: git apply --check /tmp/TEZ-ZEROPART.001.patch
All tests pass after git apply

Step 9 — Write the JIRA Description

Summary: ShuffleVertexManager throws [ExceptionType] when source vertex
         has zero output partitions

Description:
  When a source vertex completes with zero output partitions (all records
  filtered or vertex ran zero tasks), ShuffleVertexManager.onVertexManagerEventReceived
  receives a VertexManagerEvent with partitionCount=0.  The statistics
  processing path performs arithmetic on this value without a zero guard,
  causing [ExceptionType] at [ClassName].java:[line].

  Steps to reproduce:
    See attached TestShuffleVertexManager#testZeroPartitionSourceDoesNotCrash.

  Fix:
    Add a zero-partition guard at [method name], line [N].
    Skip or normalise the event when partitionCount == 0.

Priority: Major
Component: tez-dag
Affects Version: 0.10.x

Step 10 — Deeper Understanding

After completing the fix, answer these questions by reading ShuffleVertexManager.java:

#	Question
1	What is the `slowStartMinFraction` and `slowStartMaxFraction` used for? At what point in the scheduling lifecycle are they checked?
2	When does `ShuffleVertexManager` call `reconfigureVertex()`? What does it change?
3	What data structure accumulates partition statistics across multiple `VertexManagerEvent` calls? Why accumulate rather than process each event independently?
4	The test class uses `mock(VertexManagerPluginContext.class)`. Compare this to `TestWavingVertexManager` — what additional interactions does `ShuffleVertexManager` have with the context that `WavingVertexManager` does not?
5	Search for all places in `ShuffleVertexManager` where a divide-by-zero could theoretically occur. List them.

Level 5: Testing Infrastructure

Apache Tez has one of the most complete test suites in the Hadoop ecosystem: thousands of unit tests, a MiniTezCluster integration harness, and a TestOrderedWordCount end-to-end reference. At this level you will move from reading tests to writing them — adding missing coverage to TestVertexImpl, submitting a real DAG against MiniTezCluster, and finding and fixing a flaky test.

Why testing matters for contributors

Every Tez patch must include either (a) a new test that fails without the patch and passes with it, or (b) a clear justification in the JIRA for why a test is not needed. Committers will block patches that regress existing tests or that add unverified logic.

What this level covers

Topic	Where
`MiniTezCluster` setup/teardown lifecycle	Lab 5.1
`TestOrderedWordCount` as the canonical integration test template	Lab 5.1
Adding a missing `TestVertexImpl` transition test	Lab 5.2
Writing a full mini-cluster integration test for your own DAG	Lab 5.3
Identifying, reproducing, and fixing a flaky test	Lab 5.4

Prerequisites

Level 4 complete (you understand VertexImpl state machine and VertexManagerPlugin)
Tez source checked out and mvn install -DskipTests succeeded

Test categories and Maven commands

Category	What it tests	Command
Unit	Single class in isolation with mocks	`mvn test -pl tez-dag -Dtest=TestVertexImpl`
Mini-cluster integration	Full AM + YARN + HDFS in-process	`mvn test -pl tez-tests -Dtest=TestOrderedWordCount`
System	Real cluster (CI only)	Not run locally

Key test classes

Class	Module	What it covers
`TestVertexImpl`	`tez-dag`	`VertexImpl` state machine, transitions, vertex recovery
`TestDAGImpl`	`tez-dag`	`DAGImpl` state machine, DAG-level events
`TestTaskImpl`	`tez-dag`	`TaskImpl` scheduling, speculation, counters
`TestTaskAttemptImpl`	`tez-dag`	`TaskAttemptImpl` state transitions
`TestOrderedWordCount`	`tez-tests`	End-to-end DAG submission against MiniTezCluster
`TestMiniTezClusterWithTez`	`tez-tests`	Multi-DAG runs, recovery, kill scenarios

Expected outcome

By the end of this level you will have:

Run a DAG against MiniTezCluster inside a JUnit test
Added a missing state-machine transition test to TestVertexImpl
Identified and fixed a flaky test (or documented why it flakes)

Lab 5.1 — Explore `MiniTezCluster` and `TestOrderedWordCount`

Lab type: Read & Run
Estimated time: 90 min
Tez module: tez-tests
Key class: org.apache.tez.test.TestOrderedWordCount

Overview

MiniTezCluster spins up an in-process YARN ResourceManager, NodeManager, HDFS NameNode, and DataNode, plus the Tez ApplicationMaster — all inside a single JVM. This lets you submit real DAGs in a JUnit test with no external infrastructure.

TestOrderedWordCount is the canonical example: it submits a multi-stage word-count DAG (tokenize → partition → sort → count) and asserts correct output.

Step 1 — Locate the Files

find ~/tez-src -name "MiniTezCluster.java" | head -5
find ~/tez-src -name "TestOrderedWordCount.java" | head -5
find ~/tez-src -name "MiniTezClusterWithTez.java" | head -5

Step 2 — Read `MiniTezCluster.java`

Open MiniTezCluster.java and answer:

#	Question
1	What superclass does `MiniTezCluster` extend? What Hadoop class sets up the in-process YARN cluster?
2	Where is `TezConfiguration` created and how is it modified to use the in-process services?
3	What is the purpose of the `serviceStart()` method? What does it start?
4	After `serviceStop()`, can you call `serviceStart()` again on the same instance? Why or why not?
5	Where does MiniTezCluster write its temporary data (HDFS files, YARN work dirs)? How would a test clean this up?

Step 3 — Read `TestOrderedWordCount.java`

Work through the test lifecycle:

3a — `@BeforeClass setUpClass()`

#	Question
1	How many NodeManagers does the test cluster start with?
2	After `miniTezCluster.start()`, what call copies the Tez auxiliary service config?
3	Where are test input files created — on HDFS or local FS?
4	Is a new `TezClient` created per test or per class?

3b — `@Test testOrderedWordCount()`

#	Question
1	Trace the method calls from `TezClient.submitDAG()` to when the test receives the final `DAGStatus`.
2	What does the assertion verify — DAG state, output correctness, or counter values?
3	If you wanted to assert on a specific counter (e.g. `TaskCounter.INPUT_RECORDS_PROCESSED`), where in the test would you add that assertion?

3c — `@AfterClass tearDownClass()`

#	Question
1	What is the order of shutdown calls? Does the `TezClient` stop before or after the cluster?
2	Does the test delete the HDFS working directory? Should it?

Step 4 — Run the Test

cd ~/tez-src
mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -20

Expected:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

If you see Unable to find class: org.apache.tez.test.TestOrderedWordCount, ensure mvn install -DskipTests completed successfully for all modules.

Step 5 — Measure the Overhead

Time the test:

time mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 | tail -3

Record how long it takes. Then answer:

Is the bottleneck cluster startup, DAG execution, or cluster shutdown? (Hint: add -Dorg.apache.tez.test.MiniTezCluster.log.level=DEBUG and look at the timestamps.)
Why is @BeforeClass used instead of @Before? What is the performance difference?

Step 6 — Find More Integration Tests

find ~/tez-src/tez-tests -name "Test*.java" | xargs grep -l "MiniTezCluster" | head -10

Pick one that is NOT TestOrderedWordCount. Read its @BeforeClass and one @Test method. Answer:

What scenario does this test cover that TestOrderedWordCount does not?
Does it use a separate MiniTezCluster instance, or the same one reused across multiple test classes? How?

Step 7 — Source Connection Table

Class used in this lab	Tez source file (relative to repo root)
`MiniTezCluster`
`TezClient`
`TezConfiguration`
`DAGStatus`
`MiniDFSCluster` (Hadoop helper)

Step 8 — JIRA Research

Search:

project = TEZ AND component = "tez-tests" AND resolution = Fixed ORDER BY updated DESC

Find a recent test-improvement JIRA.

What was added or fixed?
Does the patch include a new test, an existing test modification, or a flaky-test fix?

Lab 5.2 — Add a Missing `TestVertexImpl` Transition Test

Lab type: Fix-It (test improvement)
Estimated time: 90 min
Tez module: tez-dag
Key class: org.apache.tez.dag.app.dag.impl.TestVertexImpl

Overview

TestVertexImpl covers the VertexImpl state machine but no test suite is ever complete. In this lab you will:

Read the state machine definition
Identify an untested transition
Write a JUnit test that exercises that transition
Verify it fails without the expected assertions and passes with them

This is the canonical entry point for new Tez contributors — many accepted patches are "add test coverage for transition X".

Step 1 — Locate the State Machine Definition

find ~/tez-src -name "VertexImpl.java" | head -3
grep -n "StateMachineFactory\|addTransition" \
  ~/tez-src/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
  | head -50

The state machine is built with StateMachineFactory<VertexImpl, VertexState, VertexEventType, VertexEvent>. Each addTransition() call defines:

current state
event type
next state
transition action

Step 2 — Read `TestVertexImpl.java`

wc -l ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

It is large (~5,000 lines). You do not need to read it all. Instead:

grep -n "public void test" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
  | head -60

List all test method names.

Step 3 — Find an Untested Transition

Compare the transitions in VertexImpl.java to the tests in TestVertexImpl.java.

Strategy:

List all addTransition calls with grep -n "addTransition" VertexImpl.java
For each transition, search TestVertexImpl.java for a test that covers the (fromState, eventType) pair
Find one that is missing

Hint: look at transitions from INITED state. Some transitions from INITED triggered by rare events (e.g. VERTEX_FAILED before a task is scheduled) are often not explicitly tested.

Step 4 — Write the Test

Add a new test to TestVertexImpl.java. Follow the exact style of the surrounding tests:

@Test(timeout = 5000)
public void testVertexFailed_FromInitedState() {
    // TODO: initialize a vertex to INITED state using the existing test helpers
    //       then send a VERTEX_FAILED event
    //       assert the vertex transitions to ERROR or FAILED state
    //       assert any cleanup callbacks were invoked
}

Pattern to follow:

Look for an existing test that puts the vertex in the state you need (e.g. testVertexWithInitializer reaches RUNNING; look for a simpler path)
Use dispatcher.getEventHandler().handle(new VertexEventXxx(...)) to fire events
Use vertex.getState() to assert the resulting state

Step 5 — Run the New Test

cd ~/tez-src
mvn test -pl tez-dag \
  -Dtest=TestVertexImpl#testVertexFailed_FromInitedState -q 2>&1 | tail -20

Step 6 — Run the Full Test Class

mvn test -pl tez-dag -Dtest=TestVertexImpl -q 2>&1 | tail -10

All existing tests must still pass.

Step 7 — Write the Patch and JIRA Description

cd ~/tez-src
git diff > /tmp/TEZ-VERTEXTEST.001.patch
cat /tmp/TEZ-VERTEXTEST.001.patch

Draft JIRA:

Summary: TestVertexImpl is missing coverage for VERTEX_FAILED from INITED state

Description:
  The VertexImpl state machine defines a transition (INITED, VERTEX_FAILED)
  but TestVertexImpl has no test that fires this event path.  This patch adds
  TestVertexImpl#testVertexFailed_FromInitedState to cover the gap.

Priority: Minor
Component: tez-dag

Deeper Understanding

#	Question
1	What is the difference between `VertexState.FAILED` and `VertexState.ERROR`? When does the AM choose each?
2	`TestVertexImpl` uses a mock `AppContext`. What methods on `AppContext` does `VertexImpl` call most frequently? (grep for `appContext.`)
3	What is `DrainDispatcher` and why is it used in tests instead of `AsyncDispatcher`?
4	Some tests set a `Clock` mock. Why would a state machine test need to control time?

Lab 5.3 — Build It: Integration Test with `MiniTezCluster`

Lab type: Build It — Maven module with a real mini-cluster integration test
Estimated time: 150 min
Maven module: book/projects/level-5-integration-test
Key class: org.apache.tez.learning.l5.TestNumberPipelineWithMiniCluster

What You Will Build

A JUnit integration test that:

Starts MiniTezCluster in @BeforeClass
Submits the Level 1 NumberPipelineDAG (reused from level-1-number-pipeline)
Waits for the DAG to complete
Reads back the counter NumberPipeline/TotalSum and asserts it equals 9900
Stops the cluster in @AfterClass

This is the same pattern used by TestOrderedWordCount — you are building the exact kind of test that Tez committers write for new DAG features.

Step 1 — Create the Maven Module

book/projects/level-5-integration-test/
  pom.xml
  src/test/java/org/apache/tez/learning/l5/
    TestNumberPipelineWithMiniCluster.java

The module is a test-only module (no src/main/). It depends on:

org.apache.tez.learning:level-1-number-pipeline:1.0-SNAPSHOT (your DAG)
org.apache.tez:tez-tests (for MiniTezCluster)
JUnit 4.13.2
org.apache.hadoop:hadoop-minicluster

<dependency>
  <groupId>org.apache.tez</groupId>
  <artifactId>tez-tests</artifactId>
  <version>${tez.version}</version>
  <classifier>tests</classifier>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-minicluster</artifactId>
  <version>${hadoop.version}</version>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.apache.tez.learning</groupId>
  <artifactId>level-1-number-pipeline</artifactId>
  <version>1.0-SNAPSHOT</version>
  <scope>test</scope>
</dependency>

Add level-5-integration-test to the parent pom.xml modules list.

Step 2 — Write `TestNumberPipelineWithMiniCluster.java`

Skeleton:

package org.apache.tez.learning.l5;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.MiniDFSCluster;
import org.apache.tez.client.TezClient;
import org.apache.tez.common.counters.TezCounters;
import org.apache.tez.dag.api.DAG;
import org.apache.tez.dag.api.TezConfiguration;
import org.apache.tez.dag.app.dag.DAGState;
import org.apache.tez.dag.client.DAGClient;
import org.apache.tez.dag.client.DAGStatus;
import org.apache.tez.learning.l1.NumberPipelineDAG;
import org.apache.tez.test.MiniTezCluster;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;

import static org.junit.Assert.*;

public class TestNumberPipelineWithMiniCluster {

    private static MiniTezCluster miniTezCluster;
    private static TezClient tezClient;
    private static TezConfiguration tezConf;

    @BeforeClass
    public static void setUpClass() throws Exception {
        // Start MiniTezCluster with 1 NodeManager
        miniTezCluster = new MiniTezCluster(
                TestNumberPipelineWithMiniCluster.class.getName(), 1, 1, 1);
        Configuration conf = new Configuration();
        miniTezCluster.init(conf);
        miniTezCluster.start();

        tezConf = new TezConfiguration(miniTezCluster.getConfig());
        tezConf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, false);

        tezClient = TezClient.create(
                "TestNumberPipelineClient", tezConf);
        tezClient.start();
    }

    @AfterClass
    public static void tearDownClass() throws Exception {
        if (tezClient != null) {
            tezClient.stop();
        }
        if (miniTezCluster != null) {
            miniTezCluster.stop();
        }
    }

    @Test(timeout = 120_000)
    public void testNumberPipelineTotalSum() throws Exception {
        // Build the Level 1 DAG (local mode runs fine in mini-cluster too)
        DAG dag = NumberPipelineDAG.buildDAG(tezConf);

        DAGClient dagClient = tezClient.submitDAG(dag);
        DAGStatus dagStatus = dagClient.waitForCompletion();

        assertEquals("DAG must succeed",
                DAGStatus.State.SUCCEEDED, dagStatus.getState());

        TezCounters counters = dagStatus.getDAGCounters();
        assertNotNull("Counters must be present", counters);

        long totalSum = counters
                .getGroup("NumberPipeline")
                .findCounter("TotalSum")
                .getValue();

        assertEquals("TotalSum for 0..99 must equal 4950", 4950L, totalSum);
    }
}

Adapting NumberPipelineDAG: the Level 1 project is designed for local mode (TezConfiguration.TEZ_LOCAL_MODE = true). You will need to either (a) add a static buildDAG(TezConfiguration conf) factory method that accepts an external config, or (b) create a subclass that overrides the DAG construction to accept an injected config. Choose (a).

Step 3 — Verify the Build

cd book/projects
mvn -pl level-1-number-pipeline install -DskipTests -q
mvn -pl level-5-integration-test test -q 2>&1 | tail -20

Expected:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Step 4 — Deep Questions

#	Question
1	Why does the test use `dagStatus.getDAGCounters()` instead of `dagClient.getDAGStatus(EnumSet.of(StatusGetOpts.GET_COUNTERS))`? Are they equivalent?
2	The timeout is `120_000 ms`. Why does a simple 100-integer DAG need 2 minutes?
3	If the DAG fails, `dagStatus.getState()` returns `FAILED` and the assertion fires. How would you get the failure reason from `dagStatus`?
4	`@BeforeClass` uses `static` fields. What happens if two test classes in the same JVM both start `MiniTezCluster`? How does `TestOrderedWordCount` handle this?
5	The counter group is `"NumberPipeline"` and the counter name is `"TotalSum"`. If you mistype the group name, what does `getGroup()` return? Does the assertion fail gracefully?

Step 5 — Experiment: Add a Second Assertion

After verifying TotalSum, add an assertion on the number of tasks run:

long inputRecords = counters
        .findCounter(TaskCounter.INPUT_RECORDS_PROCESSED)
        .getValue();
// How many input records do you expect?
assertEquals(???, inputRecords);

Think about the DAG topology:

Source vertex: 1 task, emits 100 integers
Sink vertex: 1 task, reads 100 records

What value do you expect for INPUT_RECORDS_PROCESSED across both vertices?

Step 6 — Tez Source Connection Table

Class used in this lab	Tez source file
`MiniTezCluster`
`TezClient`
`DAGClient`
`DAGStatus`
`TezCounters`

Lab 5.4 — Fix It: Un-Ignore a Flaky Test in `TestVertexImpl`

Lab type: Fix-It — flaky test investigation and repair
Estimated time: 90 min
Tez module: tez-dag
Key class: TestVertexImpl

Overview

Large Java projects accumulate @Ignored tests that were disabled because they were "flaky" — meaning they passed sometimes and failed other times. A flaky test is almost always a symptom of a real bug: a race condition, an incorrect assertion, or missing test isolation.

In this lab you will:

Find an @Ignored test in TestVertexImpl
Un-ignore it and run it 10 times to characterize the failure
Identify the root cause
Apply the minimum fix
Verify the test passes reliably

Step 1 — Find the Ignored Tests

grep -n "@Ignore\|@Disabled" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java

Also search across all tez-dag tests:

grep -rn "@Ignore\|@Disabled" ~/tez-src/tez-dag/src/test/java/ | \
  grep -v "^Binary" | head -30

Step 2 — Pick a Target

Select one ignored test. Prefer tests that have a comment explaining why they were ignored — these are the most educational.

Record:

The test method name
The reason given in the @Ignore annotation or nearby comment
Which state transition or feature it is testing

Step 3 — Un-Ignore and Run

Remove the @Ignore annotation. Run the test 10 times:

for i in $(seq 1 10); do
  mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
    grep -E "PASS|FAIL|ERROR|Tests run" | tail -1
done

Record the pass/fail pattern. Is it:

Always failing (deterministic bug)
Randomly failing (race condition or timing sensitivity)
Always passing (was it already fixed in this version?)

Step 4 — Diagnose the Failure

Read the test carefully. Common flaky-test patterns in Tez state machine tests:

Pattern	Symptom	Fix
`AsyncDispatcher` not drained before assertion	Assertion fires before event is processed	Use `DrainDispatcher` instead
Mock returns null for a method that returns a list	`NullPointerException` in production code	Stub with `Collections.emptyList()`
`Thread.sleep(N)` instead of proper synchronization	Fails on slow CI machines	Replace with `waitFor()` or `DrainDispatcher`
Leaked state from another test	First run passes, second fails	Verify `@Before` / `@After` cleans up completely

Identify which pattern applies.

Step 5 — Apply the Fix

Apply the minimum fix. Options:

Option A — Replace AsyncDispatcher with DrainDispatcher

// Before (flaky):
AsyncDispatcher dispatcher = new AsyncDispatcher();

// After (deterministic):
DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertex);
dispatcher.init(conf);
dispatcher.start();
// ... fire events ...
dispatcher.await(); // blocks until queue is empty

Option B — Add missing stub

when(mockContext.getSomeList()).thenReturn(Collections.emptyList());

Option C — Fix assertion order

Move assertions AFTER the dispatcher.await() call.

Step 6 — Verify Reliability

Run the test 20 times:

for i in $(seq 1 20); do
  mvn test -pl tez-dag -Dtest=TestVertexImpl#yourTestName -q 2>&1 | \
    grep -E "Tests run" | tail -1
done

All 20 runs must pass.

Step 7 — Run the Full Suite

mvn test -pl tez-dag -q 2>&1 | tail -10

All existing tests must pass.

Step 8 — Format the Patch and Write the JIRA

cd ~/tez-src
git diff > /tmp/TEZ-FLAKY.001.patch

Summary: TestVertexImpl#[testName] is flaky due to [root cause]

Description:
  TestVertexImpl#[testName] was marked @Ignore with the note "[original reason]".
  Investigation shows the root cause is [description].

  The fix [removes AsyncDispatcher / adds missing stub / fixes assertion order],
  making the test deterministic.

  Ran the test 20 times with the fix applied — all passed.

Priority: Minor
Component: tez-dag

Deeper Understanding

#	Question
1	What is the difference between `AsyncDispatcher` and `DrainDispatcher`? Where is `DrainDispatcher` defined?
2	Why is a flaky test arguably worse than no test? What does it do to CI reliability?
3	Tez's `StateMachineFactory` is modeled after Hadoop's. Does Hadoop's `TestStateMachine` use `DrainDispatcher` or `AsyncDispatcher` in its tests?
4	Some Tez flaky tests are caused by `System.currentTimeMillis()` being called in a tight loop and the assertion depending on a specific elapsed time. How would you make such a test deterministic?

Level 6: Hive/Tez Integration

Hive-on-Tez is the largest consumer of the Tez API. Understanding how Hive translates SQL into a Tez DAG — and what can go wrong — is essential for any contributor who wants to fix real production bugs.

What Hive does with Tez

Every Hive query that runs on Tez goes through this pipeline:

SQL → Hive AST → Operator tree → MapReduceWork/ReduceWork tasks
   → TezWork → Tez DAG (vertices + edges + VertexManagerPlugins)
   → TezClient.submitDAG()

The translation layer lives in hive-exec module, specifically TezWork, DagUtils, and TezTask.

Why Tez contributors must understand Hive

Most real Tez bugs are first reported from Hive (a slow query, a failing shuffle, a counter discrepancy)
ShuffleVertexManager was built specifically for the Hive reduce pattern
Hive adds many VertexManagerEvent payloads that Tez must handle correctly
Compatibility issues between Hive versions and Tez versions are common release blockers

What this level covers

Topic	Lab
Trace a Hive SQL query to the generated Tez DAG	Lab 6.1
Read `DagUtils` and understand vertex/edge configuration	Lab 6.1
Debug a failing Hive-on-Tez query (task diagnostics, AM logs)	Lab 6.2
Fix a Hive-Tez compatibility issue via a Tez patch	Lab 6.2

Prerequisites

Level 5 complete (you can submit and debug a Tez DAG)
Optional but helpful: basic SQL knowledge
Optional: Hive source checked out alongside Tez

Key classes

Class	Where	What it does
`TezWork`	`hive-exec`	Container for all Tez DAG specifications
`DagUtils`	`hive-exec`	Builds Tez DAG from `TezWork`
`TezTask`	`hive-exec`	Executes a `TezWork` via `TezClient`
`ShuffleVertexManager`	`tez-dag`	Manages reduce-vertex scheduling
`OrderedPartitionedKVOutput`	`tez-runtime-library`	Default Hive reduce output

Lab 6.1 — Trace a Hive SQL Query to the Generated Tez DAG

Lab type: Read & Research
Estimated time: 120 min
Key classes: DagUtils, TezWork, TezTask (all in Hive)

Overview

When you run SELECT a, COUNT(*) FROM t GROUP BY a on a Hive-on-Tez cluster, Hive builds a TezWork object (a description of what the DAG should look like) and hands it to DagUtils.createDag(). That method creates the actual Tez DAG, vertices, edges, and VertexManagerPluginDescriptors.

In this lab you will trace this path end-to-end.

Step 1 — Check Out Hive Source (Optional)

If you have Hive source:

git clone https://github.com/apache/hive.git ~/hive-src --depth=1
find ~/hive-src -name "DagUtils.java" | head -3
find ~/hive-src -name "TezWork.java" | head -3
find ~/hive-src -name "TezTask.java" | head -3

If you do not have Hive source, you can read these classes on GitHub:

ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
ql/src/java/org/apache/hadoop/hive/ql/plan/TezWork.java
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java

Step 2 — Read `TezWork.java`

TezWork is a directed graph of BaseWork nodes. Answer:

#	Question
1	What are the two main subclasses of `BaseWork` that represent map and reduce phases?
2	How does `TezWork` represent edges between vertices? What class holds edge configuration?
3	Where does `TezWork` store the `VertexManagerPluginDescriptor`?
4	A `GROUP BY` query produces how many `BaseWork` nodes? Draw the graph.

Step 3 — Read `DagUtils.createDag()`

This is the core translation method. It iterates over TezWork and calls createVertex() and createEdge().

#	Question
1	What Tez `EdgeProperty.DataMovementType` does Hive use for a reduce shuffle? Where is this set?
2	What `VertexManagerPlugin` does Hive attach to reduce vertices? Is this set unconditionally or based on a configuration flag?
3	What is `auto-parallelism` in this context? How does Hive enable it?
4	What `UserPayload` does Hive pass to `ShuffleVertexManager`? Specifically: what are the values of `minFraction` and `maxFraction`?

Step 4 — Read `TezTask.execute()`

This method submits the DAG and waits for completion.

#	Question
1	Does `TezTask` create a new `TezClient` per query, or reuse one per session?
2	How does `TezTask` wait for DAG completion? Which Tez API does it poll?
3	When a Hive query fails, what information does `TezTask` extract from the `DAGStatus` to show the user?
4	`TezTask` updates Hive counters from Tez counters. What is the counter group mapping?

Step 5 — Tez Counterpart: `ShuffleVertexManager`

Open ShuffleVertexManager.java in your Tez source. Cross-reference with what you learned from DagUtils.java:

The minFraction/maxFraction payload you found in Step 3 is parsed by which method in ShuffleVertexManager?
When Hive enables auto-parallelism, what happens inside ShuffleVertexManager that does NOT happen when it is disabled?
Where does ShuffleVertexManager call context.reconfigureVertex()? What does reconfigureVertex do to the number of reducer tasks?

Step 6 — End-to-End Mental Model

Draw (on paper or in a text diagram) the full path for:

SELECT dept, COUNT(*) FROM employees GROUP BY dept

Show:

Hive logical plan nodes
TezWork graph (label each BaseWork)
Tez DAG (label each vertex, edge type, VertexManagerPlugin)
Which Tez APIs TezTask calls

Step 7 — JIRA Research: Hive/Tez Compatibility

Search:

project = TEZ AND text ~ "hive" AND resolution = Fixed ORDER BY updated DESC

Find one issue where a Tez change broke Hive or where a Hive bug exposed a Tez issue.

What was the incompatibility?
Was the fix in Tez or Hive (or both)?
Did the patch include a test? If so, where?

Lab 6.2 — Debug a Failed Hive-on-Tez Query

Lab type: Fix-It (diagnostics + root-cause analysis)
Estimated time: 120 min

Overview

A Hive-on-Tez query failure can originate from:

Tez DAG layer — vertex scheduling error, shuffle failure, OOM
Hive operator layer — deserialization error, UDF crash, wrong SerDe
Infra layer — YARN container killed, HDFS quota exceeded, network timeout

In this lab you will work through a systematic diagnostic process and trace a simulated failure back to its Tez-layer root cause.

Scenario

A Hive query:

SELECT k, SUM(v) FROM large_table GROUP BY k;

fails with:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Vertex failed, vertexName=Reducer 2, vertexId=vertex_1700000000000_0001_1_01,
diagnostics=[Task failed, taskId=task_1700000000000_0001_1_01_000000,
diagnostics=[TaskAttempt 0 failed, info=[Container container_... exited
with exitCode: -104]]

Exit code -104 means container killed by YARN for exceeding memory.

Step 1 — Identify the Layer

#	Question
1	Is exit code `-104` a Tez error or a YARN error? Where is this code defined?
2	Which vertex failed — the map or the reduce? How do you know from the diagnostic message?
3	What Tez API would you call (in Java) to retrieve these diagnostics programmatically?
4	The error says "TaskAttempt 0 failed". Does this mean no retries happened, or that all retries were exhausted?

Step 2 — Locate the Logs

In a real cluster:

# Get the AM logs
yarn logs -applicationId application_1700000000000_0001 \
  -log_files syslog | grep -A 20 "Reducer 2"

# Get the container logs
yarn logs -applicationId application_1700000000000_0001 \
  -containerId container_... | head -200

Questions:

In the AM logs, what Tez class emits the Task failed message? (Hint: grep for TaskImpl or VertexImpl in the log.)
The container log has a Java OOM or GC log. Where in TaskAttemptImpl does the container exit code get translated to a TaskAttemptEvent?

Step 3 — Identify the Tez Configuration Fix

The reduce vertex ran out of memory. The relevant configuration:

Config key	Default	Description
`tez.am.resource.memory.mb`	1024	AM container memory
`tez.task.resource.memory.mb`	1024	Task container memory
`hive.tez.container.size`	-1 (inherits from mapred)	Hive override for Tez task memory
`hive.auto.convert.join.noconditionaltask.size`	10MB	In-memory join threshold

Which config key should be increased to fix the OOM?
Is this a Tez config or a Hive config? Which system applies it?
Find where tez.task.resource.memory.mb is read in Tez source. In which class and method?

Step 4 — Tez Source Reading: Container Exit Code Handling

Find where Tez handles non-zero container exit codes:

grep -rn "exitCode\|EXIT_CODE\|ContainerExitStatus" \
  ~/tez-src/tez-dag/src/main/java/ | grep -v "test" | head -30

Answer:

What class translates the YARN container exit code into a TaskAttemptEvent?
Is -104 (PREEMPTED) treated differently from -1 (ABORTED)?
Does Tez retry a preempted task? What configuration controls the max retries?

Step 5 — Simulate the Fix

In a real system you would increase tez.task.resource.memory.mb and rerun. Since you do not have a Hive cluster, instead:

Find the test in TestTaskAttemptImpl.java that covers container preemption:

grep -n "preempt\|PREEMPT\|exitCode" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttemptImpl.java \
  | head -20

Read the test. Answer:

How does the test simulate a container exit with a non-zero exit code?
What state does TaskAttemptImpl transition to on preemption?
Is there a test for the full retry-until-max-attempts path?

Step 6 — Write a Diagnostic Runbook Entry

Write 5–8 bullet points as a "runbook entry" for this class of failure:

## Hive-on-Tez: Reducer OOM (exit code -104)

**Symptoms:** ...
**Root cause:** ...
**Diagnostic steps:** ...
**Fix:** ...
**Tez classes involved:** ...
**Relevant configuration:** ...

This is the kind of documentation that Tez PMC members write for operators.

JIRA Research

Search for Tez issues related to container OOM or preemption handling:

project = TEZ AND text ~ "preempt OR oom OR out of memory" AND resolution = Fixed

Find one. Read the patch. Was the fix in TaskAttemptImpl, in configuration defaults, or in a different class?

Level 7: Runtime and Shuffle

The Tez shuffle layer is the most performance-critical and most bug-prone part of the runtime. Understanding it is required for diagnosing slow queries, data-skew issues, and shuffle fetch failures.

How shuffle works in Tez

Map task → OrderedPartitionedKVOutput → TezIndexRecord (index + data files)
                                         ↓
                              ShuffleHandler (HTTP server in NM)
                                         ↓
Reduce task → OrderedGroupedKVInput ← shuffle fetcher threads
                                         ↓
                                    merge + sort → processor

Key insight: unlike Hadoop MapReduce's ShuffleConsumerPlugin, Tez's shuffle is split into framework code (tez-runtime-library) and user code (Processor). The processor never sees unsorted records — sorting happens in the runtime layer.

What this level covers

Topic	Lab
Trace shuffle fetch failure from AM logs to root cause	Lab 7.1
Add or modify an `OrderedPartitionedKVOutput` processor	Lab 7.2

Key classes

Class	Where	What it does
`OrderedPartitionedKVOutput`	`tez-runtime-library`	Map output: partition + sort + spill
`OrderedGroupedKVInput`	`tez-runtime-library`	Reduce input: fetch + merge + sort
`ShuffleFetch` / `Fetcher`	`tez-runtime-library`	HTTP fetch from ShuffleHandler
`MergeManager`	`tez-runtime-library`	In-memory and on-disk merge
`ShuffleHandler`	`tez-shuffle`	Netty HTTP server serving map output
`TezIndexRecord`	`tez-runtime-library`	Per-partition offset+length in output file

Lab 7.1 — Debug Shuffle Behavior

Lab type: Read & Research
Estimated time: 120 min
Tez module: tez-runtime-library

Overview

Shuffle failures are the most common source of Tez bug reports. They manifest as FetchFailure events, IOException during map-output reads, or hung reduce tasks. In this lab you will trace the complete shuffle path from log line to source code.

Step 1 — Locate the Core Classes

find ~/tez-src/tez-runtime-library -name "*.java" | xargs grep -l "FetchFailure\|Fetcher\|ShuffleHandler" | head -10
find ~/tez-src/tez-shuffle -name "*.java" | head -10

Step 2 — Read the Shuffle Fetch Path

Open Fetcher.java (in tez-runtime-library) and trace the fetch loop:

#	Question
1	What HTTP method does the Fetcher use to request map output? GET or POST?
2	What is the URL format it sends to `ShuffleHandler`? What parameters does it include?
3	If the HTTP response code is 404, what does the Fetcher do? (Fail immediately? Retry? Report back to the InputManager?)
4	What does the Fetcher do when it detects data corruption (checksum mismatch)? Which class handles checksum verification?
5	How many concurrent fetcher threads does a reduce task run? What configuration key controls this?

Step 3 — Read the `FetchFailure` Event Path

When a fetch fails, an event travels up to the AM:

grep -rn "FetchFailure\|FETCH_FAILURE" ~/tez-src/tez-dag/src/main/java/ | \
  grep -v "test" | grep ".java:" | head -20

Trace: where does the FetchFailure event originate, and what state transition does it trigger in TaskAttemptImpl?

#	Question
1	What is the name of the event class that carries the fetch-failure information to the AM?
2	In `TaskAttemptImpl`, what state does the task transition to when it receives a fetch failure?
3	Does a single fetch failure kill the task, or does Tez retry? What configuration controls max fetch retries?
4	What happens to the source task attempt (the map) when its output cannot be fetched? Is it re-run?

Step 4 — Read `ShuffleHandler`

Open ShuffleHandler.java in tez-shuffle:

#	Question
1	What Netty class does `ShuffleHandler` extend?
2	How does `ShuffleHandler` authenticate that a requester is authorized to fetch map output? (Hint: look for `TOKEN` or `JobTokenSecretManager`.)
3	Where does `ShuffleHandler` read the index file? What class represents the index?
4	If the NM restarts while a reduce is fetching, what happens to in-flight fetch requests?

Step 5 — Read the Spill Path

Open DefaultSorter.java or PipelinedSorter.java in tez-runtime-library:

At what memory threshold does a spill occur?
How many spill files can accumulate before a merge is triggered?
After a spill, where is the index written?

Step 6 — Common Shuffle Bug Patterns

For each pattern below, identify the relevant Tez class and the configuration that can mitigate it:

Pattern	Class	Config key
Slow fetch due to too few fetcher threads
OOM in reducer due to large in-memory merge buffer
Fetch failure due to ShuffleHandler authentication timeout
Data skew: one reducer processes 100× more data than others

Step 7 — JIRA Research

Search:

project = TEZ AND component = "tez-runtime-library" AND resolution = Fixed ORDER BY updated DESC

Find a recently fixed shuffle or sort bug. Read the patch:

What was the bug?
Was it in Fetcher, DefaultSorter, MergeManager, or ShuffleHandler?
Was a test added? What does it mock or simulate?

Lab 7.2 — Modify a Processor: Add Deduplication to `UnionSinkProcessor`

Lab type: Fix-It / Extend
Estimated time: 90 min
Maven module: book/projects/level-3-multi-input

Overview

UnionSinkProcessor from Level 3 sums all values it receives. In this lab you will extend it to deduplicate records by key before summing — only the first record for each key is counted.

This exercise teaches:

How to modify a Processor that uses OrderedGroupedKVInput
How counters interact with deduplication logic
How to write a unit test for processor logic using mocks

Step 1 — Understand the Current Behavior

The current UnionSinkProcessor (Level 3) receives (Integer key, Integer value) pairs and sums all values. For the test input (0..99 integers), expected sum is 4950.

Open UnionSinkProcessor.java and answer:

How does it iterate over input records?
Where does it write the counter?
What happens if the same key appears twice (e.g. key=5, value=5 appears from both the even source and… wait, can it? Check EvenNumberSource and OddNumberSource.)

Step 2 — Add a Deduplicating Variant

Create DeduplicatingUnionSinkProcessor.java in the same package. It should:

Maintain a Set<Integer> of seen keys
For each (key, value) pair from the input: if key is new, add to set and add value to the sum; otherwise skip
Publish the same UnionPipeline/TotalSum counter
Also publish a new counter UnionPipeline/DuplicatesSkipped

Step 3 — Write a Unit Test

Create TestDeduplicatingUnionSinkProcessor.java. Use the Mockito pattern from TestMultiInputProcessors:

@Test
public void testDuplicateKeyIsSkippedOnce() {
    // Create a mock input that returns (key=1, value=10) twice
    // and (key=2, value=20) once
    // Expected TotalSum: 10 + 20 = 30
    // Expected DuplicatesSkipped: 1
}

@Test
public void testAllUniqueKeys() {
    // No duplicates: result must equal non-deduplicating sum
}

Step 4 — Run the Tests

cd book/projects
mvn -pl level-3-multi-input test -q 2>&1 | tail -10

Step 5 — Questions

#	Question
1	If your deduplication `Set` grows very large (millions of keys), what would happen to the task JVM heap?
2	The input is already sorted by key (because `OrderedGroupedKVInput` sorts). Could you use this property to deduplicate without a `Set`? Rewrite `DeduplicatingUnionSinkProcessor` to use `O(1)` memory.
3	Your new counter `UnionPipeline/DuplicatesSkipped` — where in the Tez framework does it get propagated to the AM and eventually to `DAGStatus.getDAGCounters()`?

Level 8: Real Issue Contribution

This level is the transition from learner to contributor. You will pick a real open JIRA issue, reproduce it, write a patch, and go through the Apache contribution process from start to submission.

The Apache contribution loop

1. Pick an issue (JIRA)          → identify something you can fix
2. Understand the context         → read related code, existing tests, comments
3. Reproduce the bug              → write a failing test or reproduce steps
4. Implement the fix              → minimum change that passes all tests
5. Format the patch               → `git diff > TEZ-NNNN.001.patch`
6. Upload to JIRA                 → attach the patch, set status to "Patch Available"
7. Respond to review comments     → iterate, upload TEZ-NNNN.002.patch etc.
8. Patch committed                → a committer votes +1 and commits

Choosing the right issue

Good first contributions:

Type	Difficulty	Acceptance rate
Missing test coverage	Low	High
Wrong error message	Low	High
Javadoc improvement	Low	High
Logging improvement	Low	High
NPE in edge case	Medium	High
Performance regression (small)	Medium	Medium
New feature	High	Low (needs design discussion first)

Rule: Start with "Minor" or "Trivial" priority JIRAs. Do not attempt "Blocker" or "Critical" until you have 3+ committed patches.

What this level covers

Topic	Lab
Find and reproduce a real open JIRA issue	Lab 8.1
Implement a fix, write the test, format the patch	Lab 8.2
Write better error messages for failed DAGs	Lab 8.3

Lab 8.1 — Find and Reproduce a Real JIRA Issue

Lab type: Research & Reproduce
Estimated time: 2–4 hours (actual time varies by issue)

Step 1 — Find a Good Candidate

Go to: https://issues.apache.org/jira/projects/TEZ

Filter:

Status: Open
Priority: Minor or Trivial
Component: tez-dag or tez-runtime-library
Resolution: Unresolved

Look for issues with:

A small reproduction case described in comments
No existing "Patch Available" attachment
Last comment less than 1 year old

Step 2 — Read Everything

For your chosen issue, read:

The original description
Every comment (some comments contain critical reproduction steps)
Any attached patches (even if they were rejected — understand why)
Related issues in the "is blocked by" / "depends on" links

Answer for your issue:

#	Question
1	What is the exact symptom? (Exception? Wrong result? Performance regression?)
2	Which Tez class is implicated? Which method?
3	Under what conditions does the bug occur?
4	Is there a unit test that would catch this if it existed?

Step 3 — Reproduce the Bug

For a unit-test-reproducible bug:

cd ~/tez-src
# Write a test that fails
mvn test -pl tez-dag -Dtest=TestVertexImpl#testMyReproduction -q 2>&1 | tail -20

For a configuration-dependent bug, write a minimal local-mode DAG that triggers it.

Record:

The exact exception and stack trace
Which class and line number triggers it
Whether it is deterministic or intermittent

Step 4 — Map the Root Cause

Trace from the symptom to the line of code that is wrong:

Start with the exception message
Find the throw site in source code
Walk backwards through the call stack
Identify the single line that is wrong (the real fix site is often 10 lines above the throw site)

Step 5 — Verify Your Understanding

Post a comment on the JIRA (be professional and concise):

I was able to reproduce this issue on Tez trunk (commit <hash>) with the
following minimal test case:
[paste test code or reproduction steps]

The root cause appears to be [one sentence description] at
[ClassName.java:line].  I am working on a patch.

This establishes you as working on the issue and prevents duplicate work.

Questions

How long did it take you to go from "reading the JIRA" to "reproducing the bug"?
Was the root cause where you expected it based on the stack trace, or did you have to trace further?
Is there a comment in the code near the bug site that explains the intended behavior? Was the comment wrong?

Lab 8.2 — Implement the Fix, Write the Test, Format the Patch

Lab type: Fix-It (real JIRA)
Estimated time: 2–6 hours

Step 1 — Implement the Minimum Fix

Rules:

Change only what is necessary to fix the bug
Do not reformat surrounding code
Do not add unrelated improvements
Do not add comments unless they explain the fix
If the fix requires changes in multiple files, make all changes in one commit

Step 2 — Write the Test

Every Tez patch must include a test that:

Fails on the original code (without the fix)
Passes on the patched code

The test must be in the same test class as existing tests for the modified class.

Test quality checklist:

Test name clearly describes what it is testing
@Test(timeout = 5000) annotation (prevents hung tests from blocking CI)
No Thread.sleep() (use DrainDispatcher.await() or CountDownLatch instead)
Assertion messages explain what was expected vs. what was found
No hardcoded absolute paths or ports

Step 3 — Run the Full Test Module

cd ~/tez-src
mvn test -pl tez-dag -q 2>&1 | tail -10

All tests must pass.

Step 4 — Run Checkstyle

mvn checkstyle:check -pl tez-dag -q 2>&1 | grep -E "ERROR|violation" | head -20

Zero violations required.

Step 5 — Format the Patch

cd ~/tez-src
git diff HEAD > /tmp/TEZ-NNNN.001.patch

Verify:

# No trailing whitespace
grep -nP "\\s+$" /tmp/TEZ-NNNN.001.patch

# Patch applies cleanly
git apply --check /tmp/TEZ-NNNN.001.patch

Step 6 — Upload to JIRA

Open your JIRA issue
Click "Attach File" and upload /tmp/TEZ-NNNN.001.patch
Set the "Patch Available" flag (the checkbox in the issue screen, NOT the workflow button)
Update the description or add a comment:

Attaching patch TEZ-NNNN.001.patch

Changes:
- [ClassName.java]: [one-line description of the fix]
- [TestClassName.java]: [one-line description of the new test]

The test fails on unpatched code and passes with the fix applied.

After You Submit

You will typically receive review feedback within a few days to a few weeks. Common feedback categories:

Feedback	Meaning	Your response
"Can you add a test?"	Test is missing	Add test, re-upload
"This is too broad"	Change is larger than needed	Narrow scope, re-upload
"Style nit: …"	Checkstyle or code style	Fix, re-upload
"+1"	Committer approves	Wait for commit, or ask "Is this ready to commit?"
"-1"	Hard block	Address all -1 comments before re-uploading

Lab 8.3 — Improve Error Messages for Failed DAGs

Lab type: Fix-It (error message quality)
Estimated time: 90 min

Overview

Poor error messages are one of the most common complaints from Tez users. "Container exited with a non-zero exit code" tells an operator almost nothing. This lab focuses on finding and improving a diagnostic message in the Tez AM.

Step 1 — Find Weak Error Messages

Search for generic or unhelpful diagnostics:

grep -rn '"Container exited\|"Task failed\|"Vertex failed\|unknown error' \
  ~/tez-src/tez-dag/src/main/java/ | grep -v test | head -20

Also look for messages that use string concatenation on a potentially-null object:

grep -rn 'diagnostics.*\+.*null\|null.*\+.*diagnostics' \
  ~/tez-src/tez-dag/src/main/java/ | head -20

Step 2 — Pick a Target

Select one diagnostic message that you can improve. Good candidates:

A message that says "failed" without explaining why
A message that could NPE if a field is null
A message that uses a raw integer code without a human-readable explanation

Step 3 — Understand the Context

For your chosen message:

What class emits it?
What state transition triggers it?
What information is available at that point (in the method parameters or fields) that could be added to the message?

Step 4 — Improve the Message

Example improvement:

// Before (unhelpful):
diagnostics.add("Container " + containerId + " failed");

// After (actionable):
diagnostics.add(String.format(
    "Container %s failed with exit code %d (%s). " +
    "Check container logs at: %s",
    containerId,
    exitCode,
    ContainerExitStatus.getExitCodeString(exitCode),
    logURL));

Step 5 — Write a Test for the New Message

The test should verify that:

The improved message appears in TaskAttemptImpl.getDiagnostics() or VertexImpl.getDiagnostics() after the relevant failure event
It contains the expected key fields (exit code, container ID, etc.)

Pattern:

@Test
public void testDiagnosticsContainsExitCode() {
    // ... set up failing task attempt with specific exit code ...
    List<String> diags = taskAttempt.getDiagnostics();
    assertTrue("Diagnostics should contain exit code",
        diags.stream().anyMatch(d -> d.contains("exitCode=123")));
}

Step 6 — Format Patch and JIRA

git diff > /tmp/TEZ-ERRORMSG.001.patch

JIRA title pattern: [tez-dag] Improve error message for [specific failure scenario]

Reflection Questions

What makes a good diagnostic message? List 4 properties.
Why do projects accumulate bad error messages over time? (Hint: think about who writes the code vs. who runs it.)
Find a Tez JIRA where the only change was improving a log or diagnostic message. Was the patch accepted? How long did the review take?

Level 9: Advanced Committer / PMC-Level Contributor

At this level you move beyond fixing bugs into shaping the project: writing performance-critical tests, analyzing regressions, participating in design discussions, and understanding how Apache governance works.

The committer path

Contributor → trusted contributor (10+ accepted patches)
           → committer candidate (PMC votes)
           → committer (can merge patches)
           → PMC member (vote on releases and project direction)

Becoming a committer is about demonstrated judgment — not just writing correct code, but consistently:

Choosing the minimum-impact fix over the clever refactor
Writing tests that catch real bugs, not just satisfy coverage metrics
Reviewing others' patches with constructive, specific feedback
Following up on issues you reported or started

What this level covers

Topic	Lab
Write comprehensive scheduler behavior tests	Lab 9.1
Analyze and quantify a performance regression	Lab 9.2

Lab 9.1 — Write Tests for Scheduler Behavior

Lab type: Build It — comprehensive test coverage
Estimated time: 3–4 hours
Tez module: tez-dag

Overview

The Tez task scheduler (TaskSchedulerEventHandler, CapacityTaskScheduler, FairTaskScheduler) manages how containers are requested from YARN and how pending tasks are assigned to available containers.

This is one of the least-tested areas of Tez. Well-written scheduler tests are highly valued by committers.

Step 1 — Understand the Scheduler Interface

find ~/tez-src -name "TaskScheduler.java" | head -3
find ~/tez-src -name "TaskSchedulerEventHandler.java" | head -3
find ~/tez-src -name "TestTaskScheduler*.java" | head -10

Open the scheduler interface and answer:

#	Question
1	What events does `TaskSchedulerEventHandler` process? List all event types.
2	When a container becomes available, what is the algorithm for choosing which task to assign to it?
3	When Tez requests a container from YARN, what resource profile does it request? (CPU + memory?)
4	If YARN preempts a container, what does the scheduler do to the task that was running in it?

Step 2 — Identify Missing Coverage

grep -n "public void test" \
  ~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/rm/TestTaskSchedulerEventHandler.java \
  | head -30

Find 3 scenarios that are NOT covered by existing tests. Good candidates:

Container allocation after task is cancelled (race condition scenario)
Scheduling under resource pressure (all containers allocated, new task arrives)
Task scheduled to a blacklisted node

Step 3 — Write 3 New Tests

For each missing scenario, write a test following the pattern of the existing tests. Each test must:

Set up the scheduler with a mock RMCommunicator and DAGAppMaster
Drive a sequence of events
Assert on the scheduler's resulting state and on calls made to the mock YARN RM

@Test(timeout = 5000)
public void testTaskScheduledAfterContainerPreempted() {
    // TODO: set up scheduler with 1 running container
    // TODO: simulate YARN preemption of that container
    // TODO: verify the task is re-queued (not dropped)
    // TODO: simulate new container allocation
    // TODO: verify the task is re-scheduled to the new container
}

Step 4 — Run and Verify

mvn test -pl tez-dag -Dtest=TestTaskSchedulerEventHandler -q 2>&1 | tail -10

Step 5 — Reflection

#	Question
1	The test uses mocks for YARN and the DAGAppMaster. What real behavior is NOT exercised by this approach?
2	A scheduler has inherently concurrent behavior. How do the existing tests handle thread safety?
3	If you were to write an integration test for the scheduler (using `MiniTezCluster`), what would be harder to set up than in a unit test? What would be easier to assert?

Lab 9.2 — Analyze a Performance Regression

Lab type: Research & Benchmark
Estimated time: 3–4 hours

Overview

Performance regressions are among the most impactful bugs in Tez — a 10% slowdown in shuffle can translate to significant cost at scale. But they are also the hardest to reproduce and fix.

In this lab you will:

Identify a performance-sensitive code path
Write a micro-benchmark using JMH
Compare two implementations and quantify the difference
Write a JIRA with a clear, reproducible performance report

Step 1 — Identify a Hot Path

The most performance-critical paths in Tez:

Path	Class	Why it matters
Record serialization	`TezSerializer`, `WritableSerialization`	Called once per record
Sort buffer writes	`DefaultSorter.collect()`	Called once per output record
Shuffle URL construction	`Fetcher.getFetchList()`	Called per fetch request
Counter increment	`TezCounter.increment()`	Called very frequently
BitSet operations	`VertexManagerPlugin.onTaskAttemptCompleted`	Called per task completion

Step 2 — Add Maven Surefire Benchmark Configuration

For a quick JMH benchmark within the project:

<!-- Add to level-4-waving-manager/pom.xml if you want to benchmark BitSet -->
<dependency>
  <groupId>org.openjdk.jmh</groupId>
  <artifactId>jmh-core</artifactId>
  <version>1.37</version>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.openjdk.jmh</groupId>
  <artifactId>jmh-generator-annprocess</artifactId>
  <version>1.37</version>
  <scope>test</scope>
</dependency>

Step 3 — Write the Benchmark

Example: compare BitSet.andNot(clone) vs re-building the set from scratch:

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkBitSetAndNot(Blackhole bh) {
    BitSet scheduled = createBigBitSet(1000);
    BitSet finished = createBigBitSet(500);
    BitSet copy = (BitSet) scheduled.clone();
    copy.andNot(finished);
    bh.consume(copy.isEmpty());
}

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void benchmarkManualIteration(Blackhole bh) {
    Set<Integer> scheduled = createBigSet(1000);
    Set<Integer> finished = createBigSet(500);
    boolean allDone = finished.containsAll(scheduled);
    bh.consume(allDone);
}

Step 4 — Run and Analyze

cd book/projects
mvn -pl level-4-waving-manager test -Dtest=WavingBenchmark -q 2>&1 | tail -30

Record:

Mean time per operation (nanoseconds)
Confidence intervals
Winner

Step 5 — Write the JIRA Performance Report

Summary: [ClassName] uses O(n) Set.containsAll() where O(n) BitSet.andNot() is available

Description:
  Micro-benchmark comparison of BitSet.andNot() vs Set.containsAll() for
  wave-completion detection in WavingVertexManager (and by extension any
  similar VertexManagerPlugin).

  Results (1000 tasks, 500 completed, JDK 11, M1 MacBook):

    BitSet.andNot():        X ± Y ns/op
    Set.containsAll():      X ± Y ns/op
    Speedup:                Nx

  For large DAGs with thousands of tasks, this difference compounds
  significantly over the lifetime of the DAG.

Patch: Switch from HashSet to BitSet in [ClassName].

Priority: Minor
Component: tez-dag

Reflection

#	Question
1	At what scale (number of tasks per DAG) would the BitSet optimization matter in practice? At 10 tasks? 10,000?
2	JMH benchmarks measure throughput in isolation. What real-world factors could make the benchmark results misleading?
3	Performance patches are often held to a higher standard of review than correctness patches. Why?

Contributor Mindset

This section is the "soft skills with hard edges" half of the curriculum. The technical chapters teach you how Tez works; this section teaches you how the Apache Tez project works — how decisions are made, how patches are accepted, how trust is earned, and how a contributor becomes a committer.

These are not optional skills. A technically excellent patch with poor process around it will sit on JIRA for months. A modest patch with clean process gets reviewed and committed.

Reading Order

The chapters are ordered to mirror the actual arc of a new contributor.

#	Chapter	What it answers
1	Reading the Codebase	How do I navigate ~200k LOC without drowning?
2	Design via JIRA	Where does design happen in Apache projects?
3	Community Interaction	How do I talk to dev@ and JIRA without burning trust?
4	Patch Quality	What does a committer-ready patch look like?
5	Responding to Feedback	How do I handle review comments well?
6	Compatibility	What can I change without breaking users?
7	Meritocracy	How does someone become a committer or PMC member?

Chapters 1–2 are pre-work — read them before opening any JIRA. Chapters 3–5 are operational — read them before submitting your first patch. Chapters 6–7 are strategic — read them when you start thinking beyond a single patch.

How This Complements the Technical Labs

The labs in Levels 1–9 build engineering competence inside the Tez codebase. This section builds the project-level competence needed to ship that work into Apache Tez itself.

The relationship is concrete:

Technical chapter	Mindset chapter that pairs with it
Level 2 Lab 2: Prepare a Patch	Patch Quality
Level 3 deep dives on AM internals	Reading the Codebase
Level 5 Tez/Hive integration	Compatibility
Level 7 protocol & wire format	Compatibility
Capstone project (`capstone/`)	All seven mindset chapters

If you are doing the Capstone, you should have read all seven chapters in this section by the time you reach Step 8 (the patch).

What This Section Is Not

It is not generic open-source advice. Every claim, template, and procedure here is grounded in:

The Apache Software Foundation Way
The Apache Tez JIRA project (TEZ)
The dev@tez.apache.org mailing-list archive
The tez-tools/src/main/resources/tez/checkstyle.xml and other in-repo policy files
The @InterfaceAudience / @InterfaceStability annotations in tez-api

Where a chapter generalises, it labels the generalisation. Where it states a Tez-specific rule, it cites the in-repo file or the JIRA where the rule was set.

Prerequisites

Before this section is useful you must have:

A local clone of Tez at ~/tez-src (git clone https://github.com/apache/tez.git)
A JIRA account at https://issues.apache.org/jira/
A subscription to dev@tez.apache.org (send empty mail to dev-subscribe@tez.apache.org)
An ASF ID is not required — that comes later, with committership.

Validation for the Section

You have absorbed this section when you can:

Find any feature in Tez within 10 minutes by tracing from TezClient or DAGAppMaster.
Write a JIRA description that a committer can act on without follow-up questions.
Produce a patch that passes mvn checkstyle:check and mvn test in changed modules on the first try.
Read a @InterfaceAudience annotation and predict what you may and may not change.
Explain to a colleague the difference between contributor, committer, and PMC.

The next chapter — Reading the Codebase — gives you the navigation strategy you will use through everything that follows.

Reading a 200k+ LOC Apache Codebase

Apache Tez is roughly 200,000 lines of Java across 15+ Maven modules. No single human holds it all in their head — not even the most senior committers. The skill is not memory; it is navigation. This chapter gives you the strategies committers actually use.

Module Map First

Before reading any code, learn the module shape. Run this once and pin the output:

cd ~/tez-src
find . -maxdepth 2 -name pom.xml | sort

The modules that matter for ~90% of work:

Module	What lives there	When you read it
`tez-api`	Public API: `TezClient`, `DAG`, `Vertex`, `Edge`, `*Descriptor`	Always start here
`tez-common`	Shared utilities, `TezConfiguration`, counters	Tracing configs
`tez-runtime-internals`	Task runtime, `LogicalIOProcessorRuntimeTask`	Following a task
`tez-runtime-library`	`OrderedPartitionedKVOutput`, shuffle inputs	I/O contracts
`tez-dag`	`DAGAppMaster`, schedulers, state machines	AM-side bugs
`tez-mapreduce`	MR compat: `MRInput`, `MROutput`	MR-on-Tez
`tez-tests`	`MiniTezCluster`, `TestOrderedWordCount`	Integration tests
`tez-tools`	Checkstyle config, swimlanes, analyzer	Process tooling

Tez follows the Hadoop convention: code lives in <module>/src/main/java, tests in <module>/src/test/java. Protobufs live in <module>/src/main/proto.

Strategy 1: Start From the Public API, Trace Inward

Every Tez user program goes through tez-api. That makes it the only mandatory entry point. The reading order:

tez-api (what users see)
   ↓
tez-dag (what the AM does with it)
   ↓
tez-runtime-internals (what tasks do)
   ↓
tez-runtime-library (the I/Os tasks use)

Trace example — "where does parallelism come from?":

cd ~/tez-src
grep -rn "setParallelism" tez-api/src/main/java | head
grep -rn "setParallelism\|reconfigureVertex" tez-dag/src/main/java | head

You will find Vertex.setParallelism(int) in tez-api and follow it to VertexImpl.setParallelism in tez-dag. That arc — API → impl — is the canonical pattern for reading Tez.

Strategy 2: Protobufs Are the Source of Truth for Anything Serialized

Anything that crosses a process boundary (client → AM, AM → container, AM → history) is defined in protobuf. The protos are the contract; the Java is the implementation.

find ~/tez-src -name "*.proto" | sort

The four protos to internalise:

Proto	Role
`tez-api/src/main/proto/DAGApiRecords.proto`	`DAGPlan`, `VertexPlan`, `EdgePlan` — the DAG on the wire
`tez-api/src/main/proto/Events.proto`	The event types that flow on the dispatcher
`tez-common/src/main/proto/TezCommonProtos.proto`	Counters, plugin descriptors
`tez-dag/src/main/proto/DAGProtos.proto`	AM-internal records

When you see a class named *Proto (e.g. DAGProtos.DAGPlan) the generated code lives in target/generated-sources/ after a build. Don't read the generated code; read the .proto.

Practical rule: if you are changing a field that appears in a proto, you are changing wire compatibility. See Compatibility.

Strategy 3: IDE Call Hierarchy + `git log -S`

Two tools, used together, replace 80% of speculative reading.

Call hierarchy (IntelliJ: Ctrl-Alt-H, Eclipse: Ctrl-Alt-H) answers "who calls this?". Use it on entry points like TezClient.submitDAG to find every call site in tests and examples.

git log -S answers "when and why did this code appear?".

cd ~/tez-src
git log -S "reconfigureVertex" --oneline -- tez-dag/
git log -S "reconfigureVertex" --oneline -- tez-api/

Pick the oldest commit referenced and read its JIRA:

git show <sha> | head -30
# Look for "TEZ-NNNN" in the commit message

That JIRA is the design discussion. It is more valuable than the code.

Strategy 4: Tests Are Executable Spec

The Tez test suite is the cheapest way to learn what a class does. For any class Foo.java, look for TestFoo.java:

find ~/tez-src -name "TestVertexImpl.java"
find ~/tez-src -name "TestDAGImpl.java"
find ~/tez-src -name "TestShuffleVertexManager.java"

The test names alone form a behavior spec:

grep "  public void test" $(find ~/tez-src -name TestVertexImpl.java)

For runtime behavior, integration tests in tez-tests/ are the gold:

ls ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/

TestTezJobs.java and TestExceptionPropagation.java walk full DAGs end-to-end on a MiniTezCluster. Read them before guessing how a feature behaves at runtime.

Strategy 5: Keep a Reading Log

Committers have working memory of the codebase because they wrote a lot of it. You don't. Compensate with notes. Keep one file:

mkdir -p ~/tez-notes
cat > ~/tez-notes/reading-log.md <<'EOF'
# Tez Reading Log

## YYYY-MM-DD — DAG submission path
- TezClient.submitDAG(DAG) in tez-api builds DAGPlan
- → DAGClientAMProtocolBlockingPB.submitDAG (RPC)
- → DAGAppMaster.submitDAGToAppMaster
- → DAGAppMaster.startDAG → AsyncDispatcher.getEventHandler().handle(DAGEventType.DAG_INIT)

## YYYY-MM-DD — Vertex parallelism reconfiguration
- VertexManagerPlugin.context.reconfigureVertex(...)
...
EOF

Re-reading three months later, the log is gold. Without it, you re-trace the same path.

Worked Exercise: TezClient.submitDAG → AsyncDispatcher

Goal: in 90 minutes, trace the path from a user calling tezClient.submitDAG(dag) to the event landing on the DAGAppMaster async dispatcher.

Step 1 (15 min) — Find the entry

cd ~/tez-src
find tez-api/src/main/java -name "TezClient.java"
grep -n "public DAGClient submitDAG" $(find tez-api/src/main/java -name TezClient.java)

You will find an overload that takes DAG dag. Read its body. Note that it does two things: builds a DAGPlan from the DAG, then sends it via an RPC stub.

Step 2 (20 min) — Identify the RPC

grep -rn "submitDAG" tez-api/src/main/proto/

Find DAGClientAMProtocol.proto. The SubmitDAGRequestProto carries the DAGPlan. The generated stub is DAGClientAMProtocolBlockingPB. The server side implements it in tez-dag.

grep -rn "implements DAGClientAMProtocolBlockingPB\|extends DAGClientAMProtocolBlockingPB" tez-dag/src/main/java

You will land in DAGClientHandler (in tez-dag/.../dag/app/).

Step 3 (20 min) — Server-side handling

grep -n "submitDAG" $(find tez-dag/src/main/java -name "DAGClientHandler.java")

Follow submitDAG → DAGAppMaster.submitDAGToAppMaster → DAGAppMaster.startDAG. Inside startDAG, you will see a DAG dag = createDAG(dagPlan) and then an event dispatched through dispatcher.getEventHandler().handle(...).

Step 4 (20 min) — The dispatcher

find tez-dag/src/main/java -name "DAGAppMaster.java"
grep -n "AsyncDispatcher\|dispatcher" $(find tez-dag/src/main/java -name DAGAppMaster.java) | head

Find where dispatcher is instantiated and where event handlers are registered. The handler for DAGEventType is the DAGImpl's state machine.

Step 5 (15 min) — Record it

Open your reading log and write the four-line summary. Cite the file and line for each hop.

Validation Artifacts

After this chapter you should produce and keep:

A ~/tez-notes/module-map.md with one sentence per module.
A ~/tez-notes/reading-log.md with the submitDAG trace from the exercise above.
A grep-able list of the four protos and what each one defines.
One git log -S command and the JIRA it surfaced, saved to the log.

When you can do the exercise without checking this page, you have the navigation skill. The next chapter — Design via JIRA — tells you where the design decisions behind that code actually lived.

Design via JIRA, Not PRs

Apache projects design in the open. In Tez, "the open" is the TEZ JIRA project and the dev@tez.apache.org mailing list — not the GitHub PR.

A PR with a "see what you think" attitude and no JIRA attached will be ignored. A JIRA with a clear problem statement and rough design will get responses within days, often from people who never read the PR. This chapter is about why, and how to use that system.

Why Not Just PRs?

GitHub PRs at Apache are mirrors of patches. They are convenient for diff viewing, but they are not the system of record. The system of record is:

Artifact	System	Why there
Bug report, problem statement	JIRA	Searchable, citeable forever
Design discussion	JIRA + `dev@`	Archived by the ASF, public
Patch / code review	JIRA attachment or PR linked from JIRA	Reviewed under ASF ICLA
Vote on release / committer	`dev@` / `private@`	Required by ASF policy
The final code	git	The result, not the discussion

If a discussion happens only on a PR and the PR is later force-closed or the repo moves, the rationale evaporates. JIRA + mailing list don't move.

Concrete consequence: when you read code in tez-dag/ and ask "why?", the answer is almost certainly in a JIRA referenced from the commit message — see Reading the Codebase, Strategy 3.

The TEZ JIRA Workflow

A Tez JIRA moves through these statuses:

Open → In Progress → Patch Available → Resolved → Closed
                                    ↘ Reopened

Triggers:

Transition	Triggered by	Means
Open → In Progress	Assignee starts work	Don't duplicate this
In Progress → Patch Available	Patch (or PR) is ready for review	Reviewers, please look
Patch Available → Resolved	Committer commits it	Done in trunk
Resolved → Closed	Release ships containing the fix	Done for users
Resolved → Reopened	Bug returns or revert needed	Re-do

You only set "Patch Available" yourself. Everything else above the dotted line is yours; everything below requires a committer.

Reading Old JIRAs for Context

The single highest-leverage Tez skill is reading old JIRAs. Conventions:

Issues are referenced as TEZ-NNNN in commit messages and source comments. You will see // see TEZ-3045 or // TEZ-1597 peppered through the code.
Search them at https://issues.apache.org/jira/browse/TEZ-NNNN.
The "Activity" tab shows the design conversation. The "Attachments" tab shows the patch iterations (TEZ-NNNN.001.patch, TEZ-NNNN.002.patch, ...).

Try this now:

cd ~/tez-src
git log --all --oneline | grep -oE "TEZ-[0-9]+" | sort -u | tail -20

Pick one and open it in a browser. Read the description, the comments, and the patch iterations. You will see the design happen — alternative considered, rejected, refined. This is more useful than any architecture document because it shows reasoning, not conclusions.

When to Open a JIRA Yourself

You open a JIRA before writing the patch when any of the following is true:

Situation	Open JIRA?
Typo in Javadoc or log message	Yes (small, but track it)
One-line bug fix with obvious cause	Yes
Multi-file refactor	Yes, with a brief design
New public API	Yes, mandatory, with `[DISCUSS]` on `dev@` first
New configuration key	Yes
Performance change with measurable impact	Yes, with benchmark plan
Anything touching `DAGPlan` proto	Yes, with compatibility note

You do not need a JIRA to:

Ask a question on dev@ or user@
File a documentation question
Patch a private fork

The JIRA Description Skeleton

A Tez JIRA description that committers can act on contains, in order:

## Problem

(Two to four sentences. What is wrong. Who hits it.)

## Reproduction

(Steps to reproduce, or a code sample. If a test reproduces it, name the test class.)

## Root Cause

(One paragraph. Cite file and method.)

## Proposed Fix

(One paragraph. What you intend to do. Mention any alternatives considered.)

## Compatibility

(One sentence. Wire compat? API change? Config rename? "None." is a valid answer.)

## Test Plan

(One paragraph. Which tests pass after the change. Any new test added.)

A trivial bug fix may collapse Compatibility and Test Plan to one line each. A new API must expand them.

Design Doc on a JIRA — Skeleton

For anything larger than a single-file fix, attach a design doc (Markdown or PDF) to the JIRA. The skeleton:

# TEZ-NNNN: <short title>

## 1. Problem
What is wrong today. Who is affected. Why "do nothing" is not acceptable.

## 2. Goals
Bulleted, testable. "DAGPlan submission survives a 10 MB plan without OOM."

## 3. Non-Goals
What this design explicitly will not address. Prevents scope creep.

## 4. Alternatives Considered
- Option A: <description>. Pros / Cons. Why rejected.
- Option B: <description>. Pros / Cons. Why rejected.
- Option C (chosen): <description>. Pros / Cons.

## 5. Chosen Approach
Architecture sketch. Mermaid or ASCII. Cite files that will change.

## 6. Compatibility
- Wire compat: <change to any proto? backward compatible?>
- API compat: <InterfaceAudience.Public touched? deprecation plan?>
- Config compat: <new keys? renamed keys? default change?>

## 7. Test Plan
- Unit tests: which classes
- Integration: MiniTezCluster scenarios
- Manual: any out-of-suite verification

## 8. Rollout
- Default off? On? Feature flag name?
- Migration steps for existing users.

Attach as TEZ-NNNN-design.md or TEZ-NNNN-design.pdf. Announce it on dev@ with subject [DISCUSS] TEZ-NNNN: <short title> and a link.

Expect 1–2 weeks of asynchronous discussion before consensus. Do not start patching until the design is at least loosely agreed — patches without design buy-in get rejected.

"See TEZ-NNNN" — The Codebase Convention

Search the Tez source for back-references:

cd ~/tez-src
grep -rn "TEZ-[0-9]" tez-dag/src/main/java | head -20

Every such reference is a permanent link from the code to a design conversation. When you add a non-obvious workaround, you do the same — leave a // TEZ-NNNN: <one line why> so the next reader can find your reasoning.

When the Design Lives on `dev@` Only

Some discussions never reach JIRA — release planning, branch policy, build infrastructure. Those live on dev@tez.apache.org only. Archive:

https://lists.apache.org/list.html?dev@tez.apache.org

Search by subject prefix:

Prefix	Means
`[DISCUSS]`	Open question, no decision sought yet
`[PROPOSAL]`	Concrete proposal, feedback wanted
`[VOTE]`	Decision being made; 72h window
`[ANNOUNCE]`	One-way: release, new committer
`[NOTICE]`	One-way: infrastructure change

Subscribing: send empty mail to dev-subscribe@tez.apache.org.

Validation Artifacts

After this chapter you should be able to produce:

The URL of three different TEZ-NNNN JIRAs cited from the Tez source, and a one-line summary of what each one is about.
A draft JIRA description (in a local file ~/tez-notes/draft-jira.md) for a bug or improvement you have noticed, following the skeleton above.
A subscription confirmation to dev@tez.apache.org.
One archived [DISCUSS] thread URL relevant to a Tez area you care about.

The next chapter — Community Interaction — covers how to actually post on dev@ and behave on JIRA without burning trust on day one.

Community Interaction

This chapter covers the operational mechanics of communicating with the Apache Tez community — dev@tez.apache.org, JIRA, and the project's chat presence. Most of the "rules" below are not Tez rules; they are Apache-wide conventions that 25 years of mailing lists have settled into. Violating them is not a hanging offence, but it does mark you as new and costs you a small amount of credibility you have not yet earned.

The Lists

Tez has the standard ASF list set:

List	Purpose	Who reads
`dev@tez.apache.org`	Development discussion, design, votes	Contributors, committers, PMC
`user@tez.apache.org`	Usage questions, "how do I"	Users, some committers
`commits@tez.apache.org`	Auto-mailed commit notifications	Mostly bots; subscribe to follow trunk
`issues@tez.apache.org`	Auto-mailed JIRA notifications	Bots, some committers
`private@tez.apache.org`	PMC-only (new-committer votes, security)	PMC only

Subscribe to a list by sending an empty mail to <list>-subscribe@tez.apache.org. Confirm the reply. Unsubscribe via <list>-unsubscribe@tez.apache.org.

Default for new contributors: subscribe to dev@ and user@. Add issues@ once you are actively tracking JIRAs.

Mail Etiquette: Subject Prefixes

Subject lines on dev@ use ASCII-bracketed prefixes so subscribers can filter. Use them.

Prefix	When
`[DISCUSS]`	Open-ended question or design idea, no vote yet
`[PROPOSAL]`	Concrete proposal seeking comment
`[VOTE]`	Vote in progress; body has voting rules
`[VOTE][RESULT]`	Closing a vote; tallies the result
`[ANNOUNCE]`	One-way announcement (release, new committer)
`[NOTICE]`	Infrastructure / branch / policy change
`[jira] [Created]` etc.	Auto-prefixed by the JIRA bot; don't compose these

For a JIRA-related question, the subject is usually Re: [jira] [Created] (TEZ-NNNN) <title> — a reply to the bot mail.

Examples of good subjects:

[DISCUSS] Promoting MROutput#getDelegationToken to @Public
[PROPOSAL] TEZ-4321: Caching DAG plans across submissions
[VOTE] Apache Tez 0.10.4 RC1
[ANNOUNCE] New Tez committer: NAME

Mail Etiquette: Formatting

The ASF lists are plaintext-first. The hard rules:

Plain text only. No HTML, no rich text. Most clients have a "Send as plain text" toggle; set it as the default for *@apache.org recipients.
Inline reply, not top-post. Quote the relevant lines, reply below each.
Wrap at ~78 columns. Long unbroken lines render badly in archives.
Sign off. First name or first + last; not your full corporate signature block.
No attachments over a few KB. Patches go on JIRA, not the list.
No images. Diagrams as ASCII or as links to images hosted elsewhere.

A good dev@ reply looks like:

On Tue, May 7, 2024 at 10:14 AM, Foo Bar <foo@example.com> wrote:
> I think we should change the default of tez.am.resource.memory.mb
> from 1024 to 2048 to handle large DAGs better.

Agreed for large DAGs, but 2048 doubles the AM footprint for everyone
running small jobs (most CI users). Could we instead size it based on
DAGPlan size, falling back to 1024? Sketch:

  am_mem_mb = max(1024, dagPlanBytes / 1024 * 4)

I can prototype on TEZ-4XXX if there's interest.

-- 
Jane

What it doesn't have: HTML, a corporate disclaimer, a 2 MB inline screenshot, "+1" with no context, or "any updates?" with no quoted reference.

JIRA Etiquette

JIRA is the system of record for code-touching work. The mores:

Don't reassign

The Assignee field belongs to whoever is doing the work. If a JIRA is assigned to someone else, do not reassign it to yourself, even if it's been idle for a year. Comment first:

Hi @ASSIGNEE, I'd like to pick this up if you're not actively working on it. Happy to hand back if you have an in-flight patch. If I don't hear back in a week I'll assign to myself.

After a week of silence, then take it.

Ask before claiming high-traffic JIRAs

For high-visibility issues (release blockers, anything with multiple watchers), comment "I'll take a look at this" before you set yourself as assignee. This prevents two people working on the same fix.

"Patch Available" semantics

Setting status to Patch Available is a signal that means:

A patch (or PR linked from the JIRA) is attached
It applies cleanly to the current trunk
The author believes tests pass locally
The author is requesting review

It does not mean "I am still iterating." If you upload a draft, leave the status as In Progress and say so in a comment.

Status flow you control vs. don't

You may set	Means
Open → In Progress	Starting work
In Progress → Patch Available	Ready for review
Patch Available → In Progress	Reopening to revise after feedback
Comment with new patch	Iteration

Committer-only	Means
Patch Available → Resolved	Committed
Resolved → Closed	Released
Any → Reopened	Bug returned

Patch naming convention

Patches attached to JIRA use the convention TEZ-NNNN.NNN.patch:

TEZ-4321.001.patch   <- first iteration
TEZ-4321.002.patch   <- after first review round
TEZ-4321.003.patch   <- after second review round

Branch-specific patches add a branch suffix:

TEZ-4321.branch-0.10.001.patch

Old patches stay attached — never delete them. The history is part of the review record.

Where the Tez Community Currently Lives

Tez does not have an official Slack or Discord. The active channels are:

Channel	Use
`dev@tez.apache.org`	Primary, for all dev discussion
`user@tez.apache.org`	Usage questions
JIRA	Per-issue discussion
ASF Slack (`the-asf.slack.com`), `#tez` if it exists	Informal, ephemeral

If a #tez Slack channel does not exist, do not assume one. The mailing list is the official channel and is where decisions are made and archived. Slack/IRC is at most a hallway conversation that must be summarised back to the list.

Sister projects you may need to follow because Tez integrates with them:

dev@hive.apache.org — Hive on Tez execution issues
dev@hadoop.apache.org — YARN / HDFS compatibility
dev@pig.apache.org — Pig on Tez (mostly inactive but exists)

Self-Introduction Template

A first post to dev@tez.apache.org after subscribing is optional but helpful. Keep it short:

Subject: [DISCUSS] Introduction and intent to contribute

Hi all,

I'm <first> <last>, a <role> at <company / "independent">. I've been
using Tez via Hive in production for ~<N> months and have been
reading the codebase to understand <component / area>.

I'm interested in contributing in the area of <one or two concrete
areas, e.g. "shuffle reliability" or "AM logging">. I've worked
through Levels 1-4 of the open-source-engineer curriculum and have
TEZ-NNNN (small Javadoc fix) ready as my first patch.

Happy for any pointers on first issues to tackle.

Thanks,
<First>

What this does:

Signals you've done homework (not asking "how do I start?")
Names a concrete area so committers can match you to mentors
References a tiny first patch, so you've already shown you understand the workflow

What to avoid:

"I'd like to contribute, please assign me a task" (no committer will do this for you)
A list of grand redesigns
A corporate signature block

Asking a Question on `user@` Well

The format that gets answers:

Subject: Tez 0.10.x: AM OOMing on submission of 200-vertex DAG

Versions:
  Tez 0.10.3
  Hadoop 3.3.6
  Hive 3.1.3
  JDK 11

Symptom:
  TezClient.submitDAG throws OOM after ~12 seconds. AM log attached
  shows GC overhead limit exceeded inside DAGImpl.init.

Reproduction:
  - submit DAG with 200 vertices, each with 5 inputs
  - tez.am.resource.memory.mb = 1024 (default)

What I tried:
  - bumping to 2048 — works
  - reducing parallelism — works around but unwanted

Question:
  Is there a known scaling limit for DAGPlan size with default AM
  memory? Should the AM default scale with DAGPlan size?

Logs / DAG: <link to gist or paste in JIRA>

It gives versions, symptom, reproduction, what was already tried, and a focused question. A question that omits any of these gets a "please provide more info" reply, costing a round-trip day.

Validation Artifacts

After this chapter you should have, on disk and in the public archive:

A subscription confirmation to dev@tez.apache.org and user@tez.apache.org.
A self-introduction email posted to dev@, with archive URL saved.
One inline-reply (not top-post) reply to an existing dev@ thread.
A draft JIRA in JIRA (status Open) describing a real issue you've noticed.
A ~/tez-notes/etiquette.md cheatsheet with the subject prefixes table.

The next chapter — Patch Quality — is what your first attached patch needs to look like.

Patch Quality

A "patch" in Apache parlance is a unified diff attached to a JIRA (or, more recently, a GitHub PR linked from a JIRA). This chapter tells you what a committer is looking for when they open it for the first time. Internalising these expectations is the difference between a patch that gets committed in two review rounds and one that dies after a "please rebase" comment in month three.

What Committers Look For — In Reading Order

A committer reviewing your patch does, roughly, this:

1. Read JIRA description.        (30 sec)
2. Open the patch, skim the diff stat.   (30 sec)
3. Look at tests.                (2 min)
4. Look at the implementation.   (5 min)
5. Run mvn install / mvn test.   (background)
6. Comment.                      (variable)

Notice tests come before implementation. If the test diff is empty or weak, the implementation is read with suspicion. If the test diff is strong and minimal, the implementation is read with trust.

Rule 1: Minimum Diff

The single rule that most distinguishes a strong patch from a weak one. The diff should contain only the changes that the JIRA describes. Not:

A whitespace cleanup of the surrounding method
A rename of an unrelated variable you didn't like
An import reorder by your IDE
A bumped dependency version "while you were here"
A reformatted block

Every line you change costs the reviewer attention. Lines that don't serve the JIRA are a tax on the review.

Check before submitting:

cd ~/tez-src
git diff --stat origin/master
git diff origin/master | head -50

If git diff --stat shows changes in files unrelated to the JIRA, revert them:

git checkout origin/master -- path/to/unrelated/file

Rule 2: No Unrelated Changes

The corollary to Rule 1. Even within a touched file, do not bundle unrelated improvements. If you notice a separate bug while fixing your bug:

# don't fix it here. Open a separate JIRA:
echo "Noticed: VertexImpl.java:842 catches Exception too broadly" >> ~/tez-notes/queue.md

File a follow-up JIRA at the end of the week. Two small patches beat one mixed patch every time.

Rule 3: Apache Commit Message Format

The exact format used in git log for committed Tez changes:

TEZ-NNNN: <short imperative summary, under 72 chars>. (<contributor-name> via <committer-name>)

Verify with:

cd ~/tez-src
git log --oneline -20

You will see lines like:

abc1234 TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)
def5678 TEZ-4322: Add MR compat test for vectorized output. (John Smith via gopalv)

When you submit, your commit message has the contributor side only:

TEZ-4321: Fix NPE in VertexImpl.recover when no inputs.

The committer appends (Jane Doe via <committer>) at commit time. Don't pre-fill it.

The summary line rules:

Imperative mood: "Fix", "Add", "Remove", "Refactor" — not "Fixed", "Adding".
Under 72 characters.
Ends with a period.
No trailing whitespace.

If the change needs more explanation, leave one blank line and add a body wrapped at 72 columns:

TEZ-4321: Fix NPE in VertexImpl.recover when no inputs.

When a vertex has no Inputs (a root data-source vertex with no
upstream edges), VertexImpl.recover called .iterator() on a null
inputs collection. The fix initialises inputs to an empty list in
the recover path.

Adds TestVertexImpl.testRecoverNoInputs covering the case.

Rule 4: Tests for Behavior Changes

Any behavior change must come with a test. This includes bug fixes — the test should fail before your fix and pass after. Verify:

cd ~/tez-src
# stash your fix
git stash
# run the new test
mvn test -pl tez-dag -Dtest=TestVertexImpl#testRecoverNoInputs
# it should fail
git stash pop
mvn test -pl tez-dag -Dtest=TestVertexImpl#testRecoverNoInputs
# now it should pass

If your "bug fix" passes with the test added but without the fix applied, your test doesn't actually exercise the bug.

Exceptions where a test is not required:

Change type	Test needed?
Javadoc fix	No
Log message string change	No
Comment / formatting (rare; should be its own patch)	No
Build / Maven config change	Usually no, but justify
Behavior change	Yes, always

Rule 5: No Whitespace Churn

Whitespace-only diff lines are noise. IDEs love to insert them — turn off "format on save" for tez-src, or restrict it to lines you edited.

Detect before submitting:

cd ~/tez-src
git diff -w origin/master --stat
git diff origin/master --stat

If the second shows many more changed files than the first, you have whitespace churn. Either clean it up or, if it's pervasive, configure your editor and re-do the change.

Rule 6: Javadoc for `@Public` API

If you add or modify a method on a class annotated @InterfaceAudience.Public, it needs javadoc. The check:

cd ~/tez-src
grep -l "@InterfaceAudience.Public" tez-api/src/main/java -r | head

For each such class, every public method has Javadoc with at least:

One-sentence summary
@param for each parameter
@return for non-void
@throws for any non-RuntimeException declared exception

If your patch adds a new public method without Javadoc, expect the first review comment to ask for it.

Rule 7: `@InterfaceAudience` and `@InterfaceStability` Annotations

Every public-ish class in tez-api is annotated. Example from Vertex.java:

@Public
@Evolving
public class Vertex {
    ...
}

The grid:

	`@Stable`	`@Evolving`	`@Unstable`
`@Public`	Compat guaranteed across minor versions	May change between minor versions with warning	May change between any release
`@LimitedPrivate({"Hive"})`	Stable for named projects	Evolving for named projects	Unstable, named projects only
`@Private`	Internal; do not depend on	Internal	Internal

When you add a new class to tez-api, you must annotate it. The annotations live in tez-api/src/main/java/org/apache/hadoop/classification/. When in doubt, default to:

@Public
@Unstable

so users see the class but know not to depend on its shape yet.

Rule 8: Pre-Submit Checklist

Before you upload TEZ-NNNN.001.patch, run each of these and have all pass.

cd ~/tez-src

# 1. Full compile, all modules, no tests.
mvn install -DskipTests

# 2. Checkstyle. Tez uses the config in tez-tools/.
mvn checkstyle:check

# 3. Tests in modules you changed.
# For tez-dag, tez-api, etc.:
mvn test -pl tez-dag
mvn test -pl tez-api

# 4. A representative integration test.
mvn test -pl tez-tests -Dtest=TestOrderedWordCount

# 5. Patch applies cleanly to current master.
git fetch origin
git rebase origin/master
git diff origin/master > /tmp/TEZ-NNNN.001.patch
cd /tmp
git -C ~/tez-src apply --check TEZ-NNNN.001.patch

If any step fails, fix and re-run. Submit only when all pass.

Rule 9: Patch Generation

Generate the patch from a clean rebase against origin/master:

cd ~/tez-src
git fetch origin
git rebase origin/master           # resolves conflicts now, not at commit time
git diff origin/master --no-color --unified=5 > TEZ-NNNN.001.patch

The --unified=5 gives reviewers 5 lines of context instead of the default 3. This is a small kindness that makes review materially easier.

Inspect the patch before attaching:

wc -l TEZ-NNNN.001.patch          # how big is it?
head -30 TEZ-NNNN.001.patch       # right files?
grep -c "^+" TEZ-NNNN.001.patch   # added lines
grep -c "^-" TEZ-NNNN.001.patch   # removed lines

A patch of 50–300 lines is comfortable for a single review round. A patch over 1000 lines will sit unreviewed until you split it.

Worked Example — A Minimal Trivial Patch

A real-shape patch for a Javadoc fix on Vertex.java:

diff --git a/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java b/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
index abcdef1..1234567 100644
--- a/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
@@ -180,7 +180,10 @@ public class Vertex {
   }

   /**
-   * Set the parallelism.
+   * Set the parallelism (number of tasks) for this Vertex.
+   *
+   * @param parallelism the number of tasks. Must be > 0 unless
+   *                    {@link #setVertexManagerPlugin} configures a dynamic plugin.
+   * @return this Vertex, for chaining.
    */
   public Vertex setParallelism(int parallelism) {

That's the entire patch — 5 changed lines, +6/-1. No test (Javadoc only). It passes checkstyle:check, mvn install -DskipTests, and the JIRA is TEZ-NNNN: Improve Javadoc on Vertex#setParallelism.

Anti-Patterns

What committers flag immediately:

Anti-pattern	Why it's flagged
Reformat of an entire file	Hides the real change
`// TODO: refactor` comment added	Should be a separate JIRA
`System.out.println` left in	Use `LOG`, never System.out
`e.printStackTrace()`	Use `LOG.warn(msg, e)`
Catch `Exception` swallowing everything	Catch specific or rethrow
New configuration key with no `@Public` annotation	Won't be honored as stable
New method with `throws Exception`	Use specific exceptions
Test that always passes (no assertion)	Useless
Test depending on wall-clock timing	Flaky
`@Ignore` added to silence a failing test	Fix it or revert

Validation Artifacts

After this chapter you should have:

A ~/tez-notes/precommit.sh script running the seven pre-submit commands above.
One actual patch file TEZ-NNNN.001.patch on disk, even if you haven't uploaded it.
A ~/tez-notes/patch-checklist.md cheatsheet from Rule 8.
Knowledge of the @InterfaceAudience / @InterfaceStability matrix.

The next chapter — Responding to Feedback — covers what happens after you press "Attach".

Responding to Feedback

Your patch is attached. A committer comments. What happens next is the most underrated skill in open source: turning review comments into a committed patch without burning the reviewer's patience or your own. This chapter is the playbook.

The Asynchronous Reality

Apache review is asynchronous and bursty. The committer who reviews your patch may:

Be in a different time zone (most likely)
Be reviewing on weekends or commute time
Have other patches queued
Be the only person in the world who deeply knows the file you touched

Practical consequences:

Reality	What it means for you
Reviews come in bursts, not steady drip	Respond within 24–48h of the burst, then wait
Patches sit for weeks between rounds	Keep a `~/tez-notes/in-flight.md` list
Same committer often reviews 2–3 of your patches in one sitting	Have all of them ready
A committer may never come back	Polite ping on JIRA at 2 weeks, `dev@` at 4

Set the expectation early — both for yourself and for reviewers — that a non-trivial patch takes 3–6 weeks from first attach to commit. Optimise for round-trip count, not round-trip duration.

Address Each Comment Explicitly

Reviewers leave per-line comments on a patch (on JIRA in older Tez, on a PR in newer). Each comment needs an explicit response. Not implicit. The committer should not have to diff your old and new patches to figure out which feedback you took.

The pattern:

Reviewer:  L243: This catches Exception too broadly. Tighten to IOException.

You (in JIRA comment when attaching .002):

  Addressed in .002:
    - L243: tightened to IOException; rethrowing wrapped TezException as before.
    - L301: added the missing null check you mentioned.
    - L427: pushed back; see explanation below.

This three-line response is more valuable than a perfect patch with no commentary. It shows you read every comment and decided about each one.

Don't Argue Without Evidence

When a committer says "this is wrong" and you disagree, the natural reflex is to defend. The Apache-effective reflex is to provide evidence.

Bad:

I don't think changing this would help.

Good:

I tried the suggested approach in a local branch. It causes TestVertexImpl#testRecover to fail because REASON. Output:
java.lang.AssertionError: expected 3 attempts, got 2
  at ...
Suggesting we keep the current approach with the additional comment you also asked for.

Three rules for pushback:

Always try the alternative first. Often the committer is right and you didn't see it.
Quote the failing test or benchmark. Numbers and stack traces close arguments.
Offer the smallest possible compromise. "Keep current behavior but add the comment you asked for" is much easier to accept than "no."

When to Push Back

You should push back when:

The committer's suggestion would break a documented behavior of a @Public API.
The committer's suggestion contradicts another committer's suggestion (cite the other).
The committer's suggestion expands scope beyond the JIRA (offer to file a follow-up).
You have a measurement (perf, memory) that contradicts the suggestion.

You should not push back when:

It's a style preference and you don't strongly care. Take it; save your capital.
It's a test-coverage ask. Add the test.
It's a "split this into two patches" ask. Split it.
It's "rename this method." Rename it.

The principle: defend the substance of the patch, never the shape.

When to Abandon

Most patches that get abandoned should not have been opened in the first place. But some get abandoned mid-review and that's the right call. Signals:

Signal	Right action
Two committers disagree on the approach, irreconcilable	Wait for them to resolve on `dev@`; don't ping-pong patches
The JIRA is rejected as "won't fix" after design discussion	Close the JIRA, archive the patch locally, move on
The required change is much larger than you estimated and you can't commit the time	Comment honestly, unassign yourself, leave the JIRA open
The codebase has changed significantly and a complete rewrite is needed	Comment, unassign, leave for someone else

Abandoning is a respectable outcome. Ghosting a patch is not. If you can't continue, say so on the JIRA in one sentence:

Stepping away from this; my time has been redirected. Unassigning so someone else can pick it up. Latest patch (.003) is a good starting point but needs the test reviewer @NAME asked for.

Post a New Patch with a Clear Delta

When you upload TEZ-NNNN.002.patch, leave a JIRA comment that lists the deltas from .001:

Posted .002. Delta from .001:

- L243: tightened catch to IOException, per @<reviewer>.
- L301: added null check, per @<reviewer>.
- L427: kept current logic; rationale above.
- Added testRecoverNoInputs in TestVertexImpl.

mvn install -DskipTests, mvn checkstyle:check, mvn test -pl tez-dag all pass.

Why this matters:

Reviewer can re-review by diffing the delta, not the full patch.
Future readers of the JIRA see the iteration history at the JIRA level, not just in git.
It demonstrates the patch had real iteration, not a vibes-based "I changed some stuff."

Diff your own patches locally:

diff -u TEZ-NNNN.001.patch TEZ-NNNN.002.patch | less

Thank the Reviewer

After commit, comment on the JIRA:

Thanks @COMMITTER for the review and commit. Thanks @OTHER-REVIEWERS for the feedback.

This is not perfunctory. Apache is a long game. The committer who reviewed your first patch is likely to review your tenth. They are humans investing volunteer attention.

Acknowledgement also matters at the project level — it shows other onlookers that the project's reviewers are responsive, which makes the next contributor more likely to attempt a patch.

The Shepherd Committer

For non-trivial JIRAs, especially design-heavy ones, one committer often becomes the "shepherd" — the de facto reviewer and merge-committer. The relationship:

Their role	Your role
Reviews each patch iteration	Addresses comments promptly
Surfaces concerns from other committers	Treats them as that committer's concerns, not the shepherd's
Commits the final patch	Provides commit message text
May ask for sub-JIRAs	Files them, links them
Champions the design on `dev@` if questioned	Provides ammunition (numbers, tests)

Spotting a shepherd: after 2–3 review rounds with the same committer, they're shepherding. Direct future questions on the JIRA to them ("@COMMITTER, would you prefer A or B for the rename?"). Don't ping multiple committers in parallel; that fragments attention.

When to Ping

JIRA pings have a half-life. Use them sparingly.

Wait time since last activity	Action
< 1 week	Don't ping. Reviewers are busy.
1–2 weeks	Comment on JIRA: "Friendly ping — anything blocking on my side?"
2–4 weeks	Re-ping on JIRA, cc'ing any prior reviewer by `@`-mention.
> 4 weeks	Mention on `dev@` in a `[DISCUSS]` thread: "TEZ-NNNN has been quiet for a month, anyone willing to take another look?"

What kills a patch dead: pinging weekly or daily. After two such pings, reviewers deprioritise the patch out of self-defence. Don't.

Worked Example — A Full Round-Trip

JIRA: TEZ-4321, "Fix NPE in VertexImpl.recover when no inputs."

Day 0:  You attach TEZ-4321.001.patch, set status to Patch Available.
Day 4:  Committer @gunther comments:
          L88: prefer Collections.emptyList() over new ArrayList<>()
          L92: add test for the no-inputs case
          L94: should we also handle no-outputs symmetrically?

Day 5:  You reply on JIRA:
          - L88: agreed, will fix.
          - L92: agreed, adding TestVertexImpl#testRecoverNoInputs.
          - L94: noticed but out of scope for this JIRA. Filed TEZ-4329 for follow-up.

Day 5:  You attach TEZ-4321.002.patch and a delta-summary comment.

Day 9:  @gunther comments: "+1 LGTM"

Day 10: @gunther commits as
          "TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)"
        and sets status to Resolved.

Day 10: You comment:
          "Thanks @gunther. Working on TEZ-4329 next."

10 days, 2 patch rounds, 1 follow-up JIRA filed, 0 arguments. That is a healthy review.

When Feedback Comes from a Non-Committer

Non-committers can review too. Their +1 is non-binding (only committers' votes count for commit), but their feedback is often substantively excellent — they may know the area better than the committer who eventually commits.

Treat non-committer feedback exactly like committer feedback: address each comment, explain, iterate. Two non-binding +1s also signal to a committer that the patch is ready to consider, accelerating attention.

Validation Artifacts

After this chapter you should have:

A ~/tez-notes/in-flight.md listing any JIRA you currently have a patch on, with the date of last activity.
A template for the "delta from previous patch" comment, saved as ~/tez-notes/delta-template.md.
Internalised the four-tier ping schedule.
The reflex to thank the committer after merge.

The next chapter — Compatibility — is the technical knowledge you need so reviewers don't have to teach you compatibility rules during review.

Compatibility

Tez is a library that ships into long-lived production clusters running Hive, Pig, and custom DAG applications. A compatibility break in Tez ripples out to every downstream project that depends on it. This chapter is the operational knowledge of what you may and may not change without breaking users.

The Three Compatibility Surfaces

Tez has three distinct compatibility surfaces, each with different rules:

Surface	What it covers	Where defined
API compatibility	Source/binary compat of Java classes	`@InterfaceAudience`/`@InterfaceStability` annotations in `tez-api`
Wire compatibility	Serialised messages over the network	protobufs in `*/src/main/proto/`
Configuration compatibility	Config keys and default values	`TezConfiguration` constants in `tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java`

A single patch may touch zero, one, two, or all three. Knowing which surface you're touching tells you which rules apply.

API Compatibility — The Annotation Grid

Every class in tez-api is (or should be) annotated. The two-axis grid:

	`@Stable`	`@Evolving`	`@Unstable`
`@Public`	Compat across minor versions. Major bump to change.	May change across minor versions with deprecation.	May change across any release.
`@LimitedPrivate({"Hive"})`	Stable for named projects only (e.g. Hive).	Evolving for named projects.	Unstable, named projects only.
`@Private`	Internal. No external compat.	Internal.	Internal.

The annotations live at tez-api/src/main/java/org/apache/hadoop/classification/:

ls ~/tez-src/tez-api/src/main/java/org/apache/hadoop/classification/
# InterfaceAudience.java
# InterfaceStability.java

Verify a class:

grep -B2 "^public class Vertex" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java

You will see:

@Public
@Evolving
public class Vertex {

That tells you: external users may write code against Vertex, but the class may evolve between minor versions. You may add methods. You should not remove or change the signature of an existing method without deprecation.

What You Can and Can't Change

The decision matrix for modifying an existing public method:

Change	`@Public @Stable`	`@Public @Evolving`	`@Public @Unstable`	`@Private`
Add new method to class	OK	OK	OK	OK
Add overload (different signature)	OK	OK	OK	OK
Add optional parameter (new overload)	OK	OK	OK	OK
Rename method	Major version only	Deprecate first	OK with note in CHANGES.md	OK
Change parameter type	Major version only	Deprecate + add new	OK	OK
Change return type (widening)	Major version only	OK with note	OK	OK
Change return type (narrowing)	Major version only	Major version only	OK	OK
Remove method	Major version only	Major after 1 minor deprecation	OK with note	OK
Change method behavior (same signature)	Avoid; needs `dev@` discussion	Note in CHANGES.md	OK	OK

The default rule for @Public @Stable: assume you can't change it. To change it, you need dev@ agreement first.

Deprecation Procedure

When deprecating a @Public @Evolving method:

/**
 * @deprecated Since 0.10.5, use {@link #setParallelism(int, VertexLocationHint)} instead.
 *             This method will be removed in 0.12.0.
 */
@Deprecated
public Vertex setParallelism(int parallelism) {
    return setParallelism(parallelism, null);
}

Three required elements:

@Deprecated annotation on the method.
@deprecated Javadoc tag explaining what to use instead.
A target removal version. Vague "may be removed" deprecations live forever.

Add a note to CHANGES.txt:

DEPRECATIONS:
  TEZ-NNNN: Vertex.setParallelism(int) is deprecated; use setParallelism(int, VertexLocationHint).
            Will be removed in 0.12.0.

Wire Compatibility — Protobufs

The DAGPlan protobuf is the most compatibility-sensitive file in Tez. It is the serialised contract between:

The Tez client (often inside Hive, Pig, or user code) and the AM
The AM and history (ATSHistoryLoggingService)
The AM and the recovery file

A DAGPlan written by a 0.10.3 client must be readable by a 0.10.5 AM. A DAGPlan written today must be readable from recovery files written months ago.

The protobuf compatibility rules (protobuf 2.5 semantics, which Tez still uses for historic reasons):

Change to a `.proto`	Wire compat impact
Add a new `optional` field with default	Forward + backward compatible
Add a new `repeated` field	Forward + backward compatible
Add a new `required` field	BREAKS old readers
Remove an `optional` field	BREAKS if old readers ignore unknowns badly
Rename a field (same tag)	OK in wire, breaks source compat
Change a field's tag number	BREAKS wire compat
Change a field's type	Usually BREAKS
Convert `optional` to `repeated`	BREAKS
Add a new enum value	BREAKS if old readers reject unknowns

The hard rule for DAGApiRecords.proto:

ls ~/tez-src/tez-api/src/main/proto/
# DAGApiRecords.proto
# DAGClientAMProtocol.proto
# Events.proto

Never reuse a tag number. Once tag 12 was used, it's used forever.
Never change a field's type. Even widening (int32 to int64) is a wire break.
Never make an optional field required.
New fields go at the end with the next free tag number, marked optional.

When adding a new field:

 message VertexPlan {
   required string name = 1;
   optional int32 num_tasks = 2;
   ...
   optional int64 last_modified_time = 11;
+  optional int32 max_attempts = 12;
 }

The Java side should treat the new field as "may be absent" forever — old plans don't have it.

Recovery File Compatibility

The AM writes recovery files containing serialised DAGPlan and event records. On restart, the AM reads its own recovery file. A patched AM must be able to read recovery files written by the previous patched AM.

Practical rule: recovery is at least as wire-compat-sensitive as RPC. Treat every DAGPlan change as a recovery-format change. Tests:

find ~/tez-src -name "TestDAGRecovery*.java"
find ~/tez-src -name "TestRecovery*.java"

If your patch touches a proto, run these tests and add a new case demonstrating old-format recovery still works.

History / ATS Compatibility

The history record format (used by the Tez UI and ATS) is also a wire format:

find ~/tez-src -name "HistoryEvent*.java" | head
find ~/tez-src -name "HistoryEvent.proto"

A change here breaks Tez UI queries on historical DAGs. The compatibility rule is the same as for DAGPlan. The reviewer for any history-format patch is typically a Hive committer who depends on the Tez UI.

Configuration Compatibility

Configuration keys are defined in TezConfiguration:

grep "public static final String TEZ_" \
    ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head -30

Each key looks like:

@ConfigurationProperty(type = "integer")
public static final String TEZ_AM_RESOURCE_MEMORY_MB = "tez.am.resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;

Adding a new key

OK at any time. Add the String constant, the _DEFAULT constant, an @Public / @Unstable (or @Evolving) annotation if the surrounding class is annotated, and a javadoc explaining the key and its valid range.

Renaming a key

This requires a deprecation alias. Tez has a deprecation mechanism via Hadoop's Configuration.addDeprecation. Pattern:

public static final String TEZ_AM_RESOURCE_MEMORY_MB = "tez.am.resource.memory.mb";

// Old key, deprecated since 0.10.5.
public static final String TEZ_AM_RESOURCE_MEMORY_MB_DEPRECATED = "tez.am.memory.mb";

static {
    Configuration.addDeprecation(
        TEZ_AM_RESOURCE_MEMORY_MB_DEPRECATED,
        TEZ_AM_RESOURCE_MEMORY_MB);
}

Old config files using the deprecated name continue to work. Log a warning on first read.

Removing a key

Only at a major version bump, after at least one minor version of deprecation. Document in CHANGES.txt and the release notes.

Changing a default

Treat as a behavior change. Requires dev@ discussion if the change affects perf or resource usage. Document the change explicitly:

DEFAULT CHANGES:
  TEZ-NNNN: tez.am.resource.memory.mb default changed from 1024 to 1536 to reduce OOMs
            on large DAGs. Users with tight container budgets should explicitly set the
            old value.

Compatibility Across Tez and Hive/Pig

Tez has cross-project compatibility commitments to Hive and Pig — they bundle Tez and expect a Tez version bump not to break them. The mechanism is @LimitedPrivate.

grep -rn "@LimitedPrivate" ~/tez-src/tez-api/src/main/java | head

A class annotated @LimitedPrivate({"Hive"}) has API compatibility guaranteed to Hive only. The Tez side may not break it without first warning dev@hive.apache.org. The Hive side commits to not relying on anything other than @LimitedPrivate or @Public APIs.

When you change a @LimitedPrivate({"Hive"}) class:

Search Hive for usage: grep -rn <ClassName> ~/hive-src/ql/src/
If Hive uses it, post a heads-up on dev@hive.apache.org referencing the JIRA.
Consider providing both old and new methods for one Tez minor version.

Validation Artifacts

After this chapter you should have:

A ~/tez-notes/compat-cheatsheet.md with the API matrix from above.
A list of every .proto file in tez-api and which compat surface each protects.
The set of files in tez-api/.../classification/ open in your IDE for reference.

Knowledge of which Hive classes import from tez-api:

grep -rn "import org.apache.tez" ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ | head

The ability to predict, for any change, which compat surface(s) it touches and what the deprecation timeline would be.

The next chapter — Meritocracy — is the project-level perspective: how Apache Tez decides who gets to make compatibility decisions.

Meritocracy: Contributor → Committer → PMC

The Apache Way uses a specific, technical sense of the word "meritocracy" that is often misread. This chapter is what it actually means inside Apache Tez, how the path from casual contributor to PMC member works, and what each step really requires.

The Three Roles

Role	Granted by	What it gives you	What it asks of you
Contributor	Nothing — anyone who contributes is one	JIRA account, ability to submit patches	Nothing formal
Committer	Vote on `private@tez.apache.org` by PMC	Commit access to `apache/tez`, vote rights on patches (non-binding for releases)	ICLA on file, ongoing engagement
PMC member	Vote on `private@tez.apache.org` by PMC	Binding vote on releases, vote rights on new committers and PMC members, board reporting share	Legal stewardship, release responsibility

There is no fourth role. "Lead contributor" or "maintainer" are not Apache concepts. "Chair" is a PMC member who reports to the board; rotating, often by lottery within the PMC.

What "Meritocracy" Actually Means at Apache

Apache uses "meritocracy" in a very specific sense: decisions and elevations are based on accumulated, evidenced contribution to the project — not on title, employer, or personal connections.

That is narrower than the colloquial meaning. It explicitly does not mean:

"Best engineer wins." Many excellent engineers are not committers because they have not engaged with this specific community.
"Most patches wins." LOC is not a measure of merit.
"Paid time on the project wins." Full-time paid Tez work, on its own, does not earn committership. The community must observe the contribution.
"Smartest design wins arguments." Arguments are won by evidence and consensus, not cleverness.

What it does mean:

Sustained, visible contribution over months
Quality demonstrated by patches getting committed with few iterations
Trust demonstrated by reasonable behavior on JIRA and dev@
Investment in the project itself, not just in your features

The Path to Committer

The committer vote is private; the criteria are not codified anywhere with bullet points. What committers actually look at, in rough order:

Patch quality. Have your patches gone in with light review? Have you mastered the workflow in Patch Quality?
Volume and sustained activity. Not LOC, but consistency. 10 small patches over 6 months is much stronger than 1 huge patch.
Engagement breadth. Have you reviewed others' patches (with non-binding +1s)? Helped on user@ questions? Filed clean JIRAs?
Judgement on dev@. Have you participated in design discussions? Were your contributions thoughtful, not just adding noise?
Area coverage. Have you worked in more than one corner of the codebase, or are you trusted for a deep one? Either can earn the bit.
Trust. Would the existing committers be comfortable with you committing your own patches?

There is no fixed threshold. Different projects have different bars; Tez is in the middle (not as strict as Hadoop, not as loose as a brand-new TLP).

Typical Trajectory

Month 1-2:   First few small patches (Javadoc, log messages, tiny bug fixes).
             Some friction in review as you learn conventions.
Month 3-6:   More substantive patches. Lower review iteration count.
             Reviewing others' patches with non-binding +1.
Month 6-12:  Larger patches with design discussion.
             Filing follow-up JIRAs after your patches.
             Recognised name on dev@.
Month 12+:   A PMC member notices and proposes you on private@.
             PMC discusses, votes. Vote happens silently.
             You receive a private email offering the bit.
             You publicly accept on dev@ via an [ANNOUNCE] thread by the PMC.

The 12-month figure is a median, not a rule. Faster is possible with very sustained engagement; slower is common.

Accepting the Bit

If a PMC member emails you with an offer of committership, the steps:

Accept privately, via reply to the offer email.
The PMC raises an [ANNOUNCE] New Tez committer: <name> thread on dev@.
You acknowledge publicly on the thread.
ASF Infrastructure provisions your ASF ID (<id>@apache.org).
You get karma to push to apache/tez.

What changes for you:

You can commit your own patches. Don't commit your own patches without review for the first few months. The community trust applies to your judgement of others' patches; your own still get reviewed.
You get a binding +1 vote on commits.
You get a non-binding +1 on releases (PMC +1 is binding).
You are now visible as part of the project. Behave accordingly on dev@, JIRA, and conferences.

The Path to PMC

PMC membership is a separate, later, additive step. Committership is necessary but not sufficient. Criteria, looser even than committership:

Sustained activity as a committer. Months to years post-committer.
Project-level judgement, not just code. Have you weighed in on release timing, compat questions, community-management issues?
Willingness to take on release-management or PMC duties. Cutting a release, responding to security reports, mentoring new committers.
Trust to handle confidential matters — security disclosures arrive on private@tez.apache.org, and PMC members must handle them carefully.

PMC votes are also private. You are notified by email; the public announcement is on dev@.

What PMC Members Do That Committers Don't

Duty	Why PMC only
Binding `+1` on releases	ASF policy: releases are PMC acts
Vote on new committers and PMC members	Self-perpetuating governance
Receive and process security reports	Confidentiality
Approve / sign release artifacts	Legal liability flows through PMC
Quarterly board reports	Stewardship to the foundation
Trademark guardianship	"Apache Tez" is a Foundation mark
Brand decisions (logos, names, conferences)	ASF authorises through PMCs

Common Misconceptions

"I work on Tez full-time, so I should be a committer."

Paid time is irrelevant. The community can only assess what it can observe — public patches, public reviews, public discussion. Internal company work, no matter how extensive, does not exist from the project's perspective.

If your day job is Tez work, the way to convert that into committership is to do that work in the open: file JIRAs, attach patches, post designs.

"I wrote N lines of code, so I should be a committer."

LOC is not used. A contributor with 200 lines spread across 15 thoughtful patches is strictly stronger than one with 5000 lines in 2 mega-patches. Smaller, frequent, high- quality contributions demonstrate the judgement committership rewards.

"My company has N committers, so we should have the next slot."

Apache projects are explicitly company-independent. Many PMCs have an informal limit on the proportion of committers from any single employer (no more than ~50%) to preserve project independence. Companies do not have slots.

"I was a committer on project X, so I should get the bit here automatically."

You don't. Committership is per-project. Past contribution elsewhere is positive prior evidence but does not substitute for engagement on Tez.

"I have an ASF ICLA on file, so I'm a contributor."

An ICLA is a legal document covering future contributions. It does not make you a contributor; submitting a contribution makes you a contributor. ICLA is necessary for non-trivial contributions to be committed.

"There is a contributor-rank or leaderboard."

There isn't. Apache projects do not maintain rankings, badges, or stars. The closest thing is the CHANGES.txt file, which records the contributor name on each committed patch.

What Earns the Bit, Concretely

If you want a checklist, this is roughly it. None are individually required, but most committers tick most boxes by the time they're proposed:

10+ patches committed, spanning multiple areas of the code.
At least one patch with non-trivial design discussion on dev@ or JIRA.
At least one bug found by you, reproduced by you, fixed by you, tested by you.
Reviewed at least 5 other contributors' patches with constructive non-binding +1s or -1s.
Helped answer questions on user@ or in JIRA comments.
Filed follow-up JIRAs when you noticed adjacent issues.
Behaved well in every public interaction, including when a patch was rejected.
Maintained existing patches as the codebase moved under them (rebased, addressed review).
Sustained over 6+ months, not concentrated in one sprint.
Not gaming any of the above (committers can tell).

What Earns PMC, Concretely

Committer for 1–3+ years.
Demonstrated judgement on dev@ beyond your own patches.
Have either cut a release or helped with one.
Have proposed or seconded other committers.
Have engaged with at least one cross-project compat concern.
Visible willingness to do PMC work (security, brand, board reports) — not just code.

Validation Artifacts

After this chapter you should have:

A clear-eyed view of where you currently are on the path.
A ~/tez-notes/karma.md listing every concrete thing you've done that the community can observe — patches, reviews, JIRA comments, dev@ posts.
A goal for the next 3 months in terms of contribution shape, not LOC.
The ability to explain the contributor / committer / PMC distinction to a colleague without using the word "lead."

This chapter closes the Contributor Mindset section. The next major section, Release & PMC Reality, takes you inside the committer and PMC view — what those roles actually look like from inside.

Issue Roadmap — Twelve Stages from Trivial to Release-Blocking

This roadmap is a deliberately ordered ladder of Apache Tez contributions. Each rung trains a specific skill, depends on the rung below it, and ends at a concrete review-ready patch. Skipping rungs is the most common reason contributors stall: a shuffle bug fix without state-machine fluency turns into a six-month patch thread, and a release-blocker triage call without compatibility reflexes turns into a reverted commit.

The stages are calibrated to the Tez 0.10.x codebase on disk at ~/tez-src. JIRA queries assume https://issues.apache.org/jira/projects/TEZ. Patch discussion happens on dev@tez.apache.org. Where stages reference real modules they use the exact paths you will see under ~/tez-src:

tez-api/                       public interfaces, descriptors, configuration keys
tez-common/                    IDs, util, log helpers, ATS/timeline shared code
tez-dag/                       AppMaster: DAGImpl, VertexImpl, TaskImpl, schedulers
tez-runtime-internals/         TezTaskRunner, LogicalIOProcessorRuntimeTask
tez-runtime-library/           ShuffleManager, Fetcher, IFile, MergeManager
tez-mapreduce/                 MR-shim inputs/outputs/processors
tez-tests/                     MiniTezCluster integration tests
tez-examples/                  OrderedWordCount, SimpleSessionExample, etc.
tez-plugins/tez-yarn-timeline-history/   ATS history events
tez-plugins/tez-aux-services/  NM-side ShuffleHandler hook
docs/                          User-facing site under src/site/markdown

The Twelve Stages

#	Stage	Target skill	Prereq	Typical patch size	Review depth
1	Docs & tests	Reading the codebase, JIRA workflow, RAT/checkstyle	none	1–30 lines	1 reviewer
2	Build & logging hygiene	pom dep bands, slf4j idioms, `LOG.isDebugEnabled()`	1	5–80 lines	1 reviewer
3	Error message context	Exception chaining, ID propagation, `tez-dag` CONTEXT rule	2	20–200 lines	1–2 reviewers
4	State machine transitions	`StateMachineFactory`, `InvalidStateTransitonException`	3	30–250 lines + test	2 reviewers, dev@ ping
5	Scheduler bugs	`TaskSchedulerManager`, `YarnTaskSchedulerService`, AMRMClient	4	50–500 lines + MiniCluster test	2 reviewers
6	Shuffle & runtime	`ShuffleManager`, `Fetcher`, `MergeManager`, `IFile`	5	80–600 lines + test	2 reviewers
7	Hive-on-Tez compatibility	DAGPlan size, edge property contracts, session reuse	5 or 6	varies; often a tez-side + HIVE-side ticket	committers in both projects
8	YARN integration	AMRMToken, log aggregation, NM aux service, kerberos renewal	5	50–400 lines	2 reviewers, often YARN-side too
9	Flaky tests	`DrainDispatcher`, dispatcher-aware waits, port collisions	4	20–150 lines per test	1–2 reviewers; sometimes "stamped"
10	Performance regression	`git bisect`, async-profiler / JFR, JMH micro	6 or 8	30–300 lines + bench evidence	2 reviewers, dev@ design ping
11	Backward compatibility	`@InterfaceAudience`, `@InterfaceStability`, protobuf evolution	4	small code, long dev@ thread	committers + PMC
12	Release-blocking	RC voting, -1 binding, security CVE pipeline	committer	varies	PMC + release manager

How to Use This Roadmap

Pick a stage honestly

Find your rung by asking what is the largest patch you have shipped:

Never landed a Tez patch: start at Stage 1.
Landed a docs patch but never touched Java in tez-dag: Stage 2.
Comfortable with tez-common Java but never read a state machine: Stage 3.
Read VertexImpl.stateMachineFactory once and were confused: Stage 4.
Read it twice and could draw the state graph: Stage 5+.
Already a Tez committer: jump straight to Stages 10–12 for sharpening.

Do not jump rungs to chase a "cool" bug. A locality miscount in YarnTaskSchedulerService looks self-contained and isn't — the patch will land on state-machine transitions you have never edited.

One stage per PR

Resist the urge to fix two things in one patch. Reviewers reject mixed-concern patches almost reflexively. If you find a logging issue while fixing an error message, file a follow-up JIRA and move on. The roadmap rewards small surface area.

Always start with `git log` and `git blame`

Before touching a file, find the last 5 commits that modified it:

cd ~/tez-src
git log --oneline -n 5 -- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
git blame -L 1200,1260 tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

The blame output tells you which committer cares about that area. CC them on the JIRA.

Time investment per stage

Calibrated against a working contributor who has the codebase checked out, can build locally with mvn clean install -DskipTests -Phadoop28, and has filed at least one JIRA before:

Stage	First patch	Becoming fluent (5 patches landed)
1	half a day	1 week
2	1 day	2 weeks
3	1–2 days	1 month
4	3–5 days	2–3 months
5	1–2 weeks	4–6 months
6	2–4 weeks	6 months
7	weeks per attribution call	a year of cross-project work
8	1–3 weeks	6 months
9	1–3 days per flake	ongoing
10	weeks (perf is bisect-bound)	committer-level skill
11	weeks (dev@ design cycle)	committer-level skill
12	PMC-level responsibility	n/a

Success criterion per stage

Each stage is "complete" for you when:

Stage 1: one docs and one test patch are committed to master.
Stage 2: at least two logging or build patches are committed without nits.
Stage 3: one error-context patch is committed with no reviewer asking "which DAG?"
Stage 4: one transition fix is committed and has a regression test in TestVertexImpl.
Stage 5: one scheduler patch is committed with a MiniTezCluster repro test.
Stage 6: one shuffle-runtime patch is committed with a deterministic repro.
Stage 7: one cross-project ticket is filed with a written attribution argument.
Stage 8: one YARN-integration patch is committed with explicit Hadoop-version evidence.
Stage 9: at least three flaky tests have been de-flaked.
Stage 10: one perf patch is committed with before/after benchmark numbers.
Stage 11: one compatibility-sensitive patch is committed with explicit annotations and dev@ sign-off.
Stage 12: you have helped triage at least one RC vote.

When to ask on dev@

Before writing any code for Stages 4 and above, send a short note to dev@tez.apache.org:

Subject: [DISCUSS] TEZ-XXXX — proposed approach

I see <symptom> at <file>:<line>. My read is <cause>. I plan to <fix>, with
a regression test in <test>. Would appreciate any context I'm missing before
I post a patch.

Three sentences. No essay. The list will tell you in 24 hours whether you are about to step on someone else's in-flight work.

When the roadmap does not apply

This roadmap is for bug fixes and small features. It is not for:

New runtime engines or scheduler rewrites — those are Tez Improvement Proposals (TEPs); start a dev@ thread, not a patch.
Hive query-engine changes that happen to surface in Tez — file on HIVE, not TEZ.
YARN-side fixes that Tez merely consumes — file on YARN, not TEZ.

Stage 7 teaches the attribution skill that keeps these in the right project.

What to read alongside this roadmap

Roadmap stage	Companion deep-dive
1–3	Reading the codebase
4	State machines, Vertex lifecycle
5	Scheduler, DAG App Master
6	Shuffle & sort, Tez runtime
7	Hive integration
8	YARN integration
9	Testing framework
10	Container reuse, Tez runtime
11	Compatibility
12	Release & PMC

What this roadmap is not

This roadmap is not a tutorial on Apache Tez itself. The deep dives in ../deep-dives/index.md cover the architecture; the labs in ../level-1/index.md onward cover the hands-on code reading. The roadmap assumes you can already build Tez from source, run the unit tests, and stand up a MiniTezCluster end-to-end. If you cannot, the prerequisite chapter is Level 1, Lab 1.1.

It is also not a generic Apache contribution guide. The Apache "How to Contribute" pages cover the cross-project mechanics (ICLA, JIRA account creation, mailing list etiquette). The roadmap assumes those are done.

Finally, it is not a roadmap for committership. Becoming a Tez committer is a separate path that the PMC manages. The roadmap teaches the skills that, applied consistently over time, make committership a reasonable outcome — but landing patches is necessary, not sufficient.

Reading order

If you read this book front-to-back, you will hit this chapter after the deep dives and before the capstone. That is the intended sequence:

Read the deep dives to understand the architecture.
Read this roadmap to understand the contribution ladder.
Pick a rung and ship a patch.
Come back to this roadmap when the patch lands, and step up a rung.
After three or four rungs, attempt the capstone in ../capstone/index.md.

If you are jumping in mid-book, start at the rung that matches your current skill (see "Pick a stage honestly" above) and read the stage's companion deep dive at the same time.

A note on JQL

The JIRA queries in each stage are starting points. The Tez project's issue labelling has drifted over the years — labels like newbie and beginner are inconsistently applied. If a filter returns zero results, broaden it (remove a clause) before assuming the filter is wrong. Each stage gives at least one fallback grep-based candidate-finding method that does not depend on labels.

A second JQL tip: pin a "watched issues" filter for the components you care about. Tez has roughly a dozen components in JIRA; you do not need to watch all of them, but watching the two or three closest to your current rung is how you stay current on landed work.

A note on local clone hygiene

Every stage in this roadmap assumes you have a clean checkout at ~/tez-src. "Clean" means:

git status shows no untracked files outside .gitignore.
git branch shows you on master (or a topic branch you remember creating).
mvn clean install -DskipTests -Phadoop28 completes in under two minutes locally.

A messy checkout produces hard-to-reproduce results: a grep that catches your own WIP, a git bisect that visits commits whose builds were already broken by an unrelated local change, a mvn test that passes locally because of a stale ~/.m2 jar.

Refresh on Mondays:

cd ~/tez-src
git checkout master
git pull --ff-only
git clean -fdx
mvn -q clean install -DskipTests -Phadoop28

The git clean -fdx is aggressive — it removes everything not tracked by git, including IDE artifacts. Keep an .idea/ (or equivalent) backup elsewhere if you customise it.

How the stages interlock

Each stage builds vocabulary the next stage uses without re-explaining:

Stage 1 teaches the patch artifact format. Every later stage assumes it.
Stage 2 teaches the LOG.isDebugEnabled() pattern. Stage 3 builds on it with the CONTEXT rule.
Stage 3 teaches you to navigate tez-dag. Stage 4 lives in tez-dag/...impl/.
Stage 4 teaches the state-machine DSL. Stage 5 reads the same DSL in the scheduler.
Stage 5 teaches MiniTezCluster. Stage 6 leans on it for every shuffle test.
Stage 6 teaches the runtime contracts. Stage 7 attributes bugs against those contracts to Hive.
Stage 8 teaches the YARN boundary. Stage 11 references it when discussing compat across Hadoop versions.
Stage 9 teaches deterministic testing. Stage 10 uses it as the baseline for benchmark stability.
Stage 10 teaches measurement. Stage 11 uses measurement as evidence for compat decisions.
Stage 11 teaches the audience/stability matrix. Stage 12 uses it when triaging blockers.

Skipping a stage means skipping a vocabulary. Reviewers will notice.

Now turn the page to Stage 1.

Stage 1 — Docs and Tests

What this stage teaches

Stage 1 is the on-ramp. The skills are deliberately non-technical:

Navigate the Apache JIRA workflow: claim a ticket, assign it to yourself, attach a patch, set "Patch Available", respond to review.
Run mvn apache-rat:check and mvn checkstyle:check cleanly.
Produce a git format-patch artifact that applies on master.
Wait for a Jenkins precommit run and read its output without panicking.

The contributions themselves are surgical: a docs typo, a missing @since tag, a @param javadoc that the linter complains about, a LOG.info whose message is misleading. Nothing in this stage will surprise a reviewer. That is the point: you are exercising the workflow so the next stages can be about code.

JIRA filter to find candidates

Real JQL you can paste into https://issues.apache.org/jira/issues:

project = TEZ
  AND labels in (newbie, beginner, "newbie-friendly", "low-hanging-fruit")
  AND resolution = Unresolved
  AND (component in (Documentation) OR summary ~ "typo" OR summary ~ "javadoc")
ORDER BY updated DESC

A second filter that often surfaces good Stage 1 work — javadoc that the build already flags:

project = TEZ AND status = Open AND text ~ "javadoc" AND text ~ "missing"

Open three candidates, read each comment thread end to end. Choose one that has no assignee, no patch attached, and was last updated more than three months ago. That is the abandoned-but-still-valid ticket: a perfect Stage 1.

If nothing fits, file your own. Walk the docs/src/site/markdown/ tree and grep for broken links, stale Hadoop version numbers, and configuration keys removed years ago:

cd ~/tez-src
grep -rn "tez\.am\.task\.max\.failed\.attempts" docs/src/site/markdown/
grep -rn "hadoop-2\.[0-6]" docs/src/site/markdown/
grep -rn "TODO\|FIXME\|XXX" docs/src/site/markdown/

A genuine doc bug found this way is fair game for your first JIRA.

Walked example — `TezConfiguration` javadoc missing `@since`

Symptom: a contributor reports on dev@ that TezConfiguration.TEZ_AM_RESOURCE_MEMORY_MB has no @since tag, so users cannot tell which release introduced the property's default change.

Step 1 — Locate the symbol

cd ~/tez-src
grep -n "TEZ_AM_RESOURCE_MEMORY_MB" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head

Open the file. The relevant block looks roughly like:

@ConfigurationScope(Scope.AM)
public static final String TEZ_AM_RESOURCE_MEMORY_MB =
    TEZ_AM_PREFIX + "resource.memory.mb";
public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;

No javadoc, no @since. That is the bug.

Step 2 — Claim the JIRA

On https://issues.apache.org/jira/projects/TEZ:

Click Create, set Project = TEZ, Issue Type = Improvement.
Summary: Add @since tags and javadoc for TEZ_AM_RESOURCE_MEMORY_MB family.
Component: tez-api. Affects Version: 0.10.3. Fix Version: leave blank — the release manager sets it.
Description: state the symptom, paste the grep above, link the dev@ thread.
Save, then click Assign to me.

Step 3 — Diff

--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+  /**
+   * Memory (in MB) requested for the AppMaster container. If the AM is launched
+   * by YARN, this is passed through to {@link
+   * org.apache.hadoop.yarn.api.records.Resource#setMemorySize(long)} on the
+   * {@code ApplicationSubmissionContext}.
+   *
+   * @since 0.5.0
+   */
   @ConfigurationScope(Scope.AM)
   public static final String TEZ_AM_RESOURCE_MEMORY_MB =
       TEZ_AM_PREFIX + "resource.memory.mb";
+  /** Default value of {@link #TEZ_AM_RESOURCE_MEMORY_MB}. @since 0.5.0 */
   public static final int TEZ_AM_RESOURCE_MEMORY_MB_DEFAULT = 1024;

Two rules for @since:

Look at the earliest commit that introduced the symbol, not the current version. git log --diff-filter=A -- tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java then git log -S "TEZ_AM_RESOURCE_MEMORY_MB" -- tez-api/.... Cross-reference the commit hash against the release tags (git tag --contains <hash>).
Never guess. If you cannot find the release, ask on dev@. A wrong @since is worse than no @since.

Step 4 — Build and lint

cd ~/tez-src
mvn -pl tez-api -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-api checkstyle:check -q
mvn -pl tez-api apache-rat:check -q
mvn -pl tez-api javadoc:javadoc -q 2>&1 | grep -i "error\|warning" | head

The javadoc target is the slowest gate in Tez. Run it. If it warns about an @link that no longer resolves, fix that in the same patch — reviewers will ask anyway.

Step 5 — Format and attach the patch

cd ~/tez-src
git add tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
git commit -m "TEZ-XXXX. Add @since tags for TEZ_AM_RESOURCE_MEMORY_MB family"
git format-patch -1 HEAD --stdout > /tmp/TEZ-XXXX.001.patch

The Tez convention is TEZ-XXXX.NNN.patch where NNN starts at 001 and increments on every reroll. Upload to the JIRA, click "Submit Patch" so the status flips to Patch Available. Jenkins precommit will pick it up within an hour and post results.

Step 6 — Respond to review

Almost certain reviewer requests for a docs patch:

"Add {@value} macros so the default appears inline."
"Wrap the line at 100 chars."
"Capitalise the first word of the javadoc sentence."

Reroll as 002, never overwrite the 001 file. Each reroll is an attachment in JIRA, not a force-push; reviewers compare attachments by name.

Pitfalls

Don't fix two bugs in one patch. A whitespace cleanup tacked onto a typo fix is the most common reason a Stage 1 patch sits unmerged for months.
Don't run mvn install without -DskipTests. The full test suite takes well over an hour. For a docs patch you need only the lint targets above.
Don't squash through git rebase -i master and call git diff master — the Apache toolchain expects git format-patch -1 output. The two are not identical whenever your branch contains merge commits.
Don't paste the diff into the JIRA description. Attach the .patch file.
Don't request a reviewer in the JIRA description. Use the Assignee field to assign to yourself and let committers self-select. CC on dev@ if it has been more than two weeks with no review.
Don't open a GitHub PR instead of a JIRA patch unless the project guide says so. As of 0.10.x, Tez accepts GitHub PRs but the JIRA is still the source of truth and must be referenced in the PR title.

Exit criteria — when you're ready for the next stage

You can move to Stage 2 when:

You have one merged docs or javadoc patch and one merged test-only patch (typically a missing @Test method or a broken assertion message in tez-tests/).
You have responded to at least one round of reviewer nits without needing the reviewer to walk you through git format-patch syntax.
A green Jenkins precommit run on your patch no longer makes you nervous, and you can read the report and tell which warnings are pre-existing versus introduced by your change.
You can recite from memory: "JIRA first, branch from master, one logical change per patch, TEZ-XXXX.NNN.patch naming, attach not paste."

A second walked example — fixing a misleading log message

Symptom: a contributor sees a LOG.info in tez-dag that reads:

LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");

But it fires every time the vertex is re-initialised, not just on first initialisation. The message implies a one-shot event; operators have complained that they cannot grep the log to find unique vertices.

The diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
-    LOG.info("Vertex " + vertexName + " has " + numTasks + " tasks");
+    LOG.info("Vertex {} (id={}) initialised with {} tasks (init count={})",
+        vertexName, vertexId, numTasks, ++initCount);

Three changes in one diff:

The message uses slf4j placeholders.
The vertex ID is added so operators can correlate with downstream ATS events.
The init counter makes the "re-initialise" case visible.

This patch is technically a borderline Stage 3 candidate (it adds the vertex ID — see stage-3-error-messages.md). For a first patch, the JIRA description should explicitly say "I am only changing the log message; the init-count field is added but no transition behaviour changes." That framing keeps the patch in Stage 1 scope.

Test

A log-message change usually has no functional test. The reviewer signal is a manual run of a small OrderedWordCount against MiniTezCluster with the modified jar, and a grep of the resulting log to confirm the new format. Document the grep in the JIRA comments:

grep "initialised with" tez-am.log | head

When to file a follow-up

If, while working on a Stage 1 patch, you discover a bigger issue — suppose the missing javadoc is missing because the configuration key was silently renamed without an @since in either place — file a follow-up JIRA in the same component. Do not bundle the bigger fix into your Stage 1 patch.

Standard wording in your JIRA comments:

While working on TEZ-XXXX I noticed that TEZ_AM_RESOURCE_MEMORY_MB was
renamed from TEZ_AM_MEMORY_MB in 0.7.0 without an @deprecated on the
old key. Filed TEZ-YYYY to track the deprecation cleanup.

This habit — narrow Stage 1 patch + follow-up JIRA — is what reviewers mean when they say "keep patches focused." It is the skill the rest of the roadmap depends on.

Where Stage 1 patches go wrong

The two most common failure modes for a Stage 1 patch:

Scope creep. The contributor "just fixes" three sibling issues while editing the file. Reviewers ask for a split. The contributor reroll incompletely. Two months later the patch is abandoned.
Silent rebase break. The contributor rebases on master, the patch no longer applies cleanly, but they never upload an 002 reroll. The committer sees a stale patch and moves on.

Neither failure is about code. Both are about workflow discipline. Stage 1 exists to drill that discipline before the stakes get higher.

Stage 2 will move you from documentation into code that runs in production AMs.

Stage 2 — Build and Logging Hygiene

What this stage teaches

Stage 2 teaches the smallest patches that touch running production code: build metadata (pom.xml), logging idioms, and dependency hygiene. You learn:

How Tez's dependency version bands work and which bumps are safe within a minor line.
The slf4j-api + log4j (or reload4j) logging stack as wired in tez-common, and the four idioms reviewers actively enforce.
How to remove deprecated Guava and Hadoop calls without breaking older Hadoop consumers in the supported compatibility band.
How to triage log-level mismatches: messages logged at INFO that should be DEBUG (and the reverse).

The patches are still small (5–80 lines) and the risk surface is small, but they go into the AM and the runtime tasks. A LOG.info in ShuffleManager that fires once per fetch will be seen by every operator running Hive-on-Tez.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (summary ~ "logging" OR summary ~ "deprecated" OR summary ~ "guava"
       OR summary ~ "bump" OR summary ~ "upgrade dependency"
       OR summary ~ "System.out" OR description ~ "isDebugEnabled")
ORDER BY updated DESC

A second sweep for dependency bumps that the build flags:

project = TEZ AND component in (build) AND status = Open ORDER BY priority DESC

You can also generate candidates by running OWASP / dependency-check:

cd ~/tez-src
mvn -pl tez-common dependency:tree -DoutputType=text | grep -E "guava|jackson|netty"

Any line that flags a Guava 12.x in transitive scope is a Stage 2 candidate, because Tez has been on Guava-shaded internals for years.

Walked example A — `System.out.println` in production code

Symptom: a grep finds three stray System.out.println calls in tez-runtime-library. They were left over from a debugging session and now show up in NodeManager stdout logs, polluting operator dashboards.

Step 1 — Find every offender

cd ~/tez-src
grep -rn "System\.out\.println\|System\.err\.println" \
  tez-runtime-library/src/main/java tez-runtime-internals/src/main/java tez-dag/src/main/java \
  | grep -v "/test/" | grep -v "examples"

Each hit is a separate JIRA candidate (one stage-2 patch per ticket). Pick one, file the JIRA, claim it.

Step 2 — The diff

Suppose the offender is in tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java:

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java
@@
-    System.out.println("Spill " + numSpills + " starting, size=" + buffer.position());
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Spill {} starting, size={}", numSpills, buffer.position());
+    }

Three rules in one diff:

Replace System.out with the class's existing slf4j LOG. If the file does not have one, add private static final Logger LOG = LoggerFactory.getLogger(...) at the top.
Use slf4j {} placeholders, not string concatenation. The placeholder form avoids constructing the message string when the log level is filtered out.
Wrap the call in LOG.isDebugEnabled() only when the argument list does non-trivial work (a toString() on a large object, a list copy, a .size() on a synchronized collection). Pure references (numbers, already-bound strings) do not need the guard.

The third rule is the one reviewers nitpick most. The placeholder form already defers toString(), so a guard around a plain LOG.debug("foo {}", x) where x is an int is unnecessary noise. But this:

LOG.debug("Pending {}", scheduledTasks);   // scheduledTasks.toString() is expensive

does benefit from a guard, because scheduledTasks will be toString()-ed before slf4j forms the message.

Step 3 — Verify the build

mvn -pl tez-runtime-library -am clean install -DskipTests -Phadoop28 -q
mvn -pl tez-runtime-library checkstyle:check -q

There is no easy unit test for "no System.out left behind." The reviewer signal is a clean grep across the changed file plus a green checkstyle run.

Walked example B — `pom.xml` dep bump within the compat band

Symptom: jackson-databind 2.12.x has a known CVE; Tez is pinned to 2.12.6 in the parent POM. The compatibility band for the 0.10.x line allows bumps within the 2.12.* range.

Step 1 — Find the pin

cd ~/tez-src
grep -n "jackson-databind\|jackson.version\|jackson-core" pom.xml

Result, abbreviated:

pom.xml:178:    <jackson.version>2.12.6</jackson.version>

Most jackson artifacts in Tez are governed by ${jackson.version} in the parent POM. That is the only string you change.

Step 2 — The diff

--- a/pom.xml
+++ b/pom.xml
@@
-    <jackson.version>2.12.6</jackson.version>
+    <jackson.version>2.12.7.1</jackson.version>

That is the entire patch. The harder part is justifying it.

Step 3 — The JIRA description

Summary: Bump jackson-databind from 2.12.6 to 2.12.7.1

Description:
2.12.6 is affected by CVE-YYYY-NNNN. 2.12.7.1 is the latest patch on the 2.12
line and is API-compatible per the jackson maintainers' compat notes. We do not
bump to 2.13 / 2.14 here to keep Hive-on-Tez compatibility unchanged.

Verification:
  mvn clean install -DskipTests -Phadoop28
  mvn -pl tez-dag test -Dtest=TestDAGImpl
  mvn -pl tez-runtime-library test -Dtest=TestShuffleManager

Step 4 — Why "within the compat band" matters

If you bumped to 2.14, you would break Hive 3.x users who ship 2.13. A 2.12 → 2.12.7.1 bump is a one-line patch. A 2.12 → 2.14 bump is a six-month compatibility argument and lives in Stage 11. Stay on rung.

Walked example C — log-level mismatch

Symptom: a user reports their NodeManager logs are at 100GB/day. Investigation shows Fetcher is logging every single shuffle fetch at INFO:

LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);

That message fires per attempt per source per fetch. For a 10k-task vertex it is catastrophic.

Diff

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
@@
-    LOG.info("Fetcher " + id + " connecting to " + host + ":" + port);
+    if (LOG.isDebugEnabled()) {
+      LOG.debug("Fetcher {} connecting to {}:{}", id, host, port);
+    }

Rules for INFO → DEBUG demotions:

The message fires more than once per task attempt → almost always DEBUG.
The message fires once per DAG lifecycle event (DAG start, vertex committed, task killed by user) → keep at INFO.
The message fires per exception → keep at WARN or ERROR per the existing level, never demote silently.
Never demote a log without dev@ confirmation if the message references a contract event (state transition, container release). Operators rely on those for postmortems.

The Fetcher example is uncontroversial; a LOG.info on every state transition in VertexImpl is not — that would be Stage 4.

Pitfalls

Don't introduce a logger dependency change in a logging patch. If the file imports org.apache.commons.logging.Log, do not migrate it to slf4j in this patch. That migration is a separate JIRA and a much larger surface area.
Don't use Throwable.printStackTrace() even in tests. Reviewers will flag it. Use LOG.error("msg", t) instead.
Don't bump a dep across a major version line in a Stage 2 patch. That is Stage 11.
Don't mvn versions:use-latest-releases and submit the resulting diff. The bump must be justified per artifact with the CVE or the bug being fixed.
Don't remove deprecated Guava calls by adding new Guava calls. The Tez trajectory is off Guava in public code. Replace Preconditions.checkNotNull with Objects.requireNonNull (JDK 7+) — not with a different Guava class.
Don't add a LOG.debug guard around a string literal. LOG.debug("hello") needs no guard.

Exit criteria — when you're ready for the next stage

Move on when:

You have shipped at least two logging-cleanup patches and one pom dep bump.
You can explain, without looking it up, when to add LOG.isDebugEnabled() and when not to.
You have read tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java and understand the NDC pattern used to attach DAG/Vertex IDs to log messages — that knowledge is the bridge to Stage 3.
Your last patch was reviewed and merged without a "split this into two JIRAs" comment.

Stage 3 layers on top: you keep the same surgical patch style, but now you make the content of error messages tell the operator which DAG and which vertex.

Appendix — finding logging hygiene candidates yourself

The JIRA filter at the top of this stage may return zero results during quiet periods. When that happens, you can manufacture candidates yourself with two grep patterns that have a high signal-to-noise ratio.

Pattern A — unguarded string concatenation in `LOG.debug`

cd ~/tez-src
grep -rn 'LOG\.debug(.*+' --include="*.java" tez-dag tez-runtime-internals \
    tez-runtime-library | grep -v isDebugEnabled | head -30

This finds calls of the form LOG.debug("got " + counter) that allocate the concatenated string unconditionally. Pick one, wrap in if (LOG.isDebugEnabled()), attach to a JIRA.

Pattern B — `LOG.info` calls with high call-site frequency

cd ~/tez-src
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library | wc -l
grep -rn 'LOG\.info' --include="*.java" tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle | head

The shuffle path runs per-fetch — any LOG.info there fires hundreds of thousands of times per DAG. Most are candidates for demotion to DEBUG.

Pattern C — pom files referencing pinned old versions

cd ~/tez-src
grep -rn "<version>" --include="pom.xml" | grep -E "jackson|commons-|guava|netty" \
  | grep -v -- "-test" | head -20

Cross-reference against the latest patch release on the package's GitHub releases page. If your match is two patch versions behind and the changelog mentions a security fix, you have a Stage 2 candidate.

The bar for these "self-found" candidates is the same: file a JIRA before coding, attach a 001 patch, wait for review.

Stage 3 — Error Messages and Exception Context

What this stage teaches

Stage 3 is the first stage where you change behaviour visible to operators in a production postmortem. You learn:

The CONTEXT rule for tez-dag: every error raised, logged, or rethrown inside the AppMaster must include the DAG ID, and the vertex/task/attempt ID wherever the call site has them in scope.
How to chain causes correctly: throw new TezException(msg, cause) instead of throw new TezException(msg) then cause.printStackTrace().
How to find exception sites that swallow the original cause: a catch (Exception e) followed by throw new RuntimeException("init failed") is the canonical bug.
How NDC (Nested Diagnostic Context, configured in tez-common) propagates IDs into log messages automatically — and how to add explicit IDs where NDC is not set up.

These patches are 20–200 lines, often single-method changes that touch error paths. The reviewer test is brutal but fair: "If this exception fires in a production AM log, can the on-call engineer identify the DAG, vertex, and task without cross-referencing any other log file?" If the answer is "no," the patch is not done.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (text ~ "uninformative error" OR text ~ "missing context"
       OR text ~ "swallowed exception" OR text ~ "no DAG id"
       OR text ~ "improve error message" OR description ~ "InvalidStateTransitonException"
       AND text ~ "stack trace")
ORDER BY updated DESC

A second sweep — find your own candidates by grep:

cd ~/tez-src
# error sites that build a message without an ID
grep -rn 'throw new .*Exception(".*failed' tez-dag/src/main/java \
  | grep -v "ID\|Id\|getName" | head -30

# catch sites that drop the cause
grep -rn "catch (.*Exception .*)" tez-dag/src/main/java -A 2 \
  | grep -B1 "throw new" | grep -v ", e)" | head -30

The second grep is fuzzy; you will get false positives. But every true positive is a Stage 3 patch.

The CONTEXT rule for `tez-dag`

Every error inside the AppMaster must include enough state to identify which DAG instance on which AM on which application attempt threw it. The minimum fields, listed in priority order:

The DAG ID (TezDAGID).
The Vertex ID (TezVertexID) — required if the error is in a vertex context.
The Task ID (TezTaskID) — required if in a task context.
The Task Attempt ID (TezTaskAttemptID) — required if in an attempt context.
The container ID — required for container-management errors.

Each of these IDs is a stable string (toString() returns the canonical form). They are present on every relevant impl object in tez-dag:

grep -n "getDagId\|getVertexId\|getTaskId\|getTaskAttemptId" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

If you are editing a method on VertexImpl, you have getVertexId() and getDagId() in scope. If you do not include them in the error, the patch is incomplete.

Walked example A — uninformative `TezException` in `VertexImpl.maybeSendConfiguredEvent`

Symptom: a user reports their DAG fails with:

2026-04-12 10:14:21,003 ERROR [Dispatcher thread] org.apache.tez.dag.app.dag.impl.VertexImpl:
  Vertex init failed
org.apache.tez.dag.api.TezException: init failed
    at org.apache.tez.dag.app.dag.impl.VertexImpl.maybeSendConfiguredEvent(VertexImpl.java:NNNN)

That error tells the operator nothing. No DAG ID, no vertex name, no cause.

Step 1 — Find the throw site

cd ~/tez-src
grep -n 'throw new TezException("init failed' \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

Read 20 lines of context around the hit. The method has vertexId, getDagId(), and getName() all in scope.

Step 2 — The diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
-    } catch (AMUserCodeException e) {
-      throw new TezException("init failed");
-    }
+    } catch (AMUserCodeException e) {
+      String msg = String.format(
+          "Vertex %s (%s) of DAG %s failed during configured-event dispatch: %s",
+          getName(), vertexId, getDagId(), e.getMessage());
+      LOG.error(msg, e);
+      throw new TezException(msg, e);
+    }

What changed:

The message now identifies the vertex name (human-readable), the vertex ID (machine-stable), and the DAG ID.
The original exception is chained via the two-argument TezException constructor. The full stack trace survives.
The error is also logged at ERROR with the cause. Belt and braces — some callers swallow the exception silently, and the log line is the only record that survives.
String.format is used so the placeholders are visually aligned with the field names. Reviewers prefer it over +-concatenation when the message has more than three substitutions.

Step 3 — Regression test

Add to tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java:

@Test(timeout = 5000)
public void testInitFailureMessageIncludesIds() throws Exception {
  VertexImpl v = createVertexThatFailsInConfigured(); // existing helper pattern
  try {
    v.maybeSendConfiguredEvent();
    fail("expected TezException");
  } catch (TezException e) {
    assertTrue("message should contain vertex id",
        e.getMessage().contains(v.getVertexId().toString()));
    assertTrue("message should contain dag id",
        e.getMessage().contains(v.getDagId().toString()));
    assertNotNull("cause should be preserved", e.getCause());
  }
}

The test asserts on substring presence, not exact string equality. Reviewers reject exact-string assertions because they break the next time the message is rephrased.

Step 4 — Run targeted tests

cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -40

The full TestVertexImpl suite takes 3–5 minutes on a laptop. Run it. A state-machine-adjacent change always risks breaking a sibling transition.

Walked example B — swallowed cause in `DAGAppMaster.startDAG`

Find the bug:

cd ~/tez-src
grep -rn "catch (.*Exception" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java \
  -A 3 | grep -B1 "throw new" | head -20

Suppose the offender looks like:

try {
  initServices();
} catch (Exception e) {
  throw new TezUncheckedException("Failed to start AM");
}

The diff:

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
     try {
       initServices();
     } catch (Exception e) {
-      throw new TezUncheckedException("Failed to start AM");
+      throw new TezUncheckedException(
+          "Failed to start AM for application " + appAttemptID + ": "
+              + e.getMessage(), e);
     }

Two fixes at once: the cause is preserved (the second constructor argument), and the message now includes the appAttemptID which the surrounding DAGAppMaster has in scope. This patch is small but high-leverage: the AM startup path is the single most common place a swallowed cause hides a real configuration bug.

Walked example C — log-only context via NDC

Some hot paths cannot afford a String.format per call. The Tez convention there is NDC. Look in tez-common/src/main/java/org/apache/tez/common/CallableWithNdc.java:

cat $(find ~/tez-src/tez-common/src/main/java -name "CallableWithNdc.java")

When the dispatcher invokes a vertex transition callback, it pushes the vertex ID onto the NDC stack. log4j's %X{...} pattern then includes the ID in every log line for the duration of the call. If you discover a log message in VertexImpl that lacks the vertex ID, first check whether NDC already provides it via the log pattern. If yes, the message is fine; if no, add the ID inline. Submitting a patch that adds an explicit ID where NDC already prints it is a reviewer-rejected patch.

Pitfalls

Don't include e.getStackTrace() in your message. The stack trace is what LOG.error(msg, e) is for. Concatenating it into the message turns a one-line log into a 60-line one.
Don't use e.toString() in messages. Use e.getMessage() so the message stays single-line; the stack trace lives in the chained throwable.
Don't catch Throwable to add context. Catching Throwable swallows OutOfMemoryError and ThreadDeath. Catch Exception (or the narrowest superclass that fits).
Don't add context that requires a lock. A getName() call that internally takes the vertex write-lock is a deadlock waiting to happen if the error path itself holds the lock. Always check the lock semantics of the getter you call in an error path.
Don't change the exception type to add context. throw new TezException is still a TezException after your patch; changing it to TezUncheckedException is a behavior change and not allowed in Stage 3.
Don't add context that includes user data without redaction. If your error message includes a configuration value, check whether it could contain credentials. The Tez convention is to print the key, not the value, when the key matches .*\.(password|secret|token|credential).

Exit criteria — when you're ready for the next stage

Move to Stage 4 when:

You have shipped at least one error-context patch in tez-dag and one in tez-runtime-library that includes the DAG and vertex/task IDs.
A reviewer has accepted your test pattern (substring assertion, no exact-string match) without a comment.
You can find at least three more candidate error sites in five minutes of grepping without referring to this chapter.
You have read VertexImpl.maybeSendConfiguredEvent and the surrounding 200 lines without feeling lost — that file is the gateway to Stage 4.

Stage 4 will take you inside the state machines themselves.

Stage 4 — State Machine Transitions

What this stage teaches

Stage 4 is the first stage that requires you to understand the Tez AppMaster, not just navigate it. You learn:

The StateMachineFactory DSL used in Hadoop / Tez to declare finite state machines. The two canonical instances are VertexImpl.stateMachineFactory and TaskImpl.stateMachineFactory.
The InvalidStateTransitonException (note the historical typo — "Transiton", not "Transition" — preserved for API compatibility) that the state machine throws when an event arrives in a state with no registered transition.
How to add a transition with the right guard, without widening the surface area of the state machine accidentally.
The hard rule: never widen a transition without a dev@ design discussion. Adding a transition from RUNNING to KILLED on a new event class is a semantic change that may cascade to ATS, the client, and the speculator.
The TestVertexImpl and TestTaskImpl patterns for asserting that an event in a state produces an expected next state.

Patches are typically 30–250 lines: a transition table entry, a small guard helper, a fired event, and a deterministic regression test.

Reading order before you touch any code

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java — read the static stateMachineFactory block end to end. It is several hundred lines of .addTransition(...) calls. Diagram it on paper.
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java — same exercise for tasks.
tez-common/src/main/java/org/apache/tez/state/StateMachineTez.java — the wrapper Tez puts around the Hadoop state machine.
The deep dives state-machines and vertex-lifecycle. Do not skip these.

Then, and only then, file a JIRA.

JIRA filter to find candidates

The most fruitful filter:

project = TEZ
  AND resolution = Unresolved
  AND (text ~ "InvalidStateTransitonException" OR text ~ "Invalid event"
       OR text ~ "missing transition" OR description ~ "stateMachineFactory")
ORDER BY updated DESC

A second filter for postmortem-style tickets:

project = TEZ AND status = Open AND component in ("tez-dag")
  AND priority in (Major, Critical) AND text ~ "VertexState\\|TaskState"

Most real Stage 4 work comes from operator reports of an AM that crashed with InvalidStateTransitonException: Invalid event X on Y in state Z. That stack trace is the smoking gun: state Z received event X and had no registered handler. The fix is one of:

Add the transition with a guard (most common).
Suppress the event in that state because it is a benign late delivery (use addTransition(state, state, event) — a self-loop).
Fix the sender not to emit the event in that state (sometimes the bug is upstream).

Choosing wrong is the most common Stage 4 mistake. Pick option 3 only if you can prove the event should never have been emitted.

Walked example — missing `V_INIT` transition in `VertexState.NEW`

Symptom: an operator reports a recurring AM crash:

InvalidStateTransitonException: Invalid event: V_INIT at NEW
  at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(...)
  at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:NNNN)

V_INIT arriving while the vertex is in NEW is suspicious — NEW is supposed to accept V_INIT. Investigation reveals the transition is registered for the common path, but a recently-added early-error path emits V_INIT from a different thread before the main scheduler does, and the second V_INIT arrives while the vertex is back in NEW after a re-init.

Step 1 — Read the existing transitions

cd ~/tez-src
grep -n "addTransition(VertexState.NEW" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20

You will see something like (simplified):

.addTransition(VertexState.NEW, VertexState.INITED,
    VertexEventType.V_INIT, new InitTransition())
.addTransition(VertexState.NEW, VertexState.FAILED,
    VertexEventType.V_TERMINATE, new TerminateNewVertexTransition())

V_INIT on NEW is registered. So the crash means the vertex was not in NEW when the second V_INIT arrived — it was somewhere else, perhaps INITED. Re-grep:

grep -n "addTransition(VertexState.INITED" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | grep "V_INIT"

No hit. That is the bug: V_INIT arriving in INITED is unhandled.

Step 2 — Decide: add, ignore, or fix upstream

V_INIT in INITED is a duplicate event. It is benign (the vertex is already initialised; the second message is redundant). The correct fix is to ignore the duplicate — a self-loop. This is the safe, narrow change.

We are not widening behaviour. We are saying: "in INITED, a redundant V_INIT is a no-op, not a crash."

Step 3 — The diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java
@@
        .addTransition(VertexState.INITED, VertexState.RUNNING,
            VertexEventType.V_START, new StartTransition())
+
+       // A duplicate V_INIT can arrive when an early error path fires V_INIT
+       // concurrently with the scheduler. The vertex is already initialised;
+       // ignore the duplicate rather than crashing the AM. See TEZ-XXXX.
+       .addTransition(VertexState.INITED, VertexState.INITED,
+           VertexEventType.V_INIT, VERTEX_STATE_CHANGED_CALLBACK_NOOP)

Where VERTEX_STATE_CHANGED_CALLBACK_NOOP is either a constant MultipleArcTransition that does nothing, or, more idiomatically, a small inner class:

private static class IgnoreEventTransition
    implements SingleArcTransition<VertexImpl, VertexEvent> {
  @Override
  public void transition(VertexImpl vertex, VertexEvent event) {
    LOG.debug("Ignoring duplicate {} on vertex {} in state {}",
        event.getType(), vertex.getVertexId(), vertex.getState());
  }
}

Two rules in this diff:

The transition has a comment with the JIRA ID explaining why the self-loop exists. State-machine entries without comments are hard to remove safely two years later.
The transition logs at DEBUG, not INFO. If the duplicate event is actually a symptom of a larger bug upstream, the debug log is what tells the operator.

Step 4 — Regression test in `TestVertexImpl`

@Test(timeout = 10000)
public void testDuplicateVInitInInitedIsNoOp() throws Exception {
  initAllVertices(VertexState.INITED);                  // existing helper
  VertexImpl v = vertices.get("vertex1");
  assertEquals(VertexState.INITED, v.getState());

  // Fire a second V_INIT — must not throw, must not change state
  v.handle(new VertexEvent(v.getVertexId(), VertexEventType.V_INIT));
  dispatcher.await();

  assertEquals("duplicate V_INIT should leave INITED unchanged",
      VertexState.INITED, v.getState());
}

The test pattern:

Use the existing initAllVertices(VertexState.INITED) helper. Do not invent your own bootstrap.
Always call dispatcher.await() after v.handle(...). TestVertexImpl uses DrainDispatcher, which is the only way to make event-driven tests deterministic.
Assert the post state. Never assert on internal counters unless the transition is supposed to change them.

Run it:

cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestVertexImpl#testDuplicateVInitInInitedIsNoOp -q 2>&1 | tail -30

Then run the whole TestVertexImpl suite. A single transition addition has broken a sibling test more than once in Tez history.

Step 5 — dev@ notification

Before you post the patch:

Subject: [DISCUSS] TEZ-XXXX — add INITED -> INITED self-loop for V_INIT

I have a repro for a recurring AM crash where V_INIT arrives twice. The state
machine currently has no INITED+V_INIT entry. Proposed fix: self-loop with a
debug log. Sender side (early-error path) is left unchanged on the grounds
that defensive handling in the state machine is cheaper than chasing every
sender. Would appreciate a sanity check before I post the patch.

If a committer replies "actually the sender is the bug, fix that instead," you revise your approach. If silence for 48 hours, post the patch.

The "never widen without dev@" rule

What counts as widening:

Adding a transition from a non-terminal state to a terminal state on a new event. Example: RUNNING -> KILLED on V_USER_REQUEST_FORCE_KILL.
Adding a transition that changes a previously-rejected event into an accepted one with side effects (counters updated, downstream events emitted).
Removing a transition.

What is not widening:

Adding a self-loop that ignores a duplicate event (as above).
Adding a transition that converts an InvalidStateTransitonException into a controlled ERROR transition, when the event was clearly a fatal-bug signal.

The dev@ rule exists because state machines are observed externally: the AM emits state-changed events to ATS, the client poll loop watches them, the speculator reads them. Adding a transition is an API change for those observers, even if no Java type signature changes.

Pitfalls

Don't add transitions to fix symptoms. If you see InvalidStateTransitonException and the cause is "the sender shouldn't have emitted that event," fix the sender. Adding a transition to silence the exception hides the real bug.
Don't forget the regression test. Every transition patch must have a test that fires the event in the state and asserts the result. Tests using DrainDispatcher are the only ones reviewers accept.
Don't use Mockito.spy on VertexImpl. The state machine has private internal state that spies cannot reach reliably. Use the production class with the test helpers in TestVertexImpl and MockDAGAppMaster.
Don't change the transition() callback signature. Existing transitions use SingleArcTransition or MultipleArcTransition. Pick the matching one; do not introduce a new interface.
Don't ignore the typo. InvalidStateTransitonException (no second "i") is the canonical name in Hadoop. If you "fix" the typo in Tez code, you break binary compatibility with downstream callers that catch the exception by name.
Don't bundle a transition fix with an unrelated cleanup. Reviewers will ask you to split.

Exit criteria — when you're ready for the next stage

Move to Stage 5 when:

You have shipped one transition fix in VertexImpl or TaskImpl with a passing regression test in the corresponding Test* class.
You can draw the VertexImpl state diagram from memory (8 states, the main transitions, the terminal set).
You have read TaskAttemptImpl.stateMachineFactory in full and recognise the similarities and differences to VertexImpl.
A committer has reviewed your transition patch and accepted the addition without asking for a dev@ design thread — meaning your choice of "ignore vs add vs fix sender" was correct.

Stage 5 takes you out of the AM event loop and into the scheduler.

Stage 5 — Scheduler Bugs

What this stage teaches

Stage 5 takes you out of the per-vertex event loop and into the AM-wide scheduling layer. You learn:

The split between TaskSchedulerManager (the multi-scheduler dispatch shim) and the concrete YarnTaskSchedulerService (the AMRMClient-backed scheduler used in production), plus the alternative LocalTaskSchedulerService used by local mode and tests.
How container requests, allocations, and releases flow through AMRMClient, including the heldContainer lifecycle and the canonical leak: a held container that is never returned to YARN after an onError callback fires.
Locality miscounts: the bookkeeping mistake where a node-local allocation is charged as rack-local in getAvailableContainers, distorting the affinity signal sent back to the AMRM protocol.
Priority inversion: a high-priority request stuck behind a low-priority pending list because the request was added to the wrong queue.
Container behaviour across AM failover: when the AM restarts with tez.am.am-rm.heartbeat.interval-ms retries, what should and should not be re-claimed.
How to write a MiniTezCluster-backed integration test, and when the cheaper AMRMClient stub pattern is sufficient.

Patches are 50–500 lines, often with a non-trivial test that needs MiniTezCluster or MiniYARNCluster. Reviewers are strict: a scheduler patch without a deterministic test is rejected on sight.

Reading order

tez-dag/src/main/java/org/apache/tez/dag/app/rm/TaskSchedulerManager.java
tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
tez-dag/src/main/java/org/apache/tez/dag/app/rm/container/AMContainerImpl.java
tez-dag/src/test/java/org/apache/tez/dag/app/rm/TestTaskSchedulerManager.java
The deep dive scheduler.

cd ~/tez-src
wc -l tez-dag/src/main/java/org/apache/tez/dag/app/rm/*.java

If YarnTaskSchedulerService.java is over 2000 lines, that is expected.

JIRA filter to find candidates

project = TEZ
  AND component in ("tez-dag")
  AND resolution = Unresolved
  AND (text ~ "container leak" OR text ~ "scheduler" OR text ~ "locality"
       OR text ~ "priority" OR text ~ "AMRMClient" OR text ~ "heldContainer"
       OR description ~ "onError")
ORDER BY priority DESC, updated DESC

A second filter for AM-failover-related candidates:

project = TEZ AND resolution = Unresolved AND (text ~ "failover" OR text ~ "AM restart")
  AND component in ("tez-dag")

Walked example A — `heldContainer` never released after `onError`

Symptom: an operator reports their long-running session AM holds onto containers indefinitely after a transient RM disconnect. yarn application -status shows allocated containers far above what the running DAG should need.

Step 1 — Locate the leak path

cd ~/tez-src
grep -n "onError\|heldContainer\|releaseContainer" \
  tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head -30

You find a class field:

private final Map<ContainerId, HeldContainer> heldContainers = new HashMap<>();

and an onError(Throwable t) callback (inherited from AMRMClientAsync.CallbackHandler):

@Override
public void onError(Throwable t) {
  LOG.error("AMRMClient error", t);
  appContext.getEventHandler().handle(
      new DAGAppMasterEventSchedulingServiceError(t));
}

The bug: heldContainers is populated by onContainersAllocated but never drained in onError. When the AM recovers and the RM reissues the same container IDs, the map already has stale entries, and the new allocations are silently dropped (the bookkeeping path checks heldContainers.containsKey(id)). The containers are effectively leaked.

Step 2 — Diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
@@
   @Override
   public void onError(Throwable t) {
     LOG.error("AMRMClient error", t);
+    // Before we tear down, release any containers we still hold. If we don't,
+    // a recovering RM will re-issue the same ContainerIds and the dedup
+    // bookkeeping below will silently drop the new allocations. See TEZ-XXXX.
+    synchronized (heldContainers) {
+      for (HeldContainer hc : heldContainers.values()) {
+        try {
+          amRmClient.releaseAssignedContainer(hc.getContainer().getId());
+        } catch (Exception releaseErr) {
+          LOG.warn("Failed to release {} during onError cleanup: {}",
+              hc.getContainer().getId(), releaseErr.getMessage());
+        }
+      }
+      heldContainers.clear();
+    }
     appContext.getEventHandler().handle(
         new DAGAppMasterEventSchedulingServiceError(t));
   }

Rules in this diff:

The cleanup runs before the event is dispatched. Once the event fires, the AM may shut down handlers, and any release call would race.
The cleanup is synchronized on the same monitor that other writers to heldContainers use. Find that monitor first; if there is none, you have a second bug to file separately. Do not introduce a new lock in this patch.
Each release is wrapped individually. One failure must not prevent the others from being released.
Logged failures are WARN, not ERROR. The AM is already in an error path; doubling the severity drowns the originating cause.

Step 3 — Test with `AMRMClient` stub

A full MiniTezCluster test for this is overkill. Stub the client:

@Test(timeout = 10000)
public void testOnErrorReleasesHeldContainers() throws Exception {
  AMRMClientAsync<CookieContainerRequest> mockRm =
      mock(AMRMClientAsync.class);
  YarnTaskSchedulerService scheduler =
      new YarnTaskSchedulerService(mockAppCallbackHandler, appContext, mockRm);
  scheduler.serviceInit(new Configuration());
  scheduler.serviceStart();

  // simulate two allocations
  Container c1 = newContainer("container_1");
  Container c2 = newContainer("container_2");
  scheduler.onContainersAllocated(Arrays.asList(c1, c2));

  // fire onError
  scheduler.onError(new RuntimeException("RM gone"));

  // verify both were released
  verify(mockRm).releaseAssignedContainer(c1.getId());
  verify(mockRm).releaseAssignedContainer(c2.getId());
  assertTrue(scheduler.getHeldContainersForTest().isEmpty());
}

The pattern uses Mockito on the AMRM client interface, not on the YarnTaskSchedulerService itself. getHeldContainersForTest() is a package-private accessor you add in the same patch with a // VisibleForTesting comment.

Step 4 — Build, test, sign off

cd ~/tez-src
mvn -pl tez-dag test -Dtest=TestYarnTaskSchedulerService -q 2>&1 | tail -40
mvn -pl tez-tests test -Dtest=TestExternalTezServices -q 2>&1 | tail -10

The integration test (tez-tests) takes 5–10 minutes; skip it on the first local iteration but run it before the patch submission.

Walked example B — locality miscount

Symptom: a debug log shows node-local: 4, rack-local: 12, off-switch: 0 for a vertex whose input splits should give 14 node-local containers. The bookkeeping is off.

Locating the counter

cd ~/tez-src
grep -n "nodeLocal\|rackLocal\|offSwitch" \
  tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head -20

You find an assignContainer(...) path that compares the allocated host against the request's preferred host. The bug: the comparison is host.equals(req.host), but host arrives as node-1.cluster.local while req.host is node-1. The short-form comparison fails, the allocation is miscounted as rack-local, and the affinity penalty cascades into the next request.

Diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java
@@
-    if (host.equals(request.getHosts()[0])) {
+    // Hosts may be reported as FQDNs by the RM but as short names by the
+    // caller-supplied hint. Compare on the leading label to keep both forms
+    // equivalent. See TEZ-XXXX.
+    if (hostMatches(host, request.getHosts()[0])) {
       nodeLocalCount.incrementAndGet();
     } else if (rackOf(host).equals(rackOf(request.getHosts()[0]))) {
       rackLocalCount.incrementAndGet();
     } else {
       offSwitchCount.incrementAndGet();
     }
   }
+
+  static boolean hostMatches(String a, String b) {
+    if (a == null || b == null) return false;
+    return a.equals(b)
+        || leadingLabel(a).equals(leadingLabel(b));
+  }
+
+  private static String leadingLabel(String h) {
+    int dot = h.indexOf('.');
+    return dot < 0 ? h : h.substring(0, dot);
+  }

The accompanying test asserts the counter under both FQDN and short-name forms.

Walked example C — priority inversion

Symptom: a high-priority request (priority 0, AM speculation) waits indefinitely behind a long queue of priority-5 requests, even though the scheduler has capacity.

Root cause: the request was added to the queue keyed by priority string, not priority int. "0" sorts after "10" in string ordering. The fix is to use an Integer key or a TreeMap with a numeric comparator. The diff and test follow the same pattern as above; the file is tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java near the requestsByPriority field.

MiniTezCluster pattern

For bugs that only manifest end-to-end:

cd ~/tez-src
find tez-tests/src/test/java -name "TestMRRJobsDAGApi.java"

That file is the canonical worked example. The setup pattern:

private static MiniTezCluster tezCluster;

@BeforeClass
public static void setup() throws Exception {
  Configuration conf = new Configuration();
  tezCluster = new MiniTezCluster("TEZ-XXXX", 1, 1, 1);
  tezCluster.init(conf);
  tezCluster.start();
}

@AfterClass
public static void teardown() {
  if (tezCluster != null) {
    tezCluster.stop();
  }
}

Tests should:

Submit a small DAG (an OrderedWordCount derivative is fine).
Assert on DAGStatus and VertexStatus via the client.
Set tight tez.am.am-rm.heartbeat.interval-ms and tez.task.am.heartbeat.interval-ms overrides so retries fire quickly.

A MiniTezCluster test takes 30s+ per run; do not add more than one per JIRA.

Pitfalls

Don't mock the AppContext or the EventHandler if you can avoid it. Scheduler bugs often live in the handoff between scheduler and dispatcher. Mocking the dispatcher hides the bug.
Don't add Thread.sleep to scheduler tests. Use DrainDispatcher.await() or poll the scheduler's getHeldContainers() view with a timeout.
Don't introduce a new lock to fix a race. Most scheduler races are fixed by moving an existing line inside an existing synchronized block. Adding a new lock is a Stage 11 patch.
Don't change the AMRM heartbeat interval to make a test pass. That hides timing bugs that bite in production. Use the existing test helpers that drive the heartbeat synchronously.
Don't release containers in onContainersCompleted to "be safe". Hadoop's AMRMClient documentation forbids that; the container is already released by the RM, and a second release fires a confusing log line.
Don't fix a locality miscount by changing the comparison everywhere. The bug is usually a single inconsistency. Pin it down with a focused unit test before broadening the change.

Exit criteria — when you're ready for the next stage

Move to Stage 6 when:

You have shipped one scheduler patch with a passing MiniTezCluster or AMRM stub regression test.
You can read YarnTaskSchedulerService.assignContainer without referring to external docs.
You have written a MiniTezCluster test from scratch and it runs locally in under a minute.
You can explain the heldContainer lifecycle to another contributor in five sentences.

Stage 6 moves you into the runtime: ShuffleManager, Fetcher, MergeManager.

Stage 6 — Shuffle and Runtime

What this stage teaches

Stage 6 is the runtime stage. You learn:

The shuffle pipeline: how ShuffleManager schedules Fetcher threads against the upstream task outputs, how MergeManager consolidates fetched segments, and how the result is presented to the downstream processor as a KeyValuesReader.
The on-disk IFile format and the off-by-one EOF bugs that haunt every serialiser written against it.
FetchedInput and the in-memory vs on-disk decision: how tez.runtime.shuffle.memory.limit.percent interacts with MergeManager.canShuffleToMemory.
Fetch-failure retry storms: when a single bad NodeManager triggers cascading fetcher restarts that swamp the AM event queue.
How to inject deterministic faults using the FaultInjectionFetcher pattern (or, where it does not exist, the equivalent test double).

Patches are 80–600 lines and almost always come with a MiniTezCluster test because the runtime contracts are too subtle for unit tests alone.

Reading order

tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/MergeManager.java
tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
The deep dive shuffle-sort.

cd ~/tez-src
wc -l tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/*.java
wc -l tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java

JIRA filter to find candidates

project = TEZ
  AND component in ("tez-runtime-library", "tez-runtime-internals")
  AND resolution = Unresolved
  AND (text ~ "shuffle" OR text ~ "fetcher" OR text ~ "MergeManager"
       OR text ~ "IFile" OR text ~ "FetchedInput" OR text ~ "spill")
ORDER BY priority DESC, updated DESC

A second filter for fetch-failure storms specifically:

project = TEZ AND text ~ "fetch failure" AND text ~ "retry" AND resolution = Unresolved

Walked example A — fetch-failure retry storm

Symptom: a 5k-task vertex runs on a cluster where one NodeManager goes bad. Within minutes the AM logs are flooded with INPUT_READ_ERROR events. The DAG eventually succeeds but takes hours instead of minutes. The AM event queue backs up to 100k+ pending events.

Step 1 — Trace the path

cd ~/tez-src
grep -n "INPUT_READ_ERROR\|reportReadError\|fetchFailures" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java

You find ShuffleManager.reportReadError(...) which fires a TaskAttemptEvent to the AM for every failed fetch. With 5k downstream tasks each trying to fetch from the bad source, the AM receives 5k events per cycle. The AM dedupes by source attempt, but only after the events are on the queue.

Step 2 — Identify the fix

The minimal fix is client-side debounce: a ShuffleManager should not re-report the same source attempt failure more than once per tez.runtime.shuffle.fetch-failure.report.cooldown-ms window. The TEZ convention is to add the config key with a sensible default (reportCooldownMs = 5_000).

Step 3 — Diff

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleManager.java
@@
+  private final ConcurrentMap<InputAttemptIdentifier, Long> lastReportedAt =
+      new ConcurrentHashMap<>();
+  private final long reportCooldownMs;
@@
   public void reportReadError(InputAttemptIdentifier srcAttempt, IOException e) {
+    long now = clock.getTime();
+    Long prev = lastReportedAt.get(srcAttempt);
+    if (prev != null && now - prev < reportCooldownMs) {
+      if (LOG.isDebugEnabled()) {
+        LOG.debug("Debouncing read-error report for {} (last={}ms ago)",
+            srcAttempt, now - prev);
+      }
+      return;
+    }
+    lastReportedAt.put(srcAttempt, now);
     inputContext.sendEvents(Collections.singletonList(
         createInputReadErrorEvent(srcAttempt, e)));
   }

Add the config key to TezRuntimeConfiguration:

+  public static final String TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS =
+      TEZ_RUNTIME_PREFIX + "shuffle.fetch-failure.report.cooldown-ms";
+  public static final long
+      TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS_DEFAULT = 5_000L;

And register it in the same file's tezRuntimeKeys set so the validator does not reject it.

Step 4 — Test with `FaultInjectionFetcher` pattern

There is no production FaultInjectionFetcher; the test pattern is to subclass ShuffleManager and override createFetcher to return a Fetcher that throws IOException on every call. The repro test sits in tez-runtime-library/src/test/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/TestShuffleManager.java:

@Test(timeout = 10000)
public void testReadErrorReportDebounce() throws Exception {
  Clock clock = new ControlledClock();
  TezConfiguration conf = new TezConfiguration();
  conf.setLong(TEZ_RUNTIME_SHUFFLE_FETCH_FAILURE_REPORT_COOLDOWN_MS, 1000);

  ShuffleManager sm = createShuffleManager(conf, clock);
  InputAttemptIdentifier src = newInputAttempt(0);

  sm.reportReadError(src, new IOException("first"));
  sm.reportReadError(src, new IOException("second (debounced)"));
  sm.reportReadError(src, new IOException("third (debounced)"));

  // Only the first event should reach the inputContext
  verify(inputContext, times(1)).sendEvents(anyList());

  // Advance the clock past the cooldown
  ((ControlledClock) clock).setTime(clock.getTime() + 2000);
  sm.reportReadError(src, new IOException("after cooldown"));
  verify(inputContext, times(2)).sendEvents(anyList());
}

Then a MiniTezCluster integration test with OrderedWordCount and a fault injection on a single Fetcher — confirms the AM event queue stays bounded.

Walked example B — off-by-one in `IFile` EOF

Symptom: a reader of IFile-format data occasionally returns one extra zero-length record at the end of a segment. Downstream processors see a null/empty key and either throw or silently insert a bogus row.

Step 1 — Locate

cd ~/tez-src
grep -n "EOF_MARKER\|readNextKeyValue\|nextRawKey" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java

Read the Reader.nextRawKey loop and the EOF_MARKER constant. The classic bug shape: the loop tests bytesRead >= length after a successful read instead of before, allowing one extra iteration when the segment ends exactly on a record boundary.

Diff

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java
@@
   public boolean nextRawKey(DataInputBuffer key) throws IOException {
-    int recordLength = readVInt(dataIn);
-    if (recordLength == EOF_MARKER) {
-      return false;
-    }
+    if (bytesRead >= segmentLength) {
+      return false;
+    }
+    int recordLength = readVInt(dataIn);
+    if (recordLength == EOF_MARKER) {
+      return false;
+    }
     ...
   }

The fix is two lines. The harder part is the test.

Step 2 — Test

Add to tez-runtime-library/src/test/java/org/apache/tez/runtime/library/common/sort/impl/TestIFile.java:

@Test
public void testReaderStopsAtExactSegmentBoundary() throws Exception {
  // Write exactly two records, capture the byte length, construct a Reader
  // bounded to that byte length, and assert the third nextRawKey() returns
  // false without throwing.
  Path p = writeRecords(2);
  long segLen = fs.getFileStatus(p).getLen();
  Reader r = new Reader(conf, fs, p, codec, /*ifileReadAhead*/false, 0, segLen);
  assertTrue(r.nextRawKey(keyBuf));
  assertTrue(r.nextRawKey(keyBuf));
  assertFalse("must not return phantom third record",
      r.nextRawKey(keyBuf));
  r.close();
}

Run:

mvn -pl tez-runtime-library test -Dtest=TestIFile -q 2>&1 | tail -30

A reviewer will also ask for a check that bytesRead does not advance past segmentLength on a malformed input — add it.

Walked example C — `MergeManager` unexpected spill

Symptom: a small DAG that fits comfortably in memory still spills to disk. Investigation: MergeManager.canShuffleToMemory returns false for inputs smaller than the configured threshold because it compares against the total memory budget rather than the per-input share.

The bug shape is in MergeManager.canShuffleToMemory(long size) — the comparison uses usedMemory + size > maxMemory * memoryLimitPercent where it should be >= plus a fairness check against singleShuffleLimit.

The repro: a tiny OrderedWordCount on MiniTezCluster with tez.runtime.shuffle.memory.limit.percent=0.95 and a single 100KB input. The counter MERGED_MAP_OUTPUTS_DISK should be 0 and is not.

The fix and test follow the same pattern as the previous two examples.

Pitfalls

Don't add Thread.sleep to a shuffle test. Use DrainDispatcher, the ControlledClock pattern, or a CountDownLatch driven by the production callback. Sleep-based shuffle tests are the #1 source of flakes in tez-runtime-library (see Stage 9).
Don't relax MergeManager thresholds to "fix" a memory error. The thresholds are a contract with the AM scheduler. If MergeManager runs out of memory, the bug is usually upstream — a Fetcher that should have used disk and went to memory.
Don't add a config key without registering it in tezRuntimeKeys. The runtime validates against an allowlist; an unregistered key is silently ignored.
Don't fix the IFile reader by widening the boundary check. Boundary bugs in IFile usually have a sibling bug in the writer. Read both before patching either.
Don't add a Fetcher retry loop that does not respect the AM's already- scheduled retry policy. Two retry loops in series turn a 3x retry into a 9x retry. Confirm via dispatcher trace that the AM is the only retry authority.
Don't change the on-disk IFile format without bumping IFile.VERSION. That is a Stage 11 patch and requires explicit back-compat shims.

Exit criteria — when you're ready for the next stage

Move to Stage 7 when:

You have shipped one shuffle or runtime patch with a deterministic MiniTezCluster regression test that passes in under two minutes.
You can recite the relationship between tez.runtime.shuffle.memory.limit.percent, tez.runtime.shuffle.fetch.buffer.percent, and the JVM heap.
You have read MergeManager.merge() end to end and can explain the on-disk vs in-memory branches.
A reviewer has accepted your fix without asking "is this the same bug as TEZ-XXXX?" — meaning you have learned to grep for prior art before patching.

Stage 7 takes you out of Tez code and into the Hive-on-Tez attribution skill.

Stage 7 — Hive-on-Tez Compatibility

What this stage teaches

Stage 7 is the cross-project stage. You learn:

The largest consumer of Tez in production is Hive. Bugs that look like Tez bugs are often Hive bugs that surface through Tez, and vice versa.
The contracts Hive depends on: DAGPlan size limits, edge property serialisation, session reuse via TezSessionPoolManager, and the HiveSplitGenerator event protocol.
The attribution decision tree: when to file on TEZ, when on HIVE, and when on both with a cross-reference.
The release-train interplay: Hive 3.x ships a specific Tez version; Hive 4.x ships a different one. A "fix" in Tez master does not automatically reach a Hive user until the next Hive release picks up a Tez release.
How to write an attribution argument in a JIRA description so committers in both projects agree on ownership before any code is written.

The "patch" deliverable for Stage 7 is often a JIRA, not code. A correct attribution call is the contribution; the code may be one line in each project or zero lines in Tez and a workaround in Hive.

JIRA filter to find candidates

project = TEZ AND text ~ "Hive" AND resolution = Unresolved
  ORDER BY updated DESC

Then on the Hive side:

project = HIVE
  AND (text ~ "Tez" OR text ~ "TezSession" OR text ~ "DAGPlan"
       OR text ~ "VertexManagerPlugin")
  AND resolution = Unresolved
  ORDER BY updated DESC

Cross-reference: a TEZ- ticket linked to a HIVE- ticket is a Stage 7 opportunity. The contribution is reading both, choosing the owner project, and writing the attribution.

The attribution decision tree

Given a symptom, walk this tree:

Is the symptom observed in a non-Hive Tez workload?
├── Yes  → Tez bug. File on TEZ. Stage 4–6 patch.
└── No (Hive-specific)
    │
    Does the symptom depend on a Hive class on the stack trace?
    ├── Yes (Hive frame is the top user-code frame)
    │   │
    │   Is the Tez API contract being misused by Hive?
    │   ├── Yes  → HIVE bug. File on HIVE. Tez may need a clearer
    │   │         contract / better exception message — file
    │   │         a follow-up TEZ ticket.
    │   └── No   → Possibly a Tez API contract gap. File on TEZ
    │             with a Hive repro, link the HIVE ticket.
    │
    └── No (the bug surfaces inside Tez code triggered by Hive's DAG)
        │
        Does the Hive DAG exercise an edge case Tez tests don't cover?
        ├── Yes  → Tez bug. File on TEZ. Add a Tez-side test that
        │         reproduces the shape without Hive.
        └── No   → File a `cross-project` ticket on TEZ with a
                  HIVE counter-ticket; sort ownership on dev@.

The tree is not law. It is the start of a dev@ conversation.

Walked example A — `DAGPlan` size exceeds limit on Hive autogenerated DAG

Symptom: a Hive 3.1 query with a large IN list (10k+ literals) submits a DAG that fails at TezClient.submitDAG with:

TezException: DAGPlan serialised size 67_108_864 exceeds limit 67_108_864

The Tez default is 64MB on the wire. Hive can in principle stay under it, but the codegen path for very large IN lists doesn't truncate.

Step 1 — Attribution

Walk the tree:

Non-Hive workload? No, Hive-specific.
Hive on stack? Yes, HiveSplitGenerator.
Is Hive misusing Tez API? No — DAGPlan is exactly the wire format Tez expects; Hive is sending a legitimate but large payload.
Is this an edge case Tez tests don't cover? Yes — Tez tests submit small DAGPlans.

Conclusion: this is a Tez API contract gap that Hive happens to hit first. The fix is twofold:

Tez side: raise the configurable limit and improve the error message to tell the operator which key to bump. File on TEZ.
Hive side: paginate the IN list literal codegen. File on HIVE.

The Tez patch is small and lands first.

Step 2 — The Tez-side diff

--- a/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
+++ b/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java
@@
+  /**
+   * Maximum size (bytes) of the serialised {@link DAGPlan} that the AM
+   * accepts in a single submission. The default of 64MiB is a Hadoop
+   * IPC limit. Operators submitting very large DAGs (typically generated
+   * by upstream query engines) may need to raise this.
+   * @since 0.10.4
+   */
+  public static final String TEZ_DAG_PLAN_MAX_BYTES =
+      TEZ_PREFIX + "dag.plan.max.bytes";
+  public static final int TEZ_DAG_PLAN_MAX_BYTES_DEFAULT = 64 * 1024 * 1024;

And in tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java:

-    if (serialised.length > 64 * 1024 * 1024) {
-      throw new TezException("DAGPlan too large");
+    int max = conf.getInt(TEZ_DAG_PLAN_MAX_BYTES, TEZ_DAG_PLAN_MAX_BYTES_DEFAULT);
+    if (serialised.length > max) {
+      throw new TezException(String.format(
+          "DAGPlan serialised size %d exceeds limit %d. "
+              + "Raise %s on the submitter and AM, or reduce DAGPlan size "
+              + "(typically by pruning literal lists or split metadata).",
+          serialised.length, max, TEZ_DAG_PLAN_MAX_BYTES));
     }

The patch makes the limit explicit, configurable, and self-describing.

Step 3 — The JIRA description (attribution argument)

Summary: Make DAGPlan size limit configurable and self-describing

Description:
Hive's HiveSplitGenerator can generate DAGPlans > 64MiB for queries with
very large IN lists. Currently Tez throws "DAGPlan too large" with no
actionable advice. The Hive side will paginate (HIVE-NNNNN), but Tez
should:

  1. Expose tez.dag.plan.max.bytes so operators can raise the cap.
  2. Produce an error message that names the key and the cause.

Attribution rationale:
  - This is a Tez API contract gap: legitimate DAGPlans should not be
    silently rejected with no recourse.
  - Hive is the first downstream that hits this; other DAG generators
    (Pig-on-Tez, custom DAGs from BI tools) will hit it next.
  - HIVE-NNNNN is filed in parallel for the codegen pagination.

Tests:
  - TestDAGAppMaster#testDAGPlanSizeLimitConfigurable
  - End-to-end repro left to HIVE-NNNNN (Tez has no test that builds a
    pathological 64MiB DAGPlan).

This is the cross-project pattern: the TEZ ticket cites HIVE-NNNNN explicitly, states the attribution rationale, and stops short of fixing Hive's behaviour.

Walked example B — edge property mismatch on Hive upgrade

Symptom: after upgrading Hive 3.1 → 3.2, certain queries fail with:

TezException: EdgeProperty mismatch on edge v1->v2: source class
  org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput
  does not match sink class
  org.apache.tez.runtime.library.input.UnorderedKVInput

Tez rejects the DAG because the edge wiring is inconsistent.

Attribution: Hive 3.2 emitted a different sink type for that vertex. Tez is behaving correctly — it is enforcing the edge contract. This is a HIVE bug. File on HIVE. The Tez side requires no patch.

The contribution here is the attribution itself plus a Tez-side documentation note on the validator: "see EdgeProperty.checkCompatible for the rules enforced." Add a docs patch (Stage 1) if no such note exists.

Walked example C — `TezSessionPoolManager` reuse leak

Symptom: HiveServer2 uses TezSessionPoolManager to reuse AMs across queries. A specific Hive query path leaves the session in a state where the next query sees stale credentials.

Attribution: TezSessionPoolManager is a Hive class (in the Hive repo), even though it manages TezClient instances. Find it:

grep -rn "class TezSessionPoolManager" ~/hive-src/ql/src/java

The bug is in Hive. The Tez API used (TezClient.start()) is correct.

File on HIVE. The Tez contribution is zero code; it is the attribution call and the explanation in the JIRA comments that prevents the ticket bouncing.

Reading the Hive code path for attribution

Even though you may not commit to Hive, you must be able to read the Hive classes that touch Tez:

org.apache.hadoop.hive.ql.exec.tez.DagUtils — Hive's DAG builder.
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator — Hive's input split generation, called from Tez VertexManagers.
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager — session reuse.
org.apache.hadoop.hive.ql.exec.tez.TezSessionState — per-session state.

Keep a Hive checkout next to your Tez checkout:

git clone https://github.com/apache/hive ~/hive-src

A grep across both:

grep -rn "DAGPlan\|VertexManagerPluginDescriptor" ~/hive-src/ql/src/java | head

is the start of every Stage 7 investigation.

Pitfalls

Don't fix a Hive bug in Tez. Even if the symptom appears on a Tez stack frame, do not patch Tez to work around an incorrect Hive use of the API. You will trap Tez into supporting buggy clients forever.
Don't expand a Tez API to "make Hive easier". That is a Stage 11 patch with a dev@ design thread; not a Stage 7 patch.
Don't assume the Hive committers will read your TEZ ticket. CC the appropriate Hive committers explicitly, or post a short note on dev@hive.apache.org linking the JIRA.
Don't promise a Tez backport to a specific Hive release. Release alignment is a separate conversation; you control your patch's landing in Tez, not when Hive picks it up.
Don't file the same bug on both projects without distinguishing the work. TEZ-NNNN should fix the Tez side; HIVE-NNNN should fix the Hive side; each ticket should cross-reference the other and say exactly what code lives in which project.
Don't break older Hive versions to fix newer ones. A Tez change that raises the minimum required Hive version is a Stage 11 / Stage 12 call.

Exit criteria — when you're ready for the next stage

Move to Stage 8 when:

You have correctly attributed at least one symptom to HIVE (saving Tez from an incorrect patch) and one to TEZ (with a Hive counter-ticket).
You have a ~/hive-src checkout next to ~/tez-src and have grepped across both at least three times during real investigation.
You can describe the lifecycle of a TezSessionState from creation to reuse to teardown in five sentences.
You have read EdgeProperty.checkCompatible and know which mismatches the Tez validator does and does not flag.

Stage 8 takes you into the YARN integration layer.

Stage 8 — YARN Integration

What this stage teaches

Stage 8 lives at the Tez/YARN boundary. You learn:

How the Tez AM acquires and renews its AMRMToken, and the canonical bug: long-running session AMs (multi-day Hive sessions) whose AMRMToken expires while the AM is mid-RPC.
Log aggregation: how Tez's container exit hooks interact with the NM's LogAggregationService. The canonical symptom: missing container logs after AM crash because the AM never told the NM to flush.
The NM aux service: the Tez ShuffleHandler (or the MR ShuffleHandler when configured) lives in tez-plugins/tez-aux-services. Version mismatches between AM-side tez-runtime-library and NM-side aux service cause shuffle failures with cryptic error messages.
Kerberos delegation token renewal across DAG lifecycles, especially when multiple DAGs in a session use the same Credentials object.
TezClient AMRMToken handling: where the token lives in the submitter process versus the AM.

Patches in this stage are 50–400 lines but often require a Hadoop-version- specific code path, so the tez-plugins/tez-aux-services profile structure matters more than in other stages.

Reading order

tez-api/src/main/java/org/apache/tez/client/TezClient.java
tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java — focus on the AMRMToken handling and the credential propagation.
tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java
The deep dive yarn-integration.

cd ~/tez-src
grep -rn "AMRMToken\|getCredentials\|TokenIdentifier" \
  tez-api/src/main/java tez-dag/src/main/java | head -30
ls tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/

JIRA filter to find candidates

project = TEZ
  AND component in ("tez-dag", "tez-plugins")
  AND resolution = Unresolved
  AND (text ~ "AMRMToken" OR text ~ "kerberos" OR text ~ "delegation token"
       OR text ~ "log aggregation" OR text ~ "ShuffleHandler"
       OR text ~ "aux service" OR description ~ "TokenExpired")
ORDER BY updated DESC

A second filter focused on long-running session bugs:

project = TEZ AND text ~ "session" AND text ~ "expired"
  AND resolution = Unresolved

Walked example A — AMRMToken expiry on long DAGs

Symptom: a Hive session AM runs for 36 hours. On hour 24 it starts logging:

SecretManager$InvalidToken: AMRMToken for application appattempt_X has expired.

The AM crashes mid-DAG. The user loses the long-running session and resubmits all in-progress queries.

Step 1 — Trace token lifetime

cd ~/tez-src
grep -n "AMRMToken\|registerApplicationMaster" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
grep -rn "renewMaxLifetime\|token-max-lifetime" tez-api tez-dag tez-common

YARN's yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs default is 24h. When the RM rotates the master key, the AM's cached AMRMToken becomes invalid. The fix is to detect a token-expired exception on the AMRM heartbeat path and re-acquire the token from the RM (which already exposes this via the heartbeat response in modern Hadoop versions).

Step 2 — Choose the right Hadoop version

tez-aux-services and tez-dag build against the configured Hadoop profile:

grep -rn "hadoop28\|hadoop29\|hadoop31" pom.xml | head

Token rollover handling differs across Hadoop minor versions. The patch must be a no-op on profiles where the Hadoop client already handles the rollover transparently. Confirm by:

grep -rn "AMRMToken" ~/hadoop-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client | head

If AMRMClientAsyncImpl already loops on token expiry in Hadoop 3.x, your Tez patch is a Hadoop-2.x-only path guarded by an availability check.

Step 3 — Diff

--- a/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
+++ b/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java
@@
   private void heartbeatLoop() {
     while (!shutdownRequested) {
       try {
         AllocateResponse resp = amRmClient.allocate(progress);
+        // Hadoop 2.x clients did not transparently refresh the AMRMToken
+        // on master-key rollover. Detect token-expired and re-acquire.
+        // See TEZ-XXXX.
+        if (resp.getAMRMToken() != null) {
+          UserGroupInformation.getCurrentUser().addToken(
+              ConverterUtils.convertFromYarn(resp.getAMRMToken(), null));
+        }
         processAllocations(resp);
       } catch (InvalidToken e) {
+        LOG.warn("AMRMToken invalid for {}, attempting re-register", appAttemptID);
+        try {
+          amRmClient.registerApplicationMaster(host, port, trackingUrl);
+          continue;
+        } catch (Exception reErr) {
+          LOG.error("Re-register failed; AM will exit", reErr);
+          throw new TezUncheckedException(reErr);
+        }
       } catch (Exception e) {
         ...
       }
     }
   }

Step 4 — Test

A unit test stubs the AMRMClient to return an InvalidToken once then a healthy response, and asserts that registerApplicationMaster was called once and the loop continued. Pattern:

@Test(timeout = 10000)
public void testAmrmTokenReacquiredOnInvalidToken() throws Exception {
  AMRMClient mockRm = mock(AMRMClient.class);
  when(mockRm.allocate(anyFloat()))
      .thenThrow(new InvalidToken("expired"))
      .thenReturn(emptyAllocateResponse());
  DAGAppMaster am = createTestAM(mockRm);
  am.runOneHeartbeatIteration();
  verify(mockRm).registerApplicationMaster(anyString(), anyInt(), anyString());
  am.runOneHeartbeatIteration();
  // second iteration must succeed
}

A MiniYARNCluster test that triggers an actual key rollover is possible but slow; the unit test above is sufficient for review.

Walked example B — log aggregation race on AM crash

Symptom: an AM crashes (OutOfMemoryError). The cluster operator runs yarn logs -applicationId ... and gets nothing. The NodeManager's LogAggregationService reports the logs as never finalised.

Root cause: the JVM crashed before Tez's DAGAppMaster.shutdown() could flag the logs as aggregation-ready. NM's default is "wait for the AM to mark finalisation" rather than aggregating on container exit.

The fix

Tez registers a JVM shutdown hook (Runtime.getRuntime().addShutdownHook) that calls into the YARN LogAggregationContext to force-finalise. The hook must run before the JVM's normal exit handlers.

cd ~/tez-src
grep -n "addShutdownHook\|LogAggregationContext" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

If a shutdown hook is registered but does not handle OutOfMemoryError, add a defensive try/catch (Throwable) and ensure the hook is the first shutdown hook registered (so it runs last and after other hooks have cleaned up).

The diff is small; the test is hard. The accepted pattern is a logged-evidence test: spin up a MiniYARNCluster, submit a DAG, kill -9 the AM process, and assert that the NM log aggregation finalised the logs within a bounded time. This test belongs in tez-tests and is slow (~30s).

Walked example C — NM aux service version mismatch

Symptom: a cluster operator deploys Tez 0.10.3 but the NMs still run the Tez 0.10.1 aux service. Shuffle fails with:

IOException: Unknown shuffle handler version: 2; expected 1

The fix is in tez-aux-services plus a docs note: the aux service on every NM must match the AM-side tez-runtime-library minor version. The Tez patch is twofold:

The aux service must report its version in the protocol handshake.
The client side must produce a self-describing error message that names the NM, the version it reported, and the version the AM expected.

--- a/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
+++ b/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/Fetcher.java
@@
-    if (serverVersion != EXPECTED_SHUFFLE_VERSION) {
-      throw new IOException("Unknown shuffle handler version: " + serverVersion);
+    if (serverVersion != EXPECTED_SHUFFLE_VERSION) {
+      throw new IOException(String.format(
+          "Tez shuffle handler version mismatch on %s:%d: server=%d, expected=%d. "
+              + "Likely cause: NodeManager aux-service jar is older than the AM. "
+              + "Ensure tez-aux-services-%s.jar is deployed to every NM.",
+          host, port, serverVersion, EXPECTED_SHUFFLE_VERSION,
+          TezVersionInfo.getVersion()));
     }

The patch is one improved error message and one documentation update in docs/src/site/markdown/install.md.

Pitfalls

Don't add a new JVM shutdown hook without considering ordering. Java does not guarantee shutdown hook order; if two hooks rely on each other, you must serialise them explicitly.
Don't catch Throwable outside a shutdown path. Catching Throwable in the heartbeat loop will swallow OutOfMemoryError and leave the AM in an undefined state.
Don't conflate AMRMToken with delegation tokens. AMRMToken authenticates the AM to the RM; delegation tokens authenticate the AM/tasks to HDFS or other services. Renewal paths and lifetimes are different.
Don't deploy a fix that requires the operator to redeploy tez-aux-services without saying so in the release notes. Aux service upgrades require an NM restart; that is operationally expensive.
Don't assume the Hadoop version on disk is the Hadoop version in production. Test against the minimum Hadoop version supported by your Tez release line (see pom.xml profile defs).
Don't hard-code token renewal intervals. Use the YARN-side configuration keys directly (yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs).

Exit criteria — when you're ready for the next stage

Move to Stage 9 when:

You have shipped one YARN-integration patch with evidence (in the JIRA description) of which Hadoop minor versions you tested against.
You can describe the AMRMToken lifecycle in five sentences including the master-key rollover.
You have read the LogAggregationContext API in the Hadoop source and understand the logIncludePattern / logExcludePattern interplay.
You have a tez-plugins/tez-aux-services build that runs locally and you understand which NMs need it.

Stage 9 returns to the in-repo skill set with a focus on test stability.

Stage 9 — Flaky Tests

What this stage teaches

Stage 9 is the unglamorous-but-essential stage. You learn:

The Tez flake taxonomy: Thread.sleep races, undrained AsyncDispatcher, MiniTezCluster port collisions, and @Test(timeout=...) budgets that were too tight for slow CI.
How to distinguish a flake (passes locally, fails on Jenkins 1-in-30 runs) from a real intermittent bug (manifests in production under load). Flakes are tests; intermittent bugs are not.
The DrainDispatcher.await() refactor: how to convert a sleep-based synchronisation to an event-drain-based one.
The @Rule and TestName patterns for diagnosing which test in a suite leaks state into the next.
When a flake fix is also a production code fix (the test was right; the code had a race).

Patches are 20–150 lines per test. They rarely change production code. The ones that do warrant a Stage 4–6 ticket in addition to the test fix.

JIRA filter to find candidates

project = TEZ
  AND (text ~ "flaky" OR text ~ "intermittent" OR labels = "flaky-test")
  AND resolution = Unresolved
ORDER BY updated DESC

A second source: Jenkins precommit history. Pick any open JIRA, find its Jenkins URL in the comments, click through to recent runs, look for tests that failed in one run and passed in the next on the same patch. Those tests are flake candidates regardless of whether a JIRA already exists.

A third source: your own mvn test output. Run any tez-dag test suite three times in a row:

cd ~/tez-src
for i in 1 2 3; do
  mvn -pl tez-dag test -Dtest=TestVertexImpl -q 2>&1 | tail -5
done

Any failure in the three-pass that doesn't repeat is a flake to investigate.

The Tez flake taxonomy

1. `Thread.sleep` races

The most common shape:

worker.submitJob(j);
Thread.sleep(500);                 // "wait for it to start"
assertTrue(worker.isJobRunning(j));

On a slow CI box, 500ms may not be enough. On a fast box, the job may have completed before the assertion. Both fail.

The fix is a poll with timeout:

worker.submitJob(j);
TestUtils.waitFor(() -> worker.isJobRunning(j), /*pollMs*/50, /*timeoutMs*/30_000);
assertTrue(worker.isJobRunning(j));

If TestUtils.waitFor does not exist in the module, copy the pattern from org.apache.tez.test.GenericCounter or write one yourself in three lines.

2. Undrained `AsyncDispatcher`

The dispatcher is event-driven. A test that fires an event and immediately asserts on state will see the pre-event state half the time.

The fix is DrainDispatcher.await():

cd ~/tez-src
grep -rn "class DrainDispatcher" tez-common/src/main/java tez-dag/src/test

Find the canonical class. The refactor:

-    dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
-    Thread.sleep(200);
-    assertEquals(VertexState.INITED, vertex.getState());
+    dispatcher.getEventHandler().handle(new VertexEvent(vid, VertexEventType.V_INIT));
+    dispatcher.await();
+    assertEquals(VertexState.INITED, vertex.getState());

The contract: await() returns when the event queue is empty and the last event has been fully handled (including any subsequent events the handler itself emitted). If the test still flakes after this refactor, the handler is emitting events to a different dispatcher (e.g. a child component has its own). Find it and drain that one too.

3. `MiniTezCluster` port collisions

The default MiniTezCluster binds a fixed RM port. Two suites running in parallel on the same machine collide. The fix is per-suite port randomisation:

-    tezCluster = new MiniTezCluster("test", 1, 1, 1);
+    tezCluster = new MiniTezCluster(TestName.getMethodName(), 1, 1, 1);
+    Configuration conf = new Configuration();
+    conf.setInt(YarnConfiguration.RM_PORT, 0);  // 0 = OS-assigned
+    conf.setInt(YarnConfiguration.RM_SCHEDULER_PORT, 0);
+    conf.setInt(YarnConfiguration.RM_RESOURCE_TRACKER_PORT, 0);
+    tezCluster.init(conf);

The 0 port tells the OS to assign an unused port. Then read the actual port from the cluster after start:

int amrmPort = tezCluster.getConfig().getInt(YarnConfiguration.RM_PORT, -1);

4. `@Test(timeout=...)` too tight

A test with @Test(timeout=1000) may pass on a developer's M3 Pro and fail on a contention-laden Jenkins agent. Raise the timeout to a value that comfortably covers the slow CI but is still bounded:

-  @Test(timeout = 1000)
+  @Test(timeout = 30_000)
   public void testInitTransitionRunsOnce() { ... }

The Tez convention: 30s for unit tests, 300s for MiniTezCluster tests. Never @Test(timeout = 0) — a hung test will block CI for hours.

Walked example — `TestShuffleManager` flake

Symptom: testReadErrorReportDebounce fails 1-in-12 runs on Jenkins with:

expected:<1> but was:<2>

i.e. the verify on inputContext.sendEvents saw two calls when one was expected.

Step 1 — Reproduce locally

cd ~/tez-src
for i in $(seq 1 50); do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce \
    -q 2>&1 | tail -3
done | grep -c "FAILED"

A local reproduction at 1/50 frequency is good enough to start.

Step 2 — Diagnose

Read the test. The pattern:

sm.reportReadError(src, new IOException("first"));
sm.reportReadError(src, new IOException("second"));
verify(inputContext, times(1)).sendEvents(anyList());

reportReadError may dispatch to an internal executor. The verify runs before the executor has serviced the call. The Mockito verify sees only the synchronous call most of the time; the async one fires 1-in-12.

Step 3 — Fix

Replace verify with a timeout-bounded verify:

-    verify(inputContext, times(1)).sendEvents(anyList());
+    verify(inputContext, timeout(5_000).times(1)).sendEvents(anyList());

Mockito.timeout(ms) polls until the expected interactions match, then asserts the count. The test now waits up to 5 seconds before failing.

A bigger refactor (preferred): inject a deterministic executor:

ShuffleManager sm = createShuffleManager(conf, new DirectExecutor());

where DirectExecutor is a java.util.concurrent.Executor whose execute runs synchronously on the caller thread. Now there is no race, and the original verify(..., times(1)) is correct.

The reviewer rule: prefer the deterministic executor refactor over Mockito.timeout. The timeout-based fix masks future races; the deterministic fix eliminates them.

Step 4 — Confirm the fix

Run the loop again:

for i in $(seq 1 200); do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -3
done | grep -c "FAILED"

200 runs, zero failures, is the bar. Don't ship a flake fix you have not stress-tested.

When a flake is a real bug

Sometimes a test flakes because the production code has a race. If the "obvious" flake fix is to insert a sleep or relax an assertion, stop and ask: could a production caller exercise the same race?

Example: VertexImpl.handle returning before all event-emission side effects complete. The flaky test fixes itself by dispatcher.await(), but a production caller doing the same sequence sees a partially-applied state. That is a Stage 4 bug, not a Stage 9 bug.

The decision rule:

The test races against an internal event queue → flake fix.
The test races against a public contract method → file a real bug.

Pitfalls

Don't @Ignore a flake to "fix" CI. The next contributor will silently remove the @Ignore and re-introduce the flake. File a real ticket with a written analysis even if you don't fix it.
Don't bump the @Test(timeout) without reasoning. A 30s timeout is evidence the test does real work; a 30000s timeout is evidence the test is broken.
Don't replace assertEquals with assertTrue(... contains ...) to silence a flake. That weakens the assertion permanently and hides the underlying race.
Don't refactor a test class wholesale in a flake patch. Fix the one test. If the class needs a wholesale refactor, file a separate JIRA.
Don't use Thread.yield() to fix a race. It is not a guarantee; it is a hint. Always use a real synchronisation primitive (CountDownLatch, dispatcher.await(), Future.get()).
Don't catch InterruptedException and ignore it. The Tez convention is Thread.currentThread().interrupt(); throw new ... so the interrupt status propagates.

Exit criteria — when you're ready for the next stage

Move to Stage 10 when:

You have de-flaked at least three tests with confirmed 200-run stability.
You have caught at least one real production race that was masquerading as a flake.
You can name the three flake patterns by heart (sleep races, undrained dispatcher, port collisions, tight timeouts).
A reviewer has accepted your deterministic-executor refactor as the preferred pattern over Mockito.timeout.

Stage 10 turns the focus to performance regressions.

Stage 10 — Performance Regressions

What this stage teaches

Stage 10 is where you stop fixing bugs and start measuring. You learn:

The Tez perf-regression workflow: identify symptom, git bisect to the culprit commit, profile under load, attribute the cost, ship a fix with before/after numbers.
Microbenchmarking with tez-examples/OrderedWordCount as the canonical small DAG. When that is too coarse, JMH at the call-site level.
Profilers: async-profiler for CPU/lock contention, JFR for allocation/GC pressure. When to use which.
The two perf hotspots most often blamed first: AsyncDispatcher queue contention and IFile record encoding.
How to file a perf-regression JIRA that committers take seriously: numbers, methodology, reproducibility, and a fix bounded in scope.

Patches are 30–300 lines, always with benchmark evidence in the JIRA. A performance patch without numbers is a no-op.

JIRA filter to find candidates

project = TEZ
  AND resolution = Unresolved
  AND (text ~ "performance regression" OR text ~ "slow"
       OR text ~ "contention" OR text ~ "allocation"
       OR labels = "performance")
ORDER BY priority DESC, updated DESC

A second source is the dev@ archive — search for "slowdown" or "regression" in the last six months. Operators often report perf issues without filing a JIRA. The first contribution is filing the JIRA with a repro.

The Tez perf-regression workflow

1. Reproduce the regression with a number

Never start a perf investigation with a vibe. Get a number:

cd ~/tez-src
mvn -pl tez-examples -am clean install -DskipTests -Phadoop28 -q
# Then run OrderedWordCount end-to-end on MiniTezCluster
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q

For a more isolated benchmark, write a JMH micro:

find ~/tez-src -name "pom.xml" -exec grep -l jmh {} \;

If JMH is not in the test pom, add it scoped to test only — never to compile.

2. `git bisect` to the culprit commit

Suppose the regression is "OrderedWordCount on a 10-node MiniTezCluster went from 12s to 19s between 0.10.2 and 0.10.3":

cd ~/tez-src
git bisect start
git bisect bad 0.10.3
git bisect good 0.10.2

# Each step:
mvn clean install -DskipTests -Phadoop28 -q
mvn -pl tez-tests test -Dtest=TestExternalTezServices#testOrderedWordCount -q
# Record the wall time. Then:
git bisect good   # or 'git bisect bad'

Twenty commits between two minor releases means log2(20) ≈ 5 bisect steps. Bisect to the single commit, then read its diff. Often the commit is innocent and the regression is in a sibling commit interacting with it; bisect is the start of the investigation.

3. Profile under load

Once you suspect a region of code, profile:

# async-profiler: CPU samples
$ASYNC_PROFILER/profiler.sh -d 60 -f /tmp/dag.html -e cpu <AM-pid>

# JFR: GC + allocation
jcmd <AM-pid> JFR.start name=tez duration=60s filename=/tmp/dag.jfr

Profile the AM, not the submitting client. The AM is the long-running process where contention manifests.

For a per-task profile:

// In a one-off test only — never in production code
conf.set(TezConfiguration.TEZ_TASK_LAUNCH_CMD_OPTS,
    "-agentpath:/path/to/libasyncProfiler.so=start,event=cpu,file=/tmp/task-%p.jfr");

4. Attribute the cost

Read the flame graph. A single fat frame above the noise floor is your target. Most Tez regressions land in one of three buckets:

Lock contention on AsyncDispatcher.eventQueue or VertexImpl.writeLock.
Allocation pressure from IFile.Writer or MergeManager building short-lived buffers in a tight loop.
GC overhead from a long-lived collection that grows unbounded (e.g. a HashMap keyed by TaskAttemptId that is never pruned).

5. Ship a fix with numbers

A Stage 10 JIRA description must include:

Methodology:
  - Hardware: 16-core M3 Pro, 32GB RAM.
  - Command: mvn -pl tez-tests test -Dtest=...
  - Runs: 5 cold, 10 warm, report median + p95.
  - Hadoop profile: hadoop28.

Before (TEZ master at <hash>): median 19.0s, p95 22.1s.
After  (this patch on top):    median 12.4s, p95 13.7s.

Profile evidence: flame graph attached. AsyncDispatcher.handle was 38% CPU
before, 4% after.

A reviewer will ask for the profile artifact. Attach it.

Walked example A — `AsyncDispatcher` queue contention

Symptom: AM throughput collapses on DAGs with > 10k tasks. Profile shows 40% of CPU is in AsyncDispatcher.handle under LinkedBlockingQueue.put.

Step 1 — Diagnose

cd ~/tez-src
grep -n "LinkedBlockingQueue\|eventQueue" \
  tez-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java

(The class is technically Hadoop's AsyncDispatcher, but Tez subclasses and configures it in tez-common.) Single-producer multi-consumer would benefit from a partitioned queue keyed by event type.

Step 2 — The fix surface

Two acceptable approaches:

Sharded dispatcher: partition events by destination ID so each shard has its own queue. Tez has the building blocks but not the wiring; the patch is the wiring.
Batched event submission: collect events on the producer side and submit in groups, reducing lock acquisitions per task.

Both are large patches. The Stage 10 contribution is one of them, with a clear scope: "sharded dispatcher for vertex events only", not "rewrite AsyncDispatcher".

Step 3 — Numbers

For the sharded-dispatcher patch on a 10k-task OrderedWordCount:

Before: 19.0s median, 22.1s p95.
After:  12.4s median, 13.7s p95.
AsyncDispatcher.handle: 38% → 4% CPU.

These numbers go into the JIRA description, with a flame graph attached.

Step 4 — dev@ design ping

Any Stage 10 patch above ~50 lines deserves a dev@ thread:

Subject: [DISCUSS] TEZ-XXXX — shard AsyncDispatcher by destination type

I have a repro for AM throughput collapse on 10k-task DAGs. Profile attached.
Proposed fix: shard the AsyncDispatcher event queue by destination type
(Vertex / Task / TaskAttempt / Container). Numbers: 19s -> 12s median.

Open questions:
  1. Default shard count: I propose 4 with a configurable override.
  2. Compat: AsyncDispatcher is org.apache.hadoop, so we shim in tez-common.
  3. Tests: TestAsyncDispatcher + the existing scheduler integration tests.

Comments welcome before I post the patch.

If a committer flags an unexpected constraint (e.g. "we cannot shard because ATS event ordering depends on global sequence"), redesign before coding.

Walked example B — `IFile` record encoding hot path

Symptom: profile shows 22% CPU in IFile.Writer.append under WritableUtils.writeVInt. Allocation profile shows two byte[] per record.

Diagnose:

cd ~/tez-src
grep -n "writeVInt\|writeVLong\|new byte\[" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java

The hot path allocates a fresh byte[] per record for VInt encoding. The fix is a reusable scratch buffer per Writer instance:

+  private final byte[] vIntBuf = new byte[9];
+
   public void append(DataInputBuffer key, DataInputBuffer value) throws IOException {
-    byte[] scratch = new byte[9];
-    int n = encodeVInt(key.getLength(), scratch);
-    out.write(scratch, 0, n);
+    int n = encodeVInt(key.getLength(), vIntBuf);
+    out.write(vIntBuf, 0, n);
     ...
   }

The patch is six lines. The justification is the JMH micro:

JMH benchmark: IFileWriter.append for 1M small records.
Before: 14.2 us/op, 32B/op allocation.
After:   8.7 us/op,  0B/op allocation.

This is a textbook Stage 10 patch: small, measurable, attributable.

Pitfalls

Don't ship a perf patch without numbers. Reviewers will reject it. "Looks faster" is not evidence.
Don't benchmark on the same machine you developed on without warm-up. Always run cold + warm passes; report median + p95.
Don't compare across different Hadoop profiles. Pick one profile and hold it constant.
Don't widen the scope of a perf patch mid-review. "I found another hotspot while I was here" → new JIRA.
Don't use micro-benchmark numbers in isolation. Always show the end-to-end impact too. A 2x improvement in IFile.Writer.append that yields 0.1% end-to-end improvement may not be worth merging.
Don't git bisect against a tree with unrelated WIP. git bisect is deterministic only against a clean tree.
Don't profile in production without the operator's consent. Even async-profiler has overhead; the operator should know.

Exit criteria — when you're ready for the next stage

Move to Stage 11 when:

You have shipped one perf patch with documented before/after numbers and an attached profile.
You can git bisect 20 commits without referring to documentation.
You have read at least one async-profiler flame graph for Tez and identified the hotspot without help.
A committer has accepted your patch's methodology section as sufficient evidence.

Stage 11 takes you into the compatibility contract.

Stage 11 — Backward Compatibility

What this stage teaches

Stage 11 is where every change you make is constrained by what was there before. You learn:

The Apache @InterfaceAudience and @InterfaceStability annotations and what they obligate you to preserve.
The Tez API surface: which packages are Public, which are LimitedPrivate("Hive,Pig"), and which are Private. The audience determines the cost of breaking a contract.
How to evolve a protobuf message without breaking older clients (optional fields, never reuse field numbers, never change a field type).
The deprecation cycle: how long a deprecated symbol must remain before removal, and what evidence is required to declare it ready for removal.
How to negotiate the dev@ conversation when a change is technically compatible but operationally disruptive.

The patches in this stage are often small. The thread is long. A compatibility change without a dev@ design thread is a Stage 11 patch that will be reverted.

The annotation taxonomy

Three audience levels:

Annotation	Meaning	Examples
`@InterfaceAudience.Public`	Any external consumer may call this. Removal is a major-version break.	`TezClient`, `DAG`, `Vertex`, `Edge`, `Processor`, most of `tez-api`.
`@InterfaceAudience.LimitedPrivate({"Hive","Pig"})`	Only the named projects may call this. Coordinate with them before changing.	Some internal-ish `tez-api` helpers used by Hive's DagUtils.
`@InterfaceAudience.Private`	Internal to Tez. Free to change.	Everything in `tez-dag/src/main/java/org/apache/tez/dag/app/...`.

Three stability levels:

Annotation	Meaning
`@InterfaceStability.Stable`	Compatible across minor versions. Removal requires a major bump.
`@InterfaceStability.Evolving`	May change between minor versions, but deprecation cycle expected.
`@InterfaceStability.Unstable`	Free to break at any time.

The combined matrix gives nine cells. Most public Tez API is Public + Stable: the most expensive to change. Most internal Tez API is Private + Unstable: free to change.

Find the annotations:

cd ~/tez-src
grep -rn "@InterfaceAudience\|@InterfaceStability" tez-api/src/main/java | head -20

JIRA filter to find candidates

project = TEZ AND resolution = Unresolved
  AND (text ~ "deprecate" OR text ~ "compatibility"
       OR text ~ "InterfaceAudience" OR text ~ "protobuf"
       OR labels = "incompatible")
ORDER BY priority DESC, updated DESC

Walked example A — adding an optional protobuf field

Symptom: Tez wants to add a per-vertex "originating-user-class" string to the DAGPlan so the AM can attribute resource usage. The DAGPlan is wire-serialised to YARN's RM cache, so older AMs must continue to deserialise plans without the new field.

Step 1 — Locate the proto

cd ~/tez-src
find . -name "*.proto" | head
grep -n "message VertexPlan" $(find . -name "*.proto") | head

Read the existing VertexPlan message. Note the highest field number in use (say, 12). The new field must use a new number, not a recycled one.

Step 2 — The diff

--- a/tez-api/src/main/proto/DAGProtos.proto
+++ b/tez-api/src/main/proto/DAGProtos.proto
@@
 message VertexPlan {
   optional string name = 1;
   ...
   optional int32 task_resource_memory_mb = 12;
+  // @since 0.10.4 — optional; old AMs ignore unknown fields.
+  optional string originating_user_class = 13;
 }

Three rules:

The field is optional. Never required — required fields break old readers. Tez uses proto2, where optional is the default for fields you may add later.
The field number 13 has never been used before. Search the entire git history:
```
git log -p -S "= 13" -- tez-api/src/main/proto/DAGProtos.proto
```
to confirm.
The comment names the introduction release. Future contributors will use it to decide whether the field is safe to assume in their code path.

Step 3 — Producer and consumer sides

The producer in tez-api/src/main/java/org/apache/tez/dag/api/DAG.java sets the field when known and leaves it unset when not. The consumer in tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java must tolerate the unset case:

+    if (vertexPlan.hasOriginatingUserClass()) {
+      this.originatingUserClass = vertexPlan.getOriginatingUserClass();
+    } else {
+      this.originatingUserClass = null;
+    }

The reviewer will reject any consumer that calls getOriginatingUserClass() without first calling hasOriginatingUserClass(). Proto2 optional fields return a default ("" for strings) when unset, which is not the same as "absent".

Step 4 — Test the back-compat

The test is a serialisation round-trip with an older binary deserialiser:

@Test
public void testOldAMCanDeserialiseNewPlan() throws Exception {
  VertexPlan newPlan = VertexPlan.newBuilder()
      .setName("v1")
      .setOriginatingUserClass("com.example.Job")
      .build();
  byte[] wire = newPlan.toByteArray();

  // Parse as if we were an older AM that doesn't know the new field
  // (use the generated descriptor with the field removed, or use
  // DynamicMessage to ignore unknown fields).
  VertexPlan parsed = VertexPlan.parseFrom(wire);
  assertEquals("v1", parsed.getName());
  // The unknown field is preserved in parsed.getUnknownFields() but
  // ignored by the AM's logic. That is the contract.
}

A real test against an older Tez jar is also valuable; check it in as a resource.

Walked example B — deprecating a public method

Symptom: TezClient.submitDAG(DAG) returns a DAGClient whose getDAGStatus contract is unclear. A new method submitDAGWithStatus(DAG) returns a typed future. The old method should be deprecated.

The diff

--- a/tez-api/src/main/java/org/apache/tez/client/TezClient.java
+++ b/tez-api/src/main/java/org/apache/tez/client/TezClient.java
@@
+  /**
+   * @deprecated as of 0.10.4. Use {@link #submitDAGWithStatus(DAG)} which
+   *     returns a typed future. This method will be removed in 0.11.0.
+   *     See <a href="https://issues.apache.org/jira/browse/TEZ-XXXX">TEZ-XXXX</a>.
+   */
+  @Deprecated
   public DAGClient submitDAG(DAG dag) throws ... { ... }

Rules for deprecation:

The Javadoc names the replacement, the removal version, and the JIRA with the rationale.
The @Deprecated annotation is on the method, not the class.
The implementation is unchanged. Deprecation is a docs-and-annotation change; behaviour stays the same so existing callers continue to work.
Never delete a deprecated method in the same patch. Deprecation and removal are separate releases. The minimum cycle in Tez is one minor release as deprecated, then removal in the next major.

The removal patch goes in only when:

The deprecation has been in a released version for at least one minor cycle.
Search of downstream code (Hive, Pig, the Tez examples) confirms no remaining callers.
A dev@ thread has confirmed removal is acceptable.

Walked example C — changing a `LimitedPrivate("Hive")` API

Symptom: a LimitedPrivate("Hive") helper in tez-api is mis-named. You want to rename it.

This is not a free change, despite LimitedPrivate. The audience ("Hive") must be coordinated with. The workflow:

File the TEZ ticket with the rename proposal.
Search the Hive source for the existing name; if any caller uses it, write the HIVE-side patch first (deprecation-import shim).
Add the new name in Tez. Keep the old name as a @Deprecated wrapper for one release.
Remove the old name in Tez only after Hive has shipped a release that uses the new name.

The contribution often spans two Tez releases and two Hive releases. That is the cost of LimitedPrivate.

Pitfalls

Don't reuse a protobuf field number after removing a field. Reserve it with reserved 7; in the proto file. Recycling a number breaks cross-version readers in undetectable ways.
Don't change the type of a protobuf field. string → bytes looks identical on the wire but is incompatible at parse time. Add a new field with a new number; deprecate the old.
Don't widen a Private API to Public without a dev@ thread. Once public, you cannot retract.
Don't remove a @Deprecated method in the same release that introduces the deprecation. That defeats the purpose of deprecation.
Don't change the default value of a configuration key without a dev@ thread. Default changes are invisible to compile-time checks but catastrophic in production. They are a Stage 12-adjacent change.
Don't introduce a new Stable annotation lightly. Once Stable, the method is locked for a major-version cycle.
Don't assume Hadoop's compatibility annotations are identical in meaning. They are similar but have project-specific nuance; read the Tez project's BUILDING.txt and the dev@ archive before relying on them.

Exit criteria — when you're ready for the next stage

Move to Stage 12 when:

You have shipped one compatibility-sensitive change (a protobuf evolution, a deprecation, or an API rename) with explicit annotations and dev@ sign-off.
You can recite the audience × stability matrix and pick the correct cell for an arbitrary tez-api class.
You have written a deprecation Javadoc that named the replacement, the removal version, and the JIRA without being prompted.
You have read the BUILDING.txt and dev@-archived compatibility guidance for Tez and Hadoop.

Stage 12 is the final stage: release-blocking issues and PMC-level work.

Stage 12 — Release-Blocking Issues

What this stage teaches

Stage 12 is the committer/PMC stage. You learn:

The four categories of release blockers: data loss, correctness regressions, AM crash, security CVE.
How to triage a candidate blocker during an RC vote: what evidence is required, who must be CC'd, and what the deadline-pressure tradeoffs are.
The Apache release process from a committer's seat: building an RC, signing artifacts, calling a [VOTE] thread, the 72-hour rule, and the meaning of +1 binding, -1 binding, +1, and 0 votes.
The Tez release notes format and what a release blocker contributes to it.
Security CVE handling: the private security@ list, embargoed disclosure, and the path from private patch to public release.

This is the only stage where you may be voting on someone else's work as much as writing your own. The patch surface is identical to earlier stages; the context in which you act is different.

JIRA filter to find candidates

project = TEZ
  AND priority in (Blocker, Critical)
  AND resolution = Unresolved
ORDER BY priority DESC, updated DESC

The set is small at any given time. During an RC vote it grows fast.

A second filter for the RC voting period:

project = TEZ AND priority = Blocker AND created > -7d

The four categories of release blockers

1. Data loss

The strictest category. Any code path where a successfully-acknowledged write can be lost, or a successfully-acknowledged read can return wrong data, is a data-loss blocker. Examples in Tez history:

A MergeManager spill that double-counted records and silently dropped one.
A Fetcher that ignored a checksum mismatch and returned corrupted bytes to the downstream processor.
A DAGRecovery path that reconstructed an incorrect parent vertex state after AM restart.

Triage: the JIRA description must contain a deterministic repro that the release manager can run in under five minutes. Without a repro, the issue is not a blocker — it is a "to be investigated" ticket.

2. Correctness regressions

A query that returned correct results in version N-1 returns wrong results in version N. The bar is lower than data loss (the data is still there; the output is wrong) but the triage is the same. A correctness regression that affects a single Hive query path is a blocker.

3. AM crash

Any reproducible InvalidStateTransitonException in master is a blocker during an RC. Operators expect the AM to survive their workload. An AM crash on a Hive-emitted DAG that worked in the previous release blocks the RC even if the DAG itself is "unusual" — the AM must be defensive against its inputs.

4. Security CVE

A demonstrated CVE in a Tez-owned class is a blocker regardless of whether it has been exploited. The disclosure path is security@tez.apache.org first, then the public JIRA only after the fix is ready.

Triage during an RC vote

The RC vote pattern on dev@:

Subject: [VOTE] Release Apache Tez 0.10.4 (RC1)

Hi,

I've prepared the first release candidate for Tez 0.10.4. The artifacts
are at:
  https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/

The git tag is:
  https://github.com/apache/tez/releases/tag/release-0.10.4-rc1

The release notes are:
  CHANGES.txt at the top of the tag.

Please verify the signatures, run the smoke tests, and vote:
  [+1] release this RC
  [0]  no opinion
  [-1] do not release (please explain)

The vote is open for 72 hours.

Your job, as a contributor evaluating the RC:

Verify the artifact:

curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz
curl -O https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-rc1/apache-tez-0.10.4-src.tar.gz.asc
gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz

Build from source:

tar xf apache-tez-0.10.4-src.tar.gz
cd apache-tez-0.10.4-src
mvn clean install -DskipTests -Phadoop28

Run a smoke test:

mvn -pl tez-tests test -Dtest=TestExternalTezServices -Phadoop28

Reply on the vote thread with your evidence.

Vote semantics

Vote	Meaning
`+1 binding`	PMC member endorses release. Three are required for release.
`+1`	Non-PMC endorses. Counts for momentum, not the binding count.
`0`	No opinion. Often used to indicate "I built it, smoke test passed, but I can't speak to my use case."
`-1 binding`	PMC member vetoes. One -1 binding stops the release unless overridden by another vote (rare).
`-1`	Non-PMC veto. Not binding, but committers will read it.

A -1 vote must include the reason. "Build failed" is not enough; "build failed because X test fails reproducibly on Hadoop 3.x profile, evidence at URL" is.

Walked example — discovering a blocker during RC vote

Symptom: during the 0.10.4 RC1 vote, you run the smoke test and observe a test failure in TestShuffleManager#testReadErrorReportDebounce that did not happen in 0.10.3.

Step 1 — Reproduce

cd apache-tez-0.10.4-src
for i in 1 2 3; do
  mvn -pl tez-runtime-library test \
    -Dtest=TestShuffleManager#testReadErrorReportDebounce -q 2>&1 | tail -5
done

If the failure is 3/3, it is reproducible. If 1/3, it is a flake (Stage 9 issue, not a blocker).

Step 2 — Identify the cause

git log v0.10.3..release-0.10.4-rc1 -- \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped

You see a commit that changed the debounce window default from 5000ms to 500ms. The test was written against 5000ms; the change silently broke it.

Step 3 — Decide blocker vs not

A failing unit test in an RC is not automatically a blocker. The question is: does the underlying behaviour change affect production?

If the default change is intentional and the test should be updated → not a blocker. Fix the test in 0.10.4 hotfix or 0.10.5.
If the default change is unintentional or it breaks production users → blocker. RC1 must be cancelled; RC2 reverts the default change.

For this example, suppose the default change was intentional but the release notes don't mention it. The behaviour change is operator-visible (fetch-failure reports now arrive 10x more often, may overwhelm the AM event queue). That makes it a blocker for a different reason than the test failure: an undocumented behaviour change.

Step 4 — Vote and document

Subject: Re: [VOTE] Release Apache Tez 0.10.4 (RC1)

[-1] non-binding

While building the RC and running the smoke tests, I observed:
  TestShuffleManager#testReadErrorReportDebounce fails 3/3 runs.

Root cause: commit <hash> changed the default of
tez.runtime.shuffle.fetch-failure.report.cooldown-ms from 5000 to 500.
This is operator-visible behaviour change not noted in CHANGES.txt.

Recommendation: either revert the default in RC2 with the new default
deferred to 0.11.0, or keep the new default and update CHANGES.txt to
flag the operator impact and update the test.

Filed TEZ-XXXX with the analysis.

The release manager will respond. RC2 will either fix the issue (cancel, rebuild, vote again) or argue why the change is acceptable.

Release notes

The Tez release notes live in CHANGES.txt at the repo root, organised by release. The format:

Release 0.10.4 - 2026-XX-XX

  NEW FEATURES:
    TEZ-XXXX. Sharded AsyncDispatcher for high-fanout DAGs. (you)

  IMPROVEMENTS:
    TEZ-YYYY. Make DAGPlan size limit configurable. (you)

  BUG FIXES:
    TEZ-ZZZZ. Release held containers on AMRM onError. (you)

  INCOMPATIBLE CHANGES:
    TEZ-AAAA. Default of tez.runtime.shuffle.fetch-failure.report.cooldown-ms
              changed from 5000 to 500. Operators of long-running session AMs
              should evaluate AM event-queue capacity. (you)

Every patch that lands during the release cycle gets a line. The release manager assembles the file from the JIRA "Fix Version" field; contributors make the lines short and accurate.

Security CVE pipeline

The path from "I think I found a CVE" to a public release:

Do not file a public JIRA. Email security@tez.apache.org (the private list, monitored by PMC members).
Wait for acknowledgement (typically within 48 hours).
Work with the security responder on a fix privately, in a private branch.
Once the fix is ready, request a CVE ID via the Apache security team (or MITRE via the responder).
Build a release that includes the fix.
Publish the release; then the CVE is disclosed publicly with a JIRA.

The embargo window is typically 30–90 days. Contributors who report through the private channel and respect the embargo are credited in the advisory.

Pitfalls

Don't +1 a release you have not built and smoke-tested. A +1 carries weight; do not give it as a courtesy.
Don't -1 without evidence. A -1 blocks the release; the bar for evidence is high.
Don't escalate a Stage 9 flake to a blocker. Reproduce three times before voting.
Don't disclose a security vulnerability publicly before the embargo expires. Apache projects take this very seriously; a leak can lose you committer status.
Don't file Priority: Blocker casually. Reserve it for the four categories above. JIRA pollution diminishes the signal.
Don't merge a "must-have" fix during an active RC vote without cancelling the RC first. Mid-vote merges invalidate the artifact and reset the 72-hour clock.
Don't assume the release manager will catch your concern silently. Vote on the thread, even if just to 0 with a comment.

Exit criteria — there is no next stage

Stage 12 is the final rung of this roadmap. The exit criterion is that you continue — you are now operating as a committer-track contributor. The next steps are not stages but ongoing practices:

Participate in every RC vote with a built artifact and a smoke-test result, even just 0.
Watch the security@ and dev@ lists daily.
Mentor a new contributor through Stages 1–4 every year.
Read every CHANGES.txt diff for every release line you care about.
Send a quarterly note to dev@ on which areas of the codebase you are willing to review, so contributors know where to ask.

If you have walked all twelve stages, you are the Apache Tez committer the project needed when you started reading this book.

Deep Dives: Reading Order

This directory contains 21 deep-dive chapters. They are the reference material behind the Level curriculum. Each chapter is self-contained, but most chapters depend on a handful of earlier ones. Read in the order below the first time through; thereafter use the index as a lookup.

The chapters are grouped by subsystem. For each chapter we list:

Title — the file.
One-line summary — what you should walk away knowing.
Consumed by — which Levels/Labs depend on it.

Group 1 — The DAG Model and the Client

These four chapters define "what is a Tez job" before any execution machinery exists.

#	File	Summary	Consumed by
1	dag-model.md	DAG/Vertex/Edge as immutable plan; `DAGPlan` protobuf; validation rules	Level 1 (all labs); Level 2 lab 2.1
2	logical-physical.md	How the logical DAG becomes a physical execution plan with concrete parallelism	Level 4 lab 4.2; Level 5 lab 5.1
3	tez-client.md	Client-side bring-up: session mode, local resources, AM start, submission RPC	Level 3 lab 3.1; Level 7 lab 7.1
4	dag-client.md	Status polling, kill, error reporting; RPC vs ATS backends	Level 3 lab 3.1; Level 8 lab 8.1

Start here. Without the DAG model in your head, every later chapter feels like trivia.

Group 2 — AM Lifecycle and Dispatch

#	File	Summary	Consumed by
5	dag-app-master.md	AM as YARN application; dispatchers, heartbeats, recovery	Level 3 lab 3.2; Level 8 lab 8.2
6	state-machines.md	Hadoop `StateMachineFactory` API; dispatcher invariants; tests	Level 4 labs 4.1, 4.3, 4.4
7	event-routing.md	The event hierarchy; "events are the only mutation API" rule	Level 4 (all labs)

These chapters explain how the AM mutates state. They must precede the per-entity lifecycle chapters that follow.

Group 3 — Per-Entity Lifecycle

#	File	Summary	Consumed by
8	vertex-lifecycle.md	`VertexImpl` state machine: NEW → SUCCEEDED, plus failure/kill paths	Level 4 lab 4.2
9	task-lifecycle.md	`TaskImpl` state machine; speculation; max-failed-attempts	Level 4 lab 4.3
10	task-attempt-lifecycle.md	`TaskAttemptImpl` state machine; container assignment; termination causes	Level 4 lab 4.4; Level 8 lab 8.2

Read 8, 9, 10 in this order. Each refers backward to events from chapter 7 and state-machine primitives from chapter 6.

Group 4 — Input/Processor/Output

#	File	Summary	Consumed by
11	ipo-abstractions.md	`LogicalInput`/`LogicalOutput`/`Processor`; lifecycle methods; mergedinputs	Level 5 lab 5.1; Level 7 lab 7.1
12	tez-runtime.md	`TezTaskRunner2`, `LogicalIOProcessorRuntimeTask`, the umbilical	Level 5 lab 5.1

These chapters live inside tez-runtime-internals and tez-runtime-library — the JVM the task actually runs in.

Group 5 — Shuffle, Sort, and Counters

#	File	Summary	Consumed by
13	shuffle-sort.md	Sorter implementations, `IFile`, `ShuffleManager`, `Fetcher`, `MergeManager`	Level 5 labs 5.2, 5.3
14	counters-diagnostics.md	`TezCounters`, framework counters, custom counters, ATS publication	Level 8 lab 8.1

If you skip 13, do not attempt to debug shuffle issues in production. Always read it cold before opening a fetcher-related JIRA.

Group 6 — Scheduling and Resources

#	File	Summary	Consumed by
15	scheduler.md	`TaskSchedulerManager`, `YarnTaskSchedulerService`, AMRM heartbeats	Level 6 lab 6.2
16	container-reuse.md	`AMContainerImpl` lifecycle; reuse policy; idle timeouts	Level 6 labs 6.1, 6.2
17	yarn-integration.md	YARN tokens, AMRM client, app master failover, log aggregation	Level 6 lab 6.2

Group 7 — Modes and Integrations

#	File	Summary	Consumed by
18	local-mode.md	`LocalContainerLauncher`, debugging without YARN	Level 2 labs
19	hive-integration.md	Hive `TezTask`, edge usage, `DynamicPartitionPruning`, ATS spans	Level 7 (Hive labs h1–h6)

Group 8 — Failure, Recovery, and Testing

#	File	Summary	Consumed by
20	failure-handling.md	Task retry, vertex rerun, AM restart, recovery records	Level 8 lab 8.2
21	testing-framework.md	`MiniTezCluster`, `MockContainerLauncher`, `DrainDispatcher`, fault injection	Level 2 labs; Level 4 labs

A note on order vs index

The deep-dives are an index — they exist to be looked up later. The first read should follow the table above. But when you return to fix a bug, jump directly to the chapter most relevant and use the cross-references inside it.

Every chapter ends with a Validation: prove you understand this section. Treat that as the gate before declaring the chapter "read."

DAG Model

A Tez DAG is an immutable plan for a distributed computation. This chapter describes the model classes (DAG, Vertex, Edge, EdgeProperty, DataSourceDescriptor, DataSinkDescriptor, *Descriptor), the protobuf representation that crosses the wire, and the validation rules that turn a "DAG you wrote" into a "DAG the AM will accept."

After this chapter you should be able to write a small DAG by hand, predict which EdgeManager implementation will be picked for each edge, and find any classification rule in the source.

The classes you actually call from a client

All of these live in tez-api:

tez-api/src/main/java/org/apache/tez/dag/api/
  DAG.java
  Vertex.java
  Edge.java
  EdgeProperty.java
  InputDescriptor.java
  OutputDescriptor.java
  ProcessorDescriptor.java
  VertexManagerPluginDescriptor.java
  DataSourceDescriptor.java
  DataSinkDescriptor.java
  EntityDescriptor.java          (base class for all *Descriptors)
  GroupInputEdge.java            (multi-source unioning edge)
  VertexGroup.java               (group of vertices for grouped commits)

Use this command to inspect the API surface:

grep -n "^public " tez-api/src/main/java/org/apache/tez/dag/api/DAG.java | head -40

Every class above is immutable by convention once handed to TezClient. You may mutate via the builder methods (addVertex, addEdge, addDataSource) before submission. After submission the only way to change the plan is via VertexManagerPlugin callbacks (see vertex-lifecycle.md and the Level 4 lab on VertexManager).

EdgeProperty — three orthogonal axes

EdgeProperty.create(DataMovementType, DataSourceType, SchedulingType, OutputDescriptor, InputDescriptor) is the single most important constructor in the API.

grep -n "enum " tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

The three enums:

Enum	Values	What it controls
`DataMovementType`	`ONE_TO_ONE`, `BROADCAST`, `SCATTER_GATHER`, `CUSTOM`	How outputs are routed from src to dst tasks
`DataSourceType`	`PERSISTED`, `PERSISTED_RELIABLE`, `EPHEMERAL`	Durability of intermediate data
`SchedulingType`	`SEQUENTIAL`, `CONCURRENT`	Whether dst tasks must wait for src to finish

Edge type matrix (movement × scheduling)

Movement	Scheduling	Typical use	EdgeManager impl
SCATTER_GATHER	SEQUENTIAL	Map → Reduce shuffle	`ShuffleEdgeManager` (the AM-internal default)
ONE_TO_ONE	SEQUENTIAL	Sorted reducer → re-sorter (rare)	`OneToOneEdgeManager`
BROADCAST	SEQUENTIAL	Small-side join broadcast	`BroadcastEdgeManager`
CUSTOM	SEQUENTIAL	Hive cartesian product, custom partitioner	User-supplied `EdgeManagerPlugin`
BROADCAST	CONCURRENT	Streaming push between long-running tasks	`BroadcastEdgeManager`
SCATTER_GATHER	CONCURRENT	(Unusual — generally invalid for shuffles)	—

Locate the actual EdgeManager implementations:

find tez-dag/src/main/java -name "*EdgeManager*"

Key files (exact names vary slightly by branch):

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/
  OneToOneEdgeManagerOnDemand.java
  ScatterGatherEdgeManager.java
  BroadcastEdgeManager.java

Read Edge.java (tez-api) to see how it wires the right manager based on EdgeProperty:

grep -n "EdgeManager\|edgeManager\|createEdgeManager" \
  tez-api/src/main/java/org/apache/tez/dag/api/Edge.java

DataSourceDescriptor vs Input

Beginners frequently confuse these two:

Concept	Class	Defined in	Lives during
Plan-time root-input definition	`DataSourceDescriptor`	`tez-api`	Client + AM (planning)
Runtime input attached to a task	`Input` (interface)	`tez-api`	Task JVM (execution)

A DataSourceDescriptor describes "how to materialize splits for this vertex" (controller class + input descriptor + (optional) initializer). The AM may run an InputInitializer (e.g., MRInputAMSplitGenerator) to enumerate splits before the vertex starts. The result of that initialization becomes InputDataInformationEvents pushed to tasks (see ipo-abstractions.md and event-routing.md).

At task time the input class is instantiated from the InputDescriptor and called with initialize() → start() → getReader() → close(). The task never sees the DataSourceDescriptor.

The DAGPlan protobuf — the wire format

tez-api/src/main/proto/DAGApiRecords.proto

Inspect:

grep -n "^message " tez-api/src/main/proto/DAGApiRecords.proto

Key messages:

DAGPlan — root: name, vertices, edges, plan-level configs, credentials, ACLs.
VertexPlan — name, processor descriptor, parallelism, location hints, associated edges, root inputs.
EdgePlan — source/dest vertex names, edge properties, edge manager descriptor.
TezEntityDescriptorProto — {class_name, user_payload, history_text} — the serialized form of any *Descriptor.
RootInputLeafOutputProto — the protobuf encoding of DataSourceDescriptor and DataSinkDescriptor.

The conversion from API classes to protobuf happens in:

grep -rn "createDAGPlan\|toProtoFormat" tez-api/src/main/java/org/apache/tez/dag/api/ | head

Specifically DAG.createDag(...) and DagTypeConverters (a kitchen-sink class of to/from helpers).

Validation — what `DAG.verify()` checks

grep -n "private void.*verify\|public void verify" \
  tez-api/src/main/java/org/apache/tez/dag/api/DAG.java

DAG.verify(restricted=true) enforces, at minimum:

Name uniqueness — vertex names and DAG name are unique.
No cycles — DFS over the edge graph; throws IllegalStateException ("DAG contains a cycle") if any back-edge is found.
Parallelism rules:
- ONE_TO_ONE edges require source.parallelism == dest.parallelism if both are statically set.
- Vertices with BROADCAST outputs must have a finite parallelism (since each downstream task receives every output).
Descriptor non-null for required slots (Processor, Output for vertices that produce, Input for vertices that consume).
No "dangling" data sources — every root input is on a real vertex.
VertexManagerPlugin specified explicitly for vertices that need dynamic reconfig (else a default is chosen — see vertex-lifecycle.md for the default rules).

Read the body of verify(...) line-by-line; the comments cite the JIRA that added each check.

How a DAG becomes a plan, end-to-end

flowchart LR
    A[User code: new DAG] --> B[addVertex/addEdge/addDataSource]
    B --> C[TezClient.submitDAG]
    C --> D[DAG.verify]
    D -->|ok| E[DAG.createDag -> DAGPlan proto]
    E --> F[RPC DAGClientAMProtocol.submitDAG]
    F --> G[DAGAppMaster: DAGImpl init]
    G --> H[VertexImpl per VertexPlan]
    H --> I[Edge per EdgePlan; EdgeManager selected]

Each arrow has a citation:

verify: DAG.verify(...).
createDag: DAG.createDag(BinaryConfig, Credentials, Map<String,LocalResource>, JobTokenSecretManager, boolean tezLrsAsArchive).
AM-side: DAGImpl.init() and VertexImpl.constructInputDescriptors(), Edge.<init> (in tez-dag, not the tez-api Edge).

Reading exercise

# Top-level surface
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/DAG.java
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/Edge.java

# All the places where DAGPlan is constructed
grep -rn "DAGPlan.newBuilder" tez-api/src/main/java | head

# Cycle detection
grep -n "cycle\|cycleFound\|visit" \
  tez-api/src/main/java/org/apache/tez/dag/api/DAG.java

Answer:

What exception class does DAG.verify() throw on a cycle, and what does its message contain that helps a user diagnose the offending vertex?
Which method on Vertex is used to attach a DataSourceDescriptor? Which to attach a DataSinkDescriptor?
What is the role of DagTypeConverters and why is it preferred over each class owning its own toProto/fromProto methods?
When you call Edge.create(srcV, dstV, EdgeProperty.create(...)), where is the resulting Edge registered? On the source vertex? Destination? The DAG itself?
Suppose you call dag.addVertex(v) twice with the same v instance. What happens, and where in DAG.java is the protection?
What is the difference between DataSourceType.PERSISTED and DataSourceType.PERSISTED_RELIABLE? Find the consumer (search tez-dag for uses of DataSourceType).

Common bugs and symptoms

Symptom	Root cause	Where to look
`IllegalStateException: DAG contains a cycle` at submission	Accidentally added a back-edge	`DAG.verify`
Vertex starts with parallelism `-1` and never runs	`setParallelism(-1)` and no `VertexManagerPlugin` to reconfigure	`VertexImpl.initialize`; check for "parallelism not set"
Job hangs with all vertices in `INITED`	A `DataSourceDescriptor` has an initializer that never emits events	Search AM log for `InputInitializerEvent`; cross-reference initializer impl
`ClassNotFoundException` at task start for your `Processor`	The class is in client classpath but not uploaded as a local resource	`TezClient.addAppMasterLocalFiles` not called; see tez-client.md
`EdgeManager` mismatch between sides — task hangs reading	Custom `EdgeManagerPlugin` returns inconsistent partition counts	Always run `TestEdgeManagerSelf` on your plugin
`DAGPlan` proto exceeds 64 MB	Encoding huge `userPayload` directly into the plan	Use a side file via `LocalResource`; payload is `byte[]` not free-form storage

Validation: prove you understand this

Write, on a whiteboard, a 4-vertex DAG with two SCATTER_GATHER edges and one BROADCAST edge. Annotate each edge with its three EdgeProperty enums. Justify each choice.
Given an edge with (SCATTER_GATHER, PERSISTED, SEQUENTIAL), name the EdgeManager class that will be selected at runtime and the source file where the selection logic lives.
From memory, list the five required arguments to EdgeProperty.create(...).
Open DAG.verify() and identify the first five checks. For each, propose a one-line DAG that would fail it.
In a new method getAllRootInputs(DAG), walk the DAG and return all DataSourceDescriptor objects across all vertices. Compile it; check against DAG.java's own helpers.

TezClient

TezClient is the client-side API: the class your driver code instantiates to start an AM, submit DAGs, and (optionally) keep the AM alive across DAGs. This chapter walks bring-up, the session vs non-session distinction, local resource staging, RPC submission, and ATS hookup.

After this chapter you should be able to point at every line of code that runs between TezClient.create(...) and the moment a DAG appears inside the AM ready to be start()ed.

Files to open

tez-api/src/main/java/org/apache/tez/client/
  TezClient.java
  TezClientUtils.java
  TezSessionImpl.java
  FrameworkClient.java
  TezYarnClient.java            (YARN-backed FrameworkClient)
  LocalClient.java              (in-process FrameworkClient for local mode)

Plus the YARN-AM protocol definition:

tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java
tez-api/src/main/proto/DAGClientAMProtocol.proto

Two modes: session and non-session

The mode is chosen at TezClient.create(...):

TezClient client = TezClient.create(
    "MyApp",
    tezConf,
    isSession  /* true = session mode */);

Property	Non-session	Session
AM lifetime	Per DAG	Across many DAGs
`start()` semantics	No-op (AM launched at `submitDAG`)	Launches AM and waits for it to register
Allowed DAGs in flight	1	1 (sequential within a session by default)
Keep-alive	n/a	`tez.session.am.dag.submit.timeout.secs`
Use case	One-shot jobs (CLI tools, scheduled batch)	Latency-sensitive (Hive, Pig, interactive)

The AM keep-alive timer is critical. In session mode, after a DAG completes the AM waits for the configured timeout for a new DAG. If none arrives, it shuts down to free YARN resources. Find the timer:

grep -n "AMSessionDAGSubmitTimeout\|dag.submit.timeout" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

Bring-up control flow

sequenceDiagram
    participant U as User code
    participant TC as TezClient
    participant TCU as TezClientUtils
    participant YC as TezYarnClient
    participant RM as YARN RM
    participant AM as DAGAppMaster

    U->>TC: TezClient.create(name, conf, isSession)
    U->>TC: addAppMasterLocalFiles(map)
    U->>TC: start()
    TC->>TCU: createApplicationSubmissionContext(...)
    TCU->>TCU: stage local resources to HDFS
    TCU->>TCU: build classpath & env
    TC->>YC: submitApplication(appSubmissionContext)
    YC->>RM: submitApplication
    RM-->>YC: appId
    Note over RM,AM: RM launches AM container
    AM->>AM: serviceInit, serviceStart
    AM-->>TC: AM registers via heartbeat; TC sees RUNNING
    U->>TC: submitDAG(dag)
    TC->>AM: DAGClientAMProtocol.submitDAG(rpcCall)
    AM-->>TC: dagId
    TC-->>U: DAGClient

Where each call lives:

TezClient.start() → TezClientUtils.createFinalConfProtoForApp() → TezClientUtils.createApplicationSubmissionContext() → frameworkClient.submitApplication(...).
TezClient.submitDAG(dag) → getSessionAMProxy() → dagAMProtocol.submitDAG(submitRequest) (the YARN AM proxy).

grep -n "submitApplication\|submitDAG\|dagAMProtocol" \
  tez-api/src/main/java/org/apache/tez/client/TezClient.java

Local resources that `TezClientUtils` uploads

A YARN container starts with a clean working directory plus whatever local resources the AM submission context declares. For Tez, that includes:

Tez framework tarball — pointed to by tez.lib.uris (or a local jar list). Contains tez-api.jar, tez-dag.jar, tez-runtime-*.jar, etc.
User application jars — anything you added via TezClient.addAppMasterLocalFiles(Map<String, LocalResource>) plus addTaskLocalFiles.
The DAGPlan — not a local resource. It is sent via the submitDAG RPC payload.

Inspect:

grep -n "tez.lib.uris\|TezConfiguration.TEZ_LIB_URIS\|addAppMasterLocalFiles" \
  tez-api/src/main/java/org/apache/tez/client/TezClient.java \
  tez-api/src/main/java/org/apache/tez/client/TezClientUtils.java

The AMRM token is delivered by YARN when the container starts; Tez does not manage it directly.

The submission RPC

The protocol is defined in:

tez-api/src/main/proto/DAGClientAMProtocol.proto

grep -n "rpc " tez-api/src/main/proto/DAGClientAMProtocol.proto

Key RPCs:

RPC	What it does
`submitDAG`	Submit a new DAG to a running AM
`getDAGStatus`	Poll status (also used by `DAGClient`)
`getVertexStatus`	Poll a specific vertex
`tryKillDAG`	Initiate kill
`shutdownSession`	Stop the AM in session mode

The RPC server lives in the AM (DAGClientHandler and its Protobuf implementation):

grep -rn "DAGClientAMProtocol\|submitDAG" \
  tez-dag/src/main/java/org/apache/tez/dag/api/client/ 2>/dev/null | head

ATS / Timeline Service integration

When tez.history.logging.service.class is set to ATSHistoryLoggingService (the default in many distros), TezClient does not publish events itself — the AM does, via the HistoryEventHandler. However, TezClient does:

Set tez.history.logging.service.class into the AM env.
Provide ATS credentials in the application submission context.

Read:

grep -rn "ATSHistoryLoggingService\|YARN_TIMELINE_SERVICE" \
  tez-api/src/main/java/org/apache/tez/client/

For the AM-side, see counters-diagnostics.md.

`TezSessionImpl` vs `TezClient`

There is a subclass relationship: TezSessionImpl was the older name; modern Tez uses TezClient with isSession=true, but TezSessionImpl still appears in some codepaths. The two are largely interchangeable. Inspect both:

grep -n "class TezClient\|class TezSessionImpl" \
  tez-api/src/main/java/org/apache/tez/client/*.java

Reading exercise

sed -n '1,120p' tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -n "submitDAG\b" tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -n "stopSession\|stop\|close" \
  tez-api/src/main/java/org/apache/tez/client/TezClient.java
grep -rn "submitApplication" tez-api/src/main/java/org/apache/tez/client/

Answer:

What is the difference between TezClient.stop() in session vs non-session mode?
When TezClient.submitDAG() is called for a DAG that conflicts with one currently running in the session, what happens?
Find the timeout used while waiting for the AM to reach RUNNING after start(). Which config key controls it?
What pre-condition does submitDAG enforce on the DAG's vertex names with respect to previously-submitted DAGs in the same session?
Trace addAppMasterLocalFiles(...) end-to-end. Where do those files end up on HDFS?
Why is tez.lib.uris sometimes a directory and sometimes a tarball? What does TezClientUtils.setupTezJarsLocalResources do for each case?

Common bugs and symptoms

Symptom	Root cause	Fix
AM never reaches RUNNING; client hangs in `start()`	`tez.lib.uris` points to a path the NodeManager can't read	Verify HDFS perms; check NM logs
`submitDAG` throws `SessionNotRunning`	AM died (idle timeout, crash)	Catch, recreate `TezClient`, resubmit
`submitDAG` blocks forever	Previous DAG still in flight in the session	Don't reuse session for parallel DAGs; or wait
`IOException: Failed to submit application`	RM rejected (queue full, ACL)	Inspect RM logs; verify queue config
AM starts but cannot talk back to client	Client behind NAT; AM cannot reach client's RPC server	Use polling-only `DAGClient`; avoid callbacks
Tasks fail with `ClassNotFoundException` for user code	`addTaskLocalFiles` not called for that jar	Add jars via both `addAppMasterLocalFiles` and `addTaskLocalFiles` if used in tasks

Validation: prove you understand this

Write a 30-line Java driver that creates a TezClient in session mode, submits two DAGs back-to-back, prints both DAGClient.getDAGStatus() results, and shuts down cleanly.
From TezClient.java, list every method that ultimately reaches dagAMProtocol.
Explain why addAppMasterLocalFiles is a Map<String, LocalResource> and not a List<Path>.
From the proto file DAGClientAMProtocol.proto, write the exact request message used by submitDAG.
Reproduce the "AM idle timeout" path on MiniTezCluster: submit one DAG, wait past the configured timeout, attempt a second submit, observe the exception class and message.

DAGClient

DAGClient is the read-only client-side handle to a submitted DAG. It is returned by TezClient.submitDAG(...) and lives until the DAG completes (or the user kills it). This chapter covers status polling, the StatusGetOpts flag, the RPC vs ATS backends, error reporting, and the contract DAGClient exposes to callers like Hive, Pig, and CLI drivers.

After this chapter you should know which backend a given DAGClient instance is using, what fields will be populated, and which calls block vs poll.

Files to open

tez-api/src/main/java/org/apache/tez/dag/api/client/
  DAGClient.java                       (abstract base)
  DAGStatus.java                       (the snapshot type)
  VertexStatus.java
  Progress.java
  StatusGetOpts.java                   (enum: GET_COUNTERS, GET_MEMORY_USAGE)

  rpc/
    DAGClientRPCImpl.java              (talks to the AM via DAGClientAMProtocol)
    DAGClientImplLocal.java            (in-process; for LocalClient)

  registry/                            (service discovery if applicable)

ATS-backed variant:

tez-plugins/tez-yarn-timeline-history-with-fs/  or
tez-plugins/tez-yarn-timeline-history/
  src/main/java/org/apache/tez/dag/api/client/DAGClientTimelineImpl.java

(Module names vary across versions; locate with find . -name "DAGClientTimelineImpl.java".)

Core API

public abstract class DAGClient implements Closeable {
  public abstract String getExecutionContext();
  public abstract DAGStatus getDAGStatus(Set<StatusGetOpts> opts) throws ...;
  public abstract DAGStatus getDAGStatus(Set<StatusGetOpts> opts, long timeoutMillis) throws ...;
  public abstract VertexStatus getVertexStatus(String vertexName, Set<StatusGetOpts> opts) throws ...;
  public abstract DAGStatus waitForCompletion() throws ...;
  public abstract DAGStatus waitForCompletionWithStatusUpdates(Set<StatusGetOpts> opts) throws ...;
  public abstract void tryKillDAG() throws ...;
  // ...
}

grep -n "public abstract\|public " \
  tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java

`DAGStatus` — what callers actually consume

grep -n "public " tez-api/src/main/java/org/apache/tez/dag/api/client/DAGStatus.java

Fields you'll see in production triage:

Field	Populated by	Notes
`state` (`DAGStatus.State`)	Always	`SUBMITTED`/`INITING`/`RUNNING`/`SUCCEEDED`/`FAILED`/`KILLED`/`ERROR`
`progress`	RPC backend; ATS backend may lag	`Progress` per vertex + aggregate
`diagnostics`	On terminal states	Newline-joined messages
`counters`	Only if `StatusGetOpts.GET_COUNTERS` passed	Expensive over RPC
`memoryUsage`	Only if `StatusGetOpts.GET_MEMORY_USAGE` passed	Aggregated across containers

Note: state is not the same as VertexStatus.State. Vertex states are richer (INITED, RUNNING, COMMITTING, SUCCEEDED, etc.) — see vertex-lifecycle.md. DAG state is a roll-up.

RPC backend: `DAGClientRPCImpl`

grep -n "DAGClientAMProtocol\|proxy" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java

Behavior:

Each getDAGStatus(opts) is a synchronous RPC to the AM.
Default timeout per call is governed by tez.dag.am.client.am-connect-timeout-secs.
If GET_COUNTERS is set, the AM serializes the entire TezCounters tree (potentially MBs); avoid in tight loops.
waitForCompletion() is implemented as a polling loop with backoff. Find the loop:

grep -n "waitForCompletion\|sleep\|poll" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java

ATS backend: `DAGClientTimelineImpl`

When the AM has exited but ATS retains history, status is fetched from the ATS REST API (or RM web UI) instead. This is critical for post-mortem and "why did my job fail" UIs.

Behavior differences from RPC:

Eventually consistent (ATS publication is async; see counters-diagnostics.md).
state is the final state recorded; intermediate states between two ATS events are invisible.
Counters are available if ATSHistoryLoggingService was active and the event made it past the publisher queue.

Search for the fallback path that picks ATS when RPC fails:

grep -rn "DAGClientTimelineImpl\|getDAGAndAMURL\|RPCFailed\|amProxyFailed" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/ \
  tez-api/src/main/java/org/apache/tez/client/

`tryKillDAG()` — the only mutation

Despite the name, DAGClient has exactly one mutating method: tryKillDAG. It triggers the AM to start the kill path, but does not block until the DAG is dead.

grep -n "tryKillDAG\|killDAG" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java \
  tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java

To wait for the kill to take effect:

client.tryKillDAG();
DAGStatus status = client.waitForCompletion();
// status.state will be KILLED (or whatever it raced to)

Status populate flow

sequenceDiagram
    participant U as User code
    participant DC as DAGClientRPCImpl
    participant AM as DAGAppMaster
    participant DH as DAGClientHandler
    participant DI as DAGImpl

    U->>DC: getDAGStatus(opts)
    DC->>AM: RPC: getDAGStatus(dagId, opts)
    AM->>DH: dispatch
    DH->>DI: dagImpl.getDAGStatus(opts)
    DI-->>DH: DAGStatusProto
    DH-->>AM: response
    AM-->>DC: response bytes
    DC-->>U: DAGStatus

The conversion DAGImpl → DAGStatusProto happens in DAGImpl.getDAGStatus() (in tez-dag). For GET_COUNTERS, the AM walks the counter aggregation tree — expensive.

Reading exercise

# Surface
sed -n '1,80p' tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClient.java

# State enum
grep -n "public enum State\b" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/DAGStatus.java

# RPC polling loop
grep -n "waitForCompletion\|backoff\|sleep" \
  tez-api/src/main/java/org/apache/tez/dag/api/client/rpc/DAGClientRPCImpl.java

Answer:

What is the difference between waitForCompletion() and waitForCompletionWithStatusUpdates(opts)?
What happens if GET_COUNTERS is requested but the DAG is still INITING?
List the exact DAGStatus.State enum values and the terminal subset.
From the polling loop, what is the maximum sleep between polls?
When tryKillDAG() is called after the DAG already finished, what does the RPC return? Is it an error?
In DAGClientTimelineImpl, how is the "I don't see a SUCCEEDED event yet" case distinguished from "the DAG is still running"?

Common bugs and symptoms

Symptom	Root cause	Fix
`waitForCompletion()` returns `RUNNING` forever	AM crashed, RPC keeps timing out	Add timeout; check AM log; fall back to ATS
Counters are stale by ~30s	AM aggregation interval	`tez.am.aggregate.counters.interval-secs`
`tryKillDAG()` returns immediately but DAG keeps running for minutes	Kill is async; tasks must drain	Always follow with `waitForCompletion`
Hive sees `DAGStatus.State=ERROR` with no diagnostics	AM crashed before publishing	Check NM container log for the AM
ATS-backed status missing for a recently completed DAG	ATS publisher queue backed up	Wait; or query ATS REST directly
Inconsistent state between RPC and ATS for same DAG	Race during AM shutdown; ATS publishes after final RPC	Trust RPC while AM lives, ATS after

Validation: prove you understand this

Write a 20-line program that polls getDAGStatus(GET_COUNTERS) once a second and prints the FILE_BYTES_WRITTEN counter from each snapshot.
List the four StatusGetOpts enum values (check the source — there may be fewer/more than you remember) and what each adds to the payload.
From DAGClient.java, draw the inheritance/factory diagram for how a DAGClient instance is actually constructed (look at TezClient.submitDAG to see which subclass is returned in YARN vs local mode).
Force the RPC backend to fail and confirm whether (or not) Tez falls back to the ATS backend automatically. Cite the line that performs the fallback.
Explain why DAGStatus is a snapshot rather than an observable.

DAGAppMaster

DAGAppMaster is Tez's YARN ApplicationMaster: a single JVM, launched by the YARN ResourceManager, that owns one or more DAGs over its lifetime. This chapter describes its bring-up, its dispatcher topology, its YARN-facing heartbeats, and the recovery service that lets it restart after a crash.

After this chapter you should be able to map any AM log line in the first 60 seconds of operation to a method in DAGAppMaster.java.

Files to open

tez-dag/src/main/java/org/apache/tez/dag/app/
  DAGAppMaster.java                          (the AM main class)
  TaskCommunicatorManager.java               (task umbilical multiplexer)
  ContainerHeartbeatHandler.java             (container liveness)
  rm/
    TaskSchedulerManager.java                (one per scheduler instance)
    YarnTaskSchedulerService.java            (the default scheduler impl)
    container/
      AMContainerImpl.java                   (container state machine)
  launcher/
    ContainerLauncherManager.java
    DagContainerLauncher.java                (varies by version)
    LocalContainerLauncher.java              (in-process)
  recovery/
    RecoveryService.java                     (event log; restart path)
  dag/impl/
    DAGImpl.java
    VertexImpl.java
    TaskImpl.java
    TaskAttemptImpl.java

Bring-up: `serviceInit` and `serviceStart`

DAGAppMaster extends AbstractService. YARN starts it with a main; control flows:

main()
  -> DAGAppMaster.create / new DAGAppMaster(...)
  -> init(conf)
       -> serviceInit(conf)
          - parse appAttemptId
          - load credentials
          - construct AsyncDispatcher
          - construct + register child services: TaskSchedulerManager,
              ContainerLauncherManager, TaskCommunicatorManager,
              RecoveryService (if enabled), HistoryEventHandler, ATSHook
          - register event handlers on the dispatcher
  -> start()
       -> serviceStart()
          - start child services (they each start their own threads)
          - if not session mode: handle the inline DAG plan
          - if session mode: enter idle loop, wait for submitDAG RPC

Inspect the boundaries:

grep -n "serviceInit\|serviceStart\|serviceStop" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

The AsyncDispatcher and registered handlers

DAGAppMaster builds one AsyncDispatcher (from hadoop-yarn-common) and registers a handler per event type. The contract is:

Each handler runs on a single dispatch thread.
Handlers must be fast (no blocking I/O); they should mutate state and emit follow-on events.

Find the registrations:

grep -n "dispatcher.register\|register(.*\.class" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

Typical registrations (names approximate by version):

Event type	Handler class	Owned subsystem
`DAGEventType`	`DAGEventDispatcher` (forwards to `DAGImpl.handle`)	DAG lifecycle
`VertexEventType`	`VertexEventDispatcher` (forwards to `VertexImpl.handle`)	Vertex lifecycle
`TaskEventType`	`TaskEventDispatcher` (forwards to `TaskImpl.handle`)	Task lifecycle
`TaskAttemptEventType`	`TaskAttemptEventDispatcher` (forwards to `TaskAttemptImpl.handle`)	Attempt lifecycle
`AMSchedulerEventType`	`AMSchedulerEventDispatcher` (forwards to `TaskSchedulerManager`)	Scheduling
`AMContainerEventType`	container event dispatcher	Container state
`AMNodeEventType`	node event dispatcher	Node tracking
`ContainerLauncherEventType`	launcher dispatcher	Launch/stop containers
`TaskCommunicatorEventType`	comms dispatcher	Per-launcher umbilical
`HistoryEventType`	history event dispatcher	ATS/log publication
`SpeculatorEventType`	speculator dispatcher	Speculation (if enabled)
`DAGAppMasterEventType`	AM itself	Lifecycle (e.g., shutdown)
`RecoveryEventType`	recovery dispatcher	Recovery log

The handlers themselves are inner classes or top-level dispatchers found in:

grep -rn "extends EventHandler\|implements EventHandler" \
  tez-dag/src/main/java/org/apache/tez/dag/app/ | head -20

Event flow diagram

flowchart TB
    subgraph "Sources of events"
        TC[Task heartbeat]
        SCH[Scheduler callback]
        TL[Container launcher]
        UC[User: submitDAG/killDAG]
        RC[Recovery on restart]
    end
    TC --> D
    SCH --> D
    TL --> D
    UC --> D
    RC --> D
    D[AsyncDispatcher] --> DH[DAGEventDispatcher]
    D --> VH[VertexEventDispatcher]
    D --> TH[TaskEventDispatcher]
    D --> AH[TaskAttemptEventDispatcher]
    D --> SH[AMSchedulerEventDispatcher]
    D --> HH[HistoryEventDispatcher]
    DH --> DI[DAGImpl]
    VH --> VI[VertexImpl]
    TH --> TI[TaskImpl]
    AH --> TAI[TaskAttemptImpl]
    SH --> TSM[TaskSchedulerManager]
    HH --> HEH[HistoryEventHandler]

Everything flows through D. There is no other way to mutate the state of a DAG, vertex, task, or attempt. See event-routing.md.

YARN-facing components

AMRM heartbeat (the resource conversation)

TaskSchedulerManager (and underneath, YarnTaskSchedulerService) maintains an AMRMClient (from YARN). This heartbeats with the RM at a configurable interval (tez.am.am-rm.heartbeat.interval-ms.max) carrying:

ContainerRequests for new tasks.
ContainerReleases for freed containers.
Progress percent (visible in yarn application -status).

Responses contain:

AllocatedContainers (RM granted).
CompletedContainersStatuses (RM tells us a container died).

grep -n "heartbeat\|AMRMClient\|allocate" \
  tez-dag/src/main/java/org/apache/tez/dag/app/rm/YarnTaskSchedulerService.java | head

Container heartbeat (the liveness check)

ContainerHeartbeatHandler tracks the wall time of the last heartbeat() call from each running container's umbilical. If a container goes silent past tez.task.timeout-ms, the AM declares the container unresponsive and kills the attempt.

grep -n "ContainerHeartbeatHandler\|tez.task.timeout" \
  tez-dag/src/main/java/org/apache/tez/dag/app/ContainerHeartbeatHandler.java

Task umbilical (the per-task RPC server)

TaskCommunicatorManager runs an in-AM RPC server (the umbilical) that tasks call into for:

getTask() — pick up assigned task.
statusUpdate(...) / heartbeat(...) — progress and liveness.
done(...) / fatalError(...) — completion.
outputReady(...) / inputEvents(...) — runtime data plane.

The umbilical protocol is TezTaskUmbilicalProtocol:

find . -name "TezTaskUmbilicalProtocol.java"

Recovery: surviving an AM restart

If tez.am.am-rm.heartbeat.interval-ms.max allows it and recovery is enabled (tez.dag.recovery.enabled=true), RecoveryService writes a log of state-changing events to HDFS. On a restart (YARN gives the AM a new appAttemptId but the same appId), the new AM:

Reads the recovery log under ${tez.staging-dir}/$appId/recovery/$attemptId/.
Replays events into DAGImpl, VertexImpl, etc., to rebuild in-memory state up to the last durable point.
Resumes execution: completed tasks remain completed, in-flight tasks are relaunched.

grep -rn "RecoveryService\|RecoveryEvent\|replayEvents" \
  tez-dag/src/main/java/org/apache/tez/dag/app/recovery/ | head

Note: recovery is per-DAG, not per-task. A vertex that was RUNNING becomes RUNNING again; tasks that completed stay completed; tasks that were in flight get fresh attempts.

Reading exercise

# Bring-up
sed -n '1,200p' tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -200
grep -n "serviceInit\|serviceStart" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

# Handlers
grep -n "register\b" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -30

# Session vs non-session control
grep -n "isSession\|sessionMode" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head -20

# Recovery hookup
grep -n "RecoveryService\|recoveryEnabled" tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java | head

Answer:

In what order are the child services started in serviceStart? Why does order matter?
List the first three events that flow through the dispatcher when an AM in non-session mode starts.
What thread does DAGImpl.handle(DAGEvent) execute on? Is it the same thread as VertexImpl.handle(VertexEvent)?
Where is the appAttemptId > 1 check that decides whether to start fresh or recover?
What is the difference between DAGAppMaster.shutdown() and DAGAppMaster.serviceStop()?
Find the line that emits the first "DAGAppMaster started" log statement (or its modern equivalent).

Common bugs and symptoms

Symptom	Root cause	Where to look
AM dies immediately with `NPE` in `serviceInit`	Missing or wrong `tez.lib.uris`; jars not found	NM container log; verify HDFS perms
AM hangs forever after `serviceStart` in session mode	No DAGs submitted; `tez.session.am.dag.submit.timeout.secs` exhausted	Increase timeout; or check why client isn't submitting
Tasks all fail with "container lost" after a long GC	AM GC pause exceeded heartbeat budget; RM killed AM	Tune AM heap; reduce dispatcher pressure
Recovery replays but stalls in `INITING`	Recovery log truncated mid-vertex-init	Look for `SummaryEventWriter` errors in prior attempt
Event dispatcher queue grows without bound	A handler is doing blocking I/O on the dispatch thread	Take a thread dump; verify which event is stuck
AM exits with `ERROR` and no DAG transition	An uncaught exception bubbled out of an event handler	`grep "Error in dispatcher thread"` in AM log

Validation: prove you understand this

From memory, list ten event-type→handler registrations in DAGAppMaster.
Draw the event flow from TezTaskUmbilicalProtocol.heartbeat to TaskAttemptImpl.handle(TA_DONE).
Reproduce a single-DAG, non-session AM bring-up on MiniTezCluster and identify the log line emitted by each child-service start.
Read the RecoveryService writer and identify which event types are persisted vs in-memory-only.
Explain why the dispatcher must be single-threaded and what would break if you parallelized it.

VertexImpl Lifecycle

VertexImpl is the AM-side representation of a single Vertex in a running DAG. Its lifecycle is a Hadoop state machine with ~15 states and dozens of events. This chapter walks the happy path (NEW → SUCCEEDED), the major failure and kill paths, and the rules that govern transitions.

After this chapter you should be able to draw the state machine on a whiteboard and predict every state transition for any event in any state.

File

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

This is one of the largest files in Tez (typically 4000+ lines). Skim once top-to-bottom, then read the stateMachineFactory block carefully.

grep -n "stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

The factory is a single chained builder defined near the top of the file (roughly 200–600 lines depending on version).

The states

grep -n "VertexState\." tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head
# or
grep -n "public enum\|enum VertexState" \
  tez-api/src/main/java/org/apache/tez/dag/api/event/VertexState.java \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

The full state set (names exact as of 0.10.x):

State	Meaning
`NEW`	Just constructed; no events seen
`INITIALIZING`	Inputs being initialized (e.g., split generation)
`INITED`	Ready to run; awaiting `V_START`
`RUNNING`	Tasks executing
`COMMITTING`	All tasks succeeded; outputs being committed
`SUCCEEDED`	Terminal: all good
`TERMINATING`	Failure/kill in progress; awaiting task drain
`KILLED`	Terminal: killed externally
`FAILED`	Terminal: failed (own fault)
`ERROR`	Terminal: AM internal error
`RECOVERING`	(Recovery only) replaying events into this vertex

State × event matrix (happy path)

State	Event	Next state	Action
NEW	V_INIT	INITIALIZING	construct inputs, kick off `InputInitializer`s
INITIALIZING	V_ROOT_INPUT_INITIALIZED	INITIALIZING	accumulate events; if all done → INITED
INITIALIZING	V_ROOT_INPUT_FAILED	TERMINATING	bubble failure
INITIALIZING	V_INIT_COMPLETED	INITED	finalize parallelism if not set
INITED	V_START	RUNNING	schedule tasks via `VertexManagerPlugin`
RUNNING	V_TASK_COMPLETED (success)	RUNNING	bump counter; if all done → COMMITTING
RUNNING	V_TASK_COMPLETED (final fail)	TERMINATING	initiate cleanup
RUNNING	V_TASK_RESCHEDULED	RUNNING	rerun a task
COMMITTING	V_COMMIT_COMPLETED	SUCCEEDED	publish history
COMMITTING	V_COMMIT_FAILED	TERMINATING	rerun or fail

For the complete matrix, count the addTransition(...) calls:

grep -c "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

There are usually >100 transitions registered. Each carries a one-line comment with the bug or JIRA that motivated it; read those comments.

Failure path walk

stateDiagram-v2
    [*] --> NEW
    NEW --> INITIALIZING: V_INIT
    INITIALIZING --> INITED: V_INIT_COMPLETED
    INITIALIZING --> TERMINATING: V_ROOT_INPUT_FAILED
    INITED --> RUNNING: V_START
    RUNNING --> COMMITTING: all tasks SUCCEEDED
    RUNNING --> TERMINATING: any task FAILED beyond max-attempts
    RUNNING --> TERMINATING: V_TERMINATE
    COMMITTING --> SUCCEEDED: V_COMMIT_COMPLETED
    COMMITTING --> TERMINATING: V_COMMIT_FAILED
    TERMINATING --> FAILED
    TERMINATING --> KILLED
    SUCCEEDED --> [*]
    FAILED --> [*]
    KILLED --> [*]

TERMINATING exists because a vertex cannot just jump to FAILED — it must first kill all running tasks and clean up its outputs. The transition from TERMINATING to a terminal state happens when the task count reaches zero.

Vertex initialization in detail

V_INIT is the most complex transition. The handler must:

Construct each root InputDescriptor and call its InputInitializer.
If parallelism is -1, defer task creation until either the VertexManagerPlugin calls reconfigureVertex(...) or the root inputs report concrete counts.
Construct downstream Edge objects (the AM-side Edge, not the tez-api one) and bind their EdgeManagers.
Schedule the VertexManagerPlugin.onVertexStarted callback (it fires on V_START, not V_INIT).

Read the body:

grep -n "InitTransition\|RootInputInitTransition\|RECOVERING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -20

The commit path

A vertex with a DataSink (an OutputCommitter) must run a commit phase after all tasks succeed. The commit:

Runs on the AM (not in tasks).
May fail and trigger a rerun (V_COMMIT_FAILED → TERMINATING).
Holds the vertex in COMMITTING for the duration.

Vertex-group commit (when multiple vertices write to a shared VertexGroup) is coordinated by DAGImpl; individual VertexImpls just signal that they are ready to commit.

grep -n "CommittingTransition\|commitOutput\|OutputCommitter" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Reading exercise

# State machine block
sed -n '1,500p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Count transitions
grep -c "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Find every event that can take the vertex to FAILED
grep -n "VertexState.FAILED" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

# Find the InitTransition body
grep -n "private.*class.*Transition\b" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Answer:

List five events that can take the vertex from RUNNING to TERMINATING.
What determines the final state (FAILED vs KILLED) once TERMINATING completes?
Why is INITED distinct from RUNNING — what does V_START actually trigger?
How is parallelism set when a vertex starts with parallelism = -1?
What happens to in-flight tasks when a vertex transitions to TERMINATING?
Why does the state machine have a separate COMMITTING state instead of committing inside RUNNING?

Common bugs and symptoms

Symptom	Root cause	Where to look
`InvalidStateTransitonException: Invalid event V_TASK_COMPLETED at SUCCEEDED`	A late task completion event arrived after vertex completed (race)	Check task retry logic; add a no-op transition
Vertex stuck in `INITIALIZING` forever	Root input initializer never emitted events	Check `InputInitializerEvent`s in log; cross-check initializer impl
Vertex transitions to `FAILED` but the failing task was killed externally	Bug in `TaskAttemptImpl` setting the wrong termination cause	See task-attempt-lifecycle.md
All tasks succeed but vertex stays in `COMMITTING`	Output committer hangs	Check committer for synchronous slow I/O; consider async
Recovery replays into `RUNNING` but tasks aren't relaunched	Missing recovery event for in-flight tasks	Look for `VertexTaskStartEvent` gaps in recovery log
`V_KILL` causes vertex to stay in `TERMINATING` with one task lingering	Container heartbeat timeout > kill deadline	Tune `tez.task.timeout-ms`

Validation: prove you understand this

From memory, list all 10–11 VertexState values with a one-line meaning.
Without running code, predict the next state for: (NEW, V_TERMINATE), (INITIALIZING, V_TERMINATE), (RUNNING, V_TASK_RESCHEDULED), (COMMITTING, V_TASK_RESCHEDULED). Verify against the source.
Find the JIRA reference next to one transition you don't understand; read the JIRA; come back and explain why the transition exists.
Write a unit test that drives a VertexImpl from NEW to SUCCEEDED using DrainDispatcher. (Use TestVertexImpl as a template.)
Modify VertexImpl to add a no-op transition for some (state, event) pair currently absent; update TestVertexImpl in the same patch. Compile.

TaskImpl Lifecycle

TaskImpl is the AM-side representation of one logical task within a vertex. It is a relatively small state machine, but it owns a critical piece of policy: which attempt of this task is the "winner." This chapter walks the states, the attempt management rules, speculation, and the max-failed threshold that promotes a task to "this whole vertex must fail."

After this chapter you should be able to explain why a task with three failed attempts may still be RUNNING while another with one failed attempt is already FAILED.

File

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java

Tests:

tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java

The states

grep -n "TaskState\." tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
grep -n "public enum TaskState\|enum TaskState" \
  tez-api/src/main/java/org/apache/tez/dag/api/event/TaskState.java

State	Meaning
`NEW`	Constructed; no attempts yet
`SCHEDULED`	First attempt requested from scheduler
`RUNNING`	At least one attempt is `RUNNING`
`SUCCEEDED`	Terminal: one attempt succeeded; task complete
`KILLED`	Terminal: explicitly killed (vertex termination, user)
`FAILED`	Terminal: max attempts exceeded

TaskImpl does not have INITIALIZING or TERMINATING — those concerns belong to the vertex.

State × event matrix

State	Event	Next state	Action
NEW	T_SCHEDULE	SCHEDULED	create first `TaskAttemptImpl`, send `TA_SCHEDULE`
SCHEDULED	T_ATTEMPT_LAUNCHED	RUNNING	mark first attempt as running
RUNNING	T_ATTEMPT_SUCCEEDED	SUCCEEDED	pick this attempt as the winner; kill others (if speculating)
RUNNING	T_ATTEMPT_FAILED	RUNNING (retry) or FAILED (exceeded)	spawn new attempt or terminate
RUNNING	T_ATTEMPT_KILLED	RUNNING	no-op unless this was last attempt
RUNNING	T_ADD_SPEC_ATTEMPT	RUNNING	spawn a duplicate attempt
RUNNING	T_TERMINATE	KILLED	kill all attempts
any	T_RECOVER_*	recovered state	replay events

Count transitions:

grep -c "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java

Retry: how `max-failed-attempts` works

The config:

grep -n "TASK_MAX_FAILED_ATTEMPTS\|tez.am.task.max.failed.attempts" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Default is 4 in most branches; a task is FAILED only after N attempts have failed (not been killed).

Failed vs killed distinction:

Outcome	Counts toward `max.failed.attempts`?
`TaskAttempt` failed (own crash, processor exception)	yes
`TaskAttempt` killed by speculation (lost the race)	no
`TaskAttempt` killed because vertex terminated	no
`TaskAttempt` killed because container preempted	no

The classification is owned by TaskAttemptTerminationCause (see task-attempt-lifecycle.md). TaskImpl.handle consults the cause when deciding whether to retry or fail.

grep -n "TerminationCause\|isFailureCause" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head

Speculation

Speculation runs a second copy of a task before the first finishes, hoping the second wins. Implementation:

tez-dag/src/main/java/org/apache/tez/dag/app/dag/speculate/
  LegacySpeculator.java
  SimpleSpeculator.java                    (varies by version)
  legacy/RuntimeTaskStatsEstimator.java

The speculator emits T_ADD_SPEC_ATTEMPT events into the dispatcher; the task spawns an additional attempt. The first attempt to succeed wins; the others are killed with cause TERMINATED_BY_OWNER (or similar). Killed attempts do not count toward max.failed.attempts.

Enabled by:

grep -n "tez.am.speculation.enabled\|speculation" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head

"Best attempt" selection

When multiple attempts of the same task exist, the first to send TA_DONE (successful completion) wins. The handler:

Marks that attempt as the canonical one (cached in TaskImpl).
Iterates remaining attempts, sending each a kill event.
Transitions task to SUCCEEDED.

grep -n "successfulAttempt\|setWinnerAttempt\|markSuccessful" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head

Downstream vertices reading from this task's output use the winner's outputLocationHint for shuffle (see shuffle-sort.md).

Reading exercise

# Surface
sed -n '1,120p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java

# State machine block
grep -n "addTransition" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -40

# Retry logic
grep -n "addAttempt\|nextAttemptNumber\|createAttempt" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head

# Speculation hook
grep -rn "T_ADD_SPEC_ATTEMPT" tez-dag/src/main/java/org/apache/tez/dag/app/ | head

Answer:

What is the precise condition for transitioning from RUNNING to FAILED on a T_ATTEMPT_FAILED event? Cite the line.
Where is a new TaskAttemptImpl constructed? Is it a public method or private to TaskImpl?
How does TaskImpl know whether a failed attempt should count toward the failure budget?
In what state can T_ADD_SPEC_ATTEMPT arrive? What does the handler do?
Why does TaskImpl not own its own scheduling? Who does?
When a task succeeds with two parallel attempts, which one becomes the downstream input? How is the loser cleaned up?

Common bugs and symptoms

Symptom	Root cause	Where to look
Task retries forever and never fails	`max.failed.attempts` set absurdly high; or all failures classified as "kill"	Check config; verify `TerminationCause` for each failure
Speculation kills the original just after it succeeds (lost work)	Race on `markSuccessful` and speculative-attempt kill	Ensure speculator backs off when task is in completing
Task `SUCCEEDED` but a sibling attempt still appears as `RUNNING` for a long time	Container slow to acknowledge kill	Look at `ContainerHeartbeatHandler` and `TA_KILL_REQUEST`
`Task succeeded` reported but downstream cannot fetch outputs	Race between `TA_DONE` and output ready event	Check ordering of `outputReady` umbilical calls
Recovery brings task back as `RUNNING` even though it had finished	Missing `TaskFinishedEvent` in recovery log	Investigate `RecoveryService` flush boundaries

Validation: prove you understand this

Draw the TaskImpl state machine from memory, including all six states.
From TestTaskImpl, identify a test that drives a task to FAILED. Walk the events it sends.
List the four TaskAttemptTerminationCause categories that do not count toward max.failed.attempts. Cite the enum and the consumer.
Trace, line by line, what TaskImpl does when T_ATTEMPT_SUCCEEDED arrives for the second of two concurrent attempts.
Modify TaskImpl to log the winner's attempt number explicitly at the INFO level. Run a MiniTezCluster job and observe.

TaskAttemptImpl Lifecycle

TaskAttemptImpl is the AM-side representation of a single execution attempt of a task. It owns the container assignment, the umbilical, the output commit decision, and — critically — the TaskAttemptTerminationCause that drives upstream retry decisions.

After this chapter you should be able to look at any TaskAttemptImpl state in an AM log and explain (a) what container holds it, (b) which umbilical calls have or have not landed, and (c) what its termination cause will be if it dies right now.

File

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java

Tests:

tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java

Termination cause enum:

tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptTerminationCause.java

The states (typical 0.10.x naming)

State	Meaning
`NEW`	Constructed; not yet given to scheduler
`START_WAIT`	Request sent to scheduler; awaiting container
`SUBMITTED`	Container allocated; awaiting launch ack (some versions)
`RUNNING`	Container launched; processor executing
`SUCCEEDED`	Terminal: `TA_DONE` received
`KILL_IN_PROGRESS`	Kill requested; awaiting confirmation
`KILLED`	Terminal: killed before/during execution
`FAIL_IN_PROGRESS`	Failure recognized; cleaning up
`FAILED`	Terminal: failed (counts against `max.failed.attempts`)

Exact list varies by branch. Verify:

grep -n "TaskAttemptStateInternal\." \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head

Tez separates the external state (TaskAttemptState in tez-api, the 3-state coarse enum visible to ATS) from the internal state machine state (richer). Mapping:

Internal	External
NEW, START_WAIT, SUBMITTED, RUNNING	`STARTING` / `RUNNING`
SUCCEEDED	`SUCCEEDED`
KILL_IN_PROGRESS, KILLED	`KILLED`
FAIL_IN_PROGRESS, FAILED	`FAILED`

State × event matrix (key transitions)

State	Event	Next state	Notes
NEW	TA_SCHEDULE	START_WAIT	request container
START_WAIT	TA_STARTED	SUBMITTED/RUNNING	container launched
START_WAIT	TA_CONTAINER_TERMINATING	KILL_IN_PROGRESS	preemption before launch
RUNNING	TA_DONE	SUCCEEDED	`done(...)` umbilical call
RUNNING	TA_FAILED	FAIL_IN_PROGRESS	processor threw
RUNNING	TA_TIMED_OUT	FAIL_IN_PROGRESS	heartbeat exceeded `tez.task.timeout-ms`
RUNNING	TA_KILL_REQUEST	KILL_IN_PROGRESS	external kill
RUNNING	TA_CONTAINER_TERMINATED	FAIL_IN_PROGRESS / KILL_IN_PROGRESS	NM said container died
KILL_IN_PROGRESS	TA_CONTAINER_TERMINATED	KILLED	cleanup done
FAIL_IN_PROGRESS	TA_CONTAINER_TERMINATED	FAILED	cleanup done

grep -c "addTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java

Container assignment

When a TaskAttempt becomes schedulable, the AM:

Builds a ContainerRequest (resource, priority, locality).
Hands it to TaskSchedulerManager.allocateTask(...).
The scheduler (YarnTaskSchedulerService) eventually matches a granted container.
The match drives an AMSchedulerEventTAEnded/...TALaunchRequest flow that updates the TaskAttemptImpl state.
ContainerLauncherManager actually starts the JVM via NMClient.

grep -n "allocateTask\|deallocateTask\|AMSchedulerEvent" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head

The container is not assigned at construction; that's why the START_WAIT state exists. Some configurations short-circuit this via container reuse (the scheduler offers a free, idle container).

See container-reuse.md and scheduler.md.

Output commit rules (per attempt)

For attempts of vertices with an OutputCommitter:

Condition	Commit who?
Output commits are at the task level (`tez.am.commit-all-outputs-on-dag-success=false`)	Each `TaskAttemptImpl` runs `commit()` from inside the task JVM (via processor)
Output commits are at the vertex level (default for MROutput)	Only the AM commits, after all tasks succeed (see vertex-lifecycle.md)

Losing speculative attempts must not commit. The setOutputCommitted(true) flag on TaskAttemptImpl records who actually committed. The AM ensures exactly one attempt of each task has outputCommitted=true.

grep -n "outputCommitted\|commitOutput\|noCommit" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head

`TaskAttemptTerminationCause` — the policy enum

sed -n '1,200p' \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptTerminationCause.java

Categories (the exact enum is long):

Cause	Counts as failure?	Typical trigger
`TERMINATED_BY_CLIENT`	No	User killed DAG
`TERMINATED_AT_SHUTDOWN`	No	AM shutting down
`TERMINATED_INEFFECTIVE_SPECULATION`	No	Lost the speculation race
`INTERNAL_PREEMPTION`	No	AM preempted it (e.g., for higher-priority work)
`EXTERNAL_PREEMPTION`	No	YARN preempted the container
`CONTAINER_EXITED`	Yes (default)	Container died mid-run
`NODE_FAILED`	Yes	NM died
`TASK_HEARTBEAT_ERROR`	Yes	Heartbeat timeout
`OUTPUT_LOST`	Yes	Downstream reported output gone (rerun)
`APPLICATION_ERROR`	Yes	Processor threw

TaskImpl uses cause.causesFailure() (or equivalent) to decide whether to bump the failure counter.

Reading exercise

sed -n '1,160p' tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
grep -n "TaskAttemptStateInternal\." \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -20
grep -n "TerminationCause" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -20

# Heartbeat timeout path
grep -n "TA_TIMED_OUT\|heartbeatTimeout" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head

Answer:

What event arrives when an attempt's container heartbeat times out? What issues it?
What is the difference between TA_FAILED and TA_CONTAINER_TERMINATED? When does each fire?
Which TaskAttemptTerminationCause values are not counted toward tez.am.task.max.failed.attempts?
In what state does an attempt sit during container provisioning?
What does outputCommitted track, and how is it used by the AM to choose the canonical attempt?
Why are there separate FAIL_IN_PROGRESS and FAILED states (likewise for kill)?

Common bugs and symptoms

Symptom	Root cause	Where to look
Attempt stuck in `START_WAIT` for minutes	Scheduler can't satisfy locality/resource	`TaskSchedulerManager` log; relax locality
Attempt marked `FAILED` when container was preempted	`TerminationCause` set incorrectly	Check the `TA_CONTAINER_TERMINATED` handler
Two attempts both commit outputs (data corruption)	`setOutputCommitted` race; speculative commit	Run `TestSpeculation`; ensure committer is idempotent
`TaskAttempt` heartbeat timeout fires even though task was running	AM GC pause; clock skew	Tune AM heap; check NM/AM clock drift
Recovery comes back with all attempts `FAILED`	Recovery log lacks `TaskAttemptStartedEvent` for last attempt	Force flush before submitting next event
`KILL_IN_PROGRESS` lingers	`TA_CONTAINER_TERMINATED` never arrives	NM is dead; AM eventually times out container

Validation: prove you understand this

Without running code: given an attempt in RUNNING and event TA_CONTAINER_TERMINATED with cause INTERNAL_PREEMPTION, what is the next state and does the failure counter increment?
From the enum, list every TaskAttemptTerminationCause and tag each "counts" / "does not count".
Reproduce a heartbeat timeout on MiniTezCluster by suspending a task JVM. Identify the exact log line that transitions the attempt.
Walk the path from TaskCommunicatorManager.heartbeat returning a LATEST_RESPONSE_TIMEOUT to TaskAttemptImpl.handle(TA_TIMED_OUT).
Verify that a speculative-loser attempt does not corrupt counters by reading the kill-handler code.

State Machines

Tez's AM uses Hadoop's StateMachineFactory extensively: every long-lived entity (DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl, container state objects) is a state machine. This chapter explains the API, the dispatcher contract that keeps state machines correct, the AsyncDispatcher vs DrainDispatcher distinction, the common InvalidStateTransitonException bug class, and the discipline required to add a transition safely.

After this chapter you should be able to write a small state machine from scratch and review a transition-modifying patch.

The API

The factory lives in:

hadoop-yarn-common
  org/apache/hadoop/yarn/state/StateMachineFactory.java
  org/apache/hadoop/yarn/state/SingleArcTransition.java
  org/apache/hadoop/yarn/state/MultipleArcTransition.java
  org/apache/hadoop/yarn/state/InvalidStateTransitonException.java

(Yes, the exception is spelled Transiton in the Hadoop source — historical typo, preserved for compatibility. Greps that look for Transition will miss it.)

Skeleton:

private static final StateMachineFactory<MyEntity, MyState, MyEvtType, MyEvt>
    stateMachineFactory =
    new StateMachineFactory<MyEntity, MyState, MyEvtType, MyEvt>(MyState.NEW)

        // Single-arc: state, event, nextState, transition
        .addTransition(MyState.NEW, MyState.RUNNING,
            MyEvtType.START,
            new StartTransition())

        // Multiple-arc: state, set of possible next states, event, transition
        .addTransition(MyState.RUNNING,
            EnumSet.of(MyState.SUCCEEDED, MyState.FAILED),
            MyEvtType.DONE,
            new DoneTransition())

        // Self-loop: state, state, event, transition
        .addTransition(MyState.RUNNING, MyState.RUNNING,
            MyEvtType.HEARTBEAT,
            new HeartbeatTransition())

        // No-op self-loop with no transition object
        .addTransition(MyState.SUCCEEDED, MyState.SUCCEEDED,
            EnumSet.of(MyEvtType.HEARTBEAT))

        .installTopology();

installTopology() returns a builder you store; per-instance:

private final StateMachine<MyState, MyEvtType, MyEvt> stateMachine =
    stateMachineFactory.make(this);

public void handle(MyEvt event) {
  writeLock.lock();
  try {
    MyState oldState = stateMachine.getCurrentState();
    try {
      stateMachine.doTransition(event.getType(), event);
    } catch (InvalidStateTransitonException e) {
      LOG.error("Invalid event " + event.getType() + " at " + oldState);
      // typically: re-throw or transition to ERROR
    }
  } finally {
    writeLock.unlock();
  }
}

Single-arc vs multiple-arc

Concept	When to use	Implementation
`SingleArcTransition<OPERAND, EVENT>`	The next state is always the same	`void transition(OPERAND op, EVENT event)`
`MultipleArcTransition<OPERAND, EVENT, STATE>`	Next state depends on event content	`STATE transition(OPERAND op, EVENT event)` (returns the chosen state)

You almost always start with SingleArcTransition. Promote to MultipleArcTransition only when the next state legitimately depends on runtime data (e.g., "if task count == 0 then SUCCEEDED else RUNNING").

Dispatcher contract

State machines are not thread-safe by themselves. Tez upholds correctness via the single-dispatcher-thread invariant:

All events for a DAGAppMaster's state machines flow through one AsyncDispatcher.
The dispatcher has one thread that pulls events and calls handle(event).
Therefore handlers run serially; no two handle() calls overlap.

This invariant is the reason VertexImpl.handle can manipulate fields without synchronization. Break the invariant and you get races no test will catch consistently.

grep -n "AsyncDispatcher\|GenericEventHandler" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

AsyncDispatcher vs DrainDispatcher

Class	Where	Behavior
`AsyncDispatcher`	Production	Background thread; events processed asynchronously
`DrainDispatcher`	Tests	Same API; tests call `await()` to block until queue empty

Tests use DrainDispatcher so they can assert state after a known set of events has been processed:

DrainDispatcher dispatcher = new DrainDispatcher();
dispatcher.register(VertexEventType.class, vertexEventHandler);
dispatcher.init(conf);
dispatcher.start();
dispatcher.getEventHandler().handle(new VertexEvent(...));
dispatcher.await();   // blocks until queue empty
assertEquals(VertexState.RUNNING, vertex.getState());

find . -name "DrainDispatcher.java"
grep -rn "new DrainDispatcher" tez-dag/src/test/java | head

`InvalidStateTransitonException`

Thrown when doTransition(type, event) finds no registered handler for the (currentState, eventType) pair. The exception message has the form:

Invalid event: V_TASK_RESCHEDULED at SUCCEEDED

Common causes:

A late event arrived after the entity reached a terminal state (race between cancellation and a completion event).
A new code path emits an event but the receiving state machine forgot to register a handler.
The event sender misunderstood the protocol.

Fixing one of these almost always requires:

Adding a (state, eventType, sameState) no-op transition (case 1).
Adding a real transition (case 2).
Removing the bogus emit (case 3).

Never silently catch and swallow the exception in production code — it indicates a real protocol violation, and an unhandled exception in the dispatch thread is a worse outcome than a graceful error.

How to add a transition safely

Process every Tez committer follows when modifying a state machine:

Find the existing transitions for the state — read all addTransition(STATE, ...) lines.
Identify the gap — confirm the event is not already handled.
Add the transition in the correct alphabetical/grouping order the file uses.
Add a unit test to the corresponding Test*Impl class that triggers the new event in the relevant state.
Update related no-op transitions for terminal states (a new event needs no-op handlers in SUCCEEDED, FAILED, KILLED).
Run all tests in the module before opening a PR.

The discipline "always update the test in the same patch" is enforced by reviewers. PRs that change VertexImpl without changes to TestVertexImpl are typically blocked.

Reading exercise

# Find the factory blocks
grep -n "stateMachineFactory" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/*.java

# Count transitions per entity
for f in tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java \
         tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java \
         tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java \
         tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java; do
  echo "$f $(grep -c addTransition $f)"
done

# Look at one transition impl
grep -n "class StartTransition\|class InitTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Answer:

Why is the exception named InvalidStateTransitonException (with a typo)? What would happen if you renamed it?
Which Tez class uses MultipleArcTransition most heavily, and why?
What does installTopology() return, and why is the factory typically a static final field?
In TestVertexImpl, find the DrainDispatcher.await() calls. Why are they essential and what failure mode occurs if you forget?
If two threads call vertexImpl.handle(event) concurrently — bypassing the dispatcher — what specific bug class arises?
Read one MultipleArcTransition and explain how its return value determines the next state.

Common bugs and symptoms

Symptom	Root cause	Fix
`InvalidStateTransitonException: Invalid event X at TERMINAL_STATE`	Late event after terminal state	Add a no-op transition
Test passes locally, fails on CI intermittently	`DrainDispatcher.await()` missing or called too early	Always call `await()` between event sends and asserts
State machine mutates wrong fields	Transition class accidentally captures outer state	Make transition classes static; pass everything via the event
Dispatcher thread deadlocks	Handler is doing blocking I/O on dispatch thread	Move I/O to a worker; emit a follow-up event when done
`addTransition` for a no-op throws compile error	Wrong arity overload	Use the variant with `EnumSet<EventType>`
Adding a transition silently breaks recovery	Recovery replay hits the new event in an old state	Cover the recovery test path; recovery uses the same SM

Validation: prove you understand this

Implement a Light state machine with states OFF, ON, BROKEN and events TOGGLE, BREAK. Compile and unit-test.
Find every SingleArcTransition in VertexImpl that is registered as static final — explain why static.
Take an InvalidStateTransitonException from a real AM log; map it to the exact (state, event) pair and propose either a fix or a JIRA.
Run TestVertexImpl#testKilledTasksHandling. Identify every DrainDispatcher.await() call and what it guards.
Add a (SUCCEEDED, T_HEARTBEAT, SUCCEEDED) no-op to TaskImpl and the corresponding test in TestTaskImpl. Ensure all tests pass.

Event Routing

Events are the only sanctioned API for mutating any AM-side entity. This chapter catalogs the event hierarchy, explains the "events are the only mutation API" rule, walks how a single task-completion percolates up to the DAG, and shows where each event is registered and dispatched.

After this chapter you should be able to trace any state transition in the AM back through the chain of events that caused it.

The hierarchy

hadoop-yarn-common
  org/apache/hadoop/yarn/event/AbstractEvent<EVT_TYPE>
  org/apache/hadoop/yarn/event/EventHandler<E>
  org/apache/hadoop/yarn/event/AsyncDispatcher

tez-dag
  org/apache/tez/dag/app/dag/event/
    DAGEvent (subclasses: DAGEventStart, DAGEventDAGAttemptStarted, ...)
    VertexEvent (subclasses: VertexEventTaskCompleted, VertexEventVertexCompleted, ...)
    TaskEvent (subclasses: TaskEventTAUpdate, TaskEventTermination, ...)
    TaskAttemptEvent (subclasses: TaskAttemptEventStartedRemotely, ...)
    AMSchedulerEvent
    AMContainerEvent
    AMNodeEvent
    SpeculatorEvent
    ...

Hint to grep all event classes:

find tez-dag/src/main/java/org/apache/tez/dag/app -path "*event*" -name "*.java" \
  | xargs grep -l "extends AbstractEvent\|extends DAGEvent\|extends VertexEvent" \
  | head -30

The AbstractEvent<E> base has two fields: an event type (enum) and a timestamp. Concrete event classes add payloads (e.g., VertexEventTaskCompleted carries the TezTaskID and the TaskAttemptIdentifier).

The "events are the only mutation API" rule

This rule is the bedrock of correctness:

Any change to the externally observable state of a DAGImpl, VertexImpl, TaskImpl, or TaskAttemptImpl must occur inside a state-machine transition handler, triggered by an event that flowed through the AsyncDispatcher.

Concretely:

Never call a setter directly on VertexImpl from another thread.
Never have one entity reach into another and mutate. Send an event.
The only "side door" is read-only getters (intentionally not synchronized; callers tolerate slight staleness).

Why this rule:

Concurrency safety — the dispatcher serializes everything. Direct mutation re-introduces races.
Auditability — events appear in the AM log; field writes do not.
Recoverability — RecoveryService writes events; replay rebuilds state. Mutations outside events are invisible to recovery.
Testability — DrainDispatcher controls the world; bypass it and tests become non-deterministic.

A patch that calls a mutator method outside a transition handler is, by convention, immediately rejected.

Bubble-up: a task completion to the DAG

sequenceDiagram
    participant TA as TaskAttemptImpl
    participant T as TaskImpl
    participant V as VertexImpl
    participant D as DAGImpl
    participant DI as Dispatcher

    Note over TA: heartbeat -> done(...) on umbilical
    TA->>TA: handle(TA_DONE)
    TA-->>DI: emit T_ATTEMPT_SUCCEEDED
    DI->>T: handle(T_ATTEMPT_SUCCEEDED)
    T->>T: mark winner; check siblings
    T-->>DI: emit V_TASK_COMPLETED (success)
    DI->>V: handle(V_TASK_COMPLETED)
    V->>V: bump succeededTaskCount
    alt All tasks done
        V-->>DI: emit V_COMMIT_REQUEST (if applicable)
        V-->>DI: emit DAG_VERTEX_COMPLETED
        DI->>D: handle(DAG_VERTEX_COMPLETED)
        D->>D: bump succeededVertexCount
    end

Every arrow is a state-machine transition. Every emit is an eventHandler.handle(...) call inside the transition body.

Find the emit sites:

grep -n "eventHandler.handle\b" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head
grep -n "eventHandler.handle\b" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head

Where events are registered

Registrations live in DAGAppMaster.serviceInit (see dag-app-master.md):

grep -n "dispatcher.register\|register(.*\.class" \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

Each registration maps an event type to a handler. Most handlers are inner classes that delegate to entity.handle(event):

private class TaskEventDispatcher implements EventHandler<TaskEvent> {
  @Override
  public void handle(TaskEvent event) {
    DAG dag = context.getCurrentDAG();
    Task task = dag.getVertex(event.getTaskID().getVertexID())
                   .getTask(event.getTaskID());
    ((EventHandler<TaskEvent>) task).handle(event);
  }
}

Why the indirection: events carry IDs, not object references. The dispatcher handler does the resolve, then forwards.

Per-entity event types

Entity	Event type enum	Where emitted
`DAGImpl`	`DAGEventType`	Vertex completions, kill, recovery
`VertexImpl`	`VertexEventType`	Task completions, manager callbacks, root input events
`TaskImpl`	`TaskEventType`	Attempt completions, speculation, kill
`TaskAttemptImpl`	`TaskAttemptEventType`	Container events, umbilical events
`TaskSchedulerManager`	`AMSchedulerEventType`	New requests, completions, container availability
`AMContainerImpl`	`AMContainerEventType`	Launch, assignment, completion
`HistoryEventHandler`	`HistoryEventType`	Any history-loggable change

Each enum lives next to the event class:

ls tez-dag/src/main/java/org/apache/tez/dag/app/dag/event/*EventType.java

Reading exercise

# Catalog
find tez-dag/src/main/java/org/apache/tez/dag/app -name "*Event.java" \
  | head -40

# Find a transition that emits other events
grep -B2 -A15 "class CommitCompletedTransition" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Find AMSchedulerEvent emit sites
grep -rn "AMSchedulerEvent" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ | head

# Compare: emit vs direct mutation
grep -n "eventHandler.handle" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | wc -l

Answer:

Why does the dispatcher carry IDs (e.g., TezTaskID) inside events rather than object references?
Find an event that crosses subsystems: e.g., TaskAttemptImpl emitting an AMSchedulerEvent. What is the receiver and what action does it take?
List the four classes of events that VertexImpl.handle reacts to and the three classes it emits.
How does the AM ensure ordering when multiple events for the same entity are emitted in quick succession?
What happens if a transition handler throws an uncaught exception? Which thread catches it?
Find one event that has no consumers (dead code). If you find one, propose its removal in a JIRA.

Common bugs and symptoms

Symptom	Root cause	Where to look
Inconsistent state visible to two getters	Direct mutation outside dispatcher	Audit for setters called from non-handler code
Event "lost" — entity never sees it	Forgot to register handler in `DAGAppMaster.serviceInit`	Add registration; add unit test
Replay during recovery diverges from original run	An event was emitted but not recorded (recovery log gap)	`RecoveryService` writer filter
Deadlock when one entity event handler tries to read another entity	Reader path uses a lock held elsewhere	Prefer event-emit over cross-entity reads
Test hangs in `DrainDispatcher.await()`	Transition emitted an event of a type with no handler in test	Register the missing handler (no-op is fine)
One subsystem floods the dispatcher	Storm of small events (e.g., per-heartbeat)	Batch in the emitter; or upgrade to a separate dispatcher

Validation: prove you understand this

Pick one transition in TaskAttemptImpl and trace every event it emits; for each, name the receiving entity.
Open DAGAppMaster and list every event type registered, in order.
Walk a V_KILL from DAGImpl.killDAG down to a TaskAttemptImpl actually shutting down its container.
Write a unit test that triggers a transition with an event whose payload is malformed; verify the dispatcher logs the error without crashing.
Explain why moving from AsyncDispatcher to a multi-threaded dispatcher would break Tez and what would have to change to support it.

IPO Abstractions

Input, Processor, Output (collectively "IPO") are the three core runtime contracts. A Tez task is built from one processor, zero or more inputs, and zero or more outputs. This chapter walks the abstractions, the distinction between the LogicalInput/LogicalOutput layer (rich, modern) and the plain Input/Output layer (used for raw byte pipelines), the lifecycle methods, merged inputs, root vs intermediate inputs, and the minimum skeleton needed to write a new input or output.

After this chapter you should be able to read any concrete IPO class in tez-runtime-library and explain what each lifecycle method is for.

The interfaces

tez-api/src/main/java/org/apache/tez/runtime/api/
  Input.java
  Output.java
  Processor.java
  LogicalInput.java                          (extends Input)
  LogicalOutput.java                         (extends Output)
  LogicalIOProcessor.java                    (extends Processor)
  AbstractLogicalInput.java                  (base class for custom inputs)
  AbstractLogicalOutput.java
  AbstractLogicalIOProcessor.java
  Reader.java                                (the byte/record stream interface)
  Writer.java
  MergedLogicalInput.java                    (combines multiple inputs)
  InputContext.java
  OutputContext.java
  ProcessorContext.java
  Event.java                                 (DataMovementEvent, etc.)

grep -n "^public " tez-api/src/main/java/org/apache/tez/runtime/api/LogicalInput.java

Plain `Input`/`Output` vs `LogicalInput`/`LogicalOutput`

Layer	Class	Why it exists
Low-level	`Input`	Bare contract: provides a `Reader`
Low-level	`Output`	Bare contract: provides a `Writer`
High-level	`LogicalInput`	Adds events, lifecycle, knowledge of upstream completion
High-level	`LogicalOutput`	Adds events (to AM and downstream)

Almost all production inputs/outputs are LogicalInput/LogicalOutput. The plain layer exists for primitive byte-stream cases (rarely used directly).

Lifecycle methods (LogicalInput)

public abstract class AbstractLogicalInput implements LogicalInput {
  // Called by the runtime when the task starts. Setup; no I/O yet.
  public abstract List<Event> initialize() throws Exception;

  // Called after `initialize` for *all* inputs has completed.
  // Begin actively pulling data.
  public abstract void start() throws Exception;

  // The processor calls this to get a Reader for this input.
  public abstract Reader getReader() throws Exception;

  // Handle data movement / control events from the AM (e.g., upstream task done).
  public abstract void handleEvents(List<Event> inputEvents) throws Exception;

  // Final cleanup; close streams; return any final events.
  public abstract List<Event> close() throws Exception;
}

Order in a task's life:

constructor -> setContext -> initialize -> start -> getReader -> close

initialize returns events to the AM (for example, InputInitializerEvents that ask the AM to do more split work). Most inputs return an empty list.

Lifecycle methods (LogicalOutput)

Mirror of input:

public abstract class AbstractLogicalOutput implements LogicalOutput {
  public abstract List<Event> initialize() throws Exception;
  public abstract void start() throws Exception;
  public abstract Writer getWriter() throws Exception;
  public abstract void handleEvents(List<Event> outputEvents) throws Exception;
  public abstract List<Event> close() throws Exception;
}

The close of an output is the most consequential call: it flushes pending data, returns CompositeDataMovementEvent (or VertexManagerEvent) telling the AM (and thus downstream vertices) what this output produced.

Root inputs vs intermediate inputs

Kind	Source of data	Initializer runs where?
Root input	External (HDFS, HBase, Kafka)	AM-side: `InputInitializer` enumerates splits, emits `InputDataInformationEvent`s
Intermediate input	Upstream Tez vertex output	No initializer; data arrives via `DataMovementEvent` from the AM

MRInput is the canonical root input. Its AM-side initializer (MRInputAMSplitGenerator) calls InputFormat.getSplits(...) and pushes the resulting splits to tasks.

Intermediate inputs (e.g., OrderedGroupedKVInput) receive their data descriptors from the AM via DataMovementEvents — one event per upstream task completion, carrying the upstream task's location and partition.

MergedLogicalInput

When a vertex has multiple physical inputs that should look like one to the processor (e.g., a vertex group union), Tez wraps them in a MergedLogicalInput:

grep -n "MergedLogicalInput\|getInputs\|getReader" \
  tez-api/src/main/java/org/apache/tez/runtime/api/MergedLogicalInput.java

The processor calls getReader() once; the merged input combines all underlying readers. Common subclasses live in tez-runtime-library:

OrderedGroupedMergedInput — merge K/V streams preserving sort order.
ConcatenatedMergedKeyValueInput — concatenate.

Events flowing between AM and task

Event class	Direction	Carries
`DataMovementEvent`	AM → task input	Source task index, source URL/path, partition
`InputReadErrorEvent`	task input → AM	"This source URL is broken, please re-route"
`CompositeDataMovementEvent`	task output → AM (then forwarded)	Bulk version of `DataMovementEvent`
`InputDataInformationEvent`	AM → task input	Concrete split (root inputs only)
`InputInitializerEvent`	task → AM (initializer)	Custom signal to the initializer
`VertexManagerEvent`	task output → AM (vertex manager)	Stats for auto-parallelism (`ShuffleVertexManager`)

ls tez-api/src/main/java/org/apache/tez/runtime/api/events/

Minimal LogicalInput skeleton

package com.example;

import org.apache.tez.runtime.api.*;
import org.apache.tez.runtime.api.events.*;
import java.io.IOException;
import java.util.Collections;
import java.util.List;

public class HelloLogicalInput extends AbstractLogicalInput {

  private final List<Event> deferred = new java.util.ArrayList<>();

  public HelloLogicalInput(InputContext ctx, int physicalInputCount) {
    super(ctx, physicalInputCount);
  }

  @Override
  public List<Event> initialize() throws IOException {
    // Allocate resources here. Do not do I/O.
    return Collections.emptyList();
  }

  @Override
  public void start() throws IOException {
    // Begin background fetch threads if any.
  }

  @Override
  public Reader getReader() throws IOException {
    // Return a Reader. Simplest: a no-op reader that reports EOF.
    return new SimpleStringReader("hello");
  }

  @Override
  public void handleEvents(List<Event> events) throws IOException {
    // Receive DataMovementEvents from the AM. Build internal routing.
  }

  @Override
  public List<Event> close() throws IOException {
    return Collections.emptyList();
  }
}

Real implementations to read for reference:

find tez-runtime-library/src/main/java -name "OrderedGrouped*Input*.java"
find tez-runtime-library/src/main/java -name "Unordered*Input.java"

Reading exercise

sed -n '1,140p' tez-api/src/main/java/org/apache/tez/runtime/api/LogicalInput.java
sed -n '1,140p' tez-api/src/main/java/org/apache/tez/runtime/api/LogicalOutput.java
grep -rn "extends AbstractLogicalInput" tez-runtime-library/src/main/java | head
grep -rn "extends AbstractLogicalOutput" tez-runtime-library/src/main/java | head

# Event flow
ls tez-api/src/main/java/org/apache/tez/runtime/api/events/

Answer:

What is the ordering guarantee between initialize() calls across the multiple inputs/outputs of a task?
When does start() get called relative to getReader()?
What's the difference in return semantics between getReader() of a LogicalInput vs an MergedLogicalInput?
Find one concrete LogicalOutput; identify what event types its close() returns and what downstream effect each has.
Why does initialize() return List<Event> instead of void?
What is the difference between InputInitializerEvent and InputDataInformationEvent? Who emits each?

Common bugs and symptoms

Symptom	Root cause	Where to look
Task hangs in `getReader()`	Input's `start()` never returned; deadlock with handler	Always make `start()` non-blocking
`NullPointerException` in `handleEvents`	Events arrived before `initialize()`; you're using a field not yet set	Allocate state in `initialize()`
Downstream sees half the data	`close()` returned `Collections.emptyList()` when it should have emitted DME	Always emit completion events
Custom input never receives `DataMovementEvent`s	EdgeManager on the upstream side not aware of your partitioning	Check edge property `OutputDescriptor` matches your `InputDescriptor`
Root input never starts	Initializer's `handleInputInitializerEvent` not implemented	Provide a default; never silently drop
Task succeeds but produces no output	`Writer` was never flushed (forgot `close()`)	Verify with `IFile` size = 0 in logs

Validation: prove you understand this

Write a minimal LogicalInput that produces 100 fixed strings via its Reader. Wire it into a one-vertex DAG and run on MiniTezCluster.
From OrderedGroupedKVInput, identify exactly when handleEvents is called and what it does with each event.
List the seven event classes in org.apache.tez.runtime.api.events.
Diagram the events flowing from one upstream task's LogicalOutput.close() to a downstream task's LogicalInput.handleEvents().
Explain why initialize() is split from start() rather than collapsed into a single method.

Logical vs Physical Plan

Tez exposes two planes to the user and to the runtime:

Logical plan — what the application author writes: vertices, edges, edge properties. Lives in tez-api. Immutable once submitted (mostly).
Physical plan — what the AM actually schedules: task instances per vertex, per-edge routing decisions, container assignments. Lives in tez-dag. Mutable at runtime via VertexManager reconfiguration and EdgeManagerPlugin routing.

This chapter walks the boundary between them.

The logical plane

A logical DAG is a DAG object containing Vertex objects connected by Edge objects, each carrying an EdgeProperty.

ls tez-api/src/main/java/org/apache/tez/dag/api/ | head -30

Key classes:

Class	File	Purpose
`DAG`	`tez-api/src/main/java/org/apache/tez/dag/api/DAG.java`	The DAG builder.
`Vertex`	`tez-api/src/main/java/org/apache/tez/dag/api/Vertex.java`	Logical vertex with processor + target parallelism.
`Edge`	`tez-api/src/main/java/org/apache/tez/dag/api/Edge.java`	Logical edge between two vertices.
`EdgeProperty`	`tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java`	Routing + scheduling + storage characteristics.

`EdgeProperty` — four orthogonal axes

grep -n "enum DataMovementType\|enum DataSourceType\|enum SchedulingType" \
  tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java

public enum DataMovementType { ONE_TO_ONE, BROADCAST, SCATTER_GATHER, CUSTOM }
public enum DataSourceType   { PERSISTED, PERSISTED_RELIABLE, EPHEMERAL }
public enum SchedulingType   { SEQUENTIAL, CONCURRENT }

Axis	Values	What it controls
`DataMovementType`	`ONE_TO_ONE`, `BROADCAST`, `SCATTER_GATHER`, `CUSTOM`	How source outputs map to destination inputs.
`DataSourceType`	`PERSISTED`, `PERSISTED_RELIABLE`, `EPHEMERAL`	Whether outputs survive a task failure; affects re-execution policy.
`SchedulingType`	`SEQUENTIAL`, `CONCURRENT`	Whether destination can start before source completes (required for pipelined shuffle and broadcast).
`OutputDescriptor` / `InputDescriptor`	class names + payloads	The IO classes wired on each side of the edge.

A logical edge says nothing about which destination task index reads from which source task index. That decision is the EdgeManagerPlugin.

The physical plane

When the AM initializes a DAG it builds:

VertexImpl per logical Vertex (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java)
TaskImpl[] per vertex, sized by parallelism (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java)
TaskAttemptImpl per attempt of each task (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java)
Edge runtime objects with an active EdgeManagerPlugin (tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java)

wc -l tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/{VertexImpl,TaskImpl,TaskAttemptImpl,Edge}.java

Mapping logical to physical

flowchart LR
  subgraph logical[Logical]
    LV1[Vertex A parallelism=3]
    LV2[Vertex B parallelism=2]
    LV1 -- "EdgeProperty SCATTER_GATHER" --> LV2
  end
  subgraph physical[Physical]
    A0[A.0] --> B0[B.0]
    A0 --> B1[B.1]
    A1[A.1] --> B0
    A1 --> B1
    A2[A.2] --> B0
    A2 --> B1
  end
  logical --> physical

Every source attempt produces partitions for every destination task. The EdgeManager decides which output partition goes to which input.

`EdgeManagerPlugin` — the routing brain

find tez-api/src/main/java -name "EdgeManagerPlugin.java"
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeManagerPlugin.java

The contract (paraphrased):

public abstract class EdgeManagerPlugin {
  public abstract void routeDataMovementEventToDestination(
      DataMovementEvent event,
      int srcTaskIndex,
      int outputIndex,
      Map<Integer, List<Integer>> destTaskAndInputIndices);

  public abstract int getNumDestinationConsumerTasks(int srcTaskIndex);
  public abstract int getDestinationConsumerTaskNumber(int srcTaskIndex,
                                                       int srcOutputIndex);
  public abstract int getNumDestinationTaskPhysicalInputs(int destTaskIndex);
  public abstract int getNumSourceTaskPhysicalOutputs(int srcTaskIndex);
}

Built-in implementations

find tez-dag/src/main/java -name "*EdgeManager*.java"

Plugin	DataMovementType	Routing rule
`ScatterGatherEdgeManager`	`SCATTER_GATHER`	Source task `i` produces N partitions; destination `d` reads partition `d` from every source.
`BroadcastEdgeManager`	`BROADCAST`	Every source output is consumed by every destination task.
`OneToOneEdgeManager`	`ONE_TO_ONE`	Requires `srcParallelism == destParallelism`. Source `i` → destination `i`.
User-supplied	`CUSTOM`	Anything. Cartesian product, range partitioning, etc.

Inspecting routing on a live AM

grep -n "EdgeManager\|setCustomEdgeManager" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java | head -20

For each destination task Edge.sendTezEventToDestinationTasks() consults the plugin to expand source outputs into per-destination input events. The destination task receives a DataMovementEvent per logical input partition.

SCATTER_GATHER walkthrough

Source: A with parallelism 3, each task emits N partitions. Destination: B with parallelism 2.

cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/ScatterGatherEdgeManager.java

For source task A.1 emitting partitions [0, 1]:

Call	Returns
`getNumSourceTaskPhysicalOutputs(1)`	2 (= destination parallelism)
`getNumDestinationTaskPhysicalInputs(0)`	3 (= source parallelism)
`getNumDestinationConsumerTasks(1)`	2
`routeDataMovementEventToDestination(event, 1, 0, out)`	`out = { 0 -> [1] }`
`routeDataMovementEventToDestination(event, 1, 1, out)`	`out = { 1 -> [1] }`

Invariant: numSrcOutputs == destParallelism, numDestInputs == srcParallelism.

ONE_TO_ONE walkthrough

cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/OneToOneEdgeManager.java

Requires numSrcTasks == numDestTasks. Each source produces exactly one partition consumed by exactly one destination of the same index.

Common bug: changing destination parallelism via reconfigureVertex while a ONE_TO_ONE edge feeds it. Tez throws at edge initialization.

BROADCAST walkthrough

cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/BroadcastEdgeManager.java

Source emits a single logical output. Every destination task receives one input event per source task. Cost scales as srcParallelism * destParallelism — large broadcast vertices are an antipattern.

CUSTOM walkthrough — `CartesianProductEdgeManager`

find tez-dag/src/main/java -name "CartesianProductEdgeManager*.java"
wc -l $(find tez-dag/src/main/java -name "CartesianProductEdgeManager*.java")

CartesianProductVertexManager chunks source outputs and creates a 2D grid of destination tasks; the edge manager projects (srcChunkX, srcChunkY) → destIndex.

The CUSTOM movement type is the contract by which Hive ships its own routing for unconventional joins.

Runtime mutation: parallelism reconfiguration

A logical Vertex declares a target parallelism; the physical parallelism can change before the vertex starts running, via the VertexManager:

grep -n "reconfigureVertex" tez-api/src/main/java/org/apache/tez/dag/api/VertexManagerPluginContext.java

reconfigureVertex(int parallelism, VertexLocationHint, Map<String,EdgeProperty>) does three things in one atomic step inside VertexImpl:

Resizes the TaskImpl[] array (must happen before any task is scheduled).
Re-installs EdgeManagerPlugin instances on incoming edges.
Updates location hints used by the scheduler.

Read the state machine guard:

grep -n "reconfigureVertex\|VertexState.INITED\|VertexState.INITIALIZING" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -30

Reconfiguration is illegal once any task has been scheduled.

Worked example: `ShuffleVertexManager` auto-parallelism

Vertex R declared with parallelism = 100 (pessimistic upper bound).
Upstream tasks emit VertexManagerEvent payloads with byte counts per partition.
ShuffleVertexManager.onVertexManagerEventReceived accumulates totals.
After the slow-start threshold, it computes target = ceil(totalBytes / desiredTaskInputSize) clamped to [minParallelism, originalParallelism].
Calls reconfigureVertex(target, null, updatedEdgeProps).
VertexImpl resizes from 100 → e.g. 17 task instances.
The edge manager on the incoming SCATTER_GATHER edge is rebuilt to route 100-partition outputs into 17 destinations (merging at the destination).

Reading exercise

grep -n "createEdgeManager\|edgeManager =" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/Edge.java — when is the EdgeManagerPlugin instantiated?
cat tez-api/src/main/java/org/apache/tez/dag/api/EdgeProperty.java | head -100 — list all factory methods on EdgeProperty. Which require an EdgeManagerPluginDescriptor?
grep -n "setParallelism\|setVertexParallelism" tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head — which state transitions accept a parallelism change?
grep -rn "OneToOneEdgeManager\|ScatterGatherEdgeManager\|BroadcastEdgeManager" tez-dag/src/test — list the unit tests covering each built-in routing.
cat tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/CartesianProductEdgeManager.java | head -120 — what state must this plugin keep across destination task initializations?
grep -n "EdgeProperty\." ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java | head — which edge property combinations does Hive build?

Answer these:

For SCATTER_GATHER what is the size of the output partition array of one source task?
For ONE_TO_ONE, what happens if the upstream vertex auto-parallelizes from 100 → 17 after the destination has been initialized?
For BROADCAST, what is the data volume amplification?
Which EdgeManager methods are called on every destination task init, and which once per edge?

Common bugs and symptoms

Symptom	Likely cause
`Vertex failed: Cannot change parallelism after tasks scheduled`	`VertexManager.reconfigureVertex` invoked after `scheduleTasks`. Fix ordering.
`OneToOneEdgeManager: srcParallelism != destParallelism`	Auto-parallelism broke the `ONE_TO_ONE` invariant. Forbid auto-parallelism on `ONE_TO_ONE` edges.
Destination task receives 0 `DataMovementEvent`s	Custom `EdgeManagerPlugin` returned 0 from `getNumDestinationTaskPhysicalInputs`.
Hive query produces wrong row counts after a custom join	`CUSTOM` `EdgeManagerPlugin` mis-routed partitions; fence-post bug in `routeDataMovementEventToDestination`.
`BROADCAST` edge OOMs the destination	Source parallelism × payload size exceeds destination heap; switch to `PERSISTED` source type and stream from disk.

Validation: prove you understand this

Given Vertex A (parallelism=4) SCATTER_GATHER→ Vertex B (parallelism=3), compute the number of DataMovementEvents B.1 receives. Show the work.
Explain in one sentence each: when does EdgeManagerPlugin get re-instantiated, and when does it survive across reconfiguration?
Write a one-paragraph rejection of "let's just use BROADCAST for our 500-task lookup vertex" referencing concrete cost.
Identify the exact line in VertexImpl.java where reconfigureVertex is rejected if tasks have been scheduled. Cite path + line number from grep -n.
Sketch a CUSTOM EdgeManagerPlugin for range-partitioned merge: source task i emits keys in range [i*R, (i+1)*R); the destination is K tasks where K may differ from source parallelism. Define getNumDestinationTaskPhysicalInputs and the routing rule in code.

Shuffle and Sort

The shuffle layer is where Tez moves data between vertices. It splits into two halves, both living in tez-runtime-library:

Sort path — producer side: partition, sort, spill, merge. OrderedPartitionedKVOutput → PipelinedSorter / DefaultSorter → IFile segments on local disk.
Shuffle path — consumer side: fetch, merge, iterate. ShuffleManager → Fetcher → FetchedInput → MergeManager → ValuesIterator.

Between them sits the YARN ShuffleHandler aux service inside the NodeManager that serves spilled segments over HTTP.

ls tez-runtime-library/src/main/java/org/apache/tez/runtime/library/

The producer side

`OrderedPartitionedKVOutput`

find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java"
wc -l $(find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java")

The output that powers MapReduce-style shuffles. Lifecycle:

initialize() — creates a Sorter (Pipelined or Default), allocates tez.runtime.io.sort.mb of byte buffer, registers as a MemoryUpdateCallback with the MemoryDistributor.
getWriter() — returns a KeyValueWriter that delegates to the sorter.
close() — calls sorter.flush() to merge spills into final segments and emits CompositeDataMovementEvent per partition with offsets into the merged file.

Two sorter implementations

find tez-runtime-library/src/main/java -name "PipelinedSorter.java" \
                                       -o -name "DefaultSorter.java"

Sorter	Strategy	When to pick
`DefaultSorter`	Single in-memory accumulator; quicksort by `(partition, key)`; spill when buffer crosses `tez.runtime.sort.spill.percent`; final merge of all spills.	MapReduce parity, conservative memory.
`PipelinedSorter`	Multi-buffer accumulator; concurrent spill thread; per-partition sort and merge; final spill writes the merged output in one pass.	Large outputs, faster; default in Hive.

Configuration knobs:

Key	Default	Effect
`tez.runtime.io.sort.mb`	100	Sort buffer in MB. Reused for both sorters.
`tez.runtime.sort.spill.percent`	0.8	Threshold to start spilling (DefaultSorter).
`tez.runtime.sorter.class`	`PIPELINED`	`PIPELINED` or `LEGACY` (DefaultSorter).
`tez.runtime.compress`	false	Per-segment compression.
`tez.runtime.compress.codec`	DefaultCodec	Snappy, Lz4, Gzip.
`tez.runtime.combiner.class`	unset	Combiner ran during spill merge.

IFile on-disk format

IFile is the segment format both sorters write.

find tez-runtime-library/src/main/java -name "IFile.java"
grep -n "class Writer\|class Reader\|EOF_MARKER\|writeKVPair" \
  tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/IFile.java | head -30

Per-record layout:

+--------------+--------------+----------------+----------------+
| keyLen (VInt)| valLen (VInt)| key bytes (KL) | value bytes (VL)|
+--------------+--------------+----------------+----------------+

End of segment:

keyLen = -1, valLen = -1   // EOF_MARKER

If compression is enabled, the bytes between the partition header and EOF_MARKER are compressed; the record framing is inside the compressed stream.

A sorter writes one IFile segment per partition per spill. After the final merge, an IFile.OutputStream produces one file per output with an *.index sibling that records (rawLen, partLen, compressedLen) per partition.

find tez-runtime-library/src/main/java -name "TezSpillRecord.java"
grep -n "rawLength\|partLength\|compressedLength" \
  $(find tez-runtime-library/src/main/java -name "TezSpillRecord.java")

The index file is what ShuffleHandler reads when a fetcher asks for partition p of source attempt (vertex, task, attempt).

Combiner integration

Both sorters honor tez.runtime.combiner.class. The combiner is invoked during the merge step (not during accumulation), running over sorted runs:

grep -n "combiner\|combineAndSpill\|runCombiner" \
  $(find tez-runtime-library/src/main/java -name "DefaultSorter.java")

A correct combiner is associative and commutative on the value space; Tez gives no guarantee on how many merge phases run it.

Spill walkthrough

sequenceDiagram
  participant P as Processor
  participant W as KeyValueWriter
  participant S as Sorter (Pipelined)
  participant D as Local disk
  P->>W: write(K, V) [N times]
  W->>S: collect into KV buffer
  S->>S: buffer crosses sort.spill.percent
  S->>D: spill_0.out (partitioned, sorted)
  S->>D: spill_0.out.index
  Note over S: continue accepting writes into next buffer
  P->>W: close()
  W->>S: flush()
  S->>D: merge spill_0..spill_N -> file.out + file.out.index
  S-->>P: CompositeDataMovementEvent per partition

The consumer side

`OrderedGroupedKVInput` and `ShuffleManager`

find tez-runtime-library/src/main/java -name "OrderedGroupedKVInput.java"
find tez-runtime-library/src/main/java -name "ShuffleManager.java"
find tez-runtime-library/src/main/java -name "Shuffle.java"

OrderedGroupedKVInput.initialize() constructs Shuffle which holds:

ShuffleManager — pool of Fetcher threads and inbound event queue.
MergeManager — receives FetchedInputs, decides in-memory vs disk placement, kicks off background merges.
ValuesIterator — the reader the processor sees.

Fetcher

find tez-runtime-library/src/main/java -name "Fetcher.java"
wc -l $(find tez-runtime-library/src/main/java -name "Fetcher.java")

A Fetcher is a thread that connects via HTTP to the NodeManager ShuffleHandler running on the source task's node:

GET /mapOutput?job=<jobId>&dag=<dagId>&reduce=<partition>&map=<attempt1,attempt2,...>

Multi-map response: ShuffleHandler streams all requested attempts back-to-back, each prefixed with a header (MapOutputInfo). The Fetcher reads the header, decides if the payload fits in memory (MergeManager.reserve), and either writes to an in-memory buffer or directly to disk.

Key configs:

Key	Default	Effect
`tez.runtime.shuffle.parallel.copies`	20	Fetcher thread count per task.
`tez.runtime.shuffle.connect.timeout`	30000	HTTP connect timeout.
`tez.runtime.shuffle.read.timeout`	180000	HTTP socket read timeout.
`tez.runtime.shuffle.fetch.max.task.output.at.once`	20	Max attempts per HTTP request.
`tez.runtime.shuffle.memory.limit.percent`	0.25	Max fraction of heap held in-memory before forcing disk.
`tez.runtime.shuffle.merge.percent`	0.9	When in-mem buffer crosses this, kick a merge.

FetchedInput

grep -n "abstract class FetchedInput\|MemoryFetchedInput\|DiskFetchedInput" \
  $(find tez-runtime-library/src/main/java -name "FetchedInput.java")

A FetchedInput is one source partition payload. Two subclasses:

MemoryFetchedInput — bytes held in a ByteBuffer.
DiskFetchedInput — bytes on local disk under tez.runtime.shuffle.tmp.directory.

The MergeManager decides which based on size and current in-memory budget.

MergeManager

find tez-runtime-library/src/main/java -name "MergeManager.java"

Three merge tracks:

In-memory merge — N in-memory inputs are merged into one in-memory buffer or spilled to disk.
On-disk merge — N on-disk inputs are merged into a single larger on-disk segment.
Final merge — at processor pull time, remaining in-memory and on-disk inputs are merged into a unified KeyValuesReader.

grep -n "InMemoryMerger\|OnDiskMerger\|finalMerge\|mergeFactor" \
  $(find tez-runtime-library/src/main/java -name "MergeManager.java") | head -20

io.sort.factor (default 100) — max segments merged in one pass; more segments trigger multiple passes.

ValuesIterator

find tez-runtime-library/src/main/java -name "ValuesIterator.java"
grep -n "next\|groupingKey\|valuesIter" \
  $(find tez-runtime-library/src/main/java -name "ValuesIterator.java") | head

Wraps the merged sorted stream, presenting (key, Iterable<value>) pairs to the processor — the classic reducer API.

Shuffle walkthrough

sequenceDiagram
  participant T as Task processor
  participant SM as ShuffleManager
  participant F as Fetcher
  participant NM as Source NM (ShuffleHandler)
  participant MM as MergeManager
  SM->>F: assign source attempt + partition
  F->>NM: GET /mapOutput?...
  NM-->>F: stream attempt headers + IFile bytes
  F->>MM: reserve(size)
  alt fits in memory
    MM-->>F: MemoryFetchedInput
  else too big
    MM-->>F: DiskFetchedInput
  end
  F->>MM: commit FetchedInput
  MM->>MM: kick InMemoryMerger / OnDiskMerger when thresholds crossed
  T->>SM: getReader() (blocks until all inputs done)
  SM->>MM: finalMerge()
  MM-->>T: KeyValuesReader (ValuesIterator)

ShuffleHandler is YARN's, not Tez's

ls /opt/hadoop/share/hadoop/yarn/lib/ | grep shuffle    # cluster path varies

org.apache.hadoop.mapred.ShuffleHandler lives in Hadoop. NodeManagers load it as an aux service via yarn-site.xml:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

Tez piggybacks on this — Tez ships no NodeManager-side fetch service. Misconfigured aux services are a common cause of ConnectException in Fetcher.

Reading exercise

grep -n "EOF_MARKER\|writeRecord" $(find tez-runtime-library/src/main/java -name "IFile.java") — verify the EOF sentinel value.
wc -l $(find tez-runtime-library/src/main/java -name "PipelinedSorter.java" -o -name "DefaultSorter.java") — which is larger? Hypothesize why.
grep -rn "tez.runtime.io.sort.mb\|tez.runtime.sort.spill.percent" tez-runtime-library/src/main/java — find every read site for these keys.
grep -n "GET /mapOutput\|reduce=\|map=" $(find ~ -name ShuffleHandler.java 2>/dev/null | head -1) — read the exact request format.
cat $(find tez-runtime-library/src/main/java -name "ShuffleManager.java") | head -200 — how is back-pressure on Fetcher threads applied?
grep -n "combiner" tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/*.java — at what phases does the combiner run?

Common bugs and symptoms

Symptom	Likely cause
`Fetcher: java.net.ConnectException`	`mapreduce_shuffle` aux service not configured or NM not running.
`Shuffle error: java.io.IOException: Failed on local exception: org.apache.hadoop.security.AccessControlException`	ShuffleSecret missing or stale; check `JobTokenSecretManager`.
OOM during sort	`tez.runtime.io.sort.mb` too high relative to container JVM heap.
OOM during shuffle	`tez.runtime.shuffle.memory.limit.percent` too high; in-memory inputs starve heap.
`Premature EOF from inputStream`	Source task wrote partial IFile (killed mid-spill); destination retries from another attempt.
Wrong reducer output count	Combiner not idempotent across merge passes.
`OnDiskMerger` thrashing	`io.sort.factor` too low; many tiny segments forcing many merge passes.
Long shuffle plateau	One source NM saturated; HDFS-local fetch concentration.

Validation: prove you understand this

Sketch the byte layout of an IFile segment containing 3 records and a single partition. Show key/val lengths and the EOF marker.
A reducer task reads from 200 mappers. With tez.runtime.shuffle.parallel.copies=20 and tez.runtime.shuffle.fetch.max.task.output.at.once=20, what is the minimum number of HTTP requests the fetcher pool must make? Justify.
Explain why PipelinedSorter reduces wall time but not CPU time.
Given a 10 GB shuffle into a 4 GB heap reducer with tez.runtime.shuffle.memory.limit.percent=0.25, predict which inputs go to disk versus memory and why.
Identify the exact file and method where the URL pattern ?reduce=&map= is constructed on the Tez fetcher side. Use grep.

Tez Runtime Internals

The Tez runtime is the code that runs inside the container, not inside the AM. Its job: boot a JVM, accept tasks from the AM over umbilical RPC, run them to completion, and report status.

Three modules collaborate:

tez-runtime-internals — process boot, task driver, umbilical client, memory broker.
tez-runtime-library — concrete Input / Processor / Output implementations (KV, shuffle, etc).
tez-api — the SPI the user implements (AbstractLogicalInput, AbstractLogicalOutput, AbstractLogicalIOProcessor).

ls tez-runtime-internals/src/main/java/org/apache/tez/runtime/

The container process: `TezChild`

TezChild.main() is the JVM entry point for every Tez task container.

find tez-runtime-internals/src/main/java -name "TezChild.java"
grep -n "public static void main\|new TezChild\|run()" \
  $(find tez-runtime-internals/src/main/java -name "TezChild.java")

Boot sequence (paraphrased from TezChild.java):

Read JVM args: AM host, AM port, container ID, application attempt ID, PID, JVM identifier.
Read the security tokens from $HADOOP_TOKEN_FILE_LOCATION.
Construct TezTaskUmbilicalProtocol RPC proxy pointing at the AM TaskAttemptListenerImpl.
Enter TezChild.run() — an infinite loop: a. umbilical.getTask(containerContext) — blocks until the AM hands us a ContainerTask. b. If ContainerTask.shouldDie(), exit cleanly. c. Otherwise build a TezTaskRunner2 for the assigned attempt and call runner.run(). d. Loop — this is container reuse: same JVM, next task.

flowchart TD
  S[JVM start] --> P[Parse args + tokens]
  P --> R[RPC connect to AM]
  R --> L{umbilical.getTask}
  L -- shouldDie --> X[exit]
  L -- new task --> T[TezTaskRunner2.run]
  T --> L

Why container reuse needs this loop

Allocating a YARN container costs hundreds of ms; starting a JVM costs seconds. Tez amortizes both by running multiple tasks in the same TezChild process. See container-reuse.md for the AM side.

`TezTaskRunner2` — the task driver

find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java"
wc -l $(find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java")

Per-attempt driver. Owns:

a LogicalIOProcessorRuntimeTask (the actual task body),
the input/output initialization thread pool,
abort hooks (kill, fatal error, timeout).

Lifecycle of a single attempt:

sequenceDiagram
  participant TC as TezChild
  participant TR as TezTaskRunner2
  participant T as LogicalIOProcessorRuntimeTask
  participant IO as Inputs / Outputs
  participant P as Processor
  TC->>TR: new + run()
  TR->>T: initialize()
  T->>IO: input.initialize() (parallel)
  T->>IO: output.initialize() (parallel)
  T->>P: processor.initialize()
  TR->>T: run() => processor.run(inputs, outputs)
  P->>IO: read inputs, write outputs
  T->>IO: output.close() (parallel)
  T->>IO: input.close() (parallel)
  TR->>TC: result (success/failure)

Failure routes:

Input init throws → TaskFailedException to AM, attempt fails.
Processor throws → same.
AM sends kill via umbilical heartbeat reply → TezTaskRunner2.killTask() interrupts the processor thread.
Fatal error on any IO → TaskReporter.notifyFatalError() short-circuits the run.

`LogicalIOProcessorRuntimeTask` — orchestrator

find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java"
wc -l $(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java")

This is the class that actually instantiates the user's IPO triple.

initialize() does, in order:

Build the TezConfiguration for this task from the AM-provided TaskSpec.
Build the MemoryDistributor (next section) over all IOs.
For each InputSpec: instantiate the input class, set its InputContext, call initialize() on a worker thread.
Same for each OutputSpec.
Instantiate the processor; call processor.initialize(processorContext).
Wait for all input/output initialize() calls to complete (parallel).

run():

Block until every input reports it has data (or signals empty).
Call processor.run(inputs, outputs) on the main thread.
On return, call output.close() for every output (parallel), then input.close() for every input (parallel).
Collect counters; hand the final TaskStatus back to TezTaskRunner2.

Key field:

grep -n "initializerCompletionService\|runInputRunnable\|runOutputRunnable" \
  $(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java") | head

Parallel init is what makes Tez fast for processors with many inputs (eg multi-input joins).

`MemoryDistributor`

find tez-runtime-internals/src/main/java -name "MemoryDistributor.java"
cat $(find tez-runtime-internals/src/main/java -name "MemoryDistributor.java") | head -160

A single broker that hands out portions of the task's JVM heap to IOs that ask for memory.

Flow:

At task init, each Input / Output calls context.requestInitialMemory(size, callback) with what it would like to reserve (e.g. OrderedPartitionedKVOutput requests tez.runtime.io.sort.mb MB).
MemoryDistributor.makeInitialAllocations() runs an InitialMemoryAllocator plugin (default: WeightedScalingMemoryDistributor) to scale requests down to fit the container's available heap.
Allocations are dispatched to callbacks; each IO learns its actual budget via MemoryUpdateCallback.memoryAssigned(long).
IO classes resize their buffers accordingly.

Configuration knobs:

Key	Effect
`tez.runtime.task.initial.memory.allocator.class`	Plugin to use. Default `WeightedScalingMemoryDistributor`.
`tez.task.scale.memory.enabled`	Master toggle.
`tez.task.scale.memory.ratios`	Per-IO-class weight overrides.
`tez.task.scale.memory.reserve-fraction`	Reserved for processor/JVM.

grep -n "requestInitialMemory\|memoryAssigned" \
  $(find tez-runtime-library/src/main/java -name "OrderedPartitionedKVOutput.java")

Without the distributor an output would request its configured size verbatim, potentially OOMing the container when summed across IOs.

`TaskReporter` and the umbilical

find tez-runtime-internals/src/main/java -name "TaskReporter*.java"
find tez-api/src/main/java -name "TezTaskUmbilicalProtocol.java"

TaskReporter runs a heartbeat thread per task attempt. Each cycle:

Collect outbound events (counter updates, completion events from completed IOs).
Call umbilical.heartbeat(request) where request contains attempt ID, counters, status messages, and the outbound TezEvent list.
Decode the reply: AM may push back inbound TezEvents (e.g. DataMovementEvents from upstream tasks), a shouldDie flag, or a shouldReset flag.
Dispatch inbound events into the appropriate Input via LogicalIOProcessorRuntimeTask.handleEvents().

Interval: tez.task.am.heartbeat.interval-ms (default 100) plus a counter-update interval tez.task.am.heartbeat.counter.interval-ms (default 4000).

Why heartbeats carry events

Tez has no separate "event bus" between AM and containers. Everything piggybacks on the umbilical heartbeat. This means:

Event latency is bounded below by heartbeat.interval-ms.
A wedged umbilical (network partition) blocks all task communication; tez.task.timeout-ms (default 5 minutes) eventually fires and the AM considers the attempt lost.

End-to-end task lifecycle inside the JVM

grep -n "phase\|TaskRunnerPhase" $(find tez-runtime-internals/src/main/java -name "TezTaskRunner2.java") | head

Phase	Owner	What happens
1 Receive	`TezChild.run`	`umbilical.getTask` returns a `ContainerTask`.
2 Build	`TezTaskRunner2`	Construct `LogicalIOProcessorRuntimeTask`, hook up `TaskReporter`.
3 Init	`LogicalIOProcessorRuntimeTask.initialize`	MemoryDistributor + parallel IO init + processor init.
4 Run	`LogicalIOProcessorRuntimeTask.run`	`processor.run(inputs, outputs)`.
5 Close	same	Outputs close (flush spills, emit DataMovementEvents), inputs close.
6 Report	`TaskReporter` final tick	Send counters + completion event. AM transitions attempt to SUCCEEDED.
7 Loop	`TezChild.run`	Discard task object, request next.

Reading exercise

grep -n "shouldDie\|exit(" $(find tez-runtime-internals/src/main/java -name "TezChild.java") — list every termination path.
grep -n "initialize\|run\|close" $(find tez-runtime-internals/src/main/java -name "LogicalIOProcessorRuntimeTask.java") | head -40 — verify the lifecycle order.
cat $(find tez-runtime-internals/src/main/java -name "MemoryDistributor.java") | head -100 — how does it handle the case where summed requests exceed available?
grep -n "heartbeat\|TezTaskUmbilical" $(find tez-runtime-internals/src/main/java -name "TaskReporter.java") | head — find the heartbeat loop body.
cat tez-api/src/main/java/org/apache/tez/runtime/api/AbstractLogicalIOProcessor.java — read the user-facing processor contract.
wc -l $(find tez-runtime-internals/src/main/java -name "*.java" | head -20) — find the biggest classes in the runtime module.

Common bugs and symptoms

Symptom	Likely cause
Container OOM during init	`MemoryDistributor` disabled or summed IO requests exceed heap. Enable `tez.task.scale.memory.enabled`.
`TaskAttempt timed out` after 5 min of no heartbeat	`TaskReporter` thread died (uncaught exception) or RPC hung.
Processor sees zero events	Inbound events not delivered — check `TaskReporter.heartbeat` reply path; common when `tez.task.am.heartbeat.interval-ms` raised too high.
Container reuse off, JVMs constantly spinning up	`TezChild.run` loop returns shouldDie too eagerly; check AM-side `AMContainerImpl` reuse decision.
`IllegalStateException: Cannot reserve more memory`	IO requesting after `makeInitialAllocations` already ran.
Outputs never close (process hangs)	Processor never returned from `run()`; usually an infinite loop on a `KeyValuesReader`.

Validation: prove you understand this

Trace, with file:method references, the path from TezChild.main to processor.run for a single attempt.
Explain in two sentences why LogicalIOProcessorRuntimeTask.initialize parallelizes input/output init. Cite the field name.
Given a container with 1 GB heap, one OrderedPartitionedKVOutput requesting 512 MB and two OrderedGroupedKVInputs requesting 256 MB each, compute the actual allocations under the default WeightedScalingMemoryDistributor.
Identify the single umbilical method that delivers inbound TezEvents to the task. Cite the file and the field on the response object.
Sketch the smallest possible AbstractLogicalIOProcessor that prints the class names of all configured inputs and exits. Include initialize, handleEvents, run, close.

Scheduler

The scheduler is the AM-side component that turns task launch requests into running containers. It lives in tez-dag under org.apache.tez.dag.app.rm.

Two-layer design:

TaskSchedulerManager — single dispatcher and router. Receives AMSchedulerEvents from the rest of the AM, forwards to the right scheduler instance.
TaskScheduler instances — one per scheduler ID. In practice almost always YarnTaskSchedulerService (production) or LocalTaskSchedulerService (tez.local.mode=true). External pluggable schedulers (LLAP) also slot in here.

ls tez-dag/src/main/java/org/apache/tez/dag/app/rm/

`TaskSchedulerManager`

find tez-dag/src/main/java -name "TaskSchedulerManager.java"
wc -l $(find tez-dag/src/main/java -name "TaskSchedulerManager.java")

Implements EventHandler<AMSchedulerEvent> and is wired into the AM AsyncDispatcher. Every scheduling decision in the AM starts by enqueuing an AMSchedulerEvent.

Event types

find tez-dag/src/main/java -name "AMSchedulerEvent*.java"
grep -rn "extends AMSchedulerEvent" tez-dag/src/main/java

Event	Source	Purpose
`AMSchedulerEventTALaunchRequest`	`TaskAttemptImpl` after a TA is ready to schedule	Ask scheduler to launch this attempt.
`AMSchedulerEventTAStateUpdated`	`TaskAttemptImpl` on completion	Notify scheduler the container is now releasable.
`AMSchedulerEventContainerCompleted`	YARN RM callback	RM told us a container died.
`AMSchedulerEventDeallocateContainer`	various	Force-release a held container.
`AMSchedulerEventNodeBlacklistUpdate`	`NodeTracker`	Add/remove node from blacklist.
`AMSchedulerEventDAGStart`, `AMSchedulerEventVertexStateUpdated`	`DAGImpl`, `VertexImpl`	DAG lifecycle hints (drives priority adjustments).

TaskSchedulerManager.handle(event) switches on event type and forwards via getTaskScheduler(event.getSchedulerId()).handleEvent(...).

`YarnTaskSchedulerService`

find tez-dag/src/main/java -name "YarnTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")

This is where Tez talks to YARN. Owns:

AMRMClientAsync — async RM heartbeat client.
Map<Priority, BlockingQueue<CookieContainerRequest>> — outstanding requests, bucketed by priority.
Map<ContainerId, HeldContainer> — currently-assigned containers (see container-reuse.md).
A DelayedContainerManager thread that releases idle reused containers.

Request flow

sequenceDiagram
  participant TA as TaskAttemptImpl
  participant TSM as TaskSchedulerManager
  participant Y as YarnTaskSchedulerService
  participant RM as YARN RM
  TA->>TSM: AMSchedulerEventTALaunchRequest
  TSM->>Y: allocateTask(...)
  Y->>Y: build CookieContainerRequest (priority, resource, locality)
  Y->>RM: addContainerRequest (via AMRMClientAsync)
  Note over RM: scheduler matches request to a node
  RM-->>Y: onContainersAllocated([Container])
  Y->>Y: assignContainer() — match to a pending request
  Y->>TSM: containerAllocated(taskAttempt, container)
  TSM->>TA: TAEventContainerAssigned

Matching: priority + locality

grep -n "assignContainer\|matchContainerToRequest\|getMatchingRequests" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20

When a container arrives, assignContainer walks pending requests at the container's priority. For each:

NODE_LOCAL — container's node matches a hint host of the request.
RACK_LOCAL — same rack but different host.
ANY — locality wildcard.

AMRMClientAsync already biases matches by locality on the YARN side; this pass is the AM-side tiebreaker when multiple requests are eligible.

Hint level	YARN request	Tez match
`NODE_LOCAL`	host + rack + `*`	accepts container on the exact host
`RACK_LOCAL`	rack + `*`	accepts container on the same rack
`ANY`	`*` only	accepts any container at this priority

TaskLocationHint is set on TaskAttemptImpl either from the input split (MRInput), the VertexLocationHint (provided by VertexManager), or left null.

Priorities

grep -n "Priority\|priorityForVertex" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java" \
                                -o -name "DAGImpl.java") | head

Tez assigns each vertex a priority class derived from its topological order in the DAG; downstream vertices have higher numeric priority (lower priority value), so that source tasks complete first and free their containers for downstream consumers. Priority is the primary key for container reuse matching as well.

RM callbacks

grep -n "AMRMClientAsync.CallbackHandler\|onContainersAllocated\|onContainersCompleted\|onShutdownRequest" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")

YarnTaskSchedulerService implements AMRMClientAsync.CallbackHandler:

onContainersAllocated(List<Container>) — enqueue for assignment.
onContainersCompleted(List<ContainerStatus>) — translate exit status into AMSchedulerEventContainerCompleted.
onShutdownRequest() — RM asked AM to die (eg lost AM attempt).
onNodesUpdated(List<NodeReport>) — update node health for blacklisting.
getProgress() — AM tells RM its overall DAG progress.

`LocalTaskSchedulerService`

find tez-dag/src/main/java -name "LocalTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java")

Same contract as YarnTaskSchedulerService but bypasses YARN:

A bounded ExecutorService of LocalContainer worker threads stands in for the YARN cluster.
allocateTask instantly synthesizes a fake Container and dispatches containerAllocated.
The container launcher (LocalContainerLauncher) runs TezChild in the same JVM on the executor.

Used by tez.local.mode=true and MiniTezCluster tests of certain flavors. See local-mode.md.

Pluggable schedulers

grep -n "tez.am.task.scheduler.classes\|TASK_SCHEDULER_SERVICE_CLASS" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Configuration:

tez.am.task.scheduler.classes = <comma-separated FQNs>

TaskSchedulerManager instantiates one per ID. Hive's LLAP plugs in a custom scheduler that talks to LLAP daemons instead of YARN.

Walkthrough: launching a single task attempt

VertexImpl decides to schedule task T.k (via VertexManager or scaling).
TaskImpl creates TaskAttemptImpl for attempt 0 → state NEW.
TaskAttemptImpl transitions to START_WAIT, dispatches AMSchedulerEventTALaunchRequest with TaskLocationHint and capability.
TaskSchedulerManager.handle routes to the configured scheduler.
YarnTaskSchedulerService.allocateTask constructs CookieContainerRequest(priority, capability, hosts, racks, relaxLocality=true) and calls AMRMClientAsync.addContainerRequest.
RM schedules → callback onContainersAllocated([c]).
assignContainer(c) finds the matching pending request, calls informAppAboutAssignment → TaskSchedulerManager.containerAllocated.
TaskSchedulerManager dispatches AMContainerEventAssignTA to AMContainerImpl, then TAEventContainerAssigned to TaskAttemptImpl.
AMContainerImpl asks ContainerLauncherImpl to launch the container (or reuse a held one).
TezChild starts (or accepts new task via reuse loop). The umbilical fires up; the attempt transitions to RUNNING.

sequenceDiagram
  participant V as VertexImpl
  participant TA as TaskAttemptImpl
  participant TSM as TaskSchedulerManager
  participant Y as YarnTaskSchedulerService
  participant AC as AMContainerImpl
  participant RM as YARN RM
  V->>TA: schedule
  TA->>TSM: AMSchedulerEventTALaunchRequest
  TSM->>Y: allocateTask
  Y->>RM: addContainerRequest
  RM-->>Y: onContainersAllocated
  Y->>TSM: containerAllocated
  TSM->>AC: AMContainerEventAssignTA
  TSM->>TA: TAEventContainerAssigned
  AC->>RM: start container (via NMClient)

Reading exercise

cat $(find tez-dag/src/main/java -name "TaskSchedulerManager.java") | head -200 — list the event types handled.
grep -n "addContainerRequest\|removeContainerRequest\|releaseAssignedContainer" \ $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — find all RM client interactions.
grep -n "NODE_LOCAL\|RACK_LOCAL\|OFF_SWITCH\|ANY" \ $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — how is locality classified?
grep -n "CookieContainerRequest" $(find tez-dag/src/main/java -name "*.java" | grep rm) — what is the "cookie"? (Hint: opaque payload to thread reuse data through AMRMClient.)
wc -l $(find tez-dag/src/main/java/org/apache/tez/dag/app/rm -name "*.java") — which file dominates? Likely YarnTaskSchedulerService ≫ everything.
grep -n "Priority.newInstance\|priority(" \ $(find tez-dag/src/main/java -name "VertexImpl.java" -o -name "DAGImpl.java") — where is per-vertex priority computed?

Common bugs and symptoms

Symptom	Likely cause
AM stuck "0 containers running"	RM has no capacity at requested priority; queue at capacity. Check `yarn application -status`.
All tasks scheduled OFF_SWITCH	`TaskLocationHint` not propagated through `VertexManager`.
Tasks fail with `Container released by AM`	`YarnTaskSchedulerService` released a container that an attempt still owned — usually a state machine race; see failure-handling.md.
Reuse not happening	Priorities mismatch between completed and pending tasks; check `tez.am.container.reuse.locality.delay-allocation-millis`.
AM heartbeat thread blocked	A scheduler callback (`onContainersAllocated`) ran a slow blocking op on the RM client thread. Keep callbacks light.
`IllegalStateException: Priority N not registered`	`allocateTask` called for a vertex whose priority class was never bootstrapped.

Validation: prove you understand this

Walk an AMSchedulerEventTALaunchRequest from dispatch in TaskAttemptImpl to a YARN AMRMClient.addContainerRequest call. Cite file paths.
Explain the difference between priority (YARN concept) and DAG priority (Tez concept) and where Tez sets each.
Given a 100-task Vertex A followed by a 10-task Vertex B, what priority class does each get and why?
Describe how YarnTaskSchedulerService decides between two pending requests at the same priority when a container arrives.
Identify the single method on YarnTaskSchedulerService that the RM callback thread invokes when containers become available. Cite file:line.

Container Reuse

Container reuse is the single biggest reason Tez runs short-task DAGs faster than MapReduce. This chapter explains why, where the policy lives, and how to debug it when it stops working.

Why reuse matters

Container allocation has three costs:

RM round-trip — addContainerRequest, RM scheduling cycle (typically yarn.scheduler.capacity.node-locality-delay adds extra ms), onContainersAllocated.
NM container launch — ContainerLaunchContext setup, localization of resources, NodeManager forking the JVM.
JVM warmup — classloading, JIT, GC tuning.

For a 5-second task on a fresh container the wall time looks like:

Phase	ms
AM request → RM allocate	200–2000
NM launch + localization	500–3000
JVM start	500–2000
Task work	5000
Overhead share	25–60%

For 100 such tasks, paying that overhead 100 times turns a DAG that should finish in ~10s into one that takes 60–90s. Reuse drops to near-zero overhead for tasks 2..N on the same container.

The reuse loop in `TezChild`

See tez-runtime.md. The single key fact: after each completed task, TezChild.run() calls umbilical.getTask() again instead of exiting. As long as the AM hands it work, the same JVM keeps running.

grep -n "umbilical.getTask\|shouldDie\|run()" \
  $(find tez-runtime-internals/src/main/java -name "TezChild.java") | head

So the entire reuse policy is implemented on the AM side — the container asks "what next?" and the AM decides whether to give it another task or release it.

`AMContainerImpl` — per-container state machine

find tez-dag/src/main/java -name "AMContainerImpl.java"
wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java")
grep -n "AMContainerState\|enum AMContainerState" \
  $(find tez-dag/src/main/java -name "AMContainerState.java" \
                                -o -name "AMContainerImpl.java") | head

Each YARN container the AM holds has a corresponding AMContainerImpl state machine. States include:

State	Meaning
`ALLOCATED`	RM has assigned the container; not yet launched.
`LAUNCHING`	NMClient is forking the JVM.
`IDLE`	Launched, no task assigned (reuse candidate).
`RUNNING`	A task attempt is currently executing.
`STOP_REQUESTED` / `COMPLETED`	Releasing or released.

The transition RUNNING → IDLE is the moment Tez decides between reuse and release.

`HeldContainer`

grep -n "HeldContainer\|heldContainers\|delayedContainers" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20

HeldContainer is the scheduler-side view of an idle reused container:

Field	Purpose
`container`	The underlying YARN `Container` (resource, node, priority).
`priority`	The priority class it was originally allocated at.
`lastTaskActivity`	Timestamp of the last task completion.
`nextScheduleTime`	When `DelayedContainerManager` will reconsider it.
`localityMatchLevel`	Track the locality at which it can still be matched.

When a task completes, AMContainerImpl reports back to YarnTaskSchedulerService which wraps the container in a HeldContainer and queues it for matching.

Matching: who gets the held container?

Algorithm (paraphrased from YarnTaskSchedulerService):

Walk pending requests at the same priority as the held container's original allocation.
Prefer requests with locality matching the container's node, then rack, then any.
Verify resource compatibility: container's Resource must satisfy the request's capability.
If a match exists, dispatch reuse to the matched TaskAttemptImpl.
If no match, leave the container as HeldContainer and schedule the DelayedContainerManager to re-evaluate after the locality delay.

grep -n "tryAssignReUsedContainer\|matchHeldContainerToRequest\|getMatchingRequests" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

Why priority-strict matching?

Tez does not reuse a container allocated for priority class P1 for a task of priority P2 because RM accounting attributed the container to the P1 queue/request. Crossing priority classes would corrupt fairness and create double-counting in the RM's view of demand.

Idle timeout

grep -n "tez.am.container.idle.release-timeout" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Two timeouts bracket the wait:

Key	Default	Meaning
`tez.am.container.idle.release-timeout-min.millis`	5000	Don't release before this much idle time.
`tez.am.container.idle.release-timeout-max.millis`	10000	Definitely release after this much.

DelayedContainerManager runs a periodic sweep. For each HeldContainer:

If now - lastActivity < min, wait.
If min ≤ now - lastActivity < max, try a relaxed-locality match.
If now - lastActivity ≥ max, release back to YARN (AMRMClient.releaseAssignedContainer).

Why a range? Avoids thundering-herd releases when a wave of tasks finishes simultaneously, and gives the AM a window to re-match before paying the allocate-from-scratch cost.

Locality re-matching

grep -n "localityMatchLevel\|adjustLocalityMatch\|fallbackMatch" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head

A held container starts at NODE_LOCAL. Each sweep without a match relaxes the level:

NODE_LOCAL → RACK_LOCAL → ANY → release.

tez.am.container.reuse.locality.delay-allocation-millis (default 250) is the per-step delay. Higher values raise locality at the cost of throughput; lower values give up locality faster.

DAG transitions and reuse

grep -n "tez.am.container.reuse.across-dags.enabled\|tez.am.container.reuse.enabled" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Reuse policy at DAG boundaries:

Key	Default	Effect
`tez.am.container.reuse.enabled`	true	Master toggle.
`tez.am.container.reuse.rack-fallback.enabled`	true	Allow RACK_LOCAL fallback.
`tez.am.container.reuse.non-local-fallback.enabled`	false	Allow ANY-locality fallback.
`tez.am.container.reuse.new-containers.enabled`	true	Reuse a brand-new container for a different task than originally requested.
`tez.am.session.mode.tez-session.enabled` (Hive)	controls inter-DAG reuse via session	Hive holds the AM across queries.

When Session mode is on (Hive's TezSessionPoolManager does this), the AM holds containers across DAGs, so the first DAG warms the JVMs that the second DAG reuses.

Failure modes

Stale credentials

grep -n "credentials\|Token\|getCredentials" \
  $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
                                -o -name "AMContainerImpl.java") | head

If a DAG uses delegation tokens (HDFS, HiveMetastore) that expire mid-session, reused containers still hold the old tokens. Symptoms: tasks fail with SecretManager$InvalidToken on file open. Fix: token renewal via TokenRenewer, or release reused containers between DAGs that use tokens with short TTLs.

Leaked containers on AM failover

grep -n "recoverContainer\|onAMRestart" \
  $(find tez-dag/src/main/java -name "*.java" | head -50) | head

When the AM dies and YARN restarts attempt 2, the old containers are still running. YARN passes them to the new AM via getContainersFromPreviousAttempts. If the new AM mis-handles the priority mapping, those containers can become orphaned — neither released nor reused — until the YARN-level yarn.am.liveness-monitor.expiry-interval-ms kicks in.

Resource fragmentation

Tez does not reshape containers. A 4 GB container allocated for a heavyweight mapper sits idle through the reduce phase if reducers want 2 GB containers — the 4 GB block is not subdivided.

Container blacklisting

grep -n "blacklist\|NodeTracker" \
  $(find tez-dag/src/main/java -name "*.java" | grep rm) | head

A node accumulating task failures gets blacklisted; held containers on that node are released even within the idle window.

Tuning playbook

Goal	Tune
Reduce p50 task latency	Increase `tez.am.container.idle.release-timeout-max.millis` — keep JVMs warm longer.
Reduce YARN queue pressure	Lower `tez.am.container.idle.release-timeout-min.millis` — return idle containers faster.
Improve locality on long DAGs	Increase `tez.am.container.reuse.locality.delay-allocation-millis`.
Hive interactive queries	Enable session pools (`hive.server2.tez.initialize.default.sessions`) and large reuse windows.
Debugging "why was this container released?"	Set log4j level for `org.apache.tez.dag.app.rm` to DEBUG.

Reading exercise

wc -l $(find tez-dag/src/main/java -name "AMContainerImpl.java") then read the state machine declaration block. Count states and transitions.
grep -n "DelayedContainerManager" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — find the sweep loop.
grep -rn "idle.release-timeout" tez-dag/src/main/java — list all read sites for the idle timeout.
grep -n "previousAttemptContainers\|registerApplicationMaster" $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") — how does the AM enumerate inherited containers on failover?
cat tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | grep -A 1 "REUSE\|REUSE_ENABLED" — list every reuse-related config key.
grep -n "containerCompleted" $(find tez-dag/src/main/java -name "AMContainerImpl.java") — where does the AM learn that the JVM exited unexpectedly?

Common bugs and symptoms

Symptom	Likely cause
0% reuse despite `tez.am.container.reuse.enabled=true`	Priority mismatches; verify with AM log `Container released because no matching request`.
Hive query slow after token refresh	Reused container holding stale `HiveMetastore` delegation token. Release after refresh or shorten reuse window.
AM log spam: `Released container X because expired`	Tasks completing faster than next-wave dispatch — lower `idle.release-timeout-min`.
YARN queue at 100% but tasks pending	Held containers at wrong priority blocking new allocations; check `nm-rm-heartbeat-interval-ms`.
Containers orphaned after AM crash	New AM did not register previous containers; check
`getContainersFromPreviousAttempts` handling.

Validation: prove you understand this

Describe the four-step locality relaxation a HeldContainer undergoes.
Why is priority-strict matching enforced even when relaxing locality? Cite the RM accounting consequence.
Given idle.release-timeout-min=5000, idle.release-timeout-max=10000, and 200 ms between successive task completions on the same vertex, what fraction of containers get reused?
Identify the exact configuration key that controls whether RM-fresh containers can be assigned to a task different from the one that triggered the request. Cite file:line.
Sketch the sequence of AM events when an AMContainer transitions RUNNING → IDLE → RUNNING with reuse, including which state machine emits each event.

Local Mode

Tez ships two "no YARN" execution paths:

Local mode — tez.local.mode=true. The whole AM + all containers run in the calling JVM. No RM, no NM, no networking.
MiniTezCluster — a real YARN MiniCluster (RM + NMs as threads) with a real Tez AM submitted as a YARN app. Networking goes over loopback.

Both let you test without a cluster, but they are not interchangeable. This chapter explains the wiring and the tradeoffs.

Why a no-YARN path exists

Production Tez requires YARN to allocate containers. For:

IDE-driven unit tests of vertex managers, edge managers, processors;
short reproducers in JIRAs;
in-process pipelines (e.g. running a DAG inline from a Hive Driver in a test);

paying YARN startup cost (30+ seconds) is intolerable. Local mode is the escape hatch.

grep -rn "tez.local.mode\|LOCAL_MODE" \
  tez-api/src/main/java tez-dag/src/main/java | head

How `tez.local.mode=true` rewires the AM

grep -n "TEZ_LOCAL_MODE\|isLocalMode\|getBoolean.*LOCAL" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

When tez.local.mode=true:

TezClient.start() does not submit to YARN. Instead it constructs a DAGAppMaster instance directly in the client JVM and starts it as a service.
TaskSchedulerManager is configured with LocalTaskSchedulerService instead of YarnTaskSchedulerService.
ContainerLauncherManager uses LocalContainerLauncher instead of ContainerLauncherImpl.
TaskCommunicatorManager uses TezLocalTaskCommunicatorImpl which bypasses RPC entirely.

The net effect: the AM, scheduler, container launcher, and TezChilds all live in the same JVM, talking via in-process queues.

`LocalTaskSchedulerService`

find tez-dag/src/main/java -name "LocalTaskSchedulerService.java"
wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java")

Mirrors YarnTaskSchedulerService but the "resource pool" is a thread pool.

Concept	Yarn version	Local version
Resource pool	YARN cluster	`ExecutorService` of bounded thread count
`allocateTask`	`AMRMClient.addContainerRequest`	enqueue to local queue, immediately synthesize fake `Container`
`releaseAssignedContainer`	RM release	return thread to pool
Locality	NODE_LOCAL/RACK_LOCAL	always ANY (single "node")
Priority	YARN priority class	honored as a queue-ordering hint

Configuration:

grep -n "TEZ_AM_INLINE_TASK_EXECUTION_MAX_TASKS\|tez.am.inline" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

tez.am.inline.task.execution.max-tasks (default 1) controls thread-pool size in local mode. Bumping this exposes concurrency bugs that production container parallelism would also expose.

`LocalContainerLauncher`

find tez-dag/src/main/java -name "LocalContainerLauncher.java"

When the AM "launches" a local container, the launcher allocates a LocalContainer worker that runs TezChild logic in the same process:

No new JVM.
No serialization of the ContainerLaunchContext — the AM hands the TaskSpec directly to the local task runner.
The umbilical "RPC" is a Java method call on an in-process object.

This means: local mode does not exercise the RPC layer, classpath construction, NM localization, or token plumbing. Bugs in those paths are invisible to local-mode tests.

What local mode does not exercise

Layer	Skipped in local mode
YARN RM scheduling	✗
NodeManager container launch	✗
Resource localization (HDFS download)	✗
AMRMToken / ClientToAMToken	✗
HDFS shuffle path (uses local FS only)	✗
`ShuffleHandler` aux service	✗
RPC serialization	✗
JVM cold start / classloader isolation	✗

What it does exercise: the DAG state machine, VertexManagers, EdgeManagers, sort/merge code, processors, and the umbilical event flow.

`MiniTezCluster`

find tez-tests/src/test/java -name "MiniTezCluster.java"
wc -l $(find tez-tests/src/test/java -name "MiniTezCluster.java")

A real cluster compressed onto one host. Inherits MiniYARNCluster from Hadoop:

One RM thread.
N NM threads (configurable).
A Tez AM submitted as a normal YARN application.
TezChild runs in separate JVMs spawned by NM ContainerExecutor.
HDFS is MiniDFSCluster (a few NameNode + DataNode threads in the same JVM) or a RawLocalFileSystem.

grep -n "MiniYARNCluster\|MiniDFSCluster\|appJar\|deploy" \
  $(find tez-tests/src/test/java -name "MiniTezCluster.java") | head

Setup pattern

grep -rn "MiniTezCluster\b" tez-tests/src/test/java | head -10

MiniTezCluster cluster = new MiniTezCluster("test", numNMs, numDNs, racks);
cluster.init(conf);
cluster.start();

TezConfiguration tezConf = new TezConfiguration(cluster.getConfig());
TezClient client = TezClient.create("test", tezConf);
client.start();
client.waitTillReady();
client.submitDAG(myDag);

When MiniTezCluster is the right tool

You are exercising RPC, security, or localization code.
You hit ShuffleHandler paths or HDFS-backed recovery (see failure-handling.md).
You're reproducing a bug that involves real container lifecycle (kill -9 vs orderly shutdown) — MiniCluster can forkProcess and SIGKILL.
You need realistic counters and ATS event flow.

When MiniTezCluster is the wrong tool

Pure VertexManager logic — use local mode or mock dispatcher.
Pure IFile / sort behavior — use a unit test on the runtime-library classes directly.
Anything where 30–60 s startup + heavy memory cost (~1 GB minimum) is intolerable.

Side-by-side comparison

Aspect	Local mode	MiniTezCluster
Startup	< 1 s	30–60 s
Memory	~256 MB	1 GB+
YARN exercised	no	yes (in-process)
RPC exercised	no	yes (loopback)
Tokens exercised	no	yes (simple, unkerberized by default)
Separate JVMs for tasks	no	yes
HDFS	RawLocal	MiniDFS or RawLocal
Shuffle path	no `ShuffleHandler`	full `ShuffleHandler`
Use case	unit / integration of AM logic	end-to-end integration tests
Example test class	`TestLocalMode`	`TestOrderedWordCount`

find tez-tests/src/test/java -name "TestLocalMode.java" \
                              -o -name "TestOrderedWordCount.java"

Worked example: switching between modes in one test

@Parameters
public static Iterable<Object[]> modes() {
  return Arrays.asList(new Object[][] {{"local"}, {"mini"}});
}

@Before
public void setUp() throws Exception {
  conf = new TezConfiguration();
  if ("local".equals(mode)) {
    conf.set("fs.defaultFS", "file:///");
    conf.setBoolean(TezConfiguration.TEZ_LOCAL_MODE, true);
  } else {
    miniCluster = new MiniTezCluster("test", 1, 1, 1);
    miniCluster.init(conf);
    miniCluster.start();
    conf = new TezConfiguration(miniCluster.getConfig());
  }
  client = TezClient.create("t", conf);
  client.start();
  client.waitTillReady();
}

This is the pattern in several Tez tests where a feature must work in both universes.

Reading exercise

wc -l $(find tez-dag/src/main/java -name "LocalTaskSchedulerService.java" \ -o -name "YarnTaskSchedulerService.java") — confirm the local version is much smaller.
grep -n "tez.local.mode" $(find tez-dag/src/main/java -name "DAGAppMaster.java") — find every branch that depends on local mode.
cat $(find tez-dag/src/main/java -name "LocalContainerLauncher.java") | head -160 — how does it run TezChild without a fork?
find tez-tests/src/test/java -name "MiniTezCluster.java" -exec grep -n "ShuffleHandler\|aux-services" {} \; — verify MiniTezCluster wires the YARN aux service.
grep -rn "TEZ_LOCAL_MODE" tez-api tez-dag tez-runtime-internals | head — list every config read site.
find tez-tests/src/test/java -name "TestLocal*" -o -name "TestMRR*" — read one local-mode and one MiniCluster test, side by side.

Common bugs and symptoms

Symptom	Likely cause
Test passes in local mode, fails on cluster	Local mode skipped RPC/localization/tokens. Add a MiniCluster variant.
MiniCluster test times out at `waitTillReady`	RM never registered the AM. Check `tez-site.xml` is on the AM classpath in the MiniCluster config.
Local-mode race conditions only visible with `inline.task.execution.max-tasks > 1`	Single-threaded local mode hides ordering bugs in `VertexManager` and dispatchers.
`ClassNotFoundException` for custom processor in MiniCluster	Container localization needs the JAR; either put it on the launch classpath or register via `LocalResources`.
MiniCluster blows the heap	Default 1 NM + MiniDFS already 1 GB; bump JVM heap or reduce NM count to 1.
Hive integration test wedges only in MiniCluster	Hive needs full Hadoop config; check `hadoop.security.authentication=simple` in test conf.

Validation: prove you understand this

List four layers that local mode does not exercise. For each, name a bug class it can hide.
In local mode, where does the "RPC" between TezChild and the AM actually happen? Cite the file path.
Why is tez.am.inline.task.execution.max-tasks=1 the default in local mode? What test reliability tradeoff does it enforce?
Given a reproducer for a bug in ShuffleHandler aux-service interaction, explain why a TestLocalMode-style test cannot reproduce it, and what the minimum MiniCluster setup is.
Show the minimum TezConfiguration setup for local mode in code. Three lines max.

Hive on Tez

Hive is the largest single consumer of Tez. Roughly 70% of bug reports filed against Tez originate in a Hive query; many "Tez bugs" turn out to be Hive bugs, and vice versa. This chapter walks the compile boundary, explains how Hive operators map to Tez I/P/O, and gives a triage tree for attribution.

The compile boundary

Hive's query compiler produces a TezWork, a graph of BaseWork nodes (MapWork, ReduceWork, MergeJoinWork, etc). TezTask.execute walks TezWork and constructs a Tez DAG.

ls ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/

Key files:

File	Role
`TezTask.java`	Hive's `Task` impl; builds the DAG and submits via `TezSessionState`.
`DagUtils.java`	DAG construction helpers (createVertex, createEdge, etc).
`TezSessionPoolManager.java`	Warm session pool — keeps AMs alive between queries.
`TezSessionState.java`	One Hive session ↔ one Tez AM.
`TezProcessor.java`	The `LogicalIOProcessor` that runs Hive operator pipelines inside a Tez task.

wc -l ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/{TezTask,DagUtils,TezSessionPoolManager,TezProcessor}.java

`TezTask.execute` — high-level flow

grep -n "execute\|build\|submitDAG" \
  ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java | head -30

Steps:

Acquire a TezSessionState from TezSessionPoolManager (or open a new one).
build(jobConf, work, scratchDir, ...) — call DagUtils to turn each BaseWork into a Tez Vertex and each TezEdgeProperty into a Tez Edge.
submit(dag, sessionState) → tezClient.submitDAG(dag).
Poll dagClient.getDAGStatus(...) until terminal.
Surface counters + diagnostics back to Hive.

`DagUtils.createVertex`

grep -n "createVertex\|createEdge\|createEdgeProperty\|setVertexManagerPlugin" \
  ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java | head -30

For a MapWork:

Hive concept	Tez vertex configuration
Operator tree starting with `TableScanOperator`	`processor = MapTezProcessor` (subclass of `TezProcessor`)
Number of input splits	`parallelism = splits.length` (often overridden by grouping)
Per-split input	`DataSourceDescriptor` with `MRInputLegacy` and the InputFormat
Combiner	Edge-level (downstream `ReduceWork` configures it as a `combiner.class`)

For a ReduceWork:

Hive concept	Tez vertex configuration
Operator tree starting at `ReduceSinkOperator`'s consumer	`processor = ReduceTezProcessor`
Target parallelism	`numReducers` (from Hive's `Operator` tree, optionally
auto-parallelized via `ShuffleVertexManager`)
Sort key codec	`OrderedGroupedKVInput.KEY_CLASS`, `KEY_COMPARATOR_CLASS`
`setVertexManagerPlugin`	`ShuffleVertexManager` with auto-parallelism if `hive.tez.auto.reducer.parallelism=true`

For a MergeJoinWork:

processor = MergeJoinProcessor
Multiple sorted inputs (one per join side) using OrderedGroupedKVInput
A custom or built-in vertex manager that coordinates inputs

Operator → IPO mapping

Hive operators run inside a Tez task — they are not Tez constructs. The mapping happens at the input/output boundary of the vertex.

Position	Hive operator	Tez wiring
Vertex entry (map side)	`TableScanOperator`	`MRInputLegacy` (`tez-mapreduce`) emits `(key, value)` from InputFormat
Vertex middle	Filter / Select / GroupBy partial / etc	Pure in-memory operator chain inside `TezProcessor.process`
Vertex exit (shuffle producer)	`ReduceSinkOperator`	`OrderedPartitionedKVOutput` with Hive's `HiveKey` serializer and partitioner
Vertex entry (reduce side)	First operator after the boundary	`OrderedGroupedKVInput` provides a `KeyValuesReader`; `ReduceRecordProcessor` adapts it into Hive's tuple-at-a-time interface
Vertex middle (reduce)	GroupBy aggregation, Join, etc	Operator chain
Vertex exit (final)	`FileSinkOperator`	`MROutputLegacy` writes to HDFS
Broadcast join build	`HashTableSinkOperator`	`UnorderedKVOutput` (or in newer Hive a `BROADCAST`-typed edge) feeding the probe vertex
Broadcast join probe	`MapJoinOperator`	`UnorderedKVInput` on a BROADCAST edge

grep -rn "OrderedPartitionedKVOutput\|OrderedGroupedKVInput\|UnorderedKVOutput\|UnorderedKVInput" \
  ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez | head -20

`TezProcessor` adapter

grep -n "class TezProcessor\|class MapTezProcessor\|class ReduceTezProcessor\|process(" \
  ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezProcessor.java

TezProcessor.run(inputs, outputs):

Pull the singular input (MRInputLegacy or first OrderedGroupedKVInput).
Construct a RecordSource that adapts the Tez reader into Hive's Operator.process(Object row, int tag) calling convention.
Run the operator tree until the input is drained.
Call forward(EOF) to drain operator buffers.
Close outputs in reverse order.

The processor is intentionally thin — all the interesting logic is in the Hive operator chain.

`TezSessionPoolManager`

find ~/hive-src/ql/src/java -name "TezSessionPoolManager.java"
wc -l ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionPoolManager.java

A Tez session = a long-lived Tez AM holding zero or more idle containers ready to accept the next DAG.

Config	Default	Effect
`hive.server2.tez.default.queues`	`default`	Pre-warm sessions per YARN queue.
`hive.server2.tez.sessions.per.default.queue`	1	Number of pre-warm sessions per queue.
`hive.server2.tez.initialize.default.sessions`	false	Start them at HS2 boot.
`hive.tez.exec.print.summary`	false	Surface Tez counters in query output.

Pool flow:

HS2 starts. If initialize.default.sessions=true, launches N AMs per queue.
Query comes in. HS2 calls TezSessionPoolManager.getSession(queue) — gets an idle session or opens a new one.
Session executes the DAG; AM holds containers across DAGs (see container-reuse.md).
On session return, AM remains idle awaiting next DAG.
On idle timeout (hive.server2.session.check.interval), pool may close sessions.

LLAP

LLAP (Live Long And Process) is a different execution model that replaces the per-query AM with a long-lived per-node daemon. The Tez AM still coordinates, but instead of asking YARN for containers it asks LLAP daemons for "fragments".

find ~/hive-src/llap-* -type d -maxdepth 2 2>/dev/null | head

Key differences (do not extrapolate Tez-on-YARN debugging to LLAP):

Containers are replaced by LlapTaskExecutorService worker slots.
The shuffle path uses a Netty-based fetcher (LlapShuffleHandler).
The Tez scheduler plugin is LlapTaskSchedulerService (in hive-llap-server).
Container reuse is not relevant — LLAP slots are always "hot".

This chapter does not cover LLAP further; treat it as a separate world.

Bug attribution: where does it really live?

Triage tree. Symptom: query fails or returns wrong result.

flowchart TD
  S[Failure observed] --> Q1{Failure message mentions Hive operator?}
  Q1 -- yes --> H1[Hive bug: open against HIVE]
  Q1 -- no --> Q2{Failure in TezChild / IFile / Fetcher?}
  Q2 -- yes --> T1[Tez bug: open against TEZ]
  Q2 -- no --> Q3{Failure in container launch / RM allocation?}
  Q3 -- yes --> Y1[YARN bug: open against YARN]
  Q3 -- no --> Q4{Wrong result not crash?}
  Q4 -- yes --> Q5{Reproduce with same DAG via TestOrderedWordCount-style?}
  Q5 -- no --> H1
  Q5 -- yes --> T1

Practical heuristics:

Stack trace contains	Probably
`org.apache.hadoop.hive.ql.exec.Operator`	Hive
`org.apache.tez.runtime.library`	Tez
`org.apache.tez.dag.app.rm`	Tez (scheduling)
`org.apache.hadoop.yarn`	YARN
`ShuffleHandler`	YARN-side (mapreduce auxservice)
`LlapDaemon`	LLAP (Hive)
`MapJoinOperator` + OOM	Hive (join planning), even though the OOM happens in a Tez container

Wrong-result bugs almost always live in Hive (operator semantics) unless you can isolate the same DAG with synthetic data in TestOrderedWordCount style.

Reading exercise

cat ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java | head -200 — read the top of TezTask.execute.
grep -n "createEdgeProperty\|EdgeProperty\.create" \ ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java — list all edge property factories Hive uses.
grep -rn "ShuffleVertexManager\|RootInputVertexManager" \ ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez — when does Hive set each manager?
find ~/hive-src/ql/src/java -name "TezProcessor.java" -exec wc -l {} \; — confirm the processor is < 1000 lines (it's an adapter, not the brain).
grep -rn "TezSessionPoolManager.getSession" ~/hive-src/service/src — when does HS2 acquire sessions?
cat ~/hive-src/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java | head -100 — see how a session wraps a TezClient.

Common bugs and symptoms

Symptom	Likely owner
`MetaException` mid-query	Hive (HMS client)
Container OOM during reduce join	Hive operator (map-join build size); Tez can not size around an oversized hash table
Wrong row counts after a query rewrite	Hive optimizer or `MapJoinOperator` semantics
`Fetcher: ConnectException to nm:13562`	YARN (aux-service mis-config)
AM dies with `org.apache.tez.dag.app.DAGAppMaster: Vertex failed` and the diagnostic mentions only `TezProcessor`, no Hive class	Tez bug — open a reproducer DAG without Hive
Slow first query after HS2 restart	No warm sessions; enable `initialize.default.sessions`
Stale ACL after `GRANT` reissue	Hive (HMS) — Tez containers cache delegation tokens; see container-reuse.md

Validation: prove you understand this

List the Hive operators on the source and destination sides of a SCATTER_GATHER shuffle edge and map each side to the Tez Input or Output class.
Identify the Hive method that finally calls tezClient.submitDAG. Cite path + grep command.
Given a query that succeeds in standalone HS2 but fails in HS2 with session pooling on, name two likely failure modes and where to look.
Explain why a MapJoinOperator OOM is a Hive bug even though the OOM stack trace is rooted in TezChild.
Show, in three lines, the conditional inside DagUtils that decides whether to install ShuffleVertexManager on a reduce vertex. (Find via grep; quote the file:line.)

YARN Integration

The Tez AM is, from YARN's perspective, an ordinary YARN application: an ApplicationMaster running in a container, talking to the ResourceManager to request more containers, talking to NodeManagers to launch them, and writing events to a Timeline Server.

This chapter walks every YARN-facing interface Tez touches.

`DAGAppMaster` as a YARN AM

find tez-dag/src/main/java -name "DAGAppMaster.java"
wc -l $(find tez-dag/src/main/java -name "DAGAppMaster.java")
grep -n "main(\|serviceStart\|serviceInit" \
  $(find tez-dag/src/main/java -name "DAGAppMaster.java") | head

Boot sequence when YARN launches the AM container:

NodeManager runs the AM command line (constructed by TezClientUtils), which is essentially java -cp <classpath> org.apache.tez.dag.app.DAGAppMaster.
DAGAppMaster.main parses environment for ApplicationAttemptId, container ID, AM Resource, etc.
Constructs the DAGAppMaster service tree (state machines, dispatchers, schedulers, ATS publisher).
serviceStart() registers with the RM via AMRMClientAsync.registerApplicationMaster.
Starts an RPC server for client connections (DAGClient, TezTaskUmbilicalProtocol).
Waits for DAG submissions over the client RPC (or, for non-session mode, picks up the pre-submitted DAG from local disk).

Key clients owned by the AM:

Client	Purpose	Library
`AMRMClientAsync`	RM heartbeat: request/release containers	`hadoop-yarn-client`
`NMClientAsync`	NM RPC: launch/stop containers	`hadoop-yarn-client`
`TimelineClient`	ATS event publisher	`hadoop-yarn-client`
`DFSClient`	HDFS access for recovery & temp files	`hadoop-hdfs-client`

`AMRMClientAsync`

grep -n "AMRMClientAsync\|addContainerRequest\|releaseAssignedContainer\|allocate" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java") | head -20

The async wrapper around AMRMClient. Tez uses it instead of the sync client so allocate-callbacks fire on a dedicated thread.

Lifecycle:

Register: registerApplicationMaster(host, rpcPort, trackingUrl). This is the AM telling the RM "I'm alive, here is where to find me."
Allocate loop: a background thread heartbeats every yarn.am.liveness-monitor.expiry-interval-ms / 3 (roughly). Each heartbeat the AMRM client sends:
- Pending container requests (added via addContainerRequest).
- Containers to release.
- Application progress (0..1). It receives:
- Newly allocated containers.
- Completed container statuses.
- Updated node reports (for blacklisting).
- Decommissioned-node reports.
Unregister: unregisterApplicationMaster(state, msg, trackingUrl) on AM shutdown.

grep -n "CallbackHandler\|onContainersAllocated\|onContainersCompleted\|onShutdownRequest\|onNodesUpdated" \
  $(find tez-dag/src/main/java -name "YarnTaskSchedulerService.java")

These callbacks run on the AMRM client's internal thread; Tez keeps them short by forwarding to its own dispatcher.

`NMClientAsync` and `ContainerLauncherImpl`

find tez-dag/src/main/java -name "ContainerLauncherImpl.java"
wc -l $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java")
grep -n "NMClientAsync\|startContainerAsync\|stopContainerAsync" \
  $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java")

After the RM allocates a container, Tez must tell the relevant NM to actually launch the JVM. ContainerLauncherImpl uses NMClientAsync to send startContainerAsync(container, containerLaunchContext).

`ContainerLaunchContext`

grep -n "buildContainerLaunchContext\|ContainerLaunchContext\|setLocalResources\|setEnvironment\|setCommands" \
  $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java" \
                                -o -name "AMContainerHelpers.java")

The CLC is what NM uses to fork the JVM. It carries:

Field	What Tez puts there
`commands`	`java <jvm opts> -Dlog4j.configuration=... org.apache.tez.runtime.task.TezChild <args>`
`environment`	`CLASSPATH`, `JVM_PID`, container ID, AM host/port
`localResources`	Tez tarball, user JARs, any HDFS-distributed resources
`tokens`	Delegation tokens (HDFS, HMS, etc) for the container to use
`serviceData`	Per-aux-service payload (e.g. `mapreduce_shuffle` job secret)

grep -n "ServiceData\|JobTokenSecretManager\|shuffleSecret" \
  $(find tez-dag/src/main/java -name "*.java") | head

The serviceData map entry under key mapreduce_shuffle carries the serialized JobToken that NM's ShuffleHandler will use to authorize fetch requests — this is why mapreduce_shuffle must be configured as an NM aux-service even for Tez DAGs.

Tokens

grep -rn "AMRMToken\|ClientToAMToken\|TimelineDelegationToken" \
  tez-dag/src/main/java | head

Token	Issued by	Used for	Where it lives
`AMRMToken`	RM, auto-injected into AM's `Credentials`	AM↔RM RPC	AM JVM credentials
`ClientToAMToken`	RM, returned to client at submit	Client (`DAGClient`) ↔ AM RPC	Client credentials
`TimelineDelegationToken`	Timeline Server	AM → Timeline publisher	AM credentials, refreshed periodically
HDFS delegation token	NN	Tasks reading/writing HDFS	Container credentials
Hive Metastore token	HMS	Tasks calling HMS	Container credentials, via Hive code path

The AM is responsible for collecting all necessary delegation tokens at submit time (client-side TezClientUtils does this) and passing them to NMs in the CLC. Tokens that expire mid-DAG must be renewed by a TokenRenewer.

Log aggregation

grep -rn "log-aggregation\|LogAggregationService" \
  $(find ~/hadoop-src -name "*.java" 2>/dev/null | head -3) 2>/dev/null | head

YARN log aggregation is configured in yarn-site.xml:

<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>/app-logs</value>
</property>

When enabled, every container's stdout, stderr, and syslog are uploaded to HDFS under /app-logs/<user>/logs/<applicationId>/<nodeAddress> when the container exits. Retrieve with:

yarn logs -applicationId application_1234_0001 -containerId container_..._01
yarn logs -applicationId application_1234_0001 -appOwner alice

Without aggregation, logs sit in ${yarn.nodemanager.log-dirs}/application_.../container_.../ on each NM until cleaned by yarn.nodemanager.log.retain-seconds.

Timeline Server (ATS)

Tez publishes a rich event stream to ATS for post-mortem debugging.

find tez-plugins -type d -name "tez-yarn-timeline*"
ls tez-plugins/

Three flavors exist in the wild:

Version	Tez plugin module	Notes
ATSv1	`tez-yarn-timeline-history`	Original; LevelDB-backed Timeline Server. Deprecated.
ATSv1.5	`tez-yarn-timeline-history-with-acls` and `tez-yarn-timeline-history-with-fs`	Adds entity-file staging to HDFS; reduces ATS write load.
ATSv2	`tez-yarn-timeline-history-with-fs` + ATSv2 reader configuration	HBase-backed, scalable; requires Hadoop 3.x.

grep -rn "TimelineClient\|TIMELINE_HISTORY\|HistoryEventHandler" \
  tez-plugins/tez-yarn-timeline-history*/src/main/java | head

What gets published:

AppLaunchedEvent
DAGSubmittedEvent, DAGInitializedEvent, DAGStartedEvent, DAGFinishedEvent
VertexInitializedEvent, VertexStartedEvent, VertexFinishedEvent
TaskStartedEvent, TaskFinishedEvent
TaskAttemptStartedEvent, TaskAttemptFinishedEvent
ContainerLaunchedEvent, ContainerStoppedEvent

The Tez UI (Ambari, standalone) reads these events to render the DAG view, vertex graphs, task swimlanes, and counter trees.

ls tez-ui/src/main 2>/dev/null

Configuration cheat sheet

grep -n "YARN\|ATS\|TIMELINE\|LOG_AGG" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java | head -20

Key	Default	Effect
`tez.am.am-rm.heartbeat.interval-ms.max`	1000	Cap on AMRM heartbeat interval.
`tez.am.client.am.port-range`	auto	RPC port for AM client RPC.
`tez.am.container.lookup.timeout-ms`	30000	How long to wait for an NM ack before failing the launch.
`tez.history.logging.service.class`	(varies)	Which ATS plugin to use.
`tez.am.tez-ui.history-url.template`	template	Where the UI is hosted; surfaced in `DAGStatus`.

`yarn` CLI behaviors for Tez apps

Command	Behavior on a Tez app
`yarn application -list`	Lists Tez AMs alongside MR/Spark; type tag is `TEZ`.
`yarn application -status <appId>`	Shows AM state, RM tracking URL, ATS tracking URL (if configured).
`yarn application -kill <appId>`	RM SIGKILLs the AM container; Tez state is lost (no recovery beyond what `RecoveryService` already wrote).
`yarn logs -applicationId <appId>`	Streams aggregated logs of all containers — AM and TezChilds.
`yarn node -list`	Useful for confirming aux-service `mapreduce_shuffle` is up on each NM.

Reading exercise

grep -n "registerApplicationMaster\|unregisterApplicationMaster" \ $(find tez-dag/src/main/java -name "*.java") — find every AM-lifecycle call.
grep -rn "setupContainerEnvironment\|buildContainerEnvironment" \ tez-dag/src/main/java tez-api/src/main/java | head — what environment variables does the AM pass to each container?
cat $(find tez-dag/src/main/java -name "ContainerLauncherImpl.java") | head -200 — read the launch path.
grep -rn "mapreduce_shuffle" tez-dag/src/main/java tez-api/src/main/java — verify the aux-service name is hard-coded.
find tez-plugins -name "*.java" | xargs grep -l "TimelineEntity" | head -3 — which classes assemble ATS entities?
cat $(find tez-dag/src/main/java -name "DAGAppMaster.java") | head -300 — locate serviceInit and list every service added to the composite.

Common bugs and symptoms

Symptom	Likely cause
AM exits with `InvalidApplicationMasterRequestException`	AM tried to register twice or after un-register; usually a re-init bug.
`Auxiliary service mapreduce_shuffle not configured`	`yarn-site.xml` aux-services missing.
`ConnectionRefused` from Fetcher	NodeManager aux-service crashed or wrong shuffle port.
AM dies "RM expired"	AMRM heartbeat thread blocked or paused for GC > expiry interval.
ATS empty for completed app	`tez.history.logging.service.class` mis-set, or ATS not running.
`yarn logs` returns "Logs not aggregated"	Container did not finish cleanly, or aggregation not enabled.
ClientToAMToken auth fail	Client and AM disagree on cluster security; check both have the same `hadoop.security.authentication`.

Validation: prove you understand this

Trace the exact call path from DAGAppMaster.serviceStart to AMRMClientAsync.registerApplicationMaster.
List the contents of the ContainerLaunchContext.serviceData map that Tez populates, and explain who reads each entry.
Explain why an AM long pause for full GC can manifest as an RM expired shutdown, and which config controls the threshold.
For an app with yarn.log-aggregation-enable=false and a TezChild that crashed, give the exact filesystem path on the NM where its stderr lives. Use the configured yarn.nodemanager.log-dirs as a variable.
Name the three ATS plugin modules, and pick the right one for a Hadoop 3.x cluster targeting HBase-backed ATSv2.

Failure Handling

A Tez DAG fails for many reasons: a corrupted input split, a flaky NM, an OOM in the user processor, a Kerberos token expiry, an RM connectivity blip, an AM crash. Tez has a layered escalation model: small failures are absorbed, big ones propagate, and the AM persists enough state to recover from its own death.

This chapter walks the escalation, the failure taxonomy, and the recovery machinery.

Escalation: attempt → task → vertex → DAG

flowchart TD
  TA[TaskAttempt fails] -->|retry budget| TA2[New TaskAttempt]
  TA -->|exhausted| T[Task fails]
  T -->|failure policy| V[Vertex fails]
  V -->|fail-on-vertex-failure| D[DAG fails]

Default behavior:

Layer	Configuration	Default	Effect when exceeded
TaskAttempt	`tez.am.task.max.failed.attempts`	4	Mark Task as failed
Task	`tez.am.vertex.max.task.failed.attempts` (no direct knob; per-vertex policy)	varies	Vertex fails on first failed task by default
Vertex	per-DAG failure policy	fail-fast	DAG fails

grep -n "MAX_FAILED_ATTEMPTS\|MAX_TASK_ATTEMPTS\|TEZ_AM_TASK_MAX" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

`TaskAttemptTerminationCause`

find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java"
cat $(find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java")

The enum names every reason a TaskAttempt can end up non-SUCCEEDED. A selection:

Cause	Source	Retryable?
`TERMINATED_BY_CLIENT`	user-initiated DAG kill	no
`INTERNAL_PREEMPTION`	scheduler preempted to make room	yes
`EXTERNAL_PREEMPTION`	YARN preempted the container	yes
`CONTAINER_LAUNCH_FAILED`	NM rejected the launch	retried on a new container
`CONTAINER_EXITED`	TezChild exited without a status update	yes
`CONTAINER_STOPPED`	AM stopped the container intentionally	depends
`NODE_FAILED`	NM died	yes, on a different node
`NODE_BLACKLISTED`	node accumulated too many failures	retried elsewhere
`OUTPUT_LOST`	downstream reported missing output	yes, re-run source TA
`INPUT_READ_ERROR`	TA failed reading shuffle from a source	yes
`APPLICATION_ERROR`	uncaught exception in user code	usually no, but retried up to attempt budget
`FRAMEWORK_ERROR`	uncaught exception in Tez code	sometimes no
`OTHER_TASK_ATTEMPT_KILLED_DUPLICATE`	speculative duplicate lost	no (not a failure)

This enum is the most important debugging signal — every failed attempt in ATS / AM log surfaces a cause from this list.

TaskAttempt failure retries

grep -n "max.failed.attempts\|maxFailedAttempts\|attemptFailed" \
  $(find tez-dag/src/main/java -name "TaskImpl.java")

On a TA failure:

TaskAttemptImpl transitions to FAILED (or KILLED if the cause is in the "killed" subset).
TaskImpl increments its failed-attempt counter.
If counter < tez.am.task.max.failed.attempts, TaskImpl schedules a new TaskAttemptImpl (incremented attempt index).
Otherwise TaskImpl transitions to FAILED and reports up to VertexImpl.

Some causes are not counted against the budget (e.g. OUTPUT_LOST, NODE_FAILED) — these are infrastructure failures, not user-code failures.

grep -n "isFatalFailure\|isExternalError\|countAsFailure" \
  $(find tez-dag/src/main/java -name "*.java" | head -50)

Node blacklisting

grep -rn "NodeTracker\|blacklist\|BLACKLISTED" \
  tez-dag/src/main/java/org/apache/tez/dag/app | head

Per-node failure accounting:

Trigger	Effect
N task attempts fail on the same node within a window	Add node to blacklist for this app
`NodeReport` from RM says UNHEALTHY	Add node to blacklist immediately
`tez.am.maxtaskfailures.per.node`	Per-node failure threshold (default 3)
`tez.am.node-blacklisting.enabled`	Master toggle
`tez.am.node-blacklisting.ignore-threshold-node-percent`	Don't blacklist if it would remove more than N% of the cluster

A blacklisted node:

No new container requests go to it.
Held containers on it are released.
Existing attempts already running on it are allowed to finish (not preemptively killed).

Output loss

A common late-stage failure: a downstream task reads from a shuffle source and finds the source's output is gone (the NM died, the disk was wiped, etc).

grep -rn "OUTPUT_LOST\|reportSourceTaskAttemptFailed\|inputFailedEvent" \
  tez-runtime-library/src/main/java tez-dag/src/main/java | head

Flow:

Destination TA's Fetcher fails permanently on source S.a.0.
Destination TA sends InputReadErrorEvent via umbilical heartbeat.
AM's VertexImpl receives the event, marks S.a.0 as OUTPUT_LOST.
TaskImpl for S.a schedules a new attempt S.a.1.
New attempt re-runs, produces fresh outputs, downstream resumes.

This is the cascading-rerun engine — and a source of pathological behavior when a single bad disk poisons many downstream tasks.

AM failover

find tez-dag/src/main/java -name "RecoveryService.java"
find tez-dag/src/main/java -name "RecoveryEventHandler.java"
wc -l $(find tez-dag/src/main/java -name "RecoveryService.java")

YARN keeps a small budget of AM restarts (yarn.resourcemanager.am.max-attempts, default 2). When the AM crashes:

RM allocates a fresh AM container, attempt index incremented.
New AM boots, sees attempt > 1, enters recovery mode.
RecoveryService reads the recovery log from HDFS (written by attempt 1).
Replays events to reconstruct DAG, Vertex, Task, TaskAttempt state.
Inherits any pre-existing containers via AMRMClient.getContainersFromPreviousAttempts.
Resumes scheduling from the last consistent state.

RecoveryService

grep -n "writeEvent\|flush\|recover\|RecoveryEventType" \
  $(find tez-dag/src/main/java -name "RecoveryService.java" \
                                -o -name "RecoveryEventHandler.java")

Append-only event log on HDFS, one file per app attempt:

hdfs:///tmp/staging/<user>/.staging/application_<id>/appattempt_<id>_NNNNNN/
  recovery/
    summary.dag_1.recovery
    dag_1.recovery

Event kinds:

Event	When written
`DAGSubmittedEvent`	DAG arrives at AM
`DAGInitializedEvent`	DAG state machine reaches INITED
`DAGStartedEvent`	DAG reaches RUNNING
`DAGFinishedEvent`	DAG terminal state
`VertexInitializedEvent`, `VertexStartedEvent`, `VertexFinishedEvent`	mirror state transitions
`TaskStartedEvent`, `TaskFinishedEvent`	per task
`TaskAttemptStartedEvent`, `TaskAttemptFinishedEvent`	per attempt
`VertexConfigurationDoneEvent`	parallelism finalized

find tez-dag/src/main/java -name "*Event*.java" -path "*recovery*"

Configuration

Key	Default	Effect
`tez.dag.recovery.enabled`	true	Master toggle.
`tez.dag.recovery.flush.interval.secs`	30	Periodic fsync of the recovery log.
`tez.dag.recovery.io.buffer.size`	8192	Buffer for the writer.
`yarn.resourcemanager.am.max-attempts` (YARN)	2	Caps recovery attempts.

What recovery can and cannot recover

Can	Cannot
DAG / Vertex / Task / TA state at last flush	In-flight events lost since last flush
Counter snapshots written to recovery log	Real-time counter updates between flushes
Container assignments	NM-side container state — those are rediscovered via `getContainersFromPreviousAttempts`
User payload of `DAGPlan`	User in-memory state inside a custom `VertexManagerPlugin`

A VertexManagerPlugin that holds in-memory state across events must override getState() / setState() to participate in recovery — otherwise it starts fresh on AM attempt 2.

DAG-level termination causes

find tez-dag/src/main/java -name "DAGTerminationCause.java"
cat $(find tez-dag/src/main/java -name "DAGTerminationCause.java")

Cause	Trigger
`DAG_KILL`	client called `dagClient.tryKillDAG()`
`VERTEX_FAILURE`	a vertex transitioned to FAILED
`INIT_FAILURE`	DAG init failed (bad plan, bad input)
`INTERNAL_ERROR`	unhandled exception inside AM
`AM_USERCODE_FAILURE`	user-supplied plugin threw
`OUT_OF_TEZ_TASK_RESOURCES`	scheduler could not satisfy resource requests
`RECOVERY_FAILURE`	replay couldn't reconstruct prior state

Reading exercise

grep -n "transition\|FAILED\|KILLED" \ $(find tez-dag/src/main/java -name "TaskAttemptImpl.java") | head -40 — count terminal transitions.
grep -rn "OUTPUT_LOST" tez-dag/src/main/java tez-runtime-library/src/main/java — what triggers this cause?
cat $(find tez-dag/src/main/java -name "RecoveryService.java") | head -200 — read the writer loop.
grep -n "RecoveryEvent" $(find tez-dag/src/main/java -name "*.java" | head -50) — list all recovery event classes.
wc -l $(find tez-dag/src/main/java -name "TaskAttemptTerminationCause.java" \ -o -name "DAGTerminationCause.java" \ -o -name "VertexTerminationCause.java")
grep -rn "node-blacklisting\|blacklistNode" tez-dag/src/main/java | head — where is blacklist enforcement implemented?

Common bugs and symptoms

Symptom	Likely cause
`OUTPUT_LOST` cascade kills the DAG	One bad NM is poisoning downstream; blacklist or pin off it.
Recovery infinite-loops on attempt 2	Corrupt recovery log; check fsync gating and `tez.dag.recovery.flush.interval.secs`.
`INTERNAL_PREEMPTION` repeatedly	Tez scheduler is preempting its own attempts; usually a higher-priority vertex starving lower; tune priorities.
All attempts of one task fail in < 1s	User code throws deterministically; cause is `APPLICATION_ERROR`.
DAG hangs forever after one task fails	Vertex failure policy is permissive (rare); look at the vertex transition.
`NODE_BLACKLISTED` removes 100% of cluster	`ignore-threshold-node-percent` not set; the DAG is now unschedulable.
AM crashes, attempt 2 boots, but tasks restart from scratch	Recovery disabled or HDFS staging dir not accessible to attempt 2.

Validation: prove you understand this

List five TaskAttemptTerminationCause values that do not count against the attempt budget. Cite where the predicate lives.
Explain in two sentences how an OUTPUT_LOST on source S.a.0 triggers a re-run of S.a, not just S.a.0's downstream consumers.
Identify the HDFS path pattern under which recovery events are written. Give the exact path components.
Describe what happens to in-flight DataMovementEvents when the AM crashes mid-DAG and AM attempt 2 takes over.
Given tez.am.maxtaskfailures.per.node=3 and an 8-node cluster, what is the smallest sequence of task failures that triggers blacklisting? Show the math.

Counters and Diagnostics

When a Tez DAG misbehaves, you have two primary signals: counters (numeric aggregates from every task) and diagnostics strings (free-text causes at every level of the hierarchy). This chapter is the operator's reference for both.

`TezCounters`

find tez-api/src/main/java -name "TezCounters.java"
wc -l $(find tez-api/src/main/java -name "TezCounters.java")
grep -n "class TezCounters\|addGroup\|getGroup\|findCounter" \
  $(find tez-api/src/main/java -name "TezCounters.java")

TezCounters is a typed map of (groupName) → CounterGroup → (counterName) → Counter. It is hash-cons style: identical strings share storage. Counters are long values with thread-safe increment.

find tez-api/src/main/java -name "TaskCounter.java"
cat $(find tez-api/src/main/java -name "TaskCounter.java")

Standard groups

Group	Source class	What lives there
`org.apache.tez.common.counters.TaskCounter`	`TaskCounter` enum	Per-task framework metrics
`org.apache.tez.common.counters.DAGCounter`	`DAGCounter` enum	Per-DAG aggregate metrics
`org.apache.tez.common.counters.FileSystemCounter`	`FileSystemCounter`	Per-FS bytes-read/written
`org.apache.hadoop.mapreduce.JobCounter`	(legacy MR)	Compatibility shim
User-defined	`<your class name>`	App code

Key `TaskCounter` values

grep -n "INPUT_RECORDS_PROCESSED\|OUTPUT_RECORDS\|SPILLED_RECORDS\|SHUFFLE_BYTES\|GC_TIME_MILLIS\|REDUCE_INPUT_GROUPS" \
  $(find tez-api/src/main/java -name "TaskCounter.java")

Counter	Meaning
`INPUT_RECORDS_PROCESSED`	Records read from logical inputs
`OUTPUT_RECORDS`	Records written to logical outputs
`OUTPUT_BYTES`	Bytes written (post-compression for shuffle)
`OUTPUT_BYTES_PHYSICAL`	Bytes actually written to disk
`SPILLED_RECORDS`	Records spilled by sorter
`NUM_SPILLS`	Number of spill files created
`MERGED_MAP_OUTPUTS`	Spills merged on the source side
`SHUFFLE_BYTES`	Bytes fetched by shuffle
`SHUFFLE_BYTES_TO_MEM`, `SHUFFLE_BYTES_TO_DISK`	Fetcher allocation split
`REDUCE_INPUT_GROUPS`	Distinct keys seen by a `KeyValuesReader`
`REDUCE_INPUT_RECORDS`	Total values across all groups
`GC_TIME_MILLIS`	Sum of GC time during the task
`CPU_MILLISECONDS`	Process CPU time
`COMMITTED_HEAP_BYTES`	Heap size at task end
`PHYSICAL_MEMORY_BYTES`, `VIRTUAL_MEMORY_BYTES`	Process memory snapshot

`DAGCounter`

find tez-api/src/main/java -name "DAGCounter.java"
cat $(find tez-api/src/main/java -name "DAGCounter.java")

Counter	Meaning
`NUM_SUCCEEDED_TASKS`	Aggregated across all vertices
`NUM_KILLED_TASKS`	Speculative duplicates + user kills
`NUM_FAILED_TASKS`	TA failures (counts every failed attempt)
`TOTAL_LAUNCHED_TASKS`	Lifetime sum
`OTHER_LOCAL_TASKS`, `RACK_LOCAL_TASKS`, `DATA_LOCAL_TASKS`	Locality histogram
`AM_CPU_MILLISECONDS`, `AM_GC_TIME_MILLIS`	AM process counters
`WALL_CLOCK_MILLIS`	DAG submission → completion

Aggregation: task → TA → vertex → DAG

flowchart TD
  TA[TaskAttempt counters] -->|flushed via heartbeat| T[Task counters]
  T -->|on TASK_SUCCEEDED| V[Vertex counters]
  V -->|on VERTEX_SUCCEEDED| D[DAG counters]

Mechanism:

Each LogicalIOProcessorRuntimeTask accumulates counters in process.
TaskReporter heartbeat carries a snapshot to the AM via TezTaskUmbilicalProtocol.statusUpdate.
AM's TaskAttemptImpl stores the latest snapshot.
On TASK_SUCCEEDED, the winning attempt's counters become the Task counters; other attempts are discarded.
On VERTEX_SUCCEEDED, VertexImpl sums all task counters into the vertex counters.
On DAG_SUCCEEDED, DAGImpl sums all vertex counters into DAG counters and includes AM_* and DAG_* self-counters.

grep -n "incrCounters\|aggregateCounters\|getCounters\|setCounters" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -30

Counter limits (and how they kill DAGs)

grep -n "COUNTERS_MAX\|TEZ_COUNTERS_MAX\|countersMax" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Key	Default	Cap on
`tez.counters.max`	1200	Total counter count per `TezCounters` instance
`tez.counters.max.groups`	500	Group count
`tez.counters.group-name.max`	256	Length of a group name
`tez.counters.counter-name.max`	64	Length of a counter name

Exceeding any limit throws LimitExceededException. This typically happens when:

An app creates a counter per unique key (e.g. per file path).
A user vertex manager creates per-task counters.
A DAG has very many vertices, each contributing many counters, and the DAG-level aggregate blows the cap.

The exception propagates up the heartbeat path and kills the DAG with INTERNAL_ERROR. Look for LimitExceededException in the AM log to confirm.

Diagnostics strings

Every level (TA → Task → Vertex → DAG) has a List<String> of diagnostics.

Level	Class	Populated by
`TaskAttempt`	`TaskAttemptImpl`	User exception stacks, framework errors, container exit reasons
`Task`	`TaskImpl`	Aggregate of failed attempt diagnostics + scheduling diagnostics
`Vertex`	`VertexImpl`	Aggregate of failed task diagnostics + vertex manager events
`DAG`	`DAGImpl`	Aggregate of failed vertex diagnostics + DAG-level events

grep -n "addDiagnostic\|diagnostics\|getDiagnostics" \
  $(find tez-dag/src/main/java -name "TaskAttemptImpl.java" \
                                -o -name "TaskImpl.java" \
                                -o -name "VertexImpl.java" \
                                -o -name "DAGImpl.java") | head -40

When a DAG completes, DAGStatus.getDiagnostics() is the union of every diagnostic at every level. This is what tez-tool and the Tez UI display.

Where to find diagnostics

Surface	Path	Notes
Client return value	`DAGStatus.getDiagnostics()`	Concatenated strings
AM log	`syslog`	Search for `DIAG:`, `ERROR`, the cause keyword
ATS	`DAGFinishedEvent.diagnostics`, `VertexFinishedEvent.diagnostics`, etc	One field per entity
Tez UI	DAG / Vertex / Task page	Renders the same ATS fields
`dag.dot` (if dumped)	local file written by `TezClient` when enabled	Static plan only, no diagnostics
Counter dump from CLI	`tez-tool dump-counters <appId>`	Counter snapshots

grep -rn "DIAG\|addDiagnosticInfo" tez-dag/src/main/java | head -20

Counters in the AM log

A typical successful-task log line:

TaskAttempt: [attempt_1_0_00_000000_0]
  TASK_ATTEMPT_FINISHED ...
  counters: Counters: 26
    org.apache.tez.common.counters.TaskCounter
      INPUT_RECORDS_PROCESSED=12345
      OUTPUT_RECORDS=12345
      OUTPUT_BYTES=4567890
      ...

grep -rn "Counters: " tez-dag/src/main/java | head

For diagnostic grepping, search the AM log for:

Pattern	What it finds
`DIAG:`	Diagnostics appends
`Counters:`	Counter dumps
`LimitExceededException`	Counter limit hits
`TaskAttemptTerminationCause`	Failure causes
`TERMINATED_BY_CLIENT`	User-initiated kills
`OUTPUT_LOST`	Cascading reruns

Custom counters

User code accesses counters via the IPO context:

public class MyProcessor extends AbstractLogicalIOProcessor {
  @Override
  public void run(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs) {
    TezCounters counters = getContext().getCounters();
    counters.findCounter("MyApp", "ROWS_FILTERED").increment(1);
  }
}

grep -rn "getContext().getCounters\|getCounters()" \
  tez-tests/src/main/java tez-examples/src/main/java | head

Operational guidance:

Cap group/counter cardinality at compile time. Never use unbounded user input as a counter name.
One group per app; many counters per group.
Counter names are visible in ATS forever — treat them as a stable API.

Reading exercise

cat $(find tez-api/src/main/java -name "TaskCounter.java") — read the enum.
grep -n "incrCounter\|addCounters" \ $(find tez-runtime-library/src/main/java -name "*.java") | head -20 — find every place runtime increments counters.
grep -rn "LimitExceededException" tez-api/src/main/java tez-dag/src/main/java — trace the kill path.
find tez-tools -type f -name "*.java" | head — look at tez-tools for counter-dump tooling.
grep -rn "addDiagnosticInfo\|addDiagnostic" tez-dag/src/main/java | wc -l — count the call sites; build a mental model of "where diagnostics flow in."
Open the Tez UI for a recent app, navigate DAG → Vertex → Task, and compare each level's counter view against what the AM log shows.

Common bugs and symptoms

Symptom	Likely cause
DAG fails with `LimitExceededException`	Too many counters — search AM log for the limit that triggered.
Counters at DAG level don't sum to vertex counters	One vertex failed; its counters are excluded from the sum.
Counter group missing from ATS	Counter was never incremented (zero is not stored).
Diagnostics string truncated	ATS field length limit; check `yarn.timeline-service.client.max-attempts` and entity size.
`INPUT_RECORDS_PROCESSED` is zero but task succeeded	Input had zero rows, or a custom IPO does not increment the standard counter.
`SHUFFLE_BYTES_TO_DISK` >> `SHUFFLE_BYTES_TO_MEM`	Fetcher exhausted memory budget; tune `tez.runtime.shuffle.memory.limit.percent`.
Wall clock huge vs CPU millis	Task spent most time waiting (shuffle, GC, blocked); not CPU bound.

Validation: prove you understand this

Name the four standard counter groups and the class that defines each.
Explain why two attempts of the same task can have different counter values, and what happens to the loser's counters.
Calculate the smallest DAG that can hit tez.counters.max=1200, assuming each TaskCounter contributes 26 counters per vertex on success.
Trace the path of a single counter increment in user code through the classes that aggregate it up to the DAGStatus returned to the client.
Given an AM log line DIAG: TaskAttempt attempt_1_0_05_000003_2 failed, cause=APPLICATION_ERROR, list the four levels where this diagnostic ultimately appears and the exact classes that store each copy.

Testing Framework

Tez ships three tiers of tests, each with a different cost/coverage tradeoff. Knowing which tier to use for a given change — and which patterns are considered idiomatic — is the difference between a patch that lands and one that sits in review.

Three tiers

Tier	Module	Boots...	Run cost	Use for
Unit	each module's `src/test/java`	nothing real; pure mocks + dispatcher	seconds	State-machine transitions, parsers, helper classes
Mini-cluster	`tez-tests/src/test/java`	`MiniTezCluster` (`MiniYARNCluster` + Tez session)	seconds-to-minutes	End-to-end DAGs in a JVM
Full cluster	external	real YARN cluster	minutes	Release validation, perf tests

find . -name "MiniTezCluster.java"
wc -l $(find . -name "MiniTezCluster.java")

Unit testing state machines

The dominant pattern for Tez unit tests is arrange-state, send-event, drain-dispatcher, assert. Reference: TestVertexImpl, TestTaskImpl, TestTaskAttemptImpl, TestDAGImpl.

find tez-dag/src/test/java -name "TestVertexImpl.java"
wc -l $(find tez-dag/src/test/java -name "TestVertexImpl.java")
grep -n "DrainDispatcher\|MockVertex\|MockDAG\|setupVertices\|dispatcher.await" \
  $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -20

Building blocks

Class	Purpose
`DrainDispatcher`	Synchronous-ish event dispatcher; `await()` blocks until queue drains
`MockVertex`, `MockDAG`, `MockTask`, etc	Lightweight stand-ins that satisfy `Vertex` etc interfaces
`MockClock`	Controllable clock for time-dependent transitions
`MockHistoryEventHandler`	Captures recovery / ATS events for assertion
Mockito (`mock`, `when`, `verify`)	Mocks for collaborators (`TaskSchedulerManager`, etc)

Recipe

@Test
public void testVertexInitsAfterAllInputsReady() throws Exception {
  // 1. Arrange
  DrainDispatcher dispatcher = new DrainDispatcher();
  dispatcher.init(new Configuration());
  dispatcher.start();

  TaskSchedulerManager sched = mock(TaskSchedulerManager.class);
  DAG dag = mock(DAG.class);
  when(dag.getID()).thenReturn(TezDAGID.getInstance(appAttemptId, 1));

  VertexImpl v = new VertexImpl(vertexId, plan, name, conf,
      dispatcher.getEventHandler(),
      mock(TaskCommunicatorManagerInterface.class),
      mockClock, taskHeartbeatHandler, mockAppContext,
      VertexLocationHint.create(...), dispatcher,
      mockVertexManager, ...);

  // 2. Act
  dispatcher.getEventHandler().handle(
      new VertexEvent(vertexId, VertexEventType.V_INIT));
  dispatcher.await();

  // 3. Assert
  assertEquals(VertexState.INITED, v.getState());
  verify(sched, never()).taskAllocated(any(), any(), any());
}

Key idioms:

Never call Thread.sleep. Always dispatcher.await().
Never assume event ordering unless you've sent events sequentially through the same dispatcher.
Mock the AppContext aggressively. It's the god-object; mocking it lets each test isolate exactly the collaborators it cares about.

`MiniTezCluster` tests

find tez-tests/src/test/java -name "TestOrderedWordCount.java" \
                              -o -name "TestMRRJobsDAGApi.java" \
                              -o -name "TestExtServicesWithLocalMode.java" | head
wc -l $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")

MiniTezCluster boots:

A MiniYARNCluster (in-process RM + N NMs).
A MiniDFSCluster (in-process NN + DNs) — optional.
A TezClient configured against the mini cluster.

grep -n "MiniTezCluster\|new MiniYARNCluster\|setup\|tearDown" \
  $(find tez-tests/src/test/java -name "TestOrderedWordCount.java")

Lifecycle

flowchart TD
  setUp[BeforeClass: setup] --> mini[Start MiniTezCluster]
  mini --> tez[Create TezClient]
  test1[Test: build DAG] --> submit[submitDAG]
  submit --> wait[waitForCompletion]
  wait --> assert[Assert DAGStatus + counters]
  tear[AfterClass: tearDown] --> stop[Stop TezClient + cluster]

Common shape

@BeforeClass
public static void setup() throws Exception {
  conf = new Configuration();
  conf.setInt(YarnConfiguration.RM_NM_HEARTBEAT_INTERVAL_MS, 100);
  miniTezCluster = new MiniTezCluster("name", 1, 1, 1);
  miniTezCluster.init(conf);
  miniTezCluster.start();
  TezConfiguration tezConf = new TezConfiguration(miniTezCluster.getConfig());
  tezClient = TezClient.create("test", tezConf);
  tezClient.start();
}

@AfterClass
public static void tearDown() throws Exception {
  tezClient.stop();
  miniTezCluster.stop();
}

@Test(timeout = 60_000)
public void testWordCount() throws Exception {
  DAG dag = buildWordCountDAG();
  DAGClient client = tezClient.submitDAG(dag);
  DAGStatus status = client.waitForCompletionWithStatusUpdates(EnumSet.of(StatusGetOpts.GET_COUNTERS));
  assertEquals(DAGStatus.State.SUCCEEDED, status.getState());
  assertEquals(EXPECTED_ROW_COUNT,
      status.getDAGCounters().findCounter(TaskCounter.OUTPUT_RECORDS).getValue());
}

`@Test(timeout = ...)` is mandatory

A mini-cluster test that hangs blocks the whole CI build. Every MiniTezCluster test has a JUnit timeout in the 60-300 second range.

Local mode for tests

Faster than MiniTezCluster: no YARN, no DFS, everything in-process.

grep -rn "TEZ_LOCAL_MODE\|setLocalMode\|tez.local.mode" \
  tez-tests/src/test/java tez-runtime-library/src/test/java | head

Used for:

Unit-style integration tests where YARN isn't relevant.
Examples / smoke tests in tez-examples.
Quick repro of runtime issues — see the local mode deep dive.

Patterns: do and don't

Do

grep -rn "DrainDispatcher\|await()" tez-dag/src/test/java | wc -l

Send all setup events synchronously, then call dispatcher.await().
Use MockClock and advance it explicitly.
Capture emitted events with a custom handler and assert on the collection.
Use @Before / @After to reset shared dispatcher and mocks.
Mock external collaborators (TaskScheduler, ContainerLauncher, NMClient); never instantiate the real ones in unit tests.
Bound parallelism in mini-cluster tests (numNodeManagers=1 is usually fine).

Don't

Anti-pattern	Why it bites
`Thread.sleep(N)` to wait for state	Flake city; transition time depends on machine load.
`while (vertex.getState() != X)` busy loop	Same flake, plus burns CPU.
Assume `e1` happens before `e2` when both posted async	Dispatcher orders by arrival, not posting.
Static state across tests	Tests run in some JVM order; static leaks corrupt later tests.
Real network calls in unit tests	Slow, flaky, often forbidden in CI sandboxes.
`System.exit` from tested code paths	Kills the JVM running the test runner.

CI / build integration

cat pom.xml | head -100
grep -n "surefire\|failsafe" pom.xml

Maven plugin	Runs	Default scope
`maven-surefire-plugin`	unit tests under `src/test/java`	`Test.java`, `Test.java`, `Tests.java`, `TestCase.java`
`maven-failsafe-plugin`	integration tests	`IT.java`, `IT.java`, `*ITCase.java`

Tez puts MiniTezCluster tests under surefire as well (no separation), which is one reason mvn test is slow. Run a single test:

mvn test -pl tez-dag -Dtest=TestVertexImpl
mvn test -pl tez-tests -Dtest=TestOrderedWordCount#testWordCount

Test-only utilities

find . -path "*/src/main/java/*" -name "Test*.java" | head
find . -path "*/src/test/java/*" -name "Mock*.java" | head

Helpful classes (some live under src/main so they're reusable downstream):

Class	Module	Purpose
`MiniTezCluster`	`tez-tests`	Bootstrap an in-process cluster
`TezClientForTest`	`tez-api`	Subclass exposing internals
`MockDAG`, `MockVertex`, `MockTask`	`tez-dag` test sources	Plain-old objects implementing state-machine interfaces
`TestProcessor`, `TestInput`, `TestOutput`	`tez-tests`	No-op IPOs for plan plumbing tests
`DrainDispatcher`	hadoop-yarn-common (depended upon)	Dispatcher with `await()`

Reading exercise

cat $(find tez-dag/src/test/java -name "TestVertexImpl.java") | head -150 — read the setup + first test.
grep -n "@Test" $(find tez-dag/src/test/java -name "TestTaskImpl.java" \ -o -name "TestVertexImpl.java" \ -o -name "TestDAGImpl.java") | wc -l — get a sense of the test surface.
cat $(find tez-tests/src/test/java -name "TestOrderedWordCount.java") | head -200 — see a real MiniTezCluster test.
grep -rn "dispatcher.await\|DrainDispatcher" tez-dag/src/test/java | wc -l — confirm the pattern is universal.
grep -rn "Thread.sleep" tez-dag/src/test/java | head — find any stragglers using the anti-pattern; understand why each one is there (usually waiting on real OS state, e.g. a port).
mvn -pl tez-dag test -Dtest=TestVertexImpl -DfailIfNoTests=false — run one and read the output structure.

Common bugs and symptoms

Symptom	Likely cause
Test passes locally, flakes in CI	`Thread.sleep` waiting for transition; replace with `dispatcher.await()`.
`MiniTezCluster` test hangs forever	Missing `@Test(timeout = …)`; AM never finishes due to test bug.
`BindException` in mini-cluster	Previous test didn't `stop()`; ports leaked.
State machine throws `InvalidStateTransitionException` in test	Test sent event in wrong state; check arrange step.
Mock returns null from `getDAG()`	Forgot to stub `when(appContext.getCurrentDAG())`.
`OutOfMemoryError: Java heap space` in surefire	Each test forking JVM holds too much; tune `argLine=-Xmx1g` in pom.
Test depends on counter being non-zero, but it's zero	Counter incremented in code path the mock skipped; verify the code under test actually ran.

Validation: prove you understand this

Outline the four-step recipe for a state-machine unit test, with the exact call to drain the dispatcher.
Name three classes from tez-dag/src/test/java that implement the Mock* pattern and what each replaces.
Explain why Thread.sleep is an anti-pattern in Tez tests and what the correct alternative is for time-dependent transitions.
Given a hang in TestVertexImpl#testTaskKill, list the first three diagnostics you'd inspect (no debugger).
Describe the difference between a MiniTezCluster test and a local-mode test, and give one scenario where each is the correct choice.

Hive-on-Tez Labs

Hive on Tez is the production context that has carried Tez through the last decade. Every large Hive deployment that's not on Spark is on Tez. Understanding the Tez/Hive boundary is therefore not a niche skill — it is the production debugging skill for both projects.

These labs work from a SQL query down through Hive compilation, into a Tez DAG, into running tasks, and back out through failure attribution and remediation. They are deliberately hands-on; every step has commands to run against ~/tez-src and ~/hive-src.

Prerequisites

Tool	Required version	Why
Apache Tez	0.10.x	Matches the rest of this book
Apache Hive	3.x or 4.x	Production-relevant; Hive 2 is end of life
Hadoop	3.3.x	Tez and Hive both target this
JDK	11 (Hive 4) or 8 (Hive 3)	Per project requirements
Local clones	`~/tez-src`, `~/hive-src`	All commands assume these paths

If you only have one of Hive 3 vs Hive 4, the labs work either way — they call out the delta where it matters. Class paths used throughout these labs (the integration boundary):

org.apache.hadoop.hive.ql.exec.tez.TezTask                  — Hive's "execute on Tez" task
org.apache.hadoop.hive.ql.exec.tez.DagUtils                 — Builds Tez DAG from Hive plan
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager    — Pools Tez sessions
org.apache.hadoop.hive.ql.exec.tez.TezSessionState          — One Hive session = one Tez AM
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource          — Map-side record source
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource       — Reduce-side record source

Verify these exist in your tree:

find ~/hive-src -path "*ql/exec/tez/TezTask.java"
find ~/hive-src -path "*ql/exec/tez/DagUtils.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionPoolManager.java"
find ~/hive-src -path "*ql/exec/tez/TezSessionState.java"
find ~/hive-src -path "*ql/exec/tez/MapRecordSource.java"
find ~/hive-src -path "*ql/exec/tez/ReduceRecordSource.java"

If any are missing, your Hive tree may be too old. Hive 3.1.x and 4.0.x both have all six.

The Tez/Hive Boundary, At a Glance

The boundary is one Hive class — TezTask — and a handful of supporting utilities. Above the boundary, Hive owns: SQL parsing, semantic analysis, logical plan, physical plan (MapWork/ReduceWork). Below the boundary, Tez owns: DAG execution, task scheduling, shuffle, recovery.

flowchart TD
  subgraph Hive
    A[SQL Query] --> B[Parser]
    B --> C[Semantic Analyzer]
    C --> D[Logical Plan]
    D --> E[Physical Plan<br/>MapWork / ReduceWork]
    E --> F[TezTask.execute]
    F --> G[DagUtils.createVertex<br/>DagUtils.createEdge]
    G --> H[DAG object]
  end
  subgraph Tez
    H --> I[TezSession.submitDAG]
    I --> J[DAGAppMaster<br/>tez-dag]
    J --> K[Vertex tasks<br/>tez-runtime-internals]
    K --> L[Shuffle I/O<br/>tez-runtime-library]
  end

That TezTask → DagUtils → DAG → submitDAG sequence is the entire integration surface. The seven labs below walk it from the top (Lab H1) to the runtime (Lab H6).

Lab Index

Lab	Goal	Output artifact
H1: SQL → DAG	Trace a `SELECT...GROUP BY...ORDER BY` from SQL to a labelled Tez DAG	DAG diagram
H2: Inspect DAG	Capture and inspect the DAG Hive submits	EXPLAIN output + `.dot` file
H3: Debug a query	Walk from a "Vertex failed" message to the actual exception	Failure narrative
H4: Bug attribution	Use stack-trace top frame to attribute to Hive, Tez runtime, Tez AM, or YARN	Decision tree applied
H5: Reproducing bugs	Build a minimum reproducer for a Hive-on-Tez bug	Repro tarball
H6: Diagnostics	Write a small diagnostic patch (log, counter, config) and attach to JIRA	Patch + JIRA

Reading Order

H1 and H2 are foundational — do them in order. H3 and H4 are debugging skills that build on each other. H5 and H6 are the contributor-facing skills you need to file a useful Hive-on-Tez JIRA from a production observation.

If you are coming to this section from the Capstone, H4 and H5 are the most directly relevant.

Where the Real Work Happens

The Tez/Hive boundary is one of the most-asked-about areas on both project mailing lists. The labs are written so that, when you encounter a production issue, you can:

Read the stack trace and attribute it (H4).
Locate the SQL that produced the DAG (H1).
Capture the DAG and find the relevant vertex (H2).
Identify the failing task and its log (H3).
Reproduce it minimally on MiniTezCluster (H5).
Attach a diagnostic patch to a JIRA to get more data from the reporter (H6).

That six-step routine, executed crisply, is what gets Hive-on-Tez JIRAs resolved.

Validation for the Section

You have absorbed the Hive-on-Tez section when, given a freshly-failing query in a production Hive-on-Tez deployment, you can:

Within 10 minutes, identify which project owns the failure (Hive / Tez / YARN).
Within 30 minutes, locate the relevant code on both sides of the boundary.
Within 1 hour, capture the DAG and the failing task's log.
Within a day, produce a minimum reproducer on MiniTezCluster.
Within a week, file a JIRA on the right project with all the data needed.

That is the standard a Hive-on-Tez committer holds themselves to. The labs build the muscle.

Lab H1: SQL → DAG

Background

A user writes:

SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;

Hive compiles this into a Tez DAG with three vertices and two edges. This lab walks the compilation path: parser → semantic analyzer → logical plan → physical plan (MapWork/ReduceWork) → TezTask → DagUtils.createVertex/createEdge → submitted DAG.

By the end you will have a labelled DAG diagram for this query and you will be able to trace any similar query from SQL to runtime topology.

Setup

cd ~/hive-src
git log --oneline -1                    # know the version you're on
find . -name "TezTask.java"             # boundary class
find . -name "DagUtils.java"            # DAG construction

A representative test table (use Hive CLI or beeline):

CREATE TABLE t (a INT, b STRING)
  STORED AS ORC;

INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');

The query under study:

SELECT a, COUNT(*) AS c
  FROM t
  GROUP BY a
  ORDER BY a;

Step 1: Parser (lexing, AST)

Hive uses ANTLR. The grammar lives in:

find ~/hive-src -name "HiveParser.g" -o -name "HiveLexer.g"

The parser produces an AST. From the CLI:

EXPLAIN AST SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

You will see a Lisp-style tree:

(TOK_QUERY
  (TOK_FROM (TOK_TABREF (TOK_TABNAME t)))
  (TOK_INSERT
    (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
    (TOK_SELECT
      (TOK_SELEXPR (TOK_TABLE_OR_COL a))
      (TOK_SELEXPR (TOK_FUNCTIONSTAR COUNT) c))
    (TOK_GROUPBY (TOK_TABLE_OR_COL a))
    (TOK_ORDERBY (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL a)))))

The AST is the input to the next phase.

Step 2: Semantic Analyzer

The AST goes through SemanticAnalyzer:

find ~/hive-src -name "SemanticAnalyzer.java" | head

It resolves table references, expands *, type-checks aggregates, and produces a Query Block (QB) tree → Operator tree (logical plan).

EXPLAIN LOGICAL SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

You see operators like TS (TableScan), SEL (Select), GBY (GroupBy), RS (ReduceSink), FS (FileSink). Two GBY and two RS are typical for a GROUP BY ... ORDER BY (one pair each).

Step 3: Physical Plan — `MapWork`, `ReduceWork`

The logical operator tree is converted to a physical plan whose top-level units are MapWork, ReduceWork, and MergeJoinWork. For our query, Hive produces three Work units:

Work	Purpose	Operators inside
`MapWork` (Map 1)	Read `t`, partial aggregate by `a`	`TS → SEL → GBY → RS`
`ReduceWork` (Reducer 2)	Final aggregate by `a`, prepare for sort	`GBY → RS`
`ReduceWork` (Reducer 3)	Total-order sort by `a`, write output	`SEL → FS`

Inspect the structures:

grep -rn "class MapWork" ~/hive-src/ql/src/java/
grep -rn "class ReduceWork" ~/hive-src/ql/src/java/

Get this from Hive directly:

EXPLAIN SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Look for the Stage: Stage-1 / Tez block and the per-vertex sections (Map 1, Reducer 2, Reducer 3).

Step 4: `TezTask` — The Boundary

The Hive-side execution entry point for a Tez query:

grep -n "public int execute" $(find ~/hive-src -name TezTask.java)

TezTask.execute(TaskQueue queue, DriverContext driverContext) does roughly:

Acquire a TezSessionState (existing pooled session or new one) via TezSessionPoolManager.
Build a DAG from MapWork/ReduceWork via DagUtils.
Submit the DAG via the session's TezSession.submitDAG.
Block on the DAGClient for completion.
Surface counters and diagnostics.

The DAG-building call:

grep -n "DagUtils\|dagUtils\.create" $(find ~/hive-src -name TezTask.java)

You will see calls to DagUtils.createDag or DagUtils.buildDag (name varies by Hive version).

Step 5: `DagUtils.createVertex` / `createEdge`

The mapping from Hive Work units to Tez Vertex happens here:

find ~/hive-src -name "DagUtils.java"
grep -n "createVertex\|public Vertex " $(find ~/hive-src -name DagUtils.java)
grep -n "createEdge\|public Edge "     $(find ~/hive-src -name DagUtils.java)

For our query, DagUtils produces:

Hive `Work`	Tez `Vertex`	Processor descriptor
`MapWork` "Map 1"	`Vertex` "Map 1"	`MapTezProcessor`
`ReduceWork` "Reducer 2"	`Vertex` "Reducer 2"	`ReduceTezProcessor`
`ReduceWork` "Reducer 3"	`Vertex` "Reducer 3"	`ReduceTezProcessor`

And two edges:

From	To	EdgeProperty kind
Map 1	Reducer 2	`SCATTER_GATHER` (shuffle)
Reducer 2	Reducer 3	`SCATTER_GATHER` (with a 1-task sink for total order)

The "1-task sink for total order" is how Hive forces a single reducer for ORDER BY (no LIMIT): Reducer 3 has parallelism 1.

Step 6: The Submitted DAG

After DagUtils.createDag returns, TezTask submits via the session:

grep -n "submitDAG" $(find ~/hive-src -name TezTask.java)
grep -n "submitDAG" $(find ~/hive-src -name TezSessionState.java)

The call lands on TezSession.submitDAG(DAG dag) in tez-api:

grep -n "public DAGClient submitDAG" \
  $(find ~/tez-src/tez-api/src/main/java -name TezClient.java)

From there, Reading the Codebase Step 2's worked exercise picks up.

Step 7: Validation — Labelled DAG Diagram

Build this diagram for our query and save it.

flowchart TD
  M1["Map 1<br/>processor: MapTezProcessor<br/>operators: TS → SEL → GBY → RS<br/>parallelism: numSplits(t)"]
  R2["Reducer 2<br/>processor: ReduceTezProcessor<br/>operators: GBY → RS<br/>parallelism: hive.exec.reducers.* tuning"]
  R3["Reducer 3<br/>processor: ReduceTezProcessor<br/>operators: SEL → FS<br/>parallelism: 1 (ORDER BY)"]
  M1 -->|"SCATTER_GATHER<br/>partition on a"| R2
  R2 -->|"SCATTER_GATHER<br/>partition on sort key"| R3

Capture this as your validation artifact (~/tez-notes/hive-h1-dag.md).

Step 8: Print the DAG via Hive

Hive has a setting to print a runtime summary of the executed DAG:

SET hive.exec.print.summary=true;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

The summary, printed after the query, lists each vertex, its task count, and counters. Confirm the topology matches the diagram. (If you see four vertices, you may be on a build that splits ORDER BY differently; record the actual topology.)

For more detail, tez.am.dag.dot.file.location writes a .dot file — used in Lab H2.

Step 9: Counter Pop Quiz

After the query runs (with hive.exec.print.summary=true), find:

Counter	Where it lives	What it measures
`INPUT_RECORDS_PROCESSED`	Map 1	Rows read from `t`
`OUTPUT_RECORDS`	Map 1	Records emitted to shuffle (post partial-aggregate)
`REDUCE_INPUT_GROUPS`	Reducer 2	Distinct `a` values seen
`OUTPUT_RECORDS`	Reducer 2	Records to Reducer 3
`OUTPUT_RECORDS`	Reducer 3	Final result row count

For our 6-row input with 3 distinct values of a:

Counter	Expected
Map 1 `INPUT_RECORDS_PROCESSED`	6
Map 1 `OUTPUT_RECORDS`	3 (after partial GBY)
Reducer 2 `REDUCE_INPUT_GROUPS`	3
Reducer 2 `OUTPUT_RECORDS`	3
Reducer 3 `OUTPUT_RECORDS`	3

Verify against your actual run.

Validation Artifacts

The labelled mermaid DAG diagram saved at ~/tez-notes/hive-h1-dag.md.
The EXPLAIN AST, EXPLAIN LOGICAL, and EXPLAIN outputs saved.
The hive.exec.print.summary output for the actual run.
The counter table above, with your actual numbers filled in.
The grep results for createVertex and createEdge in DagUtils.java saved as ~/tez-notes/hive-h1-dagutils.txt.

You can now trace any Hive query through compilation to a Tez DAG. The next lab — Lab H2: Inspect the DAG — adds the production-grade techniques for capturing and inspecting that DAG at runtime.

Lab H2: Inspecting the Hive-Emitted DAG

Background

Lab H1 traced compilation to derive the DAG by reading code. In production, you can't always re-derive — you need to capture the DAG Hive submitted to Tez. This lab covers the four production-grade ways to do that:

EXPLAIN FORMATTED and EXPLAIN VECTORIZATION DETAIL from Hive.
TezTask logging at DEBUG level.
The Tez UI (backed by YARN ATS or Tez SimpleHistoryLoggingService).
The tez.am.dag.dot.file.location graphviz dump.

Plus the cross-cutting skill: mapping each Hive operator in the captured DAG to its Tez Input/Processor/Output (I/P/O).

Setup

# Hive CLI or beeline. Use the same table from H1:
CREATE TABLE IF NOT EXISTS t (a INT, b STRING) STORED AS ORC;
INSERT INTO t VALUES (1,'x'),(1,'y'),(2,'z'),(3,'p'),(3,'q'),(3,'r');

Verify Tez is the execution engine:

SET hive.execution.engine;        -- should be 'tez'

If not:

SET hive.execution.engine=tez;

Method 1: `EXPLAIN FORMATTED`

EXPLAIN FORMATTED emits a JSON-ish structure with operator details. Useful for programmatic parsing.

EXPLAIN FORMATTED
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Snippet of the output (structure varies by Hive version):

{
  "STAGE DEPENDENCIES": {
    "Stage-1": {"ROOT STAGE": "TRUE"},
    "Stage-0": {"DEPENDENT STAGES": "Stage-1"}
  },
  "STAGE PLANS": {
    "Stage-1": {
      "Tez": {
        "DagId:": "...",
        "Edges:": {
          "Reducer 2": [{"parent": "Map 1", "type": "SIMPLE_EDGE"}],
          "Reducer 3": [{"parent": "Reducer 2", "type": "SIMPLE_EDGE"}]
        },
        "Vertices:": {
          "Map 1": {
            "Map Operator Tree:": [...],
            "Execution mode:": "vectorized"
          },
          "Reducer 2": { ... },
          "Reducer 3": { ... }
        }
      }
    }
  }
}

Save it:

hive -e "EXPLAIN FORMATTED SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;" \
  > ~/tez-notes/hive-h2-explain-formatted.json

What it tells you that EXPLAIN doesn't:

Edge types between vertices (SIMPLE_EDGE, BROADCAST_EDGE, CUSTOM_SIMPLE_EDGE, CUSTOM_EDGE).
Execution mode per vertex (vectorized, llap, neither).
The full operator tree per vertex, including row-schema annotations.

Method 2: `EXPLAIN VECTORIZATION DETAIL`

When a query runs slower than expected on Tez, vectorization is the first thing to check. EXPLAIN VECTORIZATION DETAIL shows per-operator whether vectorization succeeded and, if not, why.

EXPLAIN VECTORIZATION DETAIL
  SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

Look for per-vertex Execution mode: vectorized and per-operator Vectorized: true. If you see notVectorizedReason: <reason>, that's the diagnostic.

Common notVectorizedReason values:

Reason	Cause
`UDF X is not vectorized`	Hive lacks a vectorized impl of a UDF you used
`Reduce vectorization disabled`	`hive.vectorized.execution.reduce.enabled=false`
`MAP_JOIN with key types ...`	Vectorized map-join doesn't support the key type combo
`Column type X not supported`	Vectorization doesn't handle the column type (DECIMAL precision, etc.)

This explains a class of Hive-on-Tez perf surprises that are unrelated to Tez itself.

Method 3: `TezTask` Logging

Increase the log level on TezTask to capture the DAG it submitted:

SET hive.root.logger=DEBUG,console;
-- or, more targeted:
SET hive.log.explain.output=true;

hive.log.explain.output=true writes the EXPLAIN to the Hive log on each query — useful in production where you can't get a CLI run but can grep the log.

grep -A100 "DAG description" /var/log/hive/hive-server2.log | head -200

For the most detail, set DEBUG specifically on the Tez integration:

# in hive-site.xml or via SET:
log4j.logger.org.apache.hadoop.hive.ql.exec.tez=DEBUG
log4j.logger.org.apache.tez.dag.api=DEBUG

In DEBUG you see:

The serialised DAGPlan size at submit time.
Each Vertex's name, parallelism, processor descriptor class.
Each Edge's source, destination, data-source / data-movement / scheduling type.

Method 4: Tez UI

The Tez UI runs against YARN Timeline Service (ATS) or against the file-system SimpleHistoryLoggingService. When configured, every Tez DAG submitted by Hive (or anything else) is captured.

Capture is enabled via tez.history.logging.service.class:

grep "tez.history.logging.service.class" ~/tez-src/tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java

Once a DAG runs, browse to:

http://<atstimeline-host>:8188/applicationhistory/

or for the standalone Tez UI:

http://<tez-ui-host>:9999/tez-ui/

Click into a DAG to see:

Per-vertex stats (tasks, attempts, succeeded, failed, killed).
Edges with type and statistics (BYTES_TRANSFERRED).
A graphical DAG view.
Per-task and per-attempt logs.

For an offline cluster, the file-system logger writes JSON files under tez.simple.history.logging.dir. They can be loaded into the Tez UI later.

Method 5: `tez.am.dag.dot.file.location`

For visual inspection, Tez can write each DAG as a Graphviz .dot file:

SET tez.am.dag.dot.file.location=/tmp/tez-dags;
SELECT a, COUNT(*) AS c FROM t GROUP BY a ORDER BY a;

After the query:

ls /tmp/tez-dags/
# <app-id>_<dag-name>.dot

dot -Tpng /tmp/tez-dags/<file>.dot -o ~/tez-notes/hive-h2-dag.png

The .dot has the same nodes/edges as the Tez UI, in a portable format.

Caveat: the location is written from the AM, so on a real cluster it lands on the AM node, not the client. Configure the path to a shared filesystem or copy after the fact.

Mapping Hive Operators to Tez I/P/O

Now the cross-cutting skill: each Hive operator inside a Vertex maps to one of Tez's three runtime roles — Input, Processor, or Output. For our query:

Map 1 (vertex)

Hive operator	Tez role	Tez class
TableScan	`Input`	`MRInput` (from `tez-mapreduce`) or `HiveInputFormat` adapter
Select	(inside Processor)	—
GroupBy (partial)	(inside Processor)	—
ReduceSink	`Output`	`OrderedPartitionedKVOutput` (from `tez-runtime-library`)

The Processor itself: MapTezProcessor. Find it:

find ~/hive-src -name "MapTezProcessor.java"

Reducer 2 (vertex)

Hive operator	Tez role	Tez class
(shuffle in)	`Input`	`OrderedGroupedKVInput`
GroupBy (final)	(inside Processor)	—
ReduceSink	`Output`	`OrderedPartitionedKVOutput`

Processor: ReduceTezProcessor. Find it:

find ~/hive-src -name "ReduceTezProcessor.java"

Reducer 3 (vertex)

Hive operator	Tez role	Tez class
(shuffle in)	`Input`	`OrderedGroupedKVInput`
Select	(inside Processor)	—
FileSink	`Output`	`MROutput` (from `tez-mapreduce`)

Validation — A Side-by-Side Table

Build this for your captured DAG and save it:

Vertex	Tasks	Inputs (class, source)	Processor	Outputs (class, dest)
Map 1	(from EXPLAIN)	`MRInput` ← `t` ORC files	`MapTezProcessor`	`OrderedPartitionedKVOutput` → Reducer 2
Reducer 2	(from EXPLAIN)	`OrderedGroupedKVInput` ← Map 1	`ReduceTezProcessor`	`OrderedPartitionedKVOutput` → Reducer 3
Reducer 3	1	`OrderedGroupedKVInput` ← Reducer 2	`ReduceTezProcessor`	`MROutput` → query result location

Save as ~/tez-notes/hive-h2-iop-mapping.md.

Worked Differences Across Methods

When all four capture methods agree, you have ground truth. When they disagree:

Disagreement	Likely cause
`EXPLAIN FORMATTED` shows N vertices, runtime UI shows N+1	Dynamic vertex insertion (CBO, runtime statistics)
`tez.am.dag.dot.file.location` shows fewer edges than UI	Edges added by VertexManager at runtime (see Lab 4.2)
UI shows `BROADCAST_EDGE`, `EXPLAIN` says `SIMPLE_EDGE`	Hive's `EXPLAIN` is sometimes loose on edge type; trust the UI
Parallelism in UI differs from `EXPLAIN`'s `-mapred.reduce.tasks`	`tez.shuffle.vertex.manager` reconfigured parallelism at runtime

Each disagreement is informative — it shows you which subsystem made the dynamic decision.

Production Diagnostic Routine

When asked "why is this query slow on Tez?":

EXPLAIN FORMATTED to see the planned DAG.
EXPLAIN VECTORIZATION DETAIL to spot non-vectorized operators.
Run with hive.exec.print.summary=true to get the runtime summary.
Open the Tez UI for the DAG, look at per-vertex and per-edge stats.
Compare planned parallelism to actual (VertexManager may have changed it).
Identify the bottleneck vertex by WALL_CLOCK_MILLIS or OUTPUT_RECORDS skew.

Most slowness is one of: vectorization failure, parallelism mismatch, data skew on a shuffle key, or AM overhead for a many-vertex DAG.

Validation Artifacts

The EXPLAIN FORMATTED JSON saved to ~/tez-notes/hive-h2-explain-formatted.json.
The EXPLAIN VECTORIZATION DETAIL saved to ~/tez-notes/hive-h2-vec.txt.
A .png rendered from the .dot saved to ~/tez-notes/hive-h2-dag.png.
The Tez UI URL for the actual DAG run, bookmarked.
The Hive-operator-to-Tez-I/P/O table above, filled in for your captured DAG.

Once you can capture and read the DAG four ways, you are ready for failure analysis — Lab H3: Debug a Failed Query.

Lab H3: Debugging a Failed Query

Background

Production Hive-on-Tez failures usually surface as one line in the Hive console:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Vertex failed, vertexName=Map 1, vertexId=vertex_1718000000000_4321_1_00,
diagnostics=[Task failed, taskId=task_1718000000000_4321_1_00_000003,
diagnostics=[TaskAttempt 0 failed, info=[
Container container_e123_1718000000000_4321_01_000007 failed.
Exit code: 1
Container exited with a non-zero exit code 1. Last 4096 bytes of stderr :
... ]]]

That message is the tip. The actual exception is buried 3–4 hops away. This lab is the operational walk from that tip to the root-cause stack trace, with a fabricated-but- realistic example.

The Failure Hop Sequence

flowchart TD
  H[Hive console error<br/>'Vertex failed, vertexName=Map 1']
  H --> A[AM log<br/>tez-dag log on the AM container]
  A --> T[TaskAttempt diagnostics<br/>which task, which container]
  T --> C[Container stderr / stdout log<br/>on the worker node]
  C --> E[Actual exception<br/>the root cause]
  E --> X[Attribute to Hive / Tez runtime / Tez AM / YARN]

Five hops. Most engineers can do hop 1 (read the console). Few can do hops 2–4 without guidance. This lab is the guidance.

Step 1: Parse the Console Message

Take the message above and extract the identifiers:

Identifier	Value (in our example)	Use for
Application ID	`application_1718000000000_4321`	YARN log retrieval
DAG ID	`dag_1718000000000_4321_1`	Tez UI URL
Vertex ID	`vertex_1718000000000_4321_1_00`	The failing vertex; here `00` ≈ Map 1
Task ID	`task_1718000000000_4321_1_00_000003`	Which task within the vertex
Attempt	`0`	First attempt failed
Container ID	`container_e123_1718000000000_4321_01_000007`	Where the work was running
Exit code	`1`	Process died abnormally

The format is consistent across all Hive-on-Tez failures. Memorise the structure.

Step 2: Get the AM Log

The Tez AM is itself a YARN container. Its log is fetched with yarn logs:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000001

The AM container is typically _01_000001 (always the first container of the app). The log streams to stdout. Pipe to a file:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000001 \
  > ~/tez-notes/hive-h3-amlog.txt

The AM log contains the DAGAppMaster lifecycle, vertex state transitions, and diagnostics aggregated from failing tasks.

Search for our failing task:

grep -n "task_1718000000000_4321_1_00_000003" ~/tez-notes/hive-h3-amlog.txt | head

You will see lines like:

2024-06-10 14:22:11,432 [INFO ] TaskImpl - task_..._000003 transitioned from SCHEDULED to RUNNING
2024-06-10 14:22:13,108 [INFO ] TaskAttemptImpl - attempt_..._000003_0 transitioned from RUNNING to FAILED
2024-06-10 14:22:13,108 [WARN ] TaskImpl - Diagnostics for ..._000003_0:
  Container ..._000007 failed.
  Exit code: 1
  ... [Last 4096 bytes of stderr] ...

The "Last 4096 bytes of stderr" is the AM's view of why the container died. It's truncated. For the full container log, hop 3.

Step 3: Get the Container Log

The container ID from the AM log (container_..._000007) is the worker. Its log:

yarn logs -applicationId application_1718000000000_4321 \
  -containerId container_e123_1718000000000_4321_01_000007 \
  > ~/tez-notes/hive-h3-container-007.txt

The container log contains the full stdout and stderr from the Tez task runtime (LogicalIOProcessorRuntimeTask), including all logged exceptions and any user-code output.

The container log structure:

LogType:stdout
...
LogType:syslog
2024-06-10 14:22:12,856 [INFO ] LogicalIOProcessorRuntimeTask - Initializing task ...
2024-06-10 14:22:12,891 [INFO ] MRInput - Initializing MRInput for ...
2024-06-10 14:22:13,007 [WARN ] MRInput - ...
2024-06-10 14:22:13,084 [ERROR] LogicalIOProcessorRuntimeTask - Failed to execute task
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException:
  Hive Runtime Error while processing row {"a":3,"b":"q"}
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:267)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        ...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
        at org.apache.hadoop.hive.ql.exec.UDFBridge.<init>(UDFBridge.java:54)
        ...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        ...
LogType:stderr

This is the actual exception. The Caused by: chain walks from Hive's wrapping exception down to the JVM-level cause.

Step 4: Walk the Exception

Reading the trace top-down for our example:

Frame	Tells you
`java.lang.RuntimeException`	Container exit, generic
`org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"a":3,"b":"q"}`	Hive boundary; you know the input row
`org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow:91`	Hive Tez map-side row processor
`org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run:418`	Hive Tez map record processor
`org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run:223`	Hive's Tez Processor adapter
`org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run:374`	Tez runtime task
`org.apache.tez.runtime.task.TaskRunner2Callable...`	Tez runtime task launcher

Now the Caused by: chain:

Cause	Tells you
`HiveException: Unable to load UDF X`	The proximate Hive problem
`ClassNotFoundException: com.example.udf.X`	The root: classloader can't find UDF

So the root cause is a UDF class missing from the classpath of the Tez task. That's a Hive (or user) issue, not a Tez issue. See Lab H4 for how to make that attribution rigorously.

Step 5: Attribute the Failure

Apply the decision rule from H4 (preview):

The package of the top frame whose code you can change indicates the project.

Top frames in order:

java.lang.RuntimeException — JVM, not actionable.
org.apache.hadoop.hive.ql.metadata.HiveException — Hive, but generic wrap; keep walking.
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91 — Hive code, specific. Stop here for the top frame: this is Hive's MapRecordSource.

Then the Caused by: chain:

HiveException: Unable to load UDF X — Hive.
ClassNotFoundException: com.example.udf.X — root cause.

Attribution: Hive (the proximate code is MapRecordSource) and user (the missing class is the user's UDF jar). Tez is not at fault — it correctly ran the task, the Hive code, and surfaced the exception. Tez's job is to provide a stack trace, which it did.

The fix is to ensure the UDF jar is on the AuxJar list:

ADD JAR /path/to/udf.jar;

or in hive-site.xml:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///opt/hive/auxlib/udf.jar</value>
</property>

Tooling Shortcuts

Get all container logs at once

yarn logs -applicationId application_1718000000000_4321 \
  > ~/tez-notes/hive-h3-all.txt

For a large DAG with many containers, this is large (often 100s of MB). Use the per-container form when you know which one to look at.

Search across container logs

grep -B2 -A20 "java.lang.\|Caused by" ~/tez-notes/hive-h3-all.txt | head -100

Find the failing task fast

grep "FAILED\|state changed.*FAILED\|attempt.*FAILED" ~/tez-notes/hive-h3-amlog.txt

Tez UI shortcut

If your cluster has the Tez UI, the per-task log links are one click. The UI URL pattern:

http://<tez-ui-host>:9999/tez-ui/#/tez-dag/dag_1718000000000_4321_1

From that page, navigate to Map 1 → task 000003 → attempt 0 → "logs". The UI fetches the container log automatically.

A Second Worked Example — Tez Runtime Failure

Console:

Vertex failed, vertexName=Reducer 2, ...
Container ... failed. Exit code: 1

Container log top of stack:

java.io.IOException: Failed on local exception: java.io.IOException: Failed to fetch shuffle data
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
        ...
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        ...

Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.

Attribution: Tez runtime library. Specifically the shuffle fetcher. The root cause — ConnectException: Connection refused — points to the upstream task's container being gone (killed, evicted, or networked away). Investigation continues into the upstream container's log.

This is the canonical Tez shuffle failure shape. The reproduction is in H5.

A Third Worked Example — AM Failure

Console:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
Application application_1718000000000_4321 failed with state FAILED.
Diagnostics: Application application_1718000000000_4321 failed 2 times due to AM Container ... exited with exitCode: -103 ...

The AM itself died. Container log of the AM:

[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
        at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
        ...

Top frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412. Attribution: Tez AM. Root cause: AM heap too small for the DAG (tez.am.resource.memory.mb). Fix is configuration; if reproducible at the default, file a JIRA against Tez requesting either a smarter default or a sizing recommendation.

Validation Artifacts

For our first example, save:

The console error verbatim (~/tez-notes/hive-h3-console.txt).
The parsed-identifiers table (Application ID, DAG ID, Vertex ID, Task ID, Container ID).
The AM log fragment showing the task transition to FAILED.
The container log fragment showing the full exception with Caused by: chain.
The attribution paragraph: which project owns the bug, and why.
The fix you propose.

Once you can produce that artifact for an arbitrary Hive-on-Tez failure, you can debug one. The next lab — Lab H4: Bug Attribution — makes the attribution rigorous with a decision tree and four more worked examples.

Lab H4: Bug Attribution

Background

A failing Hive-on-Tez query may be a Hive bug, a Tez runtime bug, a Tez AM bug, a YARN bug, a Hadoop common bug, a JVM bug, a user bug, or an infrastructure bug. Filing it on the wrong project wastes the reporter's time and the maintainer's. This lab gives you a mechanical decision tree to attribute correctly from a stack trace, plus four worked examples.

The Decision Tree

Given a stack trace (after Lab H3 has surfaced it):

flowchart TD
  S[Start: have stack trace]
  S --> T1[Find top frame whose package you can change]
  T1 --> P{Package prefix?}
  P -->|org.apache.hadoop.hive.*| H[Hive bug]
  P -->|org.apache.tez.runtime.library.*| TR[Tez runtime library<br/>tez-runtime-library]
  P -->|org.apache.tez.runtime.*<br/>not .library| TRI[Tez runtime internals<br/>tez-runtime-internals]
  P -->|org.apache.tez.dag.app.*| TA[Tez AM<br/>tez-dag]
  P -->|org.apache.tez.dag.api.*| TC[Tez client / API<br/>tez-api]
  P -->|org.apache.tez.client.*| TC
  P -->|org.apache.hadoop.yarn.*| Y[YARN bug]
  P -->|org.apache.hadoop.hdfs.*| HD[HDFS bug]
  P -->|org.apache.hadoop.mapred.*| MR[Hadoop MR compat<br/>tez-mapreduce]
  P -->|user package| U[User code bug]
  P -->|java.*, sun.*| J[Walk down to next frame]
  J --> T1
  H --> CD[Then check Caused by chain]
  TR --> CD
  TRI --> CD
  TA --> CD
  Y --> CD
  HD --> CD
  CD --> R[Root cause may shift attribution]
  R --> END[File on the project that owns the actionable code]

The rule in one sentence: find the top frame in actionable code, name its package prefix, and read off the project.

Package → Project → Module Table

Package prefix	Project	Module / area	Where to file
`org.apache.hadoop.hive.ql.exec.tez.*`	Hive	Tez integration	`https://issues.apache.org/jira/projects/HIVE`
`org.apache.hadoop.hive.ql.exec.*` (not .tez)	Hive	Operators	HIVE JIRA
`org.apache.hadoop.hive.ql.metadata.*`	Hive	Metadata / UDF	HIVE JIRA
`org.apache.hadoop.hive.serde2.*`	Hive	Serialization	HIVE JIRA
`org.apache.hadoop.hive.*` (any other)	Hive	Core	HIVE JIRA
`org.apache.tez.runtime.library.*`	Tez	`tez-runtime-library`	TEZ JIRA
`org.apache.tez.runtime.task.*`	Tez	`tez-runtime-internals`	TEZ JIRA
`org.apache.tez.runtime.*` (not .library, not .task)	Tez	`tez-runtime-internals`	TEZ JIRA
`org.apache.tez.dag.app.dag.impl.*`	Tez	`tez-dag` (state machines)	TEZ JIRA
`org.apache.tez.dag.app.rm.*`	Tez	`tez-dag` (RM client / container scheduling)	TEZ JIRA
`org.apache.tez.dag.app.launcher.*`	Tez	`tez-dag` (container launcher)	TEZ JIRA
`org.apache.tez.dag.app.*` (other)	Tez	`tez-dag` (AM core)	TEZ JIRA
`org.apache.tez.dag.api.*`	Tez	`tez-api` (DAG / Vertex / Edge)	TEZ JIRA
`org.apache.tez.client.*`	Tez	`tez-api` (TezClient)	TEZ JIRA
`org.apache.tez.mapreduce.*`	Tez	`tez-mapreduce` (MRInput/MROutput)	TEZ JIRA
`org.apache.hadoop.yarn.client.*`	YARN	Client	HADOOP JIRA, component YARN
`org.apache.hadoop.yarn.server.resourcemanager.*`	YARN	RM	HADOOP YARN
`org.apache.hadoop.yarn.server.nodemanager.*`	YARN	NM	HADOOP YARN
`org.apache.hadoop.hdfs.*`	HDFS	Client / DN / NN	HADOOP HDFS
`org.apache.hadoop.mapred.*`	MR compat	`tez-mapreduce` for MR-on-Tez	TEZ JIRA
`org.apache.hadoop.io.` / `.fs.` / `.conf.*`	Hadoop common	hadoop-common	HADOOP COMMON
`com.<user>.` / `org.<user>.` (not apache)	User code	n/a	Fix locally
`java.`, `sun.`, `jdk.*`	JVM	walk down	(not the cause; keep looking)

Verify the modules against your tree:

find ~/tez-src -maxdepth 2 -name pom.xml | sort
find ~/hive-src -maxdepth 3 -name pom.xml | head

Example 1: UDF Not Found (Hive bug → User bug)

Trace (from Lab H3):

java.lang.RuntimeException: ...
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:418)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:223)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(...)
        ...
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load UDF X
        at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:1893)
        ...
Caused by: java.lang.ClassNotFoundException: com.example.udf.X

Apply the tree:

Top actionable frame: org.apache.hadoop.hive.ql.exec.tez.MapRecordSource:91.
Package: org.apache.hadoop.hive.ql.exec.tez.*.
Project: Hive (the Tez integration code).
Check Caused by: root is ClassNotFoundException: com.example.udf.X — a user class.
Adjust: this is user error (their UDF jar isn't on the classpath), surfaced by Hive's UDF registry, surfaced by Hive's Tez integration. No bug to file.

Fix: ADD JAR or hive.aux.jars.path.

If the same trace came with Caused by: ClassNotFoundException: org.apache.hadoop.hive.ql.exec.UDFBridge, then the root is a Hive class missing from the Hive distribution — file on HIVE.

Example 2: Shuffle Fetch Failure (Tez runtime bug)

Trace:

java.io.IOException: Failed to fetch shuffle data
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:391)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.copyFromHost(Fetcher.java:355)
        at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher.run(Fetcher.java:262)
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)

Apply the tree:

Top actionable frame: org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler:391.
Package: org.apache.tez.runtime.library.*.
Project: Tez, module tez-runtime-library.
Check Caused by: ConnectException — network.
Adjust: the root is a network/infra failure. The shuffle code surfaced it correctly; not a bug in itself. But:
- If this happens once with sporadic node failures: infrastructure issue, no bug.
- If this happens frequently and the fetcher isn't retrying enough times before giving up: Tez bug — file on TEZ asking to bump or expose tez.runtime.shuffle.connect.timeout/retry counts.
- If the upstream container died because of an AM scheduling bug: Tez AM bug, file on TEZ with the AM log evidence.

Verify the retry config:

grep "shuffle.connect\|shuffle.fetch.retry\|shuffle.read.timeout" \
  ~/tez-src/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/api/TezRuntimeConfiguration.java

Example 3: AM OOM During DAG Submit (Tez AM bug)

AM container log:

[ERROR] DAGAppMaster - Caught exception while running DAGAppMaster
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3210)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(...)
        ...
        at com.google.protobuf.ByteString.copyFrom(ByteString.java:194)
        at org.apache.tez.dag.api.records.DAGProtos$DAGPlan.toBuilder(DAGProtos.java:...)
        at org.apache.tez.dag.app.dag.impl.VertexImpl.<init>(VertexImpl.java:412)
        at org.apache.tez.dag.app.DAGAppMaster.createDAG(DAGAppMaster.java:...)

Apply the tree:

Top actionable frame: skip JVM/protobuf frames. First Tez frame: org.apache.tez.dag.app.dag.impl.VertexImpl:412.
Package: org.apache.tez.dag.app.*.
Project: Tez, module tez-dag (AM).
Check Caused by: none — just the OOM.

Attribution: Tez AM. The proximate cause is constructing VertexImpl from a large DAGPlan. Three possible JIRA shapes:

"Tez AM OOMs on submission of N-vertex DAG at default tez.am.resource.memory.mb" — file requesting smarter sizing or doc.
"VertexImpl construction allocates O(N²) memory in inputs" — file with a profile and a fix suggestion.
"DAGPlan toBuilder() materialises a full copy" — file as a perf bug.

The correct shape depends on profile evidence. Without profiling, file the sizing/doc variant first; the deeper variants follow.

Example 4: NodeManager Lost (YARN bug)

AM log:

2024-06-10 ... [WARN ] AMContainerImpl - Container container_..._000007 transitioned from RUNNING to STOPPED. exitStatus -100
2024-06-10 ... [WARN ] DAGAppMaster - Container ..._000007 completed unexpectedly; will be rescheduled
2024-06-10 ... [WARN ] RMContainerRequestor - Lost node nm-12.example.com
2024-06-10 ... [INFO ] DAGAppMaster - Marking task attempt as failed due to lost node: attempt_..._000003_0

Apply the tree:

Top frame in trace: org.apache.tez.dag.app.rm.AMContainerImpl — but this is the AM's correct reaction to a node loss, not a bug.
The substantive cause is "NodeManager nm-12 lost" — diagnose by checking NodeManager log on that host:
```
yarn node -list -all | grep nm-12
tail -200 /var/log/hadoop-yarn/yarn-nodemanager.log  # on nm-12
```
Common nm-side root causes:
- NM heap OOM (NM stops responding to RM heartbeats) → YARN bug or NM tuning.
- Network partition → infra.
- Disk full on NM local-dirs → ops issue.

Attribution:

If NM died from OOM, file on HADOOP YARN.
If Tez AM didn't reschedule the lost task correctly, file on TEZ. But the AM log here shows correct reaction, so that's not in play.
If Tez's TaskScheduler retried the task on the same lost node repeatedly, file on TEZ (a scheduler awareness issue).

Cross-Project Patterns

Some failure modes have a well-known cross-project shape. Memorise the shapes:

Shape	Likely project	Quick diagnostic
`ClassCastException` inside `MapRecordSource` / `ReduceRecordSource`	Hive (schema mismatch in vectorization)	Check `EXPLAIN VECTORIZATION DETAIL`
`IOException: Stream is closed` in shuffle reader	Tez runtime library	Check upstream container alive
`TaskCommitDeniedException`	Tez AM speculative-exec coordination	Check `tez.am.speculation.enabled`
`NoSuchMethodError` on a Tez or Hive class	Version skew	Check classpath; check `mvn dependency:tree`
`IllegalArgumentException: Wrong FS`	Hadoop FS	Check `fs.defaultFS`, `core-site.xml`
`Container killed by OOM killer` (exit code 137)	YARN or workload	Check container memory request vs JVM heap
`org.apache.hadoop.security.AccessControlException`	HDFS or Hive Ranger	Permissions issue, not a code bug

What to Do With the Attribution

Having attributed correctly:

Attribution	Action
Hive	File on `https://issues.apache.org/jira/projects/HIVE` with `Tez` in summary if relevant
Tez `tez-runtime-library`	File on `https://issues.apache.org/jira/projects/TEZ`, component `Runtime Library`
Tez `tez-runtime-internals`	File on TEZ, component `Runtime Internals`
Tez `tez-dag` (AM)	File on TEZ, component `AM`
Tez `tez-api`	File on TEZ, component `Client / API`
Tez `tez-mapreduce`	File on TEZ, component `MR Compat`
YARN	File on `https://issues.apache.org/jira/projects/HADOOP`, component `YARN`
HDFS	File on HADOOP, component `HDFS`
User	Fix locally, no JIRA
Infrastructure	Operations issue, no JIRA
Multiple (Hive needs change AND Tez needs change)	File on both, cross-reference

In all cases, the JIRA description follows the skeleton in Design via JIRA.

Validation Artifacts

After this lab:

The decision tree printed and pinned at your desk (or in ~/tez-notes/).
The Package → Project → Module table memorised or saved as ~/tez-notes/hive-h4-attribution.md.
Four attributions, one for each worked example, written out in your own words.
The reflex: never file a JIRA on a project whose code does not appear in the top of the actionable stack.

The next lab — Lab H5: Reproducing Bugs — covers how to turn an attributed bug into a minimum reproducer suitable to attach to a JIRA.

Lab H5: Reproducing Bugs

Background

A JIRA without a reproducer drifts. A JIRA with a clean reproducer gets attention. "Clean" means: minimal schema, minimal data, minimal query, runnable in under a minute on a local MiniTezCluster or MiniHS2. This lab is the procedure.

The Hive integration test framework (hive-itests) is the source of every pattern you need. Reading its existing tests is the cheapest education.

The Three Reduction Axes

To minimise a reproducer, reduce along three independent axes:

Axis	Reduce	Stop reducing when
Schema	Drop unused columns; simplify types	Removing a column makes the bug disappear
Data	Reduce row count; generate synthetic data	Reducing rows makes the bug disappear
Query	Drop joins, predicates, projections	Dropping a clause makes the bug disappear

The goal is the smallest schema × smallest data × smallest query that still reproduces.

Setup — Local `MiniHS2` + `MiniTezCluster`

MiniHS2 is a single-JVM HiveServer2 that runs against a MiniTezCluster (a single-JVM YARN). Together they let you reproduce a Hive-on-Tez bug in seconds without an external cluster.

Existing reference in your tree:

find ~/hive-src/itests -name "MiniHS2.java" | head
find ~/hive-src/itests -name "TestMiniLlapVectorArrowWithLlapIODisabled.java" | head
find ~/tez-src/tez-tests -name "MiniTezCluster.java"

A reproducer test class skeleton (Hive 3/4 style):

public class TestMyBugRepro {
  private MiniHS2 miniHS2;

  @Before
  public void setUp() throws Exception {
    HiveConf conf = new HiveConf();
    conf.set("hive.execution.engine", "tez");
    conf.set("tez.lib.uris",
        "file://" + System.getProperty("tez.lib.dir"));
    miniHS2 = new MiniHS2.Builder()
        .withConf(conf)
        .withMiniMR()                  // brings up MiniTezCluster
        .build();
    miniHS2.start(new HashMap<>());
  }

  @After
  public void tearDown() throws Exception {
    miniHS2.stop();
  }

  @Test
  public void reproBug() throws Exception {
    try (Connection c = DriverManager.getConnection(miniHS2.getJdbcURL());
         Statement s = c.createStatement()) {
      s.execute("CREATE TABLE t (...) STORED AS ORC");
      s.execute("INSERT INTO t VALUES (...)");
      ResultSet rs = s.executeQuery("SELECT ...");
      // assert behaviour or expect exception
    }
  }
}

Run with mvn test -pl itests -Dtest=TestMyBugRepro.

Reducing the Schema

Starting from a real production table with 200 columns, reduce iteratively:

Identify referenced columns. Read the failing query; note which columns the SELECT, WHERE, GROUP BY, JOIN, ORDER BY actually reference.
Drop everything else. Make a new test schema with only the referenced columns.
Re-run. Does the bug still reproduce? If yes, you've reduced. If no, you've found a column that's load-bearing; add it back and look for why.
Simplify remaining types. Replace DECIMAL(38,10) with DECIMAL(10,2) if the bug doesn't depend on precision. Replace STRUCT<...> with STRING if you can. Replace partition columns with non-partitioned tables unless the partition is load-bearing.
Stop when reduction breaks the repro.

For our running example query:

SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;

Only column a is referenced. Schema reduces to:

CREATE TABLE t (a INT) STORED AS ORC;

If the bug needs the second column for some reason (e.g. ORC stripe layout), keep it.

Reducing the Data — `JoinDataGen` Pattern

Hive's itests includes data generators for systematic minimisation. The most common pattern is JoinDataGen for generating join inputs at controlled cardinalities:

find ~/hive-src -name "JoinDataGen*.java" -o -name "*DataGen*.java" | head

The pattern (adapt for your bug):

public final class TestDataGen {
  public static void writeIntRows(String tableName, int rowCount, int distinctKeys,
                                   Statement s) throws SQLException {
    Random r = new Random(42);
    StringBuilder values = new StringBuilder();
    for (int i = 0; i < rowCount; i++) {
      if (i > 0) values.append(",");
      values.append("(").append(r.nextInt(distinctKeys)).append(")");
    }
    s.execute("INSERT INTO " + tableName + " VALUES " + values);
  }
}

Reduce data:

Start with original data size. 1 billion rows? Reduce to 1 million.
Halve until bug disappears. Binary-search the row count: 1M → 500K → 250K → ...
At the smallest row count that still repros, vary distinct-key count. Bug may need 5 distinct keys (skew) or 500K (cardinality). Find which.
Vary value distribution. If the bug needs a skewed distribution (one key gets 90% of rows), generate that explicitly.
Document the minimum. "Bug reproduces at >= 1024 rows with >= 8 distinct keys."

For our running example, with no actual bug, the minimum is whatever you need to exercise the GROUP BY + ORDER BY path — single-digit rows are enough.

Reducing the Query

Remove clauses one at a time and re-test:

Remove ORDER BY → does the bug still happen? (Probably not, if the bug is in the total-order reducer.)
Remove the aggregate → does the bug still happen?
Remove WHERE predicates one at a time.
Remove JOINs; if the join is the cause, simplify to a 2-table join, then to a tiny-on-tiny join.
Replace MAP JOIN with SHUFFLE JOIN by disabling map joins (hive.auto.convert.join=false) and re-test.

A reproducer query of 3 lines beats a reproducer query of 30 lines, even for the same bug.

Capturing the Artifacts

A complete bug-report artifact set:

Artifact	Why
`CREATE TABLE` DDL for every table involved	Reproducer setup
Data generation code or inline `INSERT` values	Reproducer setup
The minimal query	The test
`SET hive.*` lines that were necessary	Configuration
The expected behavior (correct result)	Oracle
The actual behavior (incorrect result or exception)	Symptom
`EXPLAIN FORMATTED` output	Plan
AM log fragment showing failure	Diagnostic
Container log fragment showing exception	Diagnostic
Tez and Hive version	Version

Bundle into a single artifact:

cd ~/tez-notes
mkdir hive-h5-repro
cp ddl.sql hive-h5-repro/
cp gen.sql hive-h5-repro/
cp query.sql hive-h5-repro/
cp explain.txt hive-h5-repro/
cp amlog-fragment.txt hive-h5-repro/
cp container-log-fragment.txt hive-h5-repro/
cat > hive-h5-repro/README.md <<EOF
# Repro for HIVE-XXXXX / TEZ-XXXX

Tez version: 0.10.X
Hive version: 4.0.X
Hadoop version: 3.3.X
JDK: 11

Setup:  hive -f ddl.sql && hive -f gen.sql
Repro:  hive -f query.sql
Expected: rows = N, max value = M.
Actual:   exception in container log (see container-log-fragment.txt).
EOF
tar czf hive-h5-repro.tar.gz hive-h5-repro/

Attach hive-h5-repro.tar.gz to the JIRA. A reproducer in this shape gets opened by maintainers; one without these elements doesn't.

When `MiniTezCluster` Doesn't Reproduce

A bug that reproduces on a production cluster but not on MiniTezCluster is the worst shape. Common causes:

Cause	Diagnostic
Multi-node shuffle behavior; mini cluster is single-node	Force multiple containers per node; can't fully simulate
Container OOM at production memory; mini cluster doesn't have memory pressure	Configure mini cluster with tight memory limits
Concurrent DAG submissions; mini cluster has none	Run multiple parallel tests
ORC stripe layout; needs production-size files	Generate larger ORC files
Production data distribution; mini cluster has uniform	Use realistic random seed and distribution
Speculative execution; not enabled in mini by default	Enable with `tez.am.speculation.enabled=true`

If none of these reduce, the bug may be in cluster-only code paths (RM scheduling edge cases). Document that the reproducer requires N nodes and attach what evidence you have.

A Worked Reproducer — Hypothetical Bug

Suppose a bug: COUNT(*) returns 0 when input table has exactly 1024 rows and vectorization is enabled. (Imaginary; for the pattern.)

Schema

CREATE TABLE t (a INT) STORED AS ORC;

Data

INSERT INTO t SELECT col1 FROM dual WHERE 1=0;  -- placeholder
-- repeat to produce exactly 1024 rows:
INSERT INTO t SELECT pos AS a FROM (
  SELECT explode(sequence(1, 1024)) AS pos
) s;

(Hive's explode(sequence(...)) may or may not be available depending on version; use the equivalent for your version.)

Query

SET hive.vectorized.execution.enabled=true;
SELECT COUNT(*) FROM t;

Expected vs Actual

Expected: 1024
Actual:   0

EXPLAIN

EXPLAIN VECTORIZATION DETAIL SELECT COUNT(*) FROM t;

Save the output. Look for Execution mode: vectorized and any odd Vectorized: false on a key operator.

Trial Reductions

1023 rows: bug? No.
1024 rows: bug.
2048 rows: bug? Test.
Vectorization off: bug? Reset.

Document the conditions:

Bug reproduces at:
  - row count exactly 1024
  - hive.vectorized.execution.enabled=true
Bug does NOT reproduce at:
  - row count != 1024
  - hive.vectorized.execution.enabled=false

That's a sharp, actionable bug. Attribution (by Lab H4): likely Hive's vectorized aggregation code path. File on HIVE.

Production-to-Test Translation

When a real production bug is reported to you with no reproducer:

Get the query. From the user, from hive.log (hive.server2.logging.operation.enabled), or from HiveServer2 audit logs.
Get the schema. Run SHOW CREATE TABLE on each involved table; copy.
Get a sample of data. A few hundred to a few thousand rows. Anonymise PII if needed.
Get the version triplet. Tez / Hive / Hadoop.
Reproduce. Stand up MiniHS2, load the schema, load the sample data, run the query.
If it reproduces, reduce. Apply the three axes.
If it doesn't reproduce, expand. More data, more nodes, more concurrency.

A one-day cycle for a complex production bug is fast. A one-week cycle is realistic for something subtle.

Validation Artifacts

After this lab:

A complete reproducer artifact (a hive-h5-repro.tar.gz-style bundle) for a real or imagined Hive-on-Tez bug.
A TestMyBugRepro.java skeleton you can adapt.
The three-axes reduction discipline applied at least once.
The reflex to capture the version triplet (Tez/Hive/Hadoop) on every reproducer.

The next lab — Lab H6: Diagnostics — covers what to do when you can't reproduce locally and need to ask the production reporter to capture more data.

Lab H6: Writing a Diagnostic Patch

Background

You have a Hive-on-Tez bug report from production. You can't reproduce locally (Lab H5 didn't work). You need more data. The way to get it is a diagnostic patch — a small change that adds logging, counters, or a debug toggle without changing behavior, attached to the JIRA, that the reporter can apply and re-run.

A well-shaped diagnostic patch:

Adds boundary-INFO logging at the suspected fault site.
Adds a TezCounter so the data is captured in the standard counter mechanism.
Adds a debug-only TezConfiguration switch so the cost is opt-in.

This lab walks the three patterns.

Pattern 1: Boundary INFO Logging

A "boundary" is the point at which control flows from one subsystem to another:

Boundary	Example
Hive → Tez submit	`TezTask.execute` → `TezSession.submitDAG`
Tez AM → Container	`DAGAppMaster.scheduleTaskAttempt` → `ContainerLauncherImpl.launch`
Container → Task	`LogicalIOProcessorRuntimeTask.run` → `Processor.run`
Task → Input shuffle	`OrderedGroupedKVInput.start`
Task → Output shuffle	`OrderedPartitionedKVOutput.start`

INFO at a boundary is cheap, lasts the lifetime of a task, and gives the next debugger a structured trail.

Example patch (illustrative diff)

Suppose the bug is "DAG submission occasionally takes >10s on large DAGs." A diagnostic patch in TezTask:

diff --git a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
index abcdef1..2345678 100644
--- a/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
+++ b/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
@@ -201,7 +201,12 @@ public class TezTask extends Task<TezWork> {
   private DAGClient submit(DAG dag, TezSessionState session) throws Exception {
+    long submitStartNs = System.nanoTime();
+    int dagPlanBytes = dag.createDag(conf, null, null, null, false).getSerializedSize();
+    LOG.info("HIVE-XXXX diag: about to submitDAG, dagName={}, vertices={}, planBytes={}",
+        dag.getName(), dag.getVertices().size(), dagPlanBytes);
     DAGClient client = session.getSession().submitDAG(dag);
+    LOG.info("HIVE-XXXX diag: submitDAG returned in {} ms",
+        TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - submitStartNs));
     return client;
   }

Rules for the patch:

Tag every log line with the JIRA ID. The reporter greps for HIVE-XXXX diag: to find your data.
INFO, not DEBUG. The reporter must not have to change log levels.
Structured key=value or {} placeholders. Easy to parse.
Cheap. Measure or log only what's needed; no full-DAG dumps unless explicitly asked.

Pattern 2: A New `TezCounter`

Counters are the production-safe way to surface a number. They aggregate across tasks and are visible in the Tez UI, in hive.exec.print.summary output, and in the AM log.

Define a new counter

Tez counters are enums. The Tez-side counters:

find ~/tez-src -name "DAGCounter.java" -o -name "TaskCounter.java"

public enum TaskCounter {
  // ... existing ...
  REDUCE_INPUT_GROUPS,
  REDUCE_OUTPUT_RECORDS,
  // new for diagnostic:
  /** TEZ-XXXX diag: number of shuffle fetch retries on this task. */
  SHUFFLE_FETCH_RETRIES,
}

The Hive-side counters live in:

find ~/hive-src -name "OperatorVariation.java" -o -name "HiveCounter*.java" | head

For Hive-side, use a Reporter.incrCounter or operator counter mechanism, depending on Hive version.

Increment it where it matters

In the suspected hot spot:

-        copyFromHost(host);
+        try {
+          copyFromHost(host);
+        } catch (IOException e) {
+          context.getCounters().findCounter(TaskCounter.SHUFFLE_FETCH_RETRIES).increment(1);
+          throw e;
+        }

After the reporter runs with the patch:

SET hive.exec.print.summary=true;
-- repro the bug

The summary will show SHUFFLE_FETCH_RETRIES = N per task, surfacing data that was previously invisible.

Counters vs logs

Aspect	Counter	Log
Aggregation across tasks	Automatic	Manual
Production safety	High	High
Persistence	Long (ATS / Tez UI)	Short (containerlog rotation)
Detail per event	None (just a count)	Full message
Cost	Near zero	Low to moderate

Use both for big diagnostics: a counter to know "this happened N times" and a log to know "and the first time, here's what it looked like."

Pattern 3: A Debug `TezConfiguration` Switch

For more invasive diagnostics — extra log lines that would be too noisy by default, or extra checks that have a measurable cost — gate them behind a config switch.

Define the switch

In TezConfiguration (Tez side) or HiveConf (Hive side):

// TezConfiguration.java
@Private @Unstable
public static final String TEZ_AM_DIAGNOSTICS_VERBOSE = "tez.am.diagnostics.verbose";
public static final boolean TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT = false;

Use @Private and @Unstable for a diagnostic key — see Compatibility. It signals "this is not a supported API, may be removed once the bug is fixed."

Gate the diagnostic

private final boolean verboseDiagnostics;

public VertexImpl(...) {
  this.verboseDiagnostics = conf.getBoolean(
      TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE,
      TezConfiguration.TEZ_AM_DIAGNOSTICS_VERBOSE_DEFAULT);
}

public void scheduleTasks(...) {
  // ... existing logic ...
  if (verboseDiagnostics) {
    LOG.info("TEZ-XXXX diag: scheduling {} tasks for vertex {}; first task locations: {}",
        tasksToSchedule.size(), getName(),
        tasksToSchedule.subList(0, Math.min(5, tasksToSchedule.size())));
  }
}

Reporter applies the patch and turns it on:

SET tez.am.diagnostics.verbose=true;
-- repro the bug
SET tez.am.diagnostics.verbose=false;

When the bug is diagnosed, the switch is removed in the proper fix. It is not a supported config — the JIRA tracks both the diagnostic patch (to be reverted) and the real fix.

Assembling the Diagnostic Patch

A complete patch for attachment:

One new INFO log line at the boundary you suspect.
One new counter if there's a count to track.
One debug switch if the diagnostic has cost.
JIRA description with:
- What the patch adds.
- How to apply it.
- How to enable any switch.
- What output the reporter should capture and attach.
Test that the patch compiles and runs the existing tests — diagnostic patches must not change behavior.

Skeleton JIRA comment

Diagnostic patch attached: TEZ-XXXX.diag.001.patch

Adds:
  - INFO log "TEZ-XXXX diag:" in VertexImpl.scheduleTasks
  - TaskCounter SHUFFLE_FETCH_RETRIES
  - Config switch tez.am.diagnostics.verbose (default false)

To reproduce with the patch:

  1. Apply: git apply TEZ-XXXX.diag.001.patch
  2. Build: mvn install -DskipTests
  3. Run query that reproduces the issue, with:
       SET tez.am.diagnostics.verbose=true;
       SET hive.exec.print.summary=true;
  4. Attach:
       - The AM log (yarn logs -applicationId ...)
       - The full Tez summary output
       - One container log from a failing task

Will use the data to file a proper fix. The diagnostic patch is not for commit;
the fix patch will be separate.

Thanks,
<First>

When the Reporter Can't Apply a Patch

Some reporters can't patch their cluster (locked-down enterprise environment). In that case:

Ask for what data they can capture: AM log, container logs, Tez UI screenshots, counter values.
Tell them which existing INFO-level logs to grep for.
Tell them which existing counters to read off.
If a config switch already exists that increases logging, point at it (e.g. tez.am.history.logging.enabled=true, tez.runtime.shuffle.connect.timeout=X).

You don't always get a diagnostic patch onto the cluster. The skill is to plan the diagnostic so you get as much as possible from what's already shipped.

After the Diagnostic Runs

The reporter attaches:

AM log with TEZ-XXXX diag: lines.
Tez summary with counters.
Container log fragments.

You analyse:

What did the diagnostic counter show?
What did the diagnostic log line tell you?
Where is the actual root cause?

Once you know the root cause:

File a separate JIRA for the real fix (or repurpose the diagnostic JIRA).
Attach a proper fix patch (no diag, no INFO noise, no @Private config keys).
Note in the JIRA comment that the diagnostic patch is being abandoned in favor of the fix patch.

Worked Example — Slow DAG Submission

Production bug: "Hive query takes 30 seconds in TezTask.submit before the DAG starts running on large DAGs."

Diagnostic patch

(As above) — adds two INFO lines in TezTask.submit capturing time and DAGPlan size.

Reporter runs

HIVE-XXXX diag: about to submitDAG, dagName=Hive_..., vertices=347, planBytes=8421567
HIVE-XXXX diag: submitDAG returned in 28341 ms

Analysis

347 vertices: large DAG, but not absurd.
DAGPlan 8.4 MB: very large.
28 seconds for submitDAG: most likely RPC + protobuf parse on the AM side.

Further diagnosis

Add a Tez-side INFO in DAGAppMaster.submitDAGToAppMaster:

TEZ-XXXX diag: received DAGPlan of {} bytes, deserializing
TEZ-XXXX diag: createDAG completed in {} ms
TEZ-XXXX diag: VertexImpl construction completed in {} ms

Re-run on the cluster. Pinpoint where the 28 seconds go.

Likely result: VertexImpl construction is O(N²) in vertex count for some reason. File the fix patch with the profile evidence.

What a Diagnostic Patch Is Not

Not a place to add unrelated improvements.
Not a permanent feature; it gets reverted after the bug is fixed.
Not a substitute for a proper reproducer (combine both when you can).
Not a place to use @Public APIs — diagnostic config keys are @Private, @Unstable.
Not committable to trunk as-is.

Validation Artifacts

After this lab:

A ~/tez-notes/diag-patch-template.md containing the JIRA-comment template above.
One worked diagnostic patch (real or imagined) following the three patterns.
The reflex to tag every diagnostic log line with the JIRA ID.
The discipline to file diagnostic and fix as separate JIRAs (or stages of one JIRA).

This lab closes the Hive-on-Tez Labs section. You now have the full toolkit: trace a SQL query into a DAG, capture the DAG four ways, walk a failure to its root, attribute it to the right project, reproduce it minimally, and instrument it for remote diagnosis. That toolkit is the practising-Tez-committer skill at the Tez/Hive boundary.

Release & PMC Reality

This section takes you inside the committer and PMC view of Apache Tez. It is written for two audiences:

Contributors who want to understand what a committer is reading when they review your patch, why a release vote takes 72 hours, and what a PMC member actually does between commits.
New committers and PMC members on Tez (or any other ASF project) who need the operational playbook nobody hands them.

The chapters are deliberately not aspirational. They are the mechanics — what email to send, what file to sign, what the [VOTE] thread template looks like, where the LICENSE and NOTICE rules are bright lines.

Reading Order

#	Chapter	Audience
1	Mailing Lists	Everyone
2	JIRA & Code Review	Contributors and committers
3	Committer Mindset	New committers, contributors who want to think like one
4	Release Voting	PMC and release managers
5	PMC Responsibilities	PMC members
6	Licensing	Everyone touching dependencies; PMC for releases
7	Code Style & Trust	All contributors

Chapters 1–3 and 6–7 are useful to contributors. Chapters 4–5 are PMC-facing but worth reading earlier to understand why committers behave the way they do at release time.

How This Section Differs From the Mindset Section

The Contributor Mindset section answered the question "how do I behave so my work gets accepted?" This section answers "what is the work being done by the people who accept it?" — the asymmetric view from the other side.

You don't need to be a committer to read this material. You need to internalise it before you become one, so the offer doesn't catch you off guard.

What This Section Is Not

This section is not:

A substitute for the ASF release distribution policy.
A substitute for ASF legal guidance on licensing.
A substitute for the Tez committer's onboarding email from the PMC.

It is a faithful, project-specific summary of what those documents and that onboarding actually contain, written so that a contributor can build accurate expectations and a new committer can move fast without surprises.

Prerequisites

Before this section is fully useful:

You have read the Contributor Mindset section.
You have a JIRA account at https://issues.apache.org/jira/.
You are subscribed to dev@tez.apache.org.
You have a local clone of Tez at ~/tez-src.

If you are a new Tez committer:

You have received your ASF ID (<id>@apache.org).
You have set up GPG (we'll cover this in Release Voting).
You are subscribed to private@tez.apache.org.

Validation for the Section

You have absorbed this section when you can:

Compose a [VOTE] thread email for an RC without consulting a template.
Read a LICENSE change in a patch and predict if it would block a release.
Explain why Tez is RTC (Review Then Commit) and not CTR (Commit Then Review).
Predict, before opening a JIRA, which committer will likely shepherd it.
Identify the category-A / category-B / category-X status of a dependency you want to add.
Run mvn apache-rat:check and read its output.

The next chapter — Mailing Lists — covers the operational mechanics of the ASF list system that this entire section relies on.

Mailing Lists

Mailing lists are the spine of Apache governance. Every decision that affects the project — design, release, new committer, security disclosure — happens on a mailing list, in an archived thread, with a documented vote when required. This chapter is the operational manual for the Tez lists.

The Tez Lists

List	Purpose	Subscribe	Notes
`dev@tez.apache.org`	Development discussion, design, votes	`dev-subscribe@tez.apache.org`	Primary list. Read first, post sparingly.
`user@tez.apache.org`	Usage questions	`user-subscribe@tez.apache.org`	Lower-traffic. Answer here if you can.
`commits@tez.apache.org`	Git commit notifications	`commits-subscribe@tez.apache.org`	Bot-driven. Subscribe to follow trunk live.
`issues@tez.apache.org`	JIRA event notifications	`issues-subscribe@tez.apache.org`	Bot-driven. Verbose; use a filter rule.
`private@tez.apache.org`	PMC-only	(Auto on PMC)	New-committer votes, security reports.

Archive: https://lists.apache.org/list.html?dev@tez.apache.org and equivalent for each list. Anything posted is public forever (except private@, which is archived but not public).

Subscribing

# From the address you want subscribed:
echo "" | mail -s "" dev-subscribe@tez.apache.org
# You will get a confirmation request. Reply to it.

For multiple lists, repeat. To unsubscribe, replace subscribe with unsubscribe.

Filtering

issues@tez.apache.org posts dozens of mails per day. Set a Gmail / Outlook / Thunderbird rule to file it into a folder. Same for commits@tez.apache.org if you subscribe.

For dev@, file by subject prefix:

Prefix	Folder
`[VOTE]`	`dev-vote` (read same-day)
`[ANNOUNCE]`	`dev-announce` (read same-day)
`[NOTICE]`	`dev-notice` (read same-day)
`[DISCUSS]`	`dev-discuss` (read within the week)
`[PROPOSAL]`	`dev-proposal` (read within the week)
(anything else)	`dev-misc`

ASF Mailing-List Mores

Lists predate the web at Apache. The conventions are old and load-bearing.

Plain text only

HTML mail is dropped by some clients, breaks quoting, and bloats archives. Apache lists are plain-text. Configure your mail client:

Gmail web: Settings → General → Default text style → Plain text
Mutt / mu4e / aerc: already plain
Outlook: File → Options → Mail → Plain Text

Inline reply, not top-post

The Apache convention is to reply under the relevant quoted text, quoting only the part you're answering. Trim aggressively.

On Tue, May 7, 2024, Foo Bar wrote:
> Should we bump the default for tez.am.resource.memory.mb?

Yes, but conditionally. See the sizing sketch on TEZ-4321.

> And what about the AM heap?

Same patch; -Xmx is computed from -resource.memory.mb in the launch
command. We don't need a separate knob.

-- 
Jane

What top-post would do — your full reply at the top, the original quoted in full below — makes archive threads unreadable. People will gently note this once; do not require a second note.

No attachments

Patches go on JIRA. Logs and stack traces go in a gist or a pastebin and are linked. Long output goes as an attachment to the JIRA, not the email.

A 2 MB attachment forces hundreds of subscribers to download it. A link forces only the interested.

Sign off

A short sign-off — first name, or first + last — is conventional. No corporate signature block, no legal disclaimer, no "Sent from my iPhone."

If you must have a signature, use the standard -- \n separator (dash-dash-space) so mail clients can suppress it.

Subject hygiene

Subject prefixes are filterable. Use them.

Prefix	When
`[DISCUSS]`	Open question, no decision sought yet
`[PROPOSAL]`	Concrete proposal, comment wanted
`[VOTE]`	Vote in progress; body has voting rules
`[VOTE][RESULT]`	Closing a vote; tallies the result
`[ANNOUNCE]`	One-way announcement (release, new committer)
`[NOTICE]`	Infrastructure / policy change

Don't prefix replies. The Re: is enough; subscribers' filters key off the embedded [VOTE] already.

Reply-To etiquette

ASF lists are configured to set Reply-To: list. So your reply goes to the list by default. Don't break it by manually rewriting the To:.

If you want to reply privately to the sender (rare — use only for personal/off-topic), explicitly remove the list and address them.

`[VOTE]` Mechanics

ASF votes are the formal decision mechanism. They use a fixed +1 / 0 / -1 syntax.

Voting tokens

Token	Meaning
`+1`	I approve.
`+0`	I'm slightly positive but won't block.
`0`	I have no opinion.
`-0`	I'm slightly negative but won't block.
`-1`	I disapprove.

The -1 (a "veto") is a heavy tool. It must be accompanied by a technical justification. A -1 without justification is invalid. Once a valid -1 is cast on a code change, the issue must be resolved (typically by revision) before the change proceeds.

Binding vs non-binding votes

Vote topic	Who is binding
Code change	Committers and PMC
Release artifact	PMC only
New committer	PMC only
New PMC member	PMC only
Project mechanics (board reports, etc.)	PMC only

Non-binding votes are welcomed and counted, but only the binding count determines the outcome.

Required minimums

For releases, the ASF rule:

72-hour minimum vote duration.
At least 3 binding +1 votes.
More +1 than -1 votes.

If those conditions aren't met by close, the vote is extended or fails. See Release Voting for the full mechanics.

For code changes in Tez (RTC project — see JIRA & Code Review):

Typically 1 binding +1 (a committer) is sufficient to commit, after review.
A -1 from any committer or PMC member blocks the commit pending resolution.

For new committers / PMC:

Run on private@.
Typically a few-day vote window.
Pass: more +1 than -1; common practice is at least 3 +1.

Lazy consensus

Many decisions don't require a vote. The mechanism is lazy consensus:

"I'm planning to do X. Speaking up if you disagree; otherwise I'll proceed in 72 hours."

Used for things like cutting a branch, scheduling a release-vote window, or applying a trivial fix. The poster picks a reasonable window (24–72 hours). Silence = consent.

Lazy consensus is not for irreversible decisions (release, license change, PMC membership). Those require an explicit vote.

Composing a `[VOTE]` Email

Template — release vote (the full version is in Release Voting):

Subject: [VOTE] Apache Tez 0.10.4 RC1

Hi all,

I'd like to call a vote on releasing Apache Tez 0.10.4 RC1.

Source release:  https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1/
Git tag:         release-0.10.4-rc1
Commit hash:     <full sha>
Staging Nexus:   https://repository.apache.org/content/repositories/orgapachetez-NNNN/

KEYS file:       https://downloads.apache.org/tez/KEYS
Signed with key: <your key id and fingerprint>

The vote will be open for 72 hours.

[ ] +1 Release this package
[ ]  0 No opinion
[ ] -1 Do not release because ...

My +1.

Thanks,
<First>

Template — new committer (run on private@):

Subject: [VOTE] New Tez committer: <First Last>

Hi PMC,

I'd like to propose <First Last> as a new committer on Apache Tez.

<First Last> has been contributing since <month year> and has had
<N> patches committed, spanning <areas>. Highlights:

  - TEZ-NNNN: <one line>
  - TEZ-NNNN: <one line>
  - Active reviewer on TEZ-XXX, TEZ-YYY.

They've shown <judgement / quality / breadth>.

Vote open for 72 hours.

[ ] +1
[ ]  0
[ ] -1

My +1.

Thanks,
<First>

Template — closing a vote:

Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1

Hi all,

The vote on Apache Tez 0.10.4 RC1 has passed with the following tally:

Binding +1: <list of names>
Non-binding +1: <list of names>
0: <names if any>
-1: <names if any, with reasons>

I'll proceed with the release steps.

Thanks to everyone who voted.

<First>

Lazy Consensus Examples

Good lazy-consensus posts:

"I'm cutting branch branch-0.10.5 from current master tomorrow at 12:00 UTC unless there's objection."
"Planning to apply TEZ-4321 (one-line log fix, trivial) by end of week unless someone flags it. Patch is .001 on JIRA."
"Will cancel the 0.10.5 RC1 vote and roll RC2 tomorrow due to the LICENSE finding."

Bad lazy-consensus posts:

"Going to release 0.10.5 next week." (Requires a [VOTE].)
"Going to add NAME as committer." (Requires a [VOTE] on private@.)
"Going to remove the deprecated key X." (User-visible behavior; requires [DISCUSS] → consensus.)

When You're New on the List

The first month of reading a list:

Read every [VOTE] thread.
Read every [DISCUSS] thread.
Skim [jira] [Created] mails.
Post nothing initially.

After the first month:

Reply to a user@ question you can answer.
Post a self-introduction (see Community Interaction).
Comment on a [DISCUSS] thread once you have substance.

Validation Artifacts

After this chapter:

Subscriptions confirmed to dev@, user@, and (if your mail client tolerates the volume) issues@.
Mail-client filters configured for the subject prefixes table.
A ~/tez-notes/vote-templates.md containing the four templates above.
The reflex to inline-reply, not top-post.
One archived [VOTE] thread URL bookmarked for reference.

The next chapter — JIRA & Code Review — is the operational view of what code review looks like from the committer side of the table.

JIRA & Code Review — Inside a Tez Review

This chapter is the committer view of code review. Read it as a contributor, and your patches will become reviewable. Read it as a new committer, and you'll have a workflow.

Tez is RTC

Apache projects choose between two commit philosophies:

Model	Meaning	Used by
RTC (Review Then Commit)	Patch must be reviewed and `+1`'d before commit	Tez, Hive, Hadoop (for most code)
CTR (Commit Then Review)	Committer may commit and discuss after	Some smaller projects, certain Hadoop subsystems

Tez is RTC. The implication: every commit went through at least one review round. Patches sitting at "Patch Available" with no review block on attention, not on velocity — the committer pool is finite.

The RTC exception: trivial fixes (typos, log message edits, javadoc improvements) may be committed by a committer without an explicit +1, but the commit message references the JIRA and the patch is still attached for the record.

How a Committer Reads a Patch

When a committer opens your patch (in JIRA, GitHub PR, or git apply locally), the sequence is roughly:

1. Read the JIRA description.              30s
2. git apply --check on a fresh clone.     30s
3. Look at git diff --stat.                30s
4. Read the test changes.                  2-5 min
5. Read the implementation changes.        5-15 min
6. Run mvn checkstyle:check.               30s
7. Run mvn test in changed modules.        2-15 min
8. Optionally: run an integration test.    5-30 min
9. Comment.                                Variable

The first three steps determine whether the patch gets the full read or a bounce. If the JIRA is unclear or the diff doesn't apply or includes unrelated changes, the patch goes back without step 5.

The Skim Phase

A committer skimming git diff --stat is looking for:

File count and module spread. A patch touching one module is easy; one touching five is suspicious.
Tests in the diff. No tests in a behavior-changing patch is a red flag.
Generated files in the diff. target/, *.iml, .idea/ — never committed.
Whitespace-only churn. git diff -w should not be vastly smaller than git diff.

If any of these are off, expect a comment before the implementation is read.

The Test Phase

Committers read tests before implementation because the test reveals intent. A good test named testRecoverNoInputs tells the reviewer:

The bug is in recovery.
The trigger is "no inputs."
The fix should not break recovery in any other case.

If the test is missing, weak (no assertions, or assertions that would pass without the fix), or named generically (testMethod1), the reviewer assumes the implementation is also weak.

The Implementation Phase

By the time the reviewer reads the code, they have a mental model from the JIRA, the test name, and the diff stat. The implementation read is checking:

Does the code match the intent of the JIRA and test?
Is the change minimal — does it touch what it must, and only what it must?
Are exceptions handled appropriately for the file's conventions?
Is logging at the right level (DEBUG for hot paths, INFO for state transitions, WARN for recoverable, ERROR for unrecoverable)?
Are there obvious thread-safety issues (state visible across threads, shared mutable collections)?
Are there back-compat concerns? (See Compatibility)

Comment Phrasing

Committer comments follow soft conventions that contributors should recognise — they encode meaning beyond the literal text.

Comment	Means
"Nit: ..."	Stylistic preference; you may take it or push back without controversy.
"Suggestion: ..."	Reviewer thinks there's a better way but isn't blocking.
"Concern: ..."	Reviewer wants this addressed before commit.
"I don't think this is right."	Block; must be resolved.
"Have you considered X?"	Genuine question; respond with your reasoning.
"Let's discuss on dev@."	Issue is bigger than the patch; design discussion needed.
"+1 LGTM"	Approval (informal).
"+1 pending checkstyle"	Conditional approval.
"-1, see ..."	Veto; must be resolved before commit.

Reciprocal etiquette on responses, see Responding to Feedback: acknowledge every comment explicitly, fix what's fixable, push back with evidence on what's not.

Patch Available → Reviewed Lifecycle

The JIRA state transitions for a typical patch:

Open
 |  (contributor starts)
 v
In Progress
 |  (contributor attaches .001)
 v
Patch Available  ← reviewer reads here
 |  (review comments)
 v
In Progress  ← contributor revises
 |  (attaches .002)
 v
Patch Available
 |  (LGTM)
 v
Resolved (committer commits to trunk)
 |  (release ships)
 v
Closed

The patch attachments accumulate: .001, .002, .003. They are never deleted. Future readers can reconstruct the review by walking through them.

GitHub-PR-based reviews follow the same lifecycle, but the iteration happens in the PR's commit history rather than separate .NNN.patch files. The JIRA still moves through the states above.

Backport Patches

A patch may need to land on multiple branches (e.g. master and branch-0.10). The contributor attaches both:

TEZ-4321.001.patch                 (for master)
TEZ-4321.branch-0.10.001.patch     (for the maintenance branch)

The committer reviews and commits each. The JIRA comment notes the commits:

Committed to master: <sha>
Committed to branch-0.10: <sha>

The Committer's Pre-Commit Checklist

A committer about to commit runs:

cd ~/tez-src
git fetch origin
git checkout master
git merge --ff-only origin/master
git apply --check /tmp/TEZ-4321.003.patch
git apply /tmp/TEZ-4321.003.patch

mvn install -DskipTests
mvn checkstyle:check
mvn test -pl tez-dag,tez-api      # changed modules
mvn test -pl tez-tests -Dtest=TestOrderedWordCount

git add -A
git commit -s -m "TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gunther)"
git push origin master

Notes on the commit step:

-s adds a Signed-off-by: trailer. Tez doesn't currently require DCO, but it's Apache-idiomatic.
The (Jane Doe via gunther) suffix is added by the committer, not the contributor.
The push goes to apache/tez (committer karma required).

After push:

1. Update JIRA: status → Resolved, set Fix Version (e.g. "0.10.5").
2. Comment on JIRA with the commit SHA.
3. Thank the contributor.

Holding the "No" Muscle

A subtle and underappreciated committer skill is declining patches that shouldn't go in. A patch can be technically correct and still not belong in trunk — too narrow a use case, too much added complexity, the wrong layer.

Wording for a respectful decline:

Thanks for the patch. After reading, I'm not comfortable taking this in trunk because REASON. I appreciate the work, and I'd encourage ALTERNATIVE-PATH. Closing the JIRA as Won't Fix; if there's broader consensus on dev@ for a different approach, happy to reopen.

The "no" muscle is not natural. Committers learn it because the alternative — accepting every patch — accumulates technical debt that the committer pool will pay forever. See Committer Mindset.

When to Refactor Unsolicited Code in a Patch

A contributor's patch sometimes lands in a corner of the code the committer would like to clean up. The temptation is to do the cleanup at commit time. Don't.

The rules:

Never modify the contributor's diff at commit. The patch attached to JIRA must match what was reviewed.
File a follow-up JIRA for the cleanup. Reference the contributor in CC.
If the patch creates a refactoring opportunity, take it later. Not in this commit.

The exception: trivial cleanups the contributor agreed to in review may be applied at commit. The JIRA comment notes them. Example:

Committed with a small change: extracted the new logic into a private
helper method as discussed in review. Attaching the committed patch
as .004 for the record.

What Goes On the GitHub PR vs. JIRA

Tez accepts patches as JIRA attachments and as GitHub PRs (linked from the JIRA). The mapping:

Lives on	What
JIRA	Description, design discussion, root cause, attachments (`.NNN.patch`), final commit reference
GitHub PR (if used)	Line-by-line comments, CI run results, iterative push history

A PR without a linked JIRA is incomplete; the JIRA is the system of record. A JIRA without a PR is fine — many Tez patches are still JIRA-attachment-only.

If you open a PR, link it on the JIRA in the first comment and set the JIRA to "Patch Available."

Worked Example — A Full Review Cycle

JIRA: TEZ-4321. "Fix NPE in VertexImpl.recover when no inputs."

Day 0   You: file JIRA with description, repro, root cause.
        Set yourself as assignee, status In Progress.
        Attach .001 patch; status → Patch Available.

Day 3   Committer @gopalv: applies patch locally, runs tests, reviews.
        Comments on JIRA:
          - L88: prefer Collections.emptyList().
          - L92: add a test for the no-inputs case.
          - L94: should we handle no-outputs symmetrically? Concern: see
            VertexImpl.recover at L142, looks like the same shape.

Day 4   You: reply on JIRA.
          - L88: agreed.
          - L92: agreed; adding testRecoverNoInputs.
          - L94: I see the parallel but think it's a separate JIRA.
            Filing TEZ-4329 to track.
        Attach .002.

Day 7   @gopalv: re-reviews. "+1 LGTM."

Day 8   @gopalv: commits.
        "TEZ-4321: Fix NPE in VertexImpl.recover when no inputs. (Jane Doe via gopalv)"
        Sets JIRA Resolved, Fix Version 0.10.5.

Day 8   You: comment "Thanks @gopalv. Working on TEZ-4329 next."

That is a healthy review — 2 patch rounds, 1 follow-up JIRA filed, no friction.

Validation Artifacts

After this chapter:

A ~/tez-notes/reviewer-vocab.md cheatsheet from the comment-phrasing table.
The four checklist steps committers run pre-commit, saved for when you are one.
The discipline to never modify a contributor's diff at commit (with an exception only for explicit reviewer-author agreement).
The reflex to comment "Thanks @COMMITTER" after a merge of your patch.

The next chapter — Committer Mindset — takes the perspective further: the judgement model committers use across many patches and many years.

Committer Mindset

Becoming a committer is a one-day event. Thinking like one is a multi-year practice. This chapter sketches the practice: the asymmetries, the recurring trade-offs, and the mental model that distinguishes "writes good patches" from "stewards the codebase."

The Long-Lived Code Tax

A contributor writes a patch and leaves. A committer commits a patch and inherits it forever. Every line a committer approves is theirs to debug at 11pm three years later when it breaks in production.

Practical consequence: the committer's "yes" is a much heavier word than the contributor's "this would be nice." Committers reflexively ask:

Question	Why
Who will maintain this in 2 years?	Code without a maintainer becomes everyone's problem
Is the complexity proportional to the value?	Complex code is paid for in every future bug
Does this make `tez-dag` harder to onboard into?	Onboarding cost is real
What's the failure mode at 10x scale?	Tez runs in production clusters at scale
Does this lock us into a design we'll regret?	API and proto changes are forever

These are not abstractions. Every committer has at least one patch they regret approving. That memory is the source of the "no" muscle.

Reasoning About Compatibility

The compatibility surface is exhaustively documented in Compatibility. The mindset around it:

Default to backwards-compat. A change that breaks no one is always preferable to one that breaks anyone, even if uglier.
A deprecation is a promise. If you deprecate a method "to be removed in 0.12," it had better be removable in 0.12 — which means no production user can still be on it by then, which means the deprecation window has to be long enough to drain.
Wire compat is not negotiable. A DAGPlan change that breaks recovery from an old AM means a cluster can't roll-restart safely. That's a P0 production issue.
Configuration compatibility is silent until it isn't. Renaming a key without a deprecation alias breaks every cluster that has the old key in tez-site.xml. Reviewers will catch this if they're paying attention; committers must always pay attention.

The mental model: imagine you are the SRE on call at a Fortune 500 that runs Tez via Hive at 1 AM. What does this patch do to your night?

Reasoning About Performance

Tez runs in the hot path of Hive on terabyte-scale workloads. A 5 ms regression in a per-task code path is real money. The mindset:

Measure, don't guess. A patch claiming performance benefit needs numbers, not intuition. A patch claiming no performance impact in a hot path still needs a check.
Hot vs. cold paths. Optimisations matter in tez-runtime-library and the per-task paths of tez-runtime-internals. They matter much less in tez-dag AM startup code that runs once per DAG.
GC is performance. A patch that allocates an extra object per task adds GC pressure at scale. Reuse buffers; use primitives; bound queues.
Logging is performance. LOG.debug("..." + obj) allocates the string even when DEBUG is off. Use LOG.debug("... {}", obj) instead.

The committer reading a patch in a hot path keeps these questions ready:

Does this allocate per-record? Per-batch? Per-DAG?
Is the allocation reusable / poolable?
Is the log statement guarded or formatted?
Has the contributor said how this performs at scale?

Reasoning About Complexity

Complexity has a half-life of bugs. The reviewing committer's complexity check:

Complexity addition	What it costs
A new abstract base class	A new mental model for readers
A new configuration key	Documentation, default-tuning, deprecation later
A new state in a state machine	Combinatorial new transitions to test
A new event type	New event dispatcher cases, new history entries
A new public method	Compatibility commitment
A new dependency	Licensing review, attack surface, build complexity

A patch that adds, say, a new configuration key for a corner-case behavior is not trivially "yes" even if the code is correct. The cost of the key — documentation, tuning, eventual deprecation — must justify the value.

The reflexive committer question: "Could this be a default, with no key?" If the answer is yes, skip the key.

Reasoning About Risk

Different code paths carry different risk profiles:

Path	Risk
`tez-tools/`	Low. Process tooling; broken doesn't affect runtime.
`tez-mapreduce/`	Medium. Affects MR-on-Tez users; relatively well-tested.
`tez-runtime-library/`	High. In the per-task hot path.
`tez-runtime-internals/`	High. Task runtime; affects every DAG.
`tez-dag/` AM scheduling	High. AM bugs lose work.
`tez-dag/` DAG planning	Very high. Errors are bad DAGs.
`tez-api/`	Very high. Public API; breaking it breaks downstream projects.
`tez-api/src/main/proto/`	Critical. Wire format; cluster-rolling-restart implications.

Committers calibrate review depth to risk. A 50-line patch in tez-tools/ may get a quick read and +1. A 50-line patch in tez-api/src/main/proto/ gets word-by-word scrutiny, a [DISCUSS] thread, and possibly a -1 if the protobuf change is anything other than additive.

The "No" Muscle — When and How

The hardest committer skill is saying no. Not no-by-silence (the default and worst form), but explicit, kind, decisive no. Patterns for when to use it:

Pattern	Pattern of "no"
Patch fixes a real but rare bug at the cost of significant new complexity	"Let's not fix this in code; document the workaround and close as Won't Fix."
Patch adds a feature with one user (the contributor)	"Could you maintain this as an out-of-tree plugin? `VertexManagerPlugin` exists for this."
Patch is technically correct but encodes a design that conflicts with planned direction	"We're going a different way on dev@ thread XYZ; let's wait."
Patch is correct but vastly over-scoped	"Could you split into 3 JIRAs? Happy to commit them one at a time."
Patch is correct but in a part of the codebase being rewritten	"Let's wait for TEZ-NNNN to land first; this conflicts."

The crucial thing about saying no: do it early, explicitly, and once. Don't ghost the patch. The contributor's time is worth your one paragraph of explanation.

When to Refactor Unsolicited

A patch lands in a part of the codebase the committer has been wanting to refactor. The temptation is to do the refactor in or alongside the commit. Don't, except in narrow cases.

The rules:

Refactor neither in the contributor's patch nor in the same commit. Their patch must match what was reviewed.
File a follow-up JIRA for the refactor. Reference the contributor in CC; they often have context.
Do the refactor in a separate review cycle. Either you do it (review by someone else) or someone else does it (review by you).
Exception: If the contributor's patch sits in code that is literally being moved or removed by an imminent committed patch, coordinate. Either delay the contributor's patch or rebase the imminent one.

Mentoring Pattern

A committer's leverage is not just commits — it's mentoring. The well-trodden Apache mentoring pattern:

Notice a thoughtful new contributor. Their first patch was clean; they responded well to feedback; they asked good questions on dev@.
Suggest a JIRA in your area. Comment on a JIRA: "This would be a good fit for NAME based on their recent work on TEZ-XXXX."
Shepherd it. Review their patch yourself, fast. Set expectations on iteration count.
Make them visible. Refer to their work on dev@. Cite them in commits as you would any contributor.
Eventually propose them. When they hit the rough bar from Meritocracy, propose them on private@.

A committer who has mentored two or three contributors into committership has done more for the project than one who has committed thousands of patches.

Time Allocation

Newly-minted committers underestimate how time-consuming the role is. A rough budget for sustained committership:

Activity	Weekly time
Reviewing patches	2–4 hours
Filing or shepherding your own patches	2–4 hours
`dev@` discussion participation	1–2 hours
JIRA triage (closing dups, asking for repros)	0.5–1 hour
Mentoring	0.5–1 hour
Release work (during release windows)	4–8 hours

A committer who spends 0.5 hours/week on the project will be reactive at best and become inactive within a year. A committer who spends 4+ hours/week stewards the codebase.

Avoiding Burnout

The committer pool at any Apache project is finite. Burnout is a real failure mode:

Burnout signal	Self-rescue
Reviewing patches feels like a chore	Take a 2-week formal break; tell `dev@`
You're saying yes to patches you don't believe in	Practice saying no
You're the only reviewer for an area	Mentor someone into co-reviewing
You're sleeping less because of a release window	Ask the PMC to split the RM duties
You haven't filed a JIRA you cared about in months	Stop reviewing for a week; write

Committership is voluntary. Stepping back is honourable. Emeritus committer status exists at Apache for those who want a graceful exit; you can come back later.

Validation Artifacts

After this chapter:

A ~/tez-notes/committer-questions.md of the five recurring questions a committer asks of every patch.
The discipline to score each Tez file path you touch by risk tier.
The vocabulary to say no, in writing, with no rancour.
The plan to do mentoring at some point in your committer life.

The next chapter — Release Voting — is the operational manual for the most visible PMC-level work: cutting a release.

Release Voting

Cutting an Apache Tez release is a procedural, legal, and cryptographic operation. It is the most formal thing the PMC does. This chapter is the operational manual: the steps, the artifacts, the vote thread, and the failure modes.

The authoritative reference is the ASF Release Distribution Policy. This chapter is the Tez-specific overlay on top of it.

What "Release" Means at Apache

An Apache release has a precise legal meaning. Only source artifacts are official Apache releases. Binary artifacts (jars in Maven Central, Docker images) are convenience artifacts that the PMC may publish but that are not the legal release.

Practical consequence: every vote is a vote on the source release. Binaries derive from it.

Release Artifacts

A Tez release consists of:

Artifact	Where	Format
Source tarball	`dist.apache.org`	`apache-tez-X.Y.Z-src.tar.gz`
ASCII-armored signature	`dist.apache.org`	`apache-tez-X.Y.Z-src.tar.gz.asc`
SHA-512 checksum	`dist.apache.org`	`apache-tez-X.Y.Z-src.tar.gz.sha512`
(Optional) binary tarball	`dist.apache.org`	`apache-tez-X.Y.Z-bin.tar.gz` plus `.asc` and `.sha512`
Staged Maven jars	`repository.apache.org` (Nexus)	Standard Maven layout
Git tag	`apache/tez`	`release-X.Y.Z-rcN` then `release-X.Y.Z`

Notes:

MD5 and SHA-1 are forbidden for release checksums (ASF policy since 2019). Use SHA-512 (preferred) or SHA-256.
The signature must be ASCII-armored (.asc), not binary.
The signing key must be in the project KEYS file at https://downloads.apache.org/tez/KEYS and in your public key on a public keyserver.

Prerequisites — One-Time PMC Setup

Before you can RM (release-manage), once:

# 1. Generate a GPG key (4096-bit RSA).
gpg --full-generate-key

# 2. Submit the public key to keyservers.
gpg --send-keys <KEY_ID>

# 3. Add your key to the Tez KEYS file.
svn co https://dist.apache.org/repos/dist/release/tez tez-dist-release
cd tez-dist-release
(gpg --list-sigs <KEY_ID> && gpg --armor --export <KEY_ID>) >> KEYS
svn commit KEYS -m "Add <Your Name>'s release-signing key"

# 4. Verify it lands at:
#    https://downloads.apache.org/tez/KEYS

The Nexus staging access:

# Add ~/.m2/settings.xml entry:
cat >> ~/.m2/settings.xml <<EOF
<settings>
  <servers>
    <server>
      <id>apache.releases.https</id>
      <username>YOUR_APACHE_ID</username>
      <password>YOUR_APACHE_LDAP_PASSWORD</password>
    </server>
  </servers>
</settings>
EOF

The Release Cut

Roughly the sequence the release manager runs:

cd ~/tez-src
git fetch origin

# 1. Branch (for X.Y.0 releases) or check out maintenance branch.
git checkout -b branch-0.10.4 origin/master    # for a new minor
# or
git checkout branch-0.10                       # for a patch release

# 2. Update version.
mvn versions:set -DnewVersion=0.10.4
git commit -am "Setting version to 0.10.4 for release"
git tag release-0.10.4-rc1
git push origin branch-0.10.4
git push origin release-0.10.4-rc1

# 3. Build everything; tests must pass.
mvn clean install
mvn apache-rat:check

# 4. Build source tarball.
mvn clean package -Pdist,docs,src -DskipTests
ls tez-dist/target/                       # apache-tez-0.10.4-src.tar.gz

# 5. Sign and checksum.
gpg --armor --output apache-tez-0.10.4-src.tar.gz.asc \
    --detach-sign apache-tez-0.10.4-src.tar.gz
sha512sum apache-tez-0.10.4-src.tar.gz > apache-tez-0.10.4-src.tar.gz.sha512

# 6. Stage to dist.apache.org/dev.
svn co https://dist.apache.org/repos/dist/dev/tez tez-dev
mkdir tez-dev/tez-0.10.4-RC1
cp apache-tez-0.10.4-src.tar.gz* tez-dev/tez-0.10.4-RC1/
cd tez-dev
svn add tez-0.10.4-RC1
svn commit -m "Apache Tez 0.10.4 RC1"

# 7. Stage Maven artifacts.
mvn clean deploy -Papache-release -DskipTests
#    Then on https://repository.apache.org, log in, find your
#    staging repo (orgapachetez-NNNN), "Close" it.

The exact Maven profiles differ across Tez versions; check ~/tez-src/RELEASING.txt and the release notes for the prior release for the recipe in use.

The `[VOTE]` Email

After staging, you send the vote. The template:

Subject: [VOTE] Apache Tez 0.10.4 RC1

Hi all,

I'd like to call a vote on releasing Apache Tez 0.10.4 RC1.

Notable changes since 0.10.3:
  - TEZ-NNNN: <one line>
  - TEZ-MMMM: <one line>
  - <N> additional fixes; see CHANGES.txt for the full list.

Source release:
  https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1/

The release was signed with key:
  <KEY_ID>  <fingerprint>

KEYS file:
  https://downloads.apache.org/tez/KEYS

Git tag:        release-0.10.4-rc1
Git commit:     <full 40-char sha>

Staging repository for Maven:
  https://repository.apache.org/content/repositories/orgapachetez-NNNN/

The vote will be open for 72 hours.

Please verify and vote:

  [ ] +1 Release this package
  [ ]  0 No opinion
  [ ] -1 Do not release this package because ...

Verification steps (https://www.apache.org/info/verification.html):
  - Download src.tar.gz, .asc, .sha512.
  - Verify SHA512: sha512sum -c apache-tez-0.10.4-src.tar.gz.sha512
  - Verify signature:
      gpg --import KEYS
      gpg --verify apache-tez-0.10.4-src.tar.gz.asc apache-tez-0.10.4-src.tar.gz
  - Untar; check LICENSE, NOTICE, DISCLAIMER.
  - Build: mvn clean install -DskipTests

My +1.

Thanks,
<First Last>

Send to dev@tez.apache.org. Subject [VOTE] Apache Tez 0.10.4 RC1.

What Voters Verify

A binding +1 is not just trust. It carries a check. PMC voters typically:

Check	Command / location
Source artifact downloads	`wget` from `dist.apache.org/repos/dist/dev/tez/...`
Signature is valid and from a Tez committer	`gpg --verify` against `KEYS` file
SHA-512 matches	`sha512sum -c`
`LICENSE` is correct and current	Read it
`NOTICE` reflects bundled third-party	Read it; cross-check against `LICENSE`
`DISCLAIMER` present if incubating (not for Tez since 2014)	Check
No binary files in source tree	`find apache-tez-X.Y.Z-src -type f -name '.jar' -o -name '.class'`
Apache RAT clean	`mvn apache-rat:check`
Builds clean	`mvn clean install -DskipTests`
Tests pass (optional but valued)	`mvn test`

A voter who finds anything wrong with the source tarball can -1. Common -1 reasons:

Reason	Severity
Missing or broken signature	Vetoes (must respin)
MD5 / SHA-1 only	Vetoes
Binary files in source tree	Vetoes
Missing or wrong LICENSE	Vetoes
Missing or wrong NOTICE	Vetoes
GPL or category-X dep	Vetoes
RAT failure	Vetoes
Apache headers missing	Vetoes
Failed unit tests of significance	Usually vetoes
Build failure	Vetoes
Documentation issue	Often non-blocking, opinion

Vote Pass Criteria

The release passes if, after the 72-hour minimum:

At least 3 binding +1 votes from PMC members.
More +1 than -1 total (binding and non-binding).
No unaddressed binding -1.

If criteria fail:

Extend the vote by 24–48 hours and ask explicitly for more attention.
Or cancel and roll RC2 with the fixes.

Closing the Vote

The release manager closes:

Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1

Hi all,

The vote on Apache Tez 0.10.4 RC1 has passed.

Binding +1: <names of PMC voters>
Non-binding +1: <names>
0: <names>
-1: <names with reasons, if any>

Proceeding with the release steps.

Thanks to everyone who voted.

<First>

If the vote fails:

Subject: [VOTE][RESULT] Apache Tez 0.10.4 RC1

The vote did not pass. Issues raised:
  - <issue from voter>
  - <issue from voter>

Rolling RC2 with these fixes. Expect a new [VOTE] thread within
<N> days.

<First>

Promoting the Release

After the vote passes:

# 1. Move source from dev to release.
svn mv \
  https://dist.apache.org/repos/dist/dev/tez/tez-0.10.4-RC1 \
  https://dist.apache.org/repos/dist/release/tez/0.10.4 \
  -m "Releasing Apache Tez 0.10.4"

# 2. Promote Nexus staging repo to release (one-click in Nexus UI).

# 3. Tag the final release.
cd ~/tez-src
git tag release-0.10.4 release-0.10.4-rc1
git push origin release-0.10.4

# 4. Wait 24h for mirrors.

# 5. Update the Tez website with download links.

# 6. Send ANNOUNCE.

The announce email goes to announce@apache.org (BCC), dev@tez.apache.org, user@tez.apache.org, and your usual ASF lists for downstream projects (e.g. dev@hive.apache.org):

Subject: [ANNOUNCE] Apache Tez 0.10.4 released

The Apache Tez community is pleased to announce the release of
Apache Tez 0.10.4.

Apache Tez is an application framework that allows for a complex
directed acyclic graph of tasks for processing data. It is built
atop Apache Hadoop YARN.

Highlights:
  - <user-facing change>
  - <user-facing change>

Download:    https://tez.apache.org/releases/0.10.4/
Release notes: https://tez.apache.org/releases/0.10.4/release-notes.html

Thanks to everyone who contributed.

The Apache Tez team

RC Iteration Patterns

A first RC almost never passes. Typical RC count for a minor release:

Release type	Typical RCs
Patch (0.10.X)	1–2
Minor (0.10.0, 0.11.0)	2–4
Major (1.0.0 if it happened)	4+

Each RC means: cancel vote, fix issues, re-tag (release-X.Y.Z-rcN+1), respin tarball, re-sign, re-stage Nexus (new staging repo), re-send [VOTE]. Plan for 1–3 weeks per release cycle.

Common Failure Modes

Failure	Recovery
Signature key not in KEYS file	Stop, update KEYS, restart vote
RAT failure on a new file	Add Apache header, respin
Forgot to update CHANGES.txt	Update, respin
Stray `.class` or `.jar` in src tree	Clean, respin
Missing LICENSE entry for new bundled dep	Add LICENSE entry + NOTICE if needed, respin
Vote got fewer than 3 binding +1 in 72h	Extend with explicit ping to PMC
-1 on the source artifact for a legitimate issue	Respin
Maven staging mistake	Drop staging repo in Nexus, re-stage

Validation Artifacts

After this chapter you should have:

A GPG key generated and added to the project KEYS file (if you are PMC).
A ~/tez-notes/release-checklist.md with the seven RM steps.
The [VOTE] and [VOTE][RESULT] templates saved.
The discipline to never vote +1 on an RC you haven't checked at least signature + LICENSE + a build.
The phone number for ASF Infra Slack handy in case Nexus or dist.apache.org misbehaves.

The next chapter — PMC Responsibilities — covers the rest of what PMC membership entails, beyond releases.

PMC Responsibilities

PMC (Project Management Committee) membership at Apache is not a senior-engineer title. It is a stewardship role with explicit legal, brand, community, and release responsibilities. This chapter is the operational manual for what PMC members actually do between releases.

The Tez PMC list is at private@tez.apache.org. Public PMC members are listed at https://tez.apache.org/team-list.html (or the equivalent on the current site).

The Four Buckets of PMC Work

Bucket	Examples	Frequency
Legal	License headers, NOTICE file, third-party LICENSE entries, ICLA matching	Per-patch and per-release
Brand	Trademark protection, conference talk approvals, logo use	Quarterly to annual
Community	Moderating list, voting new committers, mentoring, code of conduct enforcement	Continuous
Releases	Voting on RCs, cutting RCs, post-release announce	Per-release

Plus one cross-bucket: board reporting, quarterly.

Legal Responsibilities

License Headers

Every source file in the Tez tree must have an Apache 2.0 license header. Tez uses Apache RAT to enforce this.

cd ~/tez-src
mvn apache-rat:check

The expected header for a .java file:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

If RAT fails on a release candidate, the release cannot ship. PMC members reviewing a release verify RAT cleanliness as part of vote-time checks (see Release Voting).

For non-.java files (.proto, .xml, .sh, .md), the same content with the appropriate comment delimiters.

NOTICE File

The NOTICE file at the repo root carries:

The required Apache attribution line.
Required attribution for any bundled third-party code that explicitly demands it.

cat ~/tez-src/NOTICE

Most BSD-, MIT-, and Apache-licensed dependencies do not require NOTICE entries. Some do (notably ones with NOTICE files of their own, which by Apache convention propagate into bundlers). The rule of thumb: if a dependency ships a NOTICE file, copy the required text into Tez's NOTICE.

Common error: adding random "thanks to" lines. NOTICE is not a thank-you file; it is a legal artifact. Keep it minimal and correct.

LICENSE File

LICENSE at the repo root is the Apache License 2.0 plus appendices for any bundled third-party code under different licenses.

For Tez, mostly the appendices are absent because the source release bundles no third-party source. The binary release (the convenience tarball) may bundle jars whose licenses must be appendixed.

If you are a PMC member adding a new dependency that gets bundled in the binary release:

Identify the dependency's license (read it, don't guess).
Verify category (A, B, or X) — see Licensing.
If A: update LICENSE appendix; sometimes NOTICE.
If B: requires PMC discussion + LICENSE / NOTICE updates.
If X: stop. Cannot be bundled. May only be a runtime-optional dep, never a hard one.

ICLA Matching

Every non-trivial contribution must come from someone with an Apache ICLA on file. The ICLA list is maintained by Apache Infra; PMC members can verify by emailing secretary@apache.org with a contributor name.

In practice, for casual contributors:

Trivial patches (Javadoc, typo) do not require ICLA.
Anything substantive does.
The contributor sends the ICLA themselves; PMC verifies it landed.

If a substantial patch is committed without an ICLA on file, that is a legal exposure for the foundation. PMC members must catch this before commit.

Brand Responsibilities

"Apache Tez" is a trademark of the Apache Software Foundation. The PMC is the steward.

Brand decision	PMC action
New logo	PMC vote, register with VP Brand Management
Conference talk titled "Apache Tez"	OK; speaker should follow trademark guidelines
Conference talk titled "Tez" without Apache	Polite ask: please use full mark
Third-party product named "TezCloud"	Likely refer to VP Brand; could be misleading
Third-party product built on Tez, named differently	OK; clarify attribution if uncertain
Use of the Tez feather logo in a slide deck	OK with attribution

For specifics see the ASF Trademark and Brand Policy. When in doubt, the PMC defers to trademarks@apache.org.

Community Responsibilities

Moderation

Most ASF mailing lists are moderated for non-subscribers (subscribers post freely). The moderation work is light: approving first posts, rejecting spam.

Tez has a small mod team (typically a couple of PMC members). Add dev-moderate@ or similar to your mail filter to spot moderation requests.

If subscriber behavior on a list becomes problematic — flame wars, code-of-conduct violations — the PMC handles it. Typical escalation:

Off-list private email from a PMC member to the offending subscriber.
If unaddressed, a public on-list warning.
If unaddressed, removal from the list (rare; requires PMC vote).

For severe cases (harassment, security threats), escalate immediately to board@apache.org.

Voting New Committers

The committer-bit process, from the PMC's side:

1. PMC member observes a strong contributor (see meritocracy chapter).
2. PMC member emails private@tez.apache.org with [VOTE] thread.
3. PMC members vote +1 / 0 / -1 (usually +1, sometimes 0 with rationale).
4. Vote runs ~72 hours; passes with >3 binding +1 and no binding -1.
5. PMC member privately emails the contributor with the offer.
6. On acceptance, ASF Infra is notified to provision the ASF account.
7. PMC announces publicly on dev@.

A -1 from a PMC member on a committer vote requires a concrete reason. "Doesn't feel right" is not enough; "two recent JIRAs showed inadequate care for compatibility" is.

PMC members may vote 0 if they don't know the contributor well — common, no shame in it.

Voting New PMC Members

Same mechanism as committer, except:

All committers are pre-considered, so the candidate is always a sitting committer.
The bar is higher (judgement, willingness to do PMC work, see Meritocracy).

After acceptance, the candidate is invited to the PMC. The Apache Board confirms.

Code of Conduct

Apache projects follow the ASF Code of Conduct. The PMC is the enforcement body within the project. Most enforcement is gentle and private. Serious cases are escalated to the board.

Release Responsibilities

Covered in detail in Release Voting. The PMC-specific elements:

Binding +1 votes on release artifacts are PMC-only.
At least 3 binding +1 required for a release to pass.
PMC member is the release manager (or supervises if a non-PMC committer is designated by lazy consensus to RM under PMC oversight).
Post-release, PMC member ensures the announce@apache.org mail goes out and the website is updated.

Security Reports

Security disclosures arrive at private@tez.apache.org or security@apache.org. The process:

1. Acknowledge receipt within 48 hours.
2. PMC investigates in private; reproduce.
3. Develop a fix in a private branch (not in apache/tez until disclosure).
4. Determine severity (CVSS) and assign a CVE.
5. Coordinate disclosure timing with downstream projects (Hive, etc).
6. Cut a release containing the fix.
7. Send disclosure to oss-security and security@apache.org with CVE and details.

The discipline: never discuss security issues on public lists or public JIRA until the fix has been released and disclosure is published.

If you are new to PMC, read the ASF Security Team process before you need it.

Board Reporting

The Apache Board oversees every project via quarterly reports. The Tez PMC submits a report each quarter (or per the schedule the board sets — currently quarterly with projects rotated through). The chair (or a delegate) submits it via https://reporter.apache.org/.

A standard report contains:

Community activity (new committers, new PMC members, list activity)
Releases since last report
Brand or legal issues
Health concerns the board should know about

The board looks for warning signs:

Warning	Board concern
No releases in many quarters	Is the project dormant?
All committers from one company	Is the project independent?
Mailing-list activity falling	Is the community shrinking?
Code-of-conduct issues unresolved	Is the PMC functional?

The chair is responsible for filing on time. If the report is late, the board notices.

Time Commitment

A PMC member with no other ASF roles spends roughly:

Activity	Monthly time
Reviewing private@ traffic	1–2 hours
Voting on releases (when there is one)	1–3 hours per release
Voting on new committers	30 minutes per vote
Board reporting (every 3 months)	1–2 hours
Security incidents (when they happen)	Variable; possibly days
Committer work on top of PMC duties	(as before)

A PMC member who is also chair adds the report-filing burden and acts as the project's ambassador to the board.

Stepping Back

PMC membership is permanent until you step back. Emeritus PMC status exists for those who have stepped away from active project work but want to remain available for consultation.

To go emeritus:

Subject: [NOTICE] Going emeritus PMC

Hi all,

Effective <date>, I'm moving to emeritus PMC status on Tez. My
involvement in the project has tapered and I want the active PMC
to reflect who's currently doing the work.

Please feel free to reach out if you ever want a sanity check on
something I worked on historically.

Thanks for the years of collaboration.

<First>

PMC removes you from active count. You retain your ASF account; you may return to active later by vote.

Validation Artifacts

After this chapter:

A ~/tez-notes/pmc-duties.md listing the four buckets and a one-line example of each.
A subscription to private@tez.apache.org (when you are PMC).
Knowledge of how to verify an ICLA, how to find the trademark policy, how to file a board report.
A reflex to escalate security reports to private@ immediately and never discuss them publicly until disclosure.

The next chapter — Licensing — drills into the legal bucket: ALv2, LICENSE/NOTICE rules, and category A/B/X.

Licensing

Apache licensing is precise. The rules are not "be reasonable about open source"; they are a specific framework administered by Apache Legal. Getting them wrong blocks a release. This chapter is the working knowledge needed by committers and PMC, plus the bits every contributor should know before adding a dependency.

The Apache License 2.0

Apache Tez is licensed under the Apache License, Version 2.0 ("ALv2"). This is a permissive license that allows:

Use, reproduction, modification, distribution
Commercial use
Patent grant (explicitly, unlike MIT/BSD)
Sublicensing under different terms (with attribution)

In exchange:

You include the LICENSE and NOTICE in distributions
You note significant modifications
You preserve attribution and patent grants

Practically, ALv2 is one of the most permissive copyleft-free licenses. It's compatible with almost everything except GPL 2.0 (and is one-way compatible with GPL 3.0).

The Three Files in the Tez Repo Root

File	Purpose
`LICENSE`	The Apache License 2.0 text, plus appendices for any bundled third-party code under different licenses
`NOTICE`	Required attributions for bundled code (Apache + any NOTICE-bearing deps)
`KEYS` (in dist, not repo)	PGP keys used to sign releases

ls ~/tez-src/LICENSE ~/tez-src/NOTICE
cat ~/tez-src/NOTICE

For Tez source releases, LICENSE and NOTICE are typically short — the source tarball bundles no third-party code. For convenience binary releases, both grow with the bundled jars.

Category A / B / X — The Dependency Classes

Apache Legal classifies third-party licenses into categories. The full list is at Apache Legal Resolved. Summary:

Category	Meaning	Examples	Can it be a Tez dependency?
A	Compatible with ALv2	ALv2, MIT, BSD 2/3-clause, ISC, MPL 2.0	Yes; document in LICENSE/NOTICE if bundled
B	Compatible with conditions	EPL 1.0/2.0, CDDL 1.0/1.1, MPL 1.1, IBM Public License 1.0	Yes, but only as bundled binary, not source. Add LICENSE/NOTICE entry.
X	Incompatible	GPL (any version), AGPL, LGPL 2.0/2.1 (kind of), SSPL, BUSL, CC-BY-NC	No. May not be bundled in any release. Runtime optional dep only, with care.

The hard cases:

LGPL is category X for binary distribution but acceptable as an optional runtime dependency. Be careful; this is one of the most-asked questions on legal-discuss@apache.org.
CC-BY-SA and other ShareAlike licenses depend on the work: data and documentation are sometimes B, sometimes X.
Bespoke licenses (custom permissive licenses) must be reviewed before use.

If you are uncertain, post on legal-discuss@apache.org with a link to the license text. Don't guess.

"GPL Contamination"

Apache projects cannot ship GPL code. The rule has corollaries that catch people:

Action	OK?
Tez code calls a GPL library via reflection at runtime	No — if the library must be present, it's a dep
Tez code can optionally integrate with a GPL tool the user installs themselves	Yes — runtime-optional, user-supplied
Tez ships a GPL jar in the binary tarball	No
Tez build script downloads a GPL jar during build	No (this is contamination)
Tez source contains a comment "see SOME GPL CODE for reference"	Risky — get review
Tez source copies a snippet from GPL code	No — pollutes the codebase

The conservative rule: GPL code may exist near Tez (a user's runtime environment) but not in Tez (source or binary distribution).

Adding a New Dependency — Procedure

When a patch proposes a new third-party dependency:

Identify the license. Open the project's LICENSE file. Don't read the GitHub "License" sidebar; it can be wrong.
Classify. Category A, B, or X (above). If A, proceed. If B, plan for LICENSE / NOTICE updates and PMC discussion. If X, stop.
Check transitive deps. A category-A library may pull in a category-X transitive. Use mvn dependency:tree and verify every transitive's license.
Justify. On the JIRA, explain why this dep is needed and why no in-tree alternative suffices.
Update LICENSE. If the dep is bundled in the binary release (it usually is), add an appendix entry naming the dep, its license, and where to find the full license text.
Update NOTICE. If the dep ships a NOTICE file, copy the required text into Tez's NOTICE. Read the dep's NOTICE; not all of it is required.
Test the build. Run mvn apache-rat:check and a full build. The dep should not produce RAT-flagged files (most don't).

PMC review the dependency before commit. If you are PMC, ask:

Is the license correctly classified?
Is the dep maintained?
What is the size cost (Tez binary tarball grows by N MB)?
Are there security advisories against the version proposed?

Apache RAT in Tez Pre-commit

Apache RAT (Release Audit Tool) checks that every source file has an Apache license header. It is part of every Tez release vote and should be part of every contributor's pre-submit.

Run:

cd ~/tez-src
mvn apache-rat:check

Output on success:

[INFO] BUILD SUCCESS

Output on failure:

[ERROR] Files with unapproved licenses:
  tez-dag/src/main/java/.../NewClass.java

The fix is to add the license header. The standard Java header is at the top of any existing Tez Java file; copy it.

RAT can be configured to allow certain files to be exempt (e.g. generated .proto-derived files, META-INF/). The exemption config lives in the parent pom.xml:

grep -A20 "apache-rat-plugin" ~/tez-src/pom.xml

Adding a new file type that legitimately can't carry a header (e.g. a JSON test fixture) requires updating the exemption list and noting it in the JIRA.

License Header Template

For .java:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

For .proto:

//
// Licensed to the Apache Software Foundation (ASF) ... (same content with // comments)
//

For .xml:

<!--
   Licensed to the Apache Software Foundation (ASF) ... (same content)
-->

For .sh / .py:

#
# Licensed to the Apache Software Foundation (ASF) ... (same content with # comments)
#

For .md: by convention, no header is needed for markdown docs in the source tree, but project policy may require one. Check mvn apache-rat:check output.

The Tez NOTICE File

A typical Tez NOTICE:

Apache Tez
Copyright 2014-YYYY The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (https://www.apache.org/).

Plus, if bundled deps require:

This product bundles SomeLibrary, which is available under
the Foo Bar License. See <path or URL>.

NOTICE is not:

A list of contributors (that's CHANGES.txt and git).
A thank-you list.
A list of services or users.

Keep it minimal and legally precise.

Source vs Binary Release — Different Rules

Apache makes a sharp distinction:

Aspect	Source release	Binary release
Status	Official Apache release	Convenience artifact
What's bundled	Source code only	Compiled jars, possibly third-party jars
Must have ALv2 LICENSE	Yes	Yes
Must have NOTICE	Yes	Yes; longer than source NOTICE
Must pass RAT	Yes	Source check passes for binary, plus binary-bundled jars are exempt
Category B bundling	Generally allowed in source, restrictive	Allowed with LICENSE/NOTICE entry
Category X bundling	Never	Never

Practical implication: a source release rarely bundles anything except Tez's own source. A binary release bundles tez-dist/target/apache-tez-X.Y.Z-bin.tar.gz which contains all the runtime jars Tez depends on (Hadoop, Jackson, etc.).

Common Licensing Mistakes

Mistake	Caught by	Fix
New file without Apache header	`mvn apache-rat:check`	Add header
Random third-party snippet pasted into Tez	Code review	Replace with original code or pull in via dep
New category-B dep with no LICENSE update	PMC at release vote	Update LICENSE
New category-X dep	PMC at release vote	Remove dep
NOTICE accidentally cleared	Code review	Restore from prior release
`Copyright (c) Company Name` in a file	Code review	Replace with Apache header; Company-owned code requires CLA review

What ICLAs and CCLAs Cover

Two contributor license agreements:

CLA	Who signs	What it covers
ICLA (Individual)	An individual contributor	Their personal contributions
CCLA (Corporate)	A company's authorised signatory	Contributions by listed employees

An ICLA is required for any non-trivial contribution. A CCLA is required if the contribution is made in the contributor's capacity as a company employee.

PMC members can verify ICLA status via secretary@apache.org. For a casual single-patch contributor, the trivial-patch exception often applies and no ICLA is needed; for a contributor on path to committer, the ICLA needs to be on file by the second or third patch.

Validation Artifacts

After this chapter:

A ~/tez-notes/license-categories.md cheatsheet of A/B/X with examples.
The reflex to run mvn apache-rat:check in your pre-submit script.
The discipline to check a new dep's category before opening a JIRA proposing it.
The ability to read Tez's NOTICE file and confirm what each line is there for.

The next chapter — Code Style & Trust — closes the section with the operational mechanics of style enforcement and the trust ladder a contributor climbs.

Code Style & Trust

The Tez project enforces a specific code style via checkstyle. The style itself is less interesting than the trust mechanism it embodies: an automated, opinionated style is how a project of dozens of committers and hundreds of contributors keeps its codebase coherent without requiring every reviewer to argue about braces.

This chapter is the practical guide to the style, the tools that enforce it, and the trust ladder a contributor climbs from first patch to commit bit.

Where the Style Lives

Tez's checkstyle configuration:

cat ~/tez-src/tez-tools/src/main/resources/tez/checkstyle.xml

This file is the source of truth. If a reviewer says "your patch fails checkstyle," they mean this file is unhappy.

The file is invoked by the parent pom.xml:

grep -A10 "maven-checkstyle-plugin" ~/tez-src/pom.xml

Verify locally:

cd ~/tez-src
mvn checkstyle:check

Output on success is silent (exit 0). Output on failure lists each violation with file and line number.

The Rules That Matter

The full ruleset is the file above. The rules that catch contributors most often:

Rule	What it enforces
Line length	Usually 120 chars max
Indentation	2 spaces (not 4, not tabs)
Imports	No wildcard imports; specific order
Brace style	Egyptian (`{` on same line)
Unused imports	Disallowed
Member ordering	Static fields, instance fields, constructors, methods
Trailing whitespace	Disallowed
Final newline	Required
`@Override` annotations	Required when overriding
Javadoc on public methods of `@Public` classes	Required

The full list is in the file. Notable absences:

Tez does not enforce a strict naming convention beyond standard Java (camelCase, PascalCase for classes).
Tez does not enforce method length limits (so committers must catch overly long methods in review).
Tez does not enforce strict cyclomatic complexity (same).

So checkstyle is a floor, not a ceiling. Passing it doesn't mean the patch is well-styled in the human sense — it means the obvious mechanical violations are absent.

IDE Setup

Configure your IDE to match. IntelliJ:

1. File → Settings → Editor → Code Style → Java.
2. Set Tab size: 2; Indent: 2; Continuation indent: 4.
3. Use spaces, not tabs.
4. Wrapping: hard wrap at 120.
5. Import → Class count to use import with '*': 999.
6. Final newline: required.

Or import the Hadoop / Tez IntelliJ style file if one is in the repo:

find ~/tez-src -name "*.xml" | xargs grep -l "CodeStyle" 2>/dev/null | head

Eclipse: Window → Preferences → Java → Code Style → Formatter, import an XML if one is provided in tez-tools/.

VS Code with the Java extension: edit .vscode/settings.json per workspace:

{
  "java.format.settings.url": "tez-tools/src/main/resources/tez/eclipse-formatter.xml",
  "editor.tabSize": 2,
  "editor.insertSpaces": true,
  "files.insertFinalNewline": true,
  "files.trimTrailingWhitespace": true
}

The goal: at save time, your IDE produces checkstyle-passing code.

Catching Violations Pre-Submit

The pre-submit script (from Patch Quality):

#!/usr/bin/env bash
set -e
cd ~/tez-src
mvn install -DskipTests
mvn checkstyle:check
git diff --check                       # detects whitespace errors
mvn test -pl tez-dag,tez-api

git diff --check is a free win — it catches trailing whitespace and conflict markers before they reach the reviewer.

The Trust Ladder

Style is the visible surface of a deeper thing: trust. The contributor-to-committer path is a multi-step climb up a trust ladder.

Step 0: Anonymous reader.
        Reads the codebase.
        Trust: none required.

Step 1: First-time contributor (Javadoc fix).
        Patch passes mechanical checks.
        Trust to receive: a few minutes of review attention.

Step 2: Multi-patch contributor.
        Several patches in over weeks/months.
        Trust to receive: a sympathetic reviewer who will guide.
        Trust to give: explain your reasoning on JIRA without being asked.

Step 3: Repeat contributor in one area.
        Becomes recognised as an expert in that area.
        Trust to receive: their +1 (non-binding) carries weight on patches in that area.
        Trust to give: stay engaged on follow-up issues.

Step 4: Reviewer.
        Provides non-binding +1 on others' patches with insight.
        Trust to receive: PMC members notice.
        Trust to give: your reviews must be substantive, not drive-by +1s.

Step 5: Committer (the bit).
        Granted by PMC vote on private@.
        Trust to receive: commit access to apache/tez.
        Trust to give: review patches in your areas, mentor newcomers, attend to dev@.

Step 6: PMC member.
        Granted later, after sustained committership.
        Trust to receive: binding release vote, security-disclosure access.
        Trust to give: stewardship duties (legal, brand, community, releases).

Each step takes months of consistent engagement. The ladder is asymmetric: the contribution required to climb each step grows roughly linearly, but the trust granted grows roughly exponentially.

Patterns Committers Want

Beyond mechanical style, certain patterns mark a patch as "from someone who gets it":

Use the existing logging idiom

private static final Logger LOG = LoggerFactory.getLogger(MyClass.class);

// Then in method:
LOG.info("Initialized vertex {} with {} tasks", vertexName, numTasks);

Not System.out.println. Not LOG.info("Initialized vertex " + vertexName + ...) (the string is built even when INFO is off in some logging stacks; with SLF4J it's avoided by parameterized form).

Use existing helper classes

If tez-common has a TezUtils helper for serialising a config to a byte buffer, use it. Don't write a new helper inline. Search:

grep -rn "class.*Utils" ~/tez-src/tez-common/src/main/java

Match the surrounding file's style for ambiguous things

If the file uses final on every parameter, your additions should too. If the file uses single-letter loop variables (for (int i = 0; ...), don't suddenly switch to for (int taskIndex = 0; ...). Match the file.

Avoid speculative generality

Don't introduce an interface "in case we need a second implementation later." Don't add a configuration key "in case someone wants to tune this." Both increase the surface area the committer pool must maintain forever.

Cite the JIRA in non-obvious code

// TEZ-4321: handle the case where inputs is null after recover.
if (inputs == null) {
    inputs = Collections.emptyList();
}

The comment is a permanent breadcrumb back to the design discussion.

Keep `try`/`catch` narrow

// Good
try {
    state = readState();
} catch (IOException e) {
    LOG.warn("Failed to read state for {}", id, e);
    return defaultState();
}

// Bad — catches too much
try {
    state = readState();
    process(state);              // <-- different exception domain
    publish(state);              // <-- different exception domain
} catch (Exception e) {          // <-- swallows everything
    LOG.error("Something failed", e);
}

Don't add `@SuppressWarnings` without justification

// Bad
@SuppressWarnings("unchecked")
public List<T> getStuff() { ... }

// Good
@SuppressWarnings("unchecked") // safe; we control all writers
public List<T> getStuff() { ... }

A bare @SuppressWarnings is a code smell that says "I didn't want to deal with the real warning."

Use specific exception types in `throws`

// Bad
public DAG build() throws Exception { ... }

// Good
public DAG build() throws TezException, IOException { ... }

throws Exception defeats the type system. Reviewers will ask for specifics.

How Trust Is Withdrawn

Trust is built one patch at a time; it can also erode. Things that erode committer trust in a contributor:

Behavior	Erosion
Ghosting a patch mid-review	Significant; reviewer's time wasted
Re-attaching the same patch without addressing comments	Significant; wastes another review cycle
Arguing without evidence	Moderate; teaches reviewer to expect friction
Pinging weekly	Moderate; reviewer learns to deprioritise
Submitting a patch that breaks tests	Mild if rare; serious if pattern
Committing your own patch without review (as committer)	Serious; loss of community trust
Reverting another committer's work without discussion	Very serious; potential PMC issue
Public criticism of a committer for their review	Very serious

The recoverable: explain, apologise, address the underlying issue. Trust returns.

The non-recoverable: code-of-conduct violations. PMC handles these privately.

From First Patch to Commit Bit — The Arc

A realistic 12-month arc for a contributor on the path:

Month 1   First Javadoc fix.   Review takes 2 weeks (reviewer wasn't sure).
          You learn the patch generation workflow.
Month 2   Three small bug fixes.   Review faster (reviewer knows you).
          You learn checkstyle, run it pre-submit.
Month 3   Mid-sized refactor.   Two review rounds, no friction.
          You start filing follow-up JIRAs from things you notice.
Month 4-5 You review someone else's patch with a substantive +1.
          A PMC member notices on dev@.
Month 6   First design discussion on a JIRA.   You write a one-page design.
          Review goes well; consensus reached.
Month 7-8 You're patch-author on the implementation.   Three review rounds.
          Final commit feels routine.
Month 9   You shepherd a new contributor through their first patch.
          PMC notices.
Month 10  You're proposed on private@.   Vote passes.
          You're a committer.
Month 11  You commit your first patch (someone else's, reviewed by you).
          You explicitly don't commit your own work unreviewed.
Month 12  You're routine.   You review 2-3 patches a month, file 2-3.
          The flywheel.

This is one path, not the only path. Some contributors hit the bit at month 6 (extremely sustained activity); some at month 24+ (slower but steady). The trust ladder doesn't have a clock; it has a contribution count + sustained behavior pattern.

Validation Artifacts

After this chapter:

Your IDE is configured to produce checkstyle-passing code at save time.
Your pre-submit script runs mvn checkstyle:check and git diff --check.
A ~/tez-notes/style-patterns.md listing the "patterns committers want" above.
A clear-eyed estimate of where you are on the trust ladder, and what step is next.

This chapter closes the Release & PMC Reality section. The next major section, Hive-on-Tez Labs, is operational engineering at the Tez/Hive boundary — the most common production context for Tez today.

Capstone Project

The Capstone is the bridge from "I have read the Tez codebase" to "I have shipped a non-trivial fix that an Apache Tez committer merged into master." Everything in Levels 1–7 was preparation. This is the work.

You will pick one real, open Apache Tez JIRA, reproduce it against a current build, trace the failure through the codebase, identify the root cause, write a minimum-diff patch with deterministic tests, get it through precommit (Yetus / GitHub Actions), respond to review comments, and land the change. Then you write it up so the next person can learn from your investigation.

This chapter is the table of contents. The ten step-chapters that follow are the work itself.

Prerequisites

Do not start the Capstone until you can answer "yes" to every one of these:

Level 1–7 complete. You can read DAGImpl, VertexImpl, TaskImpl, TaskAttemptImpl, AsyncDispatcher, the shuffle path (ShuffleManager, Fetcher, MergeManager), and at least one VertexManagerPlugin (ShuffleVertexManager or RootInputVertexManager) without a guide open.
You have built Tez from source. mvn clean install -DskipTests succeeds on your machine, and mvn test -pl tez-dag finishes (some flakes are normal — see Stage 9 of the issue roadmap).
You have run MiniTezCluster locally. mvn test -pl tez-tests -Dtest=TestOrderedWordCount goes green.
You have a working JIRA + Apache ID (or a GitHub account ready to PR).
You have read the Tez contribution guide: https://tez.apache.org/contribution_guide.html and https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute.

If any of these is "no," stop. Go back. The Capstone is unforgiving of partial preparation — you will spend three weeks confused instead of three weeks shipping.

The 10-Step Flow

flowchart TD
    A[Step 1: Issue Selection] --> B[Step 2: Reproduction]
    B --> C[Step 3: Execution Path Analysis]
    C --> D[Step 4: Root Cause Identification]
    D --> E[Step 5: Implementation]
    E --> F[Step 6: Testing]
    F --> G[Step 7: Validation]
    G --> H[Step 8: Patch / PR]
    H --> I[Step 9: JIRA + Docs]
    I --> J[Step 10: Engineering Write-Up]
    G -.fail.-> D
    F -.fail.-> E
    H -.review.-> E

The dotted arrows are the loops you will actually run. Nobody gets root cause right on the first hypothesis. Nobody passes precommit on the first push. Plan for two or three iterations through Steps 4–8 before you land.

Deliverables

By the time you mark the Capstone done, every one of these artifacts exists:

#	Artifact	Lives in
1	Failing reproducer test (a JUnit test that fails on `master` without your patch and passes with it)	`tez-tests/` or a module-local `src/test/java/...`
2	Root-cause document (200–500 words, with file:line citations)	`capstone-work/root-cause.md` in your fork
3	Minimum-diff patch	A branch on your fork of `apache/tez`
4	Unit tests using `DrainDispatcher` / mock dispatcher (if state-machine related)	The relevant `src/test/java`
5	Integration test using `MiniTezCluster` (if end-to-end behavior changed)	`tez-tests/src/test/java/org/apache/tez/test/`
6	Validation report (output of `mvn test -pl <module>`, checkstyle, spotbugs, RAT)	`capstone-work/validation.md`
7	GitHub PR against `apache/tez:master` (or `.patch` file attached to JIRA)	`https://github.com/apache/tez/pulls`
8	JIRA updated: status = "Patch Available," PR linked, release-notes filled if user-visible	`https://issues.apache.org/jira/browse/TEZ-NNNN`
9	Engineering write-up (500–1000 words: problem, investigation, design, alternatives, lessons)	Personal blog, Apache wiki page, or dev@ summary

Every one. No exceptions. The write-up is not optional — it is how the community (and your future self) learns from your investigation.

100-Point Rubric Summary

The full rubric lives in evaluation-rubric.md. Headline:

Area	Weight
Problem articulation (symptom vs. root cause separation, conditions)	20
Execution-path mastery (file:line citations, diagram, accuracy)	20
Implementation quality (minimum diff, conventions, no scope creep)	20
Testing (unit + integration, deterministic, coverage)	15
Review responsiveness (addresses comments, iteration cadence)	10
Documentation (JIRA, code comments, write-up)	10
Community interaction (mailing-list etiquette, handoff hygiene)	5

Tier thresholds:

80+ — credible Tez contributor. You can sustain a steady patch flow.
90+ — committer-ready. You are doing work a committer would do without hand-holding.
95+ — PMC-track. You are leading work others want to follow.

You will self-grade in Step 10. Be honest. Inflated self-grades are visible from orbit when a committer reads your write-up.

Timeline

The Capstone is a 4–6 week effort if you have one focused evening per weekday plus weekend mornings. Less than that and you risk losing context between sessions (which is far more expensive than people expect for state-machine code).

Week	Steps	Hours
1	1–2: Pick an issue, build a deterministic reproducer	10–15
2	3–4: Trace execution, identify root cause	12–18
3	5–6: Implement fix, write unit + integration tests	12–18
4	7–8: Validate, prepare patch / PR, push	8–12
5	8–9: Review iteration (two or three rounds is normal)	6–10
6	10: Write-up, JIRA cleanup, retrospective	4–6

If you blow past six weeks, that is a signal — not a failure. Either the issue is larger than it looked (in which case, pause and renegotiate scope in the JIRA), or you are stuck on a specific step (in which case, ask on dev@tez.apache.org).

Success Indicators

You will know it is working when:

A committer comments "+1" or "LGTM, will commit shortly" on your PR.
Your fix appears in git log apache/master with (cherry picked from commit ...) landing on the next release branch.
The JIRA you claimed flips to "Resolved / Fixed in X.Y.Z" with your name on it.
Your write-up gets traffic — search-engine hits, a comment from another contributor, a question on user@.
The next time you pick a JIRA, you reach root cause in days, not weeks.

You will know it is failing when:

You are still editing files in Step 5 with no failing test in hand from Step 2.
Your PR description says "I think this might fix it."
You have not run mvn test -pl tez-dag end-to-end in over a week.
You are arguing in PR comments instead of changing code or asking questions.

If you spot a failure signal, do not push through. Stop, reread the relevant step chapter, and reset.

How to Use This Chapter

Read all ten step-chapters once, end-to-end, before you start Step 1. You need the shape of the whole journey in your head — Step 4 (root cause) makes choices that Step 6 (testing) depends on; Step 8 (patch) assumes you have artifacts from Steps 2 and 7. Skim now, deep-read each as you arrive at it.

Then go to Step 1: Issue Selection. Pick the issue. The clock starts when you comment "Working on this" on the JIRA.

Validation / Self-check

Before starting Step 1, confirm:

You can produce, from memory, the file path of DAGAppMaster, DAGImpl, VertexImpl, TaskImpl, and AsyncDispatcher.
mvn clean install -DskipTests completes against your local ~/tez-src/ clone.
mvn test -pl tez-tests -Dtest=TestOrderedWordCount passes.
You have a capstone-work/ directory in your fork ready for the root-cause.md, validation.md, and writeup.md deliverables.
You have skimmed every step-chapter once.
You have set aside 4–6 calendar weeks with realistic time budget.
You have subscribed to dev@tez.apache.org (send subscribe to dev-subscribe@tez.apache.org) and issues@tez.apache.org.

Step 1: Issue Selection

Picking the wrong issue is the most expensive mistake in the Capstone. Two weeks of investigation on a JIRA that turns out to be a duplicate, a WONTFIX, or a multi-month rearchitecture is two weeks you do not get back. The goal of this step is not to find a perfect issue. It is to find a tractable issue that exercises the parts of Tez you actually know.

Budget: 1–3 days. If you are past day 4 and still triaging, your standards are too high.

Where the Real Issues Live

Apache Tez tracks issues in JIRA at:

https://issues.apache.org/jira/projects/TEZ

There is no good-first-issue label on Tez (unlike Hadoop). The closest proxies are newbie, very small subtasks of larger umbrellas, and stale unassigned bugs with reproducers attached. You will write your own JQL.

Starter JQL Queries

Run these in JIRA's "Advanced" search box. Open each in a separate tab; do not chase one result before you have seen the whole landscape.

1. Unassigned open bugs, sorted by recency:

project = TEZ AND status in (Open, "In Progress")
  AND assignee is EMPTY
  AND type = Bug
ORDER BY created DESC

2. Bugs with reproducers attached (the gold standard):

project = TEZ AND status = Open
  AND type = Bug
  AND attachments is not EMPTY
ORDER BY updated DESC

3. Newbie-labeled (small surface area):

project = TEZ AND status = Open
  AND (labels = newbie OR labels = beginner OR labels = "low-hanging-fruit")
ORDER BY priority DESC, created DESC

4. Flaky tests (Stage 9 territory, often great Capstone fodder):

project = TEZ AND status = Open
  AND (summary ~ "flaky" OR summary ~ "intermittent" OR description ~ "flaky")
ORDER BY votes DESC

5. Open bugs touching modules you know:

project = TEZ AND status = Open AND type = Bug
  AND (component in ("tez-dag", "tez-runtime-internals", "tez-runtime-library")
       OR summary ~ "VertexImpl"
       OR summary ~ "ShuffleManager"
       OR summary ~ "AsyncDispatcher")
ORDER BY created DESC

Cast a wide net. Pull 20+ candidates into a scratchpad. You will trim aggressively.

Triage: Pick 5 Finalists from 20

For each candidate, spend 10–15 minutes — no more — answering this single question: "Could I write a failing test for this today?" If "no" or "I have no idea," drop it. If "probably yes, here's how," keep it.

Concrete triage protocol:

Read the JIRA description and every comment. Watch for "I cannot reproduce" or "this is a duplicate of TEZ-XXXX" buried at the bottom.
Check git log --grep "TEZ-NNNN" in your ~/tez-src/ clone — has it already been partially fixed?
Search the dev@ mailing list archive for the issue number: https://lists.apache.org/list.html?dev@tez.apache.org.
Open the linked files in your editor. Are they in tez-dag, tez-runtime-*, tez-api (familiar territory), or tez-ui, tez-plugins, tez-yarn-timeline-* (less familiar — skip unless you specifically studied them)?
Note the Affects-Versions field. If it only affects 0.8.x and master has been rewritten in the area, the fix may not be portable.

Keep the 5 finalists in a markdown table:

| TEZ-NNNN | Title | Component | Reproducer? | Last activity | My read |
|---|---|---|---|---|---|
| TEZ-4321 | Fetcher hangs on connection reset | tez-runtime-library | none | 2024-11 | Plausible; I know ShuffleManager |
| TEZ-4456 | VertexImpl NPE on V_ROUTE_EVENT after kill | tez-dag | stack trace only | 2025-02 | Race-y; familiar state machine |
| ... | | | | | |

Scoring Rubric

Score each finalist 0–2 in each column. The winner is the highest aggregate.

Criterion	0	1	2
Clarity	Description is one sentence and ambiguous	Description names symptom but not conditions	Clear symptom + reproduction conditions in description
Scope	Open-ended ("refactor X")	Bounded but spans modules	Bounded to one or two classes
Isolation	Requires Hive/Pig running	Needs `MiniTezCluster`	Can be reproduced in pure unit test
Testability	No clear failing assertion possible	Failing assertion possible after `MiniTezCluster` run	Failing assertion possible in `DrainDispatcher` test
Alignment	Touches code I have never read	Touches one familiar class	Touches 2–3 classes I have studied in Levels 4–6
Community engagement	Last activity > 2 years, no watchers	Some activity in last year	Recently discussed; a committer responded

Total possible: 12. Anything below 7 is risky. Pick the 9+ candidate.

Three Worked Examples

These are illustrative archetypes, not literal current JIRAs.

Candidate A: "ShuffleManager retries forever on `IOException: Connection reset`"

Clarity: 2 (description names the exception and the loop).
Scope: 2 (one class: ShuffleManager or Fetcher).
Isolation: 1 (need a fake Fetcher to inject the exception).
Testability: 2 (mock-based unit test with retry counter assertion).
Alignment: 2 (you read this in Level 5).
Community engagement: 1 (one committer comment, no resolution).
Total: 10. Pick this.

Candidate B: "Refactor `DAGImpl` state machine to use enum-based transitions"

Clarity: 1 (vague — "refactor").
Scope: 0 (touches DAGImpl, every event handler, every test).
Isolation: 0 (no failing behavior to test).
Testability: 0 (regression-only testing).
Alignment: 1 (you know DAGImpl but this is huge).
Community engagement: 0 (no committer +1).
Total: 2. Skip. This is a months-long design proposal, not a bug.

Candidate C: "Container reuse logs say `assigned` then `released` for same container"

Clarity: 2 (you can pull the log lines from the description).
Scope: 1 (touches TaskSchedulerManager and possibly YarnTaskSchedulerService).
Isolation: 0 (need MiniYARNCluster — slow, flaky, environment-sensitive).
Testability: 1 (assertions are on log content + scheduler state).
Alignment: 1 (you read TaskSchedulerManager once).
Community engagement: 2 (recent discussion).
Total: 7. Borderline. Pick only if you have no candidate above 8 and you budget extra time for the YARN harness.

Claiming the Issue

Once you decide, claim it publicly. This is non-negotiable — it prevents wasted work by others, and it commits you.

JIRA comment template

Hi — I'd like to work on this as part of an extended Tez learning project.

My plan:
1. Build a deterministic reproducer (target: <date+1 week>).
2. Root-cause analysis (target: <date+2 weeks>).
3. Patch + tests posted for review (target: <date+4 weeks>).

I'll post weekly updates here. If anyone with context has pointers on
<specific question, e.g. "whether this race was discussed in TEZ-NNNN">,
I'd be grateful. Otherwise I'll start on the reproducer this week.

— <Your Name>

Then assign the JIRA to yourself (you need a JIRA account; the Tez PMC grants contributor role on request — comment "please grant contributor role" on any issue and a PMC member will action it within a few days).

If you get no response in 5 business days

Post to dev@tez.apache.org:

Subject: [TEZ-NNNN] Working on this — any context before I dive in?

Hi all,

I left a comment on TEZ-NNNN <link> last week saying I plan to work on it. No
objections so far, so I'm starting on a reproducer this week. If anyone has
historical context — especially whether this overlaps with TEZ-XXXX — please
shout. Otherwise I'll update the JIRA as I make progress.

Thanks,
<Your Name>

If still no response after another week, proceed. Silence on a small bug is permission. (Silence on a redesign proposal is not — different beast.)

Red Flags: Issues to Skip

Last comment is from a committer saying "we should think about this more." You are not the right person to land a design call.
Open for >5 years with multiple abandoned patches. Something is structurally hard. Not Capstone material — pick later.
Touches tez-ui (Ember 1.x). The UI is on a separate lifecycle; build and test setup is divergent from the JVM modules you studied.
"Upgrade dependency X to version Y." Looks easy, ends up rebuilding the shuffle stack to handle a Guava API change. Skip unless you specifically want this experience.
Critical or Blocker priority with no patch. A committer would already be on it. If they are not, the issue may be misclassified or stale-critical.
Reproducer requires a specific Hive version + a 1TB TPC-DS run. No.

Validation / Self-check

Before you advance to Step 2, produce:

A markdown table of your 5 finalists with full scoring rubric, saved as capstone-work/issue-shortlist.md.
The TEZ-NNNN number of your chosen issue, posted as a JIRA comment claiming it.
A 1-paragraph statement of why you picked it (which two criteria scored highest and which scored lowest).
A self-assigned target date for Step 2 (deterministic reproducer in hand).
Subscription confirmed to dev@tez.apache.org and the JIRA itself (click the "Start watching" eye icon).
Your fork of apache/tez exists on GitHub with a branch named tez-NNNN-<short-slug> checked out locally.
A note in capstone-work/issue-shortlist.md of any near-miss candidates you may revisit after the Capstone — these are your next contributions.

Step 2: Reproduction

You do not have a bug until you have a failing test. Stack traces in JIRA comments are circumstantial evidence; a deterministic, automated reproducer is proof. Until you have one, every hypothesis in Step 4 is unverifiable and every "fix" in Step 5 is theater.

Goal of this step: a JUnit test that fails on a clean checkout of apache/tez:master without your patch, in under two minutes, on five out of five runs.

Where Reproducers Live

MiniTezCluster is the Tez-specific harness that boots an in-process YARN cluster plus a DAGAppMaster against the local filesystem. It is the closest thing to a real deployment that you can debug from your IDE.

find ~/tez-src/tez-tests -name "MiniTezCluster.java"
# tez-tests/src/test/java/org/apache/tez/test/MiniTezCluster.java

Read it first, then read one consumer:

grep -n "MiniTezCluster" \
  ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestTezJobs.java
grep -n "MiniTezCluster" \
  ~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java

TestTezJobs is the canonical "wire up a real cluster, submit a small DAG, assert on the output" example. TestOrderedWordCount is the lighter-weight end-to-end sanity check.

For pure unit-level reproducers (no YARN, no shuffle), use the patterns in:

~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java

These use DrainDispatcher (a synchronous dispatcher that lets you control event ordering deterministically) — see Step 6 for the full pattern.

Three Reproducer Templates

Pick the template that matches your issue type.

Template A: Race-Condition Reproducer (state-machine level)

When the bug is "two events arrive in an unexpected order and the state machine NPEs / wedges / drops a task," you need DrainDispatcher plus controlled event ordering. No MiniTezCluster.

package org.apache.tez.dag.app.dag.impl;

import org.apache.hadoop.yarn.event.DrainDispatcher;
import org.apache.tez.dag.app.AppContext;
import org.apache.tez.dag.app.dag.event.VertexEventTaskCompleted;
import org.apache.tez.dag.app.dag.event.VertexEventSourceTaskAttemptCompleted;
import org.apache.tez.dag.records.TezTaskID;
import org.junit.Before;
import org.junit.Test;

import static org.junit.Assert.assertEquals;

public class TestVertexImplTezNNNNRepro {

  private DrainDispatcher dispatcher;
  private VertexImpl vertex;
  private AppContext appContext;

  @Before
  public void setUp() {
    dispatcher = new DrainDispatcher();
    dispatcher.register(VertexEventType.class, vertexEventHandler());
    dispatcher.start();
    // Use the same factory as TestVertexImpl. Read its setUp() carefully.
    appContext = MockAppContext.create();
    vertex = createVertex(appContext, dispatcher);
    vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
    dispatcher.await();
  }

  @Test
  public void reproTaskCompletionBeforeRouteEvent() throws Exception {
    // 1. Drive vertex to RUNNING.
    vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
    dispatcher.await();
    assertEquals(VertexState.RUNNING, vertex.getState());

    // 2. Inject a task completion BEFORE the V_ROUTE_EVENT that the bug requires
    //    has been processed. This is the race window from the JIRA.
    TezTaskID t0 = vertex.getTask(0).getTaskId();
    vertex.handle(new VertexEventTaskCompleted(t0, TaskState.SUCCEEDED));

    // Do NOT call dispatcher.await() yet — interleave a second event.
    vertex.handle(new VertexEventSourceTaskAttemptCompleted(...));

    dispatcher.await();

    // 3. Assertion that fails on master, passes with fix.
    assertEquals(VertexState.SUCCEEDED, vertex.getState());
    //                     ^^^^^^^^^^^ on master this is FAILED due to the race
  }
}

Key principles:

Drive the state machine by handing events to vertex.handle() directly, not by going through a scheduler.
Use dispatcher.await() to deterministically drain the queue between phases.
The failing assertion is on a getState() or counter, not on log output.

Template B: Configuration / Validation Reproducer

When the bug is "setting tez.foo=bar is silently ignored / produces wrong behavior," reproduce at the API layer.

@Test
public void testConfigKeyHonored() throws Exception {
  TezConfiguration conf = new TezConfiguration();
  conf.set(TezConfiguration.TEZ_AM_FOO_BAR, "42");

  DAG dag = DAG.create("test-dag");
  Vertex v = Vertex.create("v1", ProcessorDescriptor.create(NoOpProcessor.class.getName()), 4);
  dag.addVertex(v);

  // The component under test reads conf — instantiate it directly.
  FooComponent foo = new FooComponent(conf);
  assertEquals(42, foo.getEffectiveValue());
  //                ^^ on master this is the default (e.g. 100) because conf is ignored
}

No cluster, no DAG submission. Just instantiate the class that reads the config and assert the effective value. The fix usually changes one conf.get() call.

Template C: Shuffle / Correctness Reproducer

When the bug is "output is wrong" (missing rows, duplicated rows, partial sort), you need MiniTezCluster and a small DAG with deterministic input.

public class TestShuffleCorrectnessTezNNNN {

  private static MiniTezCluster mrrTezCluster;
  private static FileSystem fs;

  @BeforeClass
  public static void setup() throws Exception {
    Configuration conf = new Configuration();
    fs = FileSystem.getLocal(conf);
    mrrTezCluster = new MiniTezCluster("TestShuffleRepro", 1, 1, 1);
    mrrTezCluster.init(conf);
    mrrTezCluster.start();
  }

  @AfterClass
  public static void cleanup() throws Exception {
    if (mrrTezCluster != null) mrrTezCluster.stop();
  }

  @Test(timeout = 120_000)
  public void reproPartitionedOutputMissingRows() throws Exception {
    Path inputDir = new Path("/tmp/repro-input-" + System.nanoTime());
    Path outputDir = new Path("/tmp/repro-output-" + System.nanoTime());
    writeKnownInput(fs, inputDir, /*rows=*/ 10_000);

    TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
    DAG dag = buildTwoVertexDAG(inputDir, outputDir);

    TezClient client = TezClient.create("repro", tezConf);
    client.start();
    try {
      DAGClient dagClient = client.submitDAG(dag);
      DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(null);
      assertEquals(DAGStatus.State.SUCCEEDED, status.getState());

      long outputRowCount = countRows(fs, outputDir);
      // On master this is 9_973 (27 rows lost in shuffle). With fix: 10_000.
      assertEquals(10_000L, outputRowCount);
    } finally {
      client.stop();
    }
  }
}

Build with deterministic input (fixed seed if random) so the missing-row count is reproducible across runs.

Logging: See What the State Machine Is Actually Doing

A reproducer without logs is half a reproducer. You will spend Step 4 staring at these logs.

Drop this into your test resources at src/test/resources/log4j.properties (or log4j2.properties for newer modules — check which the module uses):

log4j.rootLogger=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss.SSS} %-5p [%t] %c{1}: %m%n

# Tez AM internals — the state-machine event log lives here
log4j.logger.org.apache.tez.dag.app.DAGAppMaster=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.DAGImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.VertexImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskImpl=DEBUG
log4j.logger.org.apache.tez.dag.app.dag.impl.TaskAttemptImpl=DEBUG

# Async dispatcher event flow
log4j.logger.org.apache.tez.dag.app.AsyncDispatcher=DEBUG

# Runtime task lifecycle
log4j.logger.org.apache.tez.runtime.task=DEBUG
log4j.logger.org.apache.tez.runtime.LogicalIOProcessorRuntimeTask=DEBUG

# Shuffle internals
log4j.logger.org.apache.tez.runtime.library.common.shuffle=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager=DEBUG
log4j.logger.org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Fetcher=DEBUG

# Scheduler
log4j.logger.org.apache.tez.dag.app.rm.TaskSchedulerManager=DEBUG
log4j.logger.org.apache.tez.dag.app.rm.YarnTaskSchedulerService=DEBUG

The two most useful patterns to grep for in the output:

grep -E "VertexImpl|TaskImpl|TaskAttemptImpl" target/surefire-reports/*.txt \
  | grep -E "state|State|Event|EVENT"

That gives you the state-transition trace, which is what you'll diagram in Step 3.

Capturing container logs from `MiniTezCluster`

MiniTezCluster writes container logs (where your tasks' stderr/stdout end up) under the surefire working directory:

<module>/target/<test-class>-tmpDir/<application-id>/container-logs/

Or, in newer YARN versions:

<module>/target/MiniMRYarnCluster-localDir-nm-X_Y/usercache/<user>/appcache/<app>/container_*/

Find them with:

find ~/tez-src/tez-tests/target -name "syslog" -path "*container*" -mmin -30

Read syslog (TaskAttempt logs) and stderr (uncaught exceptions). The prelaunch.out and directory.info files explain what was actually launched.

Verify Determinism

Five runs. If even one is green, your reproducer is not deterministic yet — it is a coin flip you happen to have caught. Fix the race window before declaring victory.

cd ~/tez-src
for i in 1 2 3 4 5; do
  echo "=== Run $i ==="
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
    | tail -20
done

Expected output: five FAILs with the same assertion failure on the same line.

If you see 4 FAIL / 1 PASS:

Add a Thread.sleep is the wrong answer. (Reread Step 6.)
Insert an explicit event ordering: drain the dispatcher between every event, inject the conflicting events as a Future you control.
Use CountDownLatch to gate the producer thread until the consumer is at a known state.

If you cannot get to 5/5 fails, the bug may genuinely depend on external timing (network, GC). In that case, escalate to a stress-test pattern: run the inner test body 100x in a @RepeatedTest and assert that the failure rate is >50%. Less ideal but acceptable for some shuffle race bugs.

Validation / Self-check

By the end of Step 2 you must have:

A new test file under <module>/src/test/java/... named Test<Component>Tez<NNNN>Repro.java (the Repro suffix is for your workflow; you'll rename it to a real test name in Step 6).
The test fails on a clean ~/tez-src/ at master with an assertion error (not a setup error, not a timeout — an assertion error).
Five consecutive runs produce the same failure on the same line.
The failure happens in under 120 seconds per run.
A log4j.properties snippet in src/test/resources/ enabling debug logging on the relevant Tez packages.
A captured log excerpt (paste into capstone-work/repro-logs.txt) showing the state-machine trace at the moment of failure.
A one-paragraph description of the failure mode in your own words, saved to capstone-work/repro-summary.md. You will refine this into the root-cause document in Step 4.

Step 3: Execution Path Analysis

You have a failing test. Now you map the path the request takes from the moment TezClient.submitDAG() returns through every event, dispatcher hop, and state transition until the failure manifests. This map is the foundation for every hypothesis in Step 4. A wrong map produces a wrong root cause.

Budget: 2–4 evenings. The work is reading code, grep, and drawing.

The Canonical Submit Path

Every DAG that fails went through this skeleton path before it failed. Memorize it; you will use it as the reference axis when you sketch where your particular failure deviates.

TezClient.submitDAG(DAG)
    [tez-api/src/main/java/org/apache/tez/client/TezClient.java]
        |
        v
TezClient.submitDAGSession() or submitDAGApplication()
        |  (session vs. non-session — see TezClient.java for branch)
        v
DAGClientHandler.submitDAG(...)
    [tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java]
        |
        v
DAGAppMaster.submitDAGToAppMaster(...)
    [tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java]
        |
        v
DAGAppMaster.startDAG(...)
        |  - builds DAGImpl
        |  - emits DAGEventType.DAG_INIT
        v
AsyncDispatcher.dispatch(DAGEvent)
    [tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java]
    (uses Hadoop's hadoop-yarn-common AsyncDispatcher under the hood;
     Tez subclasses it — see Tez source for the wrapper)
        |
        v
DAGImpl.handle(DAGEvent)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java]
        |  state DAG_NEW --DAG_INIT--> INITED
        |  emits DAGEventType.DAG_START
        v
DAGImpl on DAG_START
        |  state INITED --DAG_START--> RUNNING
        |  for each Vertex: emits VertexEvent V_INIT
        v
VertexImpl.handle(VertexEventType.V_INIT)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java]
        |  state NEW --V_INIT--> INITIALIZING
        |  invokes VertexManagerPlugin.initialize()
        |  on success emits V_INITED
        v
VertexImpl on V_INITED -> on V_START
        |  state INITED --V_START--> RUNNING
        |  schedules tasks via TaskImpl events (T_SCHEDULE)
        v
TaskImpl.handle(T_SCHEDULE)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java]
        |  state NEW --T_SCHEDULE--> SCHEDULED
        |  spawns a TaskAttemptImpl, emits TA_SCHEDULE
        v
TaskAttemptImpl.handle(TA_SCHEDULE)
    [tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java]
        |  state NEW --TA_SCHEDULE--> START_WAIT
        |  requests container from TaskSchedulerManager
        v
TaskSchedulerManager / YarnTaskSchedulerService
    [tez-dag/src/main/java/org/apache/tez/dag/app/rm/]
        |  assigns container, emits TA_CONTAINER_LAUNCHED
        v
TaskAttemptImpl receives TA_CONTAINER_LAUNCHED
        |  state START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING
        |  the container is now actually running our task
        v
[ container process boots ]
TezTaskRunner2.run()
    [tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java]
        |
        v
TezChild / TaskRunner instantiates LogicalIOProcessorRuntimeTask
        |
        v
LogicalIOProcessorRuntimeTask.run()
    [tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java]
        |  initializes Inputs, Outputs, Processor
        |  calls Processor.run(inputs, outputs)
        v
[ user code runs — e.g. OrderedWordCount or your DAG's processor ]
        |
        v
heartbeat -> TaskAttemptListener -> TaskAttemptImpl TA_DONE / TA_FAILED

That is the skeleton. Your job in this step is to find the segment where your failure occurs and draw it with line numbers.

Run These Greps

These greps locate the actual file paths and method bodies on your local clone. Run them in ~/tez-src/. Each one gives you a line number to open.

# Entry: submitDAG
grep -n "public.*submitDAG" \
  tez-api/src/main/java/org/apache/tez/client/TezClient.java

# Server-side intake
grep -n "submitDAG\|startDAG" \
  tez-dag/src/main/java/org/apache/tez/dag/api/client/DAGClientHandler.java \
  tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java

# DAGImpl handlers
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/DAGImpl.java | head -40

# VertexImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java | head -60

# TaskImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskImpl.java | head -60

# TaskAttemptImpl state machine
grep -nE "addTransition|stateMachineFactory" \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java | head -80

# Dispatcher
grep -n "class AsyncDispatcher\|dispatch\b" \
  tez-dag/src/main/java/org/apache/tez/dag/app/AsyncDispatcher.java

# Runtime task entry
grep -n "public void run\|class TezTaskRunner2" \
  tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TezTaskRunner2.java

grep -n "public void run\|initialize\|class LogicalIOProcessorRuntimeTask" \
  tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java

Open each line in your editor. Read the transition table. Note which event you care about and which state(s) it is legal in.

Locate Your Specific Failure Segment

The skeleton is the highway; your bug is at one specific exit. Use these heuristics:

Symptom in repro logs	Likely segment
`VertexImpl ... transitioned from RUNNING to FAILED`	`VertexImpl` state machine — transition on `V_TASK_RESCHEDULED` or `V_INTERNAL_ERROR`
`TaskAttemptImpl ... NPE`	`TaskAttemptImpl` event handlers; check container-launched and TA_DONE paths
`NPE in AsyncDispatcher.dispatch`	Race between dispatcher start/stop and event submission
`ShuffleManager: too many fetch failures`	`Fetcher` retry/timeout; `ShuffleManager.fetchFailure()`
`IFile checksum mismatch`	`IFile.Writer`/`Reader`; check spill+merge
`OutOfMemory ... GROUP_COMPARATOR`	`MergeManager` memory math; ifile spill thresholds
`Container released before TA_DONE`	`TaskSchedulerManager` reuse path; check container release races

Once you know your segment, draw it.

Build the Path Diagram

Two formats. Do both — they validate each other.

Text-arrow form (paste into the root-cause doc)

Use this in JIRA comments and PR descriptions. It survives any rendering.

TezClient.submitDAG (TezClient.java:485)
  -> DAGClientHandler.submitDAG (DAGClientHandler.java:152)
  -> DAGAppMaster.startDAG (DAGAppMaster.java:1234)
  -> DAGImpl V_NEW --DAG_INIT--> INITED (DAGImpl.java:340)
  -> DAGImpl INITED --DAG_START--> RUNNING (DAGImpl.java:380)
  -> VertexImpl v1 NEW --V_INIT--> INITIALIZING (VertexImpl.java:1820)
  -> VertexImpl v1 INITIALIZING --V_INITED--> INITED (VertexImpl.java:1856)
  -> VertexImpl v1 INITED --V_START--> RUNNING (VertexImpl.java:1901)
  -> [21 TaskImpl T_SCHEDULE events fired]
  -> TaskImpl t0 NEW --T_SCHEDULE--> SCHEDULED (TaskImpl.java:412)
  -> TaskAttemptImpl t0.0 NEW --TA_SCHEDULE--> START_WAIT (TaskAttemptImpl.java:560)
  -> [container assigned]
  -> TaskAttemptImpl t0.0 START_WAIT --TA_CONTAINER_LAUNCHED--> RUNNING (...:610)
  -> [container starts LogicalIOProcessorRuntimeTask]
  -> ShuffleManager.run starts fetcher loop
  -> Fetcher.fetchNext throws IOException (Fetcher.java:289)  <-- FAILURE HERE
  -> ShuffleManager.fetchFailure -> InputReadErrorEvent
  -> TaskAttemptImpl t0.0 RUNNING --TA_FAILED--> FAILED

Cite real line numbers from your checkout. Future-you will thank you.

Mermaid diagram (for the write-up and PR)

sequenceDiagram
    participant C as Client
    participant AM as DAGAppMaster
    participant D as DAGImpl
    participant V as VertexImpl v1
    participant T as TaskImpl t0
    participant TA as TaskAttempt t0.0
    participant SM as ShuffleManager
    participant F as Fetcher

    C->>AM: submitDAG
    AM->>D: DAG_INIT
    D->>D: NEW -> INITED
    AM->>D: DAG_START
    D->>V: V_INIT
    V->>V: NEW -> INITIALIZING -> INITED
    D->>V: V_START
    V->>T: T_SCHEDULE
    T->>TA: TA_SCHEDULE
    TA->>TA: NEW -> START_WAIT
    Note over TA: container assigned + launched
    TA->>TA: START_WAIT -> RUNNING
    TA->>SM: shuffle starts
    SM->>F: fetchNext
    F-->>SM: IOException
    SM->>TA: InputReadErrorEvent (TA_FAILED)
    TA->>TA: RUNNING -> FAILED

Both diagrams say the same thing. Together they pass review with a committer because they prove you actually read the code instead of paraphrasing the JIRA.

Verify Empirically with Temporary `LOG.info()` Probes

The map is a hypothesis. Confirm it with probes. Add temporary logging at the points you think your event traverses. Pattern:

// In VertexImpl.java, inside the handler you suspect:
private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);

LOG.info("PROBE-TEZ{}: V_INIT entered for vertex={} state={}",
    "NNNN", getName(), getState());

Rules for probes:

Prefix every probe with PROBE-TEZ<NNNN> so you can grep them in one pass and delete in one pass.
Use LOG.info not LOG.debug so they appear without changing log config.
Include the field values you care about (state, event type, IDs).
Never commit probes. They are scaffolding for Step 4.

After re-running your test:

mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q 2>&1 \
  | grep "PROBE-TEZNNNN" | tee /tmp/probe-trace.txt

Compare the probe trace to your diagram. Discrepancies are the most valuable output of this whole step — they are exactly where your mental model differs from the code.

Common discrepancies to watch for:

"I thought this handler ran once. It ran three times." (Re-entrancy bug.)
"I thought events arrived in order A,B,C. They arrived B,A,C." (Async dispatch reordering.)
"I thought the vertex was in RUNNING. It was in INITED." (Wrong assumption about state at the time of the event.)

When a probe surprises you, do not delete the probe. Lean in. That is the shortest path to root cause.

Output

Your Step 3 deliverables live in capstone-work/execution-path/:

path-skeleton.md — text-arrow form with line numbers.
path.mmd — the mermaid source.
probe-trace.txt — grep output from the probe run.
notes.md — three to five surprises you found while reading.

Validation / Self-check

Before you advance to Step 4, you must:

Be able to name, from memory, every state transition between TezClient.submitDAG() and your failure point.
Have file:line citations for every transition in your diagram, against your ~/tez-src/ HEAD.
Have run the repro with PROBE-TEZ<NNNN> log statements and confirmed the sequence matches your diagram (or, more usefully, noted where it diverges).
Have removed every probe from your working tree before any commit (git diff should not contain "PROBE-").
Have at least one "surprise" noted in notes.md — if you have zero, you did not look hard enough.
Be able to answer: "Which event, in which state, on which class, fires the handler that produces the failure?" in one sentence.
Have the mermaid diagram render without syntax errors (mdbook serve your capstone-work folder, or paste into mermaid.live).

Step 4: Root Cause Identification

A symptom is "the test fails." A root cause is "this specific line, in this specific state, when this specific event arrives, performs this specific incorrect operation, because of this specific design assumption that no longer holds." If your statement does not have that shape, you have not found root cause yet.

This step is mostly thinking. The tools are five-whys, git blame, and git bisect. The output is a 200–500 word root-cause document and a tested hypothesis.

Five Whys, Applied to a State-Machine Race

The five-whys technique sounds trite. It is not. The discipline of asking "why" five times in a row forces you past the first plausible explanation (almost always wrong) and into the actual design defect (almost always two or three levels deeper than you initially thought).

Worked example: vertex stays in `RUNNING` after all tasks succeed

Symptom from Step 2: assertEquals(SUCCEEDED, vertex.getState()) fails with expected SUCCEEDED but was RUNNING. Repro is deterministic at 5/5.

Why 1: Why is the vertex still in RUNNING?

Because the transition to SUCCEEDED requires all tasks to have completed AND the vertex's completion handler to have been invoked. Looking at the probe trace from Step 3, the completion handler was invoked. So the transition was attempted.

Why 2: Why did the transition not happen even though the handler ran?

Because the handler returned a new state that depends on a counter (completedTaskCount). The probe shows completedTaskCount = 19 when the handler ran, but the vertex has 20 tasks. So the guard says "not done yet."

Why 3: Why is the count 19 when all 20 task-completed events were fired?

Because the count is incremented inside the handler, AFTER a check that re-routes certain V_TASK_COMPLETED events back through another handler. The re-route fires for the 20th task (look at VertexImpl.java around line 2750 — the if (recoveryData != null) branch). The re-routed event is queued but the test's dispatcher.await() returns before the queue is fully drained.

Why 4: Why does dispatcher.await() return before the re-routed event is processed?

Because AsyncDispatcher.await() waits for the current queue to drain, but the re-route enqueues into a secondary queue (the recovery dispatcher) which is not joined by the primary await.

Why 5: Why are there two dispatchers, and why does the test only await one?

Because recovery events were added in TEZ-2877 as a separate dispatch path to avoid blocking the main event loop during recovery replay. The test setup predates that change. The test never knew there was a second queue to wait on.

Root cause statement: The 20th V_TASK_COMPLETED event is enqueued into the recovery dispatcher rather than handled directly when recoveryData != null, and the test (and any caller relying on the primary dispatcher having drained) observes a stale completedTaskCount. The fix is either to (a) join the recovery dispatcher in await(), (b) handle the recovery-data branch synchronously when not actually replaying recovery, or (c) document that callers must use a different barrier.

That is a root cause. The fix direction is now obvious-ish. You can argue between (a), (b), (c) — but you know what each one changes.

Git Archaeology

Once you have a candidate cause, ask: when did this break? And why did the person who wrote it think it was correct?

`git log --follow -p -S<token>`

Find every commit that introduced or removed a specific string or method name:

cd ~/tez-src

# Every commit that touched the recovery dispatcher branch
git log --follow -p -S "recoveryData != null" \
  -- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# Every commit that mentions the counter
git log --follow -p -S "completedTaskCount" \
  -- tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

# The original change that added recovery dispatching
git log --all --grep="TEZ-2877" --oneline

-S ("pickaxe") matches commits where the count of that string changed — either added or removed. It is the single most powerful git command in this entire chapter. Learn it.

`git blame -L <start>,<end>`

Once you know the file and lines, find the commit and committer:

git blame -L 2740,2770 \
  tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/VertexImpl.java

Output looks like:

a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2745)     if (recoveryData != null) {
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2746)       handleRecovery(event);
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2747)       return;
a1b2c3d4 (Alice 2018-04-12 09:34:18 -0700 2748)     }

Then read the commit:

git show a1b2c3d4
git log -1 --format="%B" a1b2c3d4

Look for the JIRA reference in the commit message (TEZ-NNNN: ...). Open that JIRA. Read every comment. Often you will discover:

The change was made to fix a different bug (recovery correctness) and introduced your bug as collateral.
There was a comment on the original JIRA flagging the exact concern you are hitting. ("This might race with the test dispatcher pattern" — and it did.)
The fix you are considering was discussed and rejected for a reason you must now address.

`git bisect` for Regressions

If the bug is a regression — works in 0.9.x, broken in 0.10.x — bisect tells you the exact commit that introduced it. This is the highest-confidence signal in all of root-cause work.

cd ~/tez-src
git bisect start
git bisect bad master
git bisect good rel/release-0.9.2

# git checks out a midpoint commit. Build and run the repro:
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q

# If the test FAILS at this commit: bug exists here
git bisect bad
# If the test PASSES at this commit: bug introduced later
git bisect good

# Repeat. git narrows to one commit in log2(N) steps.

Once bisect converges:

a1b2c3d4 is the first bad commit
commit a1b2c3d4
Author: Alice <alice@example.org>
Date:   Thu Apr 12 09:34:18 2018
    TEZ-2877: Add recovery dispatcher path

Now you know:

The JIRA that introduced the regression.
The author (potential reviewer for your fix — Cc them).
The exact diff to study.

Automating bisect with git bisect run <script> is also fair game once you have a return-code-clean reproducer command.

Writing the Root-Cause Statement

This document goes into your JIRA, into your PR description, and into your write-up. 200–500 words, no more, no less. Use this template:

## Root cause: TEZ-NNNN

### Symptom
<one sentence — what the user sees>

### Trigger conditions
- <condition 1, e.g. recovery data is non-null when V_TASK_COMPLETED fires>
- <condition 2, e.g. only on the last task in a vertex>
- <condition 3 if any>

### Affected code
- `tez-dag/src/main/java/.../VertexImpl.java#L2745-L2748` (the recovery branch)
- `tez-dag/src/main/java/.../AsyncDispatcher.java#L210` (`await()` does not
  join the secondary queue)

### Mechanism
<three to five sentences explaining the actual defect. Use words like "because",
"as a result", "however". This is the part most people get wrong — they describe
the symptom again instead of the mechanism. The mechanism answers: of the
many ways this code could have been written, why does the current way produce
this wrong answer?>

### Introducing change
- TEZ-2877 (commit a1b2c3d4) added the recovery-dispatch branch without
  updating `AsyncDispatcher.await()` to join the recovery queue.
- The original JIRA flagged this as a concern (link to comment) but the
  resolution was deferred ("we don't await in production paths, only in
  tests").

### Fix direction
Three options considered:

1. **Join the recovery dispatcher in `await()`.** Smallest change. Risk: may
   slow recovery in production if a slow recovery handler blocks the await.
2. **Handle the recovery branch synchronously when not replaying.** Larger
   change, narrower blast radius. Recommended.
3. **Document that tests must use a new barrier.** Cheapest. Pushes burden
   onto every test author. Rejected.

Recommended: option 2. See Step 5 for the diff.

Save as capstone-work/root-cause.md.

Validating the Hypothesis

A root cause is not validated until you have demonstrated it. Two ways:

1. Revert the introducing commit and re-run the repro

git checkout master
git revert --no-commit a1b2c3d4   # introducing commit from bisect
mvn install -DskipTests -pl tez-dag -am -q
mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q

If the test now PASSES (because the change you reverted is what introduced the bug), your root cause is at least partially correct. If it still FAILS, the introducing commit is not the root cause — there is a deeper issue.

Reset before you go any further:

git reset --hard origin/master

2. Make a minimal one-line "patch" that confirms the mechanism

You are not writing the real fix yet. You are confirming the mechanism. For the example above:

--- a/tez-dag/.../VertexImpl.java
+++ b/tez-dag/.../VertexImpl.java
@@ -2745,3 +2745,3 @@
-    if (recoveryData != null) {
+    if (recoveryData != null && isReplayingRecovery()) {
       handleRecovery(event);
       return;
     }

(Assume isReplayingRecovery() does not exist yet — pretend it returns false in tests, true only during actual recovery replay.) Apply this, re-run the repro. If it passes, the mechanism is confirmed even if the actual API does not exist yet.

If the test still fails: your mechanism is wrong. Go back to the five-whys.

If the test now passes but breaks 14 other tests: your fix direction is too broad. Go back to "fix direction" in the root-cause statement and pick a narrower option.

Validation / Self-check

Before advancing to Step 5:

capstone-work/root-cause.md exists, follows the template, is 200–500 words.
You can name the introducing commit (full SHA) and JIRA.
You ran git bisect to convergence (or proved bisect doesn't apply because the bug existed since the file was first added — note this in the doc).
You ran a "revert introducing commit" experiment and saw the test go green (or have a documented reason the revert doesn't apply).
You wrote a one-line throwaway "mechanism confirmation" patch and saw the test pass on it.
You have read every comment on the introducing JIRA.
You can articulate three fix directions and explain why you rejected two of them in one sentence each.

Step 5: Implementation

Your fix is the smallest diff that makes the failing test pass without breaking any other test. Period. Anything else — a refactor you noticed, a TODO you want to address, a better name for a field — belongs in a separate JIRA, not this PR.

Committer reviewers' single biggest objection to first-time contributors is scope creep. The second is API hygiene. This chapter is about both.

Minimum-Diff Principle

The fix should change as few lines as possible while addressing the root cause identified in Step 4. Everything that survives compilation but is not strictly required to fix the bug is review surface area. Review surface area is the enemy of "merged this week."

Too much

-  public void handleVertexCompleted(VertexEvent event) {
-    if (recoveryData != null) {
-      handleRecovery(event);
-      return;
-    }
-    completedTaskCount++;
-    if (completedTaskCount == numTasks) {
-      transitionToSucceeded();
-    }
-  }
+  // Refactored to use stream API for clarity
+  public void handleVertexCompleted(final VertexEvent event) {
+    Optional.ofNullable(recoveryData)
+      .filter(rd -> isReplayingRecovery())
+      .ifPresentOrElse(
+          rd -> handleRecovery(event),
+          () -> {
+            this.completedTaskCount = this.completedTaskCount + 1;
+            this.maybeTransitionToSucceeded();
+          });
+  }
+
+  private void maybeTransitionToSucceeded() {
+    if (completedTaskCount == numTasks) {
+      transitionToSucceeded();
+    }
+  }

This will be rejected. You changed five things (stream API, final keyword, method extraction, control-flow shape, formatting). A committer cannot tell which change is the actual fix without re-deriving the root cause from scratch.

Just right

   public void handleVertexCompleted(VertexEvent event) {
-    if (recoveryData != null) {
+    if (recoveryData != null && isReplayingRecovery()) {
       handleRecovery(event);
       return;
     }
     completedTaskCount++;
     if (completedTaskCount == numTasks) {
       transitionToSucceeded();
     }
   }

One line. The change matches the root-cause statement verbatim. A reviewer reads it, opens the root-cause doc, agrees in 30 seconds.

The extracted helper, the final keyword, the stream rewrite — all may be good ideas. File them as separate JIRAs after this lands.

The Boy Scout rule does NOT apply

In a green-field project, "leave the campground cleaner than you found it" is fine. In Apache project review, drive-by cleanups block your fix because they expand the review and trigger objections you do not need to deal with to land the actual bug fix. Resist the urge.

Where Does the Fix Go? A Decision Tree

Is the bug a check that should have rejected an input but didn't?
    -> Guard condition (likely in a setter or builder).
       Example: TezConfiguration.validate(), DAG.verify().

Is the bug a wrong state machine transition?
    -> State-machine transition table edit.
       Look for stateMachineFactory.addTransition() in the affected *Impl class.
       The fix is usually adding/removing a transition or changing its target state.

Is the bug a config key being read at the wrong place or with the wrong default?
    -> Config validation in the constructor of the class that reads it.
       Or a fix to where conf.get() / conf.getInt() is called.

Is the bug a logic error in business code (wrong arithmetic, wrong comparator,
missing close())?
    -> Logic bug. Fix is local to the offending method.
       Add a test that asserts the corrected behavior.

Is the bug a race?
    -> First, prove it is actually a race with DrainDispatcher. Most "races"
       turn out to be logic bugs that *look* race-y because event ordering
       is non-obvious.
    -> If genuinely a race: usually a missing dispatcher.await, a missing
       volatile, or a transition guard that isn't atomic with a counter
       increment. Synchronize the smallest critical section.

Is the bug a memory issue (OOM, off-heap leak)?
    -> Almost never in scope for a first Capstone. Pause and consult a committer.

Configuration Keys: The Right Way

You will be tempted to "add a knob" — a new tez.foo.bar flag that defaults to the old (buggy) behavior, lets users opt in to the fix. Resist. Knobs are an admission that you don't trust your fix. If your fix is correct, it should be the new default; if it isn't, fix the fix, not the user's configuration burden.

When a knob IS justified:

The fix changes a performance-sensitive default that may regress some users.
The fix changes user-visible output format (release-note required).
The fix is gated on a long-deprecation window and the old behavior must remain available for one or two releases.

When you DO add a key, conform to Tez convention. Read:

grep -n "TEZ_AM\|TEZ_TASK\|TEZ_RUNTIME" \
  tez-api/src/main/java/org/apache/tez/dag/api/TezConfiguration.java \
  | head -40

You will see the pattern:

/**
 * Maximum number of times an AM can attempt to launch a task before failing
 * the task.
 * <p>
 * Default: {@link #TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT}.
 *
 * @since 0.9.0
 */
@ConfigurationScope(Scope.AM)
@ConfigurationProperty(type = "integer")
public static final String TEZ_AM_TASK_MAX_FAILED_ATTEMPTS =
    TEZ_AM_PREFIX + "task.max.failed.attempts";
public static final int TEZ_AM_TASK_MAX_FAILED_ATTEMPTS_DEFAULT = 4;

Mandatory elements for any new key:

Javadoc that explains what the knob does and when to change it.
@since X.Y.Z matching the next release version.
@ConfigurationScope (AM, VERTEX, TASK, CLIENT).
@ConfigurationProperty(type = "integer" / "long" / "boolean" / "string").
A _DEFAULT constant alongside.
Use the right prefix constant (TEZ_AM_PREFIX, TEZ_RUNTIME_PREFIX, etc.).
Add to tez-api/src/main/resources/META-INF/services/... if the doc-gen needs to pick it up (check existing keys to see if their config-doc generator catches up automatically or needs manual entries).

A new key that violates any of these will fail review.

Tez Coding Style

Read the existing class you are editing. Match its style exactly. The project-wide rules below are necessary but not sufficient — the file-local conventions matter just as much.

Logging

Always slf4j, never log4j directly, never System.out:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

private static final Logger LOG = LoggerFactory.getLogger(VertexImpl.class);

LOG.info("Vertex {} transitioned from {} to {} on event {}",
    getName(), oldState, newState, event.getType());

Use {} parameterization, never string concatenation in log args. Use the exception form LOG.error("Failed to schedule task {}", taskId, ex) rather than concatenating ex.toString().

Preconditions

Tez uses Guava Preconditions heavily. Use it for invariants and argument checks:

import com.google.common.base.Preconditions;

Preconditions.checkNotNull(event, "event must not be null");
Preconditions.checkArgument(parallelism > 0,
    "parallelism must be positive, got %s for vertex %s", parallelism, vertexName);
Preconditions.checkState(getState() == VertexState.RUNNING,
    "Vertex %s must be RUNNING to receive %s, was %s",
    getName(), event.getType(), getState());

The variadic %s form is preferable to string concatenation because it is free when the check passes.

Exception messages

Always include the context: which vertex, which task ID, which state, which event. Diagnosing a Tez bug from a stack trace alone is hard enough; an exception message that just says "invalid state" is hostile.

Bad:

throw new IllegalStateException("invalid state");

Good:

throw new IllegalStateException(String.format(
    "Vertex %s received event %s in state %s, which is not legal. "
        + "Expected one of [RUNNING, INITED].",
    getName(), event.getType(), getState()));

Forbidden

System.out.println / System.err.println (use LOG).
e.printStackTrace() (use LOG.error("...", e)).
Thread.sleep in production code unless you have a // TEZ-NNNN: justification comment AND a committer agreed in review.
New synchronized methods on hot paths — discuss in the JIRA before adding.
Adding new dependencies to pom.xml without discussion. This is a major re-review trigger.

Imports

No wildcard imports (import foo.bar.*;). The project's checkstyle catches these and you will fail precommit.
Group order: java, javax, org, com, third-party, project. Most IDEs handle this automatically.

Tests

Discussed fully in Step 6, but: every fix must come with at least one test that fails on master and passes with your fix. No test, no merge.

Building Incrementally

Do not try to write the whole fix and run the whole test suite. That feedback loop is too slow. Instead:

# Tight loop: compile + run only the changed module's affected test.
mvn install -DskipTests -pl tez-api,tez-common -am -q && \
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNNRepro -q

# When that goes green, broaden:
mvn test -pl tez-dag -Dtest='TestVertex*' -q

# Finally, full module:
mvn test -pl tez-dag -q

If your fix touches tez-api, you have to rebuild every downstream module. The -am flag is your friend — "also make" upstream deps.

When You Get Stuck

Hard rule: if you have not made forward progress in three sessions, post on the JIRA. Format:

Status update: I have the repro from Step 2 passing/failing as expected. My
working hypothesis is <one-sentence>. I have tried:

1. <approach A> — does not work because <observed result>.
2. <approach B> — does not work because <observed result>.

I am unsure whether to (a) <option a> or (b) <option b>. The constraint I am
trying to satisfy is <invariant>. If anyone has context on whether <approach C>
was considered for a related JIRA, please share.

Reproducer is at <link to gist or branch>.

This is not failure. This is community engagement done right. Committers respect contributors who ask sharp questions with context attached. They ignore contributors who ask "any update?" or "can you help?"

Validation / Self-check

Before advancing to Step 6:

Your fix is committed to your branch as a single commit with the title TEZ-NNNN: <short summary> and a body that references the root-cause document.
git diff origin/master --stat shows the smallest plausible diff (single digit files changed, double-digit lines at most for a typical bug fix).
The diff contains zero unrelated changes (no formatting-only changes, no import reordering not caused by your edit, no Javadoc cleanups in methods you didn't touch).
mvn install -DskipTests -pl <changed-module> -am -q succeeds.
The Step 2 reproducer test now passes (you'll generalize the test in Step 6 — the repro itself is still the gating signal).
If you added a TezConfiguration key, it has all required annotations, Javadoc, _DEFAULT constant, and @since tag.
You have re-read your diff line by line and convinced yourself every line change is required by the root cause. Strike anything that isn't.

Step 6: Testing

Your reproducer from Step 2 is the minimum — it proves the bug existed. The tests in this step prove that the fix is correct, that it stays correct, and that the next person who edits this code path will notice if they break it again. A good test suite is the most durable artifact you ship.

Two kinds of tests are required. Unit tests using a controlled dispatcher (fast, deterministic, surgical) and at least one integration test on MiniTezCluster (slow, realistic, end-to-end). Both. Always both.

Unit Tests with `DrainDispatcher`

The single most important Tez test pattern: synchronous, deterministic state- machine testing. Read the canonical example top to bottom before you write your own:

~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskAttempt.java
~/tez-src/tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestTaskImpl.java

Each is 1000+ lines. They are not light reading. They are also the only authoritative source on what is and isn't testable at the unit layer.

What `DrainDispatcher` Does

DrainDispatcher is Hadoop's synchronous testing dispatcher (from hadoop-yarn-common). When you dispatch() an event into it, the event sits in a queue. When you call await(), the queue drains synchronously on the calling thread — every handler runs before await() returns. This gives you two superpowers:

Deterministic event ordering. You can dispatch A, dispatch B, await — and you know A's handler completed before B's started.
No real threading. Bugs reproduce on every machine, not just under contention.

State-Transition Test Pattern

The template every state-machine unit test follows:

@Test
public void testV_TASK_COMPLETED_inRunningWithRecovery() throws Exception {
  // 1. Arrange: drive the SUT to the state under test.
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
  dispatcher.await();
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_START));
  dispatcher.await();
  assertEquals(VertexState.RUNNING, vertex.getState());

  // 2. Set up the precondition that triggers the bug.
  vertex.setRecoveryData(mockRecoveryData());

  // 3. Act: fire the event under test.
  TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
  vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
  dispatcher.await();

  // 4. Assert: the new state and any side-effect counters.
  assertEquals(VertexState.SUCCEEDED, vertex.getState());
  assertEquals(vertex.getNumTasks(), vertex.getCompletedTaskCount());
  assertFalse("vertex must not call handleRecovery when not actually replaying",
      vertex.getRecoveryHandlerCalled());
}

The sections — arrange, set precondition, act, assert — should always be visible. Reviewers skim for that shape. Hidden setup inside helpers makes the test harder to debug when it fails on a future change.

Build a Negative Test Too

You proved the bug is fixed. Now prove the non-buggy path still works:

@Test
public void testV_TASK_COMPLETED_inRunningWithoutRecovery() throws Exception {
  // Same arrange/state machinery, but recoveryData stays null.
  vertex.handle(new VertexEvent(vertex.getVertexId(), VertexEventType.V_INIT));
  dispatcher.await();
  // ...
  TezTaskID lastTaskId = vertex.getTask(vertex.getNumTasks() - 1).getTaskId();
  vertex.handle(new VertexEventTaskCompleted(lastTaskId, TaskState.SUCCEEDED));
  dispatcher.await();
  // Without recovery data, the existing transition behavior is unchanged.
  assertEquals(VertexState.SUCCEEDED, vertex.getState());
}

The negative test catches the regression where someone "fixes" your fix by removing the recovery branch entirely.

Test Both Branches of Every Guard You Added

If your fix is:

if (recoveryData != null && isReplayingRecovery()) { ... }

You owe four tests, one per combination:

`recoveryData == null`	`isReplayingRecovery()` returns	Expected branch
true	n/a (short-circuited)	non-recovery path
false	true	recovery path
false	false	non-recovery path (this is the bug fix)
true	true	non-recovery path (impossible? assert it cannot happen)

The last row is the kind of test that catches a future refactor where someone deletes the short-circuit.

`MockAppContext`, `MockHistoryEventHandler`, and friends

Building a VertexImpl in a unit test requires a small zoo of collaborators (an AppContext, an event handler, an EdgeManager, etc.). Don't try to build them all from scratch — copy the helpers from TestVertexImpl.

grep -nE "private.*setUp\(|class Mock|createVertex\(" \
  tez-dag/src/test/java/org/apache/tez/dag/app/dag/impl/TestVertexImpl.java \
  | head -30

You'll see helper methods like createVertex(...), createDAG(...), and inner MockHistoryEventHandler. Use them as a template; do not duplicate them in your own test if you can extend the existing test class with a new @Test method.

Integration Tests with `MiniTezCluster`

Unit tests prove the fix works in isolation. Integration tests prove it works when wired up to a real YARN cluster (in-process, but real). For correctness bugs and shuffle bugs, this is non-negotiable.

Canonical example:

~/tez-src/tez-tests/src/test/java/org/apache/tez/test/TestOrderedWordCount.java

Read its setUp / tearDown carefully. The pattern:

private static MiniTezCluster mrrTezCluster;
private static Path TEST_ROOT_DIR;

@BeforeClass
public static void setup() throws IOException {
  Configuration conf = new Configuration();
  TEST_ROOT_DIR = new Path("target", TestYourFix.class.getName() + "-tmpDir");
  mrrTezCluster = new MiniTezCluster(TestYourFix.class.getSimpleName(),
      /*numNodeManagers=*/ 1, /*numLocalDirs=*/ 1, /*numLogDirs=*/ 1);
  mrrTezCluster.init(conf);
  mrrTezCluster.start();
}

@AfterClass
public static void tearDown() {
  if (mrrTezCluster != null) {
    mrrTezCluster.stop();
    mrrTezCluster = null;
  }
}

@Test(timeout = 180_000)
public void testTezNNNNFixEndToEnd() throws Exception {
  TezConfiguration tezConf = new TezConfiguration(mrrTezCluster.getConfig());
  DAG dag = buildDAGThatExercisesFix();

  TezClient tezClient = TezClient.create("test-tez-NNNN", tezConf);
  tezClient.start();
  try {
    DAGClient dagClient = tezClient.submitDAG(dag);
    DAGStatus status = dagClient.waitForCompletionWithStatusUpdates(
        EnumSet.of(StatusGetOpts.GET_COUNTERS));

    assertEquals(DAGStatus.State.SUCCEEDED, status.getState());

    // The actual assertion — what proves the fix works end-to-end:
    long counterVal = status.getDAGCounters()
        .findCounter(YourCounterGroup.class.getName(), "ExpectedCounter")
        .getValue();
    assertEquals(20L, counterVal);
  } finally {
    tezClient.stop();
  }
}

`awaitVertexState` and the Deterministic Polling Pattern

MiniTezCluster tests look async (real cluster, real time) but you can still write deterministic assertions. Use the await* helpers in the tez-tests test utility classes:

grep -rn "awaitVertexState\|awaitDAGCompletion\|awaitTaskAttempt" \
  ~/tez-src/tez-tests/src/test/java/

Pattern:

TestTezUtils.awaitVertexState(dagClient, "v1", VertexStatus.State.SUCCEEDED, 60_000);

This polls with backoff up to the timeout. It never returns early on a spurious signal and never sleeps a fixed wallclock duration.

Determinism Rules

Hard rules. Violating any of them gets your PR sent back.

Rule	Bad	Good
No `Thread.sleep`	`Thread.sleep(500)`	`dispatcher.await()` or `awaitVertexState(...)`
No wallclock waits	`while (!done && System.currentTimeMillis() < deadline) {...}`	`latch.await(60, SECONDS)` driven by event callback
No `Random` without seed	`new Random()`	`new Random(42L)`
No timezone-dependent assertion	`assertEquals("2024-...", LocalDate.now())`	inject `Clock`
No order-dependent assertion on a Set	`assertEquals(List.of("a","b"), new HashSet<>(...))`	sort first or use `containsInAnyOrder`
Tests must clean up tmpdirs	leaving `target/...-tmpDir` between runs	`@After` removes it or uses unique `nanoTime()` path
No global mutable state	`static int counter = 0;` shared across tests	per-test instance state

Tez has shipped many flaky-test fixes. Read a few of them:

cd ~/tez-src
git log --oneline --grep="flaky\|intermittent" | head -20
git show <flaky-fix-sha>

Notice the pattern — most flaky fixes are replacing a Thread.sleep with an event-driven await, or replacing a counter assertion with a state assertion.

Coverage Target

You do not need 100% line coverage on the file you touched. You do need ~80% coverage on the lines you changed, plus tests that exercise every new branch (true and false sides).

Spot-check coverage:

mvn test -pl tez-dag -Dtest='TestVertexImpl*' \
  org.jacoco:jacoco-maven-plugin:prepare-agent \
  org.jacoco:jacoco-maven-plugin:report

# Open tez-dag/target/site/jacoco/index.html

If your changed lines show red, add a test before pushing.

A Complete Test That Fails on Master, Passes With Fix

The deliverable for this step is a test (typically two or three @Test methods on the same class) that:

Fails on a clean checkout of origin/master — assertion error, not a compilation error, not a setup error.
Passes when run against your fix branch.
Runs in under 10 seconds for unit tests, under 3 minutes for integration tests.
Has zero flakes in 10 consecutive runs.

Verify the third and fourth:

for i in {1..10}; do
  echo "=== Run $i ==="
  mvn test -pl tez-dag -Dtest=TestVertexImplTezNNNN -q || break
done

If even one run fails, you have a flaky test. Fix it before pushing. A flaky test you ship is technical debt every other contributor will pay.

Test Naming

Tez convention:

Unit test file: Test<ClassUnderTest>.java lives in <module>/src/test/java/<package>/. If TestVertexImpl.java already exists, add a new @Test method there rather than a new file.
Test method: test<Method>_<Condition>_<ExpectedResult> or test<Scenario>_<ExpectedBehavior>.
Bad: testFoo, testBug, testCase1.
Good: testV_TASK_COMPLETED_inRunningWithRecoveryData_doesNotShortCircuit.

The verbose name is the test's documentation. Future-you reading the failure output of CI will be glad for the verbosity.

Validation / Self-check

Before advancing to Step 7:

At least two @Test methods exist that fail on origin/master and pass on your branch.
At least one of them uses DrainDispatcher for deterministic event ordering (or has a documented reason it doesn't — pure unit, no events).
At least one integration test on MiniTezCluster is present if your fix affects end-to-end behavior (correctness, shuffle, scheduling).
Ten consecutive runs of your tests are all green.
Every new conditional branch in your production code has at least one test that exercises each side.
No Thread.sleep, no wallclock waits, no unseeded Random, no order-dependent assertions on unordered collections.
mvn test -pl <module> runs your tests in under the budget (10s unit, 3min integration).

Step 7: Validation

Your patch compiles. Your new tests pass. That is not enough. Validation is proving that the rest of the build — full module test suites, the static analyzers Tez runs, the legal scanner, the end-to-end examples — is also still green. Reviewers will not run this for you. They will check that you ran it and reject the PR if you didn't.

Budget: 1–2 evenings. Most of it is waiting on mvn test.

The Validation Checklist

In order. Do not skip steps because the previous step passed.

Full test suite of every module you touched.
Full clean build of the whole repo.
Checkstyle.
SpotBugs.
Apache RAT (license header check).
TestOrderedWordCount end-to-end.
Re-run your original Step 2 reproducer to confirm green.
Regression sweep of any module that depends on what you changed.
Performance validation (if perf-relevant).

Capture the output of each into capstone-work/validation/. You'll cite it in the PR description.

1. Full Module Tests

The module you changed:

cd ~/tez-src
mvn test -pl tez-dag -q 2>&1 | tee capstone-work/validation/01-tez-dag-test.log

This will take 5–20 minutes depending on the module. tez-dag is the slowest non-integration module. While it runs, work on the diff cleanup.

When it finishes, scroll to the summary lines. Look for:

[INFO] Tests run: 1342, Failures: 0, Errors: 0, Skipped: 17

If you see Failures > 0, open every failure. Then triage:

My fix caused it. Go back to Step 5. Reread the test. Either your fix is wrong, or the test is wrong (rare — assume the test is right until proven otherwise).
It is a known flaky test. Grep the JIRA: git log --grep="<TestName>". If there is an open ticket, link it in your PR description ("known flake, see TEZ-XYZ"). If there is not, file one before claiming the green.
It is also broken on master. Verify by running git stash && mvn test ... && git stash pop. If it fails on master too, link the JIRA or file one. Do not let your PR be the one to surface a pre-existing failure silently.

Run for every module you touched. If you touched tez-api, you touched everything downstream — plan accordingly.

2. Full Clean Build

The compilation gate. Catches missing imports, accidental Java-version features, downstream API breaks:

mvn clean install -DskipTests -q 2>&1 \
  | tee capstone-work/validation/02-clean-install.log

Expect a clean BUILD SUCCESS. Common failures:

Missing import. Your IDE auto-imported something not on the classpath of a downstream module.
API break. You changed a public method signature in tez-api and a downstream caller broke. Either revert the signature change or update the caller.
Java version. You used var or text blocks. Tez compiles to a JDK baseline (check pom.xml for <maven.compiler.target>). Use compatible syntax.

3. Checkstyle

Tez uses checkstyle aggressively. Run:

mvn checkstyle:check -q 2>&1 \
  | tee capstone-work/validation/03-checkstyle.log

Or, per module:

mvn checkstyle:check -pl tez-dag

Common violations and fixes:

Violation	Fix
Line longer than 120 chars	Break the line. Indent continuation 4 spaces.
Wildcard import	Replace with explicit imports.
Missing javadoc on public method	Add `/** ... */` block.
Trailing whitespace	Configure your editor to strip it on save.
Tab character	Convert to 2 spaces (Tez uses 2-space indent in most modules).
Method ordering	Public before private; static before instance.

The checkstyle config lives at tez-build-tools/src/main/resources/tez/checkstyle/checkstyle.xml — read it to understand the rules.

4. SpotBugs

Static analysis for null-deref, unchecked cast, dead-store, etc.:

mvn spotbugs:check -q 2>&1 \
  | tee capstone-work/validation/04-spotbugs.log

If it fails, view the report:

mvn spotbugs:gui -pl tez-dag

Common warnings worth fixing:

NP_NULL_ON_SOME_PATH — your new code dereferences a value that can be null on some branch.
EI_EXPOSE_REP — your getter returns a mutable internal collection directly. Wrap in Collections.unmodifiableList(...) or copy.
RV_RETURN_VALUE_IGNORED_BAD_PRACTICE — the result of file.delete() was ignored.

Warnings already present on master are not your problem to fix, but the analyzer will fail the build if your change introduces new ones. git diff origin/master tez-dag/target/spotbugsXml.xml (after running on both branches) tells you which are new.

5. Apache RAT (License Headers)

Every new .java, .xml, .properties file must carry the ASL header. RAT enforces this:

mvn apache-rat:check -q 2>&1 \
  | tee capstone-work/validation/05-rat.log

If it complains about your new test file, prepend the standard header:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

(Copy from any existing Tez file — it is the canonical form.)

For shell, properties, and XML files, use the appropriate comment syntax. Look at neighboring files in the same directory.

6. `TestOrderedWordCount` End-to-End

The closest thing to a smoke test of "does Tez actually still work for a real user workload":

mvn test -pl tez-tests -Dtest=TestOrderedWordCount -q 2>&1 \
  | tee capstone-work/validation/06-orderedwordcount.log

Takes 2–5 minutes. If this fails when your unit tests pass, your fix likely broke an interaction your unit test didn't exercise. Common culprits:

You changed an event ordering and a downstream component assumed the old ordering.
You added a config key default that breaks the example's expectations.
Your MiniTezCluster test is leaking state into a sibling test.

7. Re-Run Your Original Step 2 Reproducer

Sanity check. The thing you set out to fix is still fixed:

mvn test -pl <module> -Dtest=<YourReproTest> 2>&1 \
  | tee capstone-work/validation/07-repro.log

Five runs:

for i in 1 2 3 4 5; do
  mvn test -pl <module> -Dtest=<YourReproTest> -q
done

Five greens. Or you have not actually shipped a fix.

8. Regression Sweep

Run the test suite of every module that depends on what you changed. If you touched tez-api, that is everything. If you touched tez-runtime-library, that is at least tez-tests, tez-mapreduce, and tez-examples.

# Identify dependents
grep -l "tez-runtime-library" $(find ~/tez-src -name pom.xml)

# Run each
mvn test -pl tez-mapreduce -q | tail -5
mvn test -pl tez-examples -q | tail -5
mvn test -pl tez-tests -q | tail -10

If tez-tests takes too long (it can — there are real MiniTezCluster runs in there), at least run the tests whose name contains your changed class:

mvn test -pl tez-tests -Dtest='*Vertex*' -q

9. Performance Validation (If Relevant)

Skip this section unless your fix touches scheduling, shuffle, or any code path documented as "hot." For those, use async-profiler or JFR to capture a flamegraph before and after.

async-profiler pattern

# Start the JVM under test (e.g. a MiniTezCluster integration test)
mvn test -pl tez-tests -Dtest=TestPerfWorkload -DforkMode=never &
TEST_PID=$!

# Attach profiler
~/async-profiler/profiler.sh -d 60 -f /tmp/flame-before.svg $TEST_PID

# Apply your fix, repeat
~/async-profiler/profiler.sh -d 60 -f /tmp/flame-after.svg $TEST_PID

Compare the two SVGs. The stack frames you care about (e.g. ShuffleManager.run, MergeManager.merge) should not be wider after your fix than before. If they are, you have introduced a regression and you owe the JIRA an explanation.

Simpler: timing assertions in a JUnit test

@Test
public void testShuffleNotSlowerAfterFix() throws Exception {
  long start = System.nanoTime();
  runShuffleWorkload();
  long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);
  // Loose bound — assert no >30% regression vs. a previously-measured baseline.
  assertTrue("shuffle took " + elapsedMs + "ms, expected < 15000",
      elapsedMs < 15_000);
}

Brittle. Only add if perf is truly the concern.

The Validation Report

Compile everything into one document for the PR:

# Validation report for TEZ-NNNN

## Environment
- JDK: `java -version` -> openjdk version "11.0.21"
- Maven: `mvn -version` -> Apache Maven 3.9.6
- OS: macOS 14.2 / Linux 5.15.0-91-generic
- Tez HEAD: `git rev-parse origin/master` -> a1b2c3d4

## Results

| Check | Status | Notes |
|---|---|---|
| `mvn test -pl tez-dag` | PASS | 1342 tests, 0 failures, 17 skipped |
| `mvn clean install -DskipTests` | PASS | |
| `mvn checkstyle:check` | PASS | |
| `mvn spotbugs:check` | PASS | |
| `mvn apache-rat:check` | PASS | |
| `mvn test -pl tez-tests -Dtest=TestOrderedWordCount` | PASS | |
| Original reproducer | PASS (5/5 runs) | |
| `mvn test -pl tez-mapreduce` | PASS | |
| `mvn test -pl tez-examples` | PASS | |

## Known flakes encountered
- TestSomething#testWhatever — pre-existing flake, see TEZ-XXXX, not caused by this change.

## Performance
- Not applicable / no perf-relevant code paths touched.

Save as capstone-work/validation/REPORT.md. Paste it (or a summary plus link) into your PR description.

Validation / Self-check

Before advancing to Step 8:

capstone-work/validation/ contains one log file per check (logs 01–07 at minimum).
capstone-work/validation/REPORT.md exists with the table above filled in honestly.
Every check passes, or every failure is documented as a pre-existing issue with a JIRA link.
You re-ran your Step 2 reproducer five times with your fix applied and got 5/5 green.
You ran the test suite of at least one module that depends on the one you changed (regression sweep).
No new SpotBugs warnings introduced (diff against master baseline).
The validation report is short enough to paste into a PR description without making the reviewer scroll for a screen.

Step 8: Patch Preparation

You have working code, working tests, and a green validation run. Now you package the change so it can land. Modern Tez does this via GitHub PR; older Tez (and still some committers' preference) is .patch files attached to the JIRA. You should know how to do both.

This step is the easiest to skip past and the easiest to lose a week on if you do it sloppily. Treat the PR title, description, and commit message as seriously as the code.

Modern Tez: GitHub Pull Request

Apache Tez has been on GitHub Issues + PRs (mirrored to JIRA) for several years. The flow:

# 1. Make sure your branch is up to date with master
cd ~/tez-src
git remote -v
# origin    git@github.com:<you>/tez.git
# apache    https://github.com/apache/tez.git

git fetch apache
git checkout tez-NNNN-<slug>
git rebase apache/master
# Resolve any conflicts. Rebuild and re-run your tests after rebase.

# 2. Squash to a single clean commit (or 2-3 if logically separable)
git rebase -i apache/master
# In the editor: pick the first commit, squash the rest. Edit the combined
# commit message to one final version.

# 3. Push to your fork
git push --force-with-lease origin tez-NNNN-<slug>

# 4. Open a PR via https://github.com/apache/tez
# Title: TEZ-NNNN: <short summary, present tense>
# Base: apache/tez:master

--force-with-lease instead of --force: protects against overwriting a collaborator's commit if someone pushed to your branch between your fetch and your push.

Commit Message Template

TEZ-NNNN: Fix VertexImpl recovery branch short-circuiting non-recovery path

Reverts the unconditional short-circuit added in TEZ-2877 so that
V_TASK_COMPLETED events on the final task are processed by the standard
transition when no recovery is in progress. The original short-circuit
assumed any non-null recoveryData implies an active replay; this assumption
broke when recoveryData is populated speculatively at vertex initialization
even though no replay will occur.

The fix gates the recovery branch on the new isReplayingRecovery() predicate.
The previous behavior is preserved for actual recovery scenarios.

Tests:
- New unit test TestVertexImpl#testV_TASK_COMPLETED_inRunningWithRecovery
- New integration test in tez-tests verifying end-to-end DAG success
  with recoveryData populated.
- Existing TestOrderedWordCount and full tez-dag suite pass.

Note the shape:

Title line: TEZ-NNNN: <verb-phrase, present tense, < 72 chars>.
Blank line.
Body paragraphs: what the change does, why, what assumption broke.
Tests: explicit list of tests added or affected.

No "should fix" or "I think." Past tense for what you did, present tense for what the code does after the change.

PR Title

TEZ-NNNN: <Short imperative summary>

TEZ-NNNN prefix is mandatory. The bot uses it to link to JIRA.
Imperative mood: "Fix race", "Add config", "Avoid NPE". Not "Fixed race", not "Fixing race".
< 72 characters total including the prefix.

Bad: Fix bug, Updates to VertexImpl, My fix for TEZ-NNNN. Good: TEZ-4567: Honor isReplayingRecovery in VertexImpl completion path.

PR Description Template

## JIRA

https://issues.apache.org/jira/browse/TEZ-NNNN

## Problem

<2-4 sentences. Symptom + trigger conditions. Cite the root-cause doc.>

When the last `V_TASK_COMPLETED` event arrives for a vertex with non-null
`recoveryData` outside an actual recovery replay, the event is unconditionally
re-routed through `handleRecovery()` rather than processed by the standard
transition. As a result, `completedTaskCount` is not incremented and the vertex
fails to transition to SUCCEEDED. This affects DAGs whose AM populates
recovery data speculatively at vertex initialization.

## Root cause

See `capstone-work/root-cause.md` (or paste inline if short).

Introduced in TEZ-2877 (commit a1b2c3d4).

## Fix

Gate the recovery short-circuit on a new `isReplayingRecovery()` predicate
that returns true only during active replay. Minimum-diff (one production
line + one new private method).

## Testing

- **New:** `TestVertexImpl#testV_TASK_COMPLETED_inRunningWithRecovery` —
  unit test using `DrainDispatcher` that reproduces the failure on master
  and passes with this fix. Plus a negative-control test.
- **New:** `TestTezNNNNFixIntegration` — `MiniTezCluster` end-to-end test
  that runs a 20-task vertex with speculatively-populated recoveryData and
  asserts DAG SUCCEEDED.
- **Existing:** Full `tez-dag` suite (1342 tests) passes. `TestOrderedWordCount`
  passes. Validation report in commit message footer.

## Backward compatibility

None affected. The fix changes behavior only for the broken case (no replay
in progress). Recovery scenarios are unchanged.

## Configuration

No new keys.

Adjust sections to your fix. The structure stays the same.

GitHub Actions / Yetus Precommit

When you open the PR, GitHub Actions runs the precommit checks. The full config lives in .github/workflows/ — read it:

ls ~/tez-src/.github/workflows/
cat ~/tez-src/.github/workflows/build.yml

Common checks (subject to change as the workflow evolves):

Check	What it runs	Failure means
Compile	`mvn install -DskipTests`	Build broken on some module
Tests	`mvn test` for each module	Some test failed (yours or flake)
Checkstyle	`mvn checkstyle:check`	Style violation in changed file
Javadoc	`mvn javadoc:javadoc`	Broken Javadoc reference or missing tag
RAT	`mvn apache-rat:check`	New file missing ASL header
License	License snippet check	Same as RAT, or LICENSE/NOTICE drift
SpotBugs	`mvn spotbugs:check`	New static-analysis warning

Failures appear on the PR as red ✖ marks. Click into the failing job to read the log. Common first-PR failures:

Javadoc broken: You referenced {@link Foo#bar} and bar doesn't exist. Either fix the link or remove it.
Checkstyle: A line exceeded 120 chars or an unused import slipped in.
License: New file missing the header. Add it.
Test: A flake. Re-run the workflow ("Re-run all failed jobs" in the Actions tab). If it goes green on retry, leave a comment: "Re-ran job — flake, see TEZ-XYZ." If it fails again, your fix probably broke it.

Push fixes as new commits on the same branch. The PR auto-updates. After review approval, you'll squash on merge.

Old-Style: `.patch` Files on JIRA

Some committers still review .patch attachments. Know the convention.

Generate

git format-patch apache/master -o /tmp/
# Produces /tmp/0001-TEZ-NNNN-Fix-...patch

Or, for one combined diff:

git diff apache/master..HEAD > /tmp/TEZ-NNNN.01.patch

Naming convention

TEZ-<NNNN>.<iteration>.patch. So your first attachment is TEZ-4567.01.patch, second iteration after review feedback is TEZ-4567.02.patch. Some committers use TEZ-4567.001.patch (three-digit). Match whatever pattern the most recent committer used on that issue.

For a branch-specific patch (e.g. against the branch-0.10):

TEZ-4567.branch-0.10.01.patch.

Attach

In JIRA: "Attach files" → upload. Then "More" → "Patch Available" to flip the state. Cancel patch (revert to "In Progress") if you find a problem before review starts.

The JIRA workflow is covered fully in Step 9. The patch-file mechanics live here.

Rebasing on Master Without Losing Review Comments

GitHub's PR view loses inline comment threads when you force-push a rebase that changes the SHAs reviewers commented on. To minimize the damage:

Don't rebase mid-review unless you have to. Merge-commits from apache/master into your branch are usually acceptable during active review; squash at the end.
When you do rebase, leave a comment: "Force-pushed to rebase on master (was <old SHA>, now <new SHA>). All review threads should still be visible against the latest commit."
Squash only at the very end, after approval, just before merge.
If you really break the comment threads, post a summary comment listing "what was at line X became line Y in the new push." Reviewers appreciate it.

To rebase:

git fetch apache
git checkout tez-NNNN-<slug>
git rebase apache/master
# If conflicts: edit, git add, git rebase --continue
mvn install -DskipTests -pl <module> -am -q
mvn test -pl <module> -Dtest=<YourTests> -q
git push --force-with-lease origin tez-NNNN-<slug>

Co-Author and Sign-Off

If a committer or another contributor materially helped (suggested the fix direction, found the root cause), credit them:

TEZ-NNNN: <summary>

<body>

Co-authored-by: Alice <alice@example.org>

Tez does not require a Signed-off-by line (it is not a DCO project — it requires an Apache CLA), but committers appreciate when you note influences in the commit message.

What Reviewers Look For First

In rough order:

PR title and JIRA link — wrong format, instant correction request.
Description quality — vague description, "please clarify" comment.
Diff size — > ~200 lines for a "bug fix" gets scrutiny on scope creep.
Tests present — no tests, immediate request.
Tests fail on master, pass with fix — confirms the test is actually testing the fix, not just a happy path.
Production diff is minimum to fix the bug — every extra change has to justify itself.
Style and convention compliance — checkstyle and tests must be green.
API hygiene — no public methods added/removed without discussion.
Backward compatibility — does the fix change observable behavior for non-buggy cases? If yes, was it discussed?

Optimize for the first seven before you push. The last two are usually discussed in JIRA comments before the PR opens.

Validation / Self-check

Before advancing to Step 9:

PR exists on apache/tez with the format TEZ-NNNN: <summary>.
PR description follows the template; cites the root-cause document.
Commit message follows the template (title, body, tests footer).
GitHub Actions precommit is green (every check), or every red has a documented and accepted explanation.
Branch is rebased on a recent apache/master (within last 24-48h ideally).
PR was opened with the URL pasted into the JIRA as a comment.
You can articulate in one paragraph why every line of your diff is necessary, if a reviewer asks.

Step 9: JIRA and Documentation

The JIRA is the project's permanent memory. The PR is ephemeral — it lives on GitHub, gets merged, fades into git log. The JIRA is what users grep when they hit a similar bug two years later, what release managers read when compiling release notes, what new contributors find when researching prior art. Treating it as a checkbox is the laziest possible thing you can do.

This step is short on procedure and heavy on hygiene. Twenty minutes done well saves three different people an hour each later.

The Status Workflow

Tez uses Apache's standard JIRA workflow. The states you will pass through:

Open  ->  In Progress  ->  Patch Available  ->  Resolved  ->  Closed
                              ^                    ^
                              |                    |
                            (you)              (committer)

State	Set by	Meaning
Open	Reporter	Bug exists, nobody is working on it.
In Progress	Assignee	Someone is actively investigating.
Patch Available	Assignee	A patch / PR is ready for committer review.
Resolved	Committer	Patch merged. `Resolution: Fixed` plus `Fix Version`.
Closed	Anyone	Verified in a release. Often skipped — many Tez tickets stay Resolved indefinitely.

Transitioning correctly

You move it to In Progress when you claim it in Step 1.
You move it to Patch Available when your PR is open and precommit is green. This is the signal "ready for human review, not just CI."
You do NOT move it to Resolved. Only a committer does that when they merge. Setting it yourself will be reverted, and you will look new.
If a committer asks you to revise, the state usually stays at Patch Available. Move back to In Progress only if you'll be rewriting significantly (multi-day rework).

The `Patch Available` ritual

When you flip to Patch Available, leave a comment:

PR is now open at <link>, precommit is green, ready for review.

Summary: <one paragraph from the PR description>.

Tests: <list>.

Specific reviewer requests: <if any, e.g. "would appreciate a look from
@someone since they wrote the original code">.

This wakes up the JIRA's watchers (committers who follow issues@) and gives them enough context to decide whether to pick it up.

Required Fields

Field	Who sets	What to set
Assignee	You	Yourself (Step 1).
Component	You	`tez-dag`, `tez-runtime-library`, etc. — whatever module you primarily changed.
Affects Version	Reporter or you	The earliest version where the bug reproduces.
Fix Version	Committer	Leave blank. Only PMC/committers set this. You can comment "suggesting fix version X.Y.Z" if you have a strong opinion.
Priority	Reporter or PMC	Don't bump your own. Comment if you think the priority is wrong.
Labels	You	Add `flaky-test` / `recovery` / `shuffle` if it helps grep later. Don't invent vanity labels.
Release Notes	You, if user-visible	Mandatory if behavior, API, or configuration changes are visible. See below.
Linked Issues	You	Link the PR (web link) and any related JIRAs. See below.

Release Notes

If your fix changes anything a user can observe — output format, config key default, error message, performance characteristic — fill out the "Release Notes" field. Format:

Fixed an issue where vertices with speculatively-populated recovery data
would not transition to SUCCEEDED after all tasks completed. Affects DAGs
submitted via TezClient when checkpoint-based recovery is enabled. No
configuration or API change is required.

Two to four sentences. Past tense ("Fixed"). User-facing language ("DAGs", "TezClient"), not implementation jargon ("V_TASK_COMPLETED handler").

If your fix is purely internal (refactor of a private method, test-only change), leave Release Notes blank. The release manager will skip it.

Linking the PR

Issue Links → "is related to" → Web Link → paste the GitHub PR URL.

Tez also has a bot that auto-links a PR to the JIRA when the PR title starts with TEZ-NNNN:. The bot fires within minutes. If after an hour the JIRA does not have a "GitHub Pull Request" link visible, add it manually:

JIRA → More → Link → Web Link → URL: https://github.com/apache/tez/pull/<NNN> → Link Text: GitHub PR.

If your fix interacts with other tickets, link them explicitly:

Relation	When to use
is duplicated by	Another JIRA is a duplicate of yours (close that one).
duplicates	Yours is the duplicate (close yours, work on the older one).
is related to	Touches similar code but distinct issue.
is blocked by	You cannot land until another JIRA lands first.
is caused by	Bisect identified TEZ-XYZ as the regression source.
supersedes	Your fix replaces an older abandoned attempt.

Be conservative. Spurious links pollute the issue graph. Cross-link only where the connection is concrete.

Code Comments in the Fix

The JIRA explains what and why at the project level. Inline code comments explain why at the file level for the next person editing this line.

Good inline comment patterns:

// TEZ-NNNN: only short-circuit when recovery replay is actually in progress;
// recoveryData may be populated speculatively at vertex init even when no
// replay will occur. See the JIRA for the affected scenario.
if (recoveryData != null && isReplayingRecovery()) {
  handleRecovery(event);
  return;
}

Rules:

Cite the JIRA number. Future-you grepping the file for TEZ- will find the context immediately.
Explain the non-obvious invariant, not what the code obviously does. Never write // increment counter next to count++.
One or two lines max. If you need a paragraph, write the design note in the class Javadoc or in a markdown doc under docs/.
Don't paste the entire root-cause document. The JIRA holds that.

Notifying Watchers

After Patch Available, the JIRA's watchers see an email. If you want a specific committer's attention (e.g. the author of the introducing commit from your git bisect), @mention them in a JIRA comment:

[~alice] (author of TEZ-2877) — would appreciate a sanity check on the
recovery short-circuit gating in this PR, since you wrote the original
branch. No urgency.

The [~jira-username] syntax is JIRA's mention. Find the username from their JIRA profile URL (https://issues.apache.org/jira/people/<username>).

Do this once. Do not @-mention in every subsequent comment — committers filter their inboxes.

Backporting Fix to Branches

For most Capstone work, you fix on master and stop. But if your bug affects a maintained release branch and a committer asks you to backport:

Comment on the JIRA: "Will backport to branch-0.10 once master patch lands."

After merge to master, create a new branch from apache/branch-0.10:

git fetch apache
git checkout -b tez-NNNN-branch-0.10 apache/branch-0.10
git cherry-pick <master-fix-commit-sha>
# Resolve conflicts (often minor; sometimes major if branch diverged).

Run validation on the branch (same Step 7 checks).
Open a separate PR titled TEZ-NNNN (branch-0.10): <summary> or attach a TEZ-NNNN.branch-0.10.01.patch to the same JIRA.

Each branch's PR/patch is a separate review.

After Merge

When a committer merges your PR:

The PR is closed automatically, and they'll comment "Committed to master, thanks @you" with the merged-commit SHA.
They (or the bot) set the JIRA to Resolved with Resolution: Fixed and Fix Version: X.Y.Z.

You comment with a thanks and any follow-up plans:

Thanks for the review and merge, [~alice]. I'll watch for the next RC
to verify it lands cleanly. Filed TEZ-MMMM for the follow-up refactor
we discussed.

If you spotted a related improvement during review, file the follow-up JIRA immediately — do not let it slip.

Documentation Beyond the JIRA

Most bug fixes need no further doc. Exceptions:

Change	Where to document
New config key	`tez-api/src/main/resources/META-INF/services/...` if not auto-generated; reference from the Tez site config docs page.
New public API	Javadoc on the new method/class + the relevant `docs/<feature>.md` if one exists.
Behavior change visible to operators	A note in `CHANGELOG.md` (committer usually handles), and a JIRA Release Notes entry (you write this).
New tunable or debug flag for operators	Mention in the Tez configuration reference page (commit to the `tez-site/` directory or open a JIRA for the site update).

When in doubt, ask in the JIRA: "Should I update the docs page for X as part of this, or as a follow-up JIRA?" Committers will tell you.

Validation / Self-check

Before advancing to Step 10:

JIRA status is Patch Available with a comment summarizing the change and linking the PR.
Assignee is you.
Component is set to the right module.
Affects Version is set to a real Tez version where the bug reproduces.
Release Notes field is filled in (or explicitly blank with a one-line "internal only" justification in the PR description).
PR is linked under Issue Links → Web Link.
Any related JIRAs are cross-linked with the correct relation (is caused by / is related to / etc).
Inline code comments cite TEZ-NNNN where the change is non-obvious.
If a committer was specifically helpful (author of regressing commit, reviewer on related work), you @-mentioned them once, not repeatedly.

Step 10: Engineering Write-Up

The patch is merged. The JIRA is Resolved. Most contributors stop here. The ones who become committers, write the post. The write-up is the artifact that travels with you when you change jobs, apply for a committer vote, or get cited by another contributor doing similar work.

Eight hundred to a thousand words. Most of it written in the four hours right after merge, while the dead ends are still fresh.

Why It Matters

Three audiences:

Future you. Six months from now you'll touch this code again and want to remember what you tried.
The next contributor working a similar bug. They'll find your post via Google ("Tez vertex stuck RUNNING") and shortcut a week of work.
The committers / PMC evaluating you for a vote. They want to see that you can communicate engineering reasoning, not just produce diffs.

A good write-up is not a press release. It is a postmortem: honest about what you tried, including the failed approaches.

The Template

Sections in order, suggested word counts.

Title (one line)

Fixing TEZ-NNNN: <one-line technical summary>

Examples:

"Fixing TEZ-4567: A speculative-recovery short-circuit race in VertexImpl"
"Fixing TEZ-3982: Why our shuffle was 30% slow on small inputs"
"Fixing TEZ-2451: An off-by-one in MergeManager spill accounting"

Technical, specific. Not "My first Apache Tez contribution" — write that post separately on your blog. The engineering post stands on its own.

Problem (100–150 words)

What broke, for whom, under what conditions. Plain English, but precise.

Tez vertices configured with checkpoint-based recovery would intermittently
fail to transition to SUCCEEDED, leaving the DAG in RUNNING state until the
AM hit its global timeout. The bug only manifested when the application
master pre-populated recovery data at vertex initialization (rather than
lazily during an actual replay), which is the path used by long-running
Tez sessions reusing AMs across DAG submissions.

The symptom was a stalled DAG with all tasks reporting SUCCEEDED in the
counters but no DAGFinishedEvent in the AM log. Affected Tez 0.9.x and
0.10.0 onward.

State the symptom (what the user sees), the trigger condition (when it manifests), and the affected version range. No code yet.

Investigation Log (200–300 words)

The most valuable section. Walk through what you tried, including the hypotheses that were wrong.

Initial hypothesis was a task-scheduler bug — we suspected
TaskSchedulerManager was dropping a TASK_COMPLETED event under load.
DrainDispatcher-based reproducers in isolation showed no event loss, so
we ruled this out within a day.

Second hypothesis: a state-machine transition guard rejecting the final
event. Adding TRACE logging to VertexImpl confirmed V_TASK_COMPLETED was
arriving and being dispatched, but completedTaskCount remained one short
of total. This shifted attention from "the event is missing" to "the
event is processed but not by the expected handler."

Reading VertexImpl.handle(...) line by line revealed the recovery
short-circuit at line ~2400: `if (recoveryData != null) { handleRecovery(...); }`.
A git blame placed this in TEZ-2877 (commit a1b2c3d4), where the
assumption "non-null recoveryData implies active replay" was reasonable
at the time but became invalid when TEZ-3105 introduced speculative
recovery-data population at vertex init.

The actual race: V_TASK_COMPLETED for the final task arrived at the
moment when recoveryData was populated but isRecovering() would have
returned false — there was no isRecovering() check.

Three to five hypotheses, in the order you tried them. Each with one sentence on what suggested it and one sentence on what disproved it. The dead ends are not embarrassments — they are the work, and they teach readers what not to spend a week on.

Root Cause (50–100 words)

One paragraph, the truth as you now understand it.

The vertex state machine's V_TASK_COMPLETED handler in the RUNNING state
short-circuited any event to handleRecovery() when recoveryData was non-null,
regardless of whether a recovery replay was actually in progress. Speculative
population of recoveryData at vertex initialization (TEZ-3105) made the
guard fire in normal execution, routing terminal events to the recovery
path which silently ignored them when not replaying. The completedTaskCount
counter never reached totalTaskCount, blocking the SUCCEEDED transition.

Cite the introducing JIRA. Cite the bisect commit if you have it.

Final Design (150–200 words)

What you actually changed and why this design over alternatives.

The fix introduces an isReplayingRecovery() predicate that returns true
only when a recovery replay is in flight (tracked by an existing
RecoveryState flag in DAGAppMaster). The short-circuit is gated on this
predicate:

  if (recoveryData != null && isReplayingRecovery()) { ... }

This is a one-line production change plus a four-line predicate method.
It preserves all behavior for actual recovery scenarios and corrects the
behavior only for the speculatively-populated case.

Show the diff size and the principle ("minimum surface area"). Note any public API impact (here: none).

Alternatives Considered (100–150 words)

Two to three alternatives you rejected, with the reason.

**Alternative 1: stop populating recoveryData speculatively at vertex init.**
Rejected: TEZ-3105 documented performance reasons for the eager population
(avoids a stall when actual recovery kicks in). Reverting it would
regress that path.

**Alternative 2: have handleRecovery() forward the event back to the
standard transition when not replaying.** Rejected: it works, but couples
the recovery path to internal knowledge of which events the standard
transition needs. The gate-at-source approach is local and reviewable.

**Alternative 3: remove the short-circuit entirely and let handleRecovery()
no-op when not replaying.** Rejected: changes the semantics of every other
event flowing through the recovery path, with broader behavioral risk for
a narrowly-scoped bug.

This is the section that separates contributor-quality write-ups from committer-quality ones. Anyone can ship a fix. Articulating why this fix and not the obvious alternatives demonstrates engineering judgment.

Performance / Behavior Impact (50–100 words)

If perf-relevant, numbers from Step 7. Otherwise, one sentence:

No measurable performance impact. The new predicate is a single field
read on a hot path (VertexImpl.handle) but the original short-circuit
already paid this cost on every event. Validated via TestOrderedWordCount
runtime: no statistically significant change across 10 runs.

Lessons Learned (100–150 words)

The transferable insights, written for a peer. Things you would tell yourself before starting.

- Recovery code in Tez has always been the sharpest edge: it is the
  least-tested path because it only runs during AM failover, and most
  developer environments don't trigger it. When a bug touches recovery
  data flow, assume the test coverage is thin and add reproducers
  aggressively.
- `git pickaxe` and `git bisect` together were decisive — bisect found
  the introducing commit (TEZ-2877), and pickaxe on the changed expression
  showed it had never had a guard. Without bisect this would have been
  a week of code archaeology.
- DrainDispatcher in TestVertexImpl is underused. The repro test for this
  bug took two hours to write once I learned the pattern, and it is now
  permanent regression protection.

Three to five bullets. Concrete enough that a peer at another project could apply them.

Links

- JIRA: https://issues.apache.org/jira/browse/TEZ-NNNN
- PR: https://github.com/apache/tez/pull/<NNN>
- Merged commit: <SHA>
- Introducing commit (TEZ-2877): <SHA>

Where to Publish

Three venues, in roughly decreasing order of effort and impact.

1. Personal blog or company engineering blog

Full ~1000-word write-up. SEO-friendly title with the JIRA number and a keyword phrase users would search for ("Tez vertex stuck RUNNING fix"). Link prominently to JIRA and PR. This is the version that follows you across jobs.

2. Apache wiki / Tez documentation

Shorter version (300–500 words) focused on the lesson, not the personal narrative. Filed under a relevant page (recovery troubleshooting, debugging state machines). Requires wiki access — committers will grant it once you have a few merged contributions.

3. dev@ summary email

Two to three paragraph summary on dev@tez.apache.org with subject [TEZ-NNNN] Notes on the fix. Lets watchers and PMC see the engineering reasoning without having to read the whole PR. Optional but earns goodwill.

Subject: [TEZ-NNNN] Notes on the fix

Hi all,

Merged TEZ-NNNN this morning. Quick notes on the investigation since
recovery bugs are uncommon and the root cause was a non-obvious
interaction with TEZ-3105:

<2 paragraphs of summary>

Full write-up: <link to blog post>

Thanks again to [~alice] for the review.

Anti-Patterns

What separates write-ups that help from ones that don't:

"I learned a lot working on this!" — Yes, we know. Cut it. The artifact is the engineering, not the feel-good.
Personal narrative dominating the engineering. Save the "my journey into open source" angle for a separate post. Engineering posts get cited and reread. Narrative posts get one-time clicks.
Sanitized version where you "knew the answer all along." Nobody believes this and it actively misleads new contributors who feel inadequate when their investigation is messy. Be honest about the dead ends.
No code snippets. A write-up without showing the actual diff or the symptomatic log line is unfalsifiable.
No links. JIRA, PR, commit — all three minimum. A write-up without the JIRA link is unreviewable.
Word-padding to look thorough. A tight 600-word write-up that respects the reader beats a 2000-word slog every time.

Validation / Self-check

Before declaring the Capstone complete:

The write-up is published at a URL you can share (blog, GitHub Gist, capstone-work/writeup.md in a public repo).
It is 500–1000 words; not 200 (too thin) and not 3000 (padding).
Investigation Log section contains at least two hypotheses you ruled out, not only the winning one.
Alternatives Considered section names at least two designs you rejected with reasons.
Lessons Learned section has three to five bullets, each concrete enough to be reusable by another contributor.
JIRA, PR, and merged-commit SHA are all linked.
The write-up reads as something a peer engineer would respect, not a triumphalist blog post.

Evaluation Rubric

A 100-point self-grading rubric for the Capstone. Score yourself honestly after you finish Step 10. The scoring is calibrated against what Tez committers actually look for — not what feels good to read.

The point of this rubric is not the score. It is the diagnostic: a low score on one dimension tells you exactly where to invest the next contribution.

Scoring Dimensions

Seven dimensions, weighted by how much they matter for review outcomes.

#	Dimension	Points
1	Problem articulation	20
2	Execution-path mastery	20
3	Implementation quality	20
4	Testing	15
5	Review responsiveness	10
6	Documentation	10
7	Community interaction	5
	Total	100

1. Problem Articulation (20 pts)

Can you state, in one paragraph, what was broken, for whom, under what conditions?

Score	What it looks like
18-20	Crisp one-paragraph statement covering symptom, trigger conditions, affected version range, and operational impact. Distinguishes "this is what the user sees" from "this is the underlying mechanism." Could be read aloud at a standup and a peer would correctly grasp the bug.
14-17	Clear symptom but trigger conditions vague ("happens sometimes under load"). OR trigger clear but conflates symptom with root cause.
10-13	Reader needs to ask follow-up questions to understand what was broken. Uses jargon without grounding it in user-visible behavior.
5-9	Mostly restates the JIRA title. No conditions. No version impact.
0-4	"It was broken and I fixed it."

Look for: the absence of the word "intermittent" without a documented trigger; conflation of symptom (vertex stuck) with cause (event short-circuit).

2. Execution-Path Mastery (20 pts)

Did you actually trace the code, or did you guess?

Score	What it looks like
18-20	Step-3 document maps the full path from user submission to bug location with file:line citations at every layer. Includes a diagram (mermaid or text-arrow). Cites the AsyncDispatcher event hop and the specific state-machine transition where the bug fires. Reviewer reading it could open each file at each line and follow the logic without asking questions.
14-17	Most layers cited but one or two skipped ("then the event reaches VertexImpl"). Diagram present but missing a critical hop.
10-13	Cites the location of the bug correctly but does not trace how execution reached it. No diagram.
5-9	Vague references ("the dispatcher handles it") without file:line.
0-4	No execution-path document, or it is just a paragraph of prose.

Look for: presence of tez-api/src/main/...-style paths with line numbers that match the resolved commit SHA.

3. Implementation Quality (20 pts)

Diff hygiene, scope discipline, convention compliance.

Score	What it looks like
18-20	Minimum-diff fix. Production change measured in tens of lines, not hundreds. Every changed line is justifiable in one sentence. No drive-by refactors, no opportunistic renames. Public API surface unchanged unless required. Naming, slf4j logging style, Preconditions, exception messages all match Tez conventions. Checkstyle, SpotBugs, RAT all green without manual overrides.
14-17	Mostly minimum-diff but one or two stray changes that don't belong. Conventions mostly followed; minor style nits a reviewer would flag.
10-13	Fix works but is broader than necessary. Scope creep ("while I was here I cleaned up..."). Conventions inconsistently applied.
5-9	Significant scope creep. Public API changed unnecessarily. Style violations would block precommit without revision.
0-4	Diff is so large reviewers would request it be broken up before reviewing. OR breaks public API silently.

Look for: scope-creep tells: git diff origin/master --stat with files unrelated to the bug touched.

4. Testing (15 pts)

Coverage, determinism, regression value.

Score	What it looks like
14-15	New unit test reproduces the bug deterministically on master (DrainDispatcher or equivalent), passes with fix. Negative-control test (similar input where the bug should NOT trigger) included. Branch coverage on the changed lines is high. Integration test with MiniTezCluster confirms the fix in an end-to-end DAG. No Thread.sleep, no wall-clock dependencies, no order-dependent assertions. Test ran 10x in a loop without flake.
11-13	Unit test present and deterministic but no negative control. OR has an integration test but the unit test is weak.
7-10	Unit test present but uses Thread.sleep or is otherwise non-deterministic. Coverage of fix path incomplete.
3-6	Test exists but only checks the happy path; would have passed before the fix.
0-2	No new tests, or tests that fail on master AND on the fix.

Look for: presence of dispatcher.await() rather than Thread.sleep; a test name that describes the scenario (testV_TASK_COMPLETED_inRunningWithRecovery) rather than the method (testHandle).

5. Review Responsiveness (10 pts)

How well you ran the review cycle.

Score	What it looks like
9-10	Every reviewer comment addressed in code or with a substantive reply. Iteration cadence < 48h on most comments. Disagreements (when they happened) made the technical case without defensiveness. Updated PR description after material changes so the top-of-PR text stays accurate.
7-8	Addresses comments correctly but slow (multi-day gaps). OR addresses most comments but lets a few stylistic ones slide without acknowledgement.
5-6	Defensive on at least one comment ("but I think my way is fine"). OR force-pushed without summarizing the diff for reviewers.
2-4	Required multiple reminders from reviewers. Comments not addressed cleanly.
0-1	PR went silent for > 2 weeks without explanation, or contributor argued every comment.

Look for: PR review threads marked "resolved" by the contributor with a substantive commit pushed, not just a reply.

6. Documentation (10 pts)

JIRA fields, code comments, write-up presence.

Score	What it looks like
9-10	JIRA has Component, Affects Version, Release Notes (if user-visible), PR link, and relevant cross-links. In-code comments cite TEZ-NNNN where the change is non-obvious. Write-up exists at a public URL. JIRA status correctly walked through In Progress -> Patch Available.
7-8	JIRA mostly filled but Release Notes missing on a user-visible change. Code comments present but don't cite the JIRA.
5-6	JIRA workflow followed but fields incomplete. No write-up beyond the PR description.
2-4	JIRA fields blank or wrong. Comments absent at the surprising lines.
0-1	No JIRA hygiene at all.

Look for: the JIRA's "Release Notes" field being populated or an explicit note explaining why it's intentionally blank.

7. Community Interaction (5 pts)

Mailing list etiquette, claiming/handoff hygiene.

Score	What it looks like
5	Claimed the JIRA before starting. Posted to dev@ only when meaningful (design question, summary after merge). Used `[TEZ-NNNN]` subject prefix. Was reachable during review. Thanked reviewers explicitly. If they hit a wall, posted clearly with "stuck on X, considering A/B/C, leaning A because Y."
3-4	Mostly good etiquette; one minor slip (claimed late, or one off-topic mailing-list post).
1-2	Did not claim the JIRA before working. OR sent mailing-list traffic that was really just chat ("does anyone know...").
0	Worked silently for weeks, then dropped a PR with no JIRA assignment and no context.

Look for: a JIRA comment by the contributor before the first PR push, along the lines of "Working on this, will have a patch in a few days."

Tier Thresholds

Where you land tells you what to do next.

Score	Tier	Interpretation
95-100	PMC-ready	This is the quality of work that earns a committer vote, given a track record of several such contributions over months. You are operating at the level of someone the PMC would trust to maintain a module.
90-94	Committer-ready	You are writing patches at committer quality. With 3-5 such contributions across different modules over 6-12 months and demonstrated review participation on others' patches, a vote is plausible.
80-89	Strong contributor	A reliable contributor whose patches need minimal review iteration. Keep building the track record; this is the level where committers actively look forward to reviewing your work.
65-79	Contributor	Solid bug-fix-grade work. Patches land with normal review iteration. Most contributions to most projects live here, and it is honorable work.
50-64	Learning	Patches eventually land but with significant reviewer guidance. Use the next contribution to focus on the dimension where you scored lowest.
< 50	Foundational gap	The contribution may have merged, but the process skipped enough corners that another reviewer or future maintainer is paying a tax. Restart with a smaller bug and apply the rubric end-to-end.

The tier is not a personality assessment. It is calibrated to the artifact you produced for this one Capstone. The same person can score 65 on one contribution and 95 on the next.

How to Self-Grade

Block 30 minutes. Open this rubric. Open your own artifacts side by side (JIRA, PR, code, root-cause doc, write-up, validation report). Score each dimension by reading the band descriptions and picking the one that most honestly matches what you produced.

Two rules:

No interpolation upward. If you're between 14 and 17 on a dimension and unsure, score 14. The optimist's tax.
One independent reviewer. Ask a peer (ideally another contributor) to score independently on the same rubric. If your scores differ by more than 10 points on any dimension, talk about it. The difference is where the calibration lives.

Record both scores in capstone-work/self-grade.md along with one sentence per dimension on what would have moved the score up one band. This becomes the input for the next contribution's plan.

What to Do With a Low Score

Lowest dimension	Next contribution focus
Problem articulation	Pick a smaller, sharper bug. Write the one-paragraph statement before opening the JIRA edit, and post it for review.
Execution-path mastery	Pick a bug in a layer you've never traced (e.g. you've done DAG-level, now do shuffle-level). Force yourself to write the path doc before reading the existing tests.
Implementation quality	Pick a bug where the minimum fix is < 10 lines. Practice the discipline of leaving the surrounding code untouched.
Testing	Pick a flaky-test JIRA (Stage 9 of the roadmap). The whole bug is about testing discipline.
Review responsiveness	Pick a bug in a high-traffic area where you'll get more reviewers. Set a 24-hour SLA for yourself on every comment.
Documentation	Pick a bug that requires a Release Notes entry. Write the entry before the fix is done.
Community interaction	Reply substantively to three other contributors' patches before opening your next one.

Validation / Self-check

Before declaring the Capstone done:

capstone-work/self-grade.md exists with a score per dimension and a total.
The total is honest, not aspirational — you can defend each dimension's score with citations to your own artifacts.
At least one independent reviewer has also scored, and disagreements

10 points on any dimension have been discussed.
The lowest dimension is identified and the next contribution's focus is written down.
The score is recorded somewhere you'll see again in 3 months (calendar reminder, journal, follow-on JIRA list).
You understand that the tier label ("Contributor", "Committer-ready") describes this one piece of work, not you.
You have a candidate next bug picked, with the focus dimension in mind.

OpenSearch Open-Source Contributor Curriculum

Welcome to the OpenSearch Open-Source Contributor Curriculum — a complete, implementation-heavy roadmap for engineers who want to become serious OpenSearch contributors and eventually operate at the level of a core contributor, maintainer, or TSC-aware engineer.

What This Curriculum Is

This is not a tutorial. It is a structured engineering apprenticeship built around how OpenSearch is actually developed, tested, reviewed, and maintained by its maintainers and the Technical Steering Committee.

Every level is tied to real OpenSearch source code, real GitHub issue patterns, real test infrastructure, and real contribution workflows. The labs mirror the work an OpenSearch maintainer actually does — reading the coordination layer, tracing a search request from REST handler to Lucene, debugging shard allocation, reproducing reported issues, and preparing pull requests for community review.

Who This Is For

This curriculum is designed for strong backend and distributed systems engineers who:

Have 3+ years of Java development experience (Gradle-based projects are a plus)
Are comfortable with HTTP/REST APIs and JSON
Understand distributed systems fundamentals: replication, consensus, sharding, failure detection
Want to contribute to open source at a serious level — not just fix typos

You should be comfortable with:

Reading large, unfamiliar Java codebases without a guide
git workflows, reading diffs, working with GitHub pull requests
The search/storage domain at a high level: inverted indexes, documents, queries, aggregations
Distributed execution concepts: leader election, quorums, primary/replica data flow

You do not need prior Apache Lucene experience. You will build it here.

What You Will Be Able to Do

After completing this curriculum, you will be able to:

Capability	Description
Build and test	Build OpenSearch from source with Gradle, run unit/integration tests, launch a local cluster
Navigate the codebase	Find any class, understand its role, trace execution across module boundaries
Understand the request path	Follow a request from REST handler through transport actions to shards and Lucene
Reason about coordination	Explain cluster-manager election, cluster state publishing, and the applier/listener model
Debug failures	Diagnose unassigned shards, failing recoveries, search errors, and circuit-breaker trips
Master the engine	Trace indexing through `IndexShard`/`InternalEngine`/`Translog` and search through query/fetch phases
Contribute pull requests	Reproduce issues, fix bugs, write tests, prepare high-quality PRs with CHANGELOG + DCO
Engage the community	Interact productively on GitHub, the forum, and in community meetings
Extend the engine	Build a plugin, add a REST action, custom analyzer, or aggregation
Think like a maintainer	Reason about wire/index backward compatibility, test stability, performance, and release impact

How to Use This Curriculum

Work through the 9 levels sequentially. Do not skip levels. Each level builds directly on the previous one, and the labs depend on the conceptual foundations laid earlier.

Level	Title	Core Focus
1	Lucene and OpenSearch Foundation	Build, test, first cluster, where OpenSearch fits
2	OpenSearch Contributor Onboarding	GitHub workflow, PRs, DCO, CHANGELOG, first fix
3	OpenSearch Architecture	Nodes, shards, REST → Transport → Action, threadpools
4	Cluster Coordination and State	Cluster-manager election, cluster state, allocation
5	Testing and Debugging	Test framework, `InternalTestCluster`, flaky tests
6	Indexing Path and Storage Engine	`IndexShard`, `InternalEngine`, translog, mappings
7	Search Path and Aggregations	Query/fetch phases, aggregations, the coordinating reduce
8	Real Issue Contribution	GitHub reproduction, root cause analysis, real PRs
9	Advanced Maintainer / TSC	Backward compatibility, performance, release practices

Beyond the 9 levels, the curriculum includes five additional sections:

Section	Purpose
Contributor Mindset	How to think, behave, and grow as an OpenSearch contributor
Issue Roadmap	Staged progression from beginner-friendly to release-blocking issues
Internals Deep Dives	24 focused deep dives, each with a mini-lab
Plugins, Extensions & Cross-Repo Labs	Cross-project debugging, Dashboards-to-core tracing, plugin internals
Release, Review & Governance Practices	Foundation governance, release trains, licensing, the TSC

The curriculum closes with a Capstone Project — a full contribution cycle from issue reproduction to merged pull request and engineering write-up.

Required Tools

Before starting Level 1, ensure you have the following installed and working:

JDK 21 (the OpenSearch 3.x baseline; the repo also bundles its own JDK for the build)
Git 2.x
IntelliJ IDEA (strongly recommended) or Eclipse
Docker (optional — useful for multi-node and packaging experiments)
A modern shell with curl (for hitting the local cluster's REST API on :9200)

Note on the build: OpenSearch builds with Gradle via the ./gradlew wrapper — you do not install Gradle yourself, and the wrapper provisions a matching JDK for the build. You only need a system JDK to run tools and your IDE. Always check gradle/ and build.gradle on the branch you are working on for the exact baseline.

You will also need:

A clone of the OpenSearch repository (the core engine — where you will spend ~90% of your time)
Optionally, a clone of OpenSearch Dashboards (the UI — needed only for the cross-repo labs)
A free GitHub account with your local git configured for DCO sign-off (git config commit.gpgsign optional; git commit -s required)
An account on the OpenSearch community forum and the public Slack (optional but recommended)

Note on contribution mechanics: OpenSearch uses GitHub pull requests, not patches or a JIRA. Every PR requires a Developer Certificate of Origin sign-off (Signed-off-by: line via git commit -s) and a CHANGELOG.md entry. There is no CLA.

OpenSearch at a Glance

OpenSearch is a distributed search and analytics engine built on top of Apache Lucene. It powers full-text search, log analytics, observability, and security analytics workloads. It is the open-source (Apache 2.0) successor to the Elasticsearch 7.10.2 codebase, from which it was forked in 2021.

Why OpenSearch Exists

OpenSearch was created when Elasticsearch and Kibana were relicensed away from Apache 2.0 under the SSPL/Elastic License in early 2021. The community needed a permissively licensed, openly governed search engine. OpenSearch is that engine: Apache 2.0, governed by the OpenSearch Software Foundation under the Linux Foundation, with a Technical Steering Committee and per-repo maintainers.

What OpenSearch Does

Stores JSON documents in indices, each split into shards (primary + replica)
Each shard is an Apache Lucene index of immutable segments
Distributes shards across nodes and keeps them allocated and balanced
Elects a cluster manager (formerly "master") that owns the authoritative cluster state
Serves a REST API on port 9200 and an internal transport protocol on 9300
Executes queries and aggregations by fanning out to shards and reducing the results

Key Modules

You will spend the majority of your time in these areas:

Area	Path	Description
Core engine	`server/`	Cluster, nodes, shards, indexing, search, allocation — `org.opensearch.*`
Shared libs	`libs/`	`libs/core` (serialization), `libs/x-content` (XContent), `libs/common`
Bundled modules	`modules/`	`transport-netty4`, `reindex`, `lang-painless`, `analysis-common`
In-repo plugins	`plugins/`	`analysis-icu`, `repository-s3`, `discovery-ec2`, and more
Test framework	`test/framework/`	`OpenSearchTestCase`, `OpenSearchIntegTestCase`, `InternalTestCluster`
BWC / QA	`qa/`	Backward-compatibility, rolling-upgrade, packaging tests
REST specs	`rest-api-spec/`	REST API JSON specs and shared YAML tests

Key Classes (High-Level Preview)

Class	Area	Role
`Node`	`server`	A running OpenSearch node; wires up all services
`ClusterService`	`server`	Ties together cluster-manager service + applier service
`ClusterState`	`server`	The immutable, versioned snapshot of cluster metadata
`Coordinator`	`server`	The consensus/coordination layer (election, joins, publish)
`AllocationService`	`server`	Decides where shards go, gated by `AllocationDeciders`
`RestController`	`server`	Routes HTTP requests to `RestHandler`s
`TransportAction`	`server`	Server-side action invoked by the `NodeClient`
`IndicesService` / `IndexShard`	`server`	Manages indices and per-shard lifecycle
`InternalEngine`	`server`	Wraps Lucene `IndexWriter` + translog; the write path
`SearchService`	`server`	Executes query and fetch phases on a shard
`Plugin`	`server`	Base class for all extensions

If half of these are unfamiliar, that is expected. By Level 4 you will read them without a guide.

The OpenSearch Community

OpenSearch is a large, active, openly governed project. The codebase reflects both its Elasticsearch heritage and years of independent evolution (segment replication, the cluster-manager rename, the Extensions SDK, remote-backed storage). Many design decisions live in GitHub issues, RFCs, and recorded community meetings rather than in code comments.

What the community values:

Pull requests that include tests and a CHANGELOG entry
Issues that include a clear, minimal reproduction
Comments that demonstrate you have read the existing code
Contributors who engage respectfully and patiently across GitHub, the forum, and Slack
Sustained contribution over time, not one-off patches

The path from contributor to maintainer is measured in months to years, not weeks. That is intentional. Maintainership is earned through sustained, high-quality contribution and demonstrated judgment — not volume of PRs.

This curriculum will help you build the habits and depth of understanding that make that path realistic.

Begin with Level 1: Lucene and OpenSearch Foundation.

Overview & Prerequisites

This section is the on-ramp. Before you read a single line of IndexShard.java or trace a search request through TransportSearchAction, you need a working build, a running local cluster, a GitHub identity wired for contribution, and a clear mental map of how the curriculum is structured. This page gets you there. Budget two to four hours for a cold setup; most of that is the first Gradle build downloading the world.

This curriculum will not hold your hand. It assumes you are a strong backend/distributed-systems engineer who can read unfamiliar Java without a guide. What it will do is point you at the exact parts of OpenSearch that matter, give you the right questions, and make you prove competence at each gate. The setup below is the first gate: if you cannot build OpenSearch and hit localhost:9200, nothing else in the curriculum will work.

What You Are Setting Up (the whole picture)

┌──────────────────────────────────────────────────────────────────────────┐
│  Your laptop                                                              │
│                                                                          │
│   JDK 21 (system)  ── runs ──►  IntelliJ / your editor                   │
│         │                                                                │
│         │  imports as a Gradle project                                   │
│         ▼                                                                │
│   ~/src/OpenSearch  ◄── git clone ── github.com/opensearch-project/...    │
│         │                                                                │
│         │  ./gradlew  (wrapper provisions its OWN bundled JDK 21)        │
│         ▼                                                                │
│   ./gradlew run  ──►  single-node cluster, REST on :9200, transport :9300 │
│         ▲                                                                │
│         │  curl / your tests hit this                                    │
│   git commit -s  ──►  DCO Signed-off-by  ──►  PR on GitHub               │
└──────────────────────────────────────────────────────────────────────────┘

Two facts shape everything:

OpenSearch builds with Gradle, not Maven. You drive it through the ./gradlew wrapper, which downloads the correct Gradle version and a bundled JDK matched to the branch. You do not install Gradle, and you do not need your system JDK to exactly match the build JDK.
Contribution happens on GitHub — issues and pull requests on github.com/opensearch-project/OpenSearch. There is no JIRA, no CLA. Every commit needs a DCO sign-off (git commit -s) and every PR needs a CHANGELOG.md entry.

Step 1 — Install the toolchain

Tool	Version	Why
JDK	21 (LTS)	Baseline for the OpenSearch `3.x`/`main` line. Older lines used 11/17.
Git	2.x	Clone, branch, sign-off commits, push to your fork.
IntelliJ IDEA	latest (Community is fine)	Gradle import, navigation, debugger. Eclipse works too.
curl	any	Drive the REST API on `:9200`.
Docker	optional	Multi-node and packaging experiments later.

# Verify your toolchain. JDK must report 21 (or compatible).
java -version      # openjdk version "21.x"
git --version      # git version 2.x
curl --version     # any recent curl

Note: You do not install Gradle. The repository ships ./gradlew and gradle/wrapper/. The wrapper also provisions a build JDK via the toolchain mechanism, so a slightly different system JDK is usually fine for running tools — but install JDK 21 to keep your IDE and the build aligned and avoid surprises. If ./gradlew --version runs, your wrapper is healthy.

On macOS the path of least resistance is a JDK distribution like Temurin/Corretto 21 via your package manager; on Linux use your distro's openjdk-21-jdk or a tarball; set JAVA_HOME accordingly.

Step 2 — Clone the repositories

You will spend ~90% of your time in the core engine repo. Clone Dashboards only when the cross-repo labs (Lab P1+) call for it.

mkdir -p ~/src && cd ~/src

# The core engine — your home for the whole curriculum.
git clone https://github.com/opensearch-project/OpenSearch.git
cd OpenSearch

# Confirm you are on the development line (main / 3.x).
git branch -a | head
git log --oneline -5

# OPTIONAL — only needed for the Dashboards cross-repo labs (Section: Plugin Labs).
cd ~/src
git clone https://github.com/opensearch-project/OpenSearch-Dashboards.git

Warning: The first ./gradlew invocation downloads Gradle, the bundled JDK, and a large dependency set. Do it on a good connection and expect 15–40 minutes the first time. Subsequent builds use the Gradle daemon and local caches and are far faster.

Read these four files in the repo root before you build anything — they are the project's own contract with you:

cd ~/src/OpenSearch
ls CONTRIBUTING.md DEVELOPER_GUIDE.md TESTING.md CHANGELOG.md MAINTAINERS.md

Step 3 — First build and first run

cd ~/src/OpenSearch

# Sanity-check the wrapper (provisions Gradle + bundled JDK on first call).
./gradlew --version

# Assemble a runnable local distribution under distribution/archives/.
./gradlew localDistro          # first run: slow; later runs: minutes

# Launch a single-node cluster straight from source — REST on :9200.
# Leave this running in one terminal.
./gradlew run

In a second terminal, prove the cluster is alive — this is the canonical health check you will run hundreds of times:

curl -s localhost:9200 | head -20
curl -s 'localhost:9200/_cluster/health?pretty'

You want "status" : "green" (or "yellow" for a single node with replicas unassigned — that is normal and expected, not a failure).

{
  "cluster_name" : "runTask",
  "status" : "green",
  "number_of_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "unassigned_shards" : 0
}

Note: ./gradlew run starts security-disabled core OpenSearch (the security plugin lives in a separate repo and is not part of ./gradlew run). That is exactly what you want for development — plain HTTP on :9200, no TLS, no auth.

To run a single test (your fast feedback loop for the entire curriculum):

# A whole test class:
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests"

# A single method:
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests.testToXContent"

If ./gradlew run serves :9200 and :server:test passes a known-good class, your environment is solid.

Step 4 — Import into IntelliJ (as a Gradle project)

OpenSearch is a Gradle project. Do not try to import it as a Maven or "from sources" project.

IntelliJ → File → Open → select the ~/src/OpenSearch directory (open the folder, not a single file).
When prompted, choose Open as Project and let IntelliJ detect the Gradle build. Accept "Trust Project".
Gradle JVM: set it to your JDK 21 in Settings → Build Tools → Gradle. If imports fail with toolchain errors, point IntelliJ's Gradle JVM at the same JDK 21 the wrapper uses.
Let the initial Gradle sync finish (it indexes server/, libs/, modules/, plugins/, …). This takes a while the first time.
Verify navigation works: press Go to Class and open RestSearchAction, TransportSearchAction, IndexShard. If "Go to Class" finds them, your index is healthy.

Tip: Run/debug a test directly from the IDE gutter once the Gradle import is done. Being able to set a breakpoint in IndexShard.applyIndexOperationOnPrimary(...) and step through it is the single highest-leverage skill in this curriculum. Eclipse users: import via File → Import → Gradle → Existing Gradle Project.

Step 5 — Configure Git for DCO sign-off

OpenSearch requires a Developer Certificate of Origin sign-off on every commit. This is not a CLA — it is a one-line Signed-off-by: trailer asserting you have the right to contribute the code. The PR check (DCO) fails if any commit is missing it.

# Set the identity that will appear in your Signed-off-by line.
# These MUST match the GitHub account/email you contribute from.
git config --global user.name  "Your Real Name"
git config --global user.email "you@example.com"

# Sign off every commit with -s. This appends:
#   Signed-off-by: Your Real Name <you@example.com>
git commit -s -m "Fix off-by-one in date_histogram bucket key"

# Forgot -s on the last commit? Amend it:
git commit --amend -s --no-edit

# Forgot it across several commits? Re-sign a range:
git rebase --signoff HEAD~3

A correctly signed commit message ends with:

Fix off-by-one in date_histogram bucket key

Signed-off-by: Your Real Name <you@example.com>

Warning: The name/email in Signed-off-by: must match your git identity exactly, and the email should be one GitHub recognizes for your account. Mismatches make the DCO check fail and force you to rewrite history.

You will also fork the repo on GitHub and push branches to your fork; that flow is covered in Level 2 and in the contributor-mindset chapter on PR quality. For now, just get your local identity and sign-off working.

Step 6 — Create accounts and join the community

OpenSearch is developed in the open. Get plugged into the channels now so you are not a stranger when you open your first issue or PR.

Channel	URL	Use it for
GitHub account	https://github.com	Issues, PRs, code review, the entire contribution flow.
Community forum	https://forum.opensearch.org	Longer-form questions, design discussion, user help.
Public Slack	https://opensearch.org/slack	Real-time questions, maintainer chatter, SIG channels.
Community meetings	linked from `opensearch.org`	Recorded; watch a few to learn how decisions get made.

Large designs and roadmap items live as GitHub issues labeled RFC, meta, or proposal — not in a wiki. Bookmark the issues list: https://github.com/opensearch-project/OpenSearch/issues and learn the labels you will use constantly: good first issue, help wanted, bug, enhancement, flaky-test, untriaged, backport 2.x.

How the Curriculum Fits Together

The curriculum is 9 levels of core engineering, 5 supporting sections, and a capstone.

flowchart TD
    L1[Level 1: Lucene + OpenSearch Foundation] --> L2[Level 2: Contributor Onboarding]
    L2 --> L3[Level 3: Architecture / Request Path]
    L3 --> L4[Level 4: Cluster Coordination + State]
    L4 --> L5[Level 5: Testing + Debugging]
    L5 --> L6[Level 6: Indexing + Storage Engine]
    L6 --> L7[Level 7: Search + Aggregations]
    L7 --> L8[Level 8: Real Issue Contribution]
    L8 --> L9[Level 9: Advanced Maintainer / TSC]
    L9 --> CAP[Capstone: full contribution cycle]

    DD[Deep Dives x24] -.referenced by.-> L3
    DD -.-> L4
    DD -.-> L6
    DD -.-> L7
    PL[Plugin / Cross-Repo Labs] -.-> L8
    GOV[Release + Governance] -.-> L9

Track	What it is	When you touch it
Levels 1–9	The spine. Sequential. Each level has 2–4 labs.	Work top to bottom; do not skip.
Contributor Mindset	How to read the codebase, design via GitHub, handle feedback, grow toward maintainership.	Read alongside Levels 2, 8, 9.
Issue Roadmap	12 staged issue difficulties, docs-only → release-blocking.	Pick real issues as you progress.
Deep Dives (24)	Focused internals chapters, each with a mini-lab.	Open the relevant one whenever a level references it.
Plugin / Cross-Repo Labs	Dashboards-to-core tracing, plugin internals, bug attribution.	Around Level 8.
Release & Governance	Foundation, TSC, release trains, licensing, trust.	Level 9 and the capstone.
Capstone	A complete real contribution: issue → reproduction → fix → PR → write-up.	The final two weeks.

The deep dives are not optional reading — they are where the real depth lives. A level says "trace a search request"; the Search Execution deep dive is where you learn how QueryPhase and FetchPhase actually work. Treat the levels as the spine and the deep dives as the muscle.

How to Use the Labs

Every lab follows the same shape, so you always know where you are:

Background and Why This Lab Matters for Contributors — the why before the how.
Prerequisites — what must already work (usually a green ./gradlew run and a prior lab).
Step-by-Step Tasks — numbered, with real ./gradlew/curl/git/grep commands.
Implementation Requirements / Deliverables — checkboxes you must satisfy.
Troubleshooting, Expected Output, Stretch Goals.
Validation / Self-check — 5–7 questions or exercises that gate completion.

Rules for the labs:

Run every command. This is a hands-on apprenticeship; reading is not doing.
When a lab gives a grep/find, run it rather than trusting a line number — code moves between branches, so the curriculum points you at code with commands instead of fabricated lines.
Do not advance past a lab's Validation section until you can answer it without notes.
Keep ./gradlew run alive in a dedicated terminal for any lab that hits :9200.

You Are Ready When…

Run this and confirm every box before opening Level 1:

java -version reports JDK 21; git --version is 2.x.
~/src/OpenSearch is cloned and you have read CONTRIBUTING.md, DEVELOPER_GUIDE.md, TESTING.md.
./gradlew --version runs and provisions cleanly.
./gradlew localDistro produced an archive under distribution/archives/.
./gradlew run serves curl -s localhost:9200 and _cluster/health returns green/yellow.
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests" passes.
IntelliJ imported the project as Gradle and Go to Class finds RestSearchAction.
git config user.name/user.email are set and git commit -s adds a Signed-off-by: line.
You have a GitHub account and have skimmed the OpenSearch issues list and its labels.
You have joined (or bookmarked) the forum and Slack.

# A 5-minute "am I ready" smoke test (run from ~/src/OpenSearch):
java -version
./gradlew --version | head -3
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests" 2>&1 | tail -5
# In another terminal, with ./gradlew run live:
curl -s 'localhost:9200/_cluster/health?pretty'

If any box is unchecked, fix it now. A broken baseline means every later ./gradlew test and every curl will produce confusing failures that hide the real work.

Where to Go Next

OpenSearch Warm-Up: From User to Contributor — the most important page in this section. Run OpenSearch as a user across five real scenarios, then bridge each one to the org.opensearch.* source. Read this before Level 1.
16-Week Plan — a calendar that maps Levels 1–9 + capstone onto 16 weeks, with weekly reading, hands-on tasks, GitHub issue practice, and exit checkpoints.
Milestones: M1–M9 — the competence gates. Each milestone has skills, self-check questions, and a 20-point rubric.

Continue to the Warm-Up, or jump straight to Level 1: Lucene and OpenSearch Foundation.

The Hitchhiker's Guide to OpenSearch, Lucene & Vectors

Don't Panic.

You are about to read a tour of a system with roughly two decades of accumulated engineering inside it: a distributed coordination layer, a near-real-time indexing engine, a 20-year-old search library called Apache Lucene buried at the bottom, and — bolted on the side and rapidly becoming the main event — a vector search engine that talks to native C++ and, on a good day, a GPU.

Most people meet this system in a state of mild terror. They open IndexShard.java, see 4,000 lines, scroll back up, and quietly close the tab. This guide exists so that does not happen to you. The trick is the same one the actual Hitchhiker's Guide recommends: you need a friend who has been here before, a rough map, and the confidence that the scary words are just names for things you already half-understand.

So here is the deal. We are going to follow one single document from the moment it leaves a curl command until the moment it is an immutable file on disk — and, if it happens to be a vector, until it is a node in a graph being scored by SIMD instructions. Along the way you will meet the cast of characters: the cluster manager, the shard, the engine, the codec, the inverted index, the HNSW graph, and DocValues. Each one gets a memorable introduction and then, immediately, its real org.opensearch.* or org.apache.lucene.* name and the file it lives in, so that the friendliness never costs you accuracy.

This is the flagship on-ramp. After this, the rest of the curriculum — the nine Levels, the Lucene section, the k-NN section, the Engineering section, and the Deep Dives — stops being a wall of links and becomes a set of places you have already glimpsed from the road.

Everything below assumes a local cluster from the prerequisites: ./gradlew run is live, REST is on localhost:9200. If it is not, go fix that first; the guide is much funnier when you can run the commands.

The Three Depths of Mastery

Before the journey, fix your destination. There are exactly three depths at which a person can "know OpenSearch", and conflating them is the single biggest source of wasted effort. Know which one you are aiming for today.

Depth	You can…	You think in terms of…	This curriculum
1. User	Index data, write Query DSL, build aggregations and dashboards, run a cluster, read `_cat` APIs.	indices, mappings, queries, shards, health colors.	Warm-Up + this guide.
2. Contributor	Build from source, trace a request through the code, reproduce a GitHub issue, write a fix + test, open a quality PR.	`RestHandler` → `TransportAction` → `IndexShard` → `InternalEngine`; the Lucene boundary; cluster state.	Levels 1–9, the Lucene and k-NN sections, the Capstone.
3. Core engineer	Drive a hard, cross-cutting change the way maintainers do — design it in public, write the RFC, land it across subsystems, defend the trade-offs.	concurrency models, on-disk formats, recall vs latency vs memory, BWC, distributed failure modes.	Engineering at Scale + the Capstone Projects.

Nobody starts at depth 3. Everybody thinks they need depth 3 on day one. You do not. You need depth 1 in your hands (run the scenarios in the Warm-Up) and depth 2 in your sights. Depth 3 is what the Engineering section is for, and you will get there by doing the journey below enough times that the class names stop being scary.

Note: A wonderful property of this codebase is that depth 2 and depth 3 use the same map. The difference is not "more classes" — it is that the core engineer also holds the on-disk format, the failure modes, and the public design history in their head simultaneously. Same territory, more dimensions.

The One-Sentence Mental Model

If you remember nothing else, remember this boundary, because it decides where every bug lives:

OpenSearch is a distributed system that wraps thousands of single-machine Apache Lucene indices and makes them look like one searchable thing.

Everything about shards, nodes, replication, cluster state, REST, transport, the translog, and the aggregation reduce is OpenSearch. Everything about scoring, tokenization, segments, postings, points, DocValues, and the HNSW graph is Lucene, which OpenSearch merely configures and drives. The Warm-Up has the full ownership table; tape it to your monitor. When someone says "search is slow", your first reflex should be to ask which layer — and by the end of this guide you will have the vocabulary to answer.

The Journey: One `curl` Command, End to End

Here is our protagonist. A single document, an order, being written to an index.

curl -s -XPUT 'localhost:9200/orders/_doc/42?refresh=true' \
  -H 'Content-Type: application/json' -d '{
    "customer": "Ford Prefect",
    "total": 42.00,
    "note": "mostly harmless",
    "embedding": [0.12, 0.07, 0.99, 0.03]
  }'

That HTTP request is going to pass through nine layers, touch the authoritative state of the entire cluster, land on exactly one machine, and ultimately become bytes in an immutable file. Let us watch it.

flowchart TD
    Curl["curl PUT /orders/_doc/42"] --> REST["REST layer: RestController → RestIndexAction"]
    REST --> Action["Action framework: NodeClient → TransportIndexAction"]
    Action --> CM["Cluster manager: does 'orders' exist? map 'embedding' as knn_vector?"]
    CM --> Route["Routing: hash(_id=42) → primary shard 3"]
    Route --> Transport["Transport layer: send to the node holding shard 3"]
    Transport --> Shard["IndexShard.applyIndexOperationOnPrimary"]
    Shard --> Engine["InternalEngine.index → Lucene IndexWriter + Translog"]
    Engine --> Refresh["refresh() → new DirectoryReader → searchable"]
    Engine --> Segment["flush/merge → immutable segment files on disk"]
    Segment --> Vec[".vec / HNSW graph for the embedding field"]

We will take these in order. Each stop introduces a character, gives you the real name, and tells you where it lives.

Stop 1 — The Doorman: the REST layer

The character. Every request to OpenSearch comes in over HTTP and is met by a doorman whose entire job is to recognize the shape of what you asked (PUT /orders/_doc/42), check it is well-formed, and hand it to the right specialist inside. The doorman does no real work. He just routes.

The real names. The HTTP request lands in RestController, which matches the method+path against a registry of RestHandlers and dispatches to RestIndexAction. The handler parses your JSON into an IndexRequest and calls the client. It does not know what a shard is.

cd ~/src/OpenSearch
find server -name "RestController.java"
#   server/src/main/java/org/opensearch/rest/RestController.java
find server -name "RestIndexAction.java"
#   server/src/main/java/org/opensearch/rest/action/document/RestIndexAction.java

Full treatment: the REST layer deep dive.

Stop 2 — The Dispatcher: the action framework

The character. Inside the building, nobody talks HTTP. Work is expressed as actions — typed request/response pairs — and a dispatcher matches an action type to the one transport action that knows how to perform it. This indirection is what lets the same logic run whether the request arrived over HTTP or from another node.

The real names. RestIndexAction calls NodeClient.execute(...), which looks up the registered TransportAction for the index action — TransportIndexAction (itself routed through the bulk machinery). The wiring lives in ActionModule, where every action type is bound to its handler at startup.

grep -rn "class ActionModule" server/src/main/java/org/opensearch/action/
grep -rln "class TransportIndexAction\|class TransportBulkAction" server/src/main/java/org/opensearch/action/

Full treatment: the action framework deep dive.

Stop 3 — The Keeper of Truth: the cluster manager and cluster state

The character. Somewhere in your cluster, one node has been elected to a lonely and important job: it is the single source of truth about what exists. Which indices are there, what their mappings are, where every shard lives, which nodes are alive. This node is the cluster manager. When our document arrives for an index that does not exist yet, this node is who decides to create it, who picks a field type for embedding, and who writes that decision down.

Note: "Cluster manager" is the term you will see in modern OpenSearch. It was renamed from master during the fork from Elasticsearch. Old blog posts, some metric names, and a few legacy class names still say master / Master — mentally translate. The deep dives say cluster manager; the code is mid-migration.

The real names. The authoritative object is an immutable ClusterState: a big value containing Metadata (index mappings, settings), RoutingTable (where shards are), and DiscoveryNodes (who is alive). Creating the orders index is a ClusterStateUpdateTask executed on the cluster manager's service (ClusterManagerService, formerly MasterService). Dynamically deciding that embedding is a vector field is mapping resolution (MapperService / a knn_vector KNNVectorFieldMapper if the k-NN plugin is installed). The new state is then published two-phase to every node and applied locally by each node's ClusterApplierService.

flowchart LR
    Need["new index 'orders' needed"] --> Task["ClusterStateUpdateTask on cluster manager"]
    Task --> Compute["compute new ClusterState (Metadata + RoutingTable)"]
    Compute --> Publish["PublicationTransportHandler: publish → commit"]
    Publish --> Apply["every node: ClusterApplierService applies new state"]
    Apply --> Ready["index 'orders' now exists cluster-wide"]

grep -rln "class ClusterState" server/src/main/java/org/opensearch/cluster/
ls server/src/main/java/org/opensearch/cluster/coordination/   # election, publishing
grep -rln "class AllocationService" server/src/main/java/org/opensearch/cluster/routing/allocation/

Full treatment: cluster state, cluster state publishing, discovery and coordination. The Warm-Up's Scenario 4 lets you watch the cluster manager react to a dying node.

Stop 4 — The Sorting Hat: routing to a shard

The character. An index is not one thing; it is sliced into shards, and every shard lives on some node with zero or more replica copies. Our document has an _id of 42. Something has to decide, deterministically, which shard owns 42 — and it must give the same answer every time, or reads and writes would disagree.

The real names. Routing is a hash: by default murmur3(_routing ?? _id) % number_of_primary_shards. OperationRouting consults the RoutingTable from cluster state to find the primary shard for that slot and the node hosting it. Each shard is, at the storage level, a complete and independent Lucene index. A primary plus its replicas is one logical shard; the primary takes writes, the replicas take reads and serve as failover.

grep -rln "class OperationRouting" server/src/main/java/org/opensearch/cluster/routing/
grep -rn "murmur3\|generateShardId\|partition" server/src/main/java/org/opensearch/cluster/routing/OperationRouting.java | head

Term	What it is	Owner
Index	A named collection of documents with mappings + settings.	OpenSearch
Shard	A horizontal slice of an index; the unit of distribution. Each shard is a Lucene index.	OpenSearch wraps Lucene
Primary	The shard copy that accepts writes.	OpenSearch
Replica	A copy for reads + failover; promoted to primary if the primary dies.	OpenSearch

Full treatment: shard allocation.

Stop 5 — The Foreman: `IndexShard`

The character. Our request has now been transported to the one node holding primary shard 3 of orders. On that node, a foreman takes over. The foreman owns the lifecycle of the shard — is it recovering? started? relocating? — and is the gatekeeper for every operation that touches its data. He does not do the storage himself; he delegates that to a specialist. But nothing happens to this shard without going through him.

The real names. That foreman is IndexShard. The write enters at IndexShard.applyIndexOperationOnPrimary(...), which checks the shard is in a state that can accept writes, manages permits and sequence numbers, and then delegates the real storage work to its Engine.

find server -name "IndexShard.java"
#   server/src/main/java/org/opensearch/index/shard/IndexShard.java
grep -n "applyIndexOperationOnPrimary\|public Engine.IndexResult" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

Full treatment: index shard lifecycle.

Stop 6 — The Engine Room: `InternalEngine` and Lucene's `IndexWriter`

The character. Here, finally, is where a document becomes data. The engine is the layer that turns "apply this index operation" into a concrete Lucene write, a durable log record, and — eventually — a searchable view. It is the single most important class to understand for depth 2, because it sits exactly on the OpenSearch/Lucene boundary: above it is OpenSearch's operation model (versions, sequence numbers, the translog); below it is raw Lucene.

The real names. InternalEngine.index(Engine.Index) does roughly this: acquire a per-_id lock, decide whether this is a new doc / update / conflict, assign a sequence number and primary term, call Lucene IndexWriter.addDocument (or updateDocument), record the result in the in-memory LiveVersionMap, append the operation to the Translog for durability, and advance the LocalCheckpointTracker.

flowchart TD
    Idx["InternalEngine.index(Engine.Index)"] --> Lock["per-_id lock"]
    Lock --> Plan{new / update / conflict?}
    Plan -->|conflict| Conf["VersionConflictEngineException"]
    Plan -->|ok| Seq["assign seqNo + primaryTerm"]
    Seq --> IW["Lucene IndexWriter.addDocument"]
    IW --> VM["LiveVersionMap.put"]
    VM --> TL["Translog.add  (durable here)"]
    TL --> CP["LocalCheckpointTracker advance"]

The crucial insight, worth more than any class name: a document becomes durable the moment it is in the translog, but it becomes visible only after a refresh opens a new Lucene DirectoryReader. Durability and visibility are independent. This is the near-real-time (NRT) model.

find server -name "InternalEngine.java"
#   server/src/main/java/org/opensearch/index/engine/InternalEngine.java
grep -n "public IndexResult index\|IndexWriter\|Translog.add\|versionMap" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

Full treatment: the Engine Internals deep dive — the next thing you should read after this guide, because it is the connective tissue of the whole system. Also: the translog and refresh/flush/merge.

Stop 7 — The Stenographer: the Lucene segment and the codec

The character. Lucene never edits anything in place. When the in-memory buffer fills (or a refresh/flush happens), Lucene writes a segment: a small, immutable, self-contained mini-index. Your document is now frozen into a set of files that will never be rewritten — only, eventually, merged with others into a larger segment and then deleted. Immutability is not a limitation; it is the secret to lock-free reads, cheap snapshots, and crash safety.

But who decides the byte layout of those files? The codec — a pluggable format specification. It is the stenographer that knows exactly how to transcribe terms, doc values, points, and vectors into bytes, and how to read them back.

The real names. A Lucene index is a set of segments plus a segments_N commit point (SegmentInfos). Each segment is a bundle of files written by the current default Codec — call it LuceneNNNCodec (e.g. Lucene101Codec / Lucene103Codec; grep to find the exact one your bundled Lucene ships). The codec is an SPI: PostingsFormat, DocValuesFormat, KnnVectorsFormat, StoredFieldsFormat, PointsFormat, each pluggable via META-INF/services.

# Find the immutable segment files in a real shard directory.
DATA=$(find . -type d -name index -path "*nodes*shard*" | head -1)
find "$DATA" -maxdepth 1 -type f | sed 's:.*/::' | sort
# Expect: segments_N, plus per-segment .si .fnm .fdt/.fdx .tim/.tip .doc/.pos
#         .dvd/.dvm .nvd/.nvm .kdd/.kdi .vec/.vex/.vem ...

File ext	Holds	The character
`.si`	segment info	the segment's ID card
`.fnm`	field infos	the field directory
`.fdt` / `.fdx` / `.fdm`	stored fields	the original `_source`
`.tim` / `.tip` / `.tmd`	terms dictionary + index	the index of the book
`.doc` / `.pos` / `.pay`	postings	the inverted index proper
`.dvd` / `.dvm`	DocValues	the columnar store
`.nvd` / `.nvm`	norms	length normalization for scoring
`.kdd` / `.kdi` / `.kdm`	points / BKD	numeric & geo trees
`.vec` / `.vex` / `.vem`	HNSW vectors	the graph for `embedding`

Full treatment: Segments and Codecs and IndexWriter and Merges. To physically open one of these directories with a GUI, the Crack Open a Lucene Index lab points Lucene's Luke tool at it.

Stop 8 — The Library Card Catalog: the inverted index and DocValues

We are now inside the segment, looking at two of its most important structures. They are opposites, and knowing which one a query uses is half of understanding search performance.

The inverted index is the card catalog. For the note field ("mostly harmless"), the analyzer produced terms [mostly, harmless], and the index stores, for each term, the posting list: the set of document IDs containing it. To answer "which docs contain harmless?" Lucene jumps straight to that term and reads the list. This is the structure that makes full-text search fast, and it is term → documents.

DocValues is the opposite: document → value, stored column-wise. To sort by total or compute sum(total), you do not want the inverted index — you want to stream the total value for every matching doc, in order, like a column in a database. That is DocValues, and it is the backbone of sorting and aggregations.

The real names. The terms dictionary is a BlockTree backed by an FST (finite-state transducer); postings are read via PostingsEnum. DocValues come in NUMERIC / SORTED / SORTED_SET / SORTED_NUMERIC / BINARY flavors and are read through DocValuesProducer and types like SortedNumericDocValues.

# OpenSearch's own DocValues/fielddata deep dive ties this to aggregations.
grep -rn "SortedNumericDocValues\|getLeafCollector" \
  server/src/main/java/org/opensearch/search/aggregations/ | head

Structure	Maps	Powers	Lucene class
Inverted index	term → docs	full-text `match`, term filters	`BlockTree` terms + `PostingsEnum`
Points / BKD	value-range → docs	numeric/date/geo range	`PointValues`, `BKDReader`
DocValues	doc → value (columnar)	sort, aggregations, scripts	`*DocValues`, `DocValuesProducer`
HNSW graph	vector → nearest vectors	k-NN vector search	`HnswGraph`, `FloatVectorValues`

Full treatment: The Inverted Index and Postings, Points and BKD Trees, DocValues: The Columnar Store, and the existing DocValues and Fielddata deep dive.

Stop 9 — The Constellation: the HNSW graph and SIMD scoring

Our document had an embedding. Vectors do not fit the card-catalog model at all — there is no "term" to look up. Finding the nearest vectors to a query vector, exactly, would mean comparing against every vector in the segment, which at scale is hopeless. So vector search cheats, beautifully.

The character. Picture every vector as a star. HNSW (Hierarchical Navigable Small World) connects each star to a handful of nearby stars, building a navigable constellation with a few sparse "highway" layers on top and dense local streets at the bottom. To find the nearest stars to a query, you parachute into the top layer, greedily walk toward the query along the highways, drop a layer, walk again, and so on — visiting a tiny fraction of all vectors and still landing on the true neighbors almost always. "Almost" is the whole game: this is approximate nearest neighbor (ANN), and you trade a sliver of recall for orders of magnitude of speed.

The scoring. Walking the graph means computing distances — dot products, squared Euclidean distances — over and over. This inner loop is where vector search lives or dies, so Lucene vectorizes it with SIMD: the Panama Vector API (jdk.incubator.vector) computes many lane-wise multiply-adds per instruction, picked at runtime by a VectorizationProvider (PanamaVectorUtilSupport when the CPU supports it, a scalar fallback otherwise). This is why HNSW scoring is fast, and it is the "vectorization" theme you will meet again in the Engineering section.

The real names (Lucene engine). KnnFloatVectorField / KnnByteVectorField declare the field; FloatVectorValues reads vectors; VectorSimilarityFunction (EUCLIDEAN, DOT_PRODUCT, COSINE, MAXIMUM_INNER_PRODUCT) defines distance; HnswGraph / HnswGraphBuilder are the graph; KnnFloatVectorQuery runs the search; the on-disk format is Lucene99HnswVectorsFormat (and scalar-quantized variants, up to the Lucene104* formats), written into the .vec / .vex / .vem files; VectorUtil is the SIMD-accelerated math.

The real names (k-NN plugin). OpenSearch's k-NN plugin adds the knn_vector field type and three engines: faiss (the default — native C++ via JNI), lucene (the Lucene HNSW above), and nmslib (deprecated). For faiss, a custom codec writes the native graph as segment files, and the graph is loaded into native memory outside the JVM heap, capped by a circuit breaker.

flowchart TD
    Q["query vector q"] --> Top["enter top HNSW layer"]
    Top --> Walk["greedy walk toward q (SIMD distance via VectorUtil/Panama)"]
    Walk --> Down["descend a layer, repeat"]
    Down --> Bottom["dense bottom layer: collect candidates"]
    Bottom --> K["return top-k approximate neighbors"]

Full treatment: HNSW Vector Search in Lucene, SIMD and the Panama Vector API, and the entire k-NN section. The HNSW-from-scratch lab has you build a toy graph by hand.

The Return Trip: a search, in one paragraph

We followed a write all the way down. A read runs the same map in reverse and fanned out. RestSearchAction → TransportSearchAction on a coordinating node, which uses the RoutingTable to fan the query out to one copy of every shard. On each shard, the query phase (QueryPhase) runs the Lucene query against that shard's segments and returns the top-K doc IDs and scores; a separate fetch phase (FetchPhase) loads _source for the survivors; the coordinating node merges the per-shard results in SearchPhaseController. Aggregations add a reduce step where partial per-shard results are combined into the global answer. A k-NN query is the same shape, but the per-shard work walks the HNSW graph instead of the inverted index. See the search execution deep dive and the Warm-Up's Scenarios 2 and 3.

The Whole Cast, On One Page

Character	Real class	Lives in	Owner
The Doorman	`RestController` → `RestIndexAction`	`server/.../rest/`	OpenSearch
The Dispatcher	`NodeClient` → `TransportIndexAction`, `ActionModule`	`server/.../action/`	OpenSearch
The Keeper of Truth	`ClusterState`, `ClusterManagerService`, `Coordinator`	`server/.../cluster/`	OpenSearch
The Sorting Hat	`OperationRouting`, `RoutingTable`	`server/.../cluster/routing/`	OpenSearch
The Foreman	`IndexShard`	`server/.../index/shard/`	OpenSearch
The Engine Room	`InternalEngine`, `Translog`, `LiveVersionMap`	`server/.../index/engine/`	boundary
The Stenographer	Lucene `IndexWriter`, `Codec`	`org.apache.lucene.*`	Lucene
The Card Catalog	inverted index, `BlockTree`, FST	`org.apache.lucene.*`	Lucene
The Columnar Store	DocValues	`org.apache.lucene.*`	Lucene
The Constellation	`HnswGraph`, `KnnFloatVectorQuery`, `VectorUtil`	`org.apache.lucene.*` (+ k-NN plugin)	Lucene / k-NN

Where To Go Next (the map)

You have now seen the whole road once. Here is where each branch leads, and the order most people should take them.

flowchart TD
    Guide["You are here: the Hitchhiker's Guide"] --> Warmup["Warm-Up: run it as a user (depth 1)"]
    Warmup --> Levels["Levels 1–9: become a contributor (depth 2)"]
    Levels --> Lucene["Lucene section: segments, codecs, HNSW, SIMD"]
    Levels --> KNN["k-NN section: vectors, engines, JNI, quantization"]
    Lucene --> Eng["Engineering at Scale: real RFCs (depth 3)"]
    KNN --> Eng
    Eng --> Projects["Capstone Projects: ship a real contribution"]
    Levels --> Deep["Deep Dives: the reference for every subsystem"]

If you want to…	Go to
Feel it as a user first (do this now)	The Warm-Up — five hands-on scenarios that map every command to source.
Become a contributor, in order	Levels 1–9: build from source → repo structure → REST/transport → coordination → testing → engine → search → fix-an-issue → BWC.
Understand any one subsystem deeply	The Deep Dives — including Engine Internals, Search Execution, Aggregations, Cluster State.
Go down to Lucene	The Lucene section: segments & codecs, postings, BKD, DocValues, merges, HNSW, SIMD.
Learn vector search end to end	The k-NN section: architecture, engines, native JNI & memory, quantization & disk-ANN.
Drive a hard, cross-cutting change (depth 3)	Engineering at Scale and the catalog of real issues & RFCs.
Ship a portfolio-grade contribution	The Capstone (process) and Capstone Projects (eight concrete briefs).

Before You Drive On

You are ready to leave this guide when, without notes, you can:

Name the three depths of mastery and say which one you are aiming for now.
State the one-sentence model and split a list of concepts into "OpenSearch owns it" vs "Lucene owns it".
Trace our document's nine stops in order, naming the real class at each: RestIndexAction → TransportIndexAction → cluster manager / ClusterState → OperationRouting → IndexShard → InternalEngine → Lucene IndexWriter → segment/Codec → (for a vector) HnswGraph + VectorUtil.
Explain why a document is durable before it is visible.
Explain in one breath what HNSW does and why SIMD makes it fast.
Point at the right next section for whatever you want to learn.

If a box is empty, the offending stop above is one re-read away.

The Guide's final, load-bearing advice has not changed in this edition: Don't Panic. The system is large, but it is legible — every scary name is just a character with a job, a class, and a file. You have met them all now.

Continue to the Warm-Up to run this journey as a user, or straight to Level 1 to start building from source. When you are ready for the deep end, Engineering at Scale is waiting.

OpenSearch Warm-Up: From User to Contributor

Before you read a single line of IndexShard.java, you need to have sat in the seat of the person whose workload OpenSearch is serving. The engineers who built OpenSearch's coordination layer, search fan-out, and indexing engine were solving specific, painful problems that show up in production log analytics, full-text search, and observability workloads every day. If you skip that context and go straight to the source, you will memorize class names without understanding why the design exists.

This chapter is the missing first mile. You will run OpenSearch from the outside — as a user would — across a series of practical scenarios covering different data shapes, query patterns, and cluster behaviors. After each scenario, the chapter maps what you observed back to the org.opensearch.* source structures that own it, with real grep/find commands. By the end, every internal class will feel like an old acquaintance rather than an alien term.

Everything here assumes a local cluster from the prerequisites: ./gradlew run live, REST on localhost:9200.

What OpenSearch Actually Is (Two Sentences)

OpenSearch is a distributed search and analytics engine built on Apache Lucene: it stores JSON documents in sharded, replicated indices and serves full-text search, structured queries, and aggregations over a REST API. It is the engine — OpenSearch Dashboards is a completely separate TypeScript application (its own repo, opensearch-project/OpenSearch-Dashboards) that talks to the engine over HTTP to draw charts and dashboards.

Hold that boundary the entire curriculum: when a chart is wrong, the bug is in Dashboards (the query it built), in core search (how the engine ran it), in a plugin, or in Lucene itself — and your job as a contributor is to attribute it correctly. This warm-up keeps that line sharp.

Where OpenSearch Sits in the Search/Analytics Spectrum

┌──────────────────────────────────────────────────────────────────────────────┐
│                   Search & Analytics Tool Spectrum                            │
│                                                                              │
│  Full-text / relevance ◄──────────────────────────────► Analytics / OLAP     │
│                                                                              │
│  Solr      OpenSearch / Elasticsearch        ClickHouse        Splunk         │
│  (Lucene)  (Lucene, distributed)             (columnar OLAP)   (logs, SPL)     │
│                                                                              │
│  Vector / semantic:  Milvus · pgvector · OpenSearch k-NN plugin               │
│                                                                              │
│  ──────────────────────────────────────────────────────────────────────       │
│  Ingest:   Logstash · Data Prepper · Fluent Bit · Beats → OpenSearch          │
│  Store:    Lucene segments on local disk / remote-backed store (S3)           │
│  Query UI: OpenSearch Dashboards (the analog of Kibana / Grafana)             │
└──────────────────────────────────────────────────────────────────────────────┘

OpenSearch lives at the intersection: a distributed, near-real-time engine that is strong at both relevance-ranked full-text search and log/observability analytics, while not being a true columnar OLAP database. Knowing the neighbors tells you when OpenSearch is the right tool and where a reported "OpenSearch is slow" problem is really a "wrong tool" problem.

OpenSearch vs. Elasticsearch (the fork)

This is the most important comparison, because OpenSearch is a fork of Elasticsearch 7.10.2.

Dimension	OpenSearch	Elasticsearch
Origin	Forked from Elasticsearch 7.10.2 (2021)	The original Lucene-based engine
License	Apache 2.0	SSPL / Elastic License (8.x); Elastic re-added AGPL option later
Governance	OpenSearch Software Foundation under the Linux Foundation, a TSC	Elastic N.V. (a company)
Source of truth	GitHub issues + PRs, DCO sign-off, no CLA	GitHub + Elastic's CLA
Package namespace	`org.opensearch.*`	`org.elasticsearch.*`
Terminology	"cluster manager" (renamed from master)	still "master"
Notable divergence	segment replication, remote-backed storage, Extensions SDK, the cluster-manager rename	its own 8.x feature line

Because of the shared lineage, much of the architecture is identical in shape — but class names are org.opensearch.*, and many features have diverged. When you read old Elasticsearch blog posts, mentally translate master → cluster_manager and org.elasticsearch → org.opensearch.

OpenSearch vs. Apache Solr

Dimension	OpenSearch	Apache Solr
Lucene-based	Yes	Yes
Distribution model	Native: shards, replicas, cluster manager, `RoutingTable`	SolrCloud + ZooKeeper
Coordination	Built-in `Coordinator` (Zen2/Raft-like)	External ZooKeeper ensemble
API style	JSON over REST, rich Query DSL	XML/JSON, query params + JSON DSL
Sweet spot	Logs, observability, app search, dashboards	Enterprise/site search, faceting
Ecosystem	Dashboards, plugins (k-NN, SQL, alerting)	Solr admin UI, plugins

Both wrap Lucene. The structural difference is that OpenSearch ships its own coordination layer (no external ZooKeeper); Solr leans on ZooKeeper for cluster state. If you ever debug split-brain or election issues, that is org.opensearch.cluster.coordination, not a separate service.

OpenSearch vs. ClickHouse

Dimension	OpenSearch	ClickHouse
Engine type	Inverted index + DocValues (Lucene)	True columnar MPP OLAP
Best at	Relevance search, flexible filtering, near-real-time logs	Massive analytical scans/aggregations
Schema	Dynamic mappings, JSON documents	Strongly typed columnar tables
Aggregations	`date_histogram`, `terms`, etc. over DocValues	SQL `GROUP BY` over columns, vectorized
Updates	Document-level, near-real-time	Append-optimized, slow point updates

If a user runs trillion-row GROUP BY reports with no text search and no per-document relevance, ClickHouse will often win. OpenSearch wins when you need full-text relevance, flexible ad-hoc filtering, and a search-shaped data model alongside aggregations.

OpenSearch vs. Vector Databases (Milvus / pgvector)

Dimension	OpenSearch (+ k-NN plugin)	Milvus / pgvector
Vector search	k-NN plugin (HNSW/IVF via Lucene/FAISS/nmslib)	Purpose-built ANN engines
Hybrid (text + vector)	Native: combine `match` + k-NN in one query	Usually needs a separate text store
Core engine	Vectors are an added field type via plugin	Vectors are the entire point

OpenSearch does vector/semantic search through the k-NN plugin (a separate repo), layered on the same shards and search path. For pure, billion-scale ANN with nothing else, a dedicated vector DB may be leaner; for hybrid lexical+semantic search in one system, OpenSearch is compelling.

OpenSearch vs. Splunk

Dimension	OpenSearch	Splunk
License/cost	Open source, Apache 2.0	Commercial, volume-priced
Query language	Query DSL (JSON), SQL/PPL via plugin	SPL (Search Processing Language)
UI	OpenSearch Dashboards	Splunk Web
Sweet spot	Logs, security analytics, observability	Logs, SIEM, ops analytics

OpenSearch is frequently adopted as an open, self-hosted alternative to Splunk for log analytics and security. The compute model differs (Lucene index vs. Splunk's index-on-ingest), but the workload (ingest logs, search/filter, aggregate, dashboard) is the same — which is exactly Scenarios 1 and 3 below.

The Data Model: Documents, Indices, Shards, Segments

You must be able to say precisely what is OpenSearch and what is Lucene — this distinction decides where a bug lives.

Cluster
 └── Index  "logs-2026.06.16"  (OpenSearch concept: mappings, settings, aliases)
      ├── Primary shard 0  ─────────┐
      │    └── Lucene index         │  (LUCENE: this is a real Lucene index)
      │         ├── segment _0      │      immutable, append-only files
      │         ├── segment _1      │      built by IndexWriter, read by DirectoryReader
      │         └── segment _2      │
      ├── Replica shard 0  (copy of primary 0, on another node)
      ├── Primary shard 1
      └── Replica shard 1

Concept	Owner	What it is
Document	OpenSearch	A JSON object you index. Has an `_id`, lives in one index, routed to one shard.
Index	OpenSearch	A named collection of documents with mappings (field types) and settings (shard count, etc.).
Shard (primary/replica)	OpenSearch	A horizontal slice of an index. The unit of distribution and replication. Each shard is a Lucene index.
Segment	Lucene	An immutable mini-index inside a shard. New docs go to new segments; merges combine them.
Mapping	OpenSearch	Field → type declaration (`text`, `keyword`, `long`, `date`, `nested`, …). Drives how Lucene fields are built.
Analyzer / Tokenizer / TokenFilter	mostly Lucene (configured by OpenSearch)	Turns `text` field values into terms (`"Hello World"` → `[hello, world]`).
Inverted index	Lucene	term → posting list (which docs contain the term). The core of full-text search.
DocValues	Lucene	Columnar per-field storage used for sorting and aggregations.
Translog	OpenSearch	A write-ahead log for durability between Lucene commits. Not a Lucene concept.
Cluster state / routing / allocation	OpenSearch	Where shards live, who is cluster manager, index metadata. Pure OpenSearch.

Rule of thumb: anything about scoring, tokenization, segments, postings, DocValues is Lucene (OpenSearch configures it). Anything about shards across nodes, replication, cluster state, REST, transport, translog, aggregation reduce is OpenSearch. Keep this table near your desk.

Scenario 1: Log Analytics Ingest + Range/Term Search

What the user does — bulk-index some structured logs, then filter by status and time range.

# Bulk-index four log lines into an index "weblogs".
curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/weblogs/_bulk?refresh=true' --data-binary $'
{"index":{}}
{"@timestamp":"2026-06-16T10:00:00Z","status":200,"bytes":512,"path":"/home","client":"10.0.0.1"}
{"index":{}}
{"@timestamp":"2026-06-16T10:00:05Z","status":404,"bytes":128,"path":"/missing","client":"10.0.0.2"}
{"index":{}}
{"@timestamp":"2026-06-16T10:01:00Z","status":500,"bytes":256,"path":"/api","client":"10.0.0.1"}
{"index":{}}
{"@timestamp":"2026-06-16T10:02:30Z","status":200,"bytes":900,"path":"/home","client":"10.0.0.3"}
'

# Filter: 5xx errors in a time window. A "filter" context = no scoring, just yes/no.
curl -s -H 'Content-Type: application/json' 'localhost:9200/weblogs/_search?pretty' -d '
{
  "query": {
    "bool": {
      "filter": [
        { "range": { "status": { "gte": 500 } } },
        { "range": { "@timestamp": { "gte": "2026-06-16T10:00:00Z", "lte": "2026-06-16T10:05:00Z" } } }
      ]
    }
  }
}'

You get back the one status:500 document. Notice "max_score": null — filter context does no relevance scoring.

What OpenSearch does under the hood:

RestBulkAction parses the NDJSON body, builds a BulkRequest, and calls the NodeClient.
TransportBulkAction auto-creates the weblogs index (a cluster state update that adds index metadata and a RoutingTable entry), dynamically maps status→long, @timestamp→date, path→text+keyword, then groups documents per target shard.
Each shard's writes go through TransportShardBulkAction → IndexShard.applyIndexOperationOnPrimary(...) → InternalEngine.index(...) → Lucene IndexWriter.addDocument(...) + Translog.add(...). ?refresh=true forces an immediate refresh so the docs are searchable.
The search request hits RestSearchAction → TransportSearchAction. The coordinating node fans out to each shard; per shard, SearchService.executeQueryPhase runs the range filters as Lucene PointRangeQuery over the numeric/date _doc_values/points, collecting matching doc IDs.
FetchPhase loads the stored _source for the hits; SearchPhaseController merges shard results on the coordinating node.

Bridge to source code:

cd ~/src/OpenSearch

# The REST handler that accepts /_bulk
find server -name "RestBulkAction.java"
#   server/src/main/java/org/opensearch/rest/action/document/RestBulkAction.java

# The transport action that splits a bulk by shard and creates indices/mappings
grep -rn "class TransportBulkAction" server/src/main/java/org/opensearch/action/bulk/

# Where a range filter becomes a Lucene query
grep -rn "PointRangeQuery\|newRangeQuery" server/src/main/java/org/opensearch/index/mapper/ | head

Scenario 2: Full-Text Relevance Query (`match`, BM25, `_explain`)

What the user does — index a few articles, then run a relevance-ranked match query and ask the engine to explain the score.

curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/articles/_bulk?refresh=true' --data-binary $'
{"index":{"_id":"1"}}
{"title":"OpenSearch query performance tuning","body":"Tuning search relevance and query latency in OpenSearch."}
{"index":{"_id":"2"}}
{"title":"A gentle introduction to Lucene","body":"Lucene powers search; OpenSearch builds on it."}
{"index":{"_id":"3"}}
{"title":"Cooking with cast iron","body":"Nothing to do with search at all."}
'

# Relevance query in "query" context — scored by BM25 by default.
curl -s -H 'Content-Type: application/json' 'localhost:9200/articles/_search?pretty' -d '
{ "query": { "match": { "body": "search relevance" } } }'

# Ask WHY document 1 scored what it did.
curl -s -H 'Content-Type: application/json' 'localhost:9200/articles/_explain/1?pretty' -d '
{ "query": { "match": { "body": "search relevance" } } }'

The _explain response is a tree of BM25 components (termFreq, idf, tf, field length norm). The top scoring doc is the one whose body best matches the analyzed terms [search, relevance].

What OpenSearch does under the hood:

The match query is a QueryBuilder (MatchQueryBuilder). At query time it runs the field's analyzer over "search relevance" to produce terms, then builds a Lucene BooleanQuery of TermQuery clauses via QueryShardContext.
Per shard, QueryPhase executes the Lucene query; Lucene scores each matching doc with BM25 (the default Similarity), using term frequency, inverse document frequency, and field-length normalization — all computed inside Lucene, not OpenSearch.
The coordinating node merges the top-K from each shard in SearchPhaseController; FetchPhase loads _source for the survivors.
_explain runs the same query against one doc and returns Lucene's Explanation tree verbatim.

Bridge to source code:

# The match query builder
find server -name "MatchQueryBuilder.java"
#   server/src/main/java/org/opensearch/index/query/MatchQueryBuilder.java

# Where QueryBuilders turn into Lucene Query objects
grep -rn "QueryShardContext\|toQuery" server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java | head

# The query phase itself
find server -name "QueryPhase.java"
#   server/src/main/java/org/opensearch/search/query/QueryPhase.java

# BM25 / Similarity wiring (OpenSearch configures Lucene's Similarity)
grep -rn "BM25\|Similarity" server/src/main/java/org/opensearch/index/similarity/ | head

Scenario 3: Aggregations (`date_histogram` + `terms` sub-agg) — The Dashboards Workload

What the user does — the canonical "log volume over time, broken down by status" dashboard panel.

curl -s -H 'Content-Type: application/json' 'localhost:9200/weblogs/_search?pretty' -d '
{
  "size": 0,
  "aggs": {
    "per_minute": {
      "date_histogram": { "field": "@timestamp", "fixed_interval": "1m" },
      "aggs": {
        "by_status": { "terms": { "field": "status" } }
      }
    }
  }
}'

"size": 0 means "no hits, just the aggregation". The result is buckets per minute, each with a nested breakdown by status. This single request shape is what powers a huge fraction of OpenSearch Dashboards visualizations.

What OpenSearch does under the hood:

RestSearchAction parses the aggs block into an AggregatorFactories tree: DateHistogramAggregationBuilder with a child TermsAggregationBuilder.
Per shard, each AggregatorFactory produces an Aggregator. The aggregators iterate matching docs reading the @timestamp and status values from DocValues (columnar, not the inverted index), bucketing as they go. This is why aggregations need doc_values (on by default for numerics/dates and keyword).
Each shard returns a partial InternalAggregation. The coordinating node calls InternalAggregation.reduce(...) to merge partial buckets into the final result — this reduce step is where shard-local partial results become a globally correct answer.

Bridge to source code:

# Date histogram aggregation
grep -rln "class DateHistogramAggregator" server/src/main/java/org/opensearch/search/aggregations/

# Terms aggregation
grep -rln "class TermsAggregator" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

# The reduce contract — how partial shard aggs merge on the coordinating node
grep -rn "public InternalAggregation reduce" server/src/main/java/org/opensearch/search/aggregations/ | head

# Aggregations read DocValues, not the inverted index
grep -rn "getLeafCollector\|LeafBucketCollector" server/src/main/java/org/opensearch/search/aggregations/AggregatorBase.java | head

See the Aggregations deep dive and the DocValues and Fielddata deep dive for the full mechanics.

Scenario 4: Multi-Node Cluster — Replicas, Health, Killing a Node

What the user does — bring up a 3-node cluster, create an index with replicas, check health, then kill a node and watch shards reallocate. This is the heart of OpenSearch's distributed nature.

# Run a 3-node local cluster from source instead of the single-node default.
./gradlew run -PnumNodes=3

# Create an index with 2 primaries and 1 replica each (4 shards total).
curl -s -H 'Content-Type: application/json' -XPUT 'localhost:9200/inventory' -d '
{ "settings": { "number_of_shards": 2, "number_of_replicas": 1 } }'

# Cluster health: should be green (every replica has a home).
curl -s 'localhost:9200/_cluster/health?pretty'

# See exactly where each shard lives and its state (STARTED / RELOCATING / UNASSIGNED).
curl -s 'localhost:9200/_cat/shards/inventory?v'
curl -s 'localhost:9200/_cat/nodes?v'

Now kill one node process (find it from _cat/nodes or your process list) and immediately re-check:

# In the ./gradlew run output, identify a data node's PID; kill it. Then:
curl -s 'localhost:9200/_cluster/health?pretty'        # status flips to yellow, then back
curl -s 'localhost:9200/_cat/shards/inventory?v'       # watch UNASSIGNED → INITIALIZING → STARTED

You will see health go yellow (a replica lost its home), the cluster manager re-elect if you killed it, and the AllocationService promote a surviving replica to primary and rebuild the missing copy elsewhere — eventually back to green.

What OpenSearch does under the hood:

The elected cluster manager (formerly master) owns the authoritative ClusterState. When a node dies, FollowersChecker/LeaderChecker detect it and the cluster manager publishes a new ClusterState removing that node from DiscoveryNodes.
AllocationService + BalancedShardsAllocator, gated by AllocationDeciders (SameShardAllocationDecider, DiskThresholdDecider, MaxRetryAllocationDecider, …), recompute the RoutingTable: promote a replica to primary, allocate a new replica.
The new state is published two-phase (publish → commit) from the cluster manager to all nodes via PublicationTransportHandler; each node's ClusterApplierService applies it locally.
Recovery of the rebuilt replica runs through PeerRecoverySourceService/PeerRecoveryTargetService, tracked by sequence numbers (LocalCheckpointTracker, global checkpoint via ReplicationTracker).

Bridge to source code:

# The coordination layer: election, joins, publishing
ls server/src/main/java/org/opensearch/cluster/coordination/
grep -rln "class Coordinator" server/src/main/java/org/opensearch/cluster/coordination/

# The allocation engine and the deciders that gate it
grep -rln "class AllocationService" server/src/main/java/org/opensearch/cluster/routing/allocation/
ls server/src/main/java/org/opensearch/cluster/routing/allocation/decider/

# Where a node-left event triggers reroute
grep -rn "disassociateDeadNodes\|reroute" server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java | head

The Discovery and Coordination, Shard Allocation, and Recovery deep dives cover this in full. Level 4 is built entirely around this scenario.

Scenario 5: Snapshot to a Filesystem Repository and Restore

What the user does — register a filesystem snapshot repository, snapshot an index, delete it, and restore from the snapshot. This is the backup/restore and disaster-recovery story.

# ./gradlew run permits the build's temp dir as a snapshot path by default; if not,
# add path.repo to the run config. Register an "fs" repository:
curl -s -H 'Content-Type: application/json' -XPUT 'localhost:9200/_snapshot/my_fs_repo' -d '
{ "type": "fs", "settings": { "location": "my_backup_dir" } }'

# Take a snapshot of the "weblogs" index, waiting for it to finish.
curl -s -XPUT 'localhost:9200/_snapshot/my_fs_repo/snap-1?wait_for_completion=true&pretty' \
  -H 'Content-Type: application/json' -d '{ "indices": "weblogs" }'

# Destroy the index, then restore it from the snapshot.
curl -s -XDELETE 'localhost:9200/weblogs'
curl -s -XPOST 'localhost:9200/_snapshot/my_fs_repo/snap-1/_restore?wait_for_completion=true&pretty'

# Confirm the data is back.
curl -s 'localhost:9200/weblogs/_count?pretty'

What OpenSearch does under the hood:

RepositoriesService registers the fs repository (FsRepository, a BlobStoreRepository).
SnapshotsService runs a snapshot as a series of cluster-state transitions: it copies Lucene segment files (incrementally — only segments not already in the repo) plus index metadata into the blob store. Snapshots are incremental at the segment level.
Restore is the inverse: RestoreService builds new shards from the snapshotted segments and recovers them (a special recovery source), then the AllocationService allocates them.

Bridge to source code:

# Snapshot and repository services
grep -rln "class SnapshotsService"   server/src/main/java/org/opensearch/snapshots/
grep -rln "class RepositoriesService" server/src/main/java/org/opensearch/repositories/
grep -rln "class BlobStoreRepository" server/src/main/java/org/opensearch/repositories/blobstore/

# The fs repository implementation
find server modules plugins -name "FsRepository.java"

# Restore is recovery from a snapshot source
grep -rln "class RestoreService" server/src/main/java/org/opensearch/snapshots/

See the Snapshots and Repositories deep dive.

Dataset Scenarios for Testing Edge Cases

When you write a repro or validate a fix, the dataset you choose determines which code path you exercise. Use these as starting templates; each names the subsystem it stresses.

Dataset 1: The Empty Index

curl -s -XPUT 'localhost:9200/empty_idx'
curl -s 'localhost:9200/empty_idx/_search?pretty'        # 0 hits, must NOT error
curl -s 'localhost:9200/empty_idx/_search?pretty' -H 'Content-Type: application/json' \
  -d '{ "aggs": { "s": { "sum": { "field": "missing" } } } }'

What this tests: the coordinating-node reduce path with zero shard results, and aggregations over a field that does not exist. SearchPhaseController.reducedQueryPhase(...) and InternalAggregation.reduce(...) must produce an empty-but-valid response, not an NPE. Historically a rich source of "empty result" bugs. Source: server/src/main/java/org/opensearch/action/search/SearchPhaseController.java.

Dataset 2: A Single Huge Document

# One ~5 MB document — stresses source storage, field length, and the http.max_content_length limit.
python3 - <<'PY' > /tmp/huge.json
import json
print(json.dumps({"blob": "x" * 5_000_000, "n": 1}))
PY
curl -s -XPOST 'localhost:9200/huge/_doc?refresh=true' -H 'Content-Type: application/json' --data-binary @/tmp/huge.json

What this tests: _source storage, the http.max_content_length guard in the REST/HTTP layer, and Lucene stored-field limits. Exercises RestController request-size handling and IndexShard/InternalEngine for a single oversized doc. Source: server/src/main/java/org/opensearch/rest/RestController.java and the HTTP transport in modules/transport-netty4.

Dataset 3: Deeply Nested Objects

curl -s -XPUT 'localhost:9200/nested_idx' -H 'Content-Type: application/json' -d '
{ "mappings": { "properties": { "user": { "type": "nested",
    "properties": { "comments": { "type": "nested" } } } } } }'

What this tests: the nested field type and the index.mapping.nested_objects.limit / index.mapping.depth.limit guards. Nested docs are stored as hidden child Lucene documents, so this exercises ObjectMapper/NestedObjectMapper and the join-based nested query. Source: server/src/main/java/org/opensearch/index/mapper/ObjectMapper.java and NestedObjectMapper.

Dataset 4: High-Cardinality Terms Field

# Index many unique keyword values, then a terms agg over them.
for i in $(seq 1 5000); do
  printf '{"index":{}}\n{"uid":"user-%s"}\n' "$i"
done | curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/hicard/_bulk?refresh=true' --data-binary @-

curl -s 'localhost:9200/hicard/_search?pretty' -H 'Content-Type: application/json' \
  -d '{ "size": 0, "aggs": { "u": { "terms": { "field": "uid", "size": 10 } } } }'

What this tests: the terms aggregator's bucket collection over a high-cardinality field — memory pressure, the request circuit breaker, and shard-level vs. coordinating-node accuracy (doc_count_error_upper_bound). This is a classic OOM/breaker-trip path. Source: server/src/main/java/org/opensearch/search/aggregations/bucket/terms/ and HierarchyCircuitBreakerService.

Dataset 5: Many Tiny Shards

# 50 shards for a trivially small index — a real anti-pattern users hit.
curl -s -XPUT 'localhost:9200/oversharded' -H 'Content-Type: application/json' \
  -d '{ "settings": { "number_of_shards": 50, "number_of_replicas": 0 } }'
curl -s 'localhost:9200/_cat/shards/oversharded?v'
curl -s 'localhost:9200/_cluster/health/oversharded?pretty'

What this tests: AllocationService/BalancedShardsAllocator placing many shards, cluster-state size growth (50 routing entries), and the per-shard search fan-out overhead in TransportSearchAction (50 shard requests for almost no data). Source: server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java and org/opensearch/action/search/TransportSearchAction.java.

Dataset 6: Time-Series With Rollover

# Create a write alias backed by a dated index, then roll it over on a size/age condition.
curl -s -XPUT 'localhost:9200/ts-000001' -H 'Content-Type: application/json' \
  -d '{ "aliases": { "ts-write": { "is_write_index": true } } }'
curl -s -XPOST 'localhost:9200/ts-write/_rollover?pretty' -H 'Content-Type: application/json' \
  -d '{ "conditions": { "max_docs": 1, "max_age": "7d" } }'
curl -s 'localhost:9200/_cat/indices/ts-*?v'

What this tests: the rollover path (alias → new backing index), index lifecycle thinking, and the metadata mutations that create a new index as a cluster-state update task. Source: server/src/main/java/org/opensearch/action/admin/indices/rollover/ and MetadataRolloverService.

The Bridge: User Scenario → Source Code (Master Table)

Use this whenever you observe a runtime behavior and want to find the code that owns it. Exact class names can vary by branch — when in doubt, grep for the class name under server/src/main/java.

Observed behavior	Owning subsystem / class
HTTP request hits `:9200` and is routed	`RestController` → a `RestHandler` (`RestSearchAction`, `RestBulkAction`, `RestIndexAction`)
A REST handler dispatches to server logic	`NodeClient.execute(ActionType, request, listener)` → a `TransportAction` (wired in `ActionModule`)
A search fans out to shards and merges	`TransportSearchAction` → per-shard `SearchService.executeQueryPhase`/`executeFetchPhase`; merge in `SearchPhaseController`
The query phase runs the Lucene query	`QueryPhase`, with `SearchContext`/`DefaultSearchContext` per shard
The fetch phase loads `_source` for hits	`FetchPhase`
Global term stats across shards (optional)	`DfsPhase`
An aggregation is computed and reduced	`AggregatorFactory` → `Aggregator` → `InternalAggregation.reduce(...)`
A document is indexed on the primary	`TransportBulkAction`/`TransportShardBulkAction` → `IndexShard.applyIndexOperationOnPrimary` → `InternalEngine.index`
A write is replicated to replicas	`TransportReplicationAction` (document replication) or `SegmentReplicationTargetService` (segment replication)
A doc becomes searchable / durable	refresh (new searcher) / `Translog` + flush (Lucene commit)
A request must run on the cluster manager	`TransportClusterManagerNodeAction` (formerly `TransportMasterNodeAction`)
Cluster metadata changes	`ClusterStateUpdateTask`/`ClusterStateTaskExecutor` on the cluster-manager service
A new cluster state is distributed	`MasterService` computes it; `PublicationTransportHandler` publishes; `ClusterApplierService` applies
Leader election / node failure detection	`Coordinator`, `PreVoteCollector`, `ElectionSchedulerFactory`, `FollowersChecker`, `LeaderChecker`
Shards placed / rebalanced	`AllocationService` + `BalancedShardsAllocator`, gated by `AllocationDeciders`
A replica rebuilt after a node dies	`PeerRecoverySourceService`/`PeerRecoveryTargetService`, `ReplicationTracker`
Snapshot/restore	`SnapshotsService`, `RepositoriesService`, `BlobStoreRepository`, `RestoreService`
Cross-node wire serialization	`Writeable` + `StreamInput`/`StreamOutput`, `NamedWriteableRegistry`, over `TransportService`/`Netty4Transport`
Memory protection trips	`HierarchyCircuitBreakerService` (`parent`, `fielddata`, `request`, `in_flight_requests`)

Each row has a deep dive: REST layer, Action framework, Transport layer, Search execution, Engine internals, Cluster state, Cluster state publishing, Shard allocation, Recovery, Circuit breakers and memory.

Running OpenSearch End-to-End: The Local Developer Loop

Every contributor should be able to do this loop in under ten minutes, from cold:

cd ~/src/OpenSearch

# 1. Launch a local cluster from source (REST on :9200). Leave running.
./gradlew run

# 2. The canonical health check (separate terminal).
curl -s 'localhost:9200/_cluster/health?pretty'

# 3. Smoke test: index a doc and read it back.
curl -s -XPOST 'localhost:9200/sanity/_doc/1?refresh=true' \
  -H 'Content-Type: application/json' -d '{"hello":"world"}'
curl -s 'localhost:9200/sanity/_doc/1?pretty'
curl -s 'localhost:9200/sanity/_search?pretty' -d '{"query":{"match_all":{}}}' \
  -H 'Content-Type: application/json'

# 4. The fast inner loop you will live in: run one test class / one method.
./gradlew :server:test --tests "org.opensearch.search.query.QueryPhaseTests"
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests.testToXContent"

# 5. Multi-node integration tests (in-JVM, backed by InternalTestCluster).
./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.*"

Your baseline health check is: ./gradlew run serves :9200, a known-good :server:test class passes, and a trivial index+search round-trips. If any of those fail before you have changed a single line, stop and fix your environment — do not start Level 1 on a broken baseline, because every later ./gradlew test will produce false failures that hide the real work.

What to Verify Before Starting Level 1

Run through this checklist once. It takes 30–45 minutes and proves your environment and your understanding.

# Environment
java -version                       # JDK 21
cd ~/src/OpenSearch
./gradlew --version | head -3        # wrapper healthy

# Build + run
./gradlew localDistro                # archive under distribution/archives/
./gradlew run                        # serves :9200 (leave running, new terminal below)

# Functional round-trip
curl -s 'localhost:9200/_cluster/health?pretty'                 # green/yellow
curl -s -XPOST 'localhost:9200/check/_doc/1?refresh=true' \
  -H 'Content-Type: application/json' -d '{"k":"v"}'
curl -s 'localhost:9200/check/_search?pretty'                   # 1 hit

# Tests
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests" 2>&1 | tail -5

You are ready when, without notes, you can:

State the OpenSearch-vs-Lucene boundary for: scoring, segments, DocValues, translog, cluster state, REST/transport.
Name the four steps of the search path: REST handler → transport action → query phase → fetch phase, plus the coordinating-node reduce.
Name the indexing path: bulk/index action → IndexShard → InternalEngine → Lucene IndexWriter + translog.
Explain why killing a node turns the cluster yellow and how it returns to green (cluster manager → new ClusterState → AllocationService → recovery).
Run a single ./gradlew :server:test --tests ... and a single curl search from memory.
Open RestSearchAction, TransportSearchAction, QueryPhase, and IndexShard in your IDE via Go to Class.

If any box is unchecked, re-run the scenario it maps to before moving on.

Continue to the 16-Week Plan and Milestones, or jump to Level 1: Lucene and OpenSearch Foundation. The internals you previewed here are covered in full in the Deep Dives.

16-Week Plan: From Curious Reader to OpenSearch Maintainer Candidate

This is a 16-week, ~10-hour-per-week plan that maps the curriculum (Levels 1–9 plus a 2-week capstone) onto a calendar. Each week states:

Reading — concrete OpenSearch source files. Open them; do not just skim diagrams.
Hands-on — what you must build/run on your machine (./gradlew, curl, git).
GitHub issue practice — real search queries against opensearch-project/OpenSearch that surface beginner-appropriate issues. OpenSearch tracks everything on GitHub, not JIRA.
Labs — the curriculum labs you must complete.
Exit checkpoint — concrete deliverables. If you cannot produce them, repeat the week.

The plan assumes you have ~/src/OpenSearch checked out, ./gradlew run serving :9200, a passing ./gradlew :server:test --tests "...ClusterStateTests", and a working JDK 21 / Git environment from the prerequisites and the warm-up.

How to run a GitHub issue query: paste it into the search box at https://github.com/opensearch-project/OpenSearch/issues, or build a URL: https://github.com/opensearch-project/OpenSearch/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22. Labels with spaces must be quoted: label:"good first issue".

Weeks 1–2: Level 1 — Lucene & OpenSearch Foundation

Week 1 — Build, run, and the data model

Reading

README.md, CONTRIBUTING.md, DEVELOPER_GUIDE.md, TESTING.md in the repo root.
server/src/main/java/org/opensearch/node/Node.java — how a node wires up every service (read Node(...) constructor top-to-bottom; do not memorize, just map the territory).
server/src/main/java/org/opensearch/index/shard/IndexShard.java — read the class javadoc and the field declarations only this week.

Hands-on

Build: ./gradlew localDistro then ./gradlew run.
Reproduce all five warm-up scenarios against :9200 (bulk ingest, match, date_histogram, multi-node, snapshot).
Inspect a real shard on disk: find the nodes/0/indices data dir under the ./gradlew run working tree and list the Lucene segment files.

GitHub issue practice

is:issue is:open label:"good first issue"
is:issue is:open label:"good first issue" label:"untriaged"
is:issue is:open label:documentation

Labs

Exit checkpoint

You can build from a clean checkout and serve :9200.
You can state the OpenSearch-vs-Lucene boundary (segments, DocValues, translog, cluster state).
You have one good first issue open in a browser tab and have read it end-to-end (description + every comment).

Week 2 — First cluster, first index, Lucene underneath

Reading

server/src/main/java/org/opensearch/index/engine/InternalEngine.java — class javadoc + the index(...) method signature.
server/src/main/java/org/opensearch/index/mapper/ — skim MapperService, DocumentMapper, KeywordFieldMapper, TextFieldMapper.

Hands-on

Create an index with explicit mappings; index docs; run _search, _explain, and _analyze.
./gradlew :server:test --tests "org.opensearch.index.engine.InternalEngineTests" and read 5 of the test methods it ran.

GitHub issue practice

is:issue is:open label:"good first issue" label:"Indexing"
is:issue is:open "analyzer" label:enhancement

Labs

Exit checkpoint

You can write a mapping and explain each field type's Lucene representation.
You can run a single test method by name without consulting docs.
You can describe what _analyze returns and why.

Weeks 3–4: Level 2 — Contributor Onboarding (GitHub, PRs, DCO)

Week 3 — Repository structure and the contribution flow

Reading

MAINTAINERS.md, CHANGELOG.md, .github/pull_request_template.md.
Top-level build.gradle and settings.gradle — how modules are declared.
The SPDX header on any file in server/src/main/java/org/opensearch/.

Hands-on

Fork the repo on GitHub; add your fork as a remote; create a branch.
Configure DCO: git config user.name/user.email, then git commit -s and verify the Signed-off-by: trailer.
Run ./gradlew precommit and read what it checks (checkstyle, forbidden APIs, license headers).

GitHub issue practice

is:pr is:merged label:"good first issue"
is:issue is:open label:"help wanted" label:documentation
is:pr is:open review:required

Labs

Exit checkpoint

You can explain the full PR lifecycle: fork → branch → commit -s → CHANGELOG entry → PR → DCO/precommit checks → review → backport label.
You can run ./gradlew precommit and interpret a failure.
You can name what every PR must include (test, CHANGELOG entry, sign-off).

Week 4 — Tests as documentation; your first fix

Reading

test/framework/src/main/java/org/opensearch/test/OpenSearchTestCase.java (skim).
One real merged "good first issue" PR diff (open it on GitHub; read the conversation too).

Hands-on

Add a trivial no-op test to server and run it via ./gradlew :server:test --tests "...".
Reproduce a documentation or error-message issue locally.

GitHub issue practice

is:issue is:open label:"good first issue" -label:"untriaged"
is:issue is:open label:bug label:"good first issue" sort:created-desc

Labs

Exit checkpoint

You can produce a signed, CHANGELOG-bearing commit that passes ./gradlew precommit.
You can review a real PR and list three things its author should improve.
You have drafted (not necessarily submitted) a comment on a real good first issue.

Weeks 5–6: Level 3 — Architecture (REST → Transport → Action)

Week 5 — The request path

Reading

server/src/main/java/org/opensearch/rest/RestController.java.
server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java.
server/src/main/java/org/opensearch/action/ActionModule.java — how actions are registered.

Hands-on

Start ./gradlew run; send a _search and trace it: set a breakpoint in RestSearchAction and in TransportSearchAction, step from HTTP to action.
grep -rn "registerHandler\|registerAction" server/src/main/java/org/opensearch/action/ActionModule.java.

GitHub issue practice

is:issue is:open label:"good first issue" label:">enhancement"
is:issue is:open "RestController" in:title,body
is:pr is:merged path:server/src/main/java/org/opensearch/rest

Labs

Exit checkpoint

You can name the chain RestController → RestHandler → NodeClient.execute → TransportAction and cite each file.
You can explain the difference between HandledTransportAction, TransportSingleShardAction, TransportBroadcastAction, TransportReplicationAction, and TransportClusterManagerNodeAction.

Week 6 — Transport, threadpools, and a custom REST action

Reading

server/src/main/java/org/opensearch/transport/TransportService.java.
libs/core/src/main/java/org/opensearch/core/common/io/stream/StreamInput.java / StreamOutput.java and Writeable.
server/src/main/java/org/opensearch/threadpool/ThreadPool.java (the Names constants).

Hands-on

grep the ThreadPool.Names constants and find where SEARCH and WRITE pools are used.
Build the custom REST action plugin in Lab 3.3 and load it into ./gradlew run.

GitHub issue practice

is:issue is:open "transport" label:bug
is:issue is:open label:"good first issue" "thread pool" in:title,body

Labs

Lab 3.3 — Build It: A Custom REST Action Plugin

Exit checkpoint

You can explain how a TransportRequest is serialized (Writeable/StreamOutput, NamedWriteableRegistry).
You have a working custom REST action plugin that responds on :9200.
You can name three thread pools and the work each handles.

Weeks 7–8: Level 4 — Cluster Coordination and State

Week 7 — The Coordinator and cluster-manager election

Reading

server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java.
server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java, JoinHelper.java, PreVoteCollector.java.
server/src/main/java/org/opensearch/cluster/ClusterState.java — Metadata, RoutingTable, DiscoveryNodes, ClusterBlocks.

Hands-on

Run ./gradlew run -PnumNodes=3. Inspect _cluster/state, _cat/nodes, _cat/shards.
Kill the cluster-manager node; watch a re-election; confirm via _cluster/health.

GitHub issue practice

is:issue is:open "cluster manager" in:title,body
is:issue is:open label:"good first issue" "cluster state" in:title,body
is:issue is:open "election" label:bug

Labs

Exit checkpoint

You can explain the two-phase publish/commit and which class publishes (PublicationTransportHandler) vs. applies (ClusterApplierService).
You can name the four components of ClusterState and what each holds.
You can describe what happens to writes during a cluster-manager re-election.

Week 8 — Cluster state updates, applier model, and allocation

Reading

server/src/main/java/org/opensearch/cluster/service/MasterService.java (cluster-manager service), ClusterApplierService.java, ClusterService.java.
server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java.
server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ — pick 4 deciders.

Hands-on

Force an UNASSIGNED shard (set number_of_replicas higher than node count); inspect with _cluster/allocation/explain.
Implement the custom ClusterStateListener (Lab 4.3); log every state version change.

GitHub issue practice

is:issue is:open "allocation" label:bug
is:issue is:open "AllocationDecider" in:title,body
is:issue is:open label:"good first issue" "unassigned shard" in:title,body

Labs

Exit checkpoint

You can read _cluster/allocation/explain output and name the decider that blocked an allocation.
You can write a ClusterStateListener and explain when it fires.
You can trace a ClusterStateUpdateTask through MasterService to a published state.

Weeks 9–10: Level 5 — Testing and Debugging

Week 9 — The test framework and InternalTestCluster

Reading

test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java.
test/framework/src/main/java/org/opensearch/test/InternalTestCluster.java.
test/framework/src/main/java/org/opensearch/test/OpenSearchSingleNodeTestCase.java.

Hands-on

Run ./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.*".
Reproduce a randomized failure deliberately: run a test, capture the -Dtests.seed=... line, re-run with that seed.

GitHub issue practice

is:issue is:open label:flaky-test
is:issue is:open label:flaky-test label:"untriaged"
is:issue is:open label:flaky-test sort:created-desc

Labs

Exit checkpoint

You can write an OpenSearchIntegTestCase test that spins up a multi-node cluster in-JVM.
You can reproduce a randomized failure from its printed seed.
You can distinguish a unit test from an *IT integration test by what it extends.

Week 10 — Flaky tests and serialization round-trips

Reading

test/framework/src/main/java/org/opensearch/test/AbstractWireSerializingTestCase.java.
An existing *Tests class using @AwaitsFix (grep for it).

Hands-on

Pick a flaky-test-labeled issue; reproduce the failure by running its test on a loop with random seeds.
Write an AbstractWireSerializingTestCase round-trip for a Writeable of your choice.

GitHub issue practice

is:issue is:open label:flaky-test no:assignee
is:pr is:merged "AwaitsFix" in:title,body
is:issue is:open "reproduce" label:flaky-test

Labs

Exit checkpoint

You can mute a flaky test correctly (@AwaitsFix(bugUrl=...), never @Ignore).
You can write a wire-serialization round-trip test for BWC confidence.
You can take a flaky-test issue from repro to a candidate fix.

Weeks 11–12: Level 6 — Indexing Path and Storage Engine

Week 11 — The write path to Lucene

Reading

server/src/main/java/org/opensearch/action/bulk/TransportShardBulkAction.java.
server/src/main/java/org/opensearch/index/shard/IndexShard.java — applyIndexOperationOnPrimary(...).
server/src/main/java/org/opensearch/index/engine/InternalEngine.java — index(...).

Hands-on

Index a doc with a breakpoint in IndexShard.applyIndexOperationOnPrimary and step into InternalEngine.index → Lucene IndexWriter.
Force a _refresh and a _flush; observe segment file changes on disk.

GitHub issue practice

is:issue is:open "InternalEngine" in:title,body
is:issue is:open label:bug "translog" in:title,body
is:issue is:open label:"good first issue" "indexing" in:title,body

Labs

Exit checkpoint

You can walk indexing end-to-end: action → IndexShard → InternalEngine → IndexWriter + Translog.
You can explain refresh (visibility) vs. flush (durability) vs. merge (cleanup).
You can describe how a write replicates (document vs. segment replication).

Week 12 — Mapping, analysis, and a custom analyzer

Reading

server/src/main/java/org/opensearch/index/mapper/MapperService.java, DocumentMapper.java, TextFieldMapper.java.
modules/analysis-common/src/main/java/org/opensearch/analysis/common/ — pick two filters.

Hands-on

Build the custom analyzer plugin (Lab 6.3); register it; verify with _analyze.
grep for an existing TokenFilterFactory and mirror its structure.

GitHub issue practice

is:issue is:open "analyzer" label:enhancement
is:issue is:open label:"good first issue" "tokenizer" in:title,body
is:pr is:merged path:modules/analysis-common

Labs

Lab 6.3 — Build It: A Custom Analyzer

Exit checkpoint

You can implement an AnalysisPlugin exposing a custom filter and prove it via _analyze.
You can explain how a mapping field type chooses its analyzer and Lucene field.

Weeks 13–14: Level 7 — Search Path and Aggregations

Week 13 — Query and fetch phases

Reading

server/src/main/java/org/opensearch/action/search/TransportSearchAction.java.
server/src/main/java/org/opensearch/search/SearchService.java, search/query/QueryPhase.java, search/fetch/FetchPhase.java.
server/src/main/java/org/opensearch/action/search/SearchPhaseController.java (the reduce).

Hands-on

Trace a _search from TransportSearchAction fan-out to per-shard QueryPhase to the coordinating-node reduce with breakpoints.
Run the same query with ?explain=true and read the BM25 explanation.

GitHub issue practice

is:issue is:open "QueryPhase" in:title,body
is:issue is:open label:bug "search" label:"good first issue"
is:pr is:merged path:server/src/main/java/org/opensearch/search

Labs

Lab 7.1 — Trace a Search Through Query and Fetch Phases

Exit checkpoint

You can name and order the phases: (DFS) → query → fetch, and where the reduce happens.
You can explain what SearchContext/DefaultSearchContext hold per shard.

Week 14 — Aggregations and a custom aggregation

Reading

server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java, AggregatorBase.java.
search/aggregations/bucket/histogram/DateHistogramAggregator.java, bucket/terms/TermsAggregator.java.
Any InternalAggregation.reduce(...) implementation.

Hands-on

Run date_histogram + terms sub-agg (warm-up Scenario 3); break in the aggregator and the reduce.
Build the custom aggregation (Lab 7.3); wire it via SearchPlugin.

GitHub issue practice

is:issue is:open "aggregation" label:bug
is:issue is:open label:"good first issue" "aggregation" in:title,body
is:issue is:open "doc_count_error_upper_bound" in:title,body

Labs

Exit checkpoint

You can explain the AggregatorFactory → Aggregator → InternalAggregation.reduce lifecycle.
You can explain why aggregations read DocValues, not the inverted index.
You have a working custom aggregation registered via SearchPlugin.

Week 15: Level 8 — Real Issue Contribution (and into the capstone)

Weeks 15–16 overlap with the capstone. Use Week 15 for Level 8's labs and to select and reproduce your capstone issue; use Week 16 to implement, test, and submit.

Reading

A real open bug issue's linked code paths (follow the stack trace into server/).
Level 8 overview and the issue-roadmap stages you are targeting.

Hands-on

Reproduce a real GitHub issue locally with a failing test or a curl repro.
Localize the root cause to a file:method; draft the fix and a regression test.

GitHub issue practice

is:issue is:open label:bug label:"good first issue" no:assignee
is:issue is:open label:"help wanted" -label:"untriaged" sort:reactions-desc
is:issue is:open label:bug "stack trace" in:body sort:created-desc

Labs

Exit checkpoint

You have a reproducible failing case for a real issue (test or curl).
You have a root cause localized to a file:method with a one-paragraph explanation.
You have selected your capstone issue and confirmed it is unassigned and in scope.

Week 16: Level 9 + Capstone — Maintainer-Level Concerns and Shipping

Reading

qa/ BWC tests and server/.../Version.java — how versions gate wire/index compatibility.
Level 9 overview, the compatibility mindset, and the release process.

Hands-on

Run ./gradlew :server:test + ./gradlew precommit on your capstone branch; fix every failure.
Write/extend a BWC or wire-serialization test where your change touches serialization.
Open the PR: signed commits, CHANGELOG entry, filled-out PR template, linked issue.

GitHub issue practice

is:issue is:open label:"backport 2.x"
is:pr is:open label:"backport 2.x"
is:issue is:open label:v3.0.0

Labs

Capstone — follow capstone/index.md start to finish:

Issue selection (done in Week 15) → reproduction → execution-path analysis.
Root cause → implementation → testing → validation.
PR preparation (DCO sign-off, CHANGELOG, PR template) → GitHub documentation → write-up.

Exit checkpoint

A real PR opened against opensearch-project/OpenSearch (or a complete, review-ready patch on a branch), with passing ./gradlew precommit and tests on the affected module.
A 1500–3000 word public write-up of the experience (see capstone step 10).
At least one round of self-review against the PR quality checklist.

How to Use This Plan When You Fall Behind

If you finish a week's reading but cannot pass the exit checkpoint, repeat the week. Do not advance. The milestones are the real gate; the calendar is a suggestion.
If a GitHub issue query returns nothing useful, change the query. Labels and triage state shift constantly — drop a label, widen the date sort, or try is:issue is:open no:assignee plus a keyword. The community moves; your queries should too.
Treat ./gradlew precommit and a green :server:test on your touched module as non-negotiable before you call a week done.
Skip a Level only if you can pass all exit checkpoints from the previous Levels in one sitting, and only if a milestone explicitly permits it.
The two highest-leverage habits, sustained every week: (1) keep one real GitHub issue open and read it fully, comments included; (2) keep ./gradlew run alive and run at least one breakpoint-driven trace of the week's subsystem.

Milestones: M1 Through M9

Milestones are the "what does mastery look like at this stage" checkpoints. They map to the nine levels and the 16-week plan, but they are the real gate — the calendar is only a suggestion. Each milestone has:

Expected completion — a calendar guideline tied to the 16-week plan.
Skills you must demonstrate — 5–8 concrete abilities, tied to OpenSearch internals.
Self-check questions — answer them out loud, without notes.
20-point rubric — five criteria, four points each.
Pass threshold — minimum total to advance.
Move to the next level when — the binary gate.

Pass thresholds are deliberately high. The point is competence, not throughput. A maintainer-track contributor is measured in months of sustained, high-quality work — these milestones are how you know you are on track.

M1 — Orientation and Data Model (end of Week 2)

You can run OpenSearch as a user, and you can state precisely what is OpenSearch and what is Lucene.

Skills

Build OpenSearch from a clean checkout and serve :9200 via ./gradlew run.
Reproduce all five warm-up scenarios with curl against localhost:9200.
Draw the data model: cluster → index → shard (primary/replica) → Lucene index → segment.
Classify any concept as OpenSearch-owned vs. Lucene-owned (scoring, DocValues, translog, cluster state, REST/transport).
Write a mapping and explain each field type's Lucene representation.
Run a single test by name: ./gradlew :server:test --tests "<Class>.<method>".
Locate any org.opensearch.* class in server/ within 60 seconds via Go to Class or grep.

Self-check questions

Which of these is Lucene and which is OpenSearch: BM25 scoring, the translog, the inverted index, the RoutingTable, segment merges?
What is the difference between a text and a keyword field, and why does it matter for aggregations?
What does _refresh make happen, and how is it different from _flush?

Rubric

Criterion	1	2	3	4
Build/run fluency	`./gradlew run` works	Runs scenarios	Runs a named test	Diagnoses a run failure
Data model	Names the terms	Sketches index→shard	Sketches down to segments	Predicts shard layout
OS-vs-Lucene boundary	Confused	Knows a few	Classifies most	Classifies any concept
REST/curl fluency	Copies examples	Edits queries	Writes from memory	Predicts the response shape
Communication	Cannot explain	Explains with notes	Explains without notes	Teaches another

Pass threshold: 14/20, with no criterion below 2.

Move to Level 2 when: from a verbal prompt, you can index sample data and run a filtered search + a date_histogram aggregation from memory, and state the OS-vs-Lucene boundary cold.

M2 — Build and Test Literacy (end of Week 4)

You can navigate the codebase, build it, run any test, and produce a contribution-ready commit.

Skills

Run a single test in any module: ./gradlew :server:test --tests "Class.method".
Add a new test file and have Gradle pick it up.
Run and interpret ./gradlew precommit (checkstyle, forbidden APIs, license headers, loggerUsageCheck).
Identify the module of a class from its FQN (org.opensearch.cluster... → server; org.opensearch.core... → libs/core).
Produce a commit with a DCO Signed-off-by: line (git commit -s) and a CHANGELOG.md entry.
Distinguish a unit test (*Tests) from an integration test (*IT, internalClusterTest).
Describe the full PR lifecycle: fork → branch → sign-off → CHANGELOG → PR → DCO/precommit → review → backport label.

Self-check questions

Why is there no JIRA and no CLA — what replaces them in OpenSearch?
What does every PR have to include besides code, and which check enforces the sign-off?
What is the difference between ./gradlew :server:test and ./gradlew :server:internalClusterTest?

Rubric

Criterion	1	2	3	4
Build mastery	`assemble` works	Knows `localDistro`/`run`	Knows module deps	Diagnoses build failures
Test execution	Runs all	Runs a class	Runs a method	Runs `internalClusterTest`
Precommit	Unaware	Runs it	Reads failures	Fixes checkstyle/headers
Contribution hygiene	None	Signs off	Adds CHANGELOG	Fills PR template correctly
Module map	Knows names	Knows top-level deps	Maps FQN→module	Diagnoses where a class lives

Pass threshold: 14/20.

Move to Level 3 when: on a fresh checkout you can build, run a :server:test method by name, and produce a signed, CHANGELOG-bearing commit that passes ./gradlew precommit — within 20 minutes.

M3 — The Request Path (end of Week 6)

You can trace a request from the REST handler through the transport action framework.

Skills

Trace _search end to end: RestController → RestSearchAction → NodeClient.execute → TransportSearchAction, with a breakpoint at each hop.
Explain ActionModule registration: how a RestHandler and a TransportAction get wired.
Distinguish the transport action base classes (HandledTransportAction, TransportSingleShardAction, TransportBroadcastAction, TransportReplicationAction, TransportClusterManagerNodeAction).
Describe how a TransportRequest is serialized (Writeable + StreamInput/StreamOutput, NamedWriteableRegistry) and moved by TransportService/Netty4Transport.
Name three thread pools (SEARCH, WRITE, GET, …) and the work each owns.
Build a custom REST action plugin and serve it on :9200.

Self-check questions

When a request must run on the elected cluster manager, which base action routes it there?
How does a polymorphic transport payload deserialize on the receiving node?
Which thread pool does a _search run on, and why does that matter under load?

Rubric

Criterion	1	2	3	4
REST→action path	Vague	Names the chain	Cites files	Walks it with breakpoints
Action framework	Confused	Knows base classes	Picks the right one	Knows routing/replication semantics
Serialization	Unaware	Knows `Writeable`	Knows `StreamInput/Output`	Knows `NamedWriteableRegistry`
Threadpools	Unaware	Names a few	Maps work→pool	Reasons about saturation
Plugin	Cannot	Stub compiles	Responds on :9200	Idiomatic + tested

Pass threshold: 14/20.

Move to Level 4 when: you can answer "where does my _search request first leave the REST layer and enter server code?" with a file:method citation, and your custom REST action responds on :9200.

M4 — Coordination and Cluster State (end of Week 8)

You understand cluster-manager election, the cluster state, its publishing, and shard allocation.

Skills

Name the four components of ClusterState (Metadata, RoutingTable, DiscoveryNodes, ClusterBlocks) and what each holds.
Explain the two-phase publish/commit and which class publishes (PublicationTransportHandler) vs. applies (ClusterApplierService).
Trace a ClusterStateUpdateTask through MasterService to a published, applied state.
Read _cluster/allocation/explain and name the AllocationDecider that blocked an allocation.
Describe election and failure detection (Coordinator, PreVoteCollector, FollowersChecker, LeaderChecker) and what happens to writes during a re-election.
Implement a ClusterStateListener and explain exactly when it fires.

Self-check questions

Why is ClusterState immutable and versioned? What breaks if two nodes apply different versions?
Which service computes new states and which applies them — and why are they separate?
What turns a cluster yellow, and what sequence of events brings it back to green?

Rubric

Criterion	1	2	3	4
Cluster state	Names it	Knows the 4 parts	Reads `_cluster/state`	Predicts a state diff
Publish/apply	Confused	Knows the split	Knows two-phase	Walks it in source
Coordination	Aware	Names `Coordinator`	Knows election flow	Reasons about split-brain safety
Allocation	Black box	Names deciders	Reads `allocation/explain`	Diagnoses + fixes a decider
Listener/update	Cannot	Stub fires	Correct timing	Batched update task understood

Pass threshold: 16/20 — this is the first hard gate.

Move to Level 5 when: given an UNASSIGNED shard, you can use _cluster/allocation/explain to name the responsible decider, and you have a working ClusterStateListener with a test.

M5 — Testing and the InternalTestCluster (end of Week 10)

You can write multi-node in-JVM tests, reproduce randomized failures, and handle flaky tests correctly.

Skills

Write an OpenSearchIntegTestCase that spins up a multi-node InternalTestCluster in-JVM.
Write an OpenSearchSingleNodeTestCase and a plain OpenSearchTestCase unit test.
Reproduce a randomized failure from its printed -Dtests.seed=... line.
Write an AbstractWireSerializingTestCase/AbstractSerializingTestCase round-trip for a Writeable/XContent type.
Mute a flaky test correctly with @AwaitsFix(bugUrl="https://github.com/opensearch-project/OpenSearch/issues/NNNN") — never @Ignore.
Take a flaky-test-labeled issue from reproduction toward a candidate fix.

Self-check questions

Why are OpenSearch tests randomized, and how do you make a failure deterministic again?
Why does serialization round-trip testing matter for backward compatibility?
What is the wrong way to silence a flaky test, and why is it wrong?

Rubric

Criterion	1	2	3	4
Integ tests	Runs them	Writes single-node	Writes multi-node	Controls cluster scope
Randomization	Confused	Knows seeds	Reproduces a failure	Minimizes a repro
Serialization tests	Unaware	Knows the base class	Writes a round-trip	Catches a BWC break
Flaky discipline	`@Ignore`	Knows `@AwaitsFix`	Links the issue	Roots out the cause
Debugging	Reads stack	Maps to source	Reproduces locally	Writes a regression test

Pass threshold: 15/20.

Move to Level 6 when: you can reproduce a randomized failure from a seed and write a multi-node integration test that asserts cluster behavior — both from memory.

M6 — Indexing Path and the Engine (end of Week 12)

You can read the write path from a transport action down to Lucene and the translog.

Skills

Walk indexing end to end: TransportShardBulkAction → IndexShard.applyIndexOperationOnPrimary → InternalEngine.index → Lucene IndexWriter + Translog.add.
Explain refresh (visibility / new searcher) vs. flush (durability / Lucene commit) vs. merge (segment cleanup, MergePolicy/MergeScheduler).
Explain how a write replicates: document replication (TransportReplicationAction) vs. segment replication (SegmentReplicationTargetService/SourceService).
Describe sequence-number tracking: LocalCheckpointTracker, global checkpoint via ReplicationTracker.
Implement an AnalysisPlugin exposing a custom token filter and prove it with _analyze.
Explain how a mapping field type selects its analyzer and Lucene field.

Self-check questions

What guarantees does the translog provide between two Lucene commits?
Why is a freshly indexed document not searchable until a refresh?
What is the difference, on the wire, between document replication and segment replication?

Rubric

Criterion	1	2	3	4
Write path	Names classes	Walks happy path	Walks to Lucene	Walks edge cases
Refresh/flush/merge	Confused	Knows definitions	Knows triggers	Tunes/diagnoses
Replication	Aware	Knows doc-rep	Knows seg-rep	Reasons about seqno/checkpoints
Analysis/mapping	Aware	Reads a mapper	Builds a filter	Builds + tests an AnalysisPlugin
Engine debugging	Reads stack	Maps to source	Breakpoints in engine	Writes a repro test

Pass threshold: 15/20.

Move to Level 7 when: you can set a breakpoint in IndexShard.applyIndexOperationOnPrimary, step into InternalEngine.index, and explain refresh vs. flush vs. merge without notes.

M7 — Search and Aggregations (end of Week 14)

You can read the search fan-out, the per-shard phases, and the coordinating-node reduce.

Skills

Walk search end to end: TransportSearchAction fan-out → per-shard SearchService → QueryPhase → FetchPhase → reduce in SearchPhaseController.
Explain the optional DfsPhase (global term statistics) and when it matters.
Explain the aggregation lifecycle: AggregatorFactory → Aggregator → InternalAggregation.reduce(...).
Explain why aggregations read DocValues, not the inverted index, and what doc_count_error_upper_bound means.
Read a BM25 _explain tree and account for each term.
Build a custom aggregation and register it via SearchPlugin.

Self-check questions

Where does shard-local partial work become a globally correct answer in both search and aggs?
What does SearchContext/DefaultSearchContext hold per shard, and when is it released?
Why can a terms aggregation be approximate across shards?

Rubric

Criterion	1	2	3	4
Search path	Names phases	Orders them	Cites files	Walks with breakpoints
Reduce	Vague	Knows it exists	Knows it merges shards	Reasons about correctness
Aggregations	Aware	Knows factory→agg	Knows reduce	Builds a custom agg
DocValues lens	Confused	Knows aggs use them	Knows why	Reasons about cardinality/memory
QueryBuilders	Aware	Reads one	Knows `toQuery` path	Adds/extends a query

Pass threshold: 15/20.

Move to Level 8 when: you can trace a _search from fan-out to reduce with breakpoints and you have a custom aggregation registered via SearchPlugin that returns correct results.

M8 — Production Diagnostics and Real Contribution (end of Week 15)

You can reproduce a real GitHub issue, localize the root cause, and prepare a fix.

Skills

Reproduce a reported issue locally with a failing test or a curl repro.
Read a stack trace and walk it into server/ to a file:method.
Use cluster diagnostics: _cluster/health, _cat/shards, _cluster/allocation/explain, _nodes/stats, and the circuit-breaker stats.
Distinguish a core bug from a plugin/Dashboards/Lucene bug (correct attribution).
Write a minimal regression test that fails before the fix and passes after.
Open a PR with a DCO sign-off, a CHANGELOG entry, and a filled-out PR template, linked to the issue.

Self-check questions

Given a wrong dashboard chart, how do you decide whether the bug is in Dashboards, core search, a plugin, or Lucene?
What is the minimum a good issue reproduction contains?
Why does a regression test belong in the same PR as the fix?

Rubric

Criterion	1	2	3	4
Reproduction	None	Manual curl	Scripted	Added as a failing test
Root cause	Speculative	Localized	Cited file:method	Explained in the issue
Diagnostics	Guesses	Reads `_cat`/health	Uses `allocation/explain`	Cross-checks stats+logs
Attribution	Confused	Knows boundaries	Picks the right repo	Files/triages correctly
PR readiness	None	Draft	Signed + CHANGELOG	Template + linked issue

Pass threshold: 16/20.

Move to the capstone when: you have a reproducible failing case for a real issue, a root cause localized to a file:method, and a regression test that gates the fix.

M9 — Capstone: You Have Shipped a Patch (end of Week 16)

You have taken a real OpenSearch issue through the full contribution cycle.

Skills

Selected an appropriate, unassigned, in-scope issue.
Reproduced and root-caused it.
Implemented a minimal, idiomatic fix with a regression test.
Submitted a PR in OpenSearch's accepted format: signed commits, CHANGELOG entry, PR template, linked issue; ./gradlew precommit and module tests green.
Responded to at least one round of review feedback (real or simulated against the PR quality checklist).

Self-check questions

Is your change minimal and focused, or does it bundle unrelated cleanup?
Did you consider wire/index backward compatibility and add a serialization/BWC test if needed?
Can you summarize the root cause and the fix in three sentences for a reviewer?

Rubric (20 points)

Criterion	1	2	3	4
Issue selection	Random	Scoped	Justified	Aligned to roadmap
Reproduction	None	Manual	Scripted	Added as a test
Root cause	Speculative	Localized	Cited	Explained on the issue/PR
Implementation	Compiles	Tests pass	Idiomatic	Minimal, focused, BWC-aware
Submission	None	Draft	Submitted (signed + CHANGELOG)	Reviewed, feedback addressed

Pass threshold: 16/20, and the change must pass ./gradlew precommit plus the tests on the affected module.

Global Rubric (maintainer-readiness)

Use this every quarter, regardless of level, to self-assess against where OpenSearch maintainers operate.

Dimension	1 (Beginner)	2 (Apprentice)	3 (Practitioner)	4 (Maintainer-ready)
Code	Reads `org.opensearch.*`	Modifies safely	Designs a subsystem change	Reviews others' PRs
Testing	Runs tests	Adds unit/integ tests	Writes regression + BWC suites	Drives test-infra / flaky triage
Distributed reasoning	Single node	Shards/replicas	Coordination + allocation	Reasons about consensus/recovery edge cases
Contribution & community	Opens issues	Opens PRs with sign-off + CHANGELOG	Reviews, attributes cross-repo bugs	Shapes RFCs, mentors, weighs release impact

A maintainer-track contributor should be at level 3 on all four dimensions and level 4 on at least one. Aim for 3/3/3/3 → 4/3/3/4 by month 12 of focused, sustained contribution. Maintainership in OpenSearch is earned through demonstrated judgment over time — not PR volume — and the per-repo MAINTAINERS.md reflects that.

Use these milestones together with the 16-week plan. When the calendar and the milestone disagree, the milestone wins: repeat the week until you pass the gate.

Level 1: Lucene and OpenSearch Foundation

This level establishes the technical baseline every subsequent level depends on. You will build OpenSearch from source with Gradle, run its randomized test suite, launch a real single-node cluster from your own checkout, index documents and query them over REST, and — critically — write a tiny standalone Lucene program so you understand what a "shard" actually is before you ever read IndexShard.

OpenSearch is a distributed search and analytics engine built on Apache Lucene. It was forked from Elasticsearch 7.10.2 in 2021 (after Elastic relicensed under SSPL) and is licensed Apache 2.0. Governance now sits with the OpenSearch Software Foundation under the Linux Foundation, overseen by a Technical Steering Committee (TSC). The source of truth is GitHub — github.com/opensearch-project/OpenSearch — not JIRA, and there is no CLA: contributions are made by GitHub Pull Request with a DCO sign-off.

This curriculum will not hold your hand. It will point you at the right parts of the codebase, give you the right questions to ask, and make you run everything you read.

Learning Objectives

By the end of Level 1 you must be able to:

Explain where OpenSearch sits in the search stack — Dashboards above it, Lucene beneath it — and which problems each layer owns.
Build OpenSearch from source with the Gradle wrapper and its bundled JDK, with and without running tests.
Run unit and integration tests scoped to a single project, reproduce a randomized failure from its -Dtests.seed, and read the HTML test reports.
Launch a single-node cluster with ./gradlew run, index documents over REST, and run a match query and a terms aggregation against localhost:9200.
Map a REST call (PUT /index, POST /_bulk, GET /_search) to the RestHandler and TransportAction that service it.
Write a standalone Lucene program that builds an index, opens a reader, and runs a query — and explain how IndexShard/InternalEngine wrap exactly those primitives.
Locate any class named in Levels 2–9 (Node, ClusterService, IndexShard, SearchService, …) without a search engine.

The Search-Stack Context

Before touching a line of OpenSearch code, build an accurate mental model of the stack. OpenSearch is one layer in a three-layer system:

┌───────────────────────────────────────────────────────────┐
│              OpenSearch Dashboards (TypeScript)             │  ← Visualization / UI
│         visualizations, dev tools, index management        │     (separate repo)
└───────────────────────────────────────────────────────────┘
                     │ HTTP: _search / _msearch / _bulk (JSON)
                     ▼
┌───────────────────────────────────────────────────────────┐
│                  OpenSearch (org.opensearch.*)             │  ← Distributed engine
│   REST layer  →  Transport/Action layer  →  cluster mgmt  │     (this repo)
│   RestController → TransportAction → IndicesService        │
│                    → IndexShard → Engine                   │
└───────────────────────────────────────────────────────────┘
                     │ Engine wraps a Lucene IndexWriter / DirectoryReader
                     ▼
┌───────────────────────────────────────────────────────────┐
│                     Apache Lucene (library)                │  ← Inverted index / search
│       IndexWriter, DirectoryReader, Query, Document        │
└───────────────────────────────────────────────────────────┘
                     │ Directory abstraction
                     ▼
┌───────────────────────────────────────────────────────────┐
│              Filesystem: segments (.cfs/.si/.dvd/…)         │  ← Immutable on-disk files
└───────────────────────────────────────────────────────────┘

The boundaries matter for everything that follows, especially debugging:

Dashboards never touches your data directly. It issues _search/_bulk HTTP calls to OpenSearch through opensearch-js. A "slow visualization" is almost always a slow _search underneath. Dashboards is a client.
OpenSearch owns the distributed problem: clusters, nodes, shards, replication, coordination, the REST API, the query DSL, aggregations. It does not implement the inverted index — it delegates to Lucene.
Lucene owns the single-machine search problem: building inverted indexes, merging immutable segments, executing Query objects, scoring with BM25. A single OpenSearch shard is one Lucene index. You will prove this to yourself in Lab 1.4.

Note: When something is wrong, the first question a maintainer asks is which layer. Is it a Dashboards rendering bug, a coordinating-node reduce bug, a per-shard query bug, or a Lucene bug? Levels 7–9 and the plugin labs drill this attribution skill.

OpenSearch vs. Elasticsearch vs. Solr

You will be reading code whose history predates OpenSearch. Know the lineage:

	OpenSearch	Elasticsearch	Apache Solr
License	Apache 2.0	SSPL / Elastic License (post-7.10)	Apache 2.0
Origin	Fork of Elasticsearch 7.10.2 (2021)	Original (2010)	Original (2004)
Governance	OpenSearch Software Foundation / Linux Foundation, TSC	Elastic N.V. (single vendor)	Apache Software Foundation
Package root	`org.opensearch.*`	`org.elasticsearch.*`	`org.apache.solr.*`
Build	Gradle (`./gradlew`)	Gradle	Gradle (was Ant/Ivy)
Contribution	GitHub PR + DCO (`git commit -s`), no CLA	GitHub PR + CLA	GitHub PR + ASF CLA / JIRA
Search core	Apache Lucene	Apache Lucene	Apache Lucene
Cluster coordination	Built-in (Zen2-style, `Coordinator`)	Built-in (Zen2)	Apache ZooKeeper (SolrCloud)

The three share a Lucene heart. OpenSearch and Elasticsearch share a codebase up to 7.10.2 — which is why many class names (IndexShard, InternalEngine, SearchService) are identical and why Elasticsearch documentation from the 7.x era is often still accurate for OpenSearch internals. The most visible OpenSearch-specific change is the rename of "master" → "cluster manager" for inclusive language (the master role and cluster.initial_master_nodes are deprecated aliases of cluster.initial_cluster_manager_nodes).

Required Reading

Read these in order, in your own checkout, before starting the labs. In a mature codebase the best documentation is often in-repo Markdown and class-level Javadoc — read it seriously.

#	Resource	What to extract
1	`README.md` (repo root)	The project's scope, where to file issues, the high-level layout.
2	`DEVELOPER_GUIDE.md`	Gradle tasks, JDK requirements, IDE setup, the `run` task, debugging. The single most important file in this level.
3	`TESTING.md`	Test types, randomized testing, `-Dtests.seed`, `--tests` filtering, integration vs unit.
4	`CONTRIBUTING.md`	The GitHub/DCO/CHANGELOG workflow (you will live this in Level 2).
5	`server/src/main/java/org/opensearch/node/Node.java`	Class-level Javadoc + the constructor. The node is the object graph root — every service is wired here.
6	`server/src/main/java/org/opensearch/index/shard/IndexShard.java`	Class-level Javadoc only. The bridge from OpenSearch to Lucene.

Run this to confirm the files exist on your branch (names occasionally move between major lines):

ls README.md DEVELOPER_GUIDE.md TESTING.md CONTRIBUTING.md CHANGELOG.md MAINTAINERS.md
find server/src/main/java/org/opensearch/node/Node.java \
     server/src/main/java/org/opensearch/index/shard/IndexShard.java

Source Code Areas to Inspect

You are not modifying anything yet — you are building a map. Skim these before and after the labs.

`server/` — the core engine

This is the bulk of what you will read across the whole curriculum.

Path	Why
`org/opensearch/node/Node.java`	Root of the object graph. All services are constructed and wired here.
`org/opensearch/cluster/service/ClusterService.java`	Access point for cluster state and update tasks.
`org/opensearch/cluster/ClusterState.java`	The immutable cluster-metadata snapshot (covered in Level 4).
`org/opensearch/rest/RestController.java`	Dispatches HTTP requests to `RestHandler`s.
`org/opensearch/action/search/TransportSearchAction.java`	Entry point of the distributed search path.
`org/opensearch/indices/IndicesService.java`	Owns the `IndexService`/`IndexShard` instances on a node.
`org/opensearch/index/shard/IndexShard.java`	One shard. Wraps a Lucene index via an `Engine`.
`org/opensearch/index/engine/InternalEngine.java`	The default `Engine`: holds the Lucene `IndexWriter`, translog, refresh/flush.
`org/opensearch/search/SearchService.java`	Per-shard query and fetch phase execution.

`libs/` — shared low-level libraries

Path	Why
`libs/core`	`org.opensearch.core`: `StreamInput`/`StreamOutput`, `Writeable` — the wire serialization primitives (BWC lives here; see Level 9).
`libs/common`	Common utilities used everywhere.
`libs/x-content`	XContent: pluggable JSON/YAML/CBOR/SMILE parsing for request/response bodies.
`libs/geo`, `libs/secure-sm`	Geo primitives; the security manager.

`modules/` — bundled-by-default modules

Path	Why
`modules/transport-netty4`	The default network transport (`Netty4Transport`) and HTTP server.
`modules/lang-painless`	The Painless scripting language.
`modules/analysis-common`	The standard analyzers, tokenizers, and token filters.
`modules/reindex`, `modules/ingest-common`, `modules/percolator`	Reindex/update-by-query, ingest pipelines, percolation.

`test/framework/` — the test harness

Path	Why
`OpenSearchTestCase`	Base unit test: randomized seed, `randomAlphaOfLength`, `assertBusy`, leak detection.
`OpenSearchSingleNodeTestCase`	One in-JVM node for tests that need a real index.
`OpenSearchIntegTestCase`	Multi-node `InternalTestCluster` integration tests.
`InternalTestCluster`	Spins up real nodes in-JVM; the backbone of `*IT` tests.

Key Classes Quick Reference

Memorize the role of each. By the end of Level 1 you should be able to name the file path of any of these from memory.

Class	Project / package	Role
`Node`	`server` · `org.opensearch.node`	Object-graph root; constructs and wires every service.
`ClusterService`	`server` · `org.opensearch.cluster.service`	Access to cluster state; submits cluster-state update tasks.
`ClusterState`	`server` · `org.opensearch.cluster`	Immutable snapshot of cluster metadata, routing, nodes, blocks.
`RestController`	`server` · `org.opensearch.rest`	Routes HTTP requests to `RestHandler`s.
`TransportAction`	`server` · `org.opensearch.action.support`	Base for node-to-node actions; the execution unit behind every REST call.
`IndicesService`	`server` · `org.opensearch.indices`	Owns per-node `IndexService` and `IndexShard` instances.
`IndexShard`	`server` · `org.opensearch.index.shard`	One shard = one Lucene index, wrapped via an `Engine`.
`InternalEngine`	`server` · `org.opensearch.index.engine`	Default `Engine`: Lucene `IndexWriter` + translog + refresh/flush/merge.
`SearchService`	`server` · `org.opensearch.search`	Executes the query and fetch phases on a single shard.

Build the muscle memory now:

# You should be able to predict each path before running this.
for c in Node ClusterService ClusterState RestController IndicesService \
         IndexShard InternalEngine SearchService; do
  echo "== $c =="
  find server -name "$c.java" -path "*/main/*"
done

GitHub Issue Categories for Level 1 Contributors

OpenSearch tracks everything on GitHub. At this stage, restrict yourself to issues that exercise the workflow, not the engine:

good first issue — curated, scoped, beginner-appropriate. Always start here.
Documentation — incorrect Javadoc, stale REST examples, broken links, outdated version references in in-repo docs.
flaky-test — tests that fail intermittently. At Level 1 you observe these and learn to read the -Dtests.seed reproduction line; you fix them in Level 5.

How to find them:

# The label-filtered issue lists (open in a browser):
#   https://github.com/opensearch-project/OpenSearch/issues?q=is:open+label:%22good+first+issue%22
#   https://github.com/opensearch-project/OpenSearch/issues?q=is:open+label:flaky-test

# Or with the gh CLI:
gh issue list --repo opensearch-project/OpenSearch \
  --label "good first issue" --state open --limit 30

Warning: Do not start coding on an issue without reading the full comment thread. If it has an assignee or an open linked PR, move on. Leave a comment saying you intend to work on it before you do. Etiquette is covered in depth in Community Interaction.

Deliverables

Demonstrate all of the following before advancing to Level 2:

A successful ./gradlew assemble run — no build failures (Lab 1.1).
At least one unit-test class run green via ./gradlew :server:test --tests ... (Lab 1.2).
A running single-node cluster: GET _cat/health returns green and you have indexed and queried documents (Lab 1.3).
A standalone Lucene program that builds an index and runs a TermQuery, plus a written explanation of how InternalEngine wraps the same primitives (Lab 1.4).
Ability to name the file path of Node, IndexShard, InternalEngine, and SearchService from memory.
A written explanation (2–3 sentences) of why an OpenSearch shard is a Lucene index.

Common Mistakes

Mistake	Consequence	Fix
Using a system JDK instead of the bundled runtime	Cryptic compile/toolchain errors	The wrapper provisions its own JDK; let Gradle manage toolchains (see Lab 1.1).
Running `./gradlew check` on the whole repo first	Hours-long run; you give up	Scope: `./gradlew :server:test --tests "..."`. Save `check` for pre-PR.
Treating a randomized test failure as flaky	Real bug ignored	Re-run with the printed `-Dtests.seed=...` to reproduce deterministically.
Editing files but skipping `spotlessApply`	`precommit` fails in CI	Run `./gradlew spotlessApply` before committing.
Confusing Dashboards with the engine	Misattributed bugs	Dashboards is a separate TS repo; it is a client of OpenSearch.
Calling a shard a "node" or "index"	Wrong mental model that breaks at scale	A shard is one Lucene index; an OpenSearch index is N shards.
Reading code without running it	Abstract understanding that fails under debugging	Always `./gradlew run` and exercise the path you just read.

How to Verify Success

# 1. Build artifacts without tests.
./gradlew assemble -q && echo "ASSEMBLE OK"

# 2. One unit-test class, green.
./gradlew :server:test --tests "org.opensearch.common.UUIDTests"

# 3. Launch a cluster in one terminal...
./gradlew run
#    ...and in another, confirm it is alive:
curl -s "localhost:9200/_cat/health?v"
#    expected: a row whose 'status' column is 'green'

# 4. The Lucene project (Lab 1.4) prints its hits, e.g.:
#    Found 2 hit(s) for term body:lucene

PR Profile: Level 1 Graduate

A Level 1 graduate can credibly open these kinds of PRs. Scope is everything.

PR type	Example	Test requirement
Documentation fix	Correct a wrong parameter description in a REST handler's Javadoc	None — docs only
Stale example fix	Fix an outdated `curl` example or version string in an in-repo `.md`	Manual verification
Test naming / assertion clarity	Rename a confusing test method; add a missing `assertEquals` message	Re-run the test class
`flaky-test` observation	Comment a reliable `-Dtests.seed` reproduction on an existing issue	Reproduce locally

You are not yet ready to submit: validation-logic changes, new REST actions, engine or allocation changes, or anything touching the wire protocol. Those start in Level 2 (workflow) and ramp through Levels 3–9.

Next: Lab 1.1 — Build OpenSearch from Source.

Lab 1.1: Build OpenSearch from Source

Background

OpenSearch is a large, multi-project Gradle build (not Maven). Building from source is the mandatory first step for any contributor: you need the ability to compile, rebuild a single project, run tests against your local changes, and produce a runnable distribution. Unlike many Java projects, OpenSearch bundles its own JDK and provisions Gradle through the wrapper — so "works on my machine" failures from JDK drift are largely designed away, if you let the build manage the toolchain.

This lab walks the full build: clone, the ./gradlew wrapper and bundled JDK, assemble, localDistro, building a single project, IntelliJ import, and the Spotless/precommit machinery you will rely on in Level 2.

Why This Lab Matters for Contributors

You cannot submit a credible PR without first proving it builds cleanly.
Knowing which Gradle task touches which project saves hours of needless full builds.
A clean build baseline is what lets you tell a real regression from a local mistake.
The same assemble/precommit/spotless tasks are exactly what CI (gradle-check) runs on your PR — running them locally first is the difference between a one-round review and five.

Prerequisites

Verify before starting:

git --version          # 2.x
java -version          # informational only; the build provisions its own JDK
./gradlew --version    # run from the repo root after cloning; should print Gradle 8.x

Note: You do not need a perfectly matching system JDK. OpenSearch uses Gradle toolchains and downloads/uses the JDK it needs (JDK 21 is the baseline for the 3.x/main line; older lines used 11/17). A modern system JDK (17+) on your PATH is enough to launch Gradle itself. If java -version shows something ancient (8), install a current JDK first.

Resource floor:

Resource	Minimum	Why
Disk	~15 GB free	Dependencies, build outputs, a local distribution, Gradle caches.
RAM	8 GB (16 GB comfortable)	The Gradle daemon and forked test JVMs are memory-hungry.
Network	Required for first build	First run downloads Gradle, the toolchain JDK, and all dependencies.

Step-by-Step Tasks

Step 1: Clone the Repository

git clone https://github.com/opensearch-project/OpenSearch.git
cd OpenSearch

This is the canonical repository — GitHub is the source of truth (there is no JIRA). When you start contributing in Level 2 you will fork this and add your fork as a remote; for now, the upstream clone is what you build.

Confirm the remote and the line you are on:

git remote -v
# origin  https://github.com/opensearch-project/OpenSearch.git (fetch)
# origin  https://github.com/opensearch-project/OpenSearch.git (push)

git branch -r | grep -vi HEAD | sort | head

You will see branches like:

origin/main — the development trunk (current major line, 3.x).
origin/2.x — the maintenance line for the 2.x series.
origin/1.x — legacy.

For contributor work, use main unless you are reproducing an issue specific to a release branch. Fixes generally land on main and are backported to 2.x via the backport 2.x label (you will see this in Lab 2.2).

Step 2: Meet the Wrapper and the Bundled Toolchain

The gradlew script is the Gradle wrapper. It pins the exact Gradle version the build expects, downloading it on first use. Never install Gradle globally and run gradle — always run ./gradlew (or gradlew.bat on Windows).

cat gradle/wrapper/gradle-wrapper.properties | grep distributionUrl
# e.g. distributionUrl=https\://services.gradle.org/distributions/gradle-8.x-all.zip

./gradlew --version

OpenSearch's build configures a Java toolchain: Gradle resolves (and, if necessary, downloads) the JDK the build targets, independent of your JAVA_HOME. You can see the requirement and the build-tool plumbing:

# The baseline build JDK is declared in the build logic / CI; grep for it:
grep -rn "JavaLanguageVersion\|languageVersion" build.gradle buildSrc/ build-tools*/ 2>/dev/null | head
# Where the build looks for / provisions JDKs:
./gradlew javaToolchains | head -40

Note: If your network blocks the automatic toolchain download, point Gradle at a locally installed JDK of the right major version with -Porg.gradle.java.installations.paths=/path/to/jdk21 (or set org.gradle.java.installations.paths in ~/.gradle/gradle.properties). The DEVELOPER_GUIDE.md section on "JDK" documents the current accepted versions for your branch.

Step 3: List the Projects

OpenSearch is a multi-project build. See the project tree before building anything:

./gradlew projects | head -60

You will see the top-level projects that mirror the source layout — :server, :libs:*, :modules:*, :plugins:*, :client:*, :distribution:*, :test:framework, :qa:*, :rest-api-spec. You will explore this map in detail in Lab 2.1. For now, note that :server is the core engine and where you will spend most of your time.

Step 4: Assemble the Build (No Tests)

assemble compiles and packages all build artifacts without running tests. This is your "does it build?" command.

./gradlew assemble

Expected duration: 10–30 minutes on the first run (downloading the toolchain JDK and all dependencies dominates), then minutes on warm, incremental builds.

What you should not do at this stage is run ./gradlew build or ./gradlew check — those run the full test suite and the precommit gate across every project and can take well over an hour. Save those for pre-PR (Lab 1.2).

On success you will see:

BUILD SUCCESSFUL in 14m 32s
1234 actionable tasks: 1234 executed

If you see BUILD FAILED, go to Troubleshooting below and re-run with --stacktrace.

Step 5: Build a Single Project

In day-to-day development you almost never build everything. Scope to the project you are editing:

# Compile only the server main sources (the most common inner loop):
./gradlew :server:compileJava

# Compile the server test sources too:
./gradlew :server:compileTestJava

# Assemble a single library or module:
./gradlew :libs:core:assemble
./gradlew :modules:analysis-common:assemble

The :project:task syntax (Gradle "path") is how you address any project. Gradle automatically builds upstream dependencies first (e.g. :server depends on :libs:core, so compiling :server recompiles :libs:core if it changed). This incremental, scoped build is the command you will run hundreds of times.

Tip: --offline reuses the dependency cache and skips network checks — much faster once your first build has populated ~/.gradle. --console=plain gives clean, scrollback-friendly output.

Step 6: Produce a Runnable Local Distribution

assemble builds JARs; localDistro assembles a full, runnable OpenSearch distribution under distribution/archives/ — the same shape an end user downloads.

./gradlew localDistro

# Find what was produced (the exact path/architecture suffix varies by platform):
find distribution/archives -maxdepth 3 -type d -name "*local*" 2>/dev/null
ls distribution/archives/*/build/install/ 2>/dev/null

You generally will not run OpenSearch from this distribution during development — that is what the dedicated ./gradlew run task is for (it launches a debuggable single node from source; see Lab 1.3). localDistro matters because it is what packaging and QA build, and because being able to produce it proves your build is end-to-end healthy.

Step 7: Import into IntelliJ IDEA

IntelliJ understands Gradle multi-project builds natively. Import the build, not the files.

File → Open → select the OpenSearch/ root directory (the one containing settings.gradle).
When prompted, choose "Open as Project" and let IntelliJ import it as a Gradle project. It reads settings.gradle / build.gradle and materializes every project as a module.
Set the Gradle JVM to a JDK of the build's target major version (Settings → Build, Execution, Deployment → Build Tools → Gradle → Gradle JVM). Matching the build toolchain avoids resolution surprises.
Wait for the initial sync and index build (several minutes on first import).

Verify the import worked:

Open server/src/main/java/org/opensearch/index/shard/IndexShard.java.
Cmd/Ctrl+Click a class reference (e.g. Engine) — it should navigate.
Find Class (Cmd+O / Ctrl+N) → InternalEngine should resolve to server/src/main/java/org/opensearch/index/engine/InternalEngine.java.

Note: OpenSearch enforces formatting with Spotless and a host of checks under precommit. Rather than fighting IntelliJ's default formatter, rely on ./gradlew spotlessApply (Step 8) to normalize formatting before you commit. The repo includes IDE config the DEVELOPER_GUIDE.md points to; import it if offered.

Step 8: Meet Spotless and Precommit (Awareness)

You will run these constantly from Level 2 onward; meet them now so they are not a surprise.

# Auto-format your changes to satisfy the formatter:
./gradlew spotlessApply

# Verify formatting without changing files (what CI checks):
./gradlew spotlessJavaCheck

# The full static-analysis gate: checkstyle, forbidden APIs, license/SPDX headers,
# dependency checks, logger-usage, etc. Run this before every PR.
./gradlew precommit

precommit is the local mirror of much of what CI's gradle-check enforces. A clean precommit locally is the single biggest predictor of a smooth review. Every new .java file must carry the SPDX header (you will see it flagged here if you forget; details in Lab 2.1).

Implementation Requirements

This lab has no code to implement. Deliverables:

A successful ./gradlew assemble run (terminal output showing BUILD SUCCESSFUL).
The Gradle version printed by ./gradlew --version and the build's target JDK major version.
A successful ./gradlew :server:compileJava (proving the scoped inner loop works).
A produced local distribution from ./gradlew localDistro (path identified).
A working IntelliJ Gradle import that resolves InternalEngine via Find Class.
A clean ./gradlew precommit (or a clear note of any pre-existing failures on your branch).

Troubleshooting

`BUILD FAILED` with no obvious cause

Re-run with diagnostics:

./gradlew assemble --stacktrace --info 2>&1 | tail -80

Read the first failing task, not the last line. Gradle prints > Task :some:project:task FAILED near the actual error; everything after is fallout.

JDK / toolchain mismatch

> Could not determine the dependencies of task ...
  No compatible toolchains found for request specification: {languageVersion=21 ...}

Cause: Gradle cannot find or download the JDK the build targets. Fix: Either allow the automatic toolchain download (ensure network access), or point Gradle at a locally installed JDK of the right major version:

# One-off:
./gradlew assemble -Porg.gradle.java.installations.paths=/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home

# Or persistently in ~/.gradle/gradle.properties:
#   org.gradle.java.installations.paths=/path/to/jdk21
./gradlew javaToolchains   # confirm the JDK is now detected

Gradle daemon out of memory / `Killed`

> Java heap space
# or the JVM is OOM-killed mid-build

Cause: The Gradle daemon's heap is too small for a full build on your machine. Fix: Raise org.gradle.jvmargs. Create or edit gradle.properties in the repo root (or ~/.gradle/gradle.properties):

org.gradle.jvmargs=-Xmx4g -XX:+HeapDumpOnOutOfMemoryError
org.gradle.daemon=true

Then restart the daemon so it picks up the new args:

./gradlew --stop
./gradlew assemble

First build is extremely slow or stalls on downloads

Cause: Cold caches; the first run downloads Gradle, the toolchain JDK, and all dependencies. Fix: This is expected once. After the first successful build, use --offline to skip network checks. If you are behind a proxy, configure systemProp.http(s).proxyHost/Port in ~/.gradle/gradle.properties.

"Spotless found violations" when you only changed one file

Cause: Formatting drift. Fix: ./gradlew spotlessApply, then re-stage. Do not hand-format to match — let the tool do it.

Configuration cache or stale-daemon weirdness

If a build behaves inconsistently after you switched branches or changed gradle.properties:

./gradlew --stop                 # kill stale daemons
./gradlew assemble --rerun-tasks # ignore the up-to-date cache for this run

Expected Output

A clean assemble ends like this (numbers vary):

> Task :distribution:archives:integ-test-zip:buildExpanded
> Task :distribution:archives:darwin-tar:assemble

BUILD SUCCESSFUL in 16m 04s
2871 actionable tasks: 2871 executed

A clean precommit ends like this:

> Task :server:precommit
> Task :precommit

BUILD SUCCESSFUL in 6m 41s

Stretch Goals

Inspect the dependency graph of :server. See what :server actually depends on:
```
./gradlew :server:dependencies --configuration compileClasspath | head -60
```
Confirm :libs:core (and the Lucene artifacts) appear.

Find the bundled Lucene version. You will need this exact version in Lab 1.4:

grep -rn "lucene" buildSrc/version.properties 2>/dev/null \
  || grep -rn "lucene" gradle/libs.versions.toml 2>/dev/null \
  || ./gradlew :server:dependencies --configuration compileClasspath | grep -i lucene-core

Time a no-op incremental build. Run ./gradlew :server:compileJava twice in a row; the second run should report UP-TO-DATE for the compile task — proof that Gradle's incremental build is working.
Build with the build scan (optional): add --scan to any task to get a shareable web report of exactly what ran and how long it took. Useful when asking for help on a slow build.

Validation / Self-check

You are done when you can answer these without notes:

Why do you run ./gradlew rather than a globally installed gradle?
What does the Gradle toolchain mechanism do for you, and how do you point it at a local JDK?
What is the difference between ./gradlew assemble, ./gradlew localDistro, and ./gradlew run (the last one you will use in Lab 1.3)?
Which Gradle task addresses only the server's main Java compilation, and how do you express it?
Which property controls the Gradle daemon's heap, and in which files can you set it?
Which two tasks (spotless*, precommit) must be green before you open a PR, and what does each one check?
What exact Apache Lucene version does your branch bundle? (You will reuse this in Lab 1.4.)

Next: Lab 1.2 — Run Unit and Integration Tests.

Lab 1.2: Run Unit and Integration Tests

Background

OpenSearch has one of the most thorough test suites of any open-source distributed system, and a PR is not credible unless it passes the relevant slice of it. The suite is built on Randomized Testing (the same RandomizedRunner Lucene uses): every run picks a random seed, and the test exercises randomized inputs (field types, document counts, cluster topologies, serialization round-trips). This finds bugs ordinary fixed-input tests never would — and it means a failure must be reproduced with the seed it failed under, not dismissed as flaky.

This lab teaches you to run unit and integration tests, scope them tightly, reproduce a randomized failure deterministically, run the precommit and full-check gates, and read the HTML reports under build/reports/tests.

Why This Lab Matters for Contributors

CI's gradle-check runs a superset of what you will run here; passing locally first turns a five-round review into one.
Knowing how to scope tests with --tests is the difference between a 5-second feedback loop and a 90-minute one.
Reproducing a randomized failure from its seed is the core debugging skill for OpenSearch — you will use it constantly in Level 5 and Level 8.
Reading the test reports tells you what failed and why, not just that something did.

Prerequisites

Lab 1.1 complete: a clean ./gradlew assemble.
Familiarity with the project paths printed by ./gradlew projects.

Step-by-Step Tasks

Step 1: The Test Types

OpenSearch tests come in tiers. Know which tier you are running before you run it.

Base class	Project	What it gives you	Gradle task
`OpenSearchTestCase`	`test:framework`	Plain unit test: random seed, `randomAlphaOfLength`, `assertBusy`, leak detection. No cluster.	`:server:test`
`OpenSearchSingleNodeTestCase`	`test:framework`	One real in-JVM node — lets you create a real index and shard.	`:server:test`
`OpenSearchIntegTestCase`	`test:framework`	A multi-node `InternalTestCluster` in-JVM. The class lives in tests named `*IT`.	`:server:internalClusterTest`
`AbstractWireSerializingTestCase` / `AbstractSerializingTestCase`	`test:framework`	Round-trips a `Writeable`/XContent object to verify serialization (great for BWC).	`:server:test`
`OpenSearchRestTestCase` + YAML	`rest-api-spec`, modules	Black-box REST tests driven by YAML specs.	`:rest-api-spec:yamlRestTest`, `:module:...:yamlRestTest`

The two you will touch most at Level 1 are OpenSearchTestCase (fast, no cluster) and OpenSearchIntegTestCase (slow, real cluster). The distinction is covered in depth in Level 5.

# See how many of each style live in :server (rough proxy via base class):
grep -rln "extends OpenSearchTestCase"            server/src/test | wc -l
grep -rln "extends OpenSearchSingleNodeTestCase"  server/src/test | wc -l
grep -rln "extends OpenSearchIntegTestCase"       server/src/internalClusterTest 2>/dev/null | wc -l

Step 2: Run a Single Unit-Test Class

Never start with the whole :server:test task — it runs thousands of classes. Scope with --tests.

# A small, fast, dependency-free class — good first run:
./gradlew :server:test --tests "org.opensearch.common.UUIDTests"

--tests accepts a fully-qualified class name. You can also use wildcards and pick a single method:

# All test classes in a package:
./gradlew :server:test --tests "org.opensearch.cluster.*"

# A single method on a class:
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests.testToXContent"

# Every test class whose name ends in "Tests" under a subtree:
./gradlew :server:test --tests "org.opensearch.index.engine.*Tests"

Tip: ./gradlew :server:test --tests "X" will recompile changed sources first. If you only changed test code, the main-source compile is skipped — the loop stays fast.

Step 3: Run Integration (In-JVM Cluster) Tests

Integration tests spin up a real multi-node cluster inside the JVM via InternalTestCluster. They live under a separate source set and run via a separate task:

# One integration-test class (note the *IT naming convention):
./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.SimpleClusterStateIT"

These are much slower (seconds-to-minutes per class) because each spins up nodes, allocates shards, and tears everything down. Run them scoped; never blanket-run :server:internalClusterTest.

Step 4: Randomized Testing and `-Dtests.seed`

Every test run uses a random seed. When a test fails, Gradle prints a reproduction line that pins the seed (and other randomization) so the failure is deterministic:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.Foo.testBar" \
  -Dtests.seed=DEADBEEFCAFE -Dtests.locale=en-US -Dtests.timezone=UTC ...

To reproduce, copy that line verbatim:

./gradlew ':server:test' --tests "org.opensearch.SomeTest.testThing" -Dtests.seed=DEADBEEFCAFE

You can also force a seed to make a run deterministic, or iterate a flaky test to flush out the failing seed:

# Force a specific seed:
./gradlew :server:test --tests "org.opensearch.SomeTest" -Dtests.seed=DEADBEEFCAFE

# Run the same test many times with fresh random seeds to surface flakiness:
./gradlew :server:test --tests "org.opensearch.SomeTest" -Dtests.iters=50

Warning: A randomized failure is not automatically "flaky." If it reproduces under its seed, it is a real bug that only some inputs trigger. Only when it fails on one seed and passes on the same seed on re-run is it genuinely non-deterministic. Flaky-test handling — muting with @AwaitsFix(bugUrl=...), never @Ignore — is Level 5 material.

Step 5: Run the Precommit Gate

precommit is the static-analysis half of the gate: checkstyle, forbidden APIs, license/SPDX headers, dependency checks, loggerUsageCheck, and more. It is fast relative to the tests and CI runs it on every PR.

./gradlew precommit

Pair it with formatting:

./gradlew spotlessApply      # fix formatting
./gradlew spotlessJavaCheck  # verify formatting (what CI checks)

Step 6: The Full Check (Know What It Is, Run It Sparingly)

check is the everything gate: unit tests + precommit + integration tests for the project(s) you target. It is long-running. Scope it to a project, and only run it before a substantial PR.

# The full gate for the server project (long — minutes to tens of minutes):
./gradlew :server:check

# The whole repo (very long — reserve for pre-release / large changes):
# ./gradlew check

For most PRs you will run, in order: the affected --tests, then spotlessApply, then precommit. That mirrors what review will demand.

Step 7: Read the Test Reports

Whether a test passes or fails, Gradle writes an HTML report. For a failure you want the report, not just the console tail.

# Reports live under each project's build/reports/tests/<taskName>/index.html
find server/build/reports/tests -name index.html
# e.g. server/build/reports/tests/test/index.html
#      server/build/reports/tests/internalClusterTest/index.html

# Open it (macOS):
open server/build/reports/tests/test/index.html

The report gives you, per class and per method: pass/fail/skip counts, the full stack trace of each failure, and the captured stdout/stderr (where the test logged the reproduction line and any cluster diagnostics). The raw JUnit XML is alongside, under server/build/test-results/test/*.xml — useful for grepping or CI parsing.

# Quickly find which methods failed without opening a browser:
grep -rl 'testcase' server/build/test-results/test/*.xml >/dev/null 2>&1
grep -rE '<(failure|error)' server/build/test-results/test/*.xml | head

Implementation Requirements

This lab has no code to implement. Deliverables:

One OpenSearchTestCase-style unit class run green via --tests.
One *IT integration class run green via :server:internalClusterTest --tests.
A demonstrated seed reproduction: take any test's printed REPRODUCE WITH line and re-run it.
A clean ./gradlew precommit.
The path to your build/reports/tests/test/index.html and a one-line description of what it shows.

Troubleshooting

A test "passes locally but fails in CI" (or vice versa)

Almost always a seed difference. Reproduce CI's failure by copying the -Dtests.seed=... (and -Dtests.locale/-Dtests.timezone) from the CI log into your local command. Randomization includes locale and timezone — a locale-sensitive bug will only appear under certain -Dtests.locale values.

`OutOfMemoryError` during `:server:test`

Test JVMs are forked. Raise the daemon/test heap via org.gradle.jvmargs in gradle.properties (see Lab 1.1), and scope your run with --tests instead of running the whole suite.

Integration test hangs or leaks threads

OpenSearchIntegTestCase has aggressive leak detection; a hang at teardown usually means the test (or your change) left a thread/Closeable open. Read the report's captured output — the framework names the leaked resource. This is exactly the signal Level 5 teaches you to act on.

"Tests are UP-TO-DATE and won't re-run"

Gradle caches test results. Force a re-run:

./gradlew :server:test --tests "org.opensearch.common.UUIDTests" --rerun-tasks

`precommit` fails on a file you did not touch

Confirm it is pre-existing (some failures depend on branch state). Run on a clean checkout of main to establish the baseline:

git stash && ./gradlew precommit ; git stash pop

If main is clean and your change introduced it, the failure is yours to fix.

Expected Output

A passing scoped unit run:

> Task :server:test

org.opensearch.common.UUIDTests > testRandomUUID PASSED
org.opensearch.common.UUIDTests > testTimeBasedUUID PASSED

BUILD SUCCESSFUL in 38s

A failure prints the all-important reproduction line:

org.opensearch.cluster.SomeTest > testThing FAILED
    java.lang.AssertionError: expected:<3> but was:<2>
        at org.opensearch.cluster.SomeTest.testThing(SomeTest.java:91)

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTest.testThing" \
  -Dtests.seed=A1B2C3D4 -Dtests.locale=en-US -Dtests.timezone=UTC

BUILD FAILED in 41s

Stretch Goals

Find a serialization round-trip test and run it. These verify wire/XContent BWC and are the guardians of the protocol you must respect in Level 9:
```
grep -rln "extends AbstractWireSerializingTestCase" server/src/test | head
# pick one and run it with --tests
```

Stress a single test for flakiness. Run a chosen test 100 times with fresh seeds:

./gradlew :server:test --tests "org.opensearch.common.UUIDTests" -Dtests.iters=100

Run a module's REST-YAML tests and watch a black-box test of the REST API:
```
./gradlew :modules:reindex:yamlRestTest --tests "*" 2>&1 | tail -30
```
Diff two runs' reports. Run a test twice with two explicit different seeds and compare what the report captured — see the randomized inputs differ.

Validation / Self-check

You are done when you can answer these without notes:

What does --tests accept, and how do you scope to a single method? A whole package?
Why is :server:internalClusterTest separate from :server:test, and which base class backs it?
What is -Dtests.seed for, and where do you find the value to reproduce a given failure?
When is a randomized failure a real bug, and when is it genuinely non-deterministic?
What does precommit check that the test task does not?
Where do the HTML report and the raw JUnit XML for a :server:test run live?
Which three commands, in order, would you run before opening a PR that touches :server?

See the testing internals in depth in Level 5. Next: Lab 1.3 — Launch a Single-Node Cluster and Index Data.

Lab 1.3: Launch a Single-Node Cluster and Index Data

Background

Reading the engine is abstract until you watch it serve a request. In this lab you launch a real, debuggable OpenSearch node straight from your source checkout with ./gradlew run, then drive it over its REST API: check cluster health, create an index with an explicit mapping, bulk-index documents, run a match query, run a terms aggregation, and inspect the shards and Lucene segments backing your index.

The point is not "learn the REST API" (that is user documentation). The point is to connect every curl you run to the RestHandler that parses it and the TransportAction that executes it — so that when you read those classes in Level 3, you have already seen their effects.

Why This Lab Matters for Contributors

./gradlew run is the tool you will use for the rest of the curriculum to exercise code you have just read or changed.
Mapping a REST call to its handler/action is the foundational request-tracing skill (Level 3 is built entirely on it).
Watching _cat/segments change as you index, refresh, and merge makes the refresh/flush/merge lifecycle concrete.

Prerequisites

Lab 1.1 complete: a clean ./gradlew assemble.
curl and (optionally) jq installed for readable JSON.

Step-by-Step Tasks

Step 1: Launch the Node

From the repo root:

./gradlew run

This task builds a distribution and launches a single node with REST on localhost:9200 and the transport port on 9300. It runs in the foreground, streaming the node's logs — leave it running and open a second terminal for the curl calls. The node's data directory is ephemeral under the build output, so each run starts clean (great for reproducible experiments).

Wait for the line indicating the node has started and recovered:

[INFO ][o.o.n.Node] [runTask-0] started
[INFO ][o.o.g.GatewayService] [runTask-0] recovered [0] indices into cluster_state

Note: ./gradlew run launches with security/auth disabled by default in this dev flow, so plain HTTP on :9200 works without credentials. This is a development convenience — production clusters run the Security plugin (a separate repo). To attach a debugger, use ./gradlew run --debug-jvm and connect your IDE's remote JVM debugger to the printed port.

Step 2: Confirm the Cluster Is Alive

In the second terminal:

curl -s "localhost:9200" | jq .

Expected (versions/names vary):

{
  "name" : "runTask-0",
  "cluster_name" : "runTask",
  "version" : {
    "distribution" : "opensearch",
    "number" : "3.0.0",
    "lucene_version" : "9.x.x"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Now health:

curl -s "localhost:9200/_cat/health?v"

epoch      timestamp cluster status node.total node.data shards pri relo init unassign ...
1718531200 12:00:00  runTask green           1         1      0   0    0    0        0 ...

A single-node cluster reports green only when there are no unassigned replica shards to place. Watch this column change as you create indices in the next steps.

REST → handler map: GET / is served by RestMainAction; GET /_cat/health by RestHealthAction (one of the _cat family registered in RestController). The _cat handlers ultimately call the cluster-health transport action.

Step 3: Create an Index with an Explicit Mapping

Define the schema instead of relying on dynamic mapping — it makes the field types explicit and the later aggregation deterministic.

curl -s -X PUT "localhost:9200/books" \
  -H 'Content-Type: application/json' -d '{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "title":  { "type": "text" },
      "author": { "type": "keyword" },
      "year":   { "type": "integer" },
      "tags":   { "type": "keyword" }
    }
  }
}' | jq .

Expected:

{ "acknowledged": true, "shards_acknowledged": true, "index": "books" }

We set number_of_replicas: 0 so a single node stays green (no replica to leave unassigned). The text vs keyword distinction matters: title is analyzed for full-text match queries; author/tags are keyword (exact, not analyzed) so they aggregate cleanly.

REST → handler/action map: PUT /books → RestCreateIndexAction → TransportCreateIndexAction. Creating an index is a cluster-state change: it routes to the elected cluster manager (TransportClusterManagerNodeAction lineage), updates Metadata, and publishes a new ClusterState (the publish path is Level 4 and the cluster-state deep dive).

Step 4: Bulk-Index Documents

The Bulk API takes newline-delimited JSON (NDJSON): alternating action and source lines, with a trailing newline.

curl -s -X POST "localhost:9200/books/_bulk" \
  -H 'Content-Type: application/x-ndjson' --data-binary '
{ "index": { "_id": "1" } }
{ "title": "Lucene in Action", "author": "mccandless", "year": 2010, "tags": ["search","java"] }
{ "index": { "_id": "2" } }
{ "title": "Mastering OpenSearch", "author": "community", "year": 2023, "tags": ["search","ops"] }
{ "index": { "_id": "3" } }
{ "title": "Designing Data-Intensive Applications", "author": "kleppmann", "year": 2017, "tags": ["systems","data"] }
{ "index": { "_id": "4" } }
{ "title": "Elasticsearch: The Definitive Guide", "author": "community", "year": 2015, "tags": ["search","java"] }
' | jq '{errors, items: (.items | length)}'

Expected:

{ "errors": false, "items": 4 }

Make the documents searchable immediately (a refresh opens a new Lucene searcher — see refresh/flush/merge):

curl -s -X POST "localhost:9200/books/_refresh" >/dev/null
curl -s "localhost:9200/books/_count" | jq .
# { "count": 4, ... }

REST → handler/action map: POST /books/_bulk → RestBulkAction → TransportBulkAction → per-shard TransportShardBulkAction → IndexShard.applyIndexOperationOnPrimary(...) → InternalEngine.index(...) → Lucene IndexWriter.addDocument/updateDocument + Translog.add. You will trace this exact path in Level 6, Lab 6.1.

Step 5: Run a `match` (Full-Text) Query

match analyzes the query text and searches the analyzed title field:

curl -s -X GET "localhost:9200/books/_search" \
  -H 'Content-Type: application/json' -d '{
  "query": { "match": { "title": "search" } }
}' | jq '{hits: .hits.total.value, titles: [.hits.hits[]._source.title]}'

Because title is text, "Lucene in Action" matches nothing for search, but the two titles analyzed to include the term will match. Try a term that clearly hits:

curl -s -X GET "localhost:9200/books/_search" \
  -H 'Content-Type: application/json' -d '{
  "query": { "match": { "title": "opensearch lucene" } }
}' | jq '[.hits.hits[] | {title: ._source.title, score: ._score}]'

Expected (scores vary by BM25 stats):

[
  { "title": "Lucene in Action", "score": 1.4 },
  { "title": "Mastering OpenSearch", "score": 1.1 }
]

REST → handler/action map: GET /books/_search → RestSearchAction → TransportSearchAction. The coordinating node fans out to each shard; each runs SearchService.executeQueryPhase (QueryPhase) then executeFetchPhase (FetchPhase); the coordinator merges shard results in SearchPhaseController. The DSL match becomes a Lucene Query via QueryShardContext. Full trace in Level 7, Lab 7.1 and the search-execution deep dive.

Step 6: Run a `terms` Aggregation

Aggregate over the author keyword field, returning zero hits (we only want the buckets):

curl -s -X GET "localhost:9200/books/_search" \
  -H 'Content-Type: application/json' -d '{
  "size": 0,
  "aggs": {
    "by_author": { "terms": { "field": "author" } }
  }
}' | jq '.aggregations.by_author.buckets'

Expected:

[
  { "key": "community", "doc_count": 2 },
  { "key": "kleppmann", "doc_count": 1 },
  { "key": "mccandless", "doc_count": 1 }
]

This works because author is a keyword field with DocValues (a columnar, per-document store Lucene maintains). Aggregations read DocValues, not the inverted index. Try aggregating over the text field title and watch it fail — text fields disable DocValues by default:

curl -s -X GET "localhost:9200/books/_search" -H 'Content-Type: application/json' -d '{
  "size": 0, "aggs": { "by_title": { "terms": { "field": "title" } } }
}' | jq '.error.type, .error.reason'
# "illegal_argument_exception"
# "Text fields are not optimised for ... aggregations ... set fielddata=true ..."

That error message is a thing contributors improve — you will study handler validation in Level 2, Lab 2.3. Aggregation internals are the aggregations deep dive.

Step 7: Inspect Shards and Segments

Now look beneath the index at its physical shape.

curl -s "localhost:9200/_cat/shards/books?v"

index shard prirep state   docs store ip        node
books 0     p      STARTED    4  ...  127.0.0.1 runTask-0

One primary shard (p), no replicas (we set number_of_replicas: 0), in STARTED state. That one shard is a single Lucene index. Now the segments inside it:

curl -s "localhost:9200/_cat/segments/books?v"

index shard prirep ip        segment generation docs.count docs.deleted size searchable committed
books 0     p      127.0.0.1 _0               0          4            0  ...  true       false

You likely see one segment after a single bulk + refresh. Index more documents in separate batches (each with its own refresh) and re-run _cat/segments — you will see multiple segments appear. A force-merge collapses them:

curl -s -X POST "localhost:9200/books/_forcemerge?max_num_segments=1" >/dev/null
curl -s "localhost:9200/_cat/segments/books?v"
# now a single, larger segment

This is the merge process the engine runs continuously in the background. You will build a standalone Lucene index and watch the same segment/merge behavior with no OpenSearch at all in Lab 1.4.

Step 8: Shut Down

Stop the node with Ctrl+C in the terminal running ./gradlew run. Because the data directory is ephemeral, the next run starts empty.

Implementation Requirements

This lab has no code to implement. Deliverables:

GET /_cat/health returning green with your node running.
The books index created with an explicit mapping; _count returning 4.
A match query returning the expected titles, and a terms aggregation returning the author buckets.
A completed REST → handler → action mapping table (below) filled in from your own reading.
_cat/segments output before and after a force-merge, with one sentence explaining the change.

Fill this in by grepping the source (grep -rn "new Route" server/src/main/java/org/opensearch/rest and the action classes):

REST call	RestHandler class	TransportAction class
`GET /`	`RestMainAction`	(main/info)
`GET /_cat/health`	`RestHealthAction`	`TransportClusterHealthAction`
`PUT /books`	`RestCreateIndexAction`	`TransportCreateIndexAction`
`POST /books/_bulk`	`RestBulkAction`	`TransportBulkAction` → `TransportShardBulkAction`
`GET /books/_search`	`RestSearchAction`	`TransportSearchAction`

Troubleshooting

`curl: (7) Failed to connect to localhost port 9200`

The node has not finished starting, or run failed. Check the ./gradlew run terminal for the started line; if it crashed, read the stack trace there.

`./gradlew run` exits immediately with a port-in-use error

java.net.BindException: Address already in use

A previous node or another service holds :9200/:9300. Stop the old process (./gradlew --stop won't kill a foreground run — use Ctrl+C; find strays with lsof -i :9200).

Bulk request returns `"errors": true`

Inspect the per-item errors:

curl -s -X POST "localhost:9200/books/_bulk" -H 'Content-Type: application/x-ndjson' \
  --data-binary @your.ndjson | jq '.items[] | select(.index.error) | .index.error'

The most common cause is a missing trailing newline on the NDJSON body, or a Content-Type other than application/x-ndjson. Use --data-binary, not -d (which strips newlines).

Search returns zero hits right after indexing

You did not refresh. Documents are not searchable until a refresh opens a new searcher. Either POST /books/_refresh or wait for the periodic refresh (default ~1s). This is refresh/flush/merge in action.

Expected Output

The end-to-end happy path, condensed:

$ curl -s localhost:9200/_cat/health?h=status      ->  green
$ curl -s localhost:9200/books/_count | jq .count  ->  4
$ ... match "opensearch lucene"                     ->  2 hits
$ ... terms by_author                               ->  community:2 kleppmann:1 mccandless:1
$ curl -s localhost:9200/_cat/shards/books?h=prirep,state -> p STARTED
$ curl -s localhost:9200/_cat/segments/books?h=segment    -> _0  (one segment after forcemerge)

Stretch Goals

Force a yellow cluster. Recreate books with number_of_replicas: 1 and watch _cat/health go yellow (the replica cannot be allocated on a single node) and _cat/shards/books show an UNASSIGNED replica. Explain why — this is the shard-allocation story from Level 4.
Watch refresh make data visible. Index a doc with ?refresh=false, immediately search (zero hits), then _refresh and search again (one hit). Confirm the refresh boundary.
Profile the query. Add "profile": true to a _search body and read the per-shard, per-Lucene-query timing breakdown. This is the query phase exposed.
Find the route registration. For each row in your mapping table, find the new Route(...) or routes() declaration in the handler:
```
grep -rn "routes()\|new Route(" server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java
```

Validation / Self-check

You are done when you can answer these without notes:

Why does a single-node cluster go yellow when you ask for one replica, but green with zero?
Which handler and which transport action service POST /_bulk? Where does the write ultimately call into Lucene?
Why does a terms aggregation on a keyword field work but the same on a text field fail by default? What underlying Lucene structure does the aggregation read?
What does _refresh do at the Lucene level, and why is a document not searchable before it?
What is the relationship between an OpenSearch index, a shard, and a Lucene segment?
What does a force-merge do to _cat/segments, and why is that the same thing the engine does in the background?

Next: Lab 1.4 — Build a Minimal Lucene Index, where you build a shard by hand.

Lab 1.4: Project — Build a Minimal Lucene Index

Background

Every OpenSearch shard is a Lucene index. That sentence is repeated everywhere in this book, but it does not become real until you build a Lucene index with your own hands — no cluster, no REST, no org.opensearch.*, just the library underneath everything. In this build-it project you write a ~120-line standalone Java program that:

opens a Directory and an IndexWriter,
adds a few Documents,
commits, opens a DirectoryReader and an IndexSearcher,
runs a TermQuery and a BooleanQuery, and prints the hits,
observes segments and a merge.

Then you map each Lucene primitive to the OpenSearch class that wraps it — IndexWriter → InternalEngine, DirectoryReader/IndexSearcher → Engine.Searcher, the whole bundle → IndexShard. After this lab, "a shard is a Lucene index" is something you have seen, not something you have read.

Why This Lab Matters for Contributors

The single most useful mental model for OpenSearch internals is "OpenSearch is distribution + durability + an API on top of Lucene." This lab installs that model permanently.
When you trace the index path in Level 6 and the search path in Level 7, every Lucene call you see (IndexWriter.addDocument, IndexSearcher.search) will already be familiar.
Understanding segments and merges first-hand makes the refresh/flush/merge deep dive and engine internals read like review.

Prerequisites

Lab 1.1 complete, and you found the exact Lucene version your branch bundles (Stretch Goal 2 of that lab). You will use it here so your standalone program uses the same Lucene OpenSearch runs.
A JDK on your PATH (17+ is fine to compile/run this tiny program).

# Re-confirm the bundled Lucene version from inside your OpenSearch checkout:
grep -rn "lucene" gradle/libs.versions.toml 2>/dev/null \
  || grep -rn "lucene" buildSrc/version.properties 2>/dev/null \
  || ./gradlew :server:dependencies --configuration compileClasspath | grep -i lucene-core
# Note the version, e.g. 9.11.1 — call it LUCENE_VERSION below.

Step-by-Step Tasks

Step 1: Create the Project Skeleton

Work outside the OpenSearch repo so you do not pollute it:

mkdir -p ~/lucene-shard/src ~/lucene-shard/lib
cd ~/lucene-shard

You need three Lucene JARs at the version OpenSearch bundles: lucene-core, lucene-queryparser, and lucene-analysis-common (the package names changed at Lucene 9; this lab targets the 9.x line OpenSearch 3.x uses).

Step 2: Obtain the Lucene JARs

You already have them — they are in your Gradle cache from building OpenSearch. Copy them so your classpath is self-contained:

# Replace 9.11.1 with YOUR LUCENE_VERSION from Step 0.
LUCENE_VERSION=9.11.1
find ~/.gradle/caches -name "lucene-core-${LUCENE_VERSION}.jar"            -exec cp {} lib/ \;
find ~/.gradle/caches -name "lucene-queryparser-${LUCENE_VERSION}.jar"     -exec cp {} lib/ \;
find ~/.gradle/caches -name "lucene-analysis-common-${LUCENE_VERSION}.jar" -exec cp {} lib/ \;
ls -1 lib/
# lucene-analysis-common-9.11.1.jar
# lucene-core-9.11.1.jar
# lucene-queryparser-9.11.1.jar

Note: If a JAR is not in the cache, run a build that needs it (./gradlew :server:assemble), or download those three artifacts from Maven Central at the matching version. Using the same Lucene OpenSearch ships is the whole point — APIs differ across major Lucene versions.

Step 3: Write the Program

Create src/MiniShard.java. This is the full, runnable source — read it as you type it.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.ByteBuffersDirectory;
import org.apache.lucene.store.Directory;

import java.util.List;

/**
 * A minimal Lucene index — the same primitives OpenSearch wraps inside a single shard.
 *
 *   Directory        -> where segments live (we use an in-memory one)
 *   IndexWriter      -> wrapped by OpenSearch InternalEngine
 *   DirectoryReader  -> the point-in-time view a "refresh" opens
 *   IndexSearcher    -> wrapped by OpenSearch Engine.Searcher
 */
public class MiniShard {

    public static void main(String[] args) throws Exception {
        // 1. A "shard" needs a place to put its segments. In OpenSearch this is an
        //    FSDirectory on disk; here we use RAM so the program leaves no files.
        Directory dir = new ByteBuffersDirectory();

        // 2. The analyzer turns text into terms. OpenSearch's "standard" analyzer is this one.
        StandardAnalyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig cfg = new IndexWriterConfig(analyzer);

        // 3. The IndexWriter is what InternalEngine.index(...) ultimately drives.
        try (IndexWriter writer = new IndexWriter(dir, cfg)) {
            writer.addDocument(doc("1", "Lucene in Action",      "search", 2010));
            writer.addDocument(doc("2", "Mastering OpenSearch",  "search", 2023));
            writer.addDocument(doc("3", "Data-Intensive Apps",   "systems", 2017));
            writer.addDocument(doc("4", "The Definitive Guide",  "search", 2015));
            // commit() == a Lucene commit == what OpenSearch "flush" does (durability point).
            writer.commit();
            System.out.println("Indexed 4 documents; segment count after first commit: "
                    + writer.getSegmentCount());

            // Add more in a second batch so a second segment forms, then force a merge.
            writer.addDocument(doc("5", "OpenSearch Internals",  "search", 2024));
            writer.commit();
            System.out.println("Segment count after second commit: " + writer.getSegmentCount());
            writer.forceMerge(1); // what POST /_forcemerge?max_num_segments=1 triggers
            writer.commit();
            System.out.println("Segment count after forceMerge(1): " + writer.getSegmentCount());
        }

        // 4. Open a point-in-time view. This is exactly what an OpenSearch "refresh" does:
        //    it opens a new DirectoryReader so newly-committed docs become searchable.
        try (DirectoryReader reader = DirectoryReader.open(dir)) {
            IndexSearcher searcher = new IndexSearcher(reader);
            System.out.println("Reader sees " + reader.numDocs() + " live docs across "
                    + reader.leaves().size() + " segment(s).");

            // 5a. A TermQuery on the analyzed 'title' field. Note: StandardAnalyzer lowercases,
            //     so we query the lowercased term.
            runQuery(searcher, "TermQuery title:opensearch",
                    new TermQuery(new Term("title", "opensearch")));

            // 5b. A TermQuery on the un-analyzed 'tag' StringField (exact match, like a keyword).
            runQuery(searcher, "TermQuery tag:search",
                    new TermQuery(new Term("tag", "search")));

            // 5c. A BooleanQuery: title contains 'opensearch' AND tag == 'search'.
            BooleanQuery bool = new BooleanQuery.Builder()
                    .add(new TermQuery(new Term("title", "opensearch")), BooleanClause.Occur.MUST)
                    .add(new TermQuery(new Term("tag", "search")),       BooleanClause.Occur.MUST)
                    .build();
            runQuery(searcher, "BooleanQuery (title:opensearch AND tag:search)", bool);
        }
    }

    /** Build a document. 'title' is analyzed (TextField); 'tag' is exact (StringField). */
    private static Document doc(String id, String title, String tag, int year) {
        Document d = new Document();
        d.add(new StringField("id", id, Field.Store.YES));      // exact, stored
        d.add(new TextField("title", title, Field.Store.YES));  // analyzed, stored
        d.add(new StringField("tag", tag, Field.Store.YES));    // exact, stored
        d.add(new IntPoint("year", year));                      // numeric, range-searchable
        return d;
    }

    private static void runQuery(IndexSearcher searcher, String label, Query q) throws Exception {
        TopDocs top = searcher.search(q, 10);
        System.out.printf("%n[%s] -> %d hit(s)%n", label, top.totalHits.value);
        List<ScoreDoc> hits = List.of(top.scoreDocs);
        for (ScoreDoc sd : hits) {
            Document d = searcher.storedFields().document(sd.doc);
            System.out.printf("    id=%s  title=%-24s  score=%.3f%n",
                    d.get("id"), d.get("title"), sd.score);
        }
    }
}

Note: Two Lucene-9 API details that bite people coming from older examples: stored-field access is searcher.storedFields().document(docId) (the old searcher.doc(id) / reader.document(id) are deprecated/removed in 9.x), and analyzer packages moved to org.apache.lucene.analysis.standard. If your bundled Lucene is a different major version, adjust these two call sites — the concepts are identical.

Step 4: Compile and Run

cd ~/lucene-shard
# Compile against the Lucene JARs:
javac -cp "lib/*" -d out src/MiniShard.java

# Run with the same classpath plus your compiled class:
java -cp "out:lib/*" MiniShard

Expected output (scores vary slightly by Lucene version):

Indexed 4 documents; segment count after first commit: 1
Segment count after second commit: 2
Segment count after forceMerge(1): 1
Reader sees 5 live docs across 1 segment(s).

[TermQuery title:opensearch] -> 2 hit(s)
    id=2  title=Mastering OpenSearch     score=...
    id=5  title=OpenSearch Internals     score=...

[TermQuery tag:search] -> 4 hit(s)
    id=1  title=Lucene in Action         score=1.000
    id=2  title=Mastering OpenSearch     score=1.000
    id=4  title=The Definitive Guide     score=1.000
    id=5  title=OpenSearch Internals     score=1.000

[BooleanQuery (title:opensearch AND tag:search)] -> 2 hit(s)
    id=2  title=Mastering OpenSearch     score=...
    id=5  title=OpenSearch Internals     score=...

You just built, committed, merged, and queried a Lucene index — a shard, by hand.

Step 5: Map Lucene Primitives to OpenSearch

Now connect what you wrote to where OpenSearch wraps it. Verify each mapping with a grep in your OpenSearch checkout — do not take the table on faith.

Lucene primitive (your program)	OpenSearch wrapper	Find it
`Directory` (`ByteBuffersDirectory` / `FSDirectory`)	`Store` / the shard's data path	`grep -rn "class Store" server/src/main/java/org/opensearch/index/store/Store.java`
`IndexWriter` + `IndexWriterConfig`	`InternalEngine` (holds the `IndexWriter`)	`grep -n "IndexWriter" server/src/main/java/org/opensearch/index/engine/InternalEngine.java`
`writer.addDocument(...)`	`InternalEngine.index(...)` ← `IndexShard.applyIndexOperationOnPrimary`	`grep -n "addDocument\|updateDocument" server/src/main/java/org/opensearch/index/engine/InternalEngine.java`
`writer.commit()` (durability)	engine flush + the translog	`grep -n "flush\|commitIndexWriter" server/src/main/java/org/opensearch/index/engine/InternalEngine.java`
`DirectoryReader.open(...)` (visibility)	engine refresh → opens a new `Engine.Searcher`	`grep -n "refresh\|ReferenceManager\|SearcherManager" server/src/main/java/org/opensearch/index/engine/InternalEngine.java`
`IndexSearcher`	`Engine.Searcher` (acquired via `IndexShard.acquireSearcher(...)`)	`grep -rn "class Searcher" server/src/main/java/org/opensearch/index/engine/Engine.java`
`forceMerge(1)`	`IndexShard.forceMerge(...)` → engine `forceMerge`	`grep -n "forceMerge" server/src/main/java/org/opensearch/index/shard/IndexShard.java`
The whole bundle	`IndexShard` (one shard)	`grep -n "class IndexShard" server/src/main/java/org/opensearch/index/shard/IndexShard.java`

Read the class-level Javadoc of InternalEngine with this table in hand — it will now read as "a managed, recoverable, refreshable wrapper around exactly the IndexWriter/DirectoryReader I just used." That is the engine. Detail lives in the engine internals deep dive and the IndexShard lifecycle deep dive.

Step 6: Reflect on Segments and Merges

Your program printed the segment count rising from 1 → 2 → 1. Re-read what happened and connect it to OpenSearch:

Each commit() flushed buffered documents into a new immutable segment. Segments are never edited in place — updates/deletes are recorded as new segments + tombstones.
forceMerge(1) rewrote all live documents into one segment, reclaiming space from deletes. OpenSearch's background MergePolicy/MergeScheduler does this continuously, on a schedule, so the number of segments stays bounded without you asking.
A refresh in OpenSearch == opening a new DirectoryReader == making newly committed docs visible to search. A flush == a Lucene commit + trimming the translog == the durability point. This is precisely the lifecycle in the refresh/flush/merge deep dive.

This is the same behavior you watched over REST in Lab 1.3, Step 7 via _cat/segments and _forcemerge — except now you have seen it from inside, with nothing distributed in the way.

Implementation Requirements

Deliverables:

MiniShard.java compiles against the same Lucene version OpenSearch bundles and runs.
Program output shows the segment count going 1 → 2 → 1 across the two commits and the force-merge.
All three queries (two TermQuery, one BooleanQuery) return the expected hit counts (2, 4, 2).
The Lucene→OpenSearch mapping table above, with each row's grep run and the file/line you found pasted in.
A written reflection (4–6 sentences) answering: why is "a shard is a Lucene index" the right mental model, and what does OpenSearch add on top of Lucene that this program does not?

Troubleshooting

`error: cannot find symbol` on `storedFields()` or `StandardAnalyzer`

Your Lucene version differs from what these calls assume. storedFields() is Lucene 9; older lines use searcher.doc(id). The analyzer is in org.apache.lucene.analysis.standard in Lucene 9; older lines used org.apache.lucene.analysis.standard differently or core. Match your imports to your JARs — run jar tf lib/lucene-core-*.jar | grep -i StandardAnalyzer if unsure where a class lives.

`NoClassDefFoundError` at runtime

Your runtime classpath is missing a JAR. The analyzer needs lucene-analysis-common, the query parsing helpers need lucene-queryparser. Confirm all three JARs are on -cp: java -cp "out:lib/*" MiniShard (the lib/* glob must be quoted so the shell does not expand it).

`TermQuery title:opensearch` returns 0 hits

StandardAnalyzer lowercases. The indexed term is opensearch, not OpenSearch. A TermQuery does not analyze its input — it matches the raw term. Query the lowercased form (as the code does). This is exactly why OpenSearch distinguishes text (analyzed) from keyword (not), which you saw in Lab 1.3.

Different scores than shown

BM25 scores depend on document/term statistics and Lucene version. Hit counts are what you validate; exact scores will differ and that is fine.

Stretch Goals

Add a numeric range query. You stored year as an IntPoint. Add an IntPoint.newRangeQuery("year", 2015, 2025) and print the hits. This is the Lucene primitive behind OpenSearch's range query on numeric fields.
Delete and observe. Open a fresh IndexWriter on the same directory, call writer.deleteDocuments(new Term("id", "3")), commit, and re-check numDocs() vs reader.maxDoc(). The gap is "deleted but not yet merged away" — exactly docs.deleted in _cat/segments.
Switch to an on-disk Directory. Replace ByteBuffersDirectory with FSDirectory.open(Path.of("./idx")), run the program, and inspect the files it writes (ls -la idx/). You will see .cfs/.si/.cfe segment files — the literal bytes an OpenSearch shard stores on disk.
Use the QueryParser. Pull in lucene-queryparser and parse "title:opensearch AND tag:search" with QueryParser instead of building the BooleanQuery by hand. The OpenSearch query_string query is conceptually this parser, wired into the DSL.

Validation / Self-check

You are done when you can answer these without notes:

In one sentence: what is an OpenSearch shard, in Lucene terms?
Which OpenSearch class holds the Lucene IndexWriter, and which method drives addDocument?
What does an OpenSearch refresh correspond to in your program, and what does a flush correspond to?
Why does a TermQuery for OpenSearch (capital O) miss, but match over a text field for "OpenSearch" hits? What component is responsible for the difference?
What is a segment, why is it immutable, and what does a merge accomplish?
Name three things OpenSearch adds on top of the bare Lucene primitives in this program (hint: durability across crashes, distribution across nodes, an API).

You have now seen OpenSearch from the REST surface (Lab 1.3) down to the Lucene floor (this lab). Proceed to Level 2 — OpenSearch Contributor Onboarding to learn how to turn understanding into merged pull requests.

Level 2: OpenSearch Contributor Onboarding

Level 1 gave you a working OpenSearch you can build, test, and run. This level turns that capability into merged pull requests. You will learn how OpenSearch is actually contributed to: the GitHub-based model (issues and PRs, no JIRA), the DCO sign-off that replaces a CLA, the mandatory CHANGELOG.md entry, the precommit/CI gate, the review loop, and the backport mechanism. By the end you will have walked a trivial change, a realistic good-first-issue fix, and a code review from the maintainer's side of the table.

This is deliberately a workflow level. The contributions are surgical — a doc fix, a validation message, a missing test. Nothing here will surprise a reviewer. That is the point: you drill the mechanics now so that Levels 3–9 can be about the engine.

Learning Objectives

By the end of Level 2 you must be able to:

Describe the OpenSearch contribution model end-to-end: issue → fork → branch → signed commit → PR → CI → review → merge → backport.
Find and qualify a good first issue without stepping on another contributor.
Make a clean, signed (git commit -s) commit that passes the DCO check.
Add a correct CHANGELOG.md entry under ## [Unreleased].
Run ./gradlew spotlessApply and ./gradlew precommit locally so CI passes on the first push.
Open a PR that satisfies the PR template, understand each CI check, and respond to review without thrashing.
Explain what the backport 2.x label does and when to apply it.

The OpenSearch Contribution Model

OpenSearch development happens entirely on GitHub at github.com/opensearch-project/OpenSearch. There is no Apache JIRA, no patch files, and no CLA. If you have contributed to an Apache project (like the Tez curriculum's JIRA + .patch flow), unlearn that here — OpenSearch is a fork-and-pull-request project.

The pieces you must internalize:

Artifact	What it is	Where
Issue	A bug, enhancement, or task. The unit of "what should change."	GitHub Issues, labeled (`good first issue`, `bug`, `flaky-test`, …).
Pull Request (PR)	Your proposed change, against `main`.	GitHub PRs, from your fork's branch.
DCO	Developer Certificate of Origin. Replaces a CLA. Asserted by a `Signed-off-by` line.	Every commit: `git commit -s`. Enforced by the DCO check.
`CHANGELOG.md`	A human-readable log; every user-facing PR adds one line.	Under `## [Unreleased]`, in an `Added`/`Changed`/`Fixed`/… section.
`MAINTAINERS.md`	Who can approve/merge, and who owns which areas.	Repo root.
PR template	`.github/pull_request_template.md` — the checklist your PR must satisfy.	Auto-filled into the PR body.
SPDX header	The Apache-2.0 license header every source file carries.	Top of every `.java` file (precommit enforces it).

The governance context — TSC, the OpenSearch Software Foundation under the Linux Foundation, the forum and Slack — is covered in release-governance and community interaction. For contribution mechanics, GitHub + DCO + CHANGELOG is all you need.

DCO, not CLA

The DCO is a lightweight assertion — by signing off, you certify you wrote the change or have the right to submit it under the project's license. You assert it per-commit:

git commit -s -m "Your message"

-s appends a trailer:

Signed-off-by: Your Name <your.email@example.com>

The name and email must match your git config user.name / user.email. A bot enforces this on every PR; a commit without a valid Signed-off-by blocks the merge. There is no separate agreement to sign — this line is the agreement. You will do this for real in Lab 2.2.

The CHANGELOG entry

OpenSearch keeps a CHANGELOG.md (Keep-a-Changelog style). Almost every PR must add one line under the ## [Unreleased] heading, in the right subsection, with a link to the PR:

## [Unreleased 3.x]
### Added
- Add `foo` parameter to the `_bar` API ([#12345](https://github.com/opensearch-project/OpenSearch/pull/12345))
### Fixed
- Fix misleading validation message in `RestSomethingAction` ([#12346](https://github.com/opensearch-project/OpenSearch/pull/12346))

A missing CHANGELOG entry is the single most common reason a first PR gets a "please add a changelog entry" comment and an extra review round. (Purely internal changes — e.g. a test-only refactor — can sometimes use the "skip changelog" label, but default to adding one.)

Finding a Good First Issue

Restrict yourself, at this level, to issues labeled good first issue. They are curated to be scoped, self-contained, and low-risk.

# Browser:
#   https://github.com/opensearch-project/OpenSearch/issues?q=is:open+label:%22good+first+issue%22

# gh CLI — list and inspect:
gh issue list --repo opensearch-project/OpenSearch \
  --label "good first issue" --state open --limit 30

gh issue view 12345 --repo opensearch-project/OpenSearch --comments

Qualify before you claim. A good Level 2 issue:

has the good first issue label and is not assigned to anyone;
has no open linked PR (gh issue view shows linked PRs);
has a clear, bounded description — you can state the fix in one sentence;
has no unresolved design debate in the comments.

When you find one, leave a short comment ("I'd like to work on this") before you start, so you do not duplicate someone's in-flight work. Etiquette is detailed in Community Interaction.

Warning: Do not open a PR and then find out two other people already did. Read the full comment thread and the linked-PR list first. This is the same discipline as the Tez "read all comments before claiming" rule — the platform changed, the etiquette did not.

The PR Lifecycle

flowchart TD
    A["Find/claim a good first issue"] --> B["Fork opensearch-project/OpenSearch"]
    B --> C["git clone your fork; add upstream remote"]
    C --> D["git checkout -b fix/issue-12345"]
    D --> E["Make the surgical change"]
    E --> F["Add a CHANGELOG.md entry under [Unreleased]"]
    F --> G["./gradlew spotlessApply"]
    G --> H["./gradlew precommit + scoped tests"]
    H --> I["git commit -s  (DCO sign-off)"]
    I --> J["git push origin fix/issue-12345"]
    J --> K["Open PR against main; fill the template"]
    K --> L["CI runs: gradle-check, assemble, precommit, DCO, CHANGELOG"]
    L -->|red| H
    L -->|green| M["Maintainer review"]
    M -->|changes requested| E
    M -->|approved| N["Maintainer merges (squash) to main"]
    N --> O["Add 'backport 2.x' label -> bot opens backport PR"]

The full state machine in words:

Fork the repo to your account; clone your fork; add the canonical repo as the upstream remote so you can keep main current.
Branch from an up-to-date main (fix/... or feature/...).
Change exactly what the issue asks — nothing more (scope creep kills first PRs).
CHANGELOG: add one line under ## [Unreleased].
Format and gate locally: spotlessApply, then precommit, then the scoped tests for what you touched. (From Lab 1.2.)
Commit with -s so the DCO check passes.
Push and open a PR against main; complete the PR template.
CI runs the checks (below). Green is required.
Review: a maintainer (from MAINTAINERS.md) reviews. Respond to every comment; push more signed commits (do not force-push away the review history unless asked).
Merge: a maintainer squash-merges. If the fix belongs on the maintenance line, they (or you, if you have rights) add the backport 2.x label and a bot opens the backport PR.

The CI checks you will see

Check	What it verifies	Local equivalent
DCO	Every commit has a valid `Signed-off-by`.	`git commit -s`
Changelog	A `CHANGELOG.md` entry was added (or the skip label applied).	edit `CHANGELOG.md`
gradle-check	The big one: builds + runs the relevant tests + precommit across affected projects.	`./gradlew :server:check` (subset)
assemble	The distribution still builds.	`./gradlew assemble`
precommit	Static analysis: checkstyle, forbidden APIs, SPDX headers, etc.	`./gradlew precommit`

Running spotlessApply + precommit + the scoped tests locally before you push is the single biggest lever on review speed. Detail on quality expectations is in PR Quality and Preparation.

Deliverables

Demonstrate all of the following before advancing to Level 3:

A fork of opensearch-project/OpenSearch with upstream configured, and a topic branch off an up-to-date main (Lab 2.1, Lab 2.2).
A signed commit whose Signed-off-by matches your git identity (a passing local DCO check).
A correct CHANGELOG.md entry under ## [Unreleased].
A clean local ./gradlew spotlessApply + ./gradlew precommit.
A walked good-first-issue fix with a unit test using assertThrows/assertEquals (Lab 2.3).
A completed review of the flawed example PR — you found every issue a maintainer would flag (Lab 2.4).
A written explanation of what backport 2.x does and when to apply it.

Common Mistakes

Mistake	Consequence	Fix
Forgetting `-s` on the commit	DCO check fails; PR blocked	`git commit -s`; fix history with `git commit --amend -s` or `git rebase --signoff`.
`Signed-off-by` email ≠ git email	DCO check fails	Set `git config user.email` to match; re-sign.
No `CHANGELOG.md` entry	Changelog check fails; extra review round	Add one line under `## [Unreleased]` linking your PR.
Skipping `spotlessApply`	precommit/CI fails on formatting	Run `./gradlew spotlessApply` before committing.
Scope creep ("while I'm here…")	Reviewer asks for a split; PR stalls for weeks	One logical change per PR. File a follow-up issue for the rest.
Force-pushing over review history	Reviewers lose the thread of what changed	Add new signed commits during review; squash happens at merge.
Opening a PR for a claimed issue	Duplicate work, community friction	Check assignee and linked PRs; comment before starting.
Treating the PR template as boilerplate	Missing checkboxes; reviewer asks you to redo it	Fill it honestly — tests, CHANGELOG, related issues, DCO.

PR Profile: Level 2 Graduate

A Level 2 graduate can credibly open these PRs end-to-end, with passing CI on the first or second push:

PR type	Example	Test requirement
Doc/example fix	Correct an outdated `curl` or a wrong Javadoc `@param`	None — docs only; still needs DCO + (often) CHANGELOG
Validation-message fix	Clarify a confusing error from a `*Request.validate()` or a REST parser	A unit test asserting the new message (Lab 2.3)
Missing-assertion / test clarity	Add an `assertEquals` with a message to an under-checked test	Re-run the test class
Misleading log line	Convert a string-concat `LOG.info` to placeholders; add context	Usually none; manual run noted in the PR

You are not yet ready to submit: changes to coordination, allocation, the engine, the search path, or anything touching the wire protocol / BWC. Those are Levels 3–9. What you are ready for is the thing most contributors get wrong: a clean, focused, well-tested, properly-signed PR with a CHANGELOG entry that a maintainer can merge without a five-round back-and-forth.

Next: Lab 2.1 — Navigate the OpenSearch Repository Structure.

Lab 2.1: Navigate the OpenSearch Repository Structure

Background

The OpenSearch repository is large — hundreds of Gradle projects, tens of thousands of Java files. A contributor who cannot navigate it quickly wastes hours and, worse, edits the wrong layer. This lab is a guided tour: you will use ./gradlew projects, find, and grep to build a durable map of the codebase, learn to locate any class in seconds, read a build.gradle to understand inter-project dependencies, and recognize the SPDX header convention that every source file carries (and that precommit enforces).

This is a reading and orientation lab. You will not change any code — but the muscle memory you build here is what makes every later lab fast.

Why This Lab Matters for Contributors

"Where does this live?" is the question you answer dozens of times per PR. Answering it in seconds instead of minutes compounds.
Editing the right layer matters: a fix in :server vs :libs:core vs a :modules:* has different review owners, BWC implications, and test scopes.
Reading build.gradle dependency blocks tells you what you are allowed to call from where — and why some "obvious" imports are forbidden.
The SPDX header is a hard precommit gate; knowing it up front saves a failed CI run.

Prerequisites

Lab 1.1 complete; you can run ./gradlew.
A clone of opensearch-project/OpenSearch (your fork is fine; see Lab 2.2 for fork setup).

Step-by-Step Tasks

Step 1: The Top-Level Map

From the repo root, list directories and the Gradle project tree side by side:

ls -d */ | sort
./gradlew projects 2>/dev/null | sed -n '1,60p'

The directories you must know:

Dir	Gradle project(s)	What lives there
`server/`	`:server`	The core engine. `org.opensearch.*`. The bulk of what you read and change.
`libs/`	`:libs:core`, `:libs:common`, `:libs:x-content`, `:libs:geo`, …	Shared low-level libraries with no dependency on `:server`. `libs/core` holds `StreamInput`/`StreamOutput`/`Writeable` (the wire primitives).
`modules/`	`:modules:transport-netty4`, `:modules:lang-painless`, `:modules:analysis-common`, `:modules:reindex`, …	Modules bundled in every distribution by default.
`plugins/`	`:plugins:analysis-icu`, `:plugins:repository-s3`, `:plugins:discovery-ec2`, …	Optional, in-repo plugins. (Security, k-NN, SQL, alerting, ml-commons live in separate repos.)
`client/`	`:client:rest`, `:client:sniffer`, `:client:transport`	Java clients. (The modern client is the separate `opensearch-java` repo.)
`distribution/`	`:distribution:archives:`, `:distribution:docker`, `:distribution:packages:`	Packaging: tarballs, Docker, deb/rpm, `distribution/tools`.
`test/framework/`	`:test:framework`	`OpenSearchTestCase`, `OpenSearchIntegTestCase`, `InternalTestCluster`, disruption helpers.
`qa/`	`:qa:*`	Cross-version BWC, rolling-upgrade, mixed-cluster, packaging QA.
`rest-api-spec/`	`:rest-api-spec`	REST API JSON specs and shared REST-YAML tests.
`buildSrc/`, `build-tools*/`	(build logic)	Gradle plugins and build conventions.
`sandbox/`	`:sandbox:*`	Experimental modules/plugins.

Note: :server is allowed to depend on libs/*, but libs/* must not depend on :server — the dependency arrow only points one way. That is why core serialization primitives live in libs/core: so everything (including :server) can use them. You will see this enforced by the build.gradle files in Step 5.

Step 2: Walk `server/`

:server is where you will spend most of your time. Get its internal shape:

ls server/src/main/java/org/opensearch | sort

The packages map directly to subsystems you will study:

Package	Subsystem	Where it appears
`node`	The `Node` object graph root	Level 1
`rest`	REST layer (`RestController`, `BaseRestHandler`)	Level 3, rest-layer deep dive
`action`	Transport actions (the execution units)	Level 3, action-framework deep dive
`transport`	Node-to-node transport	transport-layer deep dive
`cluster`	Cluster state, coordination, routing, allocation	Level 4
`indices`, `index`	`IndicesService`, `IndexShard`, `engine`, `translog`, `mapper`	Level 6
`search`	Query/fetch phases, aggregations	Level 7
`common`	Utilities, `settings`, `io`, `unit`	everywhere

# How big is the engine, roughly?
find server/src/main/java -name "*.java" | wc -l
# Where is each subsystem rooted?
find server/src/main/java/org/opensearch/index/engine -name "*.java" | head

Step 3: Source Sets — main vs test vs internalClusterTest

Each project has multiple source sets. Know which one a file is in before you edit it:

ls -d server/src/*/
# server/src/main/                -> production code
# server/src/test/                -> unit tests (OpenSearchTestCase, ...)
# server/src/internalClusterTest/ -> in-JVM multi-node integration tests (*IT)

This matters because the Gradle task differs (:server:test vs :server:internalClusterTest, from Lab 1.2) and because production code must never depend on test code.

Step 4: Locate Any Class

The core skill. Three reliable methods, fastest first:

# 1. By file name — when you know the class name:
find server -name "IndexShard.java" -path "*/main/*"
#   server/src/main/java/org/opensearch/index/shard/IndexShard.java

# 2. By declaration — when you are not sure of the file name or want the exact site:
grep -rn "class TransportSearchAction" server/src/main/java | head

# 3. By usage — when you want callers of something:
grep -rn "applyIndexOperationOnPrimary" server/src/main/java | head

Practice until it is reflex. Find each of these and note the path:

for c in Node ClusterService RestController IndicesService IndexShard \
         InternalEngine SearchService StreamInput Writeable; do
  echo "== $c =="
  find server libs -name "$c.java" -path "*/main/*"
done

Note that StreamInput and Writeable resolve under libs/core, not server — the wire primitives live in the shared library, exactly as Step 1 predicted.

Step 5: Read a `build.gradle` for Dependencies

A project's build.gradle declares what it may depend on. This is how you learn the allowed direction of imports. Read the server's:

sed -n '1,80p' server/build.gradle
grep -nE "api project|implementation project|testImplementation project" server/build.gradle

You will see :server declaring dependencies on :libs:core, :libs:common, :libs:x-content, and the Lucene artifacts — and you will not see any libs/* project depending on :server. Now read a library's to confirm the one-way arrow:

grep -nE "project\(" libs/core/build.gradle
# libs:core depends only on other libs / external jars — never on :server

This dependency structure is why certain code lives where it does. A Writeable in libs/core can be used by :server, every module, and every plugin; if it lived in :server, the libraries below it could not use it. Keep this in mind when a reviewer says "this belongs in libs, not server."

Tip: To see the resolved dependency graph (including transitive Lucene/Netty/etc.): ./gradlew :server:dependencies --configuration compileClasspath | head -60.

Step 6: The SPDX Header Convention

Every OpenSearch source file carries an SPDX license header. New files must include it or precommit fails. Inspect a real one:

head -20 server/src/main/java/org/opensearch/index/shard/IndexShard.java

The canonical header for a new file is:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

Older files that predate the fork also carry an Apache-2.0 attribution block referencing the original Elasticsearch copyright (the fork preserved upstream attribution). You do not remove those. For any new file you create (e.g. a new test in Lab 2.3), copy the SPDX header from a sibling file in the same directory.

Confirm precommit cares:

grep -rn "licenseHeaders\|forbiddenApis\|spotless" server/build.gradle build-tools*/ buildSrc/ 2>/dev/null | head

Step 7: `rest-api-spec/` and `qa/` — Know They Exist

Two directories you will not edit yet but must recognize when reading PRs:

ls rest-api-spec/src/main/resources/rest-api-spec/api | head
#   the JSON specs describing every REST endpoint (params, paths, bodies)
ls rest-api-spec/src/yamlRestTest 2>/dev/null
ls qa/ | head
#   bwc-test, rolling-upgrade, mixed-cluster, full-cluster-restart, ...

rest-api-spec is the contract for the REST API and the home of shared YAML REST tests; qa/ holds the cross-version backward-compatibility tests you will write in Level 9. A PR that changes a REST endpoint usually touches rest-api-spec; a PR that changes the wire format usually adds a qa/ BWC test.

Implementation Requirements

This lab produces a map, not code. Deliverables:

A filled-in copy of the top-level dir → Gradle project → purpose table, verified against your own ./gradlew projects output.
The file path (from memory, then verified) for: Node, IndexShard, InternalEngine, SearchService, StreamInput, Writeable.
Two sentences explaining why StreamInput/Writeable live in libs/core and not :server.
A note of the :server project dependencies you found in server/build.gradle.
The SPDX header pasted from a real file, plus the path you copied it from.

Troubleshooting

`./gradlew projects` is overwhelming

Pipe it. ./gradlew projects | grep -E "':(server|libs|modules):" narrows to the projects you care about. You rarely need the full list at once.

`find` returns test and main copies of the same class name

Constrain the path: add -path "*/main/*" for production code or -path "*/test/*" for tests. Many classes have a FooTests.java next to Foo.java.

`grep -rn "class X"` returns nothing

The class may be nested, generic, or named differently than you assume. Try grep -rn "X" --include=*.java -l to find files mentioning it, or search the declaration loosely: grep -rn "class X\b\|interface X\b\|enum X\b".

A new file fails precommit with a license-header error

You omitted the SPDX header. Copy it verbatim from a sibling file in the same package. Then re-run ./gradlew precommit.

Expected Output

Your locate-the-class drill should produce paths like:

== Node ==           server/src/main/java/org/opensearch/node/Node.java
== IndexShard ==     server/src/main/java/org/opensearch/index/shard/IndexShard.java
== InternalEngine == server/src/main/java/org/opensearch/index/engine/InternalEngine.java
== SearchService ==  server/src/main/java/org/opensearch/search/SearchService.java
== StreamInput ==    libs/core/src/main/java/org/opensearch/core/common/io/stream/StreamInput.java
== Writeable ==      libs/core/src/main/java/org/opensearch/core/common/io/stream/Writeable.java

(Exact libs/core sub-paths vary by branch — the point is they are under libs/core, not server.)

Stretch Goals

Map a module's extension points. Open modules/analysis-common/build.gradle and its *Plugin.java; identify which Plugin interfaces it implements (you will study these in Level 3 and the plugin-architecture deep dive).
Trace a REST route to its handler. Pick an endpoint from rest-api-spec (say _count), then grep -rn "_count" server/src/main/java/org/opensearch/rest to find the handler that registers it. This is the bridge to Level 3, Lab 3.1.
Find every place a setting is defined. Settings are Setting<T> constants. Run grep -rn "Setting.intSetting\|Setting.boolSetting" server/src/main/java/org/opensearch/cluster | head to see how cluster settings are declared.
Diff the dependency graph of two projects. Compare ./gradlew :libs:core:dependencies with ./gradlew :server:dependencies and articulate why the library's graph is a strict subset of the server's.

Validation / Self-check

You are done when you can answer these without notes:

Which top-level dir holds the core engine, and which holds the shared wire-serialization primitives? Why are they separate?
In which direction may dependencies point between :server and libs/*? How do you verify it from a build.gradle?
Give the three ways to locate a class, and which you reach for when you know only the behavior, not the name.
What are the three source sets of :server, and which Gradle test task runs each?
What must every new .java file contain to pass precommit, and where do you copy it from?
What is rest-api-spec/ for, and what does qa/ hold?

Next: Lab 2.2 — Prepare a PR Using OpenSearch Practices.

Lab 2.2: Prepare a PR Using OpenSearch Practices

Background

This lab is the full mechanical pipeline of an OpenSearch pull request, end to end, with a trivial change so the workflow — not the code — is the lesson. You will fork, clone, branch, make a tiny doc/test change, add a CHANGELOG.md entry, format with Spotless, run precommit, make a signed commit (git commit -s for DCO), push, and open a PR against main. You will see a real diff and a real Signed-off-by commit message, understand the PR template and each CI check, and learn how the backport 2.x label works.

Do this once with a throwaway change so that when you do Lab 2.3 for real, the plumbing is invisible.

Why This Lab Matters for Contributors

Every PR you ever open follows this exact sequence. Internalize it and you stop thinking about mechanics and start thinking about the change.
DCO and CHANGELOG are blocking CI checks; getting them right locally avoids the two most common first-PR failures.
Running spotlessApply + precommit locally is what turns a five-round review into one.

Prerequisites

Lab 2.1 complete; you can navigate the repo.
A GitHub account, git configured, and the gh CLI (optional but convenient).
Your git identity set correctly — this becomes your DCO sign-off:

git config --global user.name  "Your Name"
git config --global user.email "your.email@example.com"
git config user.name; git config user.email   # verify; these MUST match your Signed-off-by

Step-by-Step Tasks

Step 1: Fork and Clone

Fork opensearch-project/OpenSearch to your account (the Fork button on GitHub, or gh repo fork). Then clone your fork and wire the canonical repo as upstream:

# Clone your fork (replace YOURNAME):
git clone https://github.com/YOURNAME/OpenSearch.git
cd OpenSearch

# Add the canonical repo as 'upstream' so you can keep main current:
git remote add upstream https://github.com/opensearch-project/OpenSearch.git
git remote -v
# origin    https://github.com/YOURNAME/OpenSearch.git (fetch/push)   <- your fork
# upstream  https://github.com/opensearch-project/OpenSearch.git (fetch/push)

gh does this in one step:

gh repo fork opensearch-project/OpenSearch --clone=true --remote=true

Step 2: Sync `main` and Branch

Always branch from an up-to-date main:

git checkout main
git fetch upstream
git merge --ff-only upstream/main     # fast-forward your local main to upstream
git push origin main                  # keep your fork's main current too

# Create a topic branch (name it after the change):
git checkout -b docs/clarify-bulk-example

Note: Branch off main. Fixes land on main first and are backported to 2.x later via the backport 2.x label — you do not branch off 2.x for a new fix.

Step 3: Make a Trivial Change

For this dry run, pick something harmless and real — for example, fix a small inaccuracy or add a clarifying sentence to an in-repo doc, or tighten a test's assertion message. Keep it to one logical change. Suppose you correct a stale example in a developer doc:

# Find a candidate (illustrative grep — look for a fixable doc nit):
grep -rn "localhost:9200" DEVELOPER_GUIDE.md TESTING.md 2>/dev/null | head

Make the edit in your editor. The resulting diff should be small and obviously correct:

diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md
index abc1234..def5678 100644
--- a/DEVELOPER_GUIDE.md
+++ b/DEVELOPER_GUIDE.md
@@ -212,7 +212,7 @@ To run a single test class:
-    ./gradlew test --tests "org.opensearch.ExampleTests"
+    ./gradlew :server:test --tests "org.opensearch.ExampleTests"

This is deliberately tiny. The discipline of "one logical change, obviously correct" is what you are practicing — not the change itself.

Step 4: Add a CHANGELOG Entry

Open CHANGELOG.md, find the ## [Unreleased ...] heading, and add one line in the appropriate subsection (Added / Changed / Fixed / Deprecated / Removed). A doc fix usually goes under Fixed or Changed:

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1111aaa..2222bbb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@
 ## [Unreleased 3.x]
 ### Added
 ### Changed
+- Correct the single-test example in the developer guide to use the `:server:test` task ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))
 ### Fixed

You will not know the PR number yet — use a placeholder (#NNNNN) and update it after the PR opens, or many contributors push, read the assigned number, then amend. CI checks that an entry exists, not that the number is final.

Note: Truly trivial, non-user-facing changes can sometimes carry the "skip changelog" label instead of an entry — but default to adding one. A missing CHANGELOG is the #1 first-PR nit.

Step 5: Format with Spotless

Even for a doc/test change, run Spotless so formatting never blocks you:

./gradlew spotlessApply        # auto-format any Java you touched
./gradlew spotlessJavaCheck    # verify (this is what CI checks)

For a pure-Markdown change this is a no-op for the formatter, but make it a reflex — the moment you touch a .java, Spotless matters.

Step 6: Run Precommit (and Scoped Tests if You Touched Code)

./gradlew precommit
# If you changed a test or source file, also run the scoped test:
# ./gradlew :server:test --tests "org.opensearch.ExampleTests"

A green precommit locally means the CI precommit check will be green too. This is the highest- leverage habit in the whole workflow (from Lab 1.2).

Step 7: Commit with DCO Sign-off

The -s flag is mandatory — it appends the Signed-off-by line the DCO check requires:

git add DEVELOPER_GUIDE.md CHANGELOG.md
git commit -s -m "Use the :server:test task in the single-test developer-guide example"

Inspect the resulting commit — note the trailer:

git log -1 --format=full

commit a1b2c3d4...
Author:     Your Name <your.email@example.com>
Commit:     Your Name <your.email@example.com>

    Use the :server:test task in the single-test developer-guide example

    Signed-off-by: Your Name <your.email@example.com>

The Signed-off-by name/email must match your git identity. If you forgot -s, fix it without re-doing the work:

git commit --amend -s --no-edit     # add sign-off to the latest commit
# or, across multiple commits on the branch:
git rebase --signoff upstream/main

Step 8: Push and Open the PR

git push origin docs/clarify-bulk-example

GitHub prints a "create a pull request" URL, or use gh:

gh pr create --repo opensearch-project/OpenSearch --base main \
  --title "Use the :server:test task in the single-test developer-guide example" \
  --body "See template below."

When the PR opens, GitHub pre-fills .github/pull_request_template.md. Fill every section honestly:

### Description
Corrects the single-test example in DEVELOPER_GUIDE.md to use the `:server:test`
Gradle path, matching the project layout. Pure documentation change.

### Related Issues
Resolves #NNNNN   <!-- or "N/A" if you filed no issue for a trivial doc fix -->

### Check List
- [x] New functionality includes testing.  (N/A — docs only)
- [x] New functionality has been documented.
- [x] API changes companion pull request created, if applicable.
- [x] Commits are signed per the DCO using `--signoff`.
- [x] Changelog updated, or "skip changelog" justified.

Step 9: Read the CI Checks

Once opened, CI runs. Each check maps to something you can reproduce locally:

CI check	Verifies	If it's red
DCO	Every commit has a valid `Signed-off-by`.	`git commit --amend -s` / `git rebase --signoff`, force-push the branch.
Changelog	An entry exists under `## [Unreleased]` (or skip label).	Add the line; push.
precommit	Checkstyle, forbidden APIs, SPDX headers, etc.	`./gradlew precommit` locally; fix; push.
assemble	The distribution still builds.	`./gradlew assemble`.
gradle-check	Build + relevant tests + precommit across affected projects.	Reproduce the failing test locally with its `-Dtests.seed`.

Note: Some CI workflows require a maintainer to comment to start the heavy gradle-check on a first-time contributor's PR (a security gate against running arbitrary code). Be patient; do not spam pushes to retrigger.

Step 10: Respond to Review and the Backport Label

A maintainer (listed in MAINTAINERS.md) reviews. For each comment:

Make the change as a new signed commit on the same branch and push. Do not force-push away the review history unless a maintainer asks you to squash.
Reply to each comment, marking resolved when addressed.

# Address a review comment:
# ...edit...
git add -A && git commit -s -m "Address review: reword the example caption"
git push origin docs/clarify-bulk-example

When approved, a maintainer squash-merges to main. If the fix should also go to the maintenance line, the backport 2.x label is applied (by a maintainer, or by you if you have the rights). A bot then opens a backport PR cherry-picking your squashed commit onto 2.x; you may need to resolve conflicts there. Responding to feedback well is its own skill — Responding to Maintainer Feedback.

Implementation Requirements

Deliverables (you may open the PR as a draft and close it afterward — the goal is the mechanics):

A fork with upstream configured and a local main fast-forwarded to upstream/main.
A topic branch off main with one small, obviously-correct change.
A CHANGELOG.md entry under ## [Unreleased].
A clean ./gradlew spotlessApply and ./gradlew precommit.
A commit whose git log -1 --format=full shows a Signed-off-by matching your git identity.
A pushed branch and an opened (draft) PR with the template fully filled in.
A written description of what each of the five CI checks verifies and how to fix a red one.

Troubleshooting

DCO check is red: "Commit sha … does not have a valid sign-off"

The commit lacks (or has a mismatched) Signed-off-by. Fix and force-push the branch:

git rebase --signoff upstream/main   # add sign-off to every commit on the branch
git push --force-with-lease origin docs/clarify-bulk-example

(Force-pushing your own unmerged topic branch to fix DCO/rebase is fine; force-pushing away a maintainer's in-progress review is not.)

Changelog check is red

You did not add (or you mis-placed) the entry. It must be under the ## [Unreleased] heading in the correct subsection. Re-check the heading name matches your target line (3.x for main).

`git merge --ff-only upstream/main` fails

Your local main has diverged (you committed on it by accident). Reset it to upstream:

git checkout main
git fetch upstream
git reset --hard upstream/main

Never commit on main; always branch.

precommit fails on a file you did not touch

Confirm it is pre-existing on a clean main (git stash; ./gradlew precommit; git stash pop). If main is clean, the failure is yours — fix it.

Expected Output

A correct signed commit:

$ git log -1 --format='%an <%ae>%n%n%B'
Your Name <your.email@example.com>

Use the :server:test task in the single-test developer-guide example

Signed-off-by: Your Name <your.email@example.com>

A green local gate:

$ ./gradlew spotlessJavaCheck precommit
BUILD SUCCESSFUL in 5m 12s

Stretch Goals

Practice the backport mentally. Read .github/workflows/ for the backport workflow file and grep -rn "backport" .github/. Identify which label triggers the bot.
Inspect the PR template source. cat .github/pull_request_template.md and map each checkbox to a CI check or a maintainer expectation.
Amend versus new commit. Practice both: git commit --amend -s (rewrites the last commit) and a fresh git commit -s (adds a new one). Know when each is appropriate during review (new commits during review; amend only before the first push or when asked to squash).
Keep a long-lived branch current. While your PR sits in review, main moves. Practice: git fetch upstream && git rebase upstream/main && git push --force-with-lease.

Validation / Self-check

You are done when you can answer these without notes:

What does git commit -s add, and why is it required? What must it match?
Where exactly does a CHANGELOG entry go, and what happens in CI if it is missing?
Which two Gradle tasks should you run locally before pushing, and why does that speed up review?
Why do you branch off main and not 2.x for a new fix? How does the fix reach 2.x?
When is force-pushing your branch acceptable, and when is it not?
Name the five CI checks and, for each, the local command that reproduces it.

Next: Lab 2.3 — Fix It: A Good First Issue, where the change is real.

Lab 2.3: Fix It — A Good First Issue

Background

This is a fix-it lab: a realistic, end-to-end walk of a beginner-appropriate OpenSearch fix, from the user-visible symptom to a merged-quality PR with a unit test. The class of bug is the sweet spot for a first real contribution — an unclear validation message: a request is correctly rejected, but the error text leaves the user guessing what they did wrong. The fix is surgical (a better message), the blast radius is tiny, and a clean unit test is easy to write with assertThrows/assertEquals.

The specific symptom, code site, and class names below are illustrative of the pattern — the exact file and message on your branch will differ, and you will grep to find the real one. What is not illustrative is the method: how you locate the code, scope the change, write the test, and keep the PR focused. That transfers to every good first issue you will ever take.

Why This Lab Matters for Contributors

Validation-message fixes are the highest-value beginner contribution: they improve real user experience, require touching only one method, and force you to write a focused test.
You practice the full discipline: reproduce → locate → minimal diff → unit test → CHANGELOG → signed commit → PR — the loop you set up in Lab 2.2.
The pitfalls here (scope creep, missing CHANGELOG, missing DCO, untested message) are the exact reasons first PRs stall.

Prerequisites

Lab 2.2 complete; fork + branch + DCO + CHANGELOG mechanics are second nature.
A running node from Lab 1.3 to reproduce the symptom.

Step 1: The Symptom

A user reports (in a good first issue) that a clearly-invalid request returns an unhelpful error. Two common, real flavors of this:

Flavor A — a validation method with a vague message. Many actions implement ActionRequest.validate(), accumulating problems into an ActionRequestValidationException via addValidationError(...). When the message is terse, the user cannot tell which field or what constraint failed. Example shape of a weak message:

{ "error": { "type": "action_request_validation_exception",
             "reason": "Validation Failed: 1: index is missing;" } }

Flavor B — a REST parser rejecting a parameter with no guidance. A RestHandler reads a query parameter and throws IllegalArgumentException with a message that names neither the bad value nor the allowed values.

Reproduce a concrete one against your running node. For instance, the _shrink/_split resize API requires a target index name; omitting it (or other resize preconditions) trips a validate():

# Symptom reproduction (illustrative — the exact endpoint/message varies by branch):
curl -s -X POST "localhost:9200/source/_shrink/" \
  -H 'Content-Type: application/json' -d '{}' | jq '.error | {type, reason}'

Read the reason. If it does not tell the user what is wrong and how to fix it, that is your bug.

Note: Pick a message that is genuinely unclear, not merely terse-but-correct. "index is missing" might be fine; "request is invalid" is not. The reviewer's bar is: does the new message help a confused user more than the old one, without changing behavior?

Step 2: Locate the Code

Find the message string, then the method that emits it. Start from the literal text:

# Search for the offending message text (use a distinctive fragment):
grep -rn "request is invalid\|Validation Failed" server/src/main/java | head

# More reliably, find the validate() method and its addValidationError calls
# for the request type in question (e.g. ResizeRequest):
grep -rn "addValidationError" server/src/main/java/org/opensearch/action/admin/indices/shrink/ResizeRequest.java
find server -name "ResizeRequest.java" -path "*/main/*"

Open the method. A weak validate() looks like this (illustrative):

@Override
public ActionRequestValidationException validate() {
    ActionRequestValidationException validationException = null;
    if (targetIndexRequest == null) {
        validationException = addValidationError("no target index request", validationException);
    }
    if (targetIndexRequest != null && targetIndexRequest.index() == null) {
        validationException = addValidationError("the target index name is not set", validationException);
    }
    // ...
    return validationException;
}

The first message — "no target index request" — is the kind of internal-jargon string a user cannot act on. That is your target.

Tip: Confirm there is an existing test for this validate() before you write a new one. A test class usually sits at the mirror path under src/test: find server -name "ResizeRequestTests.java". If it exists, you will add a method; if not, you may create one (with the SPDX header — see Lab 2.1).

Step 3: The Diff

Improve the message so it states the field and the fix. Keep it to the message string(s) — do not change when the error fires, only what it says.

diff --git a/server/src/main/java/org/opensearch/action/admin/indices/shrink/ResizeRequest.java b/server/src/main/java/org/opensearch/action/admin/indices/shrink/ResizeRequest.java
index 1234567..89abcde 100644
--- a/server/src/main/java/org/opensearch/action/admin/indices/shrink/ResizeRequest.java
+++ b/server/src/main/java/org/opensearch/action/admin/indices/shrink/ResizeRequest.java
@@ public ActionRequestValidationException validate() {
         ActionRequestValidationException validationException = null;
         if (targetIndexRequest == null) {
-            validationException = addValidationError("no target index request", validationException);
+            validationException = addValidationError(
+                "target index request is missing; specify the target index name in the request path, "
+                    + "e.g. POST /<source>/_shrink/<target>",
+                validationException
+            );
         }
         if (targetIndexRequest != null && targetIndexRequest.index() == null) {
-            validationException = addValidationError("the target index name is not set", validationException);
+            validationException = addValidationError(
+                "the target index name is not set; it must be provided in the request path "
+                    + "(POST /<source>/_shrink/<target>)",
+                validationException
+            );
         }
         return validationException;

Two rules for the message:

Name the field and the remedy. A good validation message answers "what is wrong" and "what do I type to fix it."
Do not change behavior. Same condition, same exception type, same fire-or-not — only clearer words. If you find yourself adding or removing a check, you have left "good first issue" territory; file a follow-up issue instead (see Pitfalls).

Step 4: The Unit Test

A message change is testable and must be tested — assert the new text so a future refactor cannot silently regress it. Extend (or create) the request's test, using assertThrows + assertEquals on the validation output. OpenSearchTestCase gives you the base class and expectThrows.

public class ResizeRequestTests extends OpenSearchTestCase {

    public void testValidationMessageWhenTargetIndexMissing() {
        ResizeRequest request = new ResizeRequest();   // no target set -> should fail validate()

        ActionRequestValidationException e = request.validate();

        assertNotNull("expected a validation error when the target index is missing", e);
        assertThat(
            e.getMessage(),
            containsString("target index request is missing; specify the target index name in the request path")
        );
    }
}

If you prefer the throw-style assertion (for an action whose constructor or REST parser throws rather than returning a validation exception), the pattern is:

public void testRejectsUnknownValueWithHelpfulMessage() {
    IllegalArgumentException e = expectThrows(
        IllegalArgumentException.class,
        () -> SomeParser.parseMode("nonsense")
    );
    assertEquals(
        "unknown mode [nonsense]; allowed values are [a, b, c]",
        e.getMessage()
    );
}

Note the conventions: test methods start with test, no @Test annotation is required (the runner discovers test* methods), and you use the framework's expectThrows/assertThat with Hamcrest matchers (containsString). New test files need the SPDX header.

Run only your test (fast loop from Lab 1.2):

./gradlew :server:test --tests "org.opensearch.action.admin.indices.shrink.ResizeRequestTests.testValidationMessageWhenTargetIndexMissing"

Expected:

org.opensearch.action.admin.indices.shrink.ResizeRequestTests > testValidationMessageWhenTargetIndexMissing PASSED

BUILD SUCCESSFUL

Step 5: CHANGELOG, Format, Sign, Push

The same pipeline as Lab 2.2:

# 1. CHANGELOG entry under [Unreleased] -> Fixed:
#    - Clarify the resize (_shrink/_split) validation message when the target index is missing ([#NNNNN](...))

# 2. Format and gate:
./gradlew spotlessApply
./gradlew :server:test --tests "*ResizeRequestTests*"
./gradlew precommit

# 3. Signed commit (DCO):
git add server/src/main/java/.../ResizeRequest.java \
        server/src/test/java/.../ResizeRequestTests.java \
        CHANGELOG.md
git commit -s -m "Clarify resize validation messages when the target index is missing"

# 4. Push and open the PR (fill the template; link the issue):
git push origin fix/resize-validation-message

Confirm the sign-off:

git log -1 --format='%B' | tail -2
# Clarify resize validation messages when the target index is missing
#
# Signed-off-by: Your Name <your.email@example.com>

In the PR body, link the issue (Resolves #12345) and state plainly: "This changes only the validation message text; no behavior changes. Added a unit test asserting the new message." That one sentence preempts the reviewer's first question.

Where This Goes Wrong (Pitfalls)

Pitfall	Symptom	Avoid by
Scope creep	You "also" tweak a nearby check, reorder logic, or refactor the method. Reviewer asks for a split; the PR sits for weeks.	Touch only the message string(s). File a follow-up issue for anything else you noticed.
Behavior change disguised as a message fix	You add/remove/relax a validation condition. Now it needs deeper review and BWC thought.	If the when changes, it is no longer a Level 2 change. Stop and file a separate issue.
No test	"It's just a string." A future refactor silently reverts your message; reviewers reject untested message changes.	Always assert the new message with `assertThrows`/`assertEquals` + `containsString`.
Missing CHANGELOG	The changelog CI check is red; extra review round.	Add one line under `## [Unreleased]` → `Fixed`.
Missing DCO	The DCO check is red; PR blocked.	`git commit -s`; fix with `git rebase --signoff` if forgotten.
Over-asserting the message	You `assertEquals` the entire long message; a tiny later wording tweak breaks your test needlessly.	Assert a stable, meaningful fragment with `containsString`, not the whole sentence — unless the exact text is the contract.
Editing the wrong layer	You change a message in a `libs/` class that many callers share, with surprise side effects.	Confirm the message originates where you think (`grep` for it); change the narrowest site.

Implementation Requirements

Deliverables:

A reproduced symptom: the original, unclear error captured from a real curl.
A minimal diff that changes only the message text (no behavior change).
A unit test (OpenSearchTestCase subclass) asserting the new message via assertThrows/expectThrows + assertEquals/containsString, run green.
A CHANGELOG.md entry under ## [Unreleased] → Fixed.
A clean ./gradlew spotlessApply and ./gradlew precommit.
A signed commit and an opened PR whose body states "message-only, no behavior change" and links the issue.

Troubleshooting

Your `grep` for the message text finds nothing

The message may be assembled from parts (string concatenation, String.format, a constant). Search for a distinctive word, or search by the method (grep -rn "addValidationError" server/src/main/java) and read the candidates.

The test passes but the real `curl` still shows the old message

You changed a different code path than the one the request hits. Re-reproduce, then trace from the REST handler: the resize REST handler → the action → the request's validate(). Make sure the message you edited is the one on the path your curl exercises.

`assertEquals` on the message is brittle across runs

If randomized inputs make the exact message vary, assert a stable fragment with containsString instead of the full string. Reserve exact-match assertEquals for messages whose precise text is a deliberate contract.

precommit flags your new test file

Almost always a missing SPDX header or an unused import. Copy the header from a sibling test; run ./gradlew spotlessApply to drop unused imports.

Expected Output

After: the same invalid request returns a message a user can act on:

curl -s -X POST "localhost:9200/source/_shrink/" -H 'Content-Type: application/json' -d '{}' \
  | jq -r '.error.reason'
# Validation Failed: 1: target index request is missing; specify the target index name
#   in the request path, e.g. POST /<source>/_shrink/<target>;

And the test asserting it:

org.opensearch.action.admin.indices.shrink.ResizeRequestTests > testValidationMessageWhenTargetIndexMissing PASSED
BUILD SUCCESSFUL

Stretch Goals

Find three more weak messages. grep -rn "addValidationError" server/src/main/java and skim for messages that name no field or remedy. Each is a candidate good first issue — file one (do not fix all of them in one PR).
Trace the message to the wire. Confirm where the ActionRequestValidationException is turned into the JSON error the user sees — follow it from validate() up through the REST response path (rest-layer deep dive).
Add a parameterized assertion. If the validate() accumulates multiple errors, write a test that triggers two of them and asserts both fragments appear in the combined message.
Compare to upstream wording. Look at how a few existing high-quality validation messages in the same package are phrased (grep -rn "addValidationError" server/src/main/java/org/opensearch/action/admin/indices) and match that house style.

Validation / Self-check

You are done when you can answer these without notes:

What distinguishes a "good first issue" message fix from a behavior change — and what do you do the moment you realize you are changing behavior?
Why must a message-only change still have a test, and which assertion (assertEquals vs containsString) is appropriate when?
How do you locate the exact code site that emits a message a user reported?
Which three CI checks would block this PR if you skipped the corresponding step (test, CHANGELOG, sign-off)?
What single sentence in the PR body preempts the reviewer's first question on a message change?
Name two pitfalls that most often stall a first PR and how each is avoided.

Next: Lab 2.4 — Review It: Spot the Flaws in a PR, where you sit on the other side of the table.

Lab 2.4: Review It — Spot the Flaws in a PR

Background

Contributing is half of the job; reviewing is the other half, and the faster half to become valuable at. Maintainers spend more time reading PRs than writing them, and a contributor who reviews well is one a project trusts. This lab flips you to the maintainer's side: you are handed a plausible-but-flawed example PR — the kind that looks fine at a glance and fails on the things OpenSearch maintainers actually block on — and asked to enumerate every problem before you read the answer key.

The flaws are deliberately the recurring ones: a missing test, mutation of shared state, a backward-compatibility break in StreamOutput/StreamInput ordering, a Thread.sleep where assertBusy belongs, a missing CHANGELOG, and an unsigned commit. These are not exotic — they are the top reasons real PRs get "changes requested."

Why This Lab Matters for Contributors

Learning to spot these flaws is exactly how you learn to avoid them in your own PRs.
BWC and shared-state bugs are silent — they pass the author's happy-path test and break in production or on upgrade. Recognizing them on sight is a senior skill you start building now.
Reviewing well is the single fastest path to the trust that leads to maintainership (the path to maintainership).

Prerequisites

Lab 2.1–Lab 2.3 complete.
A reading acquaintance with Writeable/StreamInput/StreamOutput (from Lab 2.1; deep treatment in Level 9 and the serialization & BWC deep dive).

The Example PR

PR #99999 — "Add a priority field to FooStats and a helper to compute averages"

Description: "Adds a new priority field to FooStats so clients can sort stats. Also adds a small cache and a convenience method. Tested manually."

The diff (read it carefully before scrolling to the questions):

diff --git a/server/src/main/java/org/opensearch/foo/FooStats.java b/server/src/main/java/org/opensearch/foo/FooStats.java
index aaaaaaa..bbbbbbb 100644
--- a/server/src/main/java/org/opensearch/foo/FooStats.java
+++ b/server/src/main/java/org/opensearch/foo/FooStats.java
@@ public class FooStats implements Writeable {
     private final long count;
     private final long totalMillis;
+    private final int priority;

     public FooStats(StreamInput in) throws IOException {
+        this.priority = in.readInt();
         this.count = in.readLong();
         this.totalMillis = in.readLong();
     }

     @Override
     public void writeTo(StreamOutput out) throws IOException {
+        out.writeInt(priority);
         out.writeLong(count);
         out.writeLong(totalMillis);
     }

diff --git a/server/src/main/java/org/opensearch/foo/FooStatsCache.java b/server/src/main/java/org/opensearch/foo/FooStatsCache.java
index ccccccc..ddddddd 100644
--- a/server/src/main/java/org/opensearch/foo/FooStatsCache.java
+++ b/server/src/main/java/org/opensearch/foo/FooStatsCache.java
@@ public class FooStatsCache {
-    private static final Map<String, FooStats> CACHE = Collections.emptyMap();
+    public static final Map<String, FooStats> CACHE = new HashMap<>();

     public FooStats getOrCompute(String key, Supplier<FooStats> supplier) {
-        return supplier.get();
+        if (CACHE.containsKey(key)) {
+            return CACHE.get(key);
+        }
+        FooStats s = supplier.get();
+        CACHE.put(key, s);
+        return s;
     }

diff --git a/server/src/test/java/org/opensearch/foo/FooStatsCacheTests.java b/server/src/test/java/org/opensearch/foo/FooStatsCacheTests.java
index eeeeeee..fffffff 100644
--- a/server/src/test/java/org/opensearch/foo/FooStatsCacheTests.java
+++ b/server/src/test/java/org/opensearch/foo/FooStatsCacheTests.java
@@ public class FooStatsCacheTests extends OpenSearchTestCase {
     public void testCachePopulatesAsynchronously() throws Exception {
         FooStatsCache cache = new FooStatsCache();
         triggerAsyncPopulate(cache);
-        assertTrue(cache.size() > 0);
+        Thread.sleep(2000);
+        assertTrue(cache.size() > 0);
     }
 }

The commit that carries this diff:

commit 0badc0de
Author: A Contributor <contrib@example.com>

    Add priority to FooStats and a stats cache

(Note: no CHANGELOG.md change is in the diff, and the commit has no Signed-off-by trailer.)

Your Task

Before reading the answer key, write down every problem a maintainer would flag. Aim for at least seven. For each, note: what is wrong, why it matters, and what you would ask the author to do. Group them as: correctness, backward compatibility, testing, concurrency/state, and process. Only then continue.

(Your review notes here — do not skip ahead.)

The Maintainer's Review (Answer Key)

Here is the review a maintainer would actually leave. Each item is tied to an OpenSearch convention.

1. Backward-compatibility break in the wire format (BLOCKER)

The new priority field is read/written first, before the existing count/totalMillis:

this.priority = in.readInt();   // <-- new field read FIRST
this.count = in.readLong();

StreamInput/StreamOutput are positional — fields are read in the exact order they were written, with no field names. Inserting a new field at the front means an old node (which writes count, totalMillis) and a new node (which expects priority, count, totalMillis) will mis-parse each other's bytes: the new node reads the old node's count as priority, then reads garbage. This breaks every mixed-version interaction — rolling upgrades, cross-node transport — and corrupts data silently.

The convention: new fields are appended at the end, and their read/write is guarded by a version check so old nodes neither write nor expect them:

public FooStats(StreamInput in) throws IOException {
    this.count = in.readLong();
    this.totalMillis = in.readLong();
    if (in.getVersion().onOrAfter(Version.V_3_1_0)) {  // the version that introduces the field
        this.priority = in.readInt();
    } else {
        this.priority = DEFAULT_PRIORITY;
    }
}

@Override
public void writeTo(StreamOutput out) throws IOException {
    out.writeLong(count);
    out.writeLong(totalMillis);
    if (out.getVersion().onOrAfter(Version.V_3_1_0)) {
        out.writeInt(priority);
    }
}

Ask: "Append priority after the existing fields and guard read/write with getVersion().onOrAfter(...). Add a qa//serialization BWC test." This is the heart of Level 9 and the serialization & BWC deep dive.

2. No serialization round-trip test for `FooStats` (BLOCKER)

FooStats implements Writeable and just gained a field, but there is no test that round-trips it through StreamOutput → StreamInput. OpenSearch has a base class precisely for this: AbstractWireSerializingTestCase<FooStats>. Without it, the BWC bug in #1 would not have been caught by the author, and future field additions can break silently.

Ask: "Add a FooStatsTests extends AbstractWireSerializingTestCase<FooStats> with a createTestInstance() and (for BWC) a bwcSerializationTest across versions."

3. Mutable shared static state — a concurrency and correctness bug (BLOCKER)

public static final Map<String, FooStats> CACHE = new HashMap<>();

Three problems stacked:

static shared mutable state. A single process-wide HashMap shared across all instances and threads. OpenSearch is heavily multi-threaded (thread pools deep dive); concurrent put/get on a plain HashMap can corrupt the map (infinite loops, lost entries) and is a data race.
public exposes the cache for any code to mutate — no encapsulation, impossible to reason about.
Unbounded. It never evicts; it is a memory leak masquerading as a cache. Real caches in OpenSearch are bounded and often built on org.opensearch.common.cache.Cache.

Ask: "Don't use mutable static state. Make the cache an instance field; if it must be shared, use a bounded, thread-safe cache (org.opensearch.common.cache.Cache or at least a ConcurrentHashMap with an eviction policy). Keep it private."

4. `Thread.sleep` instead of `assertBusy` in a test (BLOCKER for tests)

triggerAsyncPopulate(cache);
Thread.sleep(2000);
assertTrue(cache.size() > 0);

Thread.sleep in a test is an anti-pattern OpenSearch reviewers reject on sight. It is flaky (2s may be too short on a loaded CI machine → spurious failure) and slow (2s wasted when the work finished in 5ms). The framework provides assertBusy(...), which polls a condition until it holds or times out:

triggerAsyncPopulate(cache);
assertBusy(() -> assertTrue("cache should populate", cache.size() > 0));

Ask: "Replace Thread.sleep(2000) with assertBusy(() -> ...). We do not allow Thread.sleep in tests — it is the #1 source of flaky-test issues." (Flaky tests and assertBusy are Level 5.)

5. Missing `CHANGELOG.md` entry (BLOCKER — CI will fail)

The diff adds a user-facing field (priority) but no CHANGELOG.md entry. The changelog CI check is red. Even ignoring CI, a new field is exactly the kind of change the changelog exists to record.

Ask: "Add a CHANGELOG.md entry under ## [Unreleased] → Added linking this PR."

6. Unsigned commit — DCO will fail (BLOCKER — CI will fail)

The commit has no Signed-off-by trailer. The DCO check blocks the merge.

Ask: "Sign your commits: git rebase --signoff upstream/main and force-push. The Signed-off-by must match your git identity."

7. Unfocused PR / scope creep (process)

The PR bundles two unrelated changes: a new FooStats.priority field and a new caching layer in FooStatsCache. These have different risk profiles (one is a wire/BWC change, the other a concurrency change), different reviewers, and different test needs. Bundling them means neither can be reviewed cleanly, and a problem in one blocks the other.

Ask: "Please split this into two PRs: one for the priority field (with BWC handling + a serialization test) and one for the cache (with a thread-safety design). They are independent."

8. The `priority` field is unvalidated and undocumented (minor)

priority is a raw int with no bounds, default, or Javadoc. What does a negative priority mean? What is the default for old data deserialized via the BWC path (#1)?

Ask: "Document the field, define a default (used in the pre-version-bump BWC branch), and validate the range if there is one."

The Review Rubric

Internalize the categories. A useful OpenSearch review checks, in roughly this order:

Category	What you check	Blocker?
Correctness	Does it do what the description claims? Edge cases?	Yes
Backward compatibility	Wire format (`StreamInput`/`StreamOutput` order + version guards), REST API, settings, index format	Yes — silent and severe
Concurrency / state	Shared mutable state, thread safety, no static mutable caches, proper `Closeable` handling	Yes
Testing	Is there a test? Right type (unit vs IT)? Serialization round-trip for `Writeable`? No `Thread.sleep` (use `assertBusy`)?	Yes
Scope	One logical change per PR; no drive-by refactors	Often — ask for a split
Process	DCO sign-off, CHANGELOG entry, SPDX header on new files, passing `precommit`	Yes — CI enforces most
Style / docs	Naming, Javadoc, message quality, follows house style	Usually non-blocking nits

Note: Lead with the blockers, not the nits. A review that opens with three style comments and buries a BWC break at the bottom helps no one. State the blocker, the why, and the concrete ask — the way the answer key above does. Tone matters; see GitHub review etiquette.

Implementation Requirements

Deliverables:

Your independent review notes, written before reading the answer key, listing at least seven problems grouped by category.
A diff of your notes against the answer key: which did you catch, which did you miss?
For each item you missed, one sentence on the convention you will remember.
A corrected version of the BWC read/write (append + version guard) written out from memory.
A one-paragraph review comment you would actually post on #99999 — blockers first, concrete asks, courteous tone.

Troubleshooting (Your Review, Not the Code)

"I only found four problems"

You likely caught the visible ones (CHANGELOG, DCO, Thread.sleep) and missed the silent ones (BWC ordering, static mutable state, missing serialization test). The silent ones are the valuable catches — train your eye on writeTo/StreamInput constructors and static mutable fields first.

"I flagged things that aren't actually wrong"

Over-flagging erodes trust as much as under-flagging. Before you post a comment, ask: is this a correctness/BWC/test blocker, or a personal style preference? Mark preferences as "nit:" so the author can weigh them appropriately.

"How do I know which version to put in the version guard?"

The next unreleased version on the line you target (the one in the ## [Unreleased] heading). The constant lives in the Version class — grep -rn "public static final Version V_" libs/core | tail. Exact handling is Level 9.

Stretch Goals

Find a real Writeable and audit its writeTo. Pick any class in :server implementing Writeable (grep -rln "implements Writeable" server/src/main/java | head) and confirm its constructor reads fields in the same order writeTo writes them, with version guards around the newer ones. This is what you will be doing for real in Level 9.
Find assertBusy in the wild. grep -rn "assertBusy(" server/src/internalClusterTest | head — read how integration tests wait on cluster conditions without sleeping.
Find a real BWC test. ls qa/ and open a rolling-upgrade or mixed-cluster test; see how the project proves old and new nodes interoperate.
Re-review your own Lab 2.3 change with this rubric. Does it have a test? A CHANGELOG? A sign-off? Is it scoped? Reviewing your own work this way before pushing is the habit that makes you a low-maintenance contributor.

Validation / Self-check

You are done when you can answer these without notes:

Why does inserting a new field at the front of a writeTo/StreamInput pair break backward compatibility, and what two things make a field-addition BWC-safe?
Which base test class round-trips a Writeable, and why is its absence a blocker for a field-addition PR?
Name three things wrong with a public static final Map CACHE = new HashMap<>() in server code.
Why is Thread.sleep in a test a blocker, and what replaces it?
Which two process omissions in the example PR will make CI red on their own?
Why should this PR be split, and how would you phrase that ask?
In a review, what do you put first — blockers or nits — and why?

You have now contributed a fix and reviewed one. That two-sided fluency is what Level 3 builds on as you move from workflow into the engine itself.

Level 3: OpenSearch Architecture

This level builds the mental model you will use for the rest of the curriculum: how a running OpenSearch process is organized, how a request travels from an HTTP socket to a shard, and which thread does the work at each hop. By the end you will be able to open the codebase, point at the class that handles any given request, and explain why that class exists.

Levels 1 and 2 got you building, testing, and submitting a trivial PR. From here on, every lab assumes you can find a class without a search engine and run a local cluster. Level 3 is where you stop reading about OpenSearch and start reading OpenSearch.

Learning Objectives

By the end of Level 3 you must be able to:

Explain the node / cluster / index / shard model and how it maps to running processes and Lucene.
Enumerate the node roles (cluster_manager, data, ingest, coordinating, search, …) and say what each one is allowed to do.
Trace GET /_cluster/health and POST /<index>/_doc from the HTTP layer through RestController → RestHandler → NodeClient → TransportAction without a guide.
Distinguish the REST layer (port 9200, HTTP/JSON) from the transport layer (port 9300, binary, node-to-node) and say which classes own each.
Explain how actions are registered in ActionModule and how an ActionType resolves to a concrete TransportAction.
Name the major thread pools (ThreadPool.Names) and predict which one runs a given operation.
Build a minimal plugin that registers a new REST endpoint and transport action (Lab 3.3).

The Node / Cluster / Index / Shard Model

Start with the four nouns. Everything in OpenSearch is built out of them.

Concept	What it is	Backing class	Lives where
Cluster	A named set of nodes that share one cluster state	(no single class; `ClusterService` is its hub)	The whole system
Node	One running OpenSearch JVM process	`Node` (`server/.../node/Node.java`)	One machine/container
Index	A logical collection of documents with a mapping	`IndexMetadata` + `IndexService`	Metadata in cluster state; data on data nodes
Shard	A horizontal slice of an index; one Lucene index	`IndexShard` wrapping an `Engine`/`InternalEngine`	A data node's disk

An index is split into primary shards at creation time (immutable count, modulo _split/_shrink); each primary has zero or more replicas. A shard — primary or replica — is a self-contained Apache Lucene index made of immutable segments. IndexShard wraps a Lucene IndexWriter/DirectoryReader through an Engine (the default is InternalEngine). Indexing, refresh, flush, and merge all happen at the shard level. The cluster's job is to decide which node holds which shard — that is the routing table, and computing it is the subject of Level 4.

Cluster "logs-prod"
├── Node A (cluster_manager, data)
│   ├── shard [orders][0] primary      ← a Lucene index (segments on disk)
│   └── shard [orders][1] replica
├── Node B (data, ingest)
│   ├── shard [orders][0] replica
│   └── shard [orders][1] primary
└── Node C (coordinating only)         ← holds no data; routes requests

Note: "Index" is overloaded. An OpenSearch index is a collection of shards; a Lucene index is one shard. When reading the engine code, the word almost always means the Lucene one.

For the full storage story — IndexShard, InternalEngine, translog, segments, refresh/flush — see Level 6 and the indexing-path deep dive.

The Request Path: REST → RestController → Action → Shards

This is the spine of the system. Memorize it. Every user-facing operation — search, index, cluster health, snapshot — follows the same skeleton.

flowchart TD
    HTTP[HTTP request on :9200] --> HC[HttpServerTransport<br/>Netty4HttpServerTransport]
    HC --> RC[RestController.dispatchRequest]
    RC -->|matches path+method| RH[RestHandler<br/>e.g. RestClusterHealthAction / RestIndexAction]
    RH -->|parse JSON/params| REQ[builds an ActionRequest]
    REQ --> NC[NodeClient.execute ActionType, request, listener]
    NC -->|looks up in actions map| TA[TransportAction<br/>HandledTransportAction]
    TA --> AF[ActionFilters chain]
    AF --> DO[doExecute on the right node]
    DO -->|fan out over transport :9300| SH[per-shard work<br/>IndexShard / SearchService]
    SH -->|results| LIS[ActionListener.onResponse]
    LIS --> XC[ToXContent -> JSON]
    XC --> HTTP

Read it as five questions:

Who accepts the socket? HttpServerTransport (default Netty4HttpServerTransport from modules/transport-netty4). It turns bytes into a RestRequest.
Who routes it? RestController.dispatchRequest matches (method, path) against the registered handlers and calls the matching RestHandler.
Who parses and validates it? A RestHandler — almost always a subclass of BaseRestHandler. It reads path params, query params, and the JSON body, builds a typed ActionRequest, and returns a RestChannelConsumer that invokes the NodeClient.
Who dispatches to logic? NodeClient.execute(ActionType, request, listener) looks up the TransportAction registered for that ActionType and runs it locally.
Who does the work? A TransportAction (typically HandledTransportAction). For cluster-wide reads it may route to the cluster manager; for writes it routes to the primary shard and replicates; for searches it fans out to many shards and reduces.

The crucial decoupling: a RestHandler never calls business logic directly. It only knows how to turn HTTP into an ActionRequest and hand it to the client. The same TransportAction can be invoked from REST, from the Java client, or from another action — REST is just one front door.

For the layered detail, read the REST layer deep dive, the transport layer deep dive, and the action framework deep dive.

Node Roles

A node advertises a set of roles via the node.roles setting in opensearch.yml (or -Enode.roles=... on the command line). Roles decide what work a node is allowed to do; the cluster manager decides what it actually does.

Role	Setting token	Responsibility
Cluster manager (formerly master)	`cluster_manager`	Eligible to be elected to own and publish cluster state
Data	`data`	Holds shards; runs indexing, search, recovery, merges
Ingest	`ingest`	Runs ingest pipelines (`processors`) before indexing
Coordinating only	(empty `node.roles: []`)	Routes/reduces requests; holds no data, never elected
Search	`search`	Serves searchable-snapshot / remote-backed read traffic (newer role)
Remote cluster client	`remote_cluster_client`	Connects to remote clusters for cross-cluster search/replication
ML	`ml`	Runs ml-commons model inference (when the plugin is installed)
Warm	`warm`	Tiered storage warm nodes (newer, tiering feature)

Terminology — cluster manager vs master: OpenSearch renamed the "master" role and APIs to cluster manager for inclusive language. The old master role and settings like cluster.initial_master_nodes and ?master_timeout still work as deprecated aliases of cluster_manager / cluster.initial_cluster_manager_nodes / ?cluster_manager_timeout. New code uses cluster_manager. You will see both spellings throughout the codebase; first reference in this curriculum always notes both.

Every node is also implicitly a coordinating node: any node can accept a client request, fan it out to the nodes that hold the relevant shards, and reduce the results. A node with node.roles: [] is only a coordinator — useful as a load-balancing tier.

Find the role definitions yourself:

grep -rn "class DiscoveryNodeRole" server/src/main/java/org/opensearch/cluster/node/
grep -rn "CLUSTER_MANAGER_ROLE\|DATA_ROLE\|INGEST_ROLE\|roleMap" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java

REST vs Transport: Two Ports, Two Protocols

	REST layer	Transport layer
Default port	9200	9300
Audience	External clients (curl, Dashboards, SDKs)	Node-to-node, and the Java `NodeClient`
Protocol	HTTP/1.1 (+ HTTP/2 path), JSON body	Custom binary framing
Payloads	`RestRequest` → `ActionRequest`; JSON via XContent	`TransportRequest`/`TransportResponse`, `Writeable`
Entry class	`HttpServerTransport` → `RestController`	`TransportService` → `Transport` (`Netty4Transport`)
Serialization	XContent (`ToXContent` / `XContentParser`)	`StreamInput`/`StreamOutput`, `NamedWriteableRegistry`

A single user request usually uses both: HTTP on 9200 to reach the coordinating node, then binary transport on 9300 to reach the data nodes that hold the shards. A search hitting 10 shards on 3 nodes is one HTTP request and many transport round-trips.

The transport payload contract is Writeable: a type implements writeTo(StreamOutput) and a StreamInput-taking constructor. Polymorphic types (e.g., which QueryBuilder is on the wire) resolve through NamedWriteableRegistry. Getting this wire format right is the heart of backward compatibility — see Level 9 and the transport layer deep dive.

Thread Pools

OpenSearch never blocks a transport or HTTP I/O thread on real work. Work is dispatched onto named thread pools owned by ThreadPool. Picking the wrong pool (or blocking on GENERIC) is a classic contributor bug.

`ThreadPool.Names`	Type	Used for
`SEARCH`	fixed (sized to CPUs)	Query and fetch phases
`SEARCH_THROTTLED`	fixed (small)	Searches on throttled/searchable-snapshot indices
`WRITE`	fixed	Indexing, bulk, delete (the write path)
`GET`	fixed	Realtime `_doc` GETs
`MANAGEMENT`	scaling	Cluster/management housekeeping tasks
`GENERIC`	scaling (unbounded-ish)	Misc. background work; never block here forever
`SNAPSHOT`	scaling	Snapshot/restore I/O
`REFRESH`	scaling	Periodic shard refreshes
`FLUSH`	scaling	Lucene commits / flushes
`CLUSTER_MANAGER_SERVICE` (a.k.a. `MASTER_SERVICE`)	fixed (1)	Computing cluster state updates, single-threaded

Inspect them on a live node:

curl -s 'localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected'

Find the constants and defaults in source:

grep -n "public static final String" server/src/main/java/org/opensearch/threadpool/ThreadPool.java | head -40
grep -n "ThreadPool.Names" server/src/main/java/org/opensearch/action/search/TransportSearchAction.java | head

The single-threaded CLUSTER_MANAGER_SERVICE pool is why cluster state updates are serialized — a detail you will revisit in Level 4.

Source Code Areas to Inspect

Read these before and after the labs. You are not modifying anything yet.

REST layer (`server/src/main/java/org/opensearch/rest/`)

File	Why
`RestController.java`	The router. Read `registerHandler(...)` and `dispatchRequest(...)`.
`BaseRestHandler.java`	Base for all handlers. Read `prepareRequest(...)` and `handleRequest(...)`.
`RestHandler.java`	The interface. `routes()` declares the `(method, path)` it owns.
`action/cat/RestNodesAction.java`	A readable example of a `_cat` handler.
`action/admin/cluster/RestClusterHealthAction.java`	Worked example in Lab 3.1.
`action/document/RestIndexAction.java`	The write-path REST entry; worked example in Lab 3.1.

Action framework (`server/src/main/java/org/opensearch/action/`)

File	Why
`ActionModule.java`	Registers every action and REST handler. The wiring catalog.
`support/TransportAction.java`	Base class. Read `execute(...)` and the `ActionFilters` loop.
`support/HandledTransportAction.java`	The common base for "do it on this node" actions.
`ActionType.java`	The typed key that maps a request to its transport action.
`support/ActionFilters.java` / `ActionFilter.java`	The pre/post interception chain.
`admin/cluster/health/TransportClusterHealthAction.java`	Worked example in Lab 3.1.
`bulk/TransportBulkAction.java` + `bulk/TransportShardBulkAction.java`	The write path; worked in Lab 3.1.

Transport layer (`server/` + `modules/transport-netty4/`)

File	Why
`server/.../transport/TransportService.java`	Registers request handlers; `sendRequest(...)`.
`server/.../transport/TransportRequest.java` / `TransportResponse.java`	Wire message bases.
`libs/core/.../io/stream/Writeable.java`	The serialization contract.
`libs/core/.../io/stream/StreamInput.java` / `StreamOutput.java`	Read/write primitives.
`server/.../common/io/stream/NamedWriteableRegistry.java`	Polymorphic type resolution.
`modules/transport-netty4/.../transport/Netty4Transport.java`	The default transport impl.

Node bootstrap & thread pools (`server/`)

File	Why
`node/Node.java`	The big constructor that wires every service. Skim `new Node(...)`.
`threadpool/ThreadPool.java`	Thread pool names, types, and default sizing.
`client/node/NodeClient.java`	`execute(ActionType, ...)` — the bridge from REST to action.

Key Classes Quick Reference

Class	Package	Role
`RestController`	`org.opensearch.rest`	Routes HTTP `(method, path)` to a `RestHandler`
`BaseRestHandler`	`org.opensearch.rest`	Base class; parses request, returns a channel consumer
`NodeClient`	`org.opensearch.client.node`	`execute(ActionType, req, listener)` → transport action
`ActionModule`	`org.opensearch.action`	Registers all actions + REST handlers (the wiring)
`ActionType<Response>`	`org.opensearch.action`	Typed key mapping a request to its transport action
`TransportAction`	`org.opensearch.action.support`	Base server-side action; runs the `ActionFilters` chain
`HandledTransportAction`	`org.opensearch.action.support`	"Run on this node" action base
`TransportService`	`org.opensearch.transport`	Sends/receives `TransportRequest`/`Response` between nodes
`ThreadPool`	`org.opensearch.threadpool`	Named executor pools (`SEARCH`, `WRITE`, …)
`Node`	`org.opensearch.node`	A running node; constructs and wires all services

The Labs

Lab	Title	Type
3.1	Trace a REST Request from HTTP to TransportAction	Code-reading trace
3.2	Node Roles, Discovery, and the Transport Layer	Hands-on + reading
3.3	Build It — A Custom REST Action Plugin	Build-it project

Deliverables

You must demonstrate all of the following before advancing to Level 4:

A reading-log artifact tracing GET /_cluster/health from RestController to TransportClusterHealthAction (Lab 3.1).
A reading-log artifact tracing POST /<index>/_doc from RestIndexAction to TransportShardBulkAction (Lab 3.1).
A 3-node local cluster you started, with _cat/nodes showing distinct roles (Lab 3.2).
A one-paragraph explanation of REST (9200) vs transport (9300), naming the entry class for each.
A working plugin that adds a new REST endpoint and transport action, installed into a local distro and curled successfully (Lab 3.3).
From memory: name the thread pool that runs a search, a bulk index, and a cluster state update.

Common Mistakes

Mistake	Consequence	Fix
Thinking a `RestHandler` contains the business logic	You read the wrong class for a bug	The handler only builds an `ActionRequest`; logic is in the `TransportAction`
Confusing OpenSearch index with Lucene index	Misread the engine code	Index = many shards; one shard = one Lucene index
Assuming "master" is gone	You miss deprecated aliases still in the code	`master` is a deprecated alias of `cluster_manager`; both appear
Looking for the action registration in the handler	You can't find how an `ActionType` resolves	Registration is centralized in `ActionModule`
Blocking on a transport/HTTP I/O thread	Cluster stalls under load	Dispatch real work onto a `ThreadPool` executor
Curling the transport port (9300)	Connection garbage / errors	REST is 9200; 9300 is binary node-to-node only
Reading `ActionModule` top to bottom	Overwhelm — thousands of registrations	`grep` for the one action you care about

How to Verify Success

# 1. Cluster is up and you can read its roles
curl -s 'localhost:9200/_cat/nodes?v&h=name,node.role,cluster_manager'

# 2. You can find any hop in the request path by grep, not by memory of line numbers
grep -rn "registerHandler" server/src/main/java/org/opensearch/rest/RestController.java | head
grep -rn "register(ClusterHealthAction" server/src/main/java/org/opensearch/action/ActionModule.java

# 3. Thread pools are visible
curl -s 'localhost:9200/_cat/thread_pool/search,write?v'

When you can open the codebase cold and point at the handler, the action, and the thread pool for any request, you are ready for Level 4: Cluster Coordination and State.

Lab 3.1: Trace a REST Request from HTTP to TransportAction

This is a code-reading lab. You will not write a line of production code. You will follow two real requests — GET /_cluster/health and POST /<index>/_doc — from the HTTP socket all the way to the TransportAction that does the work, using only grep, find, and your IDE. The deliverable is a reading-log artifact you will reuse for the rest of the curriculum.

This is the OpenSearch analog of the Tez submitDAG → AsyncDispatcher trace. The skill is the same: navigate, don't memorize.

Background

Every user-facing OpenSearch operation follows the same skeleton (see Level 3 overview):

HTTP :9200 → HttpServerTransport → RestController.dispatchRequest
          → RestHandler (BaseRestHandler) → builds an ActionRequest
          → NodeClient.execute(ActionType, request, listener)
          → TransportAction (HandledTransportAction) → ActionFilters → doExecute
          → (over transport :9300) per-shard work → ActionListener → JSON

The two requests in this lab exercise the two most important variants of that skeleton:

GET /_cluster/health — a cluster-manager-routed read. The work must run on (or be answered from the cluster state owned by) the elected cluster manager (formerly master).
POST /<index>/_doc — a write. It enters as a single-doc index, is wrapped into a bulk, routed to the primary shard, and replicated.

Deep-dive companions: rest-layer.md, action-framework.md, transport-layer.md.

Why This Lab Matters for Contributors

When a bug report says "cluster health returns the wrong status" or "indexing a doc throws an NPE," the first thing a maintainer does is locate the exact class on the path. If you can't get from a URL to the responsible TransportAction in two minutes, you can't triage. Every issue lab in Level 8 and the capstone starts with this skill.

Prerequisites

OpenSearch cloned and building (./gradlew assemble succeeds — see Level 1).
A running node for the live half: ./gradlew run (REST on localhost:9200).
An IDE with "Find Usages" / "Call Hierarchy" (IntelliJ Ctrl-Alt-H) — optional but faster.
A scratch file for your reading log:

mkdir -p ~/opensearch-notes
: > ~/opensearch-notes/reading-log-3.1.md

Note: Class and method names are stable across recent branches, but line numbers are not. This lab gives you grep/find commands to locate each hop on your checkout rather than fabricated line numbers. If a name has drifted on your branch, the grep still points you at the right neighborhood.

Part A — `GET /_cluster/health` (budget: 45 min)

Step 1 (5 min) — Confirm the request live

curl -s 'localhost:9200/_cluster/health?pretty'

Expected (abridged):

{
  "cluster_name" : "opensearch",
  "status" : "green",
  "number_of_nodes" : 1,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "unassigned_shards" : 0
}

Now find the code that produced it.

Step 2 (8 min) — Find the RestHandler and its route

The handler name follows OpenSearch's convention: Rest<Thing>Action.

find server/src/main/java -name "RestClusterHealthAction.java"
grep -n "routes()\|new Route\|GET\|prepareRequest" \
  server/src/main/java/org/opensearch/rest/action/admin/cluster/RestClusterHealthAction.java

In routes() you will see the (GET, "/_cluster/health") and (GET, "/_cluster/health/{index}") registrations. Read prepareRequest(RestRequest, NodeClient). Note the two things every BaseRestHandler does:

Build a typed request — here a ClusterHealthRequest from path/query params (index, wait_for_status, timeout, …).
Return a RestChannelConsumer that calls client.admin().cluster().health(request, listener) (a thin wrapper over NodeClient.execute).

Log it:

cat >> ~/opensearch-notes/reading-log-3.1.md <<'EOF'
## GET /_cluster/health
1. RestController routes (GET,/_cluster/health) -> RestClusterHealthAction.prepareRequest
   - builds ClusterHealthRequest from RestRequest params
   - returns channel -> client.admin().cluster().health(req, listener)
EOF

Step 3 (7 min) — How RestController matched the route

The handler does not register itself; RestController does. Find the registration and the dispatch:

grep -n "registerHandler\|registerHandlerNoWrap\|dispatchRequest\|tryAllHandlers" \
  server/src/main/java/org/opensearch/rest/RestController.java | head

Read dispatchRequest(...) / tryAllHandlers(...): RestController walks its path trie, finds the handler whose routes() matched (method, path), applies any RestHandlerWrapper, and calls handler.handleRequest(request, channel, client). BaseRestHandler.handleRequest then calls your prepareRequest and consumes the returned channel consumer.

Who registered RestClusterHealthAction in the first place? ActionModule:

grep -n "RestClusterHealthAction" server/src/main/java/org/opensearch/action/ActionModule.java

You'll find it in initRestHandlers(...) via registerHandler.accept(new RestClusterHealthAction()).

Step 4 (10 min) — From ActionType to TransportAction

client.admin().cluster().health(...) ultimately calls NodeClient.execute(ClusterHealthAction.INSTANCE, request, listener). Find the ActionType:

find server/src/main/java -name "ClusterHealthAction.java"
grep -n "extends ActionType\|public static final\|INSTANCE\|NAME" \
  server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthAction.java

ClusterHealthAction is an ActionType<ClusterHealthResponse> with a NAME of "cluster:monitor/health" and a singleton INSTANCE. The NAME is the transport action name — the string registered on TransportService so other nodes can invoke it.

Now find where the ActionType is bound to its TransportAction:

grep -n "ClusterHealthAction.INSTANCE\|TransportClusterHealthAction" \
  server/src/main/java/org/opensearch/action/ActionModule.java

You'll see actions.register(ClusterHealthAction.INSTANCE, TransportClusterHealthAction.class) (the exact helper is an ActionRegistry/registerAction call). This is the pivot: the NodeClient holds a Map<ActionType, TransportAction> built from these registrations, and execute is just a map lookup followed by transportAction.execute(...).

grep -n "execute\|Map<.*ActionType\|actions.get" \
  server/src/main/java/org/opensearch/client/node/NodeClient.java

Log it:

cat >> ~/opensearch-notes/reading-log-3.1.md <<'EOF'
2. NodeClient.execute(ClusterHealthAction.INSTANCE, req, listener)
   - ActionType NAME = "cluster:monitor/health"
   - ActionModule binds INSTANCE -> TransportClusterHealthAction
   - NodeClient = map lookup ActionType -> TransportAction, then .execute(...)
EOF

Step 5 (10 min) — Inside the TransportAction

find server/src/main/java -name "TransportClusterHealthAction.java"
grep -n "extends Transport\|clusterManagerOperation\|masterOperation\|doExecute\|ClusterStateObserver" \
  server/src/main/java/org/opensearch/action/admin/cluster/health/TransportClusterHealthAction.java

Observe:

It extends TransportClusterManagerNodeAction (formerly TransportMasterNodeAction) — the base for actions that must run on the elected cluster manager. If the local node isn't the manager, the base class transparently re-sends the request over transport to the manager node.
The real work is in clusterManagerOperation(...) (older branches: masterOperation(...)), which reads/derives the answer from the current ClusterState and may register a ClusterStateObserver to wait for wait_for_status=green etc.
The result is a ClusterHealthResponse that implements ToXContent; the REST channel serializes it back to the JSON you saw in Step 1.

This routing-to-the-manager behavior is your first contact with the coordination layer of Level 4. Note the pattern; you'll meet TransportClusterManagerNodeAction again.

Log it:

cat >> ~/opensearch-notes/reading-log-3.1.md <<'EOF'
3. TransportClusterHealthAction extends TransportClusterManagerNodeAction
   - re-routes to elected cluster manager if not local
   - clusterManagerOperation(): derives status from ClusterState (+ ClusterStateObserver waits)
   - ClusterHealthResponse (ToXContent) -> JSON on the REST channel
EOF

Step 6 (5 min) — Prove it with a breakpoint (optional but recommended)

In ./gradlew run, attach your debugger (default debug port 5005) and set breakpoints in RestClusterHealthAction.prepareRequest and TransportClusterHealthAction.clusterManagerOperation. Re-run the curl. Watch the call stack at the transport action — it shows you the NodeClient.execute → TransportAction.execute → ActionFilters frames you just read.

Part B — `POST /<index>/_doc` (budget: 45 min)

The write path is more interesting: a single-doc index is not its own action — it is funneled through bulk.

Step 1 (5 min) — Run it

curl -s -XPOST 'localhost:9200/blog/_doc?refresh=true' \
  -H 'Content-Type: application/json' \
  -d '{"title":"hello","views":1}' | python3 -m json.tool

Expected (abridged):

{
  "_index": "blog",
  "_id": "Xa3...",
  "_version": 1,
  "result": "created",
  "_shards": { "total": 2, "successful": 1, "failed": 0 }
}

Step 2 (8 min) — The RestHandler

find server/src/main/java -name "RestIndexAction.java"
grep -n "routes()\|new Route\|POST\|PUT\|prepareRequest\|IndexRequest\|bulk" \
  server/src/main/java/org/opensearch/rest/action/document/RestIndexAction.java

RestIndexAction.routes() registers (POST, "/{index}/_doc"), (PUT, "/{index}/_doc/{id}"), and the legacy _create variants. In prepareRequest it builds an IndexRequest (index name, optional id, routing, the source bytes, op type, refresh policy).

But note where it sends the request: it calls client.index(indexRequest, listener). Follow that.

Step 3 (7 min) — IndexRequest is dispatched via the bulk action

grep -n "IndexAction\|BulkAction\|bulk(" \
  server/src/main/java/org/opensearch/action/index/TransportIndexAction.java 2>/dev/null
grep -rn "class TransportIndexAction\|prepareBulkRequest\|singleItemBulkRequest" \
  server/src/main/java/org/opensearch/action/ | head

In current OpenSearch the single-document index path is implemented on top of bulk: a single IndexRequest is wrapped into a one-item BulkRequest and dispatched through BulkAction / TransportBulkAction. (Historically there was a standalone TransportIndexAction; the modern path goes through bulk. Verify which your branch uses with the grep above — if TransportIndexAction delegates to bulk, you're on the modern path.)

Find the bulk action registration and class:

grep -n "BulkAction.INSTANCE\|TransportBulkAction" \
  server/src/main/java/org/opensearch/action/ActionModule.java
find server/src/main/java -name "TransportBulkAction.java"

Step 4 (12 min) — TransportBulkAction → TransportShardBulkAction

grep -n "doExecute\|doRun\|executeBulk\|groupRequestsByShards\|createIndex\|shardBulkAction\|TransportShardBulkAction" \
  server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java | head -30

TransportBulkAction does the coordinating-node work:

Auto-creates the index if it doesn't exist (a cluster state update — note the link to Level 4).
Resolves each item's shard by routing (OperationRouting / _routing hash).
Groups items by target shard.
For each shard, sends a BulkShardRequest to the node holding that shard's primary, via TransportShardBulkAction.

Now the per-shard write:

find server/src/main/java -name "TransportShardBulkAction.java"
grep -n "extends TransportReplicationAction\|shardOperationOnPrimary\|shardOperationOnReplica\|applyIndexOperationOnPrimary\|executeBulkItemRequest" \
  server/src/main/java/org/opensearch/action/bulk/TransportShardBulkAction.java | head

TransportShardBulkAction extends TransportWriteAction extends TransportReplicationAction. This is the primary/replica write framework:

shardOperationOnPrimary(...) runs on the node with the primary shard. It calls into IndexShard.applyIndexOperationOnPrimary(...) → InternalEngine.index(...) → Lucene IndexWriter + Translog.
The framework then replicates the same operation to each replica via shardOperationOnReplica(...) over transport.

You have now reached the storage engine boundary. The engine internals (IndexShard, InternalEngine, translog, refresh) are Level 6 and the indexing-path deep dive. Stop here; you've traced REST → action → shard.

Log it:

cat >> ~/opensearch-notes/reading-log-3.1.md <<'EOF'
## POST /<index>/_doc
1. RestController routes (POST,/{index}/_doc) -> RestIndexAction.prepareRequest
   - builds IndexRequest -> client.index(req, listener)
2. single IndexRequest funneled through BulkAction/TransportBulkAction
   - coordinating node: auto-create index, resolve shard by routing, group by shard
3. per-shard: TransportShardBulkAction (extends TransportReplicationAction)
   - shardOperationOnPrimary -> IndexShard.applyIndexOperationOnPrimary -> InternalEngine.index
   - replicate via shardOperationOnReplica over transport :9300
   - (engine internals = Level 6)
EOF

How ActionType, ActionFilters, and the actions map fit together

Pull the three concepts together with one more pass:

Concept	Class	What it does
`ActionType<R>`	`org.opensearch.action.ActionType`	A typed, named key. `NAME` is the transport string (`cluster:monitor/health`, `indices:data/write/bulk[s]`).
Action registry	`ActionModule`	Binds each `ActionType` to a `TransportAction` class via Guice; the `NodeClient` gets the resulting `Map`.
`NodeClient.execute`	`NodeClient`	`actions.get(actionType).execute(...)` — pure map lookup + invoke.
`ActionFilters`	`ActionFilters` / `ActionFilter`	A chain wrapped around every `TransportAction.execute`. Plugins (e.g. security) intercept here.

Trace the filter chain yourself:

grep -n "ActionFilters\|filters\|apply\|proceed\|filterChain" \
  server/src/main/java/org/opensearch/action/support/TransportAction.java | head

Read TransportAction.execute(...): it constructs a RequestFilterChain and calls chain.proceed, which walks each ActionFilter and finally calls doExecute. This is where the security plugin enforces authz before any action runs — a fact you'll need in cross-repo debugging (plugin labs).

Expected Output / Deliverable

Your ~/opensearch-notes/reading-log-3.1.md must contain both traces with a file path for every hop. A complete artifact looks like:

GET /_cluster/health
  RestController.dispatchRequest -> RestClusterHealthAction (rest/action/admin/cluster/)
  -> NodeClient.execute(ClusterHealthAction.INSTANCE) [name=cluster:monitor/health]
  -> TransportClusterHealthAction (extends TransportClusterManagerNodeAction)
  -> clusterManagerOperation reads ClusterState -> ClusterHealthResponse (ToXContent)

POST /<index>/_doc
  RestController.dispatchRequest -> RestIndexAction (rest/action/document/)
  -> client.index(IndexRequest) funneled through BulkAction/TransportBulkAction
  -> group by shard -> TransportShardBulkAction (extends TransportReplicationAction)
  -> shardOperationOnPrimary -> IndexShard.applyIndexOperationOnPrimary -> InternalEngine.index

Stretch Goals

Trace GET /<index>/_search. Find RestSearchAction → SearchAction.INSTANCE → TransportSearchAction. Note it does not route to the cluster manager — it fans out to shards. How does its base class differ from TransportClusterManagerNodeAction? (Preview of Level 7.)
Find the transport action name for bulk-at-shard. Grep TransportShardBulkAction for its ACTION_NAME / transportPrimaryAction; you'll see indices:data/write/bulk[s]. Note the [s], [p], [r] suffixes the replication framework appends for shard/primary/replica sub-actions.
Find a _cat handler. Trace RestNodesAction (rest/action/cat/). _cat handlers are the easiest first PRs — see Level 2.

Validation / Self-check

Answer without looking back at this lab:

Which class matched (GET, /_cluster/health) to a handler, and which method did the matching?
What is the transport action name (NAME) of ClusterHealthAction, and where is it used besides the NodeClient map?
TransportClusterHealthAction extends TransportClusterManagerNodeAction. What does that base class do when the local node is not the elected cluster manager?
A single POST /<index>/_doc does not have its own dedicated transport action that talks to the shard. Through which action is it funneled, and why is that a sensible design?
Name the method on TransportShardBulkAction (via its base) that runs on the primary shard, and the one that runs on each replica.
Where in the path could a plugin (e.g. security) reject the request before doExecute runs? Name the class and the mechanism.
From IndexShard.applyIndexOperationOnPrimary, what is the next component that touches Lucene? (You don't need to read it yet — just name it.)

When you can answer all seven and your reading log has a file path for every hop, you've completed Lab 3.1. Continue to Lab 3.2: Node Roles, Discovery, and the Transport Layer.

Lab 3.2: Node Roles, Discovery, and the Transport Layer

In Lab 3.1 you traced a request within one node. Real clusters span many nodes, and the moment a request leaves a node it travels over the transport layer: a binary protocol on port 9300, with Writeable payloads and a NamedWriteableRegistry. This lab makes that concrete. You will start a 3-node cluster, give each node different roles, inspect roles and thread pools live, and read the transport classes that move messages between nodes.

Background

A node is one OpenSearch JVM. A cluster is a set of nodes that discovered each other and agreed on a cluster manager (formerly master). Three things make that work:

Node roles — what each node is allowed to do (node.roles).
Discovery — how nodes find each other and form a cluster (discovery.seed_hosts, cluster.initial_cluster_manager_nodes).
The transport layer — how nodes talk once connected (TransportService, Netty4Transport).

Discovery and election themselves are Level 4 material; here you treat them as "the thing that wires the nodes together" and focus on roles and transport.

Deep-dive companions: transport-layer.md, discovery-coordination.md.

Why This Lab Matters for Contributors

Most distributed bugs are inter-node bugs: a request hits a coordinating node, fans out to data nodes, and something serializes wrong, lands on the wrong thread pool, or hangs. To debug those you must be able to (a) reason about which node does what, and (b) read the transport code that carries the message. Wire-format changes are also the single most common source of backward-compatibility breakage (Level 9) — and they all live in the transport layer.

Prerequisites

OpenSearch builds and ./gradlew run works (Lab 3.1).
curl and jq (or python3 -m json.tool) for reading JSON.
Two free terminals (one per node group) if you start nodes manually.

Step-by-Step Tasks

Step 1 (10 min) — Read the role definitions in source

Before starting anything, learn where roles live:

find server/src/main/java -name "DiscoveryNodeRole.java"
grep -n "DiscoveryNodeRole\|roleName\|CLUSTER_MANAGER_ROLE\|DATA_ROLE\|INGEST_ROLE\|REMOTE_CLUSTER_CLIENT_ROLE\|SEARCH_ROLE" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java | head -40

DiscoveryNodeRole defines each built-in role as a static instance with a roleName (the token you put in node.roles) and a roleNameAbbreviation (the letter _cat/nodes prints). Note the deprecated master role:

grep -n "MASTER_ROLE\|master\|deprecat\|cluster_manager" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java

Terminology — cluster manager vs master: OpenSearch renamed the master role to cluster manager. DiscoveryNodeRole.MASTER_ROLE remains as a deprecated alias of CLUSTER_MANAGER_ROLE; setting node.roles: [master] still works but logs a deprecation warning. Always prefer cluster_manager in new config and code.

Now see how a node reads its roles:

grep -rn "node.roles\|NODE_ROLES_SETTING\|getRolesFromSettings\|nodeRoles" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java | head

DiscoveryNode.getRolesFromSettings(Settings) parses node.roles; a DiscoveryNode (id, name, address, roles, version, attributes) is what gets serialized into DiscoveryNodes inside the cluster state.

Step 2 (10 min) — Start a 3-node cluster

You have two options. Option A is the fast path; Option B is closer to production.

Option A — ./gradlew run with multiple nodes:

# Three nodes in one invocation; gradle assigns roles and forms the cluster for you.
./gradlew run -Dtests.opensearch.cluster.initial_cluster_manager_nodes= \
  -PnumNodes=3

Check the task's own flags on your branch (they evolve):

grep -rn "numNodes\|numberOfNodes\|setNumberOfNodes" build.gradle */build.gradle 2>/dev/null | head
./gradlew help --task run

Option B — three nodes from a built distro, distinct roles:

./gradlew localDistro
DISTRO=$(find distribution/archives -type d -name "opensearch-*" | head -1)
# (Adjust DISTRO to the unpacked archive dir on your branch.)

# Node 1: cluster_manager + data
"$DISTRO/bin/opensearch" \
  -Ecluster.name=lab32 -Enode.name=n1 \
  -Enode.roles=cluster_manager,data \
  -Ehttp.port=9200 -Etransport.port=9300 \
  -Ediscovery.seed_hosts=127.0.0.1:9300,127.0.0.1:9301,127.0.0.1:9302 \
  -Ecluster.initial_cluster_manager_nodes=n1 &

# Node 2: data + ingest
"$DISTRO/bin/opensearch" \
  -Ecluster.name=lab32 -Enode.name=n2 \
  -Enode.roles=data,ingest \
  -Ehttp.port=9201 -Etransport.port=9301 \
  -Ediscovery.seed_hosts=127.0.0.1:9300,127.0.0.1:9301,127.0.0.1:9302 &

# Node 3: coordinating only (empty roles)
"$DISTRO/bin/opensearch" \
  -Ecluster.name=lab32 -Enode.name=n3 \
  -Enode.roles= \
  -Ehttp.port=9202 -Etransport.port=9302 \
  -Ediscovery.seed_hosts=127.0.0.1:9300,127.0.0.1:9301,127.0.0.1:9302 &

Note: cluster.initial_cluster_manager_nodes (deprecated alias: cluster.initial_master_nodes) is used only for the very first cluster bootstrap to seed the voting configuration. Set it on the cluster-manager-eligible nodes, and only on first formation. More in Level 4 Lab 4.1.

Step 3 (8 min) — Inspect roles live

curl -s 'localhost:9200/_cat/nodes?v&h=name,ip,node.role,cluster_manager,version'

Expected shape:

name ip         node.role cluster_manager version
n1   127.0.0.1  dm        *               3.x.x
n2   127.0.0.1  di        -               3.x.x
n3   127.0.0.1  -         -               3.x.x

The node.role column uses the abbreviations from DiscoveryNodeRole (d=data, m=cluster_manager, i=ingest, -=coordinating-only). The * in cluster_manager marks the elected manager. Now the structured view:

curl -s 'localhost:9200/_nodes?filter_path=nodes.*.name,nodes.*.roles&pretty'

Confirm n1 has [cluster_manager, data], n2 [data, ingest], n3 [].

Step 4 (7 min) — Inspect thread pools per node

curl -s 'localhost:9200/_cat/thread_pool?v&h=node_name,name,type,size,active,queue,rejected' \
  | grep -E 'node_name|search|write|generic|cluster_manager'

Notice that the coordinating-only node (n3) still has search and write pools — because any node can coordinate a request and reduce results — but it never runs the shard-local half of a write (it has no shards). The cluster_manager / master service pool is single-threaded and only busy on the elected manager. Cross-reference the table in the Level 3 overview.

Step 5 (15 min) — Read the transport layer

This is the reading core of the lab. Find the players:

find server/src/main/java -name "TransportService.java"
find modules/transport-netty4/src/main/java -name "Netty4Transport.java"
find libs/core/src/main/java -name "Writeable.java" -o -name "StreamInput.java" -o -name "StreamOutput.java"

TransportService — the front door for node-to-node messaging. Read these methods:

grep -n "public.*sendRequest\|registerRequestHandler\|connectToNode\|TransportRequestHandler" \
  server/src/main/java/org/opensearch/transport/TransportService.java | head -30

registerRequestHandler(action, executor, requestReader, handler) — every transport action registers a handler here under a string name (the ActionType.NAME you saw in Lab 3.1), on a specific thread pool (the executor argument — this is where the pool choice lives).
sendRequest(connection, action, request, options, responseHandler) — serialize request (Writeable.writeTo), ship it to the remote node, and invoke responseHandler when the TransportResponse returns.

The wire contract — read Writeable and a real request:

sed -n '1,60p' libs/core/src/main/java/org/opensearch/core/common/io/stream/Writeable.java
grep -n "writeTo\|StreamInput\|readFrom\|public ClusterHealthRequest" \
  server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java

A Writeable type has two halves: a StreamInput-taking constructor (read) and writeTo(StreamOutput) (write). The order of reads must exactly mirror the order of writes — get it wrong and you corrupt every subsequent field. This symmetry is the #1 thing to verify when you touch a request class.

Polymorphism on the wire — NamedWriteableRegistry:

find server/src/main/java -name "NamedWriteableRegistry.java"
grep -n "register\|getReader\|NamedWriteable\|readNamedWriteable" \
  $(find . -name NamedWriteableRegistry.java | head -1)

When a field could be one of many types — e.g. which QueryBuilder is inside a SearchRequest — OpenSearch writes a name then the payload. The reader looks the name up in NamedWriteableRegistry to pick the right StreamInput constructor. Plugins register their own named writeables (you'll do this in Lab 4.3).

Step 6 (10 min) — Watch a request hop between nodes

Send a request to the coordinating-only node and reason about the hops:

# Hits n3 (coordinating only) over HTTP :9202
curl -s 'localhost:9202/_cluster/health?pretty' | jq .status

What happened:

HTTP on :9202 reaches n3's RestController → RestClusterHealthAction → NodeClient.
TransportClusterHealthAction on n3 sees it is not the cluster manager, so it TransportService.sendRequest(...) over transport :9300 to n1 (the manager).
n1's registered handler for cluster:monitor/health runs clusterManagerOperation, builds a ClusterHealthResponse, and sends it back over transport.
n3 serializes the response to JSON on the HTTP channel.

Make the hop visible by enabling transport tracing on the coordinating node:

curl -s -XPUT 'localhost:9202/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {
    "transport.tracer.include": ["cluster:monitor/health*"],
    "transport.tracer.exclude": []
  }
}'
# Re-run the health curl, then watch the node logs:
#   [trace] [transport.tracer] [n3] ... sent request ... action [cluster:monitor/health] ... node [n1]
#   [trace] [transport.tracer] [n3] ... received response for ... action [cluster:monitor/health]

Find the tracer in source to understand the log lines:

grep -rn "transport.tracer\|TransportLogger\|tracerLog\|traceRequestSent\|traceReceivedResponse" \
  server/src/main/java/org/opensearch/transport/ | head

Log it in your reading log:

cat >> ~/opensearch-notes/reading-log-3.1.md <<'EOF'

## 3.2 — transport hop (health from coordinating node)
- n3 (coord) RestClusterHealthAction -> TransportClusterHealthAction
- not manager -> TransportService.sendRequest("cluster:monitor/health") -> n1 over :9300
- n1 handler (registered via registerRequestHandler on a thread pool) runs clusterManagerOperation
- ClusterHealthResponse (Writeable.writeTo) serialized back to n3, then to JSON
EOF

Reading Exercises

# 1. Which thread pool does the bulk-at-shard transport handler run on?
grep -n "registerRequestHandler\|ThreadPool.Names\|executor" \
  server/src/main/java/org/opensearch/action/support/replication/TransportReplicationAction.java | head

# 2. How does a node's role set get into the cluster state?
grep -rn "DiscoveryNodes\|getRoles\|DiscoveryNode(" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodes.java | head

# 3. Where is the default transport implementation chosen?
grep -rn "Netty4Transport\|TRANSPORT_TYPE\|getTransports" \
  modules/transport-netty4/src/main/java/org/opensearch/transport/Netty4Plugin.java

Answer:

Roles & capability. A node has node.roles: [data, ingest]. Can the cluster manager assign a shard to it? Can it be elected cluster manager? Which DiscoveryNodeRole flag decides each?
The executor argument. In registerRequestHandler(action, executor, ...), what is executor, and why is choosing it correctly a correctness issue, not just a performance one? (Hint: blocking the wrong pool.)
Wire symmetry. In ClusterHealthRequest, the StreamInput constructor reads fields in some order. What invariant must writeTo satisfy, and what breaks if a new field is read before it is written?
NamedWriteableRegistry. Why can't SearchRequest simply call new TermQueryBuilder(in) when reading its query off the wire? What does the registry give you that a direct constructor cannot?
Coordinating-only nodes. n3 holds no data but has a write thread pool. What part of a bulk write would n3 ever execute, and what part can it never execute?

Expected Output

_cat/nodes shows three nodes with distinct role abbreviations and exactly one * cluster manager.
_nodes confirms each node's roles array matches what you configured.
Transport tracer log lines show the health request hopping from the coordinating node to the manager and back.
Your reading log gained a "transport hop" entry with a file path / class name for each hop.

Stretch Goals

Kill the manager. Stop n1 and watch n2 (if you make it cluster_manager-eligible) get elected; _cat/nodes shows a new *. This is a preview of Level 4 Lab 4.1.
Force a serialization round-trip in a test. Find an AbstractWireSerializingTestCase subclass (e.g. ClusterHealthRequestTests) and run it: ./gradlew :server:test --tests "*ClusterHealthRequestTests". These tests round-trip a Writeable through StreamOutput/StreamInput and are the cheapest way to learn the wire format. (Full treatment in Level 5.)
Inspect connection counts. curl 'localhost:9200/_nodes/stats/transport?pretty' — find rx_count/tx_count and correlate with the requests you sent.

Validation / Self-check

Name the setting that declares a node's roles and the two settings (current + deprecated alias) used to bootstrap the first cluster's voting configuration.
Which class parses node.roles into a set of DiscoveryNodeRole, and which class carries those roles into the cluster state?
Port 9200 vs 9300: which protocol, which audience, and which entry class for each?
In TransportService.registerRequestHandler, which argument decides the thread pool the handler runs on, and why does the wrong choice cause stalls?
Explain, in two sentences, the role of NamedWriteableRegistry and why plugins must register their own named writeables.
A health request sent to a coordinating-only node still returns the right answer. Trace the exact sequence of HTTP and transport hops that makes that work.

When you can start a multi-role cluster, read its roles, and explain a single request's hops with class names, you've completed Lab 3.2. Continue to Lab 3.3: Build It — A Custom REST Action Plugin.

Lab 3.3: Build It — A Custom REST Action Plugin

You have traced the request path twice (Labs 3.1, 3.2). Now you build your own piece of it. In this lab you implement a minimal OpenSearch plugin that registers a new REST endpoint and a new transport action, builds it with the opensearch.opensearchplugin Gradle plugin, installs it into a local distribution, and curls it. This is the smallest possible end-to-end extension, and it exercises every concept from this level: Plugin, ActionPlugin, RestHandler, ActionType, Writeable, ToXContent, HandledTransportAction, and NodeClient.

The endpoint you build:

GET /_my/greeting?name=<who>   ->   {"greeting":"hello, <who>", "node":"<node_name>"}

It greets you and reports which node answered — proving the action ran inside the engine, not in the REST layer.

Background

A plugin extends org.opensearch.plugins.Plugin and opts into extension points by implementing interfaces. For actions you implement ActionPlugin:

Method (on `ActionPlugin`)	What you return
`getActions()`	`List<ActionHandler<Req,Resp>>` — bind each `ActionType` to its `TransportAction`
`getRestHandlers(...)`	`List<RestHandler>` — your REST endpoints

The plugin is loaded by PluginsService, gets an isolated classloader, and is described by a plugin-descriptor.properties file. See the plugin internals deep dive and the action framework deep dive.

Why This Lab Matters for Contributors

Most of the OpenSearch ecosystem — security, k-NN, SQL, alerting, ml-commons — is plugins built exactly this way, in separate repos, against the published org.opensearch:opensearch artifacts. Even in core, new features frequently arrive as a module or plugin first. Knowing how to register an action and a handler is table stakes for the plugin labs and a real asset when you pick up a feature issue.

Prerequisites

OpenSearch builds locally and you can produce a distro: ./gradlew localDistro (Lab 3.1/3.2).
JDK 21 available to your IDE (the repo bundles its own JDK for the build).
You know the version you're building against:

grep -n "^opensearch" buildSrc/version.properties 2>/dev/null || grep -rn "version" build.gradle | head
cat libs/core/src/main/java/org/opensearch/Version.java | grep -n "CURRENT\|V_3" | head

Let <OS_VERSION> below stand for that version (e.g. 3.0.0).

Step-by-Step Tasks

You will build this as a standalone plugin project (the way ecosystem plugins are built), not inside the OpenSearch tree. That keeps the dependency wiring explicit.

Step 1 (5 min) — Project skeleton

mkdir -p my-greeting-plugin/src/main/java/org/example/greeting
mkdir -p my-greeting-plugin/src/main/plugin-metadata
cd my-greeting-plugin

Layout you will create:

my-greeting-plugin/
├── build.gradle
├── settings.gradle
└── src/main/
    ├── java/org/example/greeting/
    │   ├── GreetingPlugin.java
    │   ├── GreetingAction.java
    │   ├── GreetingRequest.java
    │   ├── GreetingResponse.java
    │   ├── TransportGreetingAction.java
    │   └── RestGreetingAction.java
    └── plugin-metadata/
        └── plugin-descriptor.properties   (generated by the gradle plugin; see Step 7)

Step 2 (8 min) — The `ActionType` and request/response

GreetingAction.java — the typed key. NAME is the transport action name; keep it namespaced.

package org.example.greeting;

import org.opensearch.action.ActionType;

public class GreetingAction extends ActionType<GreetingResponse> {
    public static final GreetingAction INSTANCE = new GreetingAction();
    public static final String NAME = "cluster:admin/greeting";

    private GreetingAction() {
        super(NAME, GreetingResponse::new); // reader for the response off the wire
    }
}

GreetingRequest.java — a Writeable + validatable request carrying the name param.

package org.example.greeting;

import org.opensearch.action.ActionRequest;
import org.opensearch.action.ActionRequestValidationException;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;

import java.io.IOException;

import static org.opensearch.action.ValidateActions.addValidationError;

public class GreetingRequest extends ActionRequest {
    private final String name;

    public GreetingRequest(String name) {
        this.name = name;
    }

    // Read constructor — order MUST mirror writeTo().
    public GreetingRequest(StreamInput in) throws IOException {
        super(in);
        this.name = in.readString();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        super.writeTo(out);
        out.writeString(name);
    }

    @Override
    public ActionRequestValidationException validate() {
        ActionRequestValidationException e = null;
        if (name == null || name.isBlank()) {
            e = addValidationError("name must not be empty", e);
        }
        return e;
    }

    public String getName() {
        return name;
    }
}

GreetingResponse.java — Writeable (transport) + ToXContentObject (JSON).

package org.example.greeting;

import org.opensearch.core.action.ActionResponse;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.ToXContentObject;
import org.opensearch.core.xcontent.XContentBuilder;

import java.io.IOException;

public class GreetingResponse extends ActionResponse implements ToXContentObject {
    private final String greeting;
    private final String nodeName;

    public GreetingResponse(String greeting, String nodeName) {
        this.greeting = greeting;
        this.nodeName = nodeName;
    }

    public GreetingResponse(StreamInput in) throws IOException {
        super(in);
        this.greeting = in.readString();
        this.nodeName = in.readString();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        out.writeString(greeting);
        out.writeString(nodeName);
    }

    @Override
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
        builder.startObject();
        builder.field("greeting", greeting);
        builder.field("node", nodeName);
        builder.endObject();
        return builder;
    }
}

Warning: writeTo and the StreamInput constructor must read/write fields in the same order. This is the single most common plugin bug and the root of countless wire-compat issues. A round-trip test (AbstractWireSerializingTestCase) catches it — see Level 5.

Package paths for StreamInput/StreamOutput/ToXContent have moved between org.opensearch.common.* and org.opensearch.core.* across versions. If an import doesn't resolve, run: grep -rn "class StreamInput\|interface ToXContentObject" libs/ server/ | head against your branch.

Step 3 (10 min) — The transport action

HandledTransportAction is the base for "run on this node" actions. It registers the transport handler for you; you implement doExecute.

package org.example.greeting;

import org.opensearch.action.support.ActionFilters;
import org.opensearch.action.support.HandledTransportAction;
import org.opensearch.cluster.service.ClusterService;
import org.opensearch.common.inject.Inject;
import org.opensearch.core.action.ActionListener;
import org.opensearch.tasks.Task;
import org.opensearch.transport.TransportService;

public class TransportGreetingAction extends HandledTransportAction<GreetingRequest, GreetingResponse> {

    private final ClusterService clusterService;

    @Inject
    public TransportGreetingAction(
        TransportService transportService,
        ActionFilters actionFilters,
        ClusterService clusterService
    ) {
        // Registers this action under GreetingAction.NAME with the GENERIC thread pool by default;
        // the request reader tells the transport layer how to deserialize incoming GreetingRequests.
        super(GreetingAction.NAME, transportService, actionFilters, GreetingRequest::new);
        this.clusterService = clusterService;
    }

    @Override
    protected void doExecute(Task task, GreetingRequest request, ActionListener<GreetingResponse> listener) {
        try {
            String node = clusterService.localNode().getName();
            GreetingResponse response = new GreetingResponse("hello, " + request.getName(), node);
            listener.onResponse(response);
        } catch (Exception e) {
            listener.onFailure(e);
        }
    }
}

Key points:

@Inject — OpenSearch uses a Guice container internally; the constructor's parameters are wired for you. ClusterService gives you localNode(), which proves the action ran inside the engine.
doExecute must be non-blocking and complete the ActionListener exactly once (onResponse or onFailure). Never throw out of it without calling onFailure.

Step 4 (10 min) — The REST handler

package org.example.greeting;

import org.opensearch.client.node.NodeClient;
import org.opensearch.rest.BaseRestHandler;
import org.opensearch.rest.RestRequest;
import org.opensearch.rest.action.RestToXContentListener;

import java.util.List;

import static java.util.Collections.singletonList;
import static org.opensearch.rest.RestRequest.Method.GET;

public class RestGreetingAction extends BaseRestHandler {

    @Override
    public String getName() {
        return "greeting_action";
    }

    @Override
    public List<Route> routes() {
        return singletonList(new Route(GET, "/_my/greeting"));
    }

    @Override
    protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) {
        String name = request.param("name", "world");
        GreetingRequest greetingRequest = new GreetingRequest(name);
        // RestToXContentListener serializes the ToXContent response to the channel as JSON.
        return channel -> client.execute(
            GreetingAction.INSTANCE,
            greetingRequest,
            new RestToXContentListener<>(channel)
        );
    }
}

This is exactly the pattern you read in RestClusterHealthAction (Lab 3.1): parse params → build an ActionRequest → return a consumer that calls client.execute(ActionType, ...). The handler holds no business logic.

Step 5 (8 min) — The plugin class wiring it together

package org.example.greeting;

import org.opensearch.action.ActionRequest;
import org.opensearch.cluster.metadata.IndexNameExpressionResolver;
import org.opensearch.cluster.node.DiscoveryNodes;
import org.opensearch.common.settings.ClusterSettings;
import org.opensearch.common.settings.IndexScopedSettings;
import org.opensearch.common.settings.Settings;
import org.opensearch.common.settings.SettingsFilter;
import org.opensearch.core.action.ActionResponse;
import org.opensearch.plugins.ActionPlugin;
import org.opensearch.plugins.Plugin;
import org.opensearch.rest.RestController;
import org.opensearch.rest.RestHandler;

import java.util.List;
import java.util.function.Supplier;

import static java.util.Collections.singletonList;

public class GreetingPlugin extends Plugin implements ActionPlugin {

    @Override
    public List<ActionHandler<? extends ActionRequest, ? extends ActionResponse>> getActions() {
        // Bind the ActionType to its TransportAction. This is the plugin-side equivalent
        // of ActionModule.register(...) that you read in Lab 3.1.
        return singletonList(new ActionHandler<>(GreetingAction.INSTANCE, TransportGreetingAction.class));
    }

    @Override
    public List<RestHandler> getRestHandlers(
        Settings settings,
        RestController restController,
        ClusterSettings clusterSettings,
        IndexScopedSettings indexScopedSettings,
        SettingsFilter settingsFilter,
        IndexNameExpressionResolver indexNameExpressionResolver,
        Supplier<DiscoveryNodes> nodesInCluster
    ) {
        return singletonList(new RestGreetingAction());
    }
}

Note: The exact signature of getRestHandlers(...) has gained/lost parameters across versions. If it doesn't match, copy the current signature from an in-tree module: grep -rn "getRestHandlers" modules/*/src/main/java | head and mirror it. The same is true for ActionHandler — confirm with find server/src/main/java -name ActionHandler.java.

Step 6 (10 min) — Gradle build files

settings.gradle:

rootProject.name = 'my-greeting-plugin'

build.gradle:

plugins {
    id 'java'
    // The OpenSearch-provided plugin: produces the plugin zip + descriptor + assemble tasks.
    id 'opensearch.opensearchplugin'
}

opensearchplugin {
    name = 'my-greeting-plugin'
    description = 'Adds GET /_my/greeting'
    classname = 'org.example.greeting.GreetingPlugin'
    licenseFile = rootProject.file('LICENSE.txt')
    noticeFile = rootProject.file('NOTICE.txt')
}

dependencies {
    // Provided by the running OpenSearch node at runtime; do NOT bundle it in the zip.
    compileOnly "org.opensearch:opensearch:${opensearch_version}"
    testImplementation "org.opensearch.test:framework:${opensearch_version}"
}

// Skip strict precommit gates for a learning plugin (re-enable for a real one).
loggerUsageCheck.enabled = false
validateNebulaPom.enabled = false

Note: The opensearch.opensearchplugin Gradle plugin must be resolvable. Real ecosystem plugins add the OpenSearch build-tools to the pluginManagement block in settings.gradle and set opensearch_version in gradle.properties. Mirror a known-good project such as opensearch-plugin-template-java for the exact pluginManagement and version wiring on your target branch — the values drift, and copying a maintained template is the supported path.

The plugin produces a plugin-descriptor.properties at build time from the opensearchplugin {} block. It looks like this (you do not hand-write it; this is what ends up in the zip):

description=Adds GET /_my/greeting
version=1.0.0
name=my-greeting-plugin
classname=org.example.greeting.GreetingPlugin
java.version=21
opensearch.version=<OS_VERSION>

Step 7 (8 min) — Build, install, run

# Build the plugin zip.
./gradlew assemble
ls build/distributions/   # -> my-greeting-plugin-1.0.0.zip

# Install it into a local OpenSearch distro (built earlier with ./gradlew localDistro in the OS tree).
DISTRO=/path/to/opensearch-<OS_VERSION>     # the unpacked localDistro
"$DISTRO/bin/opensearch-plugin" install \
  "file://$(pwd)/build/distributions/my-greeting-plugin-1.0.0.zip"

# Confirm it registered:
"$DISTRO/bin/opensearch-plugin" list   # -> my-greeting-plugin

# Start the node and curl the endpoint:
"$DISTRO/bin/opensearch" &
sleep 20
curl -s 'localhost:9200/_my/greeting?name=contributor&pretty'

Expected:

{
  "greeting" : "hello, contributor",
  "node" : "<your-node-name>"
}

The name default and validation work too:

curl -s 'localhost:9200/_my/greeting?pretty'          # -> "hello, world"
curl -s 'localhost:9200/_my/greeting?name=&pretty'    # -> 400, validation error "name must not be empty"

Note: The opensearch.version in your descriptor must exactly match the running distro's version, or opensearch-plugin install refuses to load the plugin. This strictness is on purpose: plugins run with deep access and must be compiled against the exact engine.

Implementation Requirements

GreetingAction is an ActionType<GreetingResponse> with a namespaced NAME and a singleton.
GreetingRequest implements validate() and a symmetric Writeable (read order == write order).
GreetingResponse implements both Writeable (transport) and ToXContentObject (JSON).
TransportGreetingAction extends HandledTransportAction, completes the listener exactly once, and reports the local node name from ClusterService.
RestGreetingAction extends BaseRestHandler, declares its route in routes(), and contains no business logic.
GreetingPlugin implements ActionPlugin and wires both getActions() and getRestHandlers(...).
The plugin builds, installs via bin/opensearch-plugin install, and the curl returns the expected JSON including the node name.

Troubleshooting

Symptom	Likely cause	Fix
`plugin [...] requires opensearch version [X] but ...` on install	Descriptor `opensearch.version` ≠ distro version	Set `opensearch_version` to the distro's exact version
`ClassNotFoundException` for `GreetingPlugin` at startup	Wrong `classname` in `opensearchplugin {}`	Match the fully-qualified class name exactly
404 on `/_my/greeting`	Handler not registered / route mismatch	Check `getRestHandlers` returns it and `routes()` path is exact
500 with serialization error	`writeTo` / read constructor field order mismatch	Make read order mirror write order; add a round-trip test
`Plugin already exists` on install	Reinstalling	`bin/opensearch-plugin remove my-greeting-plugin` first
Import doesn't resolve (`StreamInput`, `ToXContent`)	Package moved between `common`/`core` on your branch	`grep -rn "class StreamInput" libs/ server/` and fix the import

Expected Output

A clean install + a successful curl:

$ bin/opensearch-plugin list
my-greeting-plugin

$ curl -s 'localhost:9200/_my/greeting?name=contributor'
{"greeting":"hello, contributor","node":"node-1"}

Stretch Goals

Return a node count instead of a greeting. Inject ClusterService, read clusterService.state().nodes().getSize(), and return it. You'll touch ClusterState — a direct bridge to Level 4.
Add a transport handler test. Use OpenSearchSingleNodeTestCase to load your plugin (getPlugins()), call client().execute(GreetingAction.INSTANCE, req).get(), and assert the response. Wire-test the request/response with AbstractWireSerializingTestCase. Full treatment in Lab 4.3 and Level 5.
Route it to the cluster manager. Make a variant whose transport action extends TransportClusterManagerNodeAction and reports which node is the elected manager. Compare how the request now hops when you curl a non-manager node (cross-reference Lab 3.2).

Validation / Self-check

Where does the binding from GreetingAction.INSTANCE to TransportGreetingAction happen in your plugin, and what is the core-engine equivalent you read in Lab 3.1?
Why does GreetingResponse implement both Writeable and ToXContentObject — what is each one for, and on which port/protocol is each used?
Your RestGreetingAction contains no business logic. Where is the logic, and why is that separation enforced by the framework rather than just by convention?
What two things must be true for bin/opensearch-plugin install to accept your zip?
doExecute must complete the ActionListener exactly once. What goes wrong if you (a) complete it twice, or (b) throw without calling onFailure?
If you reorder two fields in writeTo but not in the read constructor, what fails, when, and which kind of test would have caught it before merge?

When the curl returns your greeting with the answering node's name, and you can answer all six questions, you've completed Lab 3.3 — and Level 3. Continue to Level 4: Cluster Coordination and State.

Level 4: Cluster Coordination and State

This is the hardest level in the curriculum so far, and one of the hardest in the whole book. Say it plainly: the coordination layer is where strong engineers get humbled. It is a distributed consensus protocol (Zen2 — a Raft-like algorithm), an immutable versioned state object, a two-phase publish/commit protocol, two cooperating single-threaded services, and a shard allocation engine — all interacting. You will not understand it in one pass. You will read it, run it, watch the logs, re-read it, and only then will the picture cohere.

This level is the OpenSearch analog of Tez's state-machine internals (VertexImpl/DAGImpl). Where Tez has a DAG state machine driven by an async dispatcher, OpenSearch has a cluster state driven by consensus and applied through a publish protocol. Both reward the same approach: read the state object, read the transitions, run it, watch it move.

Warning: Do not start Level 4 until you can trace REST → Transport → Action without a guide (Level 3). The coordination layer assumes you know what a TransportAction, Writeable, and the cluster manager (formerly master) are.

Learning Objectives

By the end of Level 4 you must be able to:

Explain the coordination layer: the Coordinator, election, joins, and publish/commit, and how it forms and maintains a cluster.
Describe ClusterState as an immutable, versioned snapshot and name its four major components: Metadata, RoutingTable, DiscoveryNodes, ClusterBlocks.
Distinguish the two services: the cluster-manager service (MasterService) that computes new states from update tasks, and the ClusterApplierService that applies committed states.
Trace a cluster state update from a ClusterStateUpdateTask through computation, two-phase publication, and application — and explain why each phase exists.
Distinguish a ClusterStateApplier from a ClusterStateListener and say when to use each.
Explain shard allocation: AllocationService, RoutingAllocation, and the AllocationDeciders chain, and read a single decider end to end.
Read and reason about org.opensearch.cluster.coordination and org.opensearch.cluster.routing.allocation without a guide.

The Coordination Layer

Everything in OpenSearch that is "the cluster agreeing on something" runs through org.opensearch.cluster.coordination. The protagonist is the Coordinator: a per-node object that implements OpenSearch's consensus (called Zen2, a Raft-inspired protocol that replaced the original "Zen" discovery). It owns election, joins, failure detection, and publication.

find server/src/main/java -path "*cluster/coordination*" -name "*.java" | sort

The cast you will meet in Lab 4.1:

Class	Responsibility
`Coordinator`	The orchestrator: discovery, election, joins, publication, failure handling
`CoordinationState`	The persisted consensus state: term, last-accepted/committed state, voting config
`PreVoteCollector`	Pre-voting round (avoids disruptive elections before a real vote)
`ElectionSchedulerFactory`	Randomized, backing-off election timers (prevents thundering herds)
`JoinHelper`	Sends/receives join requests; a follower joins an elected manager
`FollowersChecker`	The manager pings followers; removes ones that stop responding
`LeaderChecker`	A follower pings the manager; triggers a new election if it disappears
`ClusterFormationFailureHelper`	Produces the "cluster not formed; reason: ..." diagnostics
`PublicationTransportHandler`	The transport handler for two-phase publish then commit

Consensus in one paragraph. A cluster-manager-eligible node that thinks there is no leader starts a pre-vote (PreVoteCollector). If it gathers enough pre-votes, it bumps the term and starts a real election, collecting join votes from a quorum of the voting configuration (CoordinationState.VoteCollection). Once it has a quorum it becomes the elected manager, and from then on it is the only node allowed to compute and publish new cluster states. Quorum-based voting is what prevents split-brain: with a voting config of N nodes, you need ⌊N/2⌋+1 votes, so two disjoint majorities cannot both exist.

The first cluster ever formed needs a seed: cluster.initial_cluster_manager_nodes (deprecated alias: cluster.initial_master_nodes) names the bootstrap voting configuration. You set it once, on first formation, and never again — it is ignored after the cluster has a committed state.

See the discovery & coordination deep dive and the cluster state publishing deep dive.

ClusterState: The Immutable Versioned Snapshot

ClusterState is the single source of truth about the cluster. It is immutable: every change produces a new ClusterState with an incremented version. The elected cluster manager publishes it to every node; all nodes converge on the same versioned snapshot.

find server/src/main/java -name "ClusterState.java"
grep -n "public.*version\|Metadata\|RoutingTable\|DiscoveryNodes\|ClusterBlocks\|Builder\|long version" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java | head -40

Its four major components:

Component	Class	What it holds
Metadata	`Metadata`	Index settings/mappings/aliases, templates, persistent settings, custom metadata
Routing table	`RoutingTable`	Which shard copies (primary/replica) are assigned to which node, and their state
Nodes	`DiscoveryNodes`	The set of nodes (id, name, address, roles), and who is the cluster manager
Blocks	`ClusterBlocks`	Cluster/index-level blocks (e.g. read-only, no-cluster-manager)

Two more properties matter constantly:

version — a monotonically increasing number per published state. "The state moved from v42 to v43" is the unit of progress you will watch in logs.
stateUUID — a unique id for this exact state; used to validate diffs.

Because publishing the whole state every time would be expensive, OpenSearch ships diffs: a node already on v42 receives only the Diff to reach v43 (Diffable/Diff). Full treatment in Lab 4.2 and the cluster state deep dive.

The Two Services: Compute vs Apply

OpenSearch separates deciding a new state from applying it. Two services, both effectively single-threaded, both hung off ClusterService:

Service	Class	Runs where	Job
Cluster-manager service	`MasterService` (a.k.a. cluster-manager service)	only on the elected manager	Batches `ClusterStateUpdateTask`s, computes the next `ClusterState`, publishes it
Applier service	`ClusterApplierService`	every node	Applies each committed `ClusterState` to local components

find server/src/main/java -name "MasterService.java" -o -name "ClusterApplierService.java" -o -name "ClusterService.java"
grep -n "submitStateUpdateTask\|runTasks\|calculateTaskOutputs\|publish" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java | head

Why split them? Only the manager may compute a new state (consensus authority). But every node must apply the committed state to its own services (routing, mappings, settings, allocation). Splitting compute (manager-only) from apply (all nodes) is what lets the same state object drive the whole cluster.

A state change always flows: update task → compute (manager) → publish → commit → apply (all nodes). The next section diagrams it.

A Cluster State Update, End to End

flowchart TD
    SRC[Trigger: PUT _cluster/settings,<br/>create index, node join, shard event] --> TASK[submitStateUpdateTask<br/>ClusterStateUpdateTask]
    TASK --> BATCH[MasterService batches tasks<br/>by ClusterStateTaskExecutor]
    BATCH --> COMP[executor.execute oldState -> newState<br/>version++ , new stateUUID]
    COMP --> PUB1[PublicationTransportHandler<br/>PHASE 1: PUBLISH diff to all nodes]
    PUB1 -->|each node validates & ACKs| QUORUM{quorum of<br/>cluster-manager nodes<br/>ACKed?}
    QUORUM -->|no| FAIL[publication fails;<br/>state NOT applied; retry/step down]
    QUORUM -->|yes| PUB2[PHASE 2: COMMIT to all nodes]
    PUB2 --> APPLY[each node: ClusterApplierService.applyState]
    APPLY --> AP[ClusterStateAppliers run first<br/>routing, mappings, allocation]
    AP --> LIS[ClusterStateListeners run after<br/>observers, plugins, your listener]
    LIS --> ACKD[task listeners ACKed; update complete]

The two phases are the crux:

Publish (phase 1): the manager sends the new state (as a diff) to every node. Each node validates it and ACKs, but does not apply it yet. The manager waits for a quorum of cluster-manager-eligible nodes to ACK.
Commit (phase 2): once a quorum has ACKed, the manager sends a commit. Now every node applies the state via ClusterApplierService.

If the quorum is not reached, the state is never committed and never applied — this is what keeps a partitioned manager from unilaterally changing the cluster. On apply, appliers run before listeners: appliers (ClusterStateApplier) make the node's components consistent with the new state (and may fail the apply); listeners (ClusterStateListener) are fire-and-forget observers.

Both halves are covered in Lab 4.2.

Shard Allocation

The routing table doesn't compute itself. When shards are unassigned (new index, node left, restored snapshot), the AllocationService runs to place them, gated by a chain of AllocationDeciders.

find server/src/main/java -path "*routing/allocation*" -name "AllocationService.java"
find server/src/main/java -path "*routing/allocation/decider*" -name "*.java" | sort

Class	Role
`AllocationService`	Entry point: `reroute(...)`, `applyStartedShards(...)`, `applyFailedShards(...)`
`RoutingAllocation`	The mutable working context for one allocation round (routing nodes, deciders, state)
`RoutingNodes`	Mutable view of shard→node assignment being computed
`AllocationDeciders`	The ordered chain of deciders; each returns a `Decision`
`BalancedShardsAllocator`	The default balancer that proposes moves within what deciders allow
`Decision`	`YES` / `NO` / `THROTTLE` with an explanation string

Common deciders you will meet (and fix in Lab 4.4):

Decider	Prevents
`SameShardAllocationDecider`	Putting a primary and its replica on the same node
`MaxRetryAllocationDecider`	Endlessly retrying a shard that keeps failing to allocate
`DiskThresholdDecider`	Allocating onto nodes over the disk watermark
`AwarenessAllocationDecider`	Violating zone/rack awareness
`FilterAllocationDecider`	Ignoring include/exclude/require allocation filters
`ThrottlingAllocationDecider`	Too many concurrent recoveries on a node

A Decision is the unit of correctness here. A wrong Decision.YES/NO, or an unhelpful explanation, is exactly the kind of bug Lab 4.4 fixes. See the shard allocation deep dive.

Key Classes Quick Reference

Class	Package	Role
`Coordinator`	`org.opensearch.cluster.coordination`	Consensus: discovery, election, joins, publication
`CoordinationState`	`org.opensearch.cluster.coordination`	Persisted term, accepted/committed state, voting config
`JoinHelper` / `PreVoteCollector`	`org.opensearch.cluster.coordination`	Joins and pre-voting
`FollowersChecker` / `LeaderChecker`	`org.opensearch.cluster.coordination`	Bidirectional failure detection
`PublicationTransportHandler`	`org.opensearch.cluster.coordination`	Two-phase publish then commit
`ClusterState`	`org.opensearch.cluster`	The immutable, versioned snapshot
`Metadata` / `RoutingTable` / `DiscoveryNodes` / `ClusterBlocks`	`org.opensearch.cluster.*`	The four state components
`MasterService`	`org.opensearch.cluster.service`	Cluster-manager service: computes states from tasks
`ClusterApplierService`	`org.opensearch.cluster.service`	Applies committed states on every node
`ClusterService`	`org.opensearch.cluster.service`	Ties the two services together; `addListener`, `submitStateUpdateTask`
`ClusterStateUpdateTask` / `ClusterStateTaskExecutor`	`org.opensearch.cluster`	A state change + how a batch of them computes a new state
`ClusterStateApplier` / `ClusterStateListener`	`org.opensearch.cluster`	Apply hooks (consistency) vs observe hooks (reactions)
`AllocationService` / `RoutingAllocation` / `AllocationDeciders`	`org.opensearch.cluster.routing.allocation`	Shard placement and its gating

The Labs

Lab	Title	Type
4.1	Read the Coordinator and Cluster-Manager Election	Code-reading (90 min trace)
4.2	ClusterState, Publishing, and the Applier Model	Reading + live TRACE logging
4.3	Build It — A Custom ClusterStateListener / Custom Metadata	Build-it project
4.4	Fix It — An AllocationDecider Edge Case	Fix-it lab with a diff

Deliverables

You must demonstrate all of the following before advancing to Level 5:

A 90-minute guided trace from node start → election → first published state, with class/file citations (Lab 4.1).
A _cluster/settings change watched live in TRACE logs, with the state version incrementing (Lab 4.2).
A working plugin whose ClusterStateListener logs index create/delete events, exercised under a node test (Lab 4.3).
A fixed AllocationDecider with a passing unit test and a _cluster/allocation/explain reproduction (Lab 4.4).
From memory: the four components of ClusterState, the two services, and the two phases of publication.

Common Mistakes

Mistake	Consequence	Fix
Mutating a `ClusterState` in place	Compile/illegal-state errors; you misread the model	State is immutable; build a new one via `ClusterState.builder(oldState)`
Confusing `ClusterStateApplier` with `ClusterStateListener`	Wrong extension point; missed consistency requirement	Appliers ensure consistency (run first, may fail apply); listeners only observe
Thinking publish == commit	Misunderstand partition behavior	Two phases: publish+ACK (quorum), then commit; uncommitted states are never applied
Setting `cluster.initial_cluster_manager_nodes` on a running cluster	No effect; confusion	It seeds only the first bootstrap; ignored once a state is committed
Reading deciders as boolean	You miss `THROTTLE`	A `Decision` is `YES`/`NO`/`THROTTLE`; throttling is not rejection
Blocking inside an applier/listener	Stalls the single-threaded applier; cluster freezes	Appliers/listeners must be fast and non-blocking
Expecting `master` terminology to be gone	You can't find code	`master` survives as deprecated aliases throughout (`MasterService`, `master_timeout`)

How to Verify Success

# 1. You can see the current cluster state version and who is manager.
curl -s 'localhost:9200/_cluster/state/version,master_node?pretty'
curl -s 'localhost:9200/_cat/cluster_manager?v'

# 2. You can find any coordination/allocation class by grep, not memory.
find server/src/main/java -path "*cluster/coordination*" -name "Coordinator.java"
find server/src/main/java -path "*routing/allocation/decider*" -name "SameShardAllocationDecider.java"

# 3. You can watch a state version increment.
curl -s -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' \
  -d '{"transient":{"cluster.routing.allocation.enable":"all"}}'
curl -s 'localhost:9200/_cluster/state/version?pretty'   # version went up

When you can explain election, the two services, the two-phase publish, and a single decider — and prove each with a command — you are ready for Level 5: Testing and Debugging.

Lab 4.1: Read the Coordinator and Cluster-Manager Election

This is a deep code-reading lab over org.opensearch.cluster.coordination — the consensus layer that elects a cluster manager (formerly master) and keeps the cluster formed. It is the OpenSearch analog of reading Tez's VertexImpl state machine: dense, important, and not absorbable in one pass. You will read the Coordinator and its collaborators, then run a guided 90-minute trace from node start → election → first published cluster state, with a file citation for every hop.

You will not change code. You will end with a reading-log artifact and answers to a set of questions that prove you understood the protocol, not just the file names.

Background

OpenSearch's coordination layer implements Zen2 — a Raft-inspired consensus protocol that replaced the original "Zen" discovery. It guarantees that at most one node believes it is the elected cluster manager at any committed term, using quorum-based voting over a voting configuration. The same machinery handles node joins, leaves, and failure detection.

The core ideas:

Term: a monotonically increasing election epoch. Every successful election bumps the term. A node only accepts a publication from a manager whose term is ≥ its own.
Voting configuration: the set of cluster-manager-eligible nodes whose votes count. A quorum is a strict majority of it; quorum overlap is what prevents split-brain.
Pre-vote → vote → join → publish: a candidate first runs a non-binding pre-vote, then a real election, collects join votes from a quorum, becomes leader, and publishes the first state of the new term.

Deep-dive companions: discovery-coordination.md, cluster-state-publishing.md. The Level 4 overview has the class table you should keep open beside this lab.

Why This Lab Matters for Contributors

Coordination bugs are the scariest issues in the tracker: "cluster won't form," "split-brain after a network partition," "node can't rejoin," "stuck in election." Maintainers triage these by reading the exact classes in this lab and the ClusterFormationFailureHelper diagnostics. You cannot meaningfully contribute to — or even reproduce — these issues without having read the Coordinator. This is also the foundation for Lab 4.2 (publication) and the disruption tests in Level 5.

Prerequisites

OpenSearch building; ./gradlew run works (Level 3).
You can trace REST → Transport → Action (Lab 3.1).
A reading log:

mkdir -p ~/opensearch-notes
: > ~/opensearch-notes/reading-log-4.1.md

Note: Coordination class/method names are stable, but bodies are long and line numbers drift. Every step gives you a grep/find to locate code on your branch. The terms here (term, quorum, voting config) are protocol vocabulary — learn them, not line numbers.

Map the package first (10 min)

find server/src/main/java -path "*cluster/coordination*" -name "*.java" | sort

Pin this table; it's your map for the whole lab:

File	One-sentence role
`Coordinator.java`	The orchestrator. Owns the mode (CANDIDATE/LEADER/FOLLOWER), election, joins, publication.
`CoordinationState.java`	The persisted consensus state: `currentTerm`, last-accepted/committed `ClusterState`, voting config; vote handling.
`PreVoteCollector.java`	Runs the non-binding pre-vote round before a real election.
`ElectionSchedulerFactory.java`	Schedules randomized, backing-off election attempts.
`JoinHelper.java`	Sends/receives `Join` requests; the mechanism by which followers join a leader.
`FollowersChecker.java`	Leader→follower health checks; removes unresponsive followers.
`LeaderChecker.java`	Follower→leader health checks; triggers re-election if the leader vanishes.
`ClusterFormationFailureHelper.java`	Builds the human-readable "cluster not formed because…" message.
`PublicationTransportHandler.java`	Transport handler for two-phase publish/commit (detailed in Lab 4.2).
`CoordinationMetadata.java`	The voting config + voting-config exclusions stored in cluster `Metadata`.

The guided 90-minute trace: node start → election → first published state

Budget the steps. Keep the reading log open and append after each step.

Step 1 (15 min) — Where the Coordinator is born and started

grep -rn "new Coordinator\|Coordinator(" server/src/main/java/org/opensearch/node/Node.java
find server/src/main/java -name "Coordinator.java"
grep -n "public Coordinator(\|void start(\|void startInitialJoin\|enum Mode\|becomeCandidate\|becomeLeader\|becomeFollower" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java | head -30

Read:

The Coordinator constructor — note its collaborators: PeerFinder/discovery, JoinHelper, PreVoteCollector, ElectionSchedulerFactory, FollowersChecker, LeaderChecker, PublicationTransportHandler, CoordinationState (lazily, once persisted state is read).
start() and startInitialJoin() — on node start the Coordinator enters CANDIDATE mode and begins discovery (finding peers via discovery.seed_hosts).
The Mode enum: CANDIDATE, LEADER, FOLLOWER, and the becomeCandidate/becomeLeader/becomeFollower transitions. This is the state machine at the heart of the file.

Log it:

cat >> ~/opensearch-notes/reading-log-4.1.md <<'EOF'
## Coordinator lifecycle
- Node.java constructs Coordinator with JoinHelper, PreVoteCollector, ElectionSchedulerFactory,
  Followers/LeaderChecker, PublicationTransportHandler, CoordinationState.
- start() -> CANDIDATE; startInitialJoin() begins peer discovery.
- Mode: CANDIDATE | LEADER | FOLLOWER ; becomeCandidate/becomeLeader/becomeFollower
EOF

Step 2 (15 min) — Discovery and the pre-vote round

grep -n "PeerFinder\|getFoundPeers\|onFoundPeersUpdated\|startElectionScheduler\|PreVoteCollector\|startPreVoting\|preVoteResponse" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java | head
grep -n "PreVoteResponse\|PreVoteRequest\|handlePreVoteRequest\|update\|isElectionQuorum\|start(" \
  server/src/main/java/org/opensearch/cluster/coordination/PreVoteCollector.java | head

The flow:

The PeerFinder finds other cluster-manager-eligible nodes and reports them via onFoundPeersUpdated.
When the node has discovered enough peers to possibly form a quorum, it starts the ElectionScheduler, which periodically (with randomized backoff) triggers an election attempt.
Before a real election, PreVoteCollector runs a pre-vote: it asks peers "would you vote for me?" without bumping the term. Only if it gets a pre-vote quorum does it proceed. This avoids disruptive elections that would needlessly bump the term and unseat a healthy leader.

Why pre-vote? In Raft, a partitioned node can repeatedly time out and bump the term, then rejoin and force a re-election even though the cluster was healthy. The pre-vote round lets such a node discover it wouldn't win before it disrupts anything. Read the class Javadoc for the exact rationale.

Step 3 (15 min) — The real election: term bump and join votes

grep -n "startElection\|StartJoinRequest\|joinLeaderInTerm\|getCurrentTerm\|incrementTerm\|handleJoinRequest" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java | head
grep -n "handleJoin\|JoinRequest\|Join \|getTargetNode\|sendJoinRequest\|joinLeaderInTerm" \
  server/src/main/java/org/opensearch/cluster/coordination/JoinHelper.java | head

The real election:

The candidate broadcasts a StartJoinRequest for a new term (currentTerm + 1).
Each recipient, via JoinHelper/Coordinator.joinLeaderInTerm, decides whether it can vote in that term (it must not have already voted in a higher term) and, if so, sends a Join.
The candidate collects Joins. The vote-counting lives in CoordinationState:

grep -n "handleJoin\|isElectionQuorum\|VoteCollection\|getLastAcceptedConfiguration\|getLastCommittedConfiguration\|electionWon" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java | head

CoordinationState.handleJoin(...) adds a vote to a VoteCollection; isElectionQuorum(...) checks whether the collected votes form a quorum of the last-committed voting configuration. When they do, the candidate calls becomeLeader(...).

Log it:

cat >> ~/opensearch-notes/reading-log-4.1.md <<'EOF'
## Election
- pre-vote (PreVoteCollector) -> if pre-vote quorum, ElectionScheduler triggers real election
- StartJoinRequest for term = currentTerm+1
- peers send Join (JoinHelper / joinLeaderInTerm)
- CoordinationState.handleJoin -> VoteCollection; isElectionQuorum() over last-committed voting config
- quorum reached -> Coordinator.becomeLeader -> LEADER mode
EOF

Step 4 (20 min) — The first published state and how a follower joins

A freshly-elected leader must publish a cluster state to make the election durable. Find the publish entry and the follower side:

grep -n "publish\|coordinationState.handleClientValue\|PublishRequest\|PublishResponse\|ApplyCommitRequest\|handlePublishRequest" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java | head
grep -n "handlePublishRequest\|handleApplyCommit\|PublishWithJoinResponse\|sendPublishRequest\|sendApplyCommit" \
  server/src/main/java/org/opensearch/cluster/coordination/PublicationTransportHandler.java | head

The first published state of a new term carries the new DiscoveryNodes (including the new leader) and the updated CoordinationMetadata. It goes through the two-phase publish (publish → quorum ACK → commit) you will study in detail in Lab 4.2. For now, note:

The leader builds a new ClusterState (via MasterService) and hands it to Coordinator.publish(...).
PublicationTransportHandler sends a PublishRequest (a diff or full state) to every node.
Followers validate it against their CoordinationState (term checks), ACK, and — on the commit phase — apply it.
A follower that wasn't part of the cluster joins by sending a Join to the new leader, which adds it to DiscoveryNodes in a subsequent published state.

Now the failure-detection loops that keep the formed cluster alive:

grep -n "FollowersChecker\|handleWakeUp\|setCurrentNodes\|onNodeFailure\|FollowerChecker" \
  server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java | head
grep -n "LeaderChecker\|handleWakeUp\|leaderFailed\|setCurrentNodes\|onLeaderFailure" \
  server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java | head

FollowersChecker (leader→followers) removes followers that stop responding; LeaderChecker (followers→leader) triggers a fresh CANDIDATE/election if the leader disappears. Together they are the cluster's heartbeat.

Log it:

cat >> ~/opensearch-notes/reading-log-4.1.md <<'EOF'
## First publish + steady state
- becomeLeader -> MasterService computes first state of new term -> Coordinator.publish()
- PublicationTransportHandler: PublishRequest -> quorum ACK -> ApplyCommitRequest (commit) [see Lab 4.2]
- new followers send Join -> added to DiscoveryNodes in a later published state
- steady state: FollowersChecker (leader->followers), LeaderChecker (followers->leader) heartbeats
EOF

Step 5 (5 min) — Read the failure diagnostics

When a cluster won't form, this class produces the message users paste into issues:

grep -n "describeQuorum\/\|getDescription\|cluster-manager not discovered\|master not discovered\|initial_cluster_manager_nodes\|discovery.seed_hosts" \
  server/src/main/java/org/opensearch/cluster/coordination/ClusterFormationFailureHelper.java | head

Read getDescription(). It explains why no quorum is possible — e.g. "this node must discover cluster-manager-eligible nodes [...] to bootstrap a cluster" or "have discovered [...]; need N more." Learning to read this message saves hours on real "cluster won't form" issues.

Bootstrapping: `cluster.initial_cluster_manager_nodes`

The very first cluster has no committed voting configuration, so there is nothing to form a quorum over. Bootstrapping seeds it:

grep -rn "INITIAL_CLUSTER_MANAGER_NODES\|INITIAL_MASTER_NODES\|initial_cluster_manager_nodes\|ClusterBootstrapService" \
  server/src/main/java/org/opensearch/cluster/coordination/ | head
find server/src/main/java -name "ClusterBootstrapService.java"
grep -n "INITIAL_CLUSTER_MANAGER_NODES\|INITIAL_MASTER_NODES\|deprecat\|bootstrap" \
  server/src/main/java/org/opensearch/cluster/coordination/ClusterBootstrapService.java | head

Terminology — cluster manager vs master: cluster.initial_cluster_manager_nodes is the modern setting; cluster.initial_master_nodes is its deprecated alias (still honored, logs a deprecation warning). ClusterBootstrapService uses whichever is set to build the initial voting configuration. It is consulted only until the cluster has a committed state — set it once, on first formation, on the cluster-manager-eligible nodes, and never again. Setting it on a running cluster does nothing.

You can watch bootstrapping live:

./gradlew run -PnumNodes=3
# In the logs of the elected node, look for:
#   [INFO ][o.o.c.c.Coordinator] ... cluster-manager node changed {previous [], current [{n1}...]}
#   [INFO ][o.o.c.s.ClusterApplierService] ... cluster-manager node changed
curl -s 'localhost:9200/_cat/cluster_manager?v'
curl -s 'localhost:9200/_cluster/state/metadata/cluster_coordination?pretty'   # the committed voting config

Reading Exercises

# 1. Where exactly is "quorum" defined? Find the majority math.
grep -rn "isQuorum\|hasQuorum\|quorum\|VotingConfiguration" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java | head

# 2. How does a node refuse a stale leader?
grep -n "term\|currentTerm\|coordinationState.handlePublishRequest\|CoordinationStateRejectedException" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java | head

# 3. The election timer backoff.
grep -n "ELECTION_INITIAL_TIMEOUT\|ELECTION_BACK_OFF\|ELECTION_MAX_TIMEOUT\|backoff\|scheduleNextElection" \
  server/src/main/java/org/opensearch/cluster/coordination/ElectionSchedulerFactory.java | head

Validation / Self-check

Answer in your own words; cite the class for each:

What is a term, and what guarantees does bumping it provide during an election?
Define quorum in terms of the voting configuration. Why does quorum overlap prevent split-brain? Which method in CoordinationState enforces it?
What problem does the pre-vote round (PreVoteCollector) solve that a plain vote does not?
Trace the messages of one election: name the request that starts it, the response a peer sends to vote, and the moment the candidate decides it has won.
After winning, why must the new leader publish a state before it is truly the manager? What would be inconsistent if it skipped that?
FollowersChecker and LeaderChecker both exist. Why two checkers instead of one? What does each trigger on failure?
cluster.initial_cluster_manager_nodes — when is it read, when is it ignored, and what is its deprecated alias? What goes wrong if two operators set different values on different seed nodes?
A user reports "cluster won't form." Which class produces the diagnostic message, and what are two distinct reasons that message might give?

When you can answer all eight from your reading log and explain the election as a sequence of messages (not just class names), you've completed Lab 4.1. Continue to Lab 4.2: ClusterState, Publishing, and the Applier Model.

Lab 4.2: ClusterState, Publishing, and the Applier Model

Lab 4.1 elected a cluster manager (formerly master). This lab follows what that manager spends its life doing: computing new ClusterStates and getting every node to agree on them. You will read the state object and its diffs, the two-phase publication protocol, and the applier/listener model — then make a real setting change and watch the cluster state version increment live in TRACE logs.

This is the OpenSearch analog of watching a Tez DAG state machine fire transitions on the async dispatcher. Here the "transition" is a versioned cluster state moving across the whole cluster.

Background

ClusterState is immutable and versioned. Every change produces a new state with version = old.version + 1 and a fresh stateUUID. The elected cluster manager is the only node that may compute a new state; it then publishes it to all nodes in two phases:

Publish: send the new state (usually a Diff) to every node; each validates and ACKs. Wait for a quorum of cluster-manager-eligible nodes.
Commit: once a quorum ACKs, tell every node to apply the state.

On apply, appliers run before listeners:

ClusterStateApplier — makes a node's components consistent with the new state (routing, mappings, allocation). Runs first; an applier failure fails the apply.
ClusterStateListener — observes the change and reacts. Runs after; fire-and-forget.

Deep-dive companions: cluster-state.md, cluster-state-publishing.md. The election that gives you a publisher is Lab 4.1.

Why This Lab Matters for Contributors

Almost every cluster-level operation — create/delete index, update mapping, change a setting, add a node, start/fail a shard — is a cluster state update. When one of those "hangs," "doesn't propagate," or "applies on some nodes but not others," you debug it through exactly the classes in this lab. The applier/listener distinction is also a constant source of plugin bugs (people register a listener where they needed an applier, or block the applier thread). You'll build your own listener in Lab 4.3.

Prerequisites

A running node (./gradlew run) you can edit logging on, or a built distro.
Lab 4.1 read (you know what term, quorum, and the manager are).
Reading log: : > ~/opensearch-notes/reading-log-4.2.md

Part A — Read `ClusterState` and diffs (25 min)

Step 1 (10 min) — The state object

find server/src/main/java -name "ClusterState.java"
grep -n "private final\|long version\|String stateUUID\|Metadata metadata\|RoutingTable\|DiscoveryNodes\|ClusterBlocks\|public Builder builder\|public static Builder" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java | head -40

Confirm for yourself:

The fields are final — ClusterState is immutable. You never mutate it; you build a new one.
The four components are Metadata, RoutingTable, DiscoveryNodes, ClusterBlocks (plus per-feature Customs — relevant in Lab 4.3).
A version (monotonic) and a stateUUID.
The Builder pattern: ClusterState.builder(previousState).metadata(...).build() is how every update produces the next state.

Read it live:

curl -s 'localhost:9200/_cluster/state?filter_path=version,state_uuid,master_node&pretty'
curl -s 'localhost:9200/_cluster/state/version?pretty'

Step 2 (15 min) — Diffs: how only the delta crosses the wire

Publishing the whole state to every node on every change would be wasteful. OpenSearch ships a Diff instead.

find server/src -name "Diff.java" -o -name "Diffable.java" | head
grep -rn "interface Diffable\|interface Diff<\|Diff<ClusterState>\|readDiffFrom\|diff(" \
  server/src/main/java/org/opensearch/cluster/ | head
grep -n "public Diff<ClusterState> diff\|static ClusterState readDiffFrom\|ClusterStateDiff\|apply(" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java | head

The model:

Diffable<T> types know how to compute diff(previous) and how to apply(previous) to reconstruct the new value.
ClusterState.diff(previousState) produces a Diff<ClusterState> that contains only the changed components (e.g. only the Metadata diff if just a setting changed).
A receiving node already on version N applies the diff to reach N+1. If it has fallen behind or the stateUUID doesn't line up, the publisher falls back to sending the full state.

This diff/full-state fallback is exactly what PublicationTransportHandler manages. Note: if you ever add a field to a state component, you must handle it in both the Writeable serialization and the diff — a classic source of BWC bugs (Level 9).

Log it:

cat >> ~/opensearch-notes/reading-log-4.2.md <<'EOF'
## ClusterState + diffs
- immutable, final fields; build next via ClusterState.builder(prev)...
- components: Metadata, RoutingTable, DiscoveryNodes, ClusterBlocks (+ Customs)
- versioned (monotonic) + stateUUID
- Diffable/Diff: publish only the delta; fall back to full state if receiver is behind / UUID mismatch
EOF

Part B — Read the publication protocol (25 min)

Step 3 (15 min) — Two-phase publish then commit

find server/src/main/java -name "PublicationTransportHandler.java"
grep -n "PUBLISH_STATE_ACTION_NAME\|COMMIT_STATE_ACTION_NAME\|sendClusterState\|handleIncomingPublishRequest\|handleApplyCommit\|PublishWithJoinResponse\|serializeFullClusterState\|serializeDiffClusterState" \
  server/src/main/java/org/opensearch/cluster/coordination/PublicationTransportHandler.java | head -30
find server/src/main/java -name "Publication.java"
grep -n "onPossibleCommitFailure\|isPublishQuorum\|handlePublishResponse\|onPossibleCompletion\|sendApplyCommit\|PublishResponse\|ApplyCommitRequest" \
  server/src/main/java/org/opensearch/cluster/coordination/Publication.java | head

Read the two transport action names: internal:cluster/coordination/publish_state and internal:cluster/coordination/commit_state (names may carry a version suffix). They correspond to the two phases:

Phase	Action	Sender	Receiver behavior
Publish	`publish_state`	Manager → all nodes	Validate term/UUID against `CoordinationState`; ACK (`PublishResponse`); do not apply yet
Commit	`commit_state`	Manager → all nodes	`ApplyCommitRequest`: now apply via `ClusterApplierService`

The manager (Publication) waits for a publish quorum of cluster-manager-eligible ACKs before it sends the commit. If it can't reach quorum (e.g. partition), the publication fails and the state is never applied anywhere — the safety property that keeps a minority manager from changing the cluster. Trace the quorum gate:

grep -n "isPublishQuorum\|VoteCollection\|coordinationState.handlePublishResponse\|onPossibleCommitFailure" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java | head

Step 4 (10 min) — How publication connects to `MasterService`

The manager doesn't publish out of nowhere; the cluster-manager service computes the state and calls publish:

find server/src/main/java -name "MasterService.java"
grep -n "publish\|ClusterStatePublisher\|patchVersions\|incrementVersion\|runTasks\|calculateTaskOutputs\|onPublicationSuccess\|onPublicationFailed" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java | head -30

MasterService.runTasks(...) computes the new state (next subsection), increments its version, then hands it to the publisher (the Coordinator's publication). On success the task listeners are notified; on failure the update is rolled back (the in-memory state never moved). Log it:

cat >> ~/opensearch-notes/reading-log-4.2.md <<'EOF'
## Publication
- MasterService.runTasks computes newState, version++, calls publisher (Coordinator.publish)
- Phase 1 publish_state -> nodes ACK PublishResponse (no apply); manager waits for publish quorum
- Phase 2 commit_state (ApplyCommitRequest) -> nodes apply via ClusterApplierService
- no quorum -> publication fails -> state NEVER applied (safety)
EOF

Part C — Update tasks, batching, appliers, listeners (25 min)

Step 5 (12 min) — `ClusterStateUpdateTask` and batching on `MasterService`

Every state change starts as a task submitted to the cluster-manager service:

find server/src/main/java -name "ClusterStateUpdateTask.java" -o -name "ClusterStateTaskExecutor.java"
grep -n "abstract ClusterState execute\|onFailure\|submitStateUpdateTask\|submitStateUpdateTasks" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java | head
grep -n "interface ClusterStateTaskExecutor\|ClusterTasksResult\|execute(\|batchExecutionContext" \
  server/src/main/java/org/opensearch/cluster/ClusterStateTaskExecutor.java | head

Key facts:

A ClusterStateUpdateTask (which is a ClusterStateTaskExecutor for itself) implements execute(currentState) -> newState. It must be a pure function of the input state — no blocking, no side effects.
MasterService batches tasks that share the same ClusterStateTaskExecutor and runs them together in one computation, producing one new state for the whole batch. This is why creating 50 indices doesn't publish 50 times in lockstep — the executor coalesces them.
The computation runs on the single-threaded CLUSTER_MANAGER_SERVICE (a.k.a. MASTER_SERVICE) thread pool — serialization of state changes is by design.

grep -n "MASTER_SERVICE\|CLUSTER_MANAGER_SERVICE\|masterServiceExecutor\|threadPoolExecutor" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java | head

Step 6 (13 min) — Appliers vs listeners

The committed state is applied by ClusterApplierService on every node:

find server/src/main/java -name "ClusterApplierService.java" -o -name "ClusterStateApplier.java" -o -name "ClusterStateListener.java"
grep -n "callClusterStateAppliers\|callClusterStateListeners\|addStateApplier\|addListener\|applyChanges\|ClusterChangedEvent" \
  server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java | head -30

Read applyChanges(...) (name may vary): it builds a ClusterChangedEvent (old state, new state, what changed) and then:

Calls all ClusterStateAppliers first — callClusterStateAppliers(...). These wire the node's internal components to the new state (e.g. IndicesClusterStateService creates/removes shards, RoutingService reacts). If an applier throws, the apply fails loudly.
Calls all ClusterStateListeners after — callClusterStateListeners(...). These just observe.

	`ClusterStateApplier`	`ClusterStateListener`
When	First, during apply	After appliers
Purpose	Make components consistent with the state	React to / observe the change
Failure	Fails the apply (serious)	Logged; doesn't fail the apply
Registered via	`clusterService.addStateApplier(...)`	`clusterService.addListener(...)`
Example	`IndicesClusterStateService`	A plugin logging index creation

Warning: Both run on the single applier thread. Blocking inside an applier or listener freezes the cluster's ability to apply states. This is one of the most damaging plugin bugs. If you need to do slow work in reaction to a state change, hand it off to another thread pool.

Log it:

cat >> ~/opensearch-notes/reading-log-4.2.md <<'EOF'
## Update tasks + apply
- ClusterStateUpdateTask.execute(currentState)->newState : pure, no blocking
- MasterService batches tasks by ClusterStateTaskExecutor (one state per batch)
- runs on single-threaded CLUSTER_MANAGER_SERVICE pool
- ClusterApplierService.applyChanges: appliers FIRST (consistency, may fail apply), listeners AFTER (observe)
- never block in an applier/listener -> freezes apply
EOF

Part D — Watch a state update live in TRACE logs (20 min)

Now make a real change and watch the version increment.

Step 7 — Enable TRACE on the cluster service

Dynamically, on a running node:

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {
    "logger.org.opensearch.cluster.service": "TRACE"
  }
}'

(Or set logger.org.opensearch.cluster.service: TRACE in config/log4j2.properties before start.)

Step 8 — Trigger an update and read the log

# Record the version before:
curl -s 'localhost:9200/_cluster/state/version?pretty'

# A trivially harmless cluster-state change:
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "cluster.routing.allocation.enable": "all" }
}'

# Version after:
curl -s 'localhost:9200/_cluster/state/version?pretty'

In the node log (logs/<cluster>.log for a distro, or stdout for ./gradlew run) you'll see lines like:

[TRACE][o.o.c.s.MasterService] cluster state updated, version [N+1], source [cluster_update_settings...]
[TRACE][o.o.c.s.MasterService] will process [cluster_update_settings ...]
[DEBUG][o.o.c.s.MasterService] publishing cluster state version [N+1]
[TRACE][o.o.c.s.ClusterApplierService] applying settings from cluster state with version [N+1]
[DEBUG][o.o.c.s.ClusterApplierService] processing [...]: execute
[DEBUG][o.o.c.s.ClusterApplierService] cluster state updated, version [N+1], source [...]

You are watching: MasterService compute v(N+1) → publish → ClusterApplierService apply v(N+1). The version you printed went up by exactly one. This is the whole chapter, made visible.

Turn TRACE back off when done:

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.cluster.service": null,
                 "cluster.routing.allocation.enable": null }
}'

Reading Exercises

# 1. Where does the new state's version get incremented?
grep -n "incrementVersion\|version() + 1\|builder(.*).version\|patchVersions" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java | head

# 2. How is "what changed" computed for appliers/listeners?
grep -n "ClusterChangedEvent\|indicesDeleted\|nodesChanged\|metadataChanged\|routingTableChanged" \
  server/src/main/java/org/opensearch/cluster/ClusterChangedEvent.java | head

# 3. Find a real applier in the engine.
grep -rln "implements ClusterStateApplier\|addStateApplier" server/src/main/java | head

Answer:

ClusterState is immutable. Show, with the builder API, how a setting change becomes a new state object. Where does the version++ happen?
Why are states published as diffs, and what makes the publisher fall back to a full state?
Explain the two phases of publication. What exactly is guaranteed if the commit phase never runs because quorum was lost?
MasterService batches tasks by ClusterStateTaskExecutor. Give a concrete example where batching matters for performance, and explain why the executor must be a pure function of the input state.
Appliers run before listeners. Give one operation that must be an applier (not a listener) and explain what would break if it were a listener.
What happens to the cluster if a plugin's ClusterStateListener blocks for 30 seconds on every state change? Which thread is it blocking?
From the TRACE logs you captured, quote the two log lines that show (a) the manager computing the new version and (b) a node applying it.

When you can produce the TRACE log of a version increment and answer all seven, you've completed Lab 4.2. Continue to Lab 4.3: Build It — A Custom ClusterStateListener / Custom Metadata.

Lab 4.3: Build It — A Custom ClusterStateListener / Custom Metadata

You have read the publication and applier model (Lab 4.2). Now you build against it. In this lab you write a plugin that registers a ClusterStateListener reacting to real cluster state changes — it logs whenever an index is created or deleted — and, as a stretch, registers a custom Metadata.Custom type that rides inside the cluster state and survives serialization. You then test it with the OpenSearch test framework (OpenSearchSingleNodeTestCase / OpenSearchIntegTestCase).

This is the smallest realistic way to hook the coordination layer, and it is exactly how ecosystem plugins (index-management, security, cross-cluster-replication) observe and extend cluster state.

Background

Two extension points from Lab 4.2:

ClusterStateListener — clusterChanged(ClusterChangedEvent event) is called on every node after a new state is applied. The ClusterChangedEvent carries old state, new state, and helpers like indicesCreated() / indicesDeleted().
Metadata.Custom — a plugin-defined, named, Writeable + ToXContent blob that lives inside ClusterState.metadata(), is published with the state, and is persisted. Registering one requires entries in NamedWriteableRegistry (transport) and NamedXContentRegistry (XContent), which a Plugin contributes via getNamedWriteables() / getNamedXContent().

Wiring is via ClusterService.addListener(...). You get the ClusterService in Plugin.createComponents(...).

Deep-dive companions: cluster-state.md, cluster-state-publishing.md, plugin-internals.md. The applier/listener distinction is from Lab 4.2; plugin/action mechanics are from Lab 3.3.

Why This Lab Matters for Contributors

Reacting to cluster state is one of the most common things a real OpenSearch plugin does: enforce a policy when an index appears, clean up resources when one is deleted, replicate metadata, track state. Custom Metadata is how features store cluster-scoped data (ISM policies, security config, replication bookkeeping). If you can register a listener and a custom metadata type and prove they survive serialization in a test, you've done the core of a stateful plugin.

Prerequisites

You completed Lab 3.3 (you can build and install a plugin).
You read Lab 4.2 (appliers vs listeners; two-phase publish).
A standalone plugin project skeleton like Lab 3.3's, or work inside a modules/ sandbox module.

Note: Method signatures on Plugin (createComponents, getNamedWriteables, getNamedXContent) and on Metadata.Custom have evolved across versions. Where a signature doesn't match your branch, copy the current one from an in-tree user: grep -rln "implements Metadata.Custom\|getNamedWriteables\|createComponents" server/ modules/ plugins/ | head and mirror it. Package paths for Writeable/ToXContent/StreamInput may be under org.opensearch.core.* — verify with grep -rn "class StreamInput" libs/.

Part A — The ClusterStateListener (45 min)

Step 1 (10 min) — The listener

package org.example.statewatch;

import org.opensearch.cluster.ClusterChangedEvent;
import org.opensearch.cluster.ClusterStateListener;
import org.opensearch.core.index.Index;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class IndexLifecycleLogger implements ClusterStateListener {

    private static final Logger logger = LogManager.getLogger(IndexLifecycleLogger.class);

    @Override
    public void clusterChanged(ClusterChangedEvent event) {
        // indicesCreated() / indicesDeleted() are diffed for us from old vs new state.
        for (String created : event.indicesCreated()) {
            logger.info("index created: [{}] (cluster state v{})",
                created, event.state().version());
        }
        for (Index deleted : event.indicesDeleted()) {
            logger.info("index deleted: [{}] (cluster state v{})",
                deleted.getName(), event.state().version());
        }
    }
}

Notes:

clusterChanged runs on the single applier thread after appliers. It must be fast and non-blocking (Lab 4.2 warning). Logging is fine; calling out to the network is not.
ClusterChangedEvent already computed the create/delete deltas from old vs new state — you don't diff the metadata yourself.

Step 2 (10 min) — A component to register the listener

You need the ClusterService to call addListener. The plugin gets it in createComponents. Make a small component that registers and unregisters the listener cleanly.

package org.example.statewatch;

import org.opensearch.cluster.service.ClusterService;
import org.opensearch.common.lifecycle.AbstractLifecycleComponent;

public class IndexLifecycleService extends AbstractLifecycleComponent {

    private final ClusterService clusterService;
    private final IndexLifecycleLogger listener = new IndexLifecycleLogger();

    public IndexLifecycleService(ClusterService clusterService) {
        this.clusterService = clusterService;
    }

    @Override
    protected void doStart() {
        clusterService.addListener(listener);
    }

    @Override
    protected void doStop() {
        clusterService.removeListener(listener);
    }

    @Override
    protected void doClose() {
        // nothing else to release
    }
}

Step 3 (10 min) — The plugin: wire it in `createComponents`

package org.example.statewatch;

import org.opensearch.cluster.service.ClusterService;
import org.opensearch.common.lifecycle.LifecycleComponent;
import org.opensearch.plugins.Plugin;
import org.opensearch.plugins.ClusterPlugin;
import org.opensearch.core.common.io.stream.NamedWriteableRegistry;

import java.util.Collection;
import java.util.Collections;
import java.util.List;

// ... createComponents has many parameters; trim to what's needed and match your branch's signature.
public class StateWatchPlugin extends Plugin implements ClusterPlugin {

    @Override
    public Collection<Object> createComponents(/* PluginServices services -- see note */ Object... services) {
        // On your branch, createComponents takes a long parameter list (Client, ClusterService,
        // ThreadPool, ...). Pull ClusterService out of it and construct the service.
        // The illustrative body:
        //
        //   ClusterService clusterService = ...; // from the parameters
        //   return Collections.singletonList(new IndexLifecycleService(clusterService));
        //
        // Returning a LifecycleComponent means OpenSearch will start()/stop() it for you,
        // which is what registers/unregisters the listener.
        throw new UnsupportedOperationException("fill from your branch's createComponents signature");
    }
}

Note — createComponents signature: It is the most version-sensitive method on Plugin. On current main it takes a single PluginServices-style bundle or a long parameter list (Client client, ClusterService clusterService, ThreadPool threadPool, ResourceWatcherService ..., ScriptService ..., NamedXContentRegistry ..., Environment ..., NodeEnvironment ..., NamedWriteableRegistry ..., IndexNameExpressionResolver ..., Supplier<RepositoriesService> ...). Copy the exact list: grep -rn "createComponents" server/src/main/java/org/opensearch/plugins/Plugin.java Then return List.of(new IndexLifecycleService(clusterService)). Returning the component as a LifecycleComponent is what gets doStart() (and thus addListener) called.

The concrete, working body once you've pulled ClusterService from the parameters:

ClusterService clusterService = /* the ClusterService argument */;
return Collections.singletonList(new IndexLifecycleService(clusterService));

Step 4 (15 min) — Test it with `OpenSearchSingleNodeTestCase`

A single-node test is the fastest way to prove the listener fires. It loads your plugin, creates an index, and asserts the listener saw it.

package org.example.statewatch;

import org.opensearch.plugins.Plugin;
import org.opensearch.test.OpenSearchSingleNodeTestCase;

import java.util.Collection;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.TimeUnit;

public class IndexLifecycleListenerIT extends OpenSearchSingleNodeTestCase {

    @Override
    protected Collection<Class<? extends Plugin>> getPlugins() {
        return Collections.singletonList(StateWatchPlugin.class);
    }

    public void testListenerFiresOnIndexCreate() throws Exception {
        // Register our own listener directly on the node's ClusterService to observe the event
        // deterministically (the plugin's logger is fine for manual runs, but a latch is testable).
        CountDownLatch created = new CountDownLatch(1);
        getInstanceFromNode(org.opensearch.cluster.service.ClusterService.class)
            .addListener(event -> {
                if (event.indicesCreated().contains("watch-me")) {
                    created.countDown();
                }
            });

        client().admin().indices().prepareCreate("watch-me").get();

        assertTrue("listener should have observed index creation",
            created.await(10, TimeUnit.SECONDS));
    }
}

Run it:

# If inside the OpenSearch tree as a module/sandbox:
./gradlew :modules:<your-module>:test --tests "*IndexLifecycleListenerIT"
# Standalone plugin project:
./gradlew test --tests "*IndexLifecycleListenerIT"

Note: getInstanceFromNode(...) reaches into the test node's Guice injector for a real service instance — the supported way to inspect a node's internals in OpenSearchSingleNodeTestCase. The CountDownLatch makes the assertion deterministic instead of relying on log scraping. The test framework is the subject of Level 5.

Part B (Stretch) — A custom `Metadata.Custom` type (45 min)

This is the harder, optional half: a plugin-defined blob that lives in the cluster state, publishes with it, and round-trips through transport and XContent.

Step 5 (15 min) — The custom metadata

package org.example.statewatch;

import org.opensearch.cluster.AbstractNamedDiffable;
import org.opensearch.cluster.NamedDiff;
import org.opensearch.cluster.metadata.Metadata;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.XContentBuilder;
import org.opensearch.Version;

import java.io.IOException;
import java.util.EnumSet;

public class WatchMetadata extends AbstractNamedDiffable<Metadata.Custom> implements Metadata.Custom {

    public static final String TYPE = "watch_metadata";

    private final long createdCount;

    public WatchMetadata(long createdCount) {
        this.createdCount = createdCount;
    }

    public WatchMetadata(StreamInput in) throws IOException {
        this.createdCount = in.readVLong();
    }

    @Override
    public String getWriteableName() {
        return TYPE;
    }

    @Override
    public Version getMinimalSupportedVersion() {
        return Version.CURRENT.minimumIndexCompatibilityVersion();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        out.writeVLong(createdCount);
    }

    @Override
    public EnumSet<Metadata.XContentContext> context() {
        // Persist in snapshots + on-disk state; include in the API. Choose per feature.
        return Metadata.ALL_CONTEXTS;
    }

    @Override
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
        builder.field("created_count", createdCount);
        return builder;
    }

    public long getCreatedCount() {
        return createdCount;
    }

    public static NamedDiff<Metadata.Custom> readDiffFrom(StreamInput in) throws IOException {
        return readDiffFrom(Metadata.Custom.class, TYPE, in);
    }
}

Step 6 (15 min) — Register the named writeable + named XContent

The cluster state is serialized by name, so a custom type must be registered in both registries. A Plugin contributes these:

@Override
public List<NamedWriteableRegistry.Entry> getNamedWriteables() {
    return List.of(
        new NamedWriteableRegistry.Entry(Metadata.Custom.class, WatchMetadata.TYPE, WatchMetadata::new),
        new NamedWriteableRegistry.Entry(NamedDiff.class, WatchMetadata.TYPE, WatchMetadata::readDiffFrom)
    );
}

@Override
public List<NamedXContentRegistry.Entry> getNamedXContent() {
    return List.of(
        new NamedXContentRegistry.Entry(
            Metadata.Custom.class,
            new ParseField(WatchMetadata.TYPE),
            WatchMetadata::fromXContent   // implement a fromXContent parser to match toXContent
        )
    );
}

Warning: If you write a custom metadata to the cluster state but forget the NamedWriteableRegistry entry, the next state publication will fail to deserialize on receiving nodes (IllegalArgumentException: Unknown NamedWriteable [...]) and the cluster will be unable to apply state — a cluster-wide outage from one missing registration. This is the classic custom metadata bug. Always wire both registries, and always add a serialization round-trip test.

Step 7 (15 min) — Round-trip test for the custom metadata

Use an AbstractNamedWriteableTestCase (or AbstractSerializingTestCase / AbstractWireSerializingTestCase) to prove the type survives transport serialization:

package org.example.statewatch;

import org.opensearch.cluster.metadata.Metadata;
import org.opensearch.core.common.io.stream.NamedWriteableRegistry;
import org.opensearch.core.common.io.stream.Writeable;
import org.opensearch.test.AbstractNamedWriteableTestCase;

import java.util.List;

public class WatchMetadataTests extends AbstractNamedWriteableTestCase<Metadata.Custom> {

    @Override
    protected WatchMetadata createTestInstance() {
        return new WatchMetadata(randomNonNegativeLong());
    }

    @Override
    protected Class<Metadata.Custom> categoryClass() {
        return Metadata.Custom.class;
    }

    @Override
    protected NamedWriteableRegistry getNamedWriteableRegistry() {
        return new NamedWriteableRegistry(List.of(
            new NamedWriteableRegistry.Entry(Metadata.Custom.class, WatchMetadata.TYPE, WatchMetadata::new)
        ));
    }

    @Override
    protected Writeable.Reader<Metadata.Custom> instanceReader() {
        return WatchMetadata::new;
    }
}

./gradlew test --tests "*WatchMetadataTests"

This round-trips a random WatchMetadata through StreamOutput→StreamInput via the registry and asserts equality — the exact protection against the "unknown NamedWriteable" outage above. Wire testing is the heart of Level 5 and BWC (Level 9).

Implementation Requirements

IndexLifecycleLogger implements ClusterStateListener, uses indicesCreated()/indicesDeleted(), and does no blocking work.
The listener is registered via ClusterService.addListener(...) from a LifecycleComponent returned by createComponents(...), and unregistered on stop.
An OpenSearchSingleNodeTestCase proves the listener fires on index creation (deterministically, via a latch).
(Stretch) WatchMetadata implements Metadata.Custom, is registered in both getNamedWriteables() and getNamedXContent(), and a round-trip serialization test passes.

Troubleshooting

Symptom	Likely cause	Fix
Listener never fires	Not registered / component not started	Return it as a `LifecycleComponent`; confirm `doStart()` calls `addListener`
`createComponents` won't compile	Signature differs on your branch	Copy the exact signature from `Plugin.java`
Cluster state apply fails on other nodes after writing custom metadata	Missing `NamedWriteableRegistry` entry	Register `Metadata.Custom` and `NamedDiff` entries
`Unknown NamedXContent` when reading state via REST	Missing `getNamedXContent()` entry	Register the `NamedXContentRegistry.Entry` with a `fromXContent` parser
Test hangs	Blocking inside the listener on the applier thread	Listener must be non-blocking; offload slow work
`IllegalStateException` mutating state	You tried to change `ClusterState` in place	Build a new state via a `ClusterStateUpdateTask`

Expected Output

Listener test output (abridged):

> Task :test
IndexLifecycleListenerIT > testListenerFiresOnIndexCreate PASSED

And, when run against a live node, the plugin log on index creation:

[INFO ][o.e.s.IndexLifecycleLogger] index created: [watch-me] (cluster state v37)

Round-trip test:

WatchMetadataTests > testSerialization PASSED

Stretch Goals

Mutate the custom metadata via an update task. Add a ClusterStateUpdateTask (submitted from your listener-adjacent code on the manager) that increments WatchMetadata.createdCount whenever an index is created, building a new Metadata with ClusterState.builder(currentState). This crosses you from observing state to changing it — and forces you to respect immutability.
Expose it via REST. Add a GET /_watch/stats handler (Lab 3.3 pattern) that reads clusterService.state().metadata().custom(WatchMetadata.TYPE) and returns created_count.
Multi-node integration test. Convert to OpenSearchIntegTestCase with @ClusterScope(numDataNodes = 2) and assert the custom metadata is identical on both nodes after a create — proving it published correctly. (Full treatment in Level 5.)

Validation / Self-check

Your listener runs on the applier thread. Name one thing it must never do, and explain the cluster-wide consequence if it does.
Why does a custom Metadata.Custom need entries in both NamedWriteableRegistry and NamedXContentRegistry? What fails if you register only one?
ClusterChangedEvent.indicesCreated() already gives you the delta. What did the framework diff to produce it, and on which two states?
You wrote custom metadata on the manager and it appeared on the manager but broke applies on the other nodes. What is the single most likely missing line, and why does it only fail on the receiving nodes?
When would you implement a ClusterStateApplier instead of a ClusterStateListener for this plugin? Give a concrete example.
Your round-trip test serializes a random WatchMetadata and asserts equality. Which production failure mode does that test specifically prevent?

When the listener test passes (and, ideally, the custom-metadata round-trip too), you've completed Lab 4.3. Continue to Lab 4.4: Fix It — An AllocationDecider Edge Case.

Lab 4.4: Fix It — An AllocationDecider Edge Case

This is a Fix-It lab. You will work in org.opensearch.cluster.routing.allocation, understand how the AllocationService and the AllocationDeciders chain decide where shards go, then fix a realistic bug in a single decider — complete with a diff, a unit test that constructs a RoutingAllocation and asserts the resulting Decision, and a _cluster/allocation/explain reproduction on a live cluster.

Allocation is where "unassigned shards" issues live, and they are some of the most common real OpenSearch bugs. The skill is precise: a single decider returning the wrong Decision — or the right decision with a useless explanation — silently breaks recovery or balancing.

Background

When shards need placement (new index, a node left, a snapshot restored, a failed shard), the AllocationService runs a reroute. It builds a RoutingAllocation (the working context) and asks the AllocationDeciders chain whether each candidate placement is allowed. Each decider returns a Decision:

`Decision.Type`	Meaning
`YES`	This decider permits the placement
`NO`	This decider forbids it (with an explanation)
`THROTTLE`	Allowed eventually, but not right now (e.g. too many concurrent recoveries)

The chain's combined verdict is the most restrictive: any NO forbids; otherwise any THROTTLE throttles; otherwise YES. The explanation strings are what _cluster/allocation/explain shows the operator — so a wrong or unhelpful explanation is itself a bug.

find server/src/main/java -path "*routing/allocation*" -name "AllocationService.java"
find server/src/main/java -path "*routing/allocation/decider*" -name "*.java" | sort

Class	Role
`AllocationService`	`reroute(...)`, `applyStartedShards(...)`, `applyFailedShards(...)` — runs allocation
`RoutingAllocation`	Per-round context: `RoutingNodes`, `DiscoveryNodes`, `Metadata`, the deciders, `Decision.Debug` mode
`RoutingNodes`	Mutable shard→node assignment being computed
`AllocationDeciders`	The ordered chain; combines per-decider `Decision`s
`Decision`	`YES`/`NO`/`THROTTLE` + explanation
`BalancedShardsAllocator`	Proposes moves within what deciders allow

Deep-dive companion: shard-allocation.md. The cluster state that allocation reads/writes is from Lab 4.2.

Why This Lab Matters for Contributors

"My shard is UNASSIGNED and I don't know why" is one of the highest-volume issue categories in the tracker, and _cluster/allocation/explain is the first tool a maintainer reaches for. Its output is built directly from decider Decisions. Fixing a decider — getting the verdict and the explanation right, with a unit test that pins it — is a high-value, very mergeable kind of PR, and a frequent good first issue shape. This lab is the bridge from reading the allocation engine to changing it.

Prerequisites

OpenSearch builds; you can run ./gradlew :server:test.
You read Lab 4.2 (allocation runs as part of state computation).
A 2-node cluster for the live reproduction (Lab 3.2 setup), or a single node for the unit test.

Step 1 (15 min) — Read one decider end to end

We'll center on MaxRetryAllocationDecider, which stops OpenSearch from endlessly retrying a shard that keeps failing to allocate (after index.allocation.max_retries, default 5). It is small, real, and an excellent specimen of the "off-by-one + bad explanation" bug class.

find server/src/main/java -name "MaxRetryAllocationDecider.java"
sed -n '1,120p' server/src/main/java/org/opensearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java
grep -n "canAllocate\|getNumFailedAllocations\|maxRetries\|SETTING_ALLOCATION_MAX_RETRY\|Decision\|allocation.decision" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java

What to understand:

canAllocate(ShardRouting shardRouting, RoutingAllocation allocation) is the hook. It reads the shard's UnassignedInfo, which tracks getNumFailedAllocations().
It compares failures against index.allocation.max_retries (SETTING_ALLOCATION_MAX_RETRY).
If failures have reached the limit, it returns Decision.NO with an explanation that includes the failure count, the limit, and a hint to retry via POST /_cluster/reroute?retry_failed=true.
Otherwise it returns Decision.YES (or allocation.decision(Decision.YES, NAME, ...)).

Read the real explanation string — it's the operator-facing contract you must not break.

Step 2 (10 min) — The bug

Suppose a contributor "simplified" the comparison and introduced an off-by-one that lets a shard be retried one time too many, and also degraded the explanation so it no longer tells the operator how to recover. Here is the planted regression as a diff (your starting point):

--- a/server/src/main/java/org/opensearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java
+++ b/server/src/main/java/org/opensearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java
@@ public Decision canAllocate(ShardRouting shardRouting, RoutingAllocation allocation) {
         final UnassignedInfo unassignedInfo = shardRouting.unassignedInfo();
         final int maxRetries = SETTING_ALLOCATION_MAX_RETRY.get(indexMetadata.getSettings());
         if (unassignedInfo != null && unassignedInfo.getNumFailedAllocations() > 0) {
             final int numFailedAllocations = unassignedInfo.getNumFailedAllocations();
-            if (numFailedAllocations >= maxRetries) {
-                return allocation.decision(Decision.NO, NAME,
-                    "shard has exceeded the maximum number of retries [%d] on failed allocation attempts - "
-                        + "manually call [%s] to retry, [%s]",
-                    maxRetries, RETRY_FAILED_API, unassignedInfo.toString());
+            if (numFailedAllocations > maxRetries) {
+                return allocation.decision(Decision.NO, NAME,
+                    "shard cannot be allocated");
             } else {
                 return allocation.decision(Decision.YES, NAME,
                     "shard has failed allocating [%d] times but [%d] retries are allowed",
                     numFailedAllocations, maxRetries);
             }
         }
         return allocation.decision(Decision.YES, NAME, "shard has no previous failures");

Two defects, both realistic:

Off-by-one: > maxRetries instead of >= maxRetries. With max_retries = 5, the shard is allowed a 6th attempt — one too many. The contract is "stop at the limit," i.e. >=.
Useless explanation: "shard cannot be allocated" drops the failure count, the limit, and the crucial recovery hint (?retry_failed=true). An operator running allocation/explain now learns nothing actionable.

Note: This is a teaching regression. The real MaxRetryAllocationDecider already uses >= and a rich explanation. You are practicing the fix-it motion on a class whose correct behavior you can verify against the actual source.

Step 3 (10 min) — The fix

@@ public Decision canAllocate(ShardRouting shardRouting, RoutingAllocation allocation) {
             final int numFailedAllocations = unassignedInfo.getNumFailedAllocations();
-            if (numFailedAllocations > maxRetries) {
-                return allocation.decision(Decision.NO, NAME,
-                    "shard cannot be allocated");
+            if (numFailedAllocations >= maxRetries) {
+                return allocation.decision(Decision.NO, NAME,
+                    "shard has exceeded the maximum number of retries [%d] on failed allocation attempts - "
+                        + "manually call [%s] to retry, [%s]",
+                    maxRetries, RETRY_FAILED_API, unassignedInfo.toString());
             } else {

The fix restores the boundary (>=) and the actionable explanation (count, limit, retry API, and the full UnassignedInfo so the operator sees why it failed).

Pitfall — explanation strings are an API. _cluster/allocation/explain output and decider messages are consumed by operators, dashboards, and support tooling. Changing them casually breaks people's runbooks. When you fix the verdict, preserve (or improve, deliberately) the message — and note it in CHANGELOG.md.

Step 4 (20 min) — The unit test that pins the boundary

This is the heart of the lab: a test that constructs a RoutingAllocation with a shard that has exactly maxRetries failures and asserts the Decision is NO. Find the existing test to mirror its setup:

find server/src/test/java -name "MaxRetryAllocationDeciderTests.java"
grep -n "createOnFailedAllocation\|allocation\|canAllocate\|max_retries\|Decision\|RoutingAllocation\|UnassignedInfo" \
  server/src/test/java/org/opensearch/cluster/routing/allocation/decider/MaxRetryAllocationDeciderTests.java | head -40

The test shape (adapt names to your branch; helpers like createInitialClusterState / MockAllocationService already exist in the allocation test package):

package org.opensearch.cluster.routing.allocation.decider;

import org.opensearch.cluster.ClusterState;
import org.opensearch.cluster.routing.ShardRouting;
import org.opensearch.cluster.routing.allocation.RoutingAllocation;
import org.opensearch.cluster.routing.allocation.decider.Decision;
import org.opensearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider;

public class MaxRetryBoundaryTests extends OpenSearchAllocationTestCase {

    public void testDeniedExactlyAtMaxRetries() {
        int maxRetries = 5;

        // Build a cluster state with one index whose primary has failed exactly maxRetries times.
        ClusterState clusterState = createClusterStateWithFailedAllocations(
            "idx", maxRetries, maxRetries /* index.allocation.max_retries */);

        RoutingAllocation allocation = newRoutingAllocation(clusterState);
        allocation.debugDecision(true); // capture explanations for assertion

        ShardRouting unassignedPrimary = allocation.routingNodes()
            .unassigned().iterator().next();

        MaxRetryAllocationDecider decider = new MaxRetryAllocationDecider();
        Decision decision = decider.canAllocate(unassignedPrimary, allocation);

        // The boundary: at exactly maxRetries, the shard must be DENIED (>=, not >).
        assertEquals(Decision.Type.NO, decision.type());
        // The explanation must remain actionable.
        assertThat(decision.getExplanation(),
            containsString("exceeded the maximum number of retries"));
        assertThat(decision.getExplanation(),
            containsString("retry"));
    }

    public void testAllowedBelowMaxRetries() {
        int maxRetries = 5;
        ClusterState clusterState = createClusterStateWithFailedAllocations(
            "idx", maxRetries - 1, maxRetries);
        RoutingAllocation allocation = newRoutingAllocation(clusterState);

        ShardRouting unassignedPrimary = allocation.routingNodes()
            .unassigned().iterator().next();

        Decision decision = new MaxRetryAllocationDecider()
            .canAllocate(unassignedPrimary, allocation);

        assertEquals(Decision.Type.YES, decision.type());
    }
}

The two assertions are the whole point:

testDeniedExactlyAtMaxRetries fails on the buggy > (it would return YES at the boundary) and passes on the fixed >=. It also pins the explanation, so the "useless message" regression can't slip back in.
testAllowedBelowMaxRetries guards the other side: you didn't over-correct into denying valid retries.

Note: The real test class (MaxRetryAllocationDeciderTests) uses helpers from OpenSearchAllocationTestCase to build cluster states and routing allocations and to simulate failed allocations (applyFailedShards). Read it before writing your own — it shows the supported way to construct a RoutingAllocation with a shard that has N failures, which is fiddly to do by hand.

Run it:

./gradlew :server:test --tests "*MaxRetryBoundaryTests"
# or the real class while you study it:
./gradlew :server:test --tests "*MaxRetryAllocationDeciderTests"

A failing run against the buggy code, then a passing run after the fix, is your proof.

Step 5 (15 min) — Reproduce on a live cluster with allocation/explain

Unit tests pin the boundary; _cluster/allocation/explain shows the operator-visible behavior.

# On a 2-node cluster. Create an index that can't allocate a replica by forcing it onto a
# non-existent attribute, so the shard fails and retries accumulate.
curl -s -XPUT 'localhost:9200/retry-demo' -H 'Content-Type: application/json' -d '{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.routing.allocation.require._name": "does-not-exist",
    "index.allocation.max_retries": 5
  }
}'

# Ask why the shard is unassigned:
curl -s -XGET 'localhost:9200/_cluster/allocation/explain?pretty' -H 'Content-Type: application/json' -d '{
  "index": "retry-demo", "shard": 0, "primary": true
}'

In the response, look at node_allocation_decisions[].deciders[]. You will find entries from each decider — including filter (the require._name you set) and, once retries accumulate, max_retry. The decision field is the Decision.Type and explanation is the string you just fixed:

{
  "decider": "max_retry",
  "decision": "NO",
  "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info ...]"
}

With the buggy code, the explanation would read "shard cannot be allocated" and the boundary would be off by one — the exact operator-facing degradation your fix prevents. Recover the demo:

curl -s -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true'
curl -s -XDELETE 'localhost:9200/retry-demo'

Pitfall — allocation/explain needs an unassigned (or movable) shard to explain. If everything is green, the API returns an error. Force an unassignable shard (as above) before calling it.

Implementation Requirements

The decider uses the correct boundary (>= maxRetries) so a shard is denied at the limit, not one past it.
The Decision.NO explanation includes the failure count, the limit, and the retry API hint.
A unit test asserts Decision.Type.NO at exactly maxRetries and YES below it, and pins the explanation content.
You reproduced the operator-facing behavior with _cluster/allocation/explain and saw the max_retry decider entry.
A CHANGELOG.md entry under ## [Unreleased] (Fixed) if you were submitting this for real.

Common Pitfalls

Pitfall	Why it bites	Avoid by
Treating a `Decision` as boolean	You ignore `THROTTLE`, which is not a `NO`	Always switch on `Decision.Type` (`YES`/`NO`/`THROTTLE`)
Changing the verdict but not the test	Regression slips back in later	Pin the boundary value, not just "denies eventually"
Degrading the explanation string	Breaks operator runbooks and dashboards	Treat `allocation/explain` text as API; preserve actionable detail
Off-by-one on `>=` vs `>`	Allows one retry too many/few	Write the boundary test first
Constructing `RoutingAllocation` by hand	Brittle, wrong shard states	Reuse `OpenSearchAllocationTestCase` helpers
Calling `allocation/explain` on a green cluster	API errors	Create an intentionally-unassignable shard first
Forgetting `allocation.debugDecision(true)`	`getExplanation()` is empty in the test	Enable debug decisions to capture the message

Expected Output

Test:

> Task :server:test
MaxRetryBoundaryTests > testDeniedExactlyAtMaxRetries PASSED
MaxRetryBoundaryTests > testAllowedBelowMaxRetries PASSED

Live reproduction shows a max_retry decider entry with decision: NO and the full, actionable explanation string.

Stretch Goals

A different decider, same motion. Apply the read → spot-bug → test → explain loop to SameShardAllocationDecider (it must return NO when the same shard copy is already on the node) or DiskThresholdDecider (watermarks). Each has its own boundary worth a test.
Combine deciders. Read AllocationDeciders.canAllocate(...) and prove with a test that one NO overrides any number of YESes, and that THROTTLE is returned only when there's no NO.
Trace applyFailedShards. In AllocationService, follow how a failed shard's UnassignedInfo.numFailedAllocations gets incremented — that's the value MaxRetryAllocationDecider reads. This closes the loop between Lab 4.2's state updates and this decider.

Validation / Self-check

Name the three Decision.Type values and explain how AllocationDeciders combines a chain of them into one verdict.
In MaxRetryAllocationDecider, which field on UnassignedInfo does the verdict depend on, and which index setting is the limit?
Why is >= (not >) the correct boundary, and what is the operator-visible symptom of the > bug?
Why is the explanation string part of the contract — who consumes it, and through which API?
Describe the unit test you would write to pin this fix so the regression can't return. Which exact failure-count value does it use, and what two things does it assert?
On a live cluster, how do you force an unassignable shard so that _cluster/allocation/explain shows the max_retry decider? Why won't it work on a green cluster?
If this were a real PR, what goes in CHANGELOG.md, and under which heading?

When your boundary test fails on the buggy code, passes on the fix, and your allocation/explain reproduction shows the actionable max_retry explanation, you've completed Lab 4.4 — and Level 4. Continue to Level 5: Testing and Debugging.

Level 5: Testing and Debugging

Up to now you have read the engine and made small, surgical changes. From here on, no change ships without a test, and most of your time as a contributor is spent in the test framework, not in server/src/main. This level makes the OpenSearch test framework your home: its taxonomy, its randomization, its in-JVM clusters, and the debugging tooling you use when a test — or production — misbehaves.

This is the OpenSearch analog of learning how a mature project defends itself. A maintainer's first question on almost every PR is "where's the test?" The second is "is it deterministic?" By the end of this level you can answer both without being asked.

Note: OpenSearch inherited a famously rigorous, randomized test framework from Elasticsearch (which inherited the Carrotsearch Randomized Testing runner from Lucene). Tests run with a random seed by default; the same test exercises different inputs on every run. This finds bugs ordinary example-based tests miss — and it is also the source of "flaky" tests, which you will learn to fix rather than mute.

Learning Objectives

By the end of Level 5 you must be able to:

Name every base test class in the OpenSearch test framework and choose the right one for a given change (unit vs single-node vs multi-node vs REST vs serialization vs BWC).
Explain randomized testing: the RandomizedRunner, the seed, randomAlphaOfLength/ randomIntBetween, and how to reproduce a failure from a printed -Dtests.seed=....
Write and run an OpenSearchIntegTestCase that stands up a multi-node InternalTestCluster, indexes data, and asserts cluster health and query results.
Write a unit test, including an equals/hashCode contract test and a serialization round-trip for a Writeable via AbstractWireSerializingTestCase.
Run the right Gradle test task for each kind of test and read the HTML/.bin reports it produces.
Debug a test or a running node: --debug-jvm, TRACE logging on a single package, and assertBusy for asynchronous assertions.
Reproduce, diagnose, and correctly fix a flaky test — and explain why adding Thread.sleep is not a fix.

The Test Framework Taxonomy

Everything lives under test/framework/ (the published org.opensearch.test framework artifact) and in each module's src/test/java. The single most valuable skill in this level is picking the cheapest test class that still proves your change. Reach for a heavier base only when a lighter one cannot exercise the behavior.

find test/framework/src/main/java -name "OpenSearch*TestCase.java" | sort
find test/framework/src/main/java -name "InternalTestCluster.java" -o -name "MockNode.java"

Base class	What it gives you	Cost	Use when
`OpenSearchTestCase`	Plain JUnit + randomization, `random*` helpers, `assertBusy`, leak/thread-leak detection	Cheapest	Pure logic: a parser, a `Setting`, a comparator, `equals`/`hashCode`, a small algorithm
`AbstractWireSerializingTestCase<T>`	A full `Writeable` round-trip harness (write→read, equality, BWC across versions)	Cheap	Any `Writeable`/`NamedWriteable` request, response, or metadata object
`AbstractSerializingTestCase<T>`	The above plus XContent (JSON) round-trip (`toXContent`/`fromXContent`)	Cheap	Objects with both wire and JSON serialization (most API request/response types)
`OpenSearchSingleNodeTestCase`	One real in-JVM node; `client()`, `createIndex`, the index/shard/engine stack	Medium	Behavior that needs a real `IndicesService`/`IndexShard`/`Engine` but not multiple nodes
`OpenSearchIntegTestCase`	A multi-node `InternalTestCluster`; `@ClusterScope`, `internalCluster()`, `ensureGreen()`	Heavy	Distributed behavior: replication, allocation, recovery, failover, cross-node queries
`OpenSearchRestTestCase`	A running cluster reached over HTTP via the low-level REST client	Heavy	Black-box REST/API contract tests; what an external client actually sees
`OpenSearchTokenStreamTestCase`	Lucene analysis assertions (`assertTokenStreamContents`, `assertAnalyzesTo`)	Cheap	Tokenizers, token filters, analyzers (see Lab 6.3)

Two more you must recognize even before you write them:

REST-YAML tests (yamlRestTest). The shared, language-agnostic API tests under rest-api-spec/ (and per-module src/yamlRestTest/resources/rest-api-spec/test/...). They are YAML files of do:/match: steps run by a Java harness (OpenSearchClientYamlSuiteTestCase). Every OpenSearch client project reuses them. Run with ./gradlew :rest-api-spec:yamlRestTest or a module's :yamlRestTest.
Backward-compatibility (BWC) tests in qa/. These start a cluster on an old version and an upgraded node, asserting data and APIs survive an upgrade. They key off bwcVersion. You will rarely write these early, but you must know they exist and gate the release.

flowchart TD
    Q{What does your change touch?} --> L1[Pure logic / parsing / equality]
    Q --> L2[A Writeable wire/XContent type]
    Q --> L3[One node: index/shard/engine]
    Q --> L4[Multiple nodes: replication/allocation/recovery]
    Q --> L5[The REST/API contract]
    Q --> L6[Analysis: tokenizer/filter/analyzer]
    L1 --> C1[OpenSearchTestCase]
    L2 --> C2[AbstractWireSerializingTestCase /<br/>AbstractSerializingTestCase]
    L3 --> C3[OpenSearchSingleNodeTestCase]
    L4 --> C4[OpenSearchIntegTestCase + @ClusterScope]
    L5 --> C5[OpenSearchRestTestCase + yamlRestTest]
    L6 --> C6[OpenSearchTokenStreamTestCase]

Randomized Testing: The Seed Is the Test

OpenSearch tests run under Carrotsearch RandomizedRunner (@RunWith is wired into the base classes). Every test method gets a deterministic source of randomness derived from a master seed. Instead of hand-picking inputs, you ask the framework for them:

String name   = randomAlphaOfLength(randomIntBetween(1, 20));   // varies every run
int shards    = randomIntBetween(1, 5);
boolean flag  = randomBoolean();
TimeValue ttl = randomTimeValue();
Version v     = VersionUtils.randomVersion(random());           // random known version (BWC)

The same method body exercises a different name, shards, flag on each run. Over CI's thousands of runs, the input space gets covered far more than any example you'd write by hand.

When a randomized test fails, the framework prints a reproduce line. It looks like:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTests.testThing" \
  -Dtests.seed=ABCDEF0123456789 -Dtests.locale=de-DE -Dtests.timezone=America/New_York

Copy that line verbatim. The seed pins all randomness (the random* values, the iteration order, even the JVM locale/timezone) so the failure reproduces exactly. This is the single most important habit in OpenSearch testing: never debug a randomized failure without its seed.

Flag	Effect
`-Dtests.seed=<hex>`	Reproduce a specific run exactly
`-Dtests.iters=N`	Run each selected method N times (hunt flakiness)
`-Dtests.locale=<loc>` / `-Dtests.timezone=<tz>`	Pin locale/timezone (locale-sensitive bugs are common)
`-Dtests.nightly=true`	Enable heavier `@Nightly` tests
`-Dtests.leaveTemporary=true`	Keep the test's data dirs for inspection
`-Dtests.security.manager=false`	Disable the test security manager when debugging

Warning: Do not "fix" a randomized test by narrowing its inputs (e.g. replacing randomIntBetween(1, 100) with a constant) to dodge a failure. That hides the bug the randomization found. Either the production code is wrong, or your test made an unwarranted assumption — fix the real cause. See Lab 5.4.

InternalTestCluster and MockNode

OpenSearchIntegTestCase runs a real cluster inside the test JVM. The machinery is InternalTestCluster: it starts N Nodes (actually MockNodes — Node subclasses that swap in test implementations like MockTransportService, deterministic random, and bounded thread pools) and wires them into a working cluster you drive through internalCluster() and client().

grep -n "startNode\|startNodes\|stopRandomDataNode\|client()\|ensureGreen\|nodesInclude\|getMasterName\|getClusterManagerName" \
  test/framework/src/main/java/org/opensearch/test/InternalTestCluster.java | head -40

What you control from a test:

Call	Does
`internalCluster().startNode(Settings)` / `startNodes(n)`	Start node(s) with overrides
`internalCluster().startClusterManagerOnlyNode()` / `startDataOnlyNode()`	Role-specific nodes
`internalCluster().stopRandomDataNode()`	Kill a data node (failover tests)
`internalCluster().getClusterManagerName()`	Name of the elected cluster manager (formerly master)
`client()` / `internalCluster().client(nodeName)`	A client to the cluster / to a specific node
`ensureGreen("index")` / `ensureYellow(...)`	Block until shards reach the health state

@ClusterScope controls cluster reuse, which is the single biggest lever on integ-test speed and isolation:

Scope	Lifetime	Trade-off
`@ClusterScope(scope = Scope.SUITE)` (default)	One cluster shared by all methods in the class	Fast; methods must clean up after themselves
`@ClusterScope(scope = Scope.TEST)`	A fresh cluster per test method	Slow but maximally isolated; use for disruptive tests
`numDataNodes`, `numClientNodes`, `minNumDataNodes`/`maxNumDataNodes`	Sizing (fixed or randomized)	Randomized sizing finds size-dependent bugs
`supportsDedicatedMasters`	Whether dedicated cluster-manager nodes may be added	Mirrors real topologies

Covered hands-on in Lab 5.1 and pushed further (failover) in Lab 5.3.

Gradle Test Tasks

There is a task per kind of test. Picking the wrong one wastes minutes (or hours).

Task	Runs	Source set
`./gradlew :server:test`	Server unit tests (`*Tests.java`)	`src/test`
`./gradlew :server:internalClusterTest`	In-JVM multi-node integ tests (`*IT.java`)	`src/internalClusterTest`
`./gradlew :rest-api-spec:yamlRestTest`	Shared REST-YAML API tests	`src/yamlRestTest`
`./gradlew :qa:full-cluster-restart:test` (and friends in `qa/`)	BWC / rolling-upgrade / mixed-cluster	`qa/`
`./gradlew precommit`	Checkstyle, forbidden-APIs, headers, `loggerUsageCheck`, etc.	—
`./gradlew check`	The full gate: unit + integ + precommit (long)	—

Scoping is mandatory for fast iteration:

# One class
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests"
# One method
./gradlew :server:test --tests "org.opensearch.index.engine.InternalEngineTests.testVersioningNewIndex"
# A glob
./gradlew :server:test --tests "*MaxRetry*"
# An integ test
./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.SpecificClusterManagerNodesIT"

Note: Naming convention is load-bearing. *Tests.java are picked up by test; *IT.java by internalClusterTest. Put an integ test in src/test and it simply won't run; put a unit test named *IT and it won't run under test. Match the source set to the suffix.

Reading the Reports

Every Gradle test task writes machine- and human-readable reports. When CI is red and the console is a wall of text, the report is faster than scrolling.

# HTML report (open in a browser) — index, per-class pages, stdout/stderr captured
open server/build/reports/tests/test/index.html          # macOS
xdg-open server/build/reports/tests/test/index.html       # Linux

# Raw per-test XML (what CI parses)
ls server/build/test-results/test/*.xml

# Find the exact reproduce line from a failed run
grep -rn "REPRODUCE WITH" server/build/test-results/ server/build/reports/ 2>/dev/null | head

The HTML page for a failed test shows the assertion, the full stack trace, and the captured stdout/stderr — which is where the REPRODUCE WITH line and any test logging end up.

Debugging

Three tools cover almost every test/debug situation.

1. Attach a debugger to a running node — --debug-jvm:

# Pauses the node JVM waiting for a debugger on port 5005, then runs a single node from source.
./gradlew run --debug-jvm
# Attach IntelliJ "Remote JVM Debug" to localhost:5005. Set breakpoints in IndexShard,
# InternalEngine, TransportShardBulkAction, etc., then send a curl and step through.

To debug a test under a debugger, use the test-task variant:

./gradlew :server:test --tests "*InternalEngineTests.testSimple" --debug-jvm

2. TRACE logging on exactly the package you care about (cluster-wide, dynamically):

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.index.engine": "TRACE",
                 "logger.org.opensearch.index.translog": "TRACE" }
}'
# ...do the operation, read the logs, then turn it back off:
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.index.engine": null,
                 "logger.org.opensearch.index.translog": null }
}'

Inside a test, set it on the class with @TestLogging("org.opensearch.index.engine:TRACE").

3. assertBusy for asynchronous assertions. Most OpenSearch work is asynchronous (refresh, allocation, recovery, cluster-state apply). Do not assert immediately, and do not Thread.sleep. Poll until true (with a timeout):

assertBusy(() -> {
    long count = client().prepareSearch("idx").setSize(0).get().getHits().getTotalHits().value;
    assertEquals(3L, count);
}, 30, TimeUnit.SECONDS);

assertBusy retries the assertion with backoff until it passes or the timeout fires. It is the deterministic alternative to "sleep and hope" — the distinction at the heart of Lab 5.4.

Key Classes Quick Reference

Class	Package / location	Role
`OpenSearchTestCase`	`org.opensearch.test` (`test/framework`)	Base unit test: randomization, `random*`, `assertBusy`, leak detection
`AbstractWireSerializingTestCase<T>`	`org.opensearch.test`	`Writeable` wire round-trip + BWC harness
`AbstractSerializingTestCase<T>`	`org.opensearch.test`	Wire and XContent round-trip harness
`OpenSearchSingleNodeTestCase`	`org.opensearch.test`	One in-JVM node; real index/shard/engine stack
`OpenSearchIntegTestCase`	`org.opensearch.test`	Multi-node `InternalTestCluster`; `@ClusterScope`
`OpenSearchRestTestCase`	`org.opensearch.test.rest`	Black-box tests over the HTTP REST client
`OpenSearchTokenStreamTestCase`	`org.opensearch.test`	Lucene analysis assertions
`InternalTestCluster`	`org.opensearch.test`	Starts/manages in-JVM nodes; `internalCluster()` returns it
`MockNode`	`org.opensearch.node` (`test/framework`)	Test `Node` with mock transport / deterministic services
`@ClusterScope`	`org.opensearch.test.OpenSearchIntegTestCase`	Cluster reuse + sizing for integ tests
`RandomizedRunner`	`com.carrotsearch.randomizedtesting`	The JUnit runner that drives the seed
`MockTransportService`	`org.opensearch.test.transport`	Transport you can intercept/disrupt (failure injection)
`ServiceDisruptionScheme` / `NetworkDisruption`	`org.opensearch.test.disruption`	Inject partitions, slow links, node freezes

The Labs

Lab	Title	Type
5.1	OpenSearchIntegTestCase and InternalTestCluster	Hands-on integ test
5.2	Add a Missing Unit Test	Hands-on unit + serialization
5.3	Build It — A Multi-Node Integration Test	Build-it (failover)
5.4	Fix It — Un-Mute a Flaky Test	Fix-it with a diff

Deliverables

You must demonstrate all of the following before advancing to Level 6:

An OpenSearchIntegTestCase you wrote that starts a multi-node InternalTestCluster, indexes docs, ensureGreens, and asserts a query result (Lab 5.1).
A unit test for a previously untested class, including equals/hashCode coverage and a serialization round-trip via AbstractWireSerializingTestCase (Lab 5.2).
A multi-node integ test that stops a data node and asserts replica promotion / shard reallocation with ensureGreen + assertBusy (Lab 5.3).
A reproduced, diagnosed, and correctly fixed flaky test — un-muted, with the race explained (Lab 5.4).
From memory: the base-class taxonomy table, the meaning of -Dtests.seed, and why assertBusy beats Thread.sleep.

Common Mistakes

Mistake	Consequence	Fix
Debugging a randomized failure without the seed	Can't reproduce; you chase ghosts	Copy the printed `-Dtests.seed=...` line verbatim
Using `OpenSearchIntegTestCase` for pure logic	100× slower tests; flaky CI	Use the cheapest base that proves the change
`Thread.sleep` to wait for async state	Flaky under load; slow always	Use `assertBusy(...)` / `ensureGreen(...)`
Putting an integ test in `src/test`	It silently never runs	`*IT.java` → `src/internalClusterTest`
Asserting before a refresh	Search misses just-indexed docs	`refresh()` the index, or `setRefreshPolicy(IMMEDIATE)`
Narrowing `random*` ranges to dodge a failure	Hides a real bug	Fix the production code or the bad assumption
`@Ignore`-ing a flaky test	No tracking; rots forever	`@AwaitsFix(bugUrl=...)` with a tracking issue, then fix it
Forgetting cleanup in `Scope.SUITE`	Cross-test pollution; order-dependent failures	Delete indices/templates you created, or use `Scope.TEST`

How to Verify Success

# 1. You can scope and reproduce any test.
./gradlew :server:test --tests "*ClusterStateTests" -Dtests.seed=DEADBEEFDEADBEEF

# 2. You can run an integ test and read its report.
./gradlew :server:internalClusterTest --tests "*ClusterManager*IT"
open server/build/reports/tests/internalClusterTest/index.html

# 3. You can turn on TRACE for one package and back off again.
curl -s -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' \
  -d '{"transient":{"logger.org.opensearch.index.engine":"TRACE"}}'
curl -s -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' \
  -d '{"transient":{"logger.org.opensearch.index.engine":null}}'

When you can choose the right base class, reproduce any failure from its seed, run the matching Gradle task, and fix flakiness without sleeping, you are ready for Level 6: Indexing Path and Storage Engine.

Lab 5.1: OpenSearchIntegTestCase and InternalTestCluster

Background

Most of what makes OpenSearch OpenSearch — replication, allocation, recovery, failover, cross-node search — only happens when there is more than one node. You cannot prove that behavior with a unit test. The framework's answer is OpenSearchIntegTestCase: it stands up a real, multi-node cluster inside the test JVM via InternalTestCluster, hands you a client(), and lets you drive it exactly like a production cluster — except it is deterministic, fast to start, and torn down automatically.

In this lab you write a working integration test from scratch: start a multi-node cluster, create an index with replicas, index documents, wait for the cluster to go green, and assert a query returns what you indexed. Along the way you read the framework itself so the magic stops being magic.

This is the hands-on counterpart to the taxonomy in the Level 5 index. For the engine internals these tests exercise, see the engine-internals deep dive and index-shard-lifecycle deep dive.

Why This Lab Matters for Contributors

A huge fraction of OpenSearch's test suite is *IT.java integ tests. To fix bugs in distributed behavior — and to prove the fix — you must be fluent in OpenSearchIntegTestCase.
Maintainers reject PRs whose tests can't actually fail when the bug is present. Knowing how to size the cluster, choose @ClusterScope, and ensureGreen() correctly is the difference between a test that catches regressions and one that's theater.
The same patterns (internalCluster(), ensureGreen(), assertBusy()) carry into every later level, including the failover test in Lab 5.3.

Prerequisites

OpenSearch builds and ./gradlew :server:test passes on your machine (Level 1).
You can run a single node from source (./gradlew run) and hit it with curl (Level 2).
You've read the Level 5 index taxonomy and the @ClusterScope table.
JDK 21, ~8 GB RAM free (an in-JVM 3-node cluster is real and uses real memory).

Step-by-Step Tasks

Step 1 (15 min) — Read the framework you're about to use

Before writing a test, read the two classes that do the work. You will refer back to these constantly.

# The base class: what client(), internalCluster(), ensureGreen, @ClusterScope come from.
find test/framework/src/main/java -name "OpenSearchIntegTestCase.java"
grep -n "public.*client()\|protected.*internalCluster()\|ensureGreen\|ensureYellow\|@interface ClusterScope\|numDataNodes\|Scope " \
  test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java | head -40

# The cluster itself: how nodes start/stop and how health is reached.
find test/framework/src/main/java -name "InternalTestCluster.java"
grep -n "startNodes\|startNode\|stopRandomDataNode\|getClusterManagerName\|client(\|ensureAtLeastNumDataNodes" \
  test/framework/src/main/java/org/opensearch/test/InternalTestCluster.java | head -40

Note three things:

client() returns a client that load-balances across the cluster (a random node), so your request goes through a real coordinating hop — exactly like production.
internalCluster() is your handle to the physical cluster: starting/stopping nodes, finding the cluster manager (formerly master), addressing a specific node.
ensureGreen(...) blocks until all primaries and replicas of the named indices are assigned and started. Without it, your assertions race the allocator.

Step 2 (10 min) — Find an existing integ test to mirror

Never write an integ test on a blank page. Copy the shape of a known-good one.

# Real integ tests live in src/internalClusterTest and end in *IT.java
find server/src/internalClusterTest/java -name "*IT.java" | head
# A small, readable example to model on:
grep -rln "extends OpenSearchIntegTestCase" server/src/internalClusterTest/java | head
grep -rln "ensureGreen\|internalCluster().startNodes" server/src/internalClusterTest/java | head

Open one and study how it: declares @ClusterScope, sizes the cluster, creates an index via client().admin().indices().prepareCreate(...), indexes via client().prepareIndex(...), and asserts via client().prepareSearch(...).

Step 3 (5 min) — Decide cluster scope and size

For this lab we want a fresh, isolated cluster so our assertions are clean, and three data nodes so replicas have somewhere to go:

Decision	Choice	Why
Scope	`@ClusterScope(scope = Scope.TEST, numDataNodes = 3)`	Fresh cluster per method; deterministic, no cross-test pollution
Replicas	`number_of_replicas: 1`	Forces real allocation; `green` means replicas placed
Health gate	`ensureGreen("integ-lab")`	Block until primaries + replicas are started

Note: Scope.TEST is slower than the default Scope.SUITE because it rebuilds the cluster each method. For a teaching lab with one method that's fine. In real PRs, prefer Scope.SUITE and clean up after yourself unless the test is disruptive (kills nodes), in which case use Scope.TEST.

Step 4 (25 min) — Write the test

Create server/src/internalClusterTest/java/org/opensearch/integlab/IntegLabIT.java:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.integlab;

import org.opensearch.action.admin.cluster.health.ClusterHealthResponse;
import org.opensearch.action.search.SearchResponse;
import org.opensearch.cluster.health.ClusterHealthStatus;
import org.opensearch.common.settings.Settings;
import org.opensearch.index.query.QueryBuilders;
import org.opensearch.test.OpenSearchIntegTestCase;
import org.opensearch.test.OpenSearchIntegTestCase.ClusterScope;
import org.opensearch.test.OpenSearchIntegTestCase.Scope;

import static org.opensearch.test.hamcrest.OpenSearchAssertions.assertHitCount;
import static org.opensearch.test.hamcrest.OpenSearchAssertions.assertSearchHits;

@ClusterScope(scope = Scope.TEST, numDataNodes = 3)
public class IntegLabIT extends OpenSearchIntegTestCase {

    public void testIndexThenSearchOnAGreenMultiNodeCluster() throws Exception {
        final String index = "integ-lab";

        // 1. Create an index with 2 primaries and 1 replica each. With 3 data nodes,
        //    every shard copy can be assigned -> the cluster can reach GREEN.
        createIndex(index, Settings.builder()
            .put("index.number_of_shards", 2)
            .put("index.number_of_replicas", 1)
            .build());

        // 2. Wait until all primaries AND replicas are started. This is the gate that
        //    makes the rest of the test deterministic.
        ClusterHealthResponse health = ensureGreen(index);
        assertEquals(ClusterHealthStatus.GREEN, health.getStatus());
        // With 2 shards * (1 primary + 1 replica) = 4 active shards.
        assertEquals(4, health.getActiveShards());

        // 3. Index three documents. IMMEDIATE refresh makes them searchable now,
        //    so we don't have to wait for the periodic refresh.
        client().prepareIndex(index).setId("1")
            .setSource("title", "opensearch internals", "level", 5).get();
        client().prepareIndex(index).setId("2")
            .setSource("title", "lucene segments", "level", 6).get();
        client().prepareIndex(index).setId("3")
            .setSource("title", "opensearch testing", "level", 5).get();
        refresh(index); // open a new searcher so the docs are visible

        // 4. Assert a match-all sees all three docs.
        SearchResponse all = client().prepareSearch(index).get();
        assertHitCount(all, 3L);

        // 5. Assert a term query returns the right subset, by id.
        SearchResponse hits = client().prepareSearch(index)
            .setQuery(QueryBuilders.matchQuery("title", "opensearch"))
            .get();
        assertHitCount(hits, 2L);
        assertSearchHits(hits, "1", "3");

        // 6. Sanity: the cluster really does have 3 data nodes, and a single cluster manager.
        assertEquals(3, internalCluster().numDataNodes());
        assertNotNull(internalCluster().getClusterManagerName()); // formerly getMasterName()
    }
}

What each piece is doing:

createIndex(name, settings) is an OpenSearchIntegTestCase helper around client().admin().indices().prepareCreate(...). With replicas: 1 and 3 data nodes, the replica has a legal home, so green is reachable.
ensureGreen(index) blocks on a cluster-health request that waits for GREEN. If allocation is broken (e.g. a decider bug from Lab 4.4), this is where your test would hang/fail — that is exactly the signal you want.
refresh(index) opens a new Lucene searcher so just-indexed docs become visible. Skipping this is the #1 cause of "my integ test sees 0 hits" (see Troubleshooting).
assertHitCount / assertSearchHits come from OpenSearchAssertions — use these, not raw JUnit, because they produce far better failure messages for search responses.

Step 5 (10 min) — Run it and read the output

./gradlew :server:internalClusterTest --tests "org.opensearch.integlab.IntegLabIT"

Note: It's :server:internalClusterTest, not :server:test. The *IT.java suffix routes the class to the internalClusterTest source set/task. Run it under :server:test and it won't be picked up at all.

Open the report:

open server/build/reports/tests/internalClusterTest/index.html   # macOS
# or
xdg-open server/build/reports/tests/internalClusterTest/index.html

Step 6 (10 min) — Prove the test can actually fail

A test that can't fail is worthless. Temporarily break it, confirm it goes red, then revert:

// Change the expected count to something wrong:
assertHitCount(all, 99L);   // should be 3

Run again — it must fail with a clear message (expected 99 but was 3). Revert. This habit (deliberately falsifying the assertion once) catches tests that silently pass for the wrong reason.

Implementation Requirements / Deliverables

A new IntegLabIT under server/src/internalClusterTest/java/... extending OpenSearchIntegTestCase.
@ClusterScope(scope = Scope.TEST, numDataNodes = 3) (or your justified alternative).
Creates an index with number_of_replicas: 1, then ensureGreen(...) and asserts the status is GREEN and the active-shard count is what you expect.
Indexes ≥3 docs, refreshes, and asserts both a match-all count and a term-query subset (by id).
Runs green under ./gradlew :server:internalClusterTest --tests "*IntegLabIT".
You demonstrated it can fail (Step 6) before reverting.

Troubleshooting

Symptom	Cause	Fix
Test class "not found" / nothing runs	Wrong task or wrong suffix	Use `:server:internalClusterTest`; class must end in `IT` and live in `src/internalClusterTest`
`ensureGreen` hangs / times out	Replica can't allocate (too few data nodes, allocation disabled)	Ensure `numDataNodes >= number_of_replicas + 1`; check allocation isn't disabled
Search returns 0 hits	You didn't refresh	Call `refresh(index)` or `setRefreshPolicy(IMMEDIATE)` on the index requests
Flaky hit counts	Asserting before async work settled	Wrap in `assertBusy(...)`; never `Thread.sleep`
`Thread leaked` / `Suite timeout`	A background thread/handle not released	Don't start your own threads; rely on the framework's lifecycle
OOM / very slow	Too many `Scope.TEST` nodes per method	Reduce `numDataNodes`, or switch to `Scope.SUITE` and clean up

Expected Output

> Task :server:internalClusterTest
org.opensearch.integlab.IntegLabIT > testIndexThenSearchOnAGreenMultiNodeCluster PASSED

BUILD SUCCESSFUL

The HTML report shows one test, status PASSED, with captured node logs under the stdout/stderr tab (a good place to see the cluster forming, the index being created, and shards going green).

Stretch Goals

Randomize the topology. Replace fixed numDataNodes = 3 with @ClusterScope(minNumDataNodes = 2, maxNumDataNodes = 5) and make the test size-agnostic (compute expected active shards from number_of_shards * (1 + replicas) and the actual node count). This is how real OpenSearch tests find size-dependent bugs.
Assert per-node placement. Use client().admin().cluster().prepareState().get().getState() .getRoutingTable() to assert no primary and its replica landed on the same node (the SameShardAllocationDecider invariant). Cross-reference shard-allocation deep dive.
Make it a SUITE-scoped test. Convert to Scope.SUITE, add a second test method, and add a teardown that deletes the index — proving you can keep a shared cluster clean between methods.
Watch it form under TRACE. Add @TestLogging("org.opensearch.cluster.service:TRACE") and read the cluster-state versions advancing in the report, connecting this lab to Level 4.

Validation / Self-check

Why is the task :server:internalClusterTest and not :server:test? What makes Gradle route IntegLabIT to the right source set?
What exactly does ensureGreen("idx") wait for, and what would make it never return?
Why must you refresh(index) (or use IMMEDIATE) before searching in an integ test? What is actually happening when you refresh?
With number_of_shards: 2 and number_of_replicas: 1 on 3 data nodes, how many active shards do you expect, and why?
What's the difference between client() and internalCluster().client(nodeName)? When would you need the second?
When should you choose Scope.TEST over the default Scope.SUITE, and what's the cost?
You changed an assertion to a wrong value and the test still passed. What does that tell you, and what would you check?

When your test stands up a green 3-node cluster, indexes and queries documents deterministically, and you've proven it can fail, move on to Lab 5.2: Add a Missing Unit Test.

Lab 5.2: Add a Missing Unit Test

Background

OpenSearch has thousands of classes, and coverage is uneven. Some classes — especially small value objects, Writeable request/response types, and helper utilities added in a hurry — ship with no sibling *Tests class. These are the cheapest, highest-value contributions a new contributor can make: a focused unit test that nails down equals/hashCode, exercises edge cases, and round-trips wire serialization. Maintainers merge these readily because they increase the safety net without touching production behavior.

This lab teaches you to find an under-tested class with grep, then write a proper OpenSearchTestCase for it — using the framework's randomization helpers, exception assertions, and the serialization round-trip harness AbstractWireSerializingTestCase. You learn what separates a good assertion (one that would fail if the code were wrong) from a useless one (one that passes no matter what).

Note: Read Level 5 index first if you have not. This lab assumes you know the test taxonomy table, the meaning of -Dtests.seed, and the random* helpers.

Why This Lab Matters for Contributors

"Where's the test?" is the first question on nearly every PR. A test-only PR is the fastest way to build trust and get your name in CHANGELOG.md.
Writing a serialization round-trip teaches you the wire protocol (StreamInput/StreamOutput, Writeable, NamedWriteableRegistry) that underpins every transport message — see the serialization & BWC deep dive.
Randomized tests find bugs example-based tests miss. Learning to let the framework pick inputs is a permanent upgrade to how you test.
A wrong equals/hashCode is a real, shippable bug class (deduplication, cluster-state diffing, set membership). A contract test catches it.

Prerequisites

A clean build (./gradlew :server:compileJava succeeds).
You can run a scoped test: ./gradlew :server:test --tests "*ClusterStateTests".
An IDE that resolves org.opensearch.* imports (IntelliJ recommended; ./gradlew idea if needed).

java -version            # JDK 21 baseline for 3.x
./gradlew :server:compileTestJava -q    # test sources compile

Step-by-Step Tasks

Step 1: Find an under-tested class

The convention is that Foo.java is tested by FooTests.java in the mirrored src/test package. A class with no sibling *Tests is a candidate. Find one:

# List main classes whose name has no matching *Tests.java anywhere in the repo.
cd server/src/main/java
for f in $(find org/opensearch -name '*.java' | sed 's#.*/##; s/\.java$//'); do
  if ! grep -rql "class ${f}Tests" ../../../../server/src/test/java 2>/dev/null; then
    echo "$f"
  fi
done | sort | head -50
cd - >/dev/null

That is a blunt instrument (it ignores inner classes and abstract types), so narrow to good targets. The best candidates are small, self-contained value types — ideally Writeable and/or ToXContent, with an equals/hashCode:

# Writeable value types under server, then check which lack a *Tests sibling.
grep -rln "implements Writeable\|implements Writeable.Reader\|extends TransportResponse\|extends ActionRequest" \
  server/src/main/java/org/opensearch | while read -r f; do
    base=$(basename "$f" .java)
    if ! find server/src/test -name "${base}Tests.java" | grep -q .; then
      echo "UNTESTED  $f"
    fi
  done | head -40

Note: Exact hits vary by branch — that is expected and fine. Pick one small class you can fully understand in 20 minutes. Avoid anything that needs a live node, a ThreadPool, or an IndexShard to construct; those belong in heavier tests. You want a pure value object.

Record your choice. For the rest of this lab the running example is a fictional but representative value type:

public final class ShardCounts implements Writeable {
    private final int total;
    private final int active;
    private final String indexName;

    public ShardCounts(int total, int active, String indexName) { /* validates total >= active >= 0 */ }
    public ShardCounts(StreamInput in) throws IOException { /* reads in field order */ }
    @Override public void writeTo(StreamOutput out) throws IOException { /* writes in field order */ }
    // getters, equals, hashCode, toString
}

Map every method on your real class to a test below.

Step 2: Read the class before you test it

You cannot write a good assertion for code you have not read. For your chosen class, answer:

CLASS=ShardCounts   # <- your class
FILE=$(find server/src/main/java -name "${CLASS}.java")
echo "$FILE"
grep -n "public ${CLASS}\|StreamInput\|writeTo\|equals\|hashCode\|throw new\|Objects.requireNonNull\|assert " "$FILE"

Specifically note:

Invariants the constructor enforces (if (...) throw new IllegalArgumentException(...)). Each one is an assertThrows test.
Field write order in writeTo vs read order in the StreamInput constructor. They must match; a round-trip test proves it.
Whether equals/hashCode use all fields or a subset (a subset is sometimes a bug).
Whether it's also ToXContentObject — if so, prefer AbstractSerializingTestCase (wire and XContent) over AbstractWireSerializingTestCase (wire only).

Step 3: Create the test file in the mirrored package

CLASS=ShardCounts
PKG_DIR=$(dirname "$(find server/src/main/java -name "${CLASS}.java")" | sed 's#src/main#src/test#')
mkdir -p "$PKG_DIR"
echo "Create: $PKG_DIR/${CLASS}Tests.java"

Every test file carries the SPDX header (precommit's licenseHeaders check fails without it):

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

Step 4: Write the construction, validation, and accessor tests

Start with OpenSearchTestCase (the cheapest base). Use the random* helpers so each run exercises different inputs — and centralize construction in a randomShardCounts() factory you will reuse:

package org.opensearch.cluster.metadata; // mirror the class's package

import org.opensearch.test.OpenSearchTestCase;

public class ShardCountsTests extends OpenSearchTestCase {

    /** One place that builds a valid random instance — reused by every test. */
    static ShardCounts randomShardCounts() {
        int total  = randomIntBetween(0, 50);
        int active = randomIntBetween(0, total);              // respect the invariant active <= total
        String idx = randomAlphaOfLength(randomIntBetween(1, 20));
        return new ShardCounts(total, active, idx);
    }

    public void testAccessorsReflectConstructorArgs() {
        int total  = randomIntBetween(0, 50);
        int active = randomIntBetween(0, total);
        String idx = randomAlphaOfLength(randomIntBetween(1, 20));

        ShardCounts counts = new ShardCounts(total, active, idx);

        assertEquals(total, counts.getTotal());
        assertEquals(active, counts.getActive());
        assertEquals(idx, counts.getIndexName());
    }

    public void testRejectsNegativeTotal() {
        int badTotal = -randomIntBetween(1, 10);                  // strictly negative
        IllegalArgumentException e = expectThrows(
            IllegalArgumentException.class,
            () -> new ShardCounts(badTotal, 0, "idx")
        );
        assertThat(e.getMessage(), containsString("total"));      // assert the message, not just the type
    }

    public void testRejectsActiveGreaterThanTotal() {
        int total  = randomIntBetween(0, 20);
        int active = total + randomIntBetween(1, 5);              // violates active <= total
        expectThrows(IllegalArgumentException.class, () -> new ShardCounts(total, active, "idx"));
    }

    public void testRejectsNullIndexName() {
        expectThrows(NullPointerException.class, () -> new ShardCounts(1, 1, null));
    }
}

Note: expectThrows is OpenSearch's preferred form (it returns the exception so you can assert on the message); assertThrows from JUnit works too. Always assert on something inside the exception (message substring or a field). A bare "an exception was thrown" test passes even when the wrong exception is thrown for the wrong reason.

Step 5: Add the `equals`/`hashCode` contract test

OpenSearch ships a ready-made contract checker, EqualsHashCodeTestUtils, that verifies reflexivity, symmetry, the equals↔hashCode agreement, and that a mutated copy is unequal:

import org.opensearch.test.EqualsHashCodeTestUtils;

public void testEqualsAndHashCode() {
    EqualsHashCodeTestUtils.checkEqualsAndHashCode(
        randomShardCounts(),                                   // a random instance
        original -> new ShardCounts(                           // copy: must be equal & same hash
            original.getTotal(), original.getActive(), original.getIndexName()),
        original -> {                                          // mutate: must be NOT equal
            switch (randomIntBetween(0, 2)) {
                case 0:  return new ShardCounts(
                             original.getTotal() + 1, original.getActive(), original.getIndexName());
                case 1:  int a = randomValueOtherThan(
                             original.getActive(), () -> randomIntBetween(0, original.getTotal()));
                         return new ShardCounts(original.getTotal(), a, original.getIndexName());
                default: return new ShardCounts(
                             original.getTotal(), original.getActive(),
                             original.getIndexName() + randomAlphaOfLength(1));
            }
        });
}

The mutator is the part that catches bugs: if equals ignores indexName, the case default branch produces an object that is "equal" but shouldn't be, and the test fails. randomValueOtherThan guarantees the mutated field actually changes.

Step 6: Add the wire serialization round-trip

For a Writeable, the cleanest approach is to convert the test to extend AbstractWireSerializingTestCase<T>, which gives you a full round-trip harness (write → read → assert equal) that also runs the BWC matrix. You implement three methods:

package org.opensearch.cluster.metadata;

import org.opensearch.common.io.stream.Writeable;
import org.opensearch.test.AbstractWireSerializingTestCase;

public class ShardCountsTests extends AbstractWireSerializingTestCase<ShardCounts> {

    @Override
    protected ShardCounts createTestInstance() {
        int total  = randomIntBetween(0, 50);
        int active = randomIntBetween(0, total);
        return new ShardCounts(total, active, randomAlphaOfLength(randomIntBetween(1, 20)));
    }

    @Override
    protected Writeable.Reader<ShardCounts> instanceReader() {
        return ShardCounts::new;                               // the StreamInput constructor
    }

    @Override
    protected ShardCounts mutateInstance(ShardCounts instance) {
        // Return an instance guaranteed NOT equal to `instance` (drives the inequality checks).
        int total = instance.getTotal() + randomIntBetween(1, 5);
        return new ShardCounts(total, instance.getActive(), instance.getIndexName());
    }

    // ... plus the validation / accessor tests from Step 4 (they coexist in the same class) ...
}

AbstractWireSerializingTestCase provides testSerialization() (write to a BytesStreamOutput, read back, assert equal — for free) and, via mutateInstance, checks that unequal instances stay unequal across the wire. If your writeTo and StreamInput constructor disagree on field order, this test fails immediately — which is exactly the bug you want it to find.

Note: If your type is also ToXContentObject, extend AbstractSerializingTestCase<T> instead and additionally implement doParseInstance(XContentParser); you then get the JSON round-trip (toXContent → fromXContent) for free as well.

Step 7: Run it (and reproduce a failure)

# Whole class
./gradlew :server:test --tests "org.opensearch.cluster.metadata.ShardCountsTests"

# One method while iterating
./gradlew :server:test --tests "*ShardCountsTests.testEqualsAndHashCode"

# Hammer it for flakiness across many random inputs
./gradlew :server:test --tests "*ShardCountsTests" -Dtests.iters=100

If a run fails, copy the printed reproduce line verbatim — its -Dtests.seed=... pins the exact random inputs:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.metadata.ShardCountsTests.testSerialization" \
  -Dtests.seed=1A2B3C4D5E6F7A8B -Dtests.locale=de-DE -Dtests.timezone=Asia/Tokyo

A failing serialization round-trip here usually means a real ordering or version-gating bug in the class — investigate the production code before "fixing" the test.

Step 8: Pass precommit and add a CHANGELOG entry

./gradlew spotlessApply                 # auto-format
./gradlew :server:precommit             # headers, forbidden-APIs, checkstyle, loggerUsageCheck

Add one line under ## [Unreleased ...] in CHANGELOG.md:

### Added
- Add unit tests for `ShardCounts` (equals/hashCode contract + wire serialization round-trip) ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

Commit with DCO sign-off:

git checkout -b test/shardcounts-unit-tests
git add server/src/test/java CHANGELOG.md
git commit -s -m "Add unit tests for ShardCounts"

What Makes a Good Assertion

This is the heart of the lab. An assertion is good only if it would fail when the code is wrong.

Weak assertion	Why it's weak	Strong replacement
`assertNotNull(counts)` after construction	A constructor that returns garbage still passes	Assert each accessor equals its constructor arg
`expectThrows(Exception.class, ...)`	Passes for any exception, including an NPE bug	Pin the exact type and assert a message substring
`assertTrue(a.equals(b))` with `a == b`	Reflexivity is trivially true; tests nothing	Compare a fresh copy and a mutated instance
`assertEquals(json, obj.toString())`	Brittle on whitespace; couples to formatting	Round-trip via the XContent harness, compare objects
Asserting only the happy path	Bugs live in edge cases	Cover boundaries (0, max, empty string) and invalid input
Hard-coded inputs only	Misses input-dependent bugs	Drive with `random*`; let the seed widen coverage

Warning: A test that always passes is worse than no test — it gives false confidence and future contributors trust it. Before you commit, break the production code on purpose (e.g. drop a field from writeTo, or make equals ignore indexName) and confirm your test goes red. If it stays green, your assertions are too weak. Then revert the sabotage.

Implementation Requirements / Deliverables

A *Tests class in the mirrored test package, with the SPDX header.
Construction + accessor tests driven by random* helpers.
At least one expectThrows test per constructor invariant, asserting the message.
An equals/hashCode contract test (EqualsHashCodeTestUtils.checkEqualsAndHashCode) with a mutator that changes a real field.
A wire round-trip via AbstractWireSerializingTestCase (or the XContent variant if applicable), with createTestInstance, instanceReader, mutateInstance.
./gradlew :server:precommit passes; spotlessApply applied; a CHANGELOG.md entry added.
Proof the tests have teeth: a note describing the sabotage you applied and that the test failed.

Troubleshooting

Symptom	Likely cause	Fix
`error: cannot find symbol randomAlphaOfLength`	Not extending an OpenSearch base test class	Extend `OpenSearchTestCase` / `AbstractWireSerializingTestCase`
`licenseHeaders` precommit failure	Missing SPDX header	Paste the Apache-2.0 header block at the top
Test "passes" but you suspect it shouldn't	Assertions too weak	Sabotage the production code; confirm red; revert
`NamedWriteableRegistry` / "unknown writeable" in round-trip	Your type is a `NamedWriteable`	Override `getNamedWriteableRegistry()` to register the entry
Round-trip fails only on some seeds	Field order mismatch or version gating	Diff `writeTo` against the `StreamInput` ctor field-by-field
`Class not found by Gradle` / test never runs	Wrong source set or name	File must be `*Tests.java` under `src/test`, in the mirrored package

Expected Output

> Task :server:test
org.opensearch.cluster.metadata.ShardCountsTests > testSerialization PASSED
org.opensearch.cluster.metadata.ShardCountsTests > testEqualsAndHashCode PASSED
org.opensearch.cluster.metadata.ShardCountsTests > testRejectsNegativeTotal PASSED
org.opensearch.cluster.metadata.ShardCountsTests > testRejectsActiveGreaterThanTotal PASSED
org.opensearch.cluster.metadata.ShardCountsTests > testAccessorsReflectConstructorArgs PASSED

BUILD SUCCESSFUL in 41s

HTML report (open after a run):

open server/build/reports/tests/test/index.html      # macOS

Stretch Goals

Add the XContent round-trip. If your type is ToXContentObject, switch to AbstractSerializingTestCase, implement doParseInstance, and confirm toXContent→fromXContent is lossless. Add assertToXContentEquivalent for a hand-written JSON sample.
Add a BWC serialization assertion. Override the version range so the harness serializes against older Versions. If the class gates a field behind out.getVersion().onOrAfter(...), verify the round-trip across that boundary. See the serialization & BWC deep dive.
Find a second untested class and repeat — bundle several into one "increase test coverage" PR; maintainers love these.
Propagate a real edge case. Use a real Version.CURRENT corner (empty collection, max int) that the original author plausibly forgot and confirm behavior is sane.

Validation / Self-check

Answer these before you call the lab done:

Show the grep you used to find an untested class. Why are small Writeable value types the best targets, and what classes did you deliberately avoid (and why)?
What does AbstractWireSerializingTestCase give you for free, and which three methods did you implement? What would break if your writeTo and StreamInput constructor disagreed on field order?
Why is expectThrows(IllegalArgumentException.class, ...) not enough on its own? What did you add?
In EqualsHashCodeTestUtils.checkEqualsAndHashCode, what is the mutator for, and what bug does it catch that a "copy is equal" check cannot?
Describe the sabotage you applied to prove your test has teeth, and which assertion caught it.
When would you reach for AbstractSerializingTestCase instead of AbstractWireSerializingTestCase?
Why must this stay an OpenSearchTestCase/serialization test rather than an OpenSearchIntegTestCase? (Hint: cost vs. what the change actually touches.)

When you can find a target, write assertions that fail on broken code, and round-trip a Writeable without a live node, move on to Lab 5.3: Build It — A Multi-Node Integration Test.

Lab 5.3: Build It — A Multi-Node Integration Test

Background

Unit tests prove logic; integration tests prove distributed behavior. The behavior that breaks in production — a primary dying, a replica getting promoted, shards reallocating, the cluster recovering to green — only shows up when you have multiple nodes and you take one away. This is what OpenSearchIntegTestCase and its in-JVM InternalTestCluster exist for.

In this Build-It lab you write a complete failover integration test from scratch: stand up a three-node cluster, index documents into an index with one replica, stop a random data node, and assert that OpenSearch promotes the replica and reallocates shards so the cluster returns to green with no data loss. You will do this deterministically — no Thread.sleep, only ensureGreen and assertBusy.

Note: This builds on Lab 5.1, which got you a multi-node cluster indexing data. Here we add the hard part: surviving a node death. The cluster manager (formerly master) is what drives replica promotion and reallocation under the hood — see the replication and shard allocation deep dives for the machinery you are testing.

Why This Lab Matters for Contributors

Failover/recovery is where the hardest OpenSearch bugs live. You cannot credibly fix one without a test that reproduces the failure scenario — and that test is exactly this shape.
Maintainers expect distributed changes to come with an *IT that runs under internalClusterTest. Knowing the idioms (internalCluster(), ensureGreen, assertBusy, disruption helpers) makes your PRs reviewable.
This is the difference between "it worked on my laptop" and "it survives a node going away under load" — the bar real clusters are held to.
Writing it teaches you determinism: why asserting on async state without polling produces flaky CI, the topic you will hunt in Lab 5.4.

Prerequisites

Lab 5.1 completed; you can run an *IT under internalClusterTest.
A clean build: ./gradlew :server:compileInternalClusterTestJava -q.
~6 GB free RAM (three in-JVM nodes plus the test runner).

# Confirm the integ-test source set and task exist.
ls server/src/internalClusterTest/java/org/opensearch | head
./gradlew :server:tasks --all | grep internalClusterTest

Step-by-Step Tasks

Step 1: Decide the topology and what "correct" means

You are testing the simplest non-trivial failover:

3 data nodes, 1 index, 1 primary + 1 replica per shard (number_of_replicas = 1)
        ┌─ node A: primary p0
        ├─ node B: replica r0   (eligible to be promoted)
        └─ node C: (spare capacity for reallocation)

Kill the node holding a copy → cluster must:
  1. promote the surviving copy to primary (if the primary died), AND
  2. allocate a NEW replica onto a remaining node, AND
  3. return to GREEN with the same document count (no data loss).

The invariants your test asserts:

Invariant	How you assert it
Cluster recovers to green	`ensureGreen(index)` after the node stop
No documents lost	search count equals what you indexed, via `assertBusy`
A replica was promoted / reallocated	health shows `active_shards == primaries + replicas`; `unassigned == 0`
It happened automatically	no manual reroute in the test

Step 2: Create the IT file (correct source set, correct suffix)

Integration tests are *IT.java under src/internalClusterTest — not src/test. Put it in the wrong place and Gradle silently never runs it.

mkdir -p server/src/internalClusterTest/java/org/opensearch/cluster/routing
$EDITOR server/src/internalClusterTest/java/org/opensearch/cluster/routing/ReplicaPromotionFailoverIT.java

Step 3: Write the test

Here is the complete, runnable test. Read every line — the comments explain why each call is there.

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.cluster.routing;

import org.opensearch.action.admin.cluster.health.ClusterHealthResponse;
import org.opensearch.action.index.IndexRequestBuilder;
import org.opensearch.cluster.health.ClusterHealthStatus;
import org.opensearch.common.settings.Settings;
import org.opensearch.index.query.QueryBuilders;
import org.opensearch.test.OpenSearchIntegTestCase;
import org.opensearch.test.OpenSearchIntegTestCase.ClusterScope;
import org.opensearch.test.OpenSearchIntegTestCase.Scope;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;

import static org.opensearch.index.query.QueryBuilders.matchAllQuery;
import static org.hamcrest.Matchers.equalTo;

/**
 * Stops a data node and verifies OpenSearch promotes a replica and reallocates shards
 * back to GREEN with no data loss. A failover test — Scope.TEST so node death never
 * leaks into another method's cluster.
 */
@ClusterScope(scope = Scope.TEST, numDataNodes = 0)   // we start nodes ourselves, explicitly
public class ReplicaPromotionFailoverIT extends OpenSearchIntegTestCase {

    private static final String INDEX = "failover-idx";

    public void testReplicaIsPromotedAndClusterRecoversAfterDataNodeStop() throws Exception {
        // 1) A dedicated cluster manager + three data nodes. Dedicated CM keeps election
        //    independent of the data node we are about to kill.
        internalCluster().startClusterManagerOnlyNode();
        final List<String> dataNodes = new ArrayList<>(internalCluster().startDataOnlyNodes(3));
        assertEquals(3, dataNodes.size());

        // 2) An index with 2 primaries and 1 replica each → 4 shard copies across 3 data nodes.
        createIndex(
            INDEX,
            Settings.builder()
                .put("index.number_of_shards", 2)
                .put("index.number_of_replicas", 1)
                .build()
        );
        ensureGreen(INDEX);   // block until all 4 copies are STARTED before we touch anything

        // 3) Index a known number of docs and make them durable+visible.
        final int docCount = scaledRandomIntBetween(20, 100);   // randomized, but bounded
        indexDocs(docCount);
        refresh(INDEX);
        assertHitCount(docCount);   // sanity: all docs are searchable before the failure

        // 4) Inject the failure: stop a random DATA node (could be holding a primary or a replica).
        internalCluster().stopRandomDataNode();

        // 5) The cluster must self-heal. ensureGreen blocks until promotion + reallocation finish.
        //    We allow a generous timeout because reallocation copies a shard.
        ensureGreen(TimeValueMinutes(1), INDEX);

        // 6) Health invariants: no unassigned shards, all copies active again.
        ClusterHealthResponse health = client().admin().cluster()
            .prepareHealth(INDEX)
            .setWaitForGreenStatus()
            .get();
        assertThat(health.getStatus(), equalTo(ClusterHealthStatus.GREEN));
        assertThat(health.getUnassignedShards(), equalTo(0));
        assertThat(health.getActiveShards(), equalTo(4));   // 2 primaries + 2 replicas

        // 7) No data loss. Search is async w.r.t. shard relocation, so poll with assertBusy.
        assertBusy(() -> assertHitCount(docCount), 30, TimeUnit.SECONDS);
    }

    // ----- helpers -----

    private void indexDocs(int count) throws InterruptedException {
        final List<IndexRequestBuilder> builders = new ArrayList<>(count);
        for (int i = 0; i < count; i++) {
            builders.add(
                client().prepareIndex(INDEX)
                    .setId(Integer.toString(i))
                    .setSource("field", randomAlphaOfLengthBetween(1, 20), "n", i)
            );
        }
        // indexRandom shuffles, batches, and refreshes for you — preferred over a manual loop.
        indexRandom(true, builders);
    }

    private void assertHitCount(long expected) {
        long actual = client().prepareSearch(INDEX)
            .setQuery(matchAllQuery())
            .setSize(0)
            .get()
            .getHits()
            .getTotalHits()
            .value();
        assertThat(actual, equalTo(expected));
    }

    private static org.opensearch.common.unit.TimeValue TimeValueMinutes(long m) {
        return org.opensearch.common.unit.TimeValue.timeValueMinutes(m);
    }
}

Note: indexRandom(true, builders) is the idiomatic way to index in integ tests: it randomizes order and batching (finding ordering bugs) and refreshes at the end so docs are searchable. scaledRandomIntBetween scales the upper bound down under nightly/CI pressure so the test stays fast.

Step 4: Run it under the integration test task

# The whole class
./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.routing.ReplicaPromotionFailoverIT"

# One method, verbose
./gradlew :server:internalClusterTest \
  --tests "*ReplicaPromotionFailoverIT.testReplicaIsPromotedAndClusterRecoversAfterDataNodeStop" -i

It is slower than a unit test (it builds and tears down a real cluster), typically 20–90 seconds.

Step 5: Prove it's deterministic — hammer it

A failover test that passes once but flakes under load is worse than none. Run it many times with different seeds to flush out timing assumptions:

./gradlew :server:internalClusterTest --tests "*ReplicaPromotionFailoverIT" -Dtests.iters=50

If any iteration fails, copy the REPRODUCE WITH ... -Dtests.seed=... line and debug that exact seed. A flaky failure here almost always means you asserted on async state without polling — replace a direct assertion with assertBusy or add the missing ensureGreen.

Step 6 (optional): Use a disruption helper instead of a hard stop

Stopping a node is the simplest failure. To test a network partition (the node is alive but unreachable — a different and nastier failure mode), use the disruption framework. Override the test plugins to mock the transport, then apply a partition:

import org.opensearch.plugins.Plugin;
import org.opensearch.test.disruption.NetworkDisruption;
import org.opensearch.test.transport.MockTransportService;

import java.util.Collection;
import java.util.Collections;

@Override
protected Collection<Class<? extends Plugin>> nodePlugins() {
    return Collections.singletonList(MockTransportService.TestPlugin.class);
}

public void testPartitionThenHeal() throws Exception {
    internalCluster().startClusterManagerOnlyNode();
    internalCluster().startDataOnlyNodes(3);
    createIndex(INDEX, Settings.builder()
        .put("index.number_of_shards", 1).put("index.number_of_replicas", 1).build());
    ensureGreen(INDEX);

    // Isolate one node from the rest, hold it, then heal and require recovery.
    String victim = internalCluster().getRandomNodeName();   // pick a node to cut off
    NetworkDisruption partition = new NetworkDisruption(
        new NetworkDisruption.TwoPartitions(
            Collections.singleton(victim), otherNodes(victim)),
        NetworkDisruption.DISCONNECT);
    internalCluster().setDisruptionScheme(partition);
    partition.startDisrupting();

    // While partitioned, the cluster manager evicts the isolated node and reallocates its shards.
    ensureStableCluster(/* expectedNodes */ 3, internalCluster().getClusterManagerName());

    partition.stopDisrupting();
    internalCluster().clearDisruptionScheme();
    ensureGreen(TimeValueMinutes(1), INDEX);
}

Warning: Disruption tests are powerful but easy to get wrong — partition the wrong set and you either deadlock or test nothing. Start with the hard-stop version (Step 3); reach for partitions only when the bug specifically involves a node being unreachable rather than dead.

Determinism: `ensureGreen` and `assertBusy`, never `Thread.sleep`

Everything after a node stop is asynchronous: the cluster manager detects the departure, computes a new cluster state (promotion + a new replica), publishes it, and shards recover. You do not know when that finishes — only that it will.

Anti-pattern	Why it flakes	Deterministic replacement
`Thread.sleep(2000)` then assert	Too short under load → fail; too long → slow always	`ensureGreen(index)` blocks exactly until healthy
Assert hit count immediately after stop	Search may hit a relocating shard mid-flight	`assertBusy(() -> assertHitCount(n), 30, SECONDS)`
Assume the killed node held the primary	It might have held the replica; promotion may be a no-op	Assert the end state (green, 0 unassigned), not the path
`ensureGreen()` with the default short timeout during reallocation	Copying a shard can exceed it	Pass an explicit timeout: `ensureGreen(timeValueMinutes(1), index)`

ensureGreen polls cluster health until GREEN (all primaries and replicas STARTED) or it times out. assertBusy re-runs an assertion with backoff until it passes or times out. Together they let you assert on a converged state without guessing how long convergence takes. This is the determinism discipline that Lab 5.4 is entirely about.

Implementation Requirements / Deliverables

ReplicaPromotionFailoverIT.java in server/src/internalClusterTest/java/... with the SPDX header.
@ClusterScope(scope = Scope.TEST) (failure must not leak into other tests).
A 3-data-node cluster + dedicated cluster manager, index with number_of_replicas >= 1.
ensureGreen before indexing and after the node stop.
Documents indexed via indexRandom, with a count assertion before the failure.
internalCluster().stopRandomDataNode() (or a disruption scheme) as the injected failure.
Post-failure assertions: green status, unassignedShards == 0, correct activeShards, and a no-data-loss hit count via assertBusy.
Zero Thread.sleep calls anywhere in the test.
Survives -Dtests.iters=50.

Troubleshooting

Symptom	Likely cause	Fix
Test never runs / "0 tests"	File is in `src/test` or not named `*IT`	Move to `src/internalClusterTest`, name it `*IT.java`
`ensureGreen` times out after the stop	Only 1 data node left can't host both copies (same-shard decider)	Start enough data nodes (3) so a new replica has a home
Hit count is short right after stop	Asserted before relocation settled	Wrap in `assertBusy(...)`; ensure `refresh` ran
Flaky under `-Dtests.iters`	An async assertion without polling	Replace direct asserts with `ensureGreen`/`assertBusy`
`NoNodeAvailableException`	You stopped the node the client was pinned to	Use `client()` (round-robins), not a node-pinned client
Cluster won't form	No cluster-manager-eligible node started	Start `startClusterManagerOnlyNode()` first
OOM / very slow	Too many docs or `Scope.SUITE` leaking nodes	Use `scaledRandomIntBetween`; keep `Scope.TEST`

Expected Output

> Task :server:internalClusterTest
org.opensearch.cluster.routing.ReplicaPromotionFailoverIT >
  testReplicaIsPromotedAndClusterRecoversAfterDataNodeStop PASSED

BUILD SUCCESSFUL in 1m 12s
1 test completed

Open the report to inspect captured node logs (where the promotion/reallocation messages are):

open server/build/reports/tests/internalClusterTest/index.html   # macOS

In the captured logs you should see the cluster manager log the node leaving, then shard-started messages as the replica is promoted and a new replica recovers.

Stretch Goals

Assert the promotion explicitly. Before the stop, record which node holds the primary for each shard via client().admin().cluster().prepareState().get().getState().routingTable(). Stop the node holding a primary on purpose (not a random one) and assert the former replica's node now holds the primary.
Test the no-replica case. Set number_of_replicas = 0, stop the node holding a shard, and assert the cluster goes red and the index loses data — proving why replicas matter. (Use ensureRed/a health check; this is a negative test.)
Add a disruption variant. Implement the testPartitionThenHeal from Step 6 and confirm the cluster ejects then re-admits the isolated node.
Index during the failure. Start a background indexing thread, stop a node mid-stream, and assert no acknowledged write was lost — closer to a real-world workload and a great bug finder.
Tie it to a real issue. Find an open flaky-test or recovery bug in github.com/opensearch-project/OpenSearch/issues whose repro is "stop a node," and shape your test to match it.

Validation / Self-check

Why is this an OpenSearchIntegTestCase and not an OpenSearchSingleNodeTestCase or a unit test? What specifically requires multiple real nodes?
Why @ClusterScope(scope = Scope.TEST) rather than the default Scope.SUITE for a test that kills a node?
After stopRandomDataNode(), exactly what does OpenSearch do to return to green? Name the two distinct actions and which component drives them.
Why must you call ensureGreen before indexing as well as after the stop?
Why is assertBusy(() -> assertHitCount(n), ...) correct but Thread.sleep(2000); assertHitCount(n) wrong? Give the two distinct failure modes of the sleep version.
You started 3 data nodes. What goes wrong if you start only 2, and which AllocationDecider explains it?
Run -Dtests.iters=50. Did any iteration fail? If so, what timing assumption did it expose and how did you make it deterministic?

When your failover test is green, deterministic across 50 iterations, and free of Thread.sleep, proceed to Lab 5.4: Fix It — Un-Mute a Flaky Test.

Lab 5.4: Fix It — Un-Mute a Flaky Test

Background

A flaky test passes sometimes and fails sometimes with no code change. In a randomized framework like OpenSearch's, flakiness is endemic: a test that races against asynchronous cluster state, refresh timing, or allocation will fail on some seeds and pass on others. OpenSearch's policy is firm — when a test flakes, you do not delete it and you do not @Ignore it. You mute it with @AwaitsFix(bugUrl=...) pointing at a tracking issue, file the issue with the flaky-test label, and then someone (you, in this lab) reproduces it deterministically, finds the race, fixes it, and un-mutes it.

This Fix-It lab gives you a realistic flaky *IT, muted with @AwaitsFix. You will reproduce the failure with a pinned seed and -Dtests.iters, identify the race (a missing ensureGreen, a refresh timing assumption, a Thread.sleep standing in for real synchronization), apply the minimal diff that makes it deterministic, and remove the @AwaitsFix.

Note: The systematic workflow for flaky tests — triage, reproduce, classify, fix, un-mute — is the subject of Issue Roadmap Stage 9: Flaky Tests. This lab is the hands-on counterpart. Read that stage for the policy; do this lab for the muscle memory.

Why This Lab Matters for Contributors

flaky-test issues are an explicit, well-labeled on-ramp for new contributors. They are real, visible, and maintainers are grateful for them — flaky CI is a tax on the whole project.
Fixing flakiness teaches you the asynchronous nature of OpenSearch better than any happy-path test: refresh, allocation, cluster-state publication, and recovery are all eventually-consistent.
The wrong fix (sprinkling Thread.sleep) is a trap that looks like it works and silently makes CI slower and still-flaky. Learning to spot and avoid it separates competent contributors from cargo-cult ones.
"Un-mute a flaky test" is a concrete, mergeable PR with a clear before/after: red → green, and one fewer @AwaitsFix in the tree.

Prerequisites

Lab 5.3 completed — you understand ensureGreen, assertBusy, and why they beat Thread.sleep.
You can run a scoped integ test and read its HTML report.
You can reproduce a randomized failure from a printed -Dtests.seed=... line.

./gradlew :server:internalClusterTest --tests "*SomeIT" -Dtests.iters=5   # sanity: the task works

The Muted Test

Here is the flaky test as you find it in the tree — already muted. It refreshes one node, then searches on (potentially) another node, asserting the doc is visible. The bug: refresh is per-shard and asynchronous across replicas, so a search routed to a not-yet-refreshed copy intermittently misses the document.

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.search.basic;

import org.opensearch.common.settings.Settings;
import org.opensearch.test.OpenSearchIntegTestCase;
import org.apache.lucene.tests.util.LuceneTestCase.AwaitsFix;

import static org.opensearch.index.query.QueryBuilders.matchQuery;
import static org.hamcrest.Matchers.equalTo;

public class SearchVisibilityAfterIndexIT extends OpenSearchIntegTestCase {

    @AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/14321")
    public void testIndexedDocumentIsImmediatelyVisible() throws Exception {
        internalCluster().startNodes(3);
        createIndex("vis", Settings.builder()
            .put("index.number_of_shards", 1)
            .put("index.number_of_replicas", 2)   // 3 copies of the one shard, one per node
            .build());

        // BUG 1: no ensureGreen — replicas may still be initializing when we index/search.

        client().prepareIndex("vis").setId("1").setSource("title", "opensearch").get();

        // BUG 2: refresh is async per-shard-copy; this does not guarantee all copies are refreshed.
        client().admin().indices().prepareRefresh("vis").get();

        // BUG 3: a single immediate search may be routed to a copy that hasn't refreshed yet.
        long hits = client().prepareSearch("vis")
            .setQuery(matchQuery("title", "opensearch"))
            .get().getHits().getTotalHits().value();

        assertThat(hits, equalTo(1L));
    }
}

Step-by-Step Tasks

Step 1: Reproduce the flakiness

A flaky test passes "usually," so a single run tells you nothing. First, temporarily remove the @AwaitsFix locally (do not commit this yet) so the runner will execute it, then iterate it many times:

# Run the method 50 times. With the bug present, at least one iteration should fail.
./gradlew :server:internalClusterTest \
  --tests "org.opensearch.search.basic.SearchVisibilityAfterIndexIT.testIndexedDocumentIsImmediatelyVisible" \
  -Dtests.iters=50

When an iteration fails, the runner prints the reproduce line. Copy it verbatim — the seed pins the race:

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' \
  --tests "org.opensearch.search.basic.SearchVisibilityAfterIndexIT.testIndexedDocumentIsImmediatelyVisible" \
  -Dtests.seed=7F0C19A4B2E5D660 -Dtests.locale=fr-FR -Dtests.timezone=UTC

Now you can reproduce it on demand:

./gradlew :server:internalClusterTest \
  --tests "*SearchVisibilityAfterIndexIT.testIndexedDocumentIsImmediatelyVisible" \
  -Dtests.seed=7F0C19A4B2E5D660 -Dtests.iters=10

Warning: If you cannot reproduce it locally even at -Dtests.iters=200, do not guess at a fix. Flaky tests are sometimes environment-sensitive (CPU count, disk speed). Raise iters, try -Dtests.jvms=1, or run under load (stress -c 8 in another shell) to bias the timing the way CI does. A fix you cannot validate is not a fix.

Step 2: Read the failure and classify the race

The assertion message will be something like:

java.lang.AssertionError:
Expected: <1L>
     but: was <0L>

Zero hits, not an exception. The doc was indexed and acknowledged but a search missed it — a visibility race. OpenSearch failures fall into a small number of flakiness archetypes; learn to classify quickly:

Archetype	Tell-tale	Root cause
Allocation/health race	`unassigned`/`yellow` when test assumes green; `NoShardAvailable`	Missing `ensureGreen` before acting
Refresh/visibility race	search count short by a few, intermittent	Asserting before all copies refreshed; relying on `refresh_interval`
Async state race	a setting/mapping/template "not applied yet"	Reading state before cluster-state apply; needs `assertBusy`
`Thread.sleep` race	passes locally, flakes on slow CI	A sleep used where real synchronization is required
Order/seed dependence	only fails on specific seeds	Test assumes an ordering randomization doesn't guarantee
Leftover state	fails only after another test	`Scope.SUITE` pollution; missing cleanup

This one is a refresh/visibility race compounded by a missing ensureGreen.

Step 3: Confirm the race with TRACE logging

Prove your hypothesis instead of guessing. Turn on TRACE for the engine/translog and re-run the pinned seed; in the captured node logs (HTML report) you will see refresh events arrive on the three shard copies at different times, and the search hitting one before its refresh.

// Add temporarily above the method:
import org.opensearch.test.junit.annotations.TestLogging;

@TestLogging(value = "org.opensearch.index.engine:TRACE,org.opensearch.index.shard:TRACE", reason = "diagnose visibility race")

./gradlew :server:internalClusterTest --tests "*SearchVisibilityAfterIndexIT*" \
  -Dtests.seed=7F0C19A4B2E5D660
open server/build/reports/tests/internalClusterTest/index.html

You will see, on the failing seed, the search executing against a replica whose refresh for the new segment has not yet completed — confirming visibility, not durability or allocation alone, is the issue. For the mechanics of why refresh is per-copy and asynchronous, see the refresh / flush / merge deep dive and the engine internals deep dive.

Step 4: Choose the correct fix (not a sleep)

There are three legitimate ways to make a just-indexed doc reliably visible. Choose based on what the test is actually trying to prove:

Fix	Mechanism	When to use
`setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE)` on the index request	Forces a refresh on all active copies before the index call returns	The cleanest fix when you index then immediately read
`ensureGreen` + explicit `refresh` + `assertBusy` on the search	Guarantees copies are started, refreshes, and polls until the count converges	When you need realistic timing but deterministic assertions
Use `flush`/`forceMerge`	Heavyweight (Lucene commit)	Almost never for a visibility test — overkill

And add the structural fix the test is missing regardless: ensureGreen before you touch the index, so you are not also racing allocation.

Warning: The tempting "fix" is Thread.sleep(1000) before the search. It will appear to work on your machine. It is wrong because: (1) on slow/loaded CI 1s is sometimes not enough → still flaky; (2) on fast machines it wastes a full second every run → slower CI for everyone; (3) it hides the real synchronization point, so the next person can't tell what is being waited for. Never paper over a race with a sleep. Wait for the condition, not for the clock.

Step 5: Apply the diff

Here is the minimal patch. It adds the missing ensureGreen, makes visibility deterministic with an immediate-refresh write, and wraps the assertion in assertBusy as a belt-and-suspenders against any residual relocation timing — and removes the @AwaitsFix.

--- a/server/src/internalClusterTest/java/org/opensearch/search/basic/SearchVisibilityAfterIndexIT.java
+++ b/server/src/internalClusterTest/java/org/opensearch/search/basic/SearchVisibilityAfterIndexIT.java
@@ -7,18 +7,21 @@
 package org.opensearch.search.basic;

 import org.opensearch.common.settings.Settings;
+import org.opensearch.action.support.WriteRequest;
 import org.opensearch.test.OpenSearchIntegTestCase;
-import org.apache.lucene.tests.util.LuceneTestCase.AwaitsFix;
+
+import java.util.concurrent.TimeUnit;

 import static org.opensearch.index.query.QueryBuilders.matchQuery;
 import static org.hamcrest.Matchers.equalTo;

 public class SearchVisibilityAfterIndexIT extends OpenSearchIntegTestCase {

-    @AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/14321")
     public void testIndexedDocumentIsImmediatelyVisible() throws Exception {
         internalCluster().startNodes(3);
         createIndex("vis", Settings.builder()
             .put("index.number_of_shards", 1)
             .put("index.number_of_replicas", 2)
             .build());

-        // BUG 1: no ensureGreen — replicas may still be initializing when we index/search.
+        // FIX 1: wait until all 3 copies are STARTED before indexing — removes the allocation race.
+        ensureGreen("vis");

-        client().prepareIndex("vis").setId("1").setSource("title", "opensearch").get();
+        // FIX 2: IMMEDIATE refreshes every active copy before the call returns — removes the
+        //        per-copy refresh-timing race deterministically (no clock involved).
+        client().prepareIndex("vis").setId("1").setSource("title", "opensearch")
+            .setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE)
+            .get();

-        // BUG 2: refresh is async per-shard-copy; this does not guarantee all copies are refreshed.
-        client().admin().indices().prepareRefresh("vis").get();
-
-        // BUG 3: a single immediate search may be routed to a copy that hasn't refreshed yet.
-        long hits = client().prepareSearch("vis")
-            .setQuery(matchQuery("title", "opensearch"))
-            .get().getHits().getTotalHits().value();
-
-        assertThat(hits, equalTo(1L));
+        // FIX 3: assert on the converged condition, not a single shot. With IMMEDIATE this passes
+        //        first try, but assertBusy keeps it robust against any residual relocation.
+        assertBusy(() -> {
+            long hits = client().prepareSearch("vis")
+                .setQuery(matchQuery("title", "opensearch"))
+                .get().getHits().getTotalHits().value();
+            assertThat(hits, equalTo(1L));
+        }, 30, TimeUnit.SECONDS);
     }
 }

Note: RefreshPolicy.IMMEDIATE is the right tool here because the test's whole point is "index then read immediately." If the test were instead about a realistic delayed read, you would keep a normal write and rely on assertBusy to poll until visible — never on a fixed sleep.

Step 6: Validate the fix is real

Re-run on the seed that used to fail, then hammer with iters. A real fix passes the previously failing seed every time and survives many iterations:

# 1) The exact seed that failed before must now pass repeatedly.
./gradlew :server:internalClusterTest --tests "*SearchVisibilityAfterIndexIT*" \
  -Dtests.seed=7F0C19A4B2E5D660 -Dtests.iters=20

# 2) Broad iteration with fresh random seeds.
./gradlew :server:internalClusterTest --tests "*SearchVisibilityAfterIndexIT*" -Dtests.iters=100

Both must be green. If iter 73 fails, you have not found the whole race — go back to Step 3.

Step 7: Un-mute, close the loop, and ship

Confirm the @AwaitsFix annotation and its now-unused import are gone (the diff above removes both).

Add a CHANGELOG.md entry:

### Fixed
- Fix and un-mute flaky `SearchVisibilityAfterIndexIT` (visibility race: missing `ensureGreen` + non-deterministic refresh) ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

Run precommit and format:

./gradlew spotlessApply
./gradlew :server:precommit

Commit with DCO sign-off and reference the tracking issue so the bot links the PR and the issue auto-closes on merge:

git checkout -b flaky/fix-search-visibility-it
git add server/src/internalClusterTest CHANGELOG.md
git commit -s -m "Fix and un-mute flaky SearchVisibilityAfterIndexIT

The test raced refresh visibility across replica copies and lacked an
ensureGreen. Make visibility deterministic with RefreshPolicy.IMMEDIATE
and assertBusy; remove @AwaitsFix.

Fixes #14321"

Pitfalls of "Fixing" Flakiness with Sleeps

What people do	Why it's wrong	What to do instead
`Thread.sleep(500)` before the assert	Magic number; too short under load, too slow always	`assertBusy(() -> assert..., timeout)` — wait for the condition
Bump the sleep until CI goes green	You've tuned to this CI's speed; next machine flakes	Remove the sleep; synchronize on the actual event
Add a retry loop with `Thread.sleep` inside	Reinvents `assertBusy`, badly, with no timeout	Use `assertBusy` (built-in backoff + timeout)
Narrow `random*` ranges so the race "can't" happen	Hides the bug the randomization found	Keep randomization; fix the synchronization
`@Ignore` the test	No tracking, rots forever, coverage silently lost	`@AwaitsFix(bugUrl=...)` only as a temporary mute, then fix
Increase `refresh_interval` to dodge timing	Changes behavior under test; masks the race	Use `IMMEDIATE` refresh policy or poll for visibility

The single rule: wait for a condition, never for the clock. ensureGreen, assertBusy, RefreshPolicy.IMMEDIATE, and ensureStableCluster are condition-waits. Thread.sleep is a clock-wait and is almost always the wrong answer in a test.

Implementation Requirements / Deliverables

The flaky failure reproduced from a pinned -Dtests.seed=... (paste the reproduce line).
The race classified using the archetype table, with the TRACE-log evidence that confirms it.
A minimal diff that fixes the race using condition-waits (no Thread.sleep).
The @AwaitsFix annotation and its now-unused import removed.
Proof of fix: the previously-failing seed passes ≥20×, and ≥100 fresh iters are green.
./gradlew :server:precommit passes; a CHANGELOG.md "Fixed" entry; DCO-signed commit referencing the tracking issue.

Troubleshooting

Symptom	Likely cause	Fix
Can't reproduce locally even at high iters	Environment-insensitive timing; too fast a machine	Raise iters; run under CPU load; try `-Dtests.jvms=1`
Fix passes the old seed but a new seed flakes	A second race you haven't addressed	TRACE again on the new seed; classify and fix it too
Precommit fails on unused import	Left the `AwaitsFix` import after removing the annotation	Delete the import line
`assertBusy` itself times out	The condition genuinely never holds	It's not flakiness — it's a real bug; investigate production code
Reviewer says "this just hides it"	You used a sleep or narrowed randomness	Replace with a condition-wait; restore randomization

Expected Output

Before (muted):

> Task :server:internalClusterTest
org.opensearch.search.basic.SearchVisibilityAfterIndexIT >
  testIndexedDocumentIsImmediatelyVisible SKIPPED   (@AwaitsFix)

After (fixed and un-muted), across 100 iterations:

> Task :server:internalClusterTest
org.opensearch.search.basic.SearchVisibilityAfterIndexIT >
  testIndexedDocumentIsImmediatelyVisible PASSED

BUILD SUCCESSFUL in 2m 03s

Stretch Goals

Fix it the other way. Instead of RefreshPolicy.IMMEDIATE, keep a normal write, then refresh("vis") and poll the search with assertBusy. Compare run times and reason about which better reflects production behavior.
Find a real one. Search github.com/opensearch-project/OpenSearch/issues?q=label:flaky-test, pick an open issue with an AwaitsFix you can locate in the tree, reproduce it from the seed in the issue, and attempt a fix.
Write a regression guard. Add a comment in the test linking the fixed issue, so a future "optimization" that reintroduces the race is caught in review.
Build a flaky-test hunter. Wrap your CI invocation in a loop that runs a target with high iters and bisects to the first failing seed — the same technique maintainers use to triage the flaky-test backlog.

Validation / Self-check

Why does a single run of a flaky test tell you nothing? What two flags do you combine to reproduce it reliably, and what does each do?
Classify the race in SearchVisibilityAfterIndexIT using the archetype table. What evidence (from TRACE logs) confirmed your classification?
List the three legitimate fixes for a visibility race and when you'd choose each. Which did the diff use and why?
Give three concrete reasons Thread.sleep(1000) is the wrong fix here.
What is the difference between waiting for a condition and waiting for the clock? Name three OpenSearch condition-wait helpers.
Why must you remove both the @AwaitsFix annotation and its import before the PR is mergeable?
How do you prove a flaky fix is real rather than lucky? What two runs gate the PR?

When you can reproduce a flaky failure from its seed, classify the race, fix it with a condition-wait, and un-mute the test with proof it's deterministic, you have completed Level 5. Continue to Level 6: Indexing Path and Storage Engine, and revisit Issue Roadmap Stage 9 when you take a real flaky-test issue.

Level 6: Indexing Path and Storage Engine

You have learned how OpenSearch defends itself with tests (Level 5). Now you learn what it is defending: the write path — the journey of a document from an HTTP request to durable, searchable bytes on disk. This is the heart of OpenSearch as a storage engine, and it is where the project meets Apache Lucene.

A document does not simply "get saved." It is routed to a shard, applied to a primary, versioned and stamped with a sequence number, written to Lucene's in-memory buffer and appended to a write-ahead log (the translog), replicated to replicas, then — separately and asynchronously — made visible by a refresh and made durable as a Lucene commit by a flush, while background merges keep the segment count sane. Every one of those words is a class you can open, a setting you can tune, and a source of real bugs you can fix.

By the end of this level you can stand at any point on the write path and say which class is running, what invariant it maintains, and what would break if it were wrong.

Learning Objectives

By the end of Level 6 you must be able to:

Trace a POST /<index>/_doc and a _bulk request from the REST layer all the way to Lucene's IndexWriter and the translog, naming every hop and the class that owns it.
Explain the division of labor between IndicesService, IndexService, IndexShard, the Engine abstraction and its InternalEngine implementation, the Translog, and the mapping stack (MapperService/DocumentMapper).
Explain how versioning and sequence numbers (_seq_no, primary term) are assigned on the primary and replayed on replicas, and how local/global checkpoints track durability.
Distinguish refresh (visibility — opens a new searcher), flush (durability — Lucene commit + translog trim), and merge (segment housekeeping), and reason about the settings that control each.
Reason about durability under failure: what the translog guarantees, what index.translog.durability controls, and what gets replayed on recovery.
Map a parsed JSON document through the mapping/analysis pipeline into Lucene fields.

The Write Path, End to End

A single-document index and a bulk request share almost the whole path; bulk just batches at the coordinating and shard level. The canonical flow:

flowchart TD
    A["HTTP POST /idx/_doc or _bulk"] --> B["RestController → RestIndexAction / RestBulkAction"]
    B --> C["NodeClient.execute(...) → TransportBulkAction"]
    C -->|resolve routing, group by shard| D["TransportShardBulkAction<br/>(extends TransportReplicationAction)"]
    D -->|on the PRIMARY node| E["IndexShard.applyIndexOperationOnPrimary(...)"]
    E -->|parse + map| F["MapperService / DocumentMapper → ParsedDocument (Lucene fields)"]
    E -->|assign _seq_no, version, primary term| G["InternalEngine.index(Engine.Index)"]
    G --> H["Lucene IndexWriter.addDocument / updateDocument"]
    G --> I["Translog.add(operation)  (write-ahead log)"]
    D -->|replicate the same op| J["replica shards: applyIndexOperationOnReplica → InternalEngine.index"]
    H -.->|later, async| K["refresh: open new DirectoryReader → docs become searchable"]
    H -.->|later, async| L["flush: IndexWriter.commit() + trim translog (durable Lucene commit)"]
    H -.->|background| M["merge: MergePolicy/MergeScheduler combine segments"]

Read it as three phases:

Coordination (any node): REST parses the request; TransportBulkAction resolves index/routing, may create the index, applies pipelines, and groups operations by target shard.
Primary write (the node holding the primary): TransportShardBulkAction runs each op through IndexShard.applyIndexOperationOnPrimary, which parses + maps the source, then calls InternalEngine.index(...), which writes to Lucene's IndexWriter and appends to the Translog. Versioning and the sequence number are assigned here, on the primary.
Replication (replica nodes): TransportReplicationAction ships the same operation (with the already-assigned _seq_no/version/primary term) to each replica, which applies it via applyIndexOperationOnReplica. (With segment replication, replicas instead copy finished segments from the primary — see replication.)

Then, decoupled from the request, three background processes govern visibility, durability, and storage shape:

Process	Trigger	What it does	Deep dive
Refresh	`index.refresh_interval` (default 1s) or `_refresh`	Opens a new Lucene searcher so buffered docs become searchable. Cheap-ish; does not fsync.	refresh/flush/merge
Flush	translog size/age, `_flush`, or shard close	`IndexWriter.commit()` (fsync segments) + start a fresh translog generation. Makes the commit point durable.	engine internals
Merge	segment count/size policy	`MergePolicy` picks segments; `MergeScheduler` merges them, reclaiming deletes.	refresh/flush/merge

Note: Refresh ≠ flush. Refresh makes docs visible; flush makes them durable as a Lucene commit. Between refreshes, a doc is buffered (visible after refresh) but already durable via the translog (fsync'd per request by default). This separation — visibility vs. durability — is the single most important mental model in this level.

Mappings and Analysis: From JSON to Lucene Fields

Before a document reaches the engine, its JSON source is turned into Lucene fields. The MapperService holds the index's mapping; the DocumentMapper parses the source into a ParsedDocument (a set of Lucene IndexableFields) according to each field's Mapper. Text fields run through an analyzer (a Tokenizer + TokenFilter chain) into indexed terms; keyword/numeric fields are indexed verbatim and as DocValues. Dynamic mapping infers types for unseen fields and triggers a cluster-state mapping update.

JSON source ──DocumentMapper──► fields:
  "title": "Fast Search"  ──(text, standard analyzer)──► terms: [fast, search]  + (optional) DocValues
  "views": 42             ──(long)──────────────────────► points + DocValues
  "tag":   "prod"         ──(keyword)────────────────────► term: [prod]         + DocValues

Analysis is the subject of Lab 6.3 (you build a custom analyzer plugin) and the mapping & analysis deep dive.

Sequence Numbers and Checkpoints

Every write on a primary is stamped with two coordinates that make replication and recovery correct:

Concept	What it is	Owner
`_seq_no`	A monotonically increasing per-shard sequence number assigned on the primary	`LocalCheckpointTracker` (via `InternalEngine`)
primary term	A counter incremented each time a new primary is elected; disambiguates ops from different primaries	`ReplicationTracker` / cluster state
local checkpoint	Highest `_seq_no` below which this copy has every op	`LocalCheckpointTracker`
global checkpoint	Highest `_seq_no` below which all in-sync copies have every op — the durable, replicated watermark	`ReplicationTracker`
`_version`	Per-document version for optimistic concurrency (`if_seq_no`/`if_primary_term`)	`InternalEngine` versioning

The global checkpoint is what lets recovery be cheap: a recovering replica only needs operations after the global checkpoint (replayed from the translog), not the whole shard. This is the backbone of peer recovery. When you read _seq_no assignment in Lab 6.1, you are reading the code that makes all of this work.

# See the seqno/checkpoint plumbing for yourself.
grep -rn "LocalCheckpointTracker\|globalCheckpoint\|primaryTerm\|generateSeqNo\|SequenceNumbers" \
  server/src/main/java/org/opensearch/index/seqno \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head -30

Key Classes

Class	Location	Role
`IndicesService`	`server/.../indices/IndicesService.java`	Owns all `IndexService`s on the node; creates/closes indices; the node-level entry point to shards
`IndexService`	`server/.../index/IndexService.java`	One per index per node; owns that index's `IndexShard`s, `MapperService`, `IndexSettings`, analysis registry
`IndexShard`	`server/.../index/shard/IndexShard.java`	One shard copy; the operational unit. `applyIndexOperationOnPrimary/OnReplica`, refresh, flush, recovery, the `Engine`
`Engine` (abstract)	`server/.../index/engine/Engine.java`	The storage-engine contract: `index`, `delete`, `get`, `refresh`, `flush`, `acquireSearcher`
`InternalEngine`	`server/.../index/engine/InternalEngine.java`	The default engine: wraps Lucene `IndexWriter`/`SearcherManager` + the `Translog`; assigns seqno/version
`Translog`	`server/.../index/translog/Translog.java`	The write-ahead log: `add`, `sync`, generations, replay on recovery
`MapperService`	`server/.../index/mapper/MapperService.java`	Holds the index mapping; resolves field `Mapper`s and analyzers
`DocumentMapper`	`server/.../index/mapper/DocumentMapper.java`	Parses a JSON source into a `ParsedDocument` of Lucene fields

# Confirm the locations on your branch (names are stable; paths may shift slightly).
for c in IndicesService IndexService index/shard/IndexShard index/engine/Engine \
         index/engine/InternalEngine index/translog/Translog index/mapper/MapperService \
         index/mapper/DocumentMapper; do
  find server/src/main/java -path "*org/opensearch/${c}.java" -o -name "$(basename $c).java" -path "*org/opensearch/*" 2>/dev/null
done | sort -u

The Labs

Lab	Title	Type
6.1	Trace a Document Index Request to Lucene	Timed code-reading trace
6.2	IndexShard, InternalEngine, and the Translog	Deep hands-on (curl + reading)
6.3	Build It — A Custom Analyzer	Build-it plugin

These build on each other: 6.1 maps the territory, 6.2 makes you operate the engine from the outside and reason about translog/refresh/flush, and 6.3 has you extend the analysis pipeline with a real plugin.

Deliverables

You must demonstrate all of the following before advancing to Level 7:

A written reading-log artifact tracing _doc and _bulk from REST to IndexWriter.addDocument and Translog.add, naming each class and where _seq_no/version are assigned (Lab 6.1).
Empirical observations, via curl, of refresh changing visibility, flush trimming the translog, and index.translog.durability changing fsync behavior — mapped back to the source (Lab 6.2).
A working custom analyzer plugin: a Plugin implements AnalysisPlugin, a token filter factory, a Lucene TokenFilter, built with opensearch.opensearchplugin, installed, and verified with _analyze and an OpenSearchTokenStreamTestCase unit test (Lab 6.3).
From memory: the difference between refresh, flush, and merge; what the global checkpoint enables; and why the translog exists when Lucene already persists segments on commit.

Common Mistakes

Mistake	Consequence	Fix
Conflating refresh and flush	You expect `_refresh` to make data durable (it doesn't)	Refresh = visibility; flush = Lucene commit (durability)
Thinking the translog is for search	You look for query data in it	Translog is a write-ahead log for recovery, not a read path
Assuming `_seq_no` is assigned on replicas	You misread replication	Primary assigns it; replicas replay the same value
Setting `refresh_interval` to a tiny value in prod	Refresh thrash, segment explosion, merge pressure	Default 1s; raise it for write-heavy ingest
Reading `InternalEngine` before `Engine`	You miss the abstraction boundary	Read the `Engine` contract first, then the impl
Forgetting analysis happens before the engine	You look for tokenization in `InternalEngine`	Tokenization is in the mapper/analysis layer
Treating `_bulk` as a totally separate path	Duplicated mental model	It's the single-doc path, batched per shard

How to Verify Success

# 1) You can locate every hop on the write path from a single grep chain.
grep -rn "applyIndexOperationOnPrimary" server/src/main/java/org/opensearch/index/shard/IndexShard.java
grep -rn "public IndexResult index" server/src/main/java/org/opensearch/index/engine/InternalEngine.java
grep -rn "addDocument\|updateDocument\|translog.add\|getTranslog().add" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

# 2) You can drive visibility vs durability from the outside (Lab 6.2).
curl -s -XPUT  'localhost:9200/demo?pretty'
curl -s -XPOST 'localhost:9200/demo/_doc?pretty' -H 'Content-Type: application/json' -d '{"t":"hello"}'
curl -s 'localhost:9200/demo/_search?pretty'            # likely 0 hits before refresh
curl -s -XPOST 'localhost:9200/demo/_refresh?pretty'
curl -s 'localhost:9200/demo/_search?pretty'            # now 1 hit
curl -s 'localhost:9200/_cat/segments/demo?v'

When you can trace a document to Lucene, explain seqno/checkpoints, drive refresh/flush/merge from curl, and have built a working analysis plugin, you are ready for the search side in Level 7. For the underlying theory, lean on the engine internals, translog, refresh/flush/merge, mapping & analysis, and index-shard lifecycle deep dives.

Lab 6.1: Trace a Document Index Request to Lucene

Background

You cannot fix the write path until you can see it. In this lab you follow a single POST /<index>/_doc request — and then a _bulk request — from the HTTP socket all the way down to Lucene's IndexWriter.addDocument(...) and the translog's Translog.add(...), naming every class on the way. This is a code-reading trace, the most important skill in the entire curriculum: a senior contributor can open an unfamiliar request path and narrate it hop by hop using nothing but grep and the IDE's "go to definition."

The output of this lab is a reading-log artifact — a written, hop-by-hop map of the path with the exact file, class, and method at each step, plus the two spots where versioning and the sequence number are assigned. You will reuse this map for the rest of Level 6 and whenever you touch indexing.

Note: This lab is timed (see Step 0). The point is not to read everything — it is to learn to follow a path efficiently and stop at the right boundaries. Pair it with the engine internals and translog deep dives, and the Level 6 overview diagram.

Why This Lab Matters for Contributors

Every indexing bug, performance issue, or feature request lands somewhere on this path. Knowing the path turns "I have no idea where to start" into "this is a TransportShardBulkAction problem."
Maintainers describe bugs by class (IndexShard.applyIndexOperationOnPrimary). You must be able to open that method and orient yourself in seconds.
The trace reveals where cross-cutting concerns live: routing, mapping, versioning, seqno assignment, replication, and the translog. Each is a future contribution area.
A reading log is reusable: you will paste hops from it into PR descriptions and issue comments to show you understand the code you're changing.

Prerequisites

A running node from source so you can correlate code with live behavior:
```
./gradlew run            # single-node cluster, REST on :9200
```
The repo open in an IDE with working "go to definition" on org.opensearch.*.
Familiarity with the REST layer deep dive and the action framework deep dive — those cover the upper hops in depth; here we move quickly through them to reach the engine.

Step-by-Step Tasks

Step 0: Set a timer and prepare the reading log

Give yourself 75 minutes. Create a scratch file and fill one row per hop as you go. The format:

HOP | FILE | CLASS#method | WHAT HAPPENS HERE | NOTE/QUESTION
----+------+--------------+-------------------+--------------

Resist the urge to read method bodies in full. For each hop, answer only: what does this hand off, and to whom? You are drawing a map, not auditing the code.

Step 1: Send the requests and watch them land

In one terminal, tail the node; in another, send the requests:

# Make the path observable: TRACE the bulk/shard action + engine for one request.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {
    "logger.org.opensearch.action.bulk": "TRACE",
    "logger.org.opensearch.index.shard": "TRACE",
    "logger.org.opensearch.index.engine": "TRACE",
    "logger.org.opensearch.index.translog": "TRACE"
  }
}'

# Single document
curl -s -XPOST 'localhost:9200/trace/_doc?pretty' -H 'Content-Type: application/json' \
  -d '{"title":"trace me","n":1}'

# Bulk
curl -s -XPOST 'localhost:9200/_bulk?pretty' -H 'Content-Type: application/x-ndjson' --data-binary $'
{"index":{"_index":"trace","_id":"b1"}}
{"title":"bulk one","n":2}
{"index":{"_index":"trace","_id":"b2"}}
{"title":"bulk two","n":3}
'

# Turn TRACE back off when done.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.action.bulk": null, "logger.org.opensearch.index.shard": null,
                 "logger.org.opensearch.index.engine": null, "logger.org.opensearch.index.translog": null }
}'

The TRACE lines in the node's stdout name the classes you are about to read — use them to confirm your map matches reality.

Step 2: Hop 1 — REST handler

Both _doc and _bulk enter through a RestHandler registered in ActionModule. Find them:

grep -rn "class RestIndexAction"  server/src/main/java/org/opensearch/rest/action/document/
grep -rn "class RestBulkAction"   server/src/main/java/org/opensearch/rest/action/document/
# How each routes the parsed request into the transport layer:
grep -n "client.execute\|client.bulk\|prepareRequest\|BulkRequest\|IndexRequest" \
  server/src/main/java/org/opensearch/rest/action/document/RestBulkAction.java

Record: RestIndexAction/RestBulkAction parse the HTTP body into an IndexRequest/BulkRequest and call NodeClient.execute(BulkAction.INSTANCE, request, listener). A single _doc is internally wrapped into a one-item bulk — both paths converge on TransportBulkAction. Note that convergence; it's the key insight that collapses two paths into one.

Step 3: Hop 2 — coordinating action (`TransportBulkAction`)

find server/src/main/java -name "TransportBulkAction.java"
grep -n "doExecute\|doInternalExecute\|executeBulk\|createIndex\|resolveRouting\|BulkShardRequest\|groupRequestsByShards\|ingest\|pipeline" \
  server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java

Record what happens on the coordinating node:

auto-create the index if missing (a cluster-state update),
run ingest pipelines if any,
resolve routing → which shard each op targets,
group operations by shard into BulkShardRequests,
dispatch each BulkShardRequest to the shard's primary via TransportShardBulkAction.

This is the "fan-out by shard" hop. A single _doc produces exactly one BulkShardRequest.

Step 4: Hop 3 — the primary write (`TransportShardBulkAction`)

find server/src/main/java -name "TransportShardBulkAction.java"
grep -n "extends TransportWriteAction\|extends TransportReplicationAction\|shardOperationOnPrimary\|executeBulkItemRequest\|applyIndexOperationOnPrimary\|dispatchedShardOperationOnPrimary" \
  server/src/main/java/org/opensearch/action/bulk/TransportShardBulkAction.java

TransportShardBulkAction extends TransportWriteAction extends TransportReplicationAction — that inheritance is the primary→replica machinery. On the primary it calls (per item) IndexShard.applyIndexOperationOnPrimary(...). Record the method name that bridges into the shard.

Note: TransportReplicationAction is the reusable base for all write actions (index, delete, bulk). It handles routing to the primary, executing on the primary, then replicating to replicas and waiting for the required number of copies. See the replication deep dive.

Step 5: Hop 4 — the shard (`IndexShard.applyIndexOperationOnPrimary`)

grep -n "applyIndexOperationOnPrimary\|applyIndexOperationOnReplica\|applyIndexOperation\|prepareIndex\|docMapper\|MapperService\|markSeqNoAsNoop\|Engine.Index\|getEngine().index" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java

This is the hinge of the whole path. Record, in order:

The source is parsed and mapped here: prepareIndex(...) uses the DocumentMapper to build a ParsedDocument (Lucene fields). (Trace into MapperService/DocumentMapper only briefly — that's Lab 6.3's territory.)
An Engine.Index operation is constructed wrapping the parsed doc, the version, the version type, and (on the primary) the assignment of the next _seq_no/primary term.
It calls getEngine().index(engineIndex) → InternalEngine.index(...).

Note the two assignment sites you must pin down precisely:

# Where versioning + seqno are resolved on the primary vs replayed on the replica.
grep -n "primaryTerm\|seqNo\|SequenceNumbers\|UNASSIGNED_SEQ_NO\|versionType\|resolveDocVersion\|getOperationPrimaryTerm" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

Step 6: Hop 5 — the engine (`InternalEngine.index`)

This is the bottom of the path that you own as a reader. Open it:

grep -n "public IndexResult index(Index index)\|private IndexResult indexIntoLucene\|planIndexingAsPrimary\|planIndexingAsNonPrimary\|generateSeqNoForOperationOnPrimary\|addDocs\|updateDocs\|addDocuments\|translog\.\|getTranslog().add\|markSeqNoAsCompleted\|versionMap" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head -40

Record the engine's decision flow:

index(Index) builds an indexing plan (IndexingStrategy): is this a new doc (addDocument) or an update (updateDocument, which deletes-then-adds)? Does version conflict? Is this primary or replica?
On the primary, generateSeqNoForOperationOnPrimary(...) assigns the _seq_no (via the LocalCheckpointTracker). On a replica, the _seq_no arrives with the operation and is not regenerated — confirm this in the non-primary plan.
indexIntoLucene(...) calls Lucene IndexWriter.addDocument(...) (new) or updateDocument(...) (update — a delete + add keyed on _id).
Translog.add(new Translog.Index(...)) appends the op to the write-ahead log.
The local checkpoint is advanced (markSeqNoAsCompleted-style call) once the op is durable enough.

Step 7: Hop 6 — Lucene and the translog (the two destinations)

The path ends in two places, written for two different reasons:

# Lucene write (visibility-on-refresh, durability-on-commit):
grep -n "indexWriter.addDocument\|indexWriter.updateDocument\|indexWriter.softUpdateDocument\|addDocuments\|updateDocuments" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

# Translog write (durability-now, replay-on-recovery):
find server/src/main/java -name "Translog.java"
grep -n "public Location add(Operation operation)\|class Index\|void sync\|ensureSynced" \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head

Record the why for each destination:

Destination	Why	Made durable by	Made visible by
`IndexWriter` (Lucene)	The searchable index	`flush` → `IndexWriter.commit()` (fsync)	`refresh` → new `DirectoryReader`
`Translog`	Crash recovery before the next commit	`Translog.sync()` (fsync, per request by default)	never — it is not a read path

This table is the answer to the most common confusion in the level: "if Lucene persists on commit, why a translog?" Because commits are expensive and infrequent; the translog makes every acknowledged write durable between commits, and is replayed on recovery. See the translog deep dive.

Step 8: Confirm the seqno/version assignment with a live request

Tie the code to behavior. Index, then read the metadata fields the engine assigned:

curl -s -XPOST 'localhost:9200/trace/_doc?pretty' -H 'Content-Type: application/json' \
  -d '{"title":"seqno demo"}'
# Response includes _version, _seq_no, _primary_term — the values assigned at Hop 5/6.

{
  "_index" : "trace",
  "_id" : "x1Ab...",
  "_version" : 1,
  "result" : "created",
  "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Index the same _id again and watch _version/_seq_no advance — proof that updateDocument (Hop 6) ran, not addDocument:

curl -s -XPOST 'localhost:9200/trace/_doc/fixed?pretty' -H 'Content-Type: application/json' -d '{"v":1}'
curl -s -XPOST 'localhost:9200/trace/_doc/fixed?pretty' -H 'Content-Type: application/json' -d '{"v":2}'
# Second response: "_version": 2, "_seq_no" increments, "result": "updated".

Implementation Requirements / Deliverables

A completed reading-log artifact with one row per hop (Steps 2–7), each naming the exact file, class, and method.
The two assignment sites identified precisely: where the _seq_no/primary term is assigned on the primary vs replayed on the replica, with the method names.
The Lucene-vs-translog destination table, in your own words, explaining why both.
The live _doc response showing _version/_seq_no/_primary_term, and the update sequence showing them advance.
One open question per hop you couldn't fully answer — these are your future reading targets.

Troubleshooting

Symptom	Likely cause	Fix
`grep` finds nothing for a class	Path moved on your branch	Drop the path: `grep -rn "class TransportShardBulkAction" server/src/main`
TRACE produces no lines	Logger name typo or wrong package	Match package to the file's `package` declaration exactly
"go to definition" lands in a `.class` not source	IDE indexing incomplete	Re-import Gradle project; `./gradlew idea` if needed
Can't tell primary from replica path in the engine	Reading too fast	Find `planIndexingAsPrimary` vs `planIndexingAsNonPrimary` and compare
No `_seq_no` in the response	Old client/format	Use `?pretty` against a 2.x/3.x node; it's in the metadata block

Expected Output

Your reading log should resolve to roughly this chain (exact method names vary slightly by branch):

HTTP POST /trace/_doc
  → RestIndexAction / RestBulkAction          (rest/action/document/)
  → NodeClient.execute(BulkAction, ...)
  → TransportBulkAction.doExecute             (action/bulk/)        [coordinating: route, group by shard]
  → TransportShardBulkAction.shardOperationOnPrimary               [primary node]
  → IndexShard.applyIndexOperationOnPrimary   (index/shard/)        [parse+map, assign seqno/version]
  → InternalEngine.index(Engine.Index)        (index/engine/)       [plan add vs update]
  → IndexWriter.addDocument / updateDocument  (Lucene)              [the searchable index]
  → Translog.add(Translog.Index)              (index/translog/)     [write-ahead log]
  → (replication) TransportReplicationAction → replicas: applyIndexOperationOnReplica → InternalEngine.index

Stretch Goals

Trace a delete. Repeat for DELETE /trace/_doc/fixed: RestDeleteAction → TransportBulkAction (deletes are bulked too) → IndexShard.applyDeleteOperationOnPrimary → InternalEngine.delete → IndexWriter.deleteDocuments / soft-delete + Translog.add(Translog.Delete). Note how soft-deletes keep the doc for recovery.
Trace an update script. POST /trace/_update/fixed goes through TransportUpdateAction, which does a get-modify-reindex and then re-enters the bulk path. Find where it loops back.
Attach a debugger. ./gradlew run --debug-jvm, breakpoint in InternalEngine.indexIntoLucene and Translog.add, send one curl, and step through the plan.
Diagram it. Turn your reading log into a mermaid diagram and compare it to the one in the Level 6 overview. Where did your branch differ?

Validation / Self-check

Name all hops from POST /_doc to IndexWriter.addDocument, with the owning class at each.
Why do _doc and _bulk converge on the same code? At which class, and what does a single _doc become internally?
Exactly where is the _seq_no assigned on the primary, and why is it not re-assigned on a replica? Name the method on each side.
When does the engine call addDocument vs updateDocument, and what does updateDocument do under the hood?
The document is written to two destinations in InternalEngine.index. Name them and give the distinct reason each exists.
After indexing, your response showed _seq_no: 0, _primary_term: 1. Which classes produced those two numbers?
What did the live update sequence (fixed indexed twice) prove about which Lucene call ran the second time?

When your reading log narrates the full path without looking it up, proceed to Lab 6.2: IndexShard, InternalEngine, and the Translog.

Lab 6.2: IndexShard, InternalEngine, and the Translog

Background

Lab 6.1 mapped the write path. This lab makes you operate it. You will read the three classes that own a shard's storage — IndexShard, InternalEngine, and Translog — and then drive their behavior from the outside with curl, watching the three things that confuse every newcomer: visibility (refresh), durability (flush, translog), and the translog's role in recovery.

The thesis of the lab is one sentence: a document is durable long before it is searchable, and searchable long before its Lucene commit is durable. Refresh, flush, and the translog are the three mechanisms that make that sentence true, and you will see each one move under a curl.

Note: Keep the engine internals, translog, and refresh/flush/merge deep dives open. This lab is the hands-on counterpart; those chapters are the theory. The index-shard lifecycle deep dive explains the states a shard moves through around these operations.

Why This Lab Matters for Contributors

"Why don't my docs show up immediately?" and "did I lose data on that crash?" are the two most common user questions. The answers are refresh and the translog — and you should be able to explain both with a curl demo.
Most engine/storage bugs are about timing: a refresh that didn't happen, a translog that wasn't trimmed, a durability setting misread. Operating the engine builds the intuition to spot them.
Tuning refresh_interval and translog.durability is a frequent production decision and a frequent source of issues; you must understand the trade-offs to review such changes.
The translog-replay model is the foundation of recovery; you can't reason about recovery bugs without it.

Prerequisites

A running node from source: ./gradlew run (REST on :9200).
Lab 6.1 completed — you know the write path.
jq is handy but optional (commands below also work with ?pretty).

curl -s 'localhost:9200/_cluster/health?pretty' | head    # node is up

Step-by-Step Tasks

Step 1: Read the `Engine` contract before the implementation

InternalEngine is large. Read the abstract Engine first — it is the contract, and it names every operation the storage engine supports:

find server/src/main/java -name "Engine.java" -path "*engine*"
grep -n "public abstract\|abstract IndexResult index\|abstract DeleteResult delete\|abstract void refresh\|abstract void flush\|abstract GetResult get\|acquireSearcher\|class Index\|class Delete\|class Get\|class Searcher" \
  server/src/main/java/org/opensearch/index/engine/Engine.java | head -40

Record the contract surface: index, delete, get, refresh, flush, forceMerge, acquireSearcher, plus the Engine.Index/Engine.Delete/Engine.Get operation value types. Every storage engine (including segment-replication variants) implements this.

Step 2: Read the visibility/durability machinery in `InternalEngine`

Now the implementation, focused on the three mechanisms:

# Visibility: the searcher manager and refresh.
grep -n "ExternalReaderManager\|OpenSearchReaderManager\|SearcherManager\|refresh(\|maybeRefresh\|reopen" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

# Durability: commit + translog interplay on flush.
grep -n "public void flush\|commitIndexWriter\|indexWriter.commit\|translog.trimUnreferenced\|rollGeneration\|ensureTranslogSynced\|translog.sync" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

Record:

Refresh opens a new reader from the in-memory + committed segments → buffered docs become searchable. It does not fsync and does not trim the translog.
Flush = IndexWriter.commit() (fsync segments, write a new commit point) then roll/trim the translog (the committed ops no longer need replaying). This is where durability and translog size meet.

Step 3: Read the translog itself

find server/src/main/java -name "Translog.java"
grep -n "public Location add\|public void sync\|class Durability\|REQUEST\|ASYNC\|rollGeneration\|trimUnreferencedReaders\|newSnapshot\|recoverFromTranslog\|getMinFileGeneration" \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head -30

Record the two durability modes and what sync() does:

`index.translog.durability`	When the translog is fsync'd	Trade-off
`request` (default)	After every write request, before it's acknowledged	Safest; a crash loses nothing acknowledged. Slower per-op.
`async`	Every `index.translog.sync_interval` (default 5s)	Faster; a crash can lose up to `sync_interval` of acknowledged writes.

The translog add always appends in memory; sync is what fsyncs it. On recovery, recoverFromTranslog replays operations after the last commit's local checkpoint.

Step 4: Observe visibility — refresh changes what search sees

Create an index with refresh disabled so you control visibility manually:

curl -s -XPUT 'localhost:9200/lab62?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0, "refresh_interval": "-1" } }
}'

# Index a doc WITHOUT refreshing.
curl -s -XPOST 'localhost:9200/lab62/_doc?pretty' -H 'Content-Type: application/json' -d '{"t":"invisible"}'

# Search immediately — expect 0 hits, because no refresh has opened a new searcher.
curl -s 'localhost:9200/lab62/_search?pretty' | grep '"value"'
#   "value" : 0,

# Force a refresh, then search again — now 1 hit.
curl -s -XPOST 'localhost:9200/lab62/_refresh?pretty' >/dev/null
curl -s 'localhost:9200/lab62/_search?pretty' | grep '"value"'
#   "value" : 1,

You just watched visibility turn on. The doc existed (and was durable in the translog) the whole time — refresh only made it searchable. Confirm the doc was retrievable by primary key the whole time (GET reads the live version map / translog, not the searcher):

# Index another with refresh still off, then GET it by id — works even without a refresh.
ID=$(curl -s -XPOST 'localhost:9200/lab62/_doc?pretty' -H 'Content-Type: application/json' \
  -d '{"t":"gettable"}' | sed -n 's/.*"_id" : "\(.*\)",/\1/p')
curl -s "localhost:9200/lab62/_doc/${ID}?pretty" | grep '"found"'
#   "found" : true,     <- visible to GET-by-id before any refresh

Note: This GET-vs-search asymmetry is real and important: GET-by-id can read the in-flight version map, so a just-indexed doc is retrievable by id immediately, but search (which uses the Lucene searcher) only sees it after a refresh. Many "why is my doc missing?" issues are this asymmetry.

Step 5: Observe the refresh interval as a knob

# Turn refresh back on at a slow 30s and watch search lag, then speed it up.
curl -s -XPUT 'localhost:9200/lab62/_settings?pretty' -H 'Content-Type: application/json' \
  -d '{"index":{"refresh_interval":"30s"}}' >/dev/null
curl -s -XPOST 'localhost:9200/lab62/_doc?pretty' -H 'Content-Type: application/json' -d '{"t":"slowvis"}' >/dev/null
curl -s 'localhost:9200/lab62/_search?pretty' | grep '"value"'   # may still be the old count for up to 30s

# A write with refresh=wait_for blocks until the next refresh makes it visible (no busy-wait).
curl -s -XPOST 'localhost:9200/lab62/_doc?refresh=wait_for&pretty' -H 'Content-Type: application/json' \
  -d '{"t":"waited"}' >/dev/null
curl -s 'localhost:9200/lab62/_search?pretty' | grep '"value"'   # now includes the waited doc

Record the trade-off: tiny refresh_interval = near-real-time search but many small segments and merge pressure; large = fewer/larger segments, less overhead, staler search. The default 1s is a balance.

Step 6: Observe durability — flush, the translog, and `_stats`

Watch the translog grow with writes and shrink on flush. _stats?translog exposes the live numbers:

# Index a batch, then look at translog size (operations not yet committed to a Lucene commit point).
for i in $(seq 1 50); do
  curl -s -XPOST 'localhost:9200/lab62/_doc?pretty' -H 'Content-Type: application/json' \
    -d "{\"t\":\"d$i\"}" >/dev/null
done
curl -s 'localhost:9200/lab62/_stats/translog?pretty' | grep -E '"operations"|"uncommitted_operations"|"size_in_bytes"'

# Flush: IndexWriter.commit() + trim the now-redundant translog. uncommitted_operations drops to ~0.
curl -s -XPOST 'localhost:9200/lab62/_flush?pretty' >/dev/null
curl -s 'localhost:9200/lab62/_stats/translog?pretty' | grep -E '"uncommitted_operations"'
#   "uncommitted_operations" : 0,

Record: flush moved durability from the translog into a Lucene commit and trimmed the translog. Before the flush, durability for those ops lived only in the translog; after, it lives in the committed segments and the translog generation can be discarded.

Step 7: See segments appear and merge

# Each refresh/flush can create a segment. List them.
curl -s 'localhost:9200/_cat/segments/lab62?v'
#   index  shard prirep ip        segment generation docs.count docs.deleted size  committed searchable ...

# Force-merge down to 1 segment (housekeeping; do NOT do this routinely in prod).
curl -s -XPOST 'localhost:9200/lab62/_forcemerge?max_num_segments=1&pretty' >/dev/null
curl -s 'localhost:9200/_cat/segments/lab62?v'   # fewer rows now

Record: refresh/flush create segments; merge (MergePolicy/MergeScheduler, forced here for the demo) combines them and reclaims deleted docs. In production merges run in the background — see the refresh/flush/merge deep dive.

Step 8: Reason about translog replay (the recovery model)

You cannot easily crash ./gradlew run mid-write safely, but you can reason precisely about what would be replayed. Inspect the durability setting and the uncommitted operation count — together they tell you the recovery contract:

curl -s 'localhost:9200/lab62/_settings?pretty' | grep -A1 'translog'
curl -s 'localhost:9200/lab62/_stats/translog?pretty' | grep -E '"uncommitted_operations"'

Answer in your reading log:

With durability: request (default), every acknowledged write is fsync'd to the translog before the response returns. On a crash, all uncommitted_operations are replayed from the translog after the last Lucene commit's local checkpoint → no acknowledged write is lost.
With durability: async, writes acknowledged within the last sync_interval may not be fsync'd → a crash can lose them. The translog still replays whatever was synced.
A flush you just ran trims the translog, so a crash right after the flush replays almost nothing — recovery is fast because the committed segments already hold those ops.

Map this back to the source: recoverFromTranslog in InternalEngine/Translog and the local/global checkpoint plumbing (see the Level 6 overview on checkpoints).

grep -rn "recoverFromTranslog\|newSnapshotFromGen\|getMinSeqNoToKeep\|trimOperations\|recoverFromTranslogInternal" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head

Step 9: Toggle `translog.durability` and observe (read-mostly)

# Switch to async with a long sync interval; index; note that ops are not fsync'd per request.
curl -s -XPUT 'localhost:9200/lab62/_settings?pretty' -H 'Content-Type: application/json' -d '{
  "index": { "translog": { "durability": "async", "sync_interval": "10s" } }
}'
curl -s 'localhost:9200/lab62/_settings?pretty' | grep -A3 translog
# Index a few docs; they are durable in *memory* but fsync only happens every 10s now.
# Switch back to the safe default before doing anything you care about:
curl -s -XPUT 'localhost:9200/lab62/_settings?pretty' -H 'Content-Type: application/json' \
  -d '{"index":{"translog":{"durability":"request"}}}' >/dev/null

Warning: durability: async trades data safety for write throughput. It is a deliberate, documented choice for specific ingest workloads — never a default you flip casually. Reviewing a PR that changes a default durability is a red flag worth a careful read.

Implementation Requirements / Deliverables

A reading note distinguishing, in your own words, refresh (visibility), flush (Lucene commit + translog trim), and merge (segment housekeeping), each tied to its InternalEngine/ Translog method.
The Step 4 demo output proving a doc is durable/GET-able before it is searchable.
The Step 6 _stats/translog before/after a flush, showing uncommitted_operations drop to ~0.
The Step 7 _cat/segments before/after a force-merge.
A written translog-replay answer (Step 8) for both request and async durability.
Source pointers (grep hits) for refresh, flush, Translog.add/sync, and recoverFromTranslog.

Troubleshooting

Symptom	Likely cause	Fix
Search shows 0 hits after indexing	No refresh has run (`refresh_interval: -1` or just slow)	`POST /idx/_refresh`, or index with `?refresh=wait_for`
`uncommitted_operations` never drops	You searched/refreshed but didn't flush	Flush is what trims the translog, not refresh
`_cat/segments` shows many tiny segments	Frequent refresh + no merge yet	Normal; merges run in background, or force-merge for the demo
GET-by-id works but search doesn't	The GET-vs-search visibility asymmetry	Expected; refresh to make it searchable
Settings update rejected	Some translog settings are static	Recreate the index, or set the dynamic subset
Node won't start after `run`	Stale data dir lock	Stop the old `./gradlew run`, retry

Expected Output

# Step 4
"value" : 0,        # before refresh
"value" : 1,        # after refresh
"found" : true,     # GET-by-id before any refresh

# Step 6
"uncommitted_operations" : 50,   # before flush
"uncommitted_operations" : 0,    # after flush

# Step 7
# _cat/segments shows several rows, then fewer after force-merge to 1 segment.

Stretch Goals

Measure refresh cost. Index 100k docs at refresh_interval: 1s vs 30s vs -1; compare _cat/segments counts and merge activity (_nodes/stats/indices/merges). Explain the trade-off with numbers.
Watch a merge happen for real. Index, delete a chunk of docs, refresh, and watch docs.deleted in _cat/segments shrink as background merges reclaim them — without force-merge.
Trace a recovery. Restart the node (./gradlew run again on the same data dir if configured) and TRACE org.opensearch.index.translog and org.opensearch.indices.recovery to watch translog replay on startup. Cross-reference the recovery deep dive.
Find the flush triggers. grep for flush_threshold_size / shouldPeriodicallyFlush in IndexShard/InternalEngine and explain what automatically triggers a flush (translog size/age, merges, shard close) without an explicit _flush.

Validation / Self-check

Explain the sentence "a document is durable before it is searchable, and searchable before its Lucene commit is durable" using refresh, flush, and the translog.
Why was your doc GET-able by id but not found by search before a refresh? Which component does each read from?
What exactly does flush do that refresh does not? Which two side effects make uncommitted_operations drop to 0?
Contrast translog.durability: request vs async. For each, what is lost in a crash, and why?
On recovery, what gets replayed from the translog, and from which point? Name the method.
You set refresh_interval: -1, then 30s, then used refresh=wait_for. Describe the visibility behavior in each case and the production trade-off of small vs large intervals.
Where in InternalEngine/Translog do refresh, flush, add, sync, and recoverFromTranslog live? Paste the grep lines.

When you can drive visibility and durability from curl and explain every transition from the source, build something on top of the engine in Lab 6.3: Build It — A Custom Analyzer.

Lab 6.3: Build It — A Custom Analyzer

Background

You have traced the write path (Lab 6.1) and operated the engine (Lab 6.2). The one piece you have only glanced at is analysis — the step where a text field's value is tokenized and filtered into the terms Lucene actually indexes. Analysis is the most plugin-friendly extension point in OpenSearch: tokenizers, token filters, char filters, and analyzers are all registered through AnalysisPlugin, and writing one is the cleanest introduction to the entire plugin model.

In this Build-It lab you create a real, installable OpenSearch plugin from scratch: a Plugin implements AnalysisPlugin that registers a custom token filter. You will write the Lucene TokenFilter (the incrementToken() loop), the AbstractTokenFilterFactory that wires it into OpenSearch, the plugin-descriptor.properties, and the opensearch.opensearchplugin Gradle build. Then you build it, install it into a distribution, and verify it with the _analyze API and a unit test extending OpenSearchTokenStreamTestCase.

The example filter is skip_short_tokens: drop any token shorter than a configurable min_length (with an uppercase variant in the stretch goals). It is small enough to understand fully and real enough to be useful.

Note: Read the plugin architecture and mapping & analysis deep dives alongside this lab. The bundled analysis-common module is your reference implementation — when in doubt, copy its patterns.

Why This Lab Matters for Contributors

Plugins are how OpenSearch is extended in the real world (security, k-NN, SQL, analyzers all ship as plugins). Building one — even a tiny one — demystifies plugin-descriptor.properties, the opensearch.opensearchplugin Gradle plugin, isolated classloaders, and the registration interfaces.
AnalysisPlugin is the gentlest entry point: no cluster state, no transport, no threading — pure Lucene TokenStream mechanics you can fully test in-JVM.
incrementToken() is the heart of Lucene text processing. Writing one teaches you how every analyzer in OpenSearch actually works, which makes analysis bugs and feature requests legible.
A working plugin with a _analyze demo and a token-stream unit test is a complete, portfolio-quality contribution shape.

Prerequisites

A clean OpenSearch build (./gradlew assemble works).
Lab 6.2 done — you know how text becomes terms.
JDK 21; the repo's bundled JDK is fine.

Familiarity with the bundled analysis module as a template:

ls modules/analysis-common/src/main/java/org/opensearch/analysis/common/ | head -30
grep -rln "AbstractTokenFilterFactory" modules/analysis-common/src/main/java | head

Step-by-Step Tasks

Step 1: Understand the four pieces you must write

A token-filter plugin is exactly four artifacts plus a build file. Hold this map in your head:

flowchart LR
    A["plugin-descriptor.properties<br/>(name, classname, versions)"] --> B["SkipShortTokensPlugin<br/>implements AnalysisPlugin"]
    B -->|getTokenFilters() registers| C["SkipShortTokenFilterFactory<br/>extends AbstractTokenFilterFactory"]
    C -->|create() wraps stream| D["SkipShortTokenFilter<br/>extends Lucene TokenFilter<br/>incrementToken()"]

Artifact	Responsibility
`plugin-descriptor.properties`	Tells OpenSearch the plugin's name, entry `classname`, and the OpenSearch/Java versions it targets
`Plugin implements AnalysisPlugin`	The entry point; `getTokenFilters()` returns the name→factory registry
`AbstractTokenFilterFactory` subclass	Reads settings (e.g. `min_length`) and `create(TokenStream)`s the Lucene filter
Lucene `TokenFilter` subclass	The actual logic in `incrementToken()`

Step 2: Scaffold the plugin module

Create the module directory under plugins/ (mirroring the in-repo plugins):

mkdir -p plugins/analysis-skip-short/src/main/java/org/opensearch/analysis/skipshort
mkdir -p plugins/analysis-skip-short/src/test/java/org/opensearch/analysis/skipshort
mkdir -p plugins/analysis-skip-short/src/main/resources
ls plugins/ | head     # confirm your module sits beside analysis-icu etc.

Step 3: The Lucene `TokenFilter` — `incrementToken()`

This is the core. A TokenFilter decorates an input TokenStream; incrementToken() pulls tokens from the input, decides whether to emit each, and returns true for "a token is ready" or false for end-of-stream. We read each token's text via the CharTermAttribute and skip those shorter than minLength:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.analysis.skipshort;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;

/** Drops tokens whose length is strictly less than {@code minLength}. */
public final class SkipShortTokenFilter extends TokenFilter {

    private final int minLength;
    // CharTermAttribute is the token's text; it is shared/mutated across the stream.
    private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class);

    public SkipShortTokenFilter(TokenStream input, int minLength) {
        super(input);
        this.minLength = minLength;
    }

    @Override
    public boolean incrementToken() throws IOException {
        // Pull tokens until we find one to keep, or the stream ends.
        while (input.incrementToken()) {
            if (termAttr.length() >= minLength) {
                return true;     // keep this token; its attributes are already populated
            }
            // else: too short — loop and pull the next one, effectively dropping this token
        }
        return false;            // input exhausted
    }
}

Note: Token attributes (CharTermAttribute, OffsetAttribute, PositionIncrementAttribute, …) are shared, mutable objects reused for every token — never cache a token's value across incrementToken() calls without copying it. A filtering filter like this one doesn't need to fix up position increments for downstream phrase queries, but a stricter implementation would bump PositionIncrementAttribute for dropped tokens; note that as a known limitation.

Step 4: The factory — read settings and wire it in

AbstractTokenFilterFactory is OpenSearch's adapter between a Lucene TokenFilter and the analysis registry. It receives the index settings, the analysis settings block, and the filter's name. Read your min_length setting here, with a default and validation:

/* SPDX header omitted for brevity — include it in your real file */
package org.opensearch.analysis.skipshort;

import org.apache.lucene.analysis.TokenStream;
import org.opensearch.common.settings.Settings;
import org.opensearch.env.Environment;
import org.opensearch.index.IndexSettings;
import org.opensearch.index.analysis.AbstractTokenFilterFactory;

public class SkipShortTokenFilterFactory extends AbstractTokenFilterFactory {

    private final int minLength;

    public SkipShortTokenFilterFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
        super(indexSettings, name, settings);
        // "min_length" from the analyzer's filter config; default 3, must be >= 1.
        this.minLength = settings.getAsInt("min_length", 3);
        if (minLength < 1) {
            throw new IllegalArgumentException(
                "[min_length] must be >= 1 for filter [" + name + "], got [" + minLength + "]");
        }
    }

    @Override
    public TokenStream create(TokenStream tokenStream) {
        return new SkipShortTokenFilter(tokenStream, minLength);
    }
}

Step 5: The plugin entry point — `AnalysisPlugin.getTokenFilters()`

The Plugin is the classname OpenSearch loads. Implementing AnalysisPlugin and overriding getTokenFilters() registers your filter under a name (the name users put in their analyzer config):

/* SPDX header omitted for brevity — include it in your real file */
package org.opensearch.analysis.skipshort;

import org.opensearch.index.analysis.TokenFilterFactory;
import org.opensearch.indices.analysis.AnalysisModule.AnalysisProvider;
import org.opensearch.plugins.AnalysisPlugin;
import org.opensearch.plugins.Plugin;

import java.util.Map;
import java.util.TreeMap;

public class SkipShortTokensPlugin extends Plugin implements AnalysisPlugin {

    @Override
    public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
        Map<String, AnalysisProvider<TokenFilterFactory>> filters = new TreeMap<>();
        // Registered name "skip_short" -> factory constructor reference.
        filters.put("skip_short", SkipShortTokenFilterFactory::new);
        return filters;
    }
}

Note: The AnalysisProvider<TokenFilterFactory> functional interface matches the factory's 4-arg constructor (IndexSettings, Environment, String name, Settings). The method-reference SkipShortTokenFilterFactory::new satisfies it directly — if it doesn't compile, your constructor signature is off.

Step 6: `plugin-descriptor.properties`

OpenSearch reads this at load time. Place it where the Gradle plugin expects it (it is generated into the zip from your opensearchplugin {} config, but you can also provide a static one). The fields:

# plugins/analysis-skip-short/src/main/resources/plugin-descriptor.properties
description=Adds a skip_short token filter that drops tokens shorter than min_length
version=${version}
name=analysis-skip-short
classname=org.opensearch.analysis.skipshort.SkipShortTokensPlugin
java.version=21
opensearch.version=${opensearchVersion}

Note: In an in-repo plugin you usually do not hand-write this file — the opensearch.opensearchplugin Gradle plugin generates it from the opensearchplugin { ... } block (Step 7) so versions stay in sync. Hand-write it only for an out-of-tree plugin built against the published artifacts.

Step 7: The Gradle build (`opensearch.opensearchplugin`)

Create plugins/analysis-skip-short/build.gradle. The opensearch.opensearchplugin plugin produces an installable plugin zip and generates the descriptor:

/* plugins/analysis-skip-short/build.gradle */
apply plugin: 'opensearch.opensearchplugin'
apply plugin: 'opensearch.yaml-rest-test'   // optional: enables yamlRestTest for the plugin

opensearchplugin {
    description 'Adds a skip_short token filter that drops tokens shorter than min_length'
    classname  'org.opensearch.analysis.skipshort.SkipShortTokensPlugin'
}

// Analysis APIs come from the core 'opensearch' dependency the plugin already builds against.
// Lucene's analysis classes are provided transitively.
restResources {
    restApi {
        includeCore '_common', 'indices', 'index', 'indices.analyze'
    }
}

Wire the module into the build by ensuring it is discovered (in-repo plugins under plugins/ are picked up by the root settings.gradle's project inclusion; confirm):

grep -rn "analysis-skip-short\|project(.*plugins" settings.gradle build.gradle 2>/dev/null | head
# If in-repo plugins are auto-included you'll see the glob; otherwise add:
#   include ':plugins:analysis-skip-short'

Step 8: Build and install the plugin

# Build just your plugin's zip.
./gradlew :plugins:analysis-skip-short:assemble
find plugins/analysis-skip-short/build/distributions -name "*.zip"
#   plugins/analysis-skip-short/build/distributions/analysis-skip-short-<version>.zip

# Easiest path to a node WITH the plugin already loaded: run with the plugin on the classpath.
./gradlew run -PinstalledPlugins="['analysis-skip-short']"

Alternatively, install into a built distribution by hand:

./gradlew localDistro
DISTRO=$(find distribution/archives -maxdepth 2 -name "opensearch-*" -type d | head -1)
"$DISTRO/bin/opensearch-plugin" install \
  "file://$(pwd)/plugins/analysis-skip-short/build/distributions/$(ls plugins/analysis-skip-short/build/distributions | grep '\.zip$')"
"$DISTRO/bin/opensearch"        # start it

Confirm OpenSearch loaded it:

curl -s 'localhost:9200/_cat/plugins?v'
#   name   component             version
#   node-0 analysis-skip-short   <version>

Step 9: Test it with `_analyze`

The _analyze API runs text through a tokenizer + filter chain and returns the resulting tokens — the fastest way to verify your filter. Test it standalone first:

curl -s -XPOST 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d '{
  "tokenizer": "standard",
  "filter": [ { "type": "skip_short", "min_length": 4 } ],
  "text": "the quick brown fox is up"
}'

Expected: tokens shorter than 4 chars (the, fox, is, up) are dropped; quick and brown remain:

{
  "tokens" : [
    { "token" : "quick", "start_offset" : 4,  "end_offset" : 9,  "type" : "<ALPHANUM>", "position" : 1 },
    { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 }
  ]
}

Then bind it into a real custom analyzer on an index and index a doc through it:

curl -s -XPUT 'localhost:9200/skipdemo?pretty' -H 'Content-Type: application/json' -d '{
  "settings": {
    "analysis": {
      "filter":   { "no_short": { "type": "skip_short", "min_length": 4 } },
      "analyzer": { "long_words": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase","no_short"] } }
    }
  },
  "mappings": { "properties": { "body": { "type": "text", "analyzer": "long_words" } } }
}'

# Verify the named analyzer applies the filter.
curl -s -XPOST 'localhost:9200/skipdemo/_analyze?pretty' -H 'Content-Type: application/json' -d '{
  "analyzer": "long_words", "text": "An OpenSearch token filter in action"
}'
# Expect: opensearch, token, filter, action  (an, in dropped)

Step 10: The unit test — `OpenSearchTokenStreamTestCase`

_analyze is a smoke test; the real test is a fast, deterministic unit test of the token stream. OpenSearchTokenStreamTestCase gives you Lucene's analysis assertions:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.analysis.skipshort;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.opensearch.test.OpenSearchTokenStreamTestCase;

import java.io.StringReader;

public class SkipShortTokenFilterTests extends OpenSearchTokenStreamTestCase {

    public void testDropsShortTokens() throws Exception {
        WhitespaceTokenizer src = new WhitespaceTokenizer();
        src.setReader(new StringReader("a bb ccc dddd eeeee"));
        TokenStream filtered = new SkipShortTokenFilter(src, 3);   // keep length >= 3

        // assertTokenStreamContents verifies the exact remaining tokens in order.
        assertTokenStreamContents(filtered, new String[] { "ccc", "dddd", "eeeee" });
    }

    public void testKeepsAllWhenMinLengthIsOne() throws Exception {
        WhitespaceTokenizer src = new WhitespaceTokenizer();
        src.setReader(new StringReader("a bb ccc"));
        TokenStream filtered = new SkipShortTokenFilter(src, 1);
        assertTokenStreamContents(filtered, new String[] { "a", "bb", "ccc" });
    }

    public void testEmptyInputProducesNoTokens() throws Exception {
        WhitespaceTokenizer src = new WhitespaceTokenizer();
        src.setReader(new StringReader(""));
        TokenStream filtered = new SkipShortTokenFilter(src, 3);
        assertTokenStreamContents(filtered, new String[0]);
    }
}

Run it:

./gradlew :plugins:analysis-skip-short:test --tests "*SkipShortTokenFilterTests"

assertTokenStreamContents checks not just the token texts but that the stream reset()s, close()s, and end()s correctly — it catches the classic bug of a filter that forgets to propagate end()/reset to its input. Add a YAML REST test (src/yamlRestTest/...) for the _analyze behavior as well if you enabled opensearch.yaml-rest-test.

Step 11: Precommit and CHANGELOG

./gradlew spotlessApply
./gradlew :plugins:analysis-skip-short:precommit

Add a CHANGELOG.md entry and commit with DCO sign-off:

### Added
- New `analysis-skip-short` plugin adding a `skip_short` token filter ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

git checkout -b feature/analysis-skip-short
git add plugins/analysis-skip-short CHANGELOG.md settings.gradle
git commit -s -m "Add analysis-skip-short plugin with a skip_short token filter"

Implementation Requirements / Deliverables

SkipShortTokenFilter extends TokenFilter with a correct incrementToken() loop.
SkipShortTokenFilterFactory extends AbstractTokenFilterFactory reading min_length with a default and validation.
SkipShortTokensPlugin extends Plugin implements AnalysisPlugin, registering skip_short in getTokenFilters().
A build.gradle applying opensearch.opensearchplugin (descriptor generated from it).
The plugin builds (:plugins:analysis-skip-short:assemble) and loads (_cat/plugins lists it).
_analyze demonstrates short tokens dropped, both standalone and via a custom analyzer on an index.
A passing OpenSearchTokenStreamTestCase unit test (assertTokenStreamContents) covering drop, keep-all, and empty-input.
precommit passes; CHANGELOG.md entry added; DCO-signed commit; SPDX headers on every file.

Troubleshooting

Symptom	Likely cause	Fix
`_cat/plugins` doesn't list it	Plugin not installed / not on the run classpath	Use `./gradlew run -PinstalledPlugins="['analysis-skip-short']"` or install the zip
`unknown filter type [skip_short]`	Name in `getTokenFilters()` ≠ name in the analyzer config	Make both `skip_short`
`IllegalArgumentException: classname` on load	`classname` in descriptor wrong	Point it at the FQN of `SkipShortTokensPlugin`
Plugin fails version check on install	Built against a different OpenSearch version	Build the plugin from the same checkout as the node
`SkipShortTokenFilterFactory::new` won't compile	Constructor signature mismatch	Match `(IndexSettings, Environment, String, Settings)` exactly
`assertTokenStreamContents` fails on `end()`/`reset()`	Filter doesn't delegate to input properly	`TokenFilter` base handles it; ensure you call `super(input)` and loop on `input.incrementToken()`
`licenseHeaders` precommit failure	Missing SPDX header	Add it to every `.java` file

Expected Output

# _cat/plugins
name   component             version
node-0 analysis-skip-short   3.0.0-SNAPSHOT

# _analyze with min_length: 4 on "the quick brown fox is up"
tokens: [ "quick", "brown" ]

# Unit test
> Task :plugins:analysis-skip-short:test
org.opensearch.analysis.skipshort.SkipShortTokenFilterTests > testDropsShortTokens PASSED
org.opensearch.analysis.skipshort.SkipShortTokenFilterTests > testKeepsAllWhenMinLengthIsOne PASSED
org.opensearch.analysis.skipshort.SkipShortTokenFilterTests > testEmptyInputProducesNoTokens PASSED
BUILD SUCCESSFUL

Stretch Goals

Add an uppercase filter too. Implement UpperCaseTokenFilter (mutate CharTermAttribute in place with Character.toUpperCase) and register it as to_upper. This teaches mutating a token vs dropping one. Verify with _analyze.
Register a tokenizer. Override getTokenizers() to add a simple custom Tokenizer (e.g. a fixed-length n-gram or a regex split). Tokenizers are the source of the stream, not a decorator — note how the lifecycle differs.
Make it a pre-configured filter. Override getPreConfiguredTokenFilters() so skip_short is usable without an explicit min_length (a sensible default), the way lowercase is built in.
Fix position increments. Make the filter bump PositionIncrementAttribute for dropped tokens so phrase queries across a gap behave correctly. Add a test asserting positions.
Add a YAML REST test. Under src/yamlRestTest/resources/rest-api-spec/test/, write a do: indices.analyze + match: test for the filter and run :plugins:analysis-skip-short:yamlRestTest.

Validation / Self-check

Name the four artifacts a token-filter plugin needs and the single responsibility of each.
In incrementToken(), why is it a while loop and not an if? What does returning false mean?
Why must you never cache a CharTermAttribute's value across incrementToken() calls?
What does AbstractTokenFilterFactory give you, and where do you read the min_length setting?
How does the registered name (skip_short) connect the analyzer config in _settings to your Java factory? Where is that mapping declared?
Why is OpenSearchTokenStreamTestCase + assertTokenStreamContents a better test than the _analyze curl? What does it check that curl can't easily?
Which Gradle plugin builds the installable zip and generates plugin-descriptor.properties, and what does classname in that descriptor point to?

When your plugin builds, loads, drops short tokens through _analyze, and passes its token-stream unit test, you have completed Level 6. You now understand the write path from REST to Lucene, can operate the engine's visibility/durability mechanisms, and have extended OpenSearch with a real plugin. Continue to Level 7: Search and Aggregation Execution, and deepen the theory with the plugin architecture and mapping & analysis deep dives.

Level 7: Search Path and Aggregations

This level is where you learn how OpenSearch actually answers a _search. Up to now you have read the write path, cluster state, allocation, and recovery. Now you turn to the read path: the two-phase, scatter-gather protocol that fans a query out to many shards on many nodes and reduces the partial answers into one result. You will trace a search through QueryPhase and FetchPhase, understand how aggregations collect per shard and reduce on the coordinating node, and finally build and register a custom aggregation in a SearchPlugin.

A contributor who does not understand query-then-fetch cannot reason about pagination cost, partial failures, aggregation approximation error, or why a sort needs DocValues. This level closes that gap and ends with you shipping real Java into the search subsystem.

Learning Objectives

By the end of Level 7 you must be able to:

Diagram the query-then-fetch fan-out: what the coordinating node sends, what each shard returns, and what the second round trip fetches.
Locate the per-shard phase entry points (SearchService.executeQueryPhase, executeFetchPhase) and the coordinating-node reduce (SearchPhaseController) in source.
Explain why deep from/size pagination is expensive and what search_after, scroll, and point-in-time (PIT) do about it.
Explain the DfsPhase and when it changes scoring; explain the can_match pre-filter.
Walk the aggregation lifecycle: AggregatorFactory → Aggregator (per-shard collect + buildAggregation) → InternalAggregation.reduce(...) on the coordinator.
Explain shard_size vs size for a terms aggregation and where approximation error comes from.
Trace a QueryBuilder to a Lucene Query through QueryShardContext, and explain why sort and aggs read columnar DocValues.
Register a custom aggregation (or query) via SearchPlugin, build it, install it, and exercise it through _search.

The Two-Phase Query-Then-Fetch Model

A _search is not "run the query on a shard." A search index is split into shards, each shard is an independent Lucene index, and no single shard has the global answer. So the coordinating node (the node that received the REST request) runs a scatter-gather:

flowchart TD
    C[Client] -->|HTTP GET _search| R[RestSearchAction]
    R --> T[TransportSearchAction<br/>coordinating node]
    T -->|can_match pre-filter<br/>optional| CM[per-shard can_match]
    T -->|QUERY: fan out| S1[Shard 0<br/>SearchService.executeQueryPhase]
    T -->|QUERY: fan out| S2[Shard 1<br/>QueryPhase]
    T -->|QUERY: fan out| S3[Shard N<br/>QueryPhase]
    S1 -->|top docIds + scores + agg slices| RED[SearchPhaseController<br/>reduce on coordinator]
    S2 --> RED
    S3 --> RED
    RED -->|which docs to fetch| F1[Shard 0<br/>executeFetchPhase]
    RED --> F2[Shard 1<br/>FetchPhase]
    F1 -->|_source + fields| MERGE[merge hits + aggs]
    F2 --> MERGE
    MERGE -->|SearchResponse| C

Phase 1 — Query. The coordinating node sends the parsed request to every target shard. Each shard runs the query locally (QueryPhase), computes its top from + size doc IDs plus their scores (and any aggregation partial results), and returns only IDs, scores, and sort values — not the documents themselves. The coordinator collects these slices in SearchPhaseController and computes the global top size across all shards. This is the reduce.

Phase 2 — Fetch. Now the coordinator knows exactly which doc IDs (on which shards) belong in the final page. It sends a second, much smaller fan-out — only to the shards that own winning docs — asking each to load the actual _source, highlighted fields, etc. (FetchPhase). The coordinator stitches the fetched documents into the final SearchResponse.

Why two phases? Because fetching _source for every candidate on every shard would move enormous amounts of data over the wire just to throw most of it away. Query-then-fetch moves only the documents that survive the global merge.

Note: This is why deep pagination is expensive. To serve from=10000&size=10, every shard must produce its top 10010 docs so the coordinator can find the global top 10010 and slice the last 10. Cost grows with from + size on every shard. Use search_after (cursor by sort values) or point-in-time (PIT) instead — see Search Execution.

Who Does What — Coordinating Node vs Data Node

Responsibility	Where	Class
Parse REST, build `SearchRequest`	coordinating node	`RestSearchAction`
Resolve indices, route to shards, fan out	coordinating node	`TransportSearchAction`, `SearchPhase` impls
Optional `can_match` pre-filter (skip shards that cannot match)	per shard	`SearchService.canMatch`
Optional global term stats for scoring	per shard	`DfsPhase`
Run the query, collect top-K + aggs	per shard (data node)	`SearchService.executeQueryPhase` → `QueryPhase`
Hold per-shard search state	per shard	`SearchContext` / `DefaultSearchContext`
Reduce top-K and aggregations	coordinating node	`SearchPhaseController`, `InternalAggregation.reduce`
Fetch `_source`/fields for winners	per shard	`SearchService.executeFetchPhase` → `FetchPhase`
Assemble `SearchResponse`	coordinating node	`TransportSearchAction`

A single node is often both coordinator and data node for a request — but the roles are distinct, and keeping them straight is the key to reading the code.

DFS, can_match, and Scoring

DfsPhase (DFS = Distributed Frequency Search). BM25 scoring depends on term statistics (document frequency, total term frequency). Each shard only knows its local statistics, so the same term can score differently on different shards. With search_type=dfs_query_then_fetch, an extra round trip (DfsPhase) gathers global term statistics first, so scoring is consistent across shards. The default query_then_fetch skips it (cheaper, slightly less precise scores). Find it:
```
grep -rn "class DfsPhase" server/src/main/java/org/opensearch/search/dfs/
```
can_match pre-filter. Before the query phase, the coordinator can ask each shard a cheap "could this shard possibly contain a matching doc?" question (using min/max ranges in the shard's segment metadata). Shards that cannot match are skipped entirely. This is why a time-range query over hundreds of time-based indices stays fast — most shards are pruned.
```
grep -rn "canMatch" server/src/main/java/org/opensearch/search/SearchService.java
```

Aggregations in One Paragraph

Aggregations ride the query phase. As the query collects matching docs on a shard, each Aggregator also collects into buckets/metrics. At the end of the shard's query phase, each aggregator emits an InternalAggregation (its partial result). The coordinator then calls InternalAggregation.reduce(...) to combine all shards' partials into the final aggregation.

flowchart LR
    AB[AggregationBuilder<br/>from JSON] --> AF[AggregatorFactory<br/>per shard]
    AF --> AG[Aggregator<br/>collect per doc]
    AG -->|buildAggregation| IA[InternalAggregation<br/>shard partial]
    IA -->|wire| RED[InternalAggregation.reduce<br/>coordinator]
    RED --> FINAL[final aggregation in SearchResponse]

The crucial subtlety: aggregations can be approximate by design. A terms aggregation asking for the top 10 terms cannot be exact unless every shard returns every term — instead each shard returns its top shard_size terms, and the coordinator merges. A term that is #11 on two shards but globally #5 can be missed. That trade-off, and the doc_count_error_upper_bound field that quantifies it, are covered in Lab 7.2 and the Aggregations deep dive.

QueryBuilders → Lucene Queries, and DocValues

The JSON query DSL is parsed into a tree of QueryBuilder objects (AbstractQueryBuilder subclasses: MatchQueryBuilder, BoolQueryBuilder, RangeQueryBuilder, …). Each builder is Writeable (it crosses the wire to data nodes) and ToXContent (it round-trips to JSON). On the shard, QueryBuilder.toQuery(QueryShardContext) turns the builder into an actual Lucene Query that Lucene's IndexSearcher can execute. See Query DSL and QueryBuilders.

Scoring and matching read the inverted index. But sorting, aggregations, and script field access read columnar DocValues — a per-field, per-document column store that is fast to iterate by document ID. You cannot sort or aggregate on a field with doc_values: false. This is the single most common "why is my aggregation failing?" root cause. See DocValues and Fielddata.

Key Classes Quick Reference

Class	Package (under `server/src/main/java/`)	Role
`RestSearchAction`	`org.opensearch.rest.action.search`	Parse `_search` REST into a `SearchRequest`
`TransportSearchAction`	`org.opensearch.action.search`	Coordinating-node fan-out + assembly
`SearchPhaseController`	`org.opensearch.action.search`	Coordinating-node reduce of top-K + aggs
`SearchService`	`org.opensearch.search`	Per-shard entry: `executeQueryPhase` / `executeFetchPhase` / `canMatch`
`QueryPhase`	`org.opensearch.search.query`	Per-shard query execution + top-K collection
`FetchPhase`	`org.opensearch.search.fetch`	Per-shard `_source`/field loading for winners
`DfsPhase`	`org.opensearch.search.dfs`	Global term statistics for consistent scoring
`SearchContext` / `DefaultSearchContext`	`org.opensearch.search.internal`	Per-shard search state
`AggregatorFactory`	`org.opensearch.search.aggregations`	Builds an `Aggregator` per shard
`Aggregator`	`org.opensearch.search.aggregations`	Collects docs into buckets/metrics on a shard
`InternalAggregation`	`org.opensearch.search.aggregations`	Shard partial + `reduce(...)` on coordinator
`AggregationBuilder`	`org.opensearch.search.aggregations`	Parsed-from-JSON, wire-serializable agg request
`AbstractQueryBuilder`	`org.opensearch.index.query`	Base for all query DSL builders
`QueryShardContext`	`org.opensearch.index.query`	Per-shard context: mappings, `toQuery`

Exact package paths and class names vary slightly by branch. Confirm with grep, e.g. grep -rn "class SearchPhaseController" server/src/main/java/.

Source Areas to Read

# The coordinating-node search action and reduce
ls server/src/main/java/org/opensearch/action/search/

# Per-shard search service and phases
ls server/src/main/java/org/opensearch/search/
ls server/src/main/java/org/opensearch/search/query/
ls server/src/main/java/org/opensearch/search/fetch/
ls server/src/main/java/org/opensearch/search/dfs/

# The aggregation framework
ls server/src/main/java/org/opensearch/search/aggregations/
ls server/src/main/java/org/opensearch/search/aggregations/bucket/terms/
ls server/src/main/java/org/opensearch/search/aggregations/metrics/

# Query DSL builders
ls server/src/main/java/org/opensearch/index/query/

Labs

Lab	Title	Type
7.1	Trace a Search Through Query and Fetch Phases	Code-reading trace
7.2	Aggregations and the Coordinating-Node Reduce	Hands-on + reading
7.3	Build It — A Custom Aggregation	Build It (Java + plugin)

Deliverables

You must demonstrate all of the following before advancing to Level 8:

A completed reading log tracing RestSearchAction → TransportSearchAction → SearchService.executeQueryPhase/executeFetchPhase → SearchPhaseController, with the grep you used at each hop (Lab 7.1).
A Profile API (_search?profile=true) output correlated to the phases you traced.
A written explanation of shard_size vs size and where doc_count_error_upper_bound comes from (Lab 7.2).
A custom aggregation registered via SearchPlugin, built, installed, and exercised via _search, with a passing AggregatorTestCase (Lab 7.3).
A two-sentence explanation of why sort/aggregations require DocValues.

Common Mistakes

Mistake	Consequence	Fix
Thinking a shard returns documents in the query phase	Wrong mental model of where `_source` is loaded	Query returns IDs+scores; fetch loads `_source`
Assuming `terms` results are exact	Surprising "missing" terms in dashboards	Understand `shard_size`; read `doc_count_error_upper_bound`
Implementing `InternalAggregation.reduce` as "concat shards"	Wrong totals, broken merging	Reduce must combine partials correctly, often re-bucket
Forgetting an agg/query is `Writeable` and `ToXContent`	Wire or round-trip test failures	Implement both; add a serialization test
Sorting/aggregating on a `doc_values:false` field	`IllegalArgumentException` at search time	DocValues required; check the mapping
Setting deep `from`/`size` for pagination	OOM / slow coordinator reduce	Use `search_after` or PIT
Skipping `precommit`/`spotlessApply` on plugin Java	CI red on the PR	Run both before pushing

Where This Level Points Next

The deep dives this level depends on: Search Execution, Aggregations, Query DSL and QueryBuilders, DocValues and Fielddata.
Level 8 takes you from curated labs to real GitHub issues.
The issue roadmap stage on search and aggregations lists the kinds of search/agg issues you are now equipped to take.

Lab 7.1: Trace a Search Through Query and Fetch Phases

Background

A _search request travels through more layers than any other request in OpenSearch. It enters as HTTP, is parsed into a SearchRequest, fans out from the coordinating node to every target shard, runs a query phase on each shard, is reduced on the coordinator, fans out again for a fetch phase, and is assembled into a SearchResponse. If you cannot point at the class and method for each of those hops, you cannot debug a search bug, a slow query, or a partial failure.

This is a timed code-reading trace. You will not modify code. You will follow a single search from REST to response, leaving a grep trail at every hop, and then correlate your trace with the Profile API so the abstract phases become concrete timings on a real cluster.

Why This Lab Matters for Contributors

Search bugs are reported as "wrong results" or "slow" — both require you to know which phase.
Partial failures (_shards.failed > 0) only make sense if you know the fan-out.
Pagination cost (from/size/search_after) is invisible until you see the reduce.
Almost every search PR touches one of these five classes. You must read them before you change them.

Prerequisites

A built OpenSearch checkout (Level 1). Confirm ./gradlew assemble succeeds.
The ability to run a local node: ./gradlew run (single-node, REST on :9200).
Level 7 index read, especially the query-then-fetch diagram.
Recommended reading alongside: Search Execution deep dive.

Set up a shell variable so the curls are copy-pasteable:

export OS=http://localhost:9200

Step-by-Step Tasks

This lab is timed: aim for 90 minutes. Each step has a target time. If you blow past it, write down where you got stuck — that confusion is the lab's real output.

Step 1 — Spin up a cluster and index sample data (10 min)

./gradlew run

In a second terminal, create an index with a few shards so the fan-out is real (a single-shard index hides the scatter-gather):

curl -s -XPUT "$OS/flights" -H 'Content-Type: application/json' -d '{
  "settings": { "number_of_shards": 3, "number_of_replicas": 0 },
  "mappings": { "properties": {
    "carrier":  { "type": "keyword" },
    "dest":     { "type": "keyword" },
    "delay_min":{ "type": "integer" },
    "ts":       { "type": "date" }
  }}
}'

for i in $(seq 1 50); do
  curl -s -XPOST "$OS/flights/_doc" -H 'Content-Type: application/json' -d "{
    \"carrier\":\"C$((RANDOM%4))\",\"dest\":\"D$((RANDOM%6))\",
    \"delay_min\":$((RANDOM%120)),\"ts\":\"2026-06-$((1+RANDOM%15))T00:00:00Z\"
  }" >/dev/null
done
curl -s -XPOST "$OS/flights/_refresh" >/dev/null
echo "indexed"

Run a baseline search and keep the response open in another buffer:

curl -s "$OS/flights/_search?size=5&sort=delay_min:desc" \
  -H 'Content-Type: application/json' -d '{ "query": { "range": { "delay_min": { "gte": 30 } } } }' | head -40

Step 2 — Hop 1: REST parsing in `RestSearchAction` (10 min)

The HTTP request is handled by a RestHandler. Find it:

grep -rn "class RestSearchAction" server/src/main/java/org/opensearch/rest/action/search/
grep -n "_search" server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java | head

Read prepareRequest(...) and parseSearchRequest(...). Answer in your reading log:

Which method maps the URL params (from, size, search_type, scroll) onto the SearchRequest/SearchSourceBuilder?
Where does the request body JSON get parsed into a SearchSourceBuilder? (look for SearchSourceBuilder.fromXContent or parseSearchSource).
What ActionType is dispatched? (grep for client.execute( / SearchAction.INSTANCE.)

grep -n "SearchAction" server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java

Step 3 — Hop 2: coordinating node fan-out in `TransportSearchAction` (20 min)

This is the heart of the coordinating node. Locate it:

grep -rn "class TransportSearchAction" server/src/main/java/org/opensearch/action/search/

Read doExecute(...) and follow it into the search phase machinery. The fan-out is implemented as a sequence of SearchPhase objects. Grep for the phase classes:

ls server/src/main/java/org/opensearch/action/search/ | grep -i "Phase\|SearchAction\|Controller"

You are looking for the abstract async search phase driver (often AbstractSearchAsyncAction) and concrete phases. Trace these questions:

How does the coordinator decide which shards to target? (index resolution → RoutingTable → GroupShardsIterator<SearchShardIterator>). Grep:
```
grep -rn "SearchShardIterator\|GroupShardsIterator" server/src/main/java/org/opensearch/action/search/ | head
```

Where is the can_match pre-filter phase? It can prune shards before the query phase:

grep -rn "CanMatch\|canMatch" server/src/main/java/org/opensearch/action/search/ | head

What is the difference between query_then_fetch and dfs_query_then_fetch in the dispatch?

grep -rn "DFS_QUERY_THEN_FETCH\|QUERY_THEN_FETCH" server/src/main/java/org/opensearch/action/search/SearchType.java

Note: The coordinating node never holds the documents during the query phase — it holds SearchPhaseResult objects containing top doc IDs, scores, sort values, and aggregation partials. Confirm this by reading what the query-phase results class carries:
grep -rn "class QuerySearchResult" server/src/main/java/org/opensearch/search/query/

Step 4 — Hop 3: the per-shard query phase (15 min)

The fan-out lands on each target shard at SearchService. Find the entry point:

grep -n "executeQueryPhase" server/src/main/java/org/opensearch/search/SearchService.java | head

Follow executeQueryPhase into QueryPhase:

grep -rn "class QueryPhase" server/src/main/java/org/opensearch/search/query/
grep -n "execute\|executeInternal\|TopDocs\|collector" server/src/main/java/org/opensearch/search/query/QueryPhase.java | head -30

Answer:

Where is the SearchContext/DefaultSearchContext created, and what does it hold? (the Lucene IndexSearcher, the parsed Query, the from/size, the aggregators).
```
grep -rn "class DefaultSearchContext" server/src/main/java/org/opensearch/search/internal/
```
Where does the QueryBuilder become a Lucene Query? (look for searchContext.query() set via QueryShardContext.toQuery).
How many docs does each shard collect for a request of from=10000&size=10? (answer: top from + size = 10010 — this is the deep-pagination cost).

Step 5 — Hop 4: the coordinating-node reduce in `SearchPhaseController` (10 min)

After all shards return their query results, the coordinator merges them:

grep -rn "class SearchPhaseController" server/src/main/java/org/opensearch/action/search/
grep -n "reducedQueryPhase\|sortDocs\|merge" server/src/main/java/org/opensearch/action/search/SearchPhaseController.java | head -20

This is where:

The global top size is computed from per-shard top-Ks (a merge of sorted lists).
Aggregations are reduced (InternalAggregations.reduce / InternalAggregation.reduce).
The coordinator decides which doc IDs on which shards must be fetched.

Find the agg reduce call:

grep -rn "InternalAggregation" server/src/main/java/org/opensearch/action/search/SearchPhaseController.java | head

Step 6 — Hop 5: the fetch phase (10 min)

Only winning docs are fetched. Entry point:

grep -n "executeFetchPhase" server/src/main/java/org/opensearch/search/SearchService.java | head
grep -rn "class FetchPhase" server/src/main/java/org/opensearch/search/fetch/

Answer:

What does FetchPhase load that the query phase did not? (_source, stored fields, highlights, script fields — via FetchSubPhase implementations).
```
ls server/src/main/java/org/opensearch/search/fetch/subphase/
```
Why is the fetch fan-out usually smaller than the query fan-out? (only shards owning winners).

Step 7 — Correlate with the Profile API (15 min)

Now make the phases concrete. The Profile API instruments per-shard timings:

curl -s "$OS/flights/_search" -H 'Content-Type: application/json' -d '{
  "profile": true,
  "size": 5,
  "sort": [ { "delay_min": "desc" } ],
  "query": { "bool": { "must": [
    { "range": { "delay_min": { "gte": 30 } } },
    { "terms": { "carrier": ["C0","C1"] } }
  ]}},
  "aggs": { "by_carrier": { "terms": { "field": "carrier" } } }
}' > /tmp/profile.json

# How many shards reported? (this is your fan-out width)
grep -c '"shard_id"' /tmp/profile.json 2>/dev/null || python3 -c "import json;d=json.load(open('/tmp/profile.json'));print('shards:',len(d['profile']['shards']))"

Inspect one shard's profile. You will see the query rewritten into Lucene query nodes (BooleanQuery, IndexOrDocValuesQuery, TermInSetQuery), each with time_in_nanos broken into create_weight, build_scorer, next_doc, score, etc., plus a collector tree and an aggregations section.

python3 - <<'PY'
import json
d=json.load(open('/tmp/profile.json'))
sh=d['profile']['shards'][0]
print("shard:", sh['id'])
q=sh['searches'][0]['query'][0]
print("top query node:", q['type'], "time_ns:", q['time_in_nanos'])
print("collector:", sh['searches'][0]['collector'][0]['name'])
print("aggs:", [a['type'] for a in sh.get('aggregations',[])])
PY

Map what you see back onto your trace:

Profile section	Phase you traced	Class
`query` rewrite tree	query phase, query → Lucene	`QueryPhase`, `QueryShardContext.toQuery`
`collector` tree	query phase top-K collection	`QueryPhase` collectors
`aggregations`	per-shard collect	`Aggregator`
(not shown — coordinator-side)	reduce	`SearchPhaseController`

Note: The Profile API only shows per-shard work. The coordinator-side reduce and the fetch phase are not in the profile output. That gap is itself a lesson: profiling tells you about shard cost, not coordinator cost. Deep-pagination pain lives in the reduce, which the profile cannot see.

Pagination: `from`/`size` vs `search_after` (reading exercise)

Run all three and compare what the coordinator must do:

# Deep from/size — every shard must produce top 110 docs
curl -s "$OS/flights/_search?from=100&size=10&sort=delay_min:desc" -o /dev/null -w "from/size: %{http_code}\n"

# search_after — cursor by the last page's sort values (cheap, stateless)
LAST=$(curl -s "$OS/flights/_search?size=5&sort=delay_min:desc" | python3 -c "import json,sys;h=json.load(sys.stdin)['hits']['hits'];print(json.dumps(h[-1]['sort']))")
curl -s "$OS/flights/_search" -H 'Content-Type: application/json' -d "{
  \"size\":5, \"sort\":[{\"delay_min\":\"desc\"}], \"search_after\": $LAST }" -o /dev/null -w "search_after: %{http_code}\n"

In your log, explain: for from=10000, why does every shard pay, and why does search_after not? (Because from/size requires each shard to materialize the top from+size; search_after seeks each shard directly to the cursor and collects only size.)

Implementation Requirements

This is a reading/tracing lab. Your deliverable is a reading-log artifact — a Markdown file with one section per hop:

## Hop 1: RestSearchAction
- file: server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java
- grep used: grep -n "_search" .../RestSearchAction.java
- method: prepareRequest -> parseSearchRequest
- what crosses to next hop: a SearchRequest (with SearchSourceBuilder)

## Hop 2: TransportSearchAction  ... (etc for all five hops)

Each hop section must contain: the file path, the grep command you ran, the method, and what crosses the wire to the next hop.

Expected Output

A flights index with 3 shards and ~50 docs.
A Profile API JSON whose shard count equals your index's shard count.
A reading log covering all five hops plus the pagination exercise.
A clear statement of which two things the Profile API does not show (reduce, fetch).

Troubleshooting

Symptom	Cause	Fix
`_search` returns `_shards.total: 1`	Index has one shard; no fan-out to observe	Recreate with `number_of_shards: 3`
Profile JSON has empty `aggregations`	No `aggs` in the request	Add the `terms` agg from Step 7
`grep` finds the class in `test/` not `main/`	You matched a test	Restrict the path to `server/src/main/java/`
`search_after` returns the same page	Wrong/stale `sort` cursor	Re-read the last hit's `sort` array each page

Stretch Goals

Add "explain": true to a search and correlate the BM25 explanation with the score timing in the profile. Where does the term frequency come from on a single shard?
Force search_type=dfs_query_then_fetch and diff the scores against query_then_fetch on a multi-shard index. Explain the difference using the DfsPhase.

Find where terminate_after short-circuits the query-phase collector:

grep -rn "terminate_after\|terminateAfter" server/src/main/java/org/opensearch/search/ | head

Validation / Self-check

You are done when you can answer, without re-reading:

Name the five classes, in order, that a _search passes through from REST to response.
In the query phase, does a shard return documents? If not, what does it return?
Why is from=10000&size=10 expensive, and which class pays the cost?
What does the fetch phase load that the query phase did not, and why is its fan-out smaller?
Where does a QueryBuilder become a Lucene Query, and what context object performs the conversion?
What does can_match do, and which class implements it?
Name two things the Profile API does not measure.

Cross-references: Search Execution deep dive, Query DSL and QueryBuilders, Lab 7.2: Aggregations, Capstone Step 2: Reproduction.

Lab 7.2: Aggregations and the Coordinating-Node Reduce

Background

Aggregations are OpenSearch's analytics engine — terms, date_histogram, avg, cardinality, and arbitrarily nested sub-aggregations are what power dashboards. Like search, aggregations run per shard and then reduce on the coordinating node (formerly the master/now cluster-manager role is unrelated here — the coordinating node is whichever node received the request). Unlike top-K search, aggregations can be approximate by design, and a contributor who does not understand where the approximation comes from will write incorrect aggregations and misread bug reports.

In this lab you run real aggregations with curl, then read the framework code that produced them: AggregatorFactory → Aggregator → InternalAggregation.reduce. You will see exactly which work happens on the shard and which happens on the coordinator, and you will measure shard_size vs size approximation error on a real index.

Why This Lab Matters for Contributors

Aggregation bugs are subtle: "wrong count," "missing bucket," "wrong total after merge" are all reduce bugs. You cannot fix them without understanding the collect/reduce split.
The most common new-contributor mistake in this subsystem is implementing reduce as a naive concatenation. Seeing a correct reduce makes the trap obvious.
Lab 7.3 asks you to build an aggregation. This lab is the reading you need first.

Prerequisites

A running node: ./gradlew run. Set export OS=http://localhost:9200.
Lab 7.1 completed (you know where the query phase is).
Read alongside: Aggregations deep dive and DocValues and Fielddata.

Step-by-Step Tasks

Step 1 — Index data with a deliberate skew (10 min)

To see approximation error you need a term distribution where a globally significant term is not locally significant on every shard. Build that on purpose:

curl -s -XPUT "$OS/sales" -H 'Content-Type: application/json' -d '{
  "settings": { "number_of_shards": 5, "number_of_replicas": 0 },
  "mappings": { "properties": {
    "product": { "type": "keyword" },
    "amount":  { "type": "double" },
    "ts":      { "type": "date" }
  }}
}'

# 26 products A..Z; "A" is dominant overall but spread thin across shards
for i in $(seq 1 2000); do
  R=$((RANDOM%100))
  if [ $R -lt 8 ]; then P="A"; else P=$(printf "\\$(printf '%03o' $((66+RANDOM%25)))"); fi
  curl -s -XPOST "$OS/sales/_doc" -H 'Content-Type: application/json' -d "{
    \"product\":\"$P\",\"amount\":$((RANDOM%500)).$((RANDOM%99)),
    \"ts\":\"2026-06-$(printf '%02d' $((1+RANDOM%28)))T00:00:00Z\" }" >/dev/null
done
curl -s -XPOST "$OS/sales/_refresh" >/dev/null
echo "loaded"

Step 2 — Run a `terms` + `date_histogram` and read the response (10 min)

curl -s "$OS/sales/_search?size=0" -H 'Content-Type: application/json' -d '{
  "aggs": {
    "top_products": {
      "terms": { "field": "product", "size": 5 },
      "aggs": { "avg_amount": { "avg": { "field": "amount" } } }
    },
    "over_time": {
      "date_histogram": { "field": "ts", "calendar_interval": "week" },
      "aggs": { "revenue": { "sum": { "field": "amount" } } }
    }
  }
}' | python3 -m json.tool | head -60

Note size=0 — you do not want hits, only aggregations, so the query phase collects buckets but skips top-K. Find the two key fields in the top_products response:

doc_count_error_upper_bound — the worst-case count error for buckets not returned.
sum_other_doc_count — docs in terms that fell outside the returned top size.

These two fields are the approximation, made visible.

Step 3 — Read the framework: factory → aggregator → internal (25 min)

Locate the framework root:

ls server/src/main/java/org/opensearch/search/aggregations/

Read these four touchpoints. For each, grep first, then read the named method.

(a) AggregationBuilder — the parsed-from-JSON, wire-serializable request. The terms builder:

grep -rn "class TermsAggregationBuilder" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

Read how size, shardSize, and field are parsed and serialized (writeTo / doXContentBody). Note: it is both Writeable (crosses to data nodes) and ToXContent (round-trips to JSON).

(b) AggregatorFactory — builds an Aggregator per shard:

grep -rn "class AggregatorFactory" server/src/main/java/org/opensearch/search/aggregations/
grep -rn "class TermsAggregatorFactory" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

Read createInternal(...) — it decides which concrete Aggregator to use (e.g. GlobalOrdinalsStringTermsAggregator vs MapStringTermsAggregator) based on field type and cardinality. This is a performance decision (global ordinals → fast). See DocValues and Fielddata.

(c) Aggregator — collects per document on a shard:

grep -rn "abstract class Aggregator" server/src/main/java/org/opensearch/search/aggregations/Aggregator.java
grep -n "getLeafCollector\|collect\|buildAggregations\|buildAggregation" \
  server/src/main/java/org/opensearch/search/aggregations/Aggregator.java

The two methods that matter:

getLeafCollector(...) returns a LeafBucketCollector whose collect(doc, bucket) is called for every matching doc — this is the collect loop, reading DocValues.
buildAggregations(...) / buildAggregation(...) produces the shard's InternalAggregation partial at the end of the query phase.

(d) InternalAggregation — the partial result and the reduce logic:

grep -rn "abstract class InternalAggregation" server/src/main/java/org/opensearch/search/aggregations/InternalAggregation.java
grep -n "reduce" server/src/main/java/org/opensearch/search/aggregations/InternalAggregation.java | head

reduce(List<InternalAggregation> aggregations, ReduceContext) runs on the coordinator. For terms (InternalTerms.reduce):

grep -rn "class InternalTerms" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/
grep -n "reduce\|reduceBuckets\|merge" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java | head

Read reduceBuckets. Observe that reduce is not concatenation: it merges buckets by key, sums their doc_count, recursively reduces each bucket's sub-aggregations, re-sorts, and re-applies size. This is the single most important method in the aggregation framework.

Step 4 — Collect vs reduce: trace one `terms` request through both (10 min)

Fill in this table from what you read (it is the lab's core artifact):

Step	Where	What happens	Reads
Parse	coordinator	JSON → `TermsAggregationBuilder`	—
Serialize + fan out	coordinator → shards	builder crosses the wire (`Writeable`)	—
Build aggregator	each shard	`TermsAggregatorFactory.createInternal`	mapping/field type
Collect	each shard	`LeafBucketCollector.collect(doc, owningBucket)` per matching doc	DocValues
Build partial	each shard	`buildAggregations` → top `shard_size` terms as `InternalTerms`	—
Reduce	coordinator	`InternalTerms.reduce` merges by key, sums counts, re-sorts, applies `size`	—
Sub-agg reduce	coordinator	recursive `reduce` of each bucket's `avg_amount`	—

Step 5 — Measure `shard_size` vs `size` approximation error (15 min)

Run the same terms agg three ways and compare the returned count for product A and the doc_count_error_upper_bound:

agg () {
  curl -s "$OS/sales/_search?size=0" -H 'Content-Type: application/json' -d "{
    \"aggs\": { \"p\": { \"terms\": { \"field\": \"product\", \"size\": 5, \"shard_size\": $1 } } } }" \
  | python3 -c "import json,sys;d=json.load(sys.stdin)['aggregations']['p'];print('shard_size=$1','err_bound=',d['doc_count_error_upper_bound'],'sum_other=',d['sum_other_doc_count']);[print('  ',b['key'],b['doc_count']) for b in d['buckets']]"
}
agg 1
agg 5
agg 100

You should observe: tiny shard_size inflates doc_count_error_upper_bound and can drop or mis-rank product A; larger shard_size shrinks the error toward zero (at the cost of more data moved in the reduce). Explain why in your log, referencing where each shard truncates to shard_size in buildAggregations before the coordinator ever sees the buckets.

Note: The default shard_size is roughly size * 1.5 + 10 (read it in source rather than trusting this number — grep -rn "shardSize" server/.../TermsAggregationBuilder.java). The trade-off is correctness vs. data moved/CPU in the reduce. A terms agg can never be exact on a multi-shard index without shard_size = full cardinality.

Step 6 — Sub-aggregations and the reduce recursion (10 min)

Your top_products.avg_amount is a metric sub-aggregation. Confirm in source that reduce recurses: when InternalTerms.reduce merges two buckets with the same key, it must also reduce their child aggregations. Grep:

grep -rn "InternalAggregations.reduce\|reduce(.*reduceContext" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java | head

Answer: why can avg be reduced correctly across shards from partials alone? (Because avg's InternalAvg partial carries both the sum and the count, not the average — so the coordinator can recompute sum/count globally.) Confirm:

grep -rn "class InternalAvg" server/src/main/java/org/opensearch/search/aggregations/metrics/
grep -n "getValue\|sum\|count" server/src/main/java/org/opensearch/search/aggregations/metrics/InternalAvg.java | head

This is the design rule for any reducible metric: the partial must carry enough state to reduce losslessly. You will apply this rule directly in Lab 7.3.

Reading Exercises

Answer each in your log with the file + grep:

Find where a terms aggregator truncates to shard_size before emitting its partial.
Find the pipeline aggregations directory — how do pipeline aggs differ from bucket/metric aggs in when they run (they run on the reduce output, not during collect)?
```
ls server/src/main/java/org/opensearch/search/aggregations/pipeline/ | head
```
Find where size=0 causes the query phase to skip top-K hit collection but still run aggregators.
Why does aggregating on a text field fail by default, while keyword works? (Hint: DocValues — text has none unless fielddata: true.)

Implementation Requirements

Deliverable is a Markdown analysis file containing:

The completed collect vs reduce table from Step 4.
The shard_size measurement from Step 5 with your three runs and an explanation of the trend.
A one-paragraph statement of the reducible-metric rule (partials must carry enough state), citing InternalAvg as the example.

Expected Output

A sales index, 5 shards, ~2000 docs, product A globally dominant.
A terms response showing nonzero doc_count_error_upper_bound at shard_size=1 and ~0 at shard_size=100.
A reading log mapping every framework class you grepped to its role.

Troubleshooting

Symptom	Cause	Fix
`doc_count_error_upper_bound` always 0	One shard, or `shard_size` ≥ cardinality	Use 5 shards; try `shard_size=1`
Aggregating a field throws `IllegalArgumentException`	Field has no DocValues (e.g. `text`)	Use `keyword`, or set `fielddata: true` (costly)
`avg` returns `null`	No docs matched / field missing	Check the query and mapping
Buckets look right but counts are too low	`shard_size` too small	Increase `shard_size` and re-measure

Stretch Goals

Add "show_term_doc_count_error": true to the terms agg and read the per-bucket error in the response. Find where it is computed in InternalTerms.
Build a cardinality agg and read InternalCardinality — it carries a HyperLogLog++ sketch. Why is the sketch (not a count) the correct partial to ship for reduce?
Nest a terms inside a date_histogram and reason about the reduce cost as the product of bucket counts.

Validation / Self-check

Which class runs the per-document collect loop, and which method is called per doc?
Which class and method run the reduce, and on which node?
Why is InternalTerms.reduce not a simple concatenation? Name two things it does.
State the reducible-metric rule and explain why InternalAvg ships sum+count, not the average.
What do doc_count_error_upper_bound and sum_other_doc_count mean, and what makes them nonzero?
Why can a multi-shard terms agg never be exact without shard_size = full cardinality?
Why do aggregations read DocValues rather than the inverted index?

Cross-references: Aggregations deep dive, DocValues and Fielddata, Lab 7.3: Build a Custom Aggregation, Issue roadmap: search and aggregations.

Lab 7.3: Build It — A Custom Aggregation

Background

You have traced the search phases (Lab 7.1) and read the aggregation framework (Lab 7.2). Now you build one. This lab walks you through a complete SearchPlugin that registers a custom metric aggregation called stats_lite: for a numeric field it computes count, sum, min, and max in a single pass — a deliberately small subset of the built-in stats agg, chosen so the code is fully readable while exercising every required moving part.

You will write all five pieces the framework requires:

Plugin implements SearchPlugin with getAggregations() registration.
StatsLiteAggregationBuilder — Writeable + ToXContent, parsed from JSON.
StatsLiteAggregatorFactory — builds the per-shard aggregator.
StatsLiteAggregator — the per-shard collect + buildAggregation.
InternalStatsLite — the partial result and the reduce that merges shards.

Then you build, install, and exercise it via _search, and write a test extending AggregatorTestCase.

Why a metric and not a bucket agg? A metric agg has the cleanest reduce: the partial carries raw accumulators (count/sum/min/max), and reduce just combines them. That makes the reducible-metric rule from Lab 7.2 concrete: ship the accumulators, not the derived answer.

Why This Lab Matters for Contributors

New aggregations and queries are a steady stream of real OpenSearch issues and RFCs. This is the exact shape of that work.
Getting Writeable + ToXContent + reduce correct is the skill that separates a mergeable PR from a "back to the drawing board" review.
The AggregatorTestCase harness you use here is the same one core uses for built-in aggs.

Prerequisites

A built checkout; ./gradlew assemble green.
Lab 7.2 done — you know AggregatorFactory/Aggregator/ InternalAggregation.
Read alongside: Plugin Architecture, Aggregations deep dive, Serialization and BWC.

Step-by-Step Tasks

Step 1 — Scaffold the plugin module (15 min)

Create a plugin under plugins/ (the in-repo plugin location). Use an existing small plugin as a template for the Gradle wiring:

ls plugins/                       # see analysis-icu, mapper-murmur3, etc.
cat plugins/mapper-murmur3/build.gradle | head -40   # a minimal plugin build file

Create the directory tree:

mkdir -p plugins/agg-stats-lite/src/main/java/org/opensearch/aggregations/statslite
mkdir -p plugins/agg-stats-lite/src/test/java/org/opensearch/aggregations/statslite

plugins/agg-stats-lite/build.gradle:

apply plugin: 'opensearch.opensearchplugin'
apply plugin: 'opensearch.yaml-rest-test'

opensearchplugin {
  description = 'Adds a stats_lite metric aggregation (count/sum/min/max).'
  classname = 'org.opensearch.aggregations.statslite.StatsLitePlugin'
}

// No external deps; this plugin only uses server APIs.

grep -n "project(\":plugins" settings.gradle | head
# add a line:  include ':plugins:agg-stats-lite'

Note: Every source file you create needs the SPDX header (the repo's CONTRIBUTING.md and PR quality cover the conventions). ./gradlew precommit will fail without it.

Step 2 — The `InternalStatsLite` partial + reduce (the heart) (25 min)

This is the class to get right. It carries the accumulators so reduce is lossless.

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */
package org.opensearch.aggregations.statslite;

import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.XContentBuilder;
import org.opensearch.search.aggregations.InternalAggregation;
import org.opensearch.search.aggregations.metrics.InternalNumericMetricsAggregation;

import java.io.IOException;
import java.util.List;
import java.util.Map;

public class InternalStatsLite extends InternalNumericMetricsAggregation.MultiValue {

    private final long count;
    private final double sum;
    private final double min;
    private final double max;

    public InternalStatsLite(String name, long count, double sum, double min, double max, Map<String, Object> metadata) {
        super(name, metadata);
        this.count = count;
        this.sum = sum;
        this.min = min;
        this.max = max;
    }

    // Wire deserialization — order MUST match writeTo.
    public InternalStatsLite(StreamInput in) throws IOException {
        super(in);
        this.count = in.readVLong();
        this.sum = in.readDouble();
        this.min = in.readDouble();
        this.max = in.readDouble();
    }

    @Override
    protected void doWriteTo(StreamOutput out) throws IOException {
        out.writeVLong(count);
        out.writeDouble(sum);
        out.writeDouble(min);
        out.writeDouble(max);
    }

    @Override
    public String getWriteableName() {
        return StatsLiteAggregationBuilder.NAME; // "stats_lite"
    }

    // THE REDUCE: combine shard partials losslessly. Not a concat — a merge of accumulators.
    @Override
    public InternalStatsLite reduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) {
        long count = 0;
        double sum = 0;
        double min = Double.POSITIVE_INFINITY;
        double max = Double.NEGATIVE_INFINITY;
        for (InternalAggregation agg : aggregations) {
            InternalStatsLite s = (InternalStatsLite) agg;
            count += s.count;
            sum += s.sum;
            min = Math.min(min, s.min);
            max = Math.max(max, s.max);
        }
        return new InternalStatsLite(name, count, sum, min, max, getMetadata());
    }

    @Override
    public double value(String name) {
        switch (name) {
            case "count": return count;
            case "sum":   return sum;
            case "min":   return min;
            case "max":   return max;
            default: throw new IllegalArgumentException("unknown value [" + name + "]");
        }
    }

    @Override
    public XContentBuilder doXContentBody(XContentBuilder builder, Params params) throws IOException {
        builder.field("count", count);
        builder.field("sum", sum);
        // min/max are meaningless with zero docs; emit null like core does.
        builder.field("min", count == 0 ? null : min);
        builder.field("max", count == 0 ? null : max);
        return builder;
    }

    // count/sum/min/max accessors for tests
    public long getCount() { return count; }
    public double getSum() { return sum; }
    public double getMin() { return min; }
    public double getMax() { return max; }
}

Warning: doWriteTo and the StreamInput constructor must read/write fields in exactly the same order. A mismatch is silent corruption across the wire — the canonical BWC bug. Test it with a round-trip (Step 7). See Serialization and BWC.

Step 3 — The `StatsLiteAggregator` (per-shard collect) (20 min)

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.ScoreMode;
import org.opensearch.common.lease.Releasables;
import org.opensearch.common.util.BigArrays;
import org.opensearch.common.util.DoubleArray;
import org.opensearch.common.util.LongArray;
import org.opensearch.index.fielddata.SortedNumericDoubleValues;
import org.opensearch.search.aggregations.Aggregator;
import org.opensearch.search.aggregations.InternalAggregation;
import org.opensearch.search.aggregations.LeafBucketCollector;
import org.opensearch.search.aggregations.LeafBucketCollectorBase;
import org.opensearch.search.aggregations.metrics.NumericMetricsAggregator;
import org.opensearch.search.aggregations.support.ValuesSource;
import org.opensearch.search.aggregations.support.ValuesSourceConfig;
import org.opensearch.search.internal.SearchContext;

import java.io.IOException;
import java.util.Map;

public class StatsLiteAggregator extends NumericMetricsAggregator.MultiValue {

    private final ValuesSource.Numeric valuesSource;

    // owningBucketOrdinal-indexed accumulators (bucket ordinals support sub-agg nesting).
    private LongArray counts;
    private DoubleArray sums;
    private DoubleArray mins;
    private DoubleArray maxes;

    public StatsLiteAggregator(
        String name,
        ValuesSourceConfig config,
        SearchContext context,
        Aggregator parent,
        Map<String, Object> metadata
    ) throws IOException {
        super(name, context, parent, metadata);
        this.valuesSource = config.hasValues() ? (ValuesSource.Numeric) config.getValuesSource() : null;
        if (valuesSource != null) {
            final BigArrays bigArrays = context.bigArrays();
            counts = bigArrays.newLongArray(1, true);
            sums = bigArrays.newDoubleArray(1, true);
            mins = bigArrays.newDoubleArray(1, false);
            mins.fill(0, mins.size(), Double.POSITIVE_INFINITY);
            maxes = bigArrays.newDoubleArray(1, false);
            maxes.fill(0, maxes.size(), Double.NEGATIVE_INFINITY);
        }
    }

    @Override
    public ScoreMode scoreMode() {
        return valuesSource != null && valuesSource.needsScores() ? ScoreMode.COMPLETE : ScoreMode.COMPLETE_NO_SCORES;
    }

    @Override
    public LeafBucketCollector getLeafCollector(LeafReaderContext ctx, LeafBucketCollector sub) throws IOException {
        if (valuesSource == null) {
            return LeafBucketCollector.NO_OP_COLLECTOR;
        }
        final BigArrays bigArrays = context.bigArrays();
        final SortedNumericDoubleValues values = valuesSource.doubleValues(ctx); // reads DocValues
        return new LeafBucketCollectorBase(sub, values) {
            @Override
            public void collect(int doc, long bucket) throws IOException {
                growIfNeeded(bigArrays, bucket);
                if (values.advanceExact(doc)) {
                    final int valueCount = values.docValueCount();
                    for (int i = 0; i < valueCount; i++) {
                        double v = values.nextValue();
                        counts.increment(bucket, 1);
                        sums.increment(bucket, v);
                        mins.set(bucket, Math.min(mins.get(bucket), v));
                        maxes.set(bucket, Math.max(maxes.get(bucket), v));
                    }
                }
            }
        };
    }

    private void growIfNeeded(BigArrays bigArrays, long bucket) {
        if (bucket >= counts.size()) {
            long from = counts.size();
            counts = bigArrays.grow(counts, bucket + 1);
            sums = bigArrays.grow(sums, bucket + 1);
            long oldMinSize = mins.size();
            mins = bigArrays.grow(mins, bucket + 1);
            mins.fill(oldMinSize, mins.size(), Double.POSITIVE_INFINITY);
            long oldMaxSize = maxes.size();
            maxes = bigArrays.grow(maxes, bucket + 1);
            maxes.fill(oldMaxSize, maxes.size(), Double.NEGATIVE_INFINITY);
        }
    }

    @Override
    public boolean hasMetric(String name) {
        switch (name) {
            case "count": case "sum": case "min": case "max": return true;
            default: return false;
        }
    }

    @Override
    public double metric(String name, long owningBucketOrd) {
        if (valuesSource == null || owningBucketOrd >= counts.size()) {
            return name.equals("count") ? 0 : Double.NaN;
        }
        switch (name) {
            case "count": return counts.get(owningBucketOrd);
            case "sum":   return sums.get(owningBucketOrd);
            case "min":   return mins.get(owningBucketOrd);
            case "max":   return maxes.get(owningBucketOrd);
            default: throw new IllegalArgumentException("unknown metric [" + name + "]");
        }
    }

    // Builds the shard partial for a given owning bucket ordinal.
    @Override
    public InternalAggregation buildAggregation(long owningBucketOrd) throws IOException {
        if (valuesSource == null || owningBucketOrd >= counts.size()) {
            return buildEmptyAggregation();
        }
        return new InternalStatsLite(
            name,
            counts.get(owningBucketOrd),
            sums.get(owningBucketOrd),
            mins.get(owningBucketOrd),
            maxes.get(owningBucketOrd),
            metadata()
        );
    }

    @Override
    public InternalAggregation buildEmptyAggregation() {
        return new InternalStatsLite(name, 0, 0, Double.POSITIVE_INFINITY, Double.NEGATIVE_INFINITY, metadata());
    }

    @Override
    public void doClose() {
        Releasables.close(counts, sums, mins, maxes);
    }
}

Note: Class/method signatures drift between branches (e.g. BigArrays access, the exact NumericMetricsAggregator.MultiValue contract, metadata() vs getMetadata()). After you paste this, let the compiler guide you and diff against a real built-in metric aggregator for the current contract:
ls server/src/main/java/org/opensearch/search/aggregations/metrics/ | grep -i "StatsAggregator\|AvgAggregator"

Step 4 — The factory and the builder (20 min)

StatsLiteAggregatorFactory (per-shard aggregator creation), extending the values-source factory so field resolution and the ValuesSourceRegistry are handled for you:

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.opensearch.search.aggregations.Aggregator;
import org.opensearch.search.aggregations.AggregatorFactories;
import org.opensearch.search.aggregations.CardinalityUpperBound;
import org.opensearch.search.aggregations.support.CoreValuesSourceType;
import org.opensearch.search.aggregations.support.ValuesSourceAggregatorFactory;
import org.opensearch.search.aggregations.support.ValuesSourceConfig;
import org.opensearch.search.aggregations.support.ValuesSourceRegistry;
import org.opensearch.search.internal.SearchContext;
import org.opensearch.index.query.QueryShardContext;

import java.io.IOException;
import java.util.Map;

public class StatsLiteAggregatorFactory extends ValuesSourceAggregatorFactory {

    public static void registerAggregators(ValuesSourceRegistry.Builder builder) {
        builder.register(
            StatsLiteAggregationBuilder.REGISTRY_KEY,
            CoreValuesSourceType.NUMERIC,
            StatsLiteAggregator::new,
            true
        );
    }

    StatsLiteAggregatorFactory(
        String name,
        ValuesSourceConfig config,
        QueryShardContext queryShardContext,
        AggregatorFactories.Builder subFactoriesBuilder,
        Map<String, Object> metadata
    ) throws IOException {
        super(name, config, queryShardContext, subFactoriesBuilder, metadata);
    }

    @Override
    protected Aggregator createUnmapped(SearchContext searchContext, Aggregator parent, Map<String, Object> metadata) throws IOException {
        return new StatsLiteAggregator(name, config, searchContext, parent, metadata);
    }

    @Override
    protected Aggregator doCreateInternal(
        SearchContext searchContext,
        Aggregator parent,
        CardinalityUpperBound bucketCardinality,
        Map<String, Object> metadata
    ) throws IOException {
        return new StatsLiteAggregator(name, config, searchContext, parent, metadata);
    }
}

StatsLiteAggregationBuilder (Writeable + ToXContent, parsed from JSON). It is a values-source builder so field/missing/script parsing is inherited:

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.XContentBuilder;
import org.opensearch.index.query.QueryShardContext;
import org.opensearch.search.aggregations.AggregationBuilder;
import org.opensearch.search.aggregations.AggregatorFactories;
import org.opensearch.search.aggregations.support.CoreValuesSourceType;
import org.opensearch.search.aggregations.support.ValuesSourceAggregationBuilder;
import org.opensearch.search.aggregations.support.ValuesSourceConfig;
import org.opensearch.search.aggregations.support.ValuesSourceRegistry;
import org.opensearch.search.aggregations.support.ValuesSourceType;

import java.io.IOException;
import java.util.Map;

public class StatsLiteAggregationBuilder
        extends ValuesSourceAggregationBuilder.LeafOnly<ValuesSourceConfig, StatsLiteAggregationBuilder> {

    public static final String NAME = "stats_lite";
    public static final ValuesSourceRegistry.RegistryKey<?> REGISTRY_KEY =
        new ValuesSourceRegistry.RegistryKey<>(NAME, StatsLiteAggregatorSupplier.class);

    public StatsLiteAggregationBuilder(String name) {
        super(name);
    }

    public StatsLiteAggregationBuilder(StreamInput in) throws IOException {
        super(in); // field/script/missing read by the base class
    }

    @Override
    protected ValuesSourceType defaultValueSourceType() {
        return CoreValuesSourceType.NUMERIC;
    }

    @Override
    public String getType() {
        return NAME;
    }

    @Override
    protected void innerWriteTo(StreamOutput out) {
        // no extra fields beyond the base values-source config
    }

    @Override
    protected StatsLiteAggregatorFactory innerBuild(
        QueryShardContext queryShardContext,
        ValuesSourceConfig config,
        AggregatorFactories.Builder subFactoriesBuilder
    ) throws IOException {
        return new StatsLiteAggregatorFactory(name, config, queryShardContext, subFactoriesBuilder, metadata);
    }

    @Override
    protected XContentBuilder doXContentBody(XContentBuilder builder, Params params) {
        return builder; // nothing custom to render
    }

    @Override
    public BucketCardinality bucketCardinality() {
        return BucketCardinality.NONE; // it's a metric, produces no buckets
    }
}

The supplier interface the registry binds (StatsLiteAggregator::new matches it):

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.opensearch.search.aggregations.Aggregator;
import org.opensearch.search.aggregations.support.ValuesSourceConfig;
import org.opensearch.search.internal.SearchContext;
// ... (signature mirrors a built-in metric supplier; compare with AvgAggregatorSupplier)

@FunctionalInterface
public interface StatsLiteAggregatorSupplier {
    Aggregator build(String name, ValuesSourceConfig config, SearchContext context,
                     Aggregator parent, java.util.Map<String, Object> metadata) throws java.io.IOException;
}

Step 5 — Register it in the plugin (10 min)

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.opensearch.plugins.Plugin;
import org.opensearch.plugins.SearchPlugin;

import java.util.List;

public class StatsLitePlugin extends Plugin implements SearchPlugin {

    @Override
    public List<AggregationSpec> getAggregations() {
        return List.of(
            new AggregationSpec(
                StatsLiteAggregationBuilder.NAME,
                StatsLiteAggregationBuilder::new,            // wire reader
                (parser, name) -> new StatsLiteAggregationBuilder(name) // XContent parser
            )
            .addResultReader(InternalStatsLite::new)          // reads the partial off the wire
            .setAggregatorRegistrar(StatsLiteAggregatorFactory::registerAggregators)
        );
    }
}

The four registration hooks map directly onto the four classes you wrote:

Hook	Class	Why
Builder wire reader	`StatsLiteAggregationBuilder::new(StreamInput)`	Builder crosses to data nodes
Builder XContent parser	`(parser,name) -> ...`	JSON `"stats_lite": {...}` → builder
`addResultReader`	`InternalStatsLite::new(StreamInput)`	Partial crosses back to coordinator
`setAggregatorRegistrar`	`StatsLiteAggregatorFactory::registerAggregators`	Binds NUMERIC values source → aggregator

plugin-descriptor.properties is generated by the opensearch.opensearchplugin Gradle plugin from build.gradle; you do not write it by hand.

Step 6 — Build, install, and exercise (15 min)

./gradlew :plugins:agg-stats-lite:spotlessApply
./gradlew :plugins:agg-stats-lite:assemble
./gradlew :plugins:agg-stats-lite:precommit     # license header, checkstyle, forbidden APIs

# Run a node WITH the plugin pre-installed:
./gradlew run -PinstalledPlugins="['agg-stats-lite']"

Then exercise it:

export OS=http://localhost:9200
curl -s -XPUT "$OS/m" -H 'Content-Type: application/json' -d '{"mappings":{"properties":{"v":{"type":"double"}}}}'
for x in 3 7 1 9 4 2; do curl -s -XPOST "$OS/m/_doc" -H 'Content-Type: application/json' -d "{\"v\":$x}" >/dev/null; done
curl -s -XPOST "$OS/m/_refresh" >/dev/null

curl -s "$OS/m/_search?size=0" -H 'Content-Type: application/json' -d '{
  "aggs": { "s": { "stats_lite": { "field": "v" } } }
}' | python3 -m json.tool

Expected (count=6, sum=26, min=1, max=9):

{ "aggregations": { "s": { "count": 6, "sum": 26.0, "min": 1.0, "max": 9.0 } } }

Confirm it survives multi-shard reduce — recreate m with 3 shards and re-run; the numbers must be identical (this is your reduce being exercised across shards).

Step 7 — Test with `AggregatorTestCase` (20 min)

AggregatorTestCase builds a tiny in-memory Lucene index, runs your aggregator over it, and gives you the reduced InternalAggregation — no cluster needed.

/* SPDX header ... */
package org.opensearch.aggregations.statslite;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.SortedNumericDocValuesField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.RandomIndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.tests.index.RandomIndexWriter;   // package may differ by branch
import org.opensearch.index.mapper.MappedFieldType;
import org.opensearch.index.mapper.NumberFieldMapper;
import org.opensearch.search.aggregations.AggregatorTestCase;

import java.util.List;

import static org.apache.lucene.document.NumericUtils.doubleToSortableLong;

public class StatsLiteAggregatorTests extends AggregatorTestCase {

    public void testCountSumMinMax() throws Exception {
        MappedFieldType ft = new NumberFieldMapper.NumberFieldType("v", NumberFieldMapper.NumberType.DOUBLE);
        StatsLiteAggregationBuilder builder = new StatsLiteAggregationBuilder("s").field("v");

        testCase(builder, new org.apache.lucene.search.MatchAllDocsQuery(), iw -> {
            for (double v : new double[]{3, 7, 1, 9, 4, 2}) {
                Document d = new Document();
                d.add(new SortedNumericDocValuesField("v", doubleToSortableLong(v)));
                iw.addDocument(d);
            }
        }, (InternalStatsLite result) -> {
            assertEquals(6, result.getCount());
            assertEquals(26.0, result.getSum(), 0.0);
            assertEquals(1.0, result.getMin(), 0.0);
            assertEquals(9.0, result.getMax(), 0.0);
        }, ft);
    }

    public void testEmptyIsZeroCount() throws Exception {
        MappedFieldType ft = new NumberFieldMapper.NumberFieldType("v", NumberFieldMapper.NumberType.DOUBLE);
        StatsLiteAggregationBuilder builder = new StatsLiteAggregationBuilder("s").field("v");
        testCase(builder, new org.apache.lucene.search.MatchAllDocsQuery(), iw -> {}, (InternalStatsLite r) -> {
            assertEquals(0, r.getCount());
            assertEquals(0.0, r.getSum(), 0.0);
        }, ft);
    }
}

Run it (and add a serialization round-trip test — extend AbstractWireSerializingTestCase<InternalStatsLite> to prove writeTo/readFrom agree, the BWC guard):

./gradlew :plugins:agg-stats-lite:test --tests "*.StatsLiteAggregatorTests"

Note: The exact testCase(...) overload signature varies by branch. If it does not compile, grep an existing metric test for the current shape:
grep -rln "extends AggregatorTestCase" server/src/test/java/org/opensearch/search/aggregations/metrics/ | head

Implementation Requirements

Five classes compile: StatsLitePlugin, StatsLiteAggregationBuilder, StatsLiteAggregatorFactory, StatsLiteAggregator, InternalStatsLite (+ supplier).
Every file carries the SPDX header; precommit passes.
InternalStatsLite.reduce merges accumulators (count/sum/min/max), not concatenation.
getAggregations() registers all four hooks (builder reader, parser, result reader, registrar).
_search with stats_lite returns correct count/sum/min/max on a multi-shard index.
StatsLiteAggregatorTests passes (non-empty + empty cases).
A Writeable round-trip test passes for InternalStatsLite.

Expected Output

> Task :plugins:agg-stats-lite:test
StatsLiteAggregatorTests > testCountSumMinMax PASSED
StatsLiteAggregatorTests > testEmptyIsZeroCount PASSED
InternalStatsLiteWireTests > testSerialization PASSED
BUILD SUCCESSFUL

And the _search response: {"count":6,"sum":26.0,"min":1.0,"max":9.0}, identical on 1 and 3 shards.

Troubleshooting

Symptom	Cause	Fix
`unknown aggregation type [stats_lite]`	Plugin not installed / not registered	`run -PinstalledPlugins="['agg-stats-lite']"`; check `getAggregations()`
`NamedWriteable... [stats_lite]` not found on reduce	Missing `addResultReader(InternalStatsLite::new)`	Add the result reader hook
Wire round-trip test fails	`doWriteTo` / `StreamInput` ctor field order mismatch	Make read order == write order
Aggregating throws "no doc values"	Field is `text`/non-numeric	Use a numeric field with DocValues
Multi-shard total wrong	`reduce` concatenated instead of merging	Sum count/sum, min/max across partials
`precommit` fails on header	Missing SPDX block	Add the header to every file

Stretch Goals

Add avg as a derived output (sum/count) in doXContentBody — and confirm you do not ship it on the wire (derive on render, reduce the accumulators). This is the reducible-metric rule made literal.
Convert the plugin into a custom query instead, via SearchPlugin.getQueries() and a QuerySpec registering a *QueryBuilder (Writeable + ToXContent) whose doToQuery returns a Lucene Query. Compare the registration surface to getAggregations().
Add a yamlRestTest under src/yamlRestTest/resources/rest-api-spec/test/ that asserts the _search contract end-to-end on a real ./gradlew run-style cluster.
Make stats_lite work as a sub-aggregation inside a terms bucket and confirm the bucket-ordinal accumulator arrays (the growIfNeeded logic) handle many owning buckets.

Validation / Self-check

Which method is your collect loop, and what does it read from each doc?
Why does InternalStatsLite ship count/sum/min/max and not the average?
What are the four registration hooks in getAggregations(), and which class does each bind?
Why must doWriteTo and the StreamInput constructor agree on field order?
How does AggregatorTestCase let you test reduce without a cluster?
What changes if you make this a getQueries() registration instead of getAggregations()?
Why must your multi-shard result equal your single-shard result, and which method guarantees it?

Cross-references: Aggregations deep dive, Plugin Architecture, Serialization and BWC, Level 8: Real Issue Contribution.

Level 8: Real Issue Contribution

Every level before this one was curated. You built from source against a known-good baseline, you traced code paths someone chose for you, and you implemented labs with a known answer. Level 8 is different. Here you contribute to OpenSearch the way maintainers actually work: you find a real, open GitHub issue, reproduce it deterministically, root-cause it, fix it with a minimal diff, write the test that fails-then-passes, and open a Pull Request that meets merge quality.

There is no answer key. The issue may be mis-triaged. The "obvious" fix may be wrong. The repro may be flaky. That is the job. This level teaches you the loop that the rest of your contributor life runs on, and it deliberately hands you less scaffolding than any prior level.

Learning Objectives

By the end of Level 8 you must be able to:

Find a tractable real issue in opensearch-project/OpenSearch and judge whether it is in scope for you.
Build a deterministic reproducer (a failing JUnit test, or a curl/REST-YAML sequence on ./gradlew run) that fails on main and will pass after the fix.
Separate the symptom (what the reporter saw) from the trigger conditions (what is actually required to provoke it).
Implement a minimal-diff fix with the SPDX header, spotlessApply, and a CHANGELOG.md entry.
Open a merge-quality Pull Request: DCO sign-off (git commit -s), a filled-in PR template, green CI, and a description a reviewer can act on.
Improve error messages and diagnostics as a focused, high-value contribution type.

The End-to-End Contribution Cycle (preview)

Everything in this level — and the Capstone — is one loop:

flowchart LR
    F[Find issue<br/>good first issue / bug] --> R[Reproduce<br/>deterministic, fails on main]
    R --> C[Root cause<br/>grep + read + debug]
    C --> X[Fix<br/>minimal diff]
    X --> T[Test<br/>fails-then-passes]
    T --> P[PR<br/>DCO + CHANGELOG + template]
    P --> CI[CI + review]
    CI -->|changes requested| X
    CI -->|approved| M[Merge + backport label]

The non-obvious truth: the reproducer, not the fix, is the center of gravity. A bug you can reproduce on demand is a bug you can fix with confidence; one you cannot is a guess. Maintainers look for the reproducer first. Lab 8.1 is entirely about building one.

How This Level Differs From Levels 1–7

Earlier levels	Level 8
Curated labs with a known answer	Real open issues, possibly mis-triaged
Code path chosen for you	You locate the code from a symptom
Repro provided	You build the repro from a bug report
"Make it compile / pass"	"Make a reviewer say yes"
No external stakes	A real PR, real CI, real maintainers

You bring the skills from prior levels: reading the codebase (Level 1), the request/transport path (Levels 2–3), cluster state and allocation (Levels 4–5), the engine (Level 6), and search/aggs (Level 7). Level 8 is where they combine on a target you did not choose.

Finding a Tractable Issue

Browse github.com/opensearch-project/OpenSearch/issues and filter with these labels:

Label	What it signals	Good for you now?
`good first issue`	Maintainers vetted it as small and self-contained	Yes — start here
`help wanted`	Maintainers want a contributor; may be larger	Sometimes
`bug`	A defect with (ideally) a repro in the issue	Yes, if scoped
`flaky-test`	A test that fails intermittently	Yes — see Level 9 / stage-9
`enhancement`	New behavior; needs design buy-in	Not yet (needs an RFC discussion)
`untriaged`	Not yet reviewed by a maintainer	Risky — may be invalid

GitHub search you can paste into the issues search box:

is:open is:issue label:"good first issue" repo:opensearch-project/OpenSearch

is:open is:issue label:bug label:"help wanted" sort:updated-desc

Triage checklist before you claim an issue:

Read the entire thread. Is someone already assigned or has an open PR?
Is the symptom described concretely enough to reproduce?
Is it in a subsystem you can actually read (server core, not a separate plugin repo)?
Is it small? A first contribution should touch a handful of files, not the coordination layer.
Comment that you are investigating before you sink a day into it — avoid duplicate work.

Warning: Do not open a PR against an issue someone else is actively working. Check for linked PRs and recent comments. The community norm is: announce intent, then work.

The Deliverables of a Real PR

A merge-quality OpenSearch PR contains, at minimum:

Artifact	Where	Why
Minimal code diff	`server/` (or the relevant module)	The fix, with SPDX headers intact
A test that fails-then-passes	`Tests.java` / `IT.java` / `*.yml`	Proves the bug and guards regression
`CHANGELOG.md` entry	one line under `## [Unreleased ...]`	Required by `precommit`
DCO sign-off	every commit (`git commit -s`)	OpenSearch requires DCO, not a CLA
Filled PR template	`.github/pull_request_template.md`	Description, related issue, checklist
Green CI	GitHub Actions on the PR	`precommit`, tests, gradle check

The labs in this level produce exactly these artifacts.

Key Practices

Practice	What it means	Tooling
Repro first	Fail on `main` before you touch the fix	JUnit / `curl` / REST-YAML
Minimal diff	Change only what the bug requires	`git diff`, no drive-by reformat
Fails-then-passes test	Revert the fix → test red; apply → green	`./gradlew :server:test --tests ...`
Run the gate locally	Don't burn CI to find a checkstyle nit	`./gradlew spotlessApply precommit`
DCO sign-off	`Signed-off-by:` on every commit	`git commit -s`
One logical change	One issue, one PR	rebase, not bundle
CHANGELOG hygiene	Add/Changed/Fixed/Deprecated/Removed	edit `CHANGELOG.md`

Labs

Lab	Title	Output
8.1	Reproduce an Existing GitHub Issue	A deterministic reproducer that fails on `main`
8.2	Implement the Fix, Write the Test, Open the PR	A merge-quality PR with a fails-then-passes test
8.3	Improve Error Messages and Diagnostics	A diagnostics PR with a message-asserting unit test

Deliverables

You must demonstrate all of the following before advancing to Level 9:

A chosen real (or real-feeling) issue with a written triage note (scope, assignee check, plan).
A deterministic reproducer that fails on main (Lab 8.1).
A minimal-diff fix and a fails-then-passes test (Lab 8.2).
A CHANGELOG.md entry and DCO-signed commits.
A PR (or a complete PR-ready branch + filled template, if not submitting upstream).
One diagnostics improvement with a unit test asserting the new message (Lab 8.3).

Common Mistakes

Mistake	Consequence	Fix
Fixing before reproducing	You "fix" the wrong thing; can't prove it	Build the repro first; it must fail on `main`
Test that passes before the fix	Proves nothing; reviewer rejects	Revert the fix and confirm the test goes red
Drive-by reformatting in the diff	Reviewer can't see the real change	Keep the diff minimal; reformat in a separate PR
Forgetting the CHANGELOG entry	`precommit` fails CI	Add the one-liner under `[Unreleased]`
Forgetting `git commit -s`	DCO check blocks the PR	Sign off every commit (`-s`)
Bundling two fixes in one PR	Hard to review; stalls	One issue per PR
Picking an `enhancement` as a first PR	Needs design buy-in; spins forever	Start with `good first issue` / `bug`

Where This Level Points Next

The full, graded version of this loop is the Capstone; Lab 8.1 maps onto Capstone Step 2: Reproduction and Lab 8.2 onto the implementation/PR steps.
The issue roadmap sequences contribution types by difficulty; Lab 8.3 corresponds to Stage 3: Error Messages.
Level 9 takes you into cross-repo and harder, release-relevant work.
For PR etiquette and review, see PR quality and responding to feedback.

Lab 8.1: Reproduce an Existing GitHub Issue

Background

A bug you cannot reproduce on demand is a bug you cannot fix with confidence. The reproducer is the single most valuable artifact in a contribution: it makes root cause provable, makes the fix verifiable, and is the first thing a maintainer looks for in your PR. Before you write one line of a fix, you must have something that fails the same way every time on main.

This lab takes a real open issue from opensearch-project/OpenSearch and turns its prose bug report into a deterministic reproducer — ideally a failing JUnit test you will ship in the PR, at minimum a recorded curl/REST-YAML sequence on ./gradlew run. The governing rule: it must fail on main without your change, every run, and pass after the fix. If it does not fail on main, you are reproducing the wrong thing.

Why This Lab Matters for Contributors

Maintainers triage by reproducibility. "I can't reproduce" closes more issues than any fix.
A repro separates symptom (what the reporter saw) from trigger conditions (what is actually required). That separation is the root-cause work, started early.
A repro promoted to a JUnit test becomes your regression guard for free.

Prerequisites

A built checkout on main: ./gradlew assemble green. Record the commit:
```
git rev-parse --short HEAD
git log -1 --format='%h %ci %s'
```
The ability to run ./gradlew run and ./gradlew :server:test.
Read Level 8 index and Capstone Step 2: Reproduction — this lab is the warm-up for that graded step.

Step-by-Step Tasks

Step 1 — Pick and triage an issue (15 min)

Open the issues list and filter:

is:open is:issue label:bug repo:opensearch-project/OpenSearch sort:updated-desc

Pick something that (a) is in server core (not a separate plugin repo), (b) has a concrete symptom, and (c) is small. Write a triage note:

Issue: #NNNNN — <title>
Symptom (reporter's words): <quote the failing behavior>
Suspected subsystem: <e.g. search aggregations / setting validation / parsing>
Assignee / open PR? <none>
My plan: reproduce via <unit | curl> then root-cause.

Note: If you cannot find a clean open issue, this lab works equally well on a real-feeling bug class. A canonical, evergreen example used below: a request parser accepts an out-of-range value that should be rejected, producing a confusing downstream failure instead of a clean validation error. This is the exact shape of dozens of real OpenSearch issues.

Step 2 — Reproduce it manually first, to see it (15 min)

Always start manual. You need to watch the bug happen before you encode it. Spin up a node:

./gradlew run
export OS=http://localhost:9200

Drive the reported scenario. For the worked example (a terms aggregation accepting size: 0 or a negative value and behaving oddly instead of rejecting it):

curl -s -XPUT "$OS/t" -H 'Content-Type: application/json' -d '{"mappings":{"properties":{"k":{"type":"keyword"}}}}'
curl -s -XPOST "$OS/t/_doc?refresh=true" -H 'Content-Type: application/json' -d '{"k":"a"}' >/dev/null

# The suspicious request — does it 400 cleanly, or do something surprising?
curl -s -i "$OS/t/_search?size=0" -H 'Content-Type: application/json' -d '{
  "aggs": { "g": { "terms": { "field": "k", "size": -1 } } } }' | head -20

Record exactly what you see: the HTTP status, the error class/message, or the wrong result. That recorded behavior is your symptom baseline.

Step 3 — Separate symptom from trigger conditions (15 min)

A bug report says "it crashed." Your job is to find the minimal conditions that provoke it. Vary one factor at a time and note which ones matter:

Factor	Does it still reproduce?	Conclusion
`size: -1` vs `size: 0` vs `size: 5`	only `-1`/`0`?	the boundary is the trigger
`keyword` vs `long` field	both?	field type irrelevant
1 shard vs 5 shards	both?	not a reduce/fan-out bug
empty index vs 1 doc vs 1000 docs	all?	not data-dependent

The set of factors that must be true to reproduce is your trigger condition. Everything else is noise to strip out of the test. This is the heart of the lab — a vague report becomes a precise, minimal statement: "terms with size <= 0 on any field."

Step 4 — Pin version and seed (5 min)

Reproducibility means the same inputs every run. Pin:

Commit: the git rev-parse --short HEAD from prerequisites. State it in the issue/PR.
Seed: OpenSearch tests are randomized (RandomizedRunner). When a test fails, the output prints a reproduce line:
```
REPRODUCE WITH: ./gradlew :server:test --tests "...Tests.method" -Dtests.seed=ABCDEF123
```
Capture that seed. A bug that only reproduces under a specific seed is still a real bug — pin the seed and say so.

Step 5 — Promote the manual repro to a deterministic test (25 min)

Manual curl is for seeing the bug. Ship code. Choose the lowest-cost harness that reliably reproduces it (see the table in Capstone Step 2):

Harness	Use when	Speed
`OpenSearchTestCase` / `AggregatorTestCase` / `OpenSearchSingleNodeTestCase`	logic in one class (parser, validator, reduce)	seconds
`OpenSearchIntegTestCase` (`InternalTestCluster`)	needs multiple nodes/shards	tens of seconds
`OpenSearchRestTestCase` + `.yml`	a REST contract (status code, message, shape)	tens of seconds

For the worked example, the bug is in a request parser/validator — a unit test is right. Find the parser:

grep -rn "size" server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java | grep -i "parse\|require\|valid\|setSize\|shardSize" | head

Now write a test that asserts the desired behavior (a clean IllegalArgumentException), so it fails on main (where no such validation exists yet):

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */
package org.opensearch.search.aggregations.bucket.terms;

import org.opensearch.test.OpenSearchTestCase;

public class TermsAggregationBuilderReproTests extends OpenSearchTestCase {

    public void testNegativeSizeIsRejected() {
        TermsAggregationBuilder builder = new TermsAggregationBuilder("g");
        // On main this may NOT throw (the bug). After the fix it must throw with a clear message.
        IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> builder.size(-1));
        assertThat(e.getMessage(), containsString("[size] must be greater than 0"));
    }

    public void testZeroSizeIsRejected() {
        TermsAggregationBuilder builder = new TermsAggregationBuilder("g");
        IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> builder.size(0));
        assertThat(e.getMessage(), containsString("[size] must be greater than 0"));
    }
}

Run it and confirm it is red on main:

./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"

Warning: If this test passes on main, the validation already exists — you reproduced the wrong thing, or the issue is already fixed. Go back to Step 3 and re-examine the trigger conditions. A repro that does not fail on main is not a repro.

Step 6 — (Alternative) a REST-YAML reproducer (15 min)

If the bug is a REST contract (status code or error body), encode it as a yamlRestTest instead. Find an existing one to copy the structure:

find . -path "*rest-api-spec/test*search*" -name "*.yml" | head
sed -n '1,30p' "$(find . -path '*rest-api-spec/test*' -name '*.yml' | head -1)"

A reproducer asserting a clean 400:

---
"terms agg rejects non-positive size":
  - do:
      catch: bad_request
      search:
        index: t
        body:
          aggs:
            g:
              terms:
                field: k
                size: 0
  - match: { error.type: "x_content_parse_exception" }
  - match: { status: 400 }

Run it:

./gradlew :rest-api-spec:yamlRestTest --tests "*search*"

Step 7 — Document the reproducer (10 min)

Write the repro up so a maintainer (and future-you) can run it in one command. This goes in the issue comment and later the PR:

Reproduced on <commit hash> (./gradlew assemble green).

Minimal trigger: terms aggregation with size <= 0.

Failing test (red on main):
  ./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"

Observed on main: size(-1) is accepted silently; the failure surfaces later/confusingly.
Expected: IllegalArgumentException "[size] must be greater than 0" at build time.

Implementation Requirements

Deliverable is a reproducer package:

A triage note (issue link, symptom, subsystem, assignee check, plan).
A minimal trigger statement (the stripped-down conditions from Step 3).
A pinned commit hash (and seed, if relevant).
A reproducer that fails on main: a JUnit test or a REST-YAML case or a recorded curl sequence with captured output.
A one-command way to run it.

Expected Output

A red test on main, e.g.:

TermsAggregationBuilderReproTests > testNegativeSizeIsRejected FAILED
    java.lang.AssertionError: Expected IllegalArgumentException to be thrown, but nothing was thrown

A documented repro block ready to paste into the issue/PR.

Troubleshooting

Symptom	Cause	Fix
Test passes on `main`	Bug already fixed, or you asserted current behavior	Re-check the issue; re-derive trigger conditions
Repro only sometimes fails	Randomized seed / timing / data-dependent	Pin `-Dtests.seed`; remove nondeterminism
Can't find the parser/validator	Wrong subsystem	grep the symptom's keywords across `server/src/main/java`
`curl` shows the bug but the unit test doesn't	Bug is in transport/REST, not the class you tested	Move up a harness level (integ or REST-YAML)
`./gradlew run` won't start	Stale build / port in use	`./gradlew clean assemble`; free `:9200`

Stretch Goals

Reduce the reproducer to its absolute minimum: fewest docs, fewest settings, smallest harness that still fails on main. A maintainer should be able to read it in 20 seconds.
Write the repro at two levels (unit + REST-YAML) and decide which you would ship. Justify it.

Bisect to find roughly when the behavior was introduced (or was always present):

git log -S "the symbol you grepped" -- server/src/main/java/org/opensearch/search/aggregations/bucket/terms/ | head

Validation / Self-check

Does your reproducer fail on main? Show the red output.
State the minimal trigger conditions — what must be true, and what is irrelevant?
Which harness did you choose and why was it the lowest-cost option that reliably reproduces?
What commit (and seed, if any) did you pin?
How will this same artifact prove your fix works in Lab 8.2?
Could a maintainer reproduce the bug from your write-up alone, in one command?

Cross-references: Capstone Step 2: Reproduction, Lab 8.2: Implement the Fix, reading the codebase.

Lab 8.2: Implement the Fix, Write the Test, Open the PR

Background

You have a deterministic reproducer from Lab 8.1 that fails on main. This lab takes you from that red test to a merge-quality Pull Request: a minimal diff, the SPDX header, spotlessApply formatting, the test that fails-then-passes, a CHANGELOG.md entry, DCO-signed commits, a clean ./gradlew precommit, and a PR description a reviewer can act on.

The discipline here is what separates a contribution that merges in days from one that rots for months. The fix itself is often the smallest part of the work.

Why This Lab Matters for Contributors

A perfect fix with a sloppy PR gets bounced; a small fix with a clean PR merges fast.
The fails-then-passes test is the proof maintainers trust. Without it, your fix is an assertion.
precommit and DCO are hard gates — failing them wastes CI minutes and reviewer goodwill.

Prerequisites

A reproducer that fails on main (Lab 8.1). You will reuse it verbatim as the regression test.
A fork of opensearch-project/OpenSearch and a feature branch:
```
git checkout -b fix/terms-size-validation
```

Git configured for DCO sign-off:

git config user.name "Your Name"
git config user.email "salbat2022@gmail.com"

Read PR quality and responding to feedback.

Step-by-Step Tasks

Step 1 — Root-cause to the exact line (15 min)

Your repro told you what; now find where. From Lab 8.1 the trigger is "terms with size <= 0". Locate the setter/parser that should reject it:

grep -rn "public TermsAggregationBuilder size" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java
grep -rn "shardSize\|requiredSize\|bucketCountThresholds" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java | head

Read the setter. On main it likely stores the value without validating it (that absence is the bug). Confirm by reading the method body — do not guess line numbers, read the code the grep points to.

Step 2 — Write the minimal fix (15 min)

The fix is a single guard in the setter. Minimal diff — change only what the bug requires, no reformatting of surrounding code:

--- a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java
@@ public TermsAggregationBuilder size(int size) {
-    public TermsAggregationBuilder size(int size) {
-        bucketCountThresholds.setRequiredSize(size);
-        return this;
-    }
+    public TermsAggregationBuilder size(int size) {
+        if (size <= 0) {
+            throw new IllegalArgumentException("[size] must be greater than 0. Found [" + size + "] in [" + name + "]");
+        }
+        bucketCountThresholds.setRequiredSize(size);
+        return this;
+    }

Three properties of a good message (you will reuse these in Lab 8.3): it names the parameter ([size]), states the constraint (must be greater than 0), and echoes the offending value and location (Found [-1] in [g]).

Note: If the real validation belongs at parse time (XContent) rather than the setter, put it where the value first becomes known so the error surfaces at the REST boundary with a 400, not deep in execution. For this bug the setter is the single choke point both REST and the Java client pass through, so it is the right place.

Step 3 — Verify the test now passes (fails-then-passes) (10 min)

This is the proof. Your Lab 8.1 test was red on main. With the fix applied it must go green. Then revert the fix and confirm it goes red again. That round trip is the whole point.

# With the fix applied:
./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"
# Expect: BUILD SUCCESSFUL

# Prove the test actually guards the bug — stash the fix and re-run:
git stash
./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"   # expect FAILED
git stash pop

A test that is green both with and without your fix proves nothing. If that happens, the test is asserting something the fix did not change — rewrite it against the bug.

Step 4 — Add the CHANGELOG entry (5 min)

Every PR adds one line to CHANGELOG.md under the ## [Unreleased ...] section, in the right category (Added/Changed/Fixed/Deprecated/Removed). precommit enforces this.

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ## [Unreleased 3.x]
 ### Fixed
+- Reject non-positive `size` in the `terms` aggregation with a clear error ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

grep -n "## \[Unreleased" CHANGELOG.md | head    # find the section to edit

Step 5 — Format and run the local gate (15 min)

Never let CI find a formatting nit you could have caught locally.

./gradlew spotlessApply                  # auto-format (Spotless)
./gradlew spotlessJavaCheck              # verify formatting is clean
./gradlew :server:precommit              # checkstyle, forbidden APIs, license headers, CHANGELOG check
./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"

precommit is the gate that most first PRs trip on. Common failures: missing SPDX header on a new file, a line over the checkstyle limit, a forbidden API (System.out, Math.random), or a missing CHANGELOG entry. Fix them all before pushing.

Step 6 — Commit with DCO sign-off (5 min)

OpenSearch requires a DCO sign-off, not a CLA. Every commit needs a Signed-off-by: line, which -s adds:

git add server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilder.java \
        server/src/test/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregationBuilderReproTests.java \
        CHANGELOG.md
git commit -s -m "Reject non-positive size in terms aggregation

The terms aggregation accepted size <= 0 silently, surfacing as a
confusing failure later. Validate in the size() setter and throw a
clear IllegalArgumentException. Adds a unit test.

Closes #NNNNN"

Confirm the sign-off landed:

git log -1 | grep "Signed-off-by"
# Signed-off-by: Your Name <salbat2022@gmail.com>

Warning: If you forget -s, the DCO bot blocks the PR. Fix with git commit --amend -s (single commit) or git rebase --signoff main (multiple).

Step 7 — Push and open the PR (15 min)

git push -u origin fix/terms-size-validation
gh pr create --repo opensearch-project/OpenSearch \
  --title "Reject non-positive size in terms aggregation" \
  --body-file /tmp/pr-body.md

Fill in the repo's PR template (.github/pull_request_template.md). A complete description:

### Description
The `terms` aggregation accepted a non-positive `size` (e.g. `size: 0` or `-1`)
without validation, leading to a confusing downstream failure instead of a clean
400. This adds validation in `TermsAggregationBuilder.size(int)` that throws an
`IllegalArgumentException` with the parameter, constraint, and offending value.

### Related Issues
Closes #NNNNN

### Reproduction
Fails on `main` (commit <hash>):
`./gradlew :server:test --tests "*.TermsAggregationBuilderReproTests"`

### Testing
- Added `TermsAggregationBuilderReproTests` (red before the fix, green after).
- `./gradlew :server:precommit` passes; `spotlessApply` applied.

### Check List
- [x] New functionality includes testing.
- [x] Commits are signed per the DCO using `--signoff`.
- [x] Public documentation issue/PR created (n/a — internal validation).
- [x] CHANGELOG entry added.

Step 8 — Read CI and respond to review (10 min)

GitHub Actions runs precommit, unit tests, and broader gates. When it goes red:

Open the failing job, read the first error (later errors are often cascades).
Distinguish your failure from a flaky unrelated test (a known flaky-test). If unrelated, say so and (politely) ask a maintainer to re-run, or reference the tracking issue.

Respond to review by pushing follow-up commits (the PR updates in place); squash only if a maintainer asks. Address every comment, even just to say "done" or to explain a disagreement civilly. See responding to feedback.

Implementation Requirements

A minimal diff: only the lines the fix needs, no drive-by reformatting.
SPDX header on any new file; precommit green.
The Lab 8.1 reproducer is the regression test and is fails-then-passes verified.
A CHANGELOG.md entry under [Unreleased] in the correct category.
DCO sign-off on every commit (git commit -s).
A PR (or PR-ready branch) with the template fully filled in.

Expected Output

# With the fix:
TermsAggregationBuilderReproTests > testNegativeSizeIsRejected PASSED
TermsAggregationBuilderReproTests > testZeroSizeIsRejected PASSED
BUILD SUCCESSFUL

# Without the fix (git stash):
TermsAggregationBuilderReproTests > testNegativeSizeIsRejected FAILED

# precommit:
> Task :server:precommit
BUILD SUCCESSFUL

A PR with green CI, a filled template, a one-line CHANGELOG, and a signed-off commit.

Troubleshooting

Symptom	Cause	Fix
`precommit` fails: missing CHANGELOG	No entry under `[Unreleased]`	Add the one-liner (Step 4)
`precommit` fails: license header	New file lacks SPDX block	Add the header; re-run
DCO check red on the PR	Missing `Signed-off-by:`	`git commit --amend -s` / `git rebase --signoff main`
`spotlessJavaCheck` red	Formatting drift	`./gradlew spotlessApply` then re-commit
Test green with and without fix	Test doesn't assert the bug	Rewrite against the failing behavior
CI red on an unrelated test	Flaky test	Check `flaky-test` label; note it, request re-run
Reviewer: "reduce the diff"	Drive-by changes crept in	Revert unrelated hunks; one logical change

Stretch Goals

Add a parallel shard_size <= 0 validation guard and a test, in the same PR (it is the same logical change). Decide whether shard_size < size should also be rejected or clamped — read how bucketCountThresholds.ensureValidity() already handles it.
Add a REST-YAML test asserting the 400 + error type, so the contract is covered end-to-end, not just the Java setter.
Trace the backport path: if the fix should land on 2.x, what label triggers the backport bot? (backport 2.x). Read compatibility.

Validation / Self-check

Show the test red without the fix and green with it. Why does that round trip matter?
Is your diff minimal? Point to any line that is not strictly required.
Where is the validation placed, and why is that the right choke point?
Does every commit have a Signed-off-by: line?
Which precommit checks did you run, and what did each catch (if anything)?
Is the CHANGELOG entry in the correct category and section?
Could a reviewer reproduce, understand, and verify your fix from the PR description alone?

Cross-references: Lab 8.1: Reproduce an Issue, Lab 8.3: Error Messages, PR quality, responding to feedback, Capstone.

Lab 8.3: Improve Error Messages and Diagnostics

Background

When OpenSearch refuses a request, the message it returns is the contract between the engine and the operator staring at a 3 a.m. page. A message like IllegalArgumentException: bad value tells the operator nothing; a message like [size] must be greater than 0. Found [-1] in aggregation [g] tells them the field, the constraint, the offending value, and where it lives. Turning the first into the second is one of the highest-value, lowest-risk contributions you can make — and it is a steady source of real OpenSearch issues.

This lab is a focused contribution type: take a vague exception, validation failure, or log line and make it actionable, then pin the new text with a unit test so it cannot silently regress. The blast radius is small (you change strings and validation, not control flow), but the message becomes part of the contract, so a test must assert it exactly.

Why Good Diagnostics Matter to Operators

A clear validation message turns a support ticket into a self-service fix.
Operators grep logs; a message with the field name and value is searchable, a generic one is not.
The "explain" family of APIs (_cluster/allocation/explain, _validate/query, profile) has no value except the quality of its human-readable text — improving it is pure operator UX.
Reviewers love these PRs: the change is contained, the test is exact, the benefit is obvious.

Prerequisites

A built checkout; ./gradlew assemble green.
Lab 8.2 (the PR mechanics — minimal diff, SPDX, DCO, CHANGELOG, precommit).
Read Issue roadmap Stage 3: Error Messages and the REST layer deep dive.

Where Diagnostics Live

flowchart LR
  A[HTTP request] --> B[RestHandler.prepareRequest]
  B -->|bad syntax| E1[IllegalArgumentException / XContentParseException]
  B --> C[ActionRequest.validate]
  C -->|null/empty/conflicting fields| E2[ActionRequestValidationException]
  C --> D[TransportAction.doExecute]
  D -->|runtime failure| E3[OpenSearchException subclass]
  E1 --> R[RestController error path -> JSON + RestStatus]
  E2 --> R
  E3 --> R

Surface	Class / hook	What you improve
Request validation	`*Request.validate()` → `ActionRequestValidationException`	One clear entry per problem
Parse / argument errors	`IllegalArgumentException`, `XContentParseException` from builders/`Setting`	Name the offending token + valid range
REST rendering	`OpenSearchException.toXContent` + `RestStatus`	The JSON error body + HTTP status a client sees
Operator explanations	`_cluster/allocation/explain`, `_validate/query`	The human-readable "why" text

The four properties of a good message — say them as a checklist before you write any string:

What failed (the parameter/field name, in brackets: [size]).
Why (the constraint: must be greater than 0).
The offending value (Found [-1]).
Where / what to do (location, or the valid set/range).

Step-by-Step Tasks

Step 1 — Find a vague message (15 min)

Hunt for low-quality diagnostics. Generic exceptions with no context are the target:

# IllegalArgumentExceptions with a bare, context-free message
grep -rn 'IllegalArgumentException("' server/src/main/java/org/opensearch/ \
  | grep -viE '\[|\{|" \+ |must|expected|invalid|unknown|found' | head -20

# validate() methods that addValidationError with vague text
grep -rn "addValidationError" server/src/main/java/org/opensearch/ | head -20

# log lines that fire with no identifying context
grep -rn 'logger\.\(warn\|error\)("' server/src/main/java/org/opensearch/ \
  | grep -viE '\{|" \+ ' | head -20

Pick one where you can show that a user could hit it with a bad request, and where the current text is unhelpful. For this lab the worked example is a request validation improvement.

Step 2 — Reproduce the bad message (10 min)

Start a node and provoke it so you can quote the current (bad) output:

./gradlew run
export OS=http://localhost:9200

# Example: an _mtermvectors / bulk / reindex request missing a required field,
# or a setting update with an out-of-range value. Provoke and capture:
curl -s -i -XPUT "$OS/x/_settings" -H 'Content-Type: application/json' -d '{
  "index": { "number_of_replicas": -3 } }' | sed -n '1,15p'

Record the status and the error.reason. If the message already names the field and value, find a worse one — you want a genuine improvement.

Step 3 — Locate the message in source (15 min)

Grep the exact (or near-exact) text you saw:

grep -rn "number_of_replicas\|must be" \
  server/src/main/java/org/opensearch/cluster/metadata/ \
  server/src/main/java/org/opensearch/common/settings/ | head

Read the throwing site. Decide which surface it belongs to (validation vs parse vs runtime) using the table above. Do not invent a line number — read the code the grep points at.

Step 4 — Improve the message (15 min)

Apply the four-property checklist. A request-validation example — turning a context-free error into a precise one:

--- a/server/src/main/java/org/opensearch/action/admin/indices/open/OpenIndexRequest.java
+++ b/server/src/main/java/org/opensearch/action/admin/indices/open/OpenIndexRequest.java
@@ public ActionRequestValidationException validate() {
     ActionRequestValidationException validationException = null;
     if (CollectionUtils.isEmpty(indices)) {
-        validationException = addValidationError("index is missing", validationException);
+        validationException = addValidationError(
+            "index is missing; specify one or more index names or patterns (e.g. \"my-index\" or \"logs-*\")",
+            validationException
+        );
     }
     return validationException;
 }

An argument-error example — naming the parameter, value, and valid range:

--- a/server/src/main/java/org/opensearch/.../SomeBuilder.java
+++ b/server/src/main/java/org/opensearch/.../SomeBuilder.java
@@
-    if (windowSize < 1) {
-        throw new IllegalArgumentException("invalid window size");
-    }
+    if (windowSize < 1) {
+        throw new IllegalArgumentException(
+            "[window_size] must be at least 1. Found [" + windowSize + "]"
+        );
+    }

Note: Keep the message machine-stable enough to test but human-first. Put field names in [brackets] (the OpenSearch convention) so they are greppable. Do not leak internal class names or stack frames into operator-facing text — that is for logs, not the REST error body.

Step 5 — Pin the message with a unit test (20 min)

A message is now part of the contract; a test must assert it so a future refactor cannot silently regress it. For validate():

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */
package org.opensearch.action.admin.indices.open;

import org.opensearch.action.ActionRequestValidationException;
import org.opensearch.test.OpenSearchTestCase;

public class OpenIndexRequestValidationTests extends OpenSearchTestCase {

    public void testMissingIndexGivesActionableMessage() {
        OpenIndexRequest request = new OpenIndexRequest(); // no indices set
        ActionRequestValidationException e = request.validate();
        assertNotNull("validate() should reject a request with no indices", e);
        assertThat(e.getMessage(), containsString("index is missing"));
        // The actionable part — the part operators actually need:
        assertThat(e.getMessage(), containsString("specify one or more index names or patterns"));
    }
}

For an IllegalArgumentException from a builder, assert with expectThrows:

public void testNegativeWindowSizeMessage() {
    IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> new SomeBuilder().windowSize(-1));
    assertThat(e.getMessage(), containsString("[window_size] must be at least 1"));
    assertThat(e.getMessage(), containsString("Found [-1]"));
}

Run it, and verify it is red on main (the old message lacked the new text) and green with your diff:

./gradlew :server:test --tests "*.OpenIndexRequestValidationTests"

Warning: Assert the actionable substring, not the entire message string. Pinning the exact full string makes the test brittle to harmless rewording and invites reviewers to ask you to relax it. Pin the load-bearing words (field name, value, constraint).

Step 6 — (Optional) verify the REST rendering and status (10 min)

If the message surfaces over REST, confirm the status code and error body with a REST-YAML test — operators see the JSON, not the Java exception:

---
"open index with no name returns an actionable 400":
  - do:
      catch: /index is missing/
      indices.open:
        index: ""
  - match: { status: 400 }

grep -rn "RestStatus" server/src/main/java/org/opensearch/action/admin/indices/open/ | head
./gradlew :rest-api-spec:yamlRestTest --tests "*open*" 2>/dev/null || true

Confirm the exception maps to RestStatus.BAD_REQUEST (400), not a 500 — a validation problem is the client's fault, and the status must say so. If it renders as 500, that itself is a worthwhile fix.

Step 7 — Ship it as a PR (10 min)

Follow Lab 8.2 exactly: minimal diff, SPDX headers, spotlessApply, CHANGELOG.md entry, DCO sign-off, precommit, and the PR template.

./gradlew spotlessApply :server:precommit
git add -A
git commit -s -m "Improve OpenIndexRequest validation message

Make the 'index is missing' validation error actionable by telling the
operator to supply index names or patterns. Pins the message with a test.

Closes #NNNNN"

CHANGELOG line:

 ### Changed
+- Make `OpenIndexRequest` validation error actionable when no index is supplied ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

Note: Diagnostic-only changes are usually Changed (not Fixed) unless the old message was outright wrong/misleading. Match the category to the nature of the change.

Implementation Requirements

A located vague message with the current (bad) output captured.
A rewritten message satisfying the four-property checklist (what / why / value / where).
A unit test asserting the actionable substring(s) — red on main, green with the fix.
(If REST-surfaced) confirmation the status is the correct 4xx, optionally a REST-YAML test.
A merge-quality PR: minimal diff, SPDX, CHANGELOG, DCO sign-off, precommit green.

Expected Output

OpenIndexRequestValidationTests > testMissingIndexGivesActionableMessage PASSED
BUILD SUCCESSFUL

And, against a running node, an error body that now reads e.g.:

{ "error": { "type": "action_request_validation_exception",
  "reason": "Validation Failed: 1: index is missing; specify one or more index names or patterns (e.g. \"my-index\" or \"logs-*\");" },
  "status": 400 }

Troubleshooting

Symptom	Cause	Fix
Test green on `main` already	The good text already exists	Pick a genuinely vague message
Test brittle / reviewer pushback	Asserting the full exact string	Assert only the load-bearing substring
Error renders as 500 not 400	Wrong exception type / no `RestStatus` mapping	Use a validation/IAE path that maps to 400 (and consider fixing the mapping)
`precommit` fails on new test file	Missing SPDX header	Add the header
Message leaks a class name to users	Internal detail in operator text	Keep internals in logs; user text stays high-level

Stretch Goals

Improve a message in the _cluster/allocation/explain path — find where an unassigned shard's reason text is built and make it name the exact decider that blocked allocation:
```
grep -rn "AllocationDecision\|explain\|NO\b" server/src/main/java/org/opensearch/cluster/routing/allocation/ | grep -i explain | head
```
Convert a multi-problem validate() to emit one entry per problem (so an operator fixing a request sees all the issues at once, not one-at-a-time). Test that two problems produce two entries.
Find a logger.warn/error that fires without identifying context (no index/shard/node id) and add it, with a test or a documented manual grep of the log proving the new format.

Validation / Self-check

Quote the message before and after. Which of the four properties did it gain?
Does your test pin the actionable substring, not the whole string? Why does that matter?
Is the message red on main and green with your fix? Show it.
If it surfaces over REST, is the status a correct 4xx (client error), not a 500?
Did you avoid leaking internal class/stack detail into operator-facing text?
Why are diagnostics improvements valued by both operators and reviewers?

Cross-references: Issue roadmap Stage 3: Error Messages, REST layer deep dive, Lab 8.2: Implement the Fix, Action Framework.

Level 9: Advanced Maintainer / TSC-Level Contributor

This is the last level before the capstone, and it is the one that changes how you read every line of code. Up to now you have been a contributor: you found an issue, fixed it, wrote a test, and shipped a PR. A maintainer carries a different burden. A maintainer is the person who says no — who blocks a green, well-tested, genuinely useful PR because it adds a field to a transport response without a version guard, or because it allocates a HashMap per document in the search loop. The contributor optimizes for "does my change work." The maintainer optimizes for "does the project still work, on every supported version, at every scale, after my change merges and ten thousand clusters upgrade into it."

This curriculum will not hold your hand here. By Level 9 you can read the engine, trace a request from RestController to Lucene, write integration tests, and argue a design on a GitHub issue. What you build now is judgment — the two reflexes that every OpenSearch maintainer applies to every PR they review, including their own.

Learning Objectives

By the end of Level 9 you must be able to:

Read any one-line diff to a Writeable, *Request, *Response, or ClusterState component and immediately see its wire backward-compatibility implications — and prove safety with a round-trip and a qa/ BWC test.
Reason about index/Lucene format BWC (data on disk) as a distinct concern from wire BWC (nodes on the network), and know the N-1-major support window.
Treat performance as a correctness property on hot paths: profile before you change, write a JMH microbenchmark in :benchmarks, and confirm the macro impact with OpenSearch Benchmark before claiming a win.
Review another contributor's PR the way a maintainer does — checking BWC, allocation, test coverage, CHANGELOG.md, and the deprecation/backport story, not just whether the code compiles and the diff looks clean.
Own a subsystem: understand its MAINTAINERS.md, its area labels, its backport policy, and what "I am responsible for this not breaking" means.
Connect a code change to the release train — which branch it lands on, which Version constant gates it, whether it needs a backport 2.x, and whether it is a release blocker.

The Two Constant Maintainer Concerns

Strip away everything else and a maintainer's job reduces to defending two invariants on every change. Internalize these; they are the lens for the whole level.

flowchart TB
    PR[Incoming PR / your own change] --> BWC{Does it change<br/>the wire or on-disk<br/>format, or a public<br/>contract?}
    PR --> PERF{Is it on a hot<br/>path — per-document,<br/>per-request, per-shard,<br/>coordination?}
    BWC -->|yes| G1[Version-gate it, add a<br/>round-trip + qa/ BWC test,<br/>plan the deprecation cycle]
    PERF -->|yes| G2[Profile, JMH micro before/after,<br/>OSB macro, check allocation/GC]
    BWC -->|no| OK1[Standard review]
    PERF -->|no| OK2[Standard review]

Concern 1 — Backward compatibility (it stays working while it changes)

OpenSearch runs in clusters that upgrade node by node. During a rolling upgrade a cluster runs two versions at once, and every transport message between an old node and a new node must serialize correctly in both directions. Get this wrong and you do not get a red unit test — you get a corrupted cluster during a customer's upgrade, which is the single worst class of bug this project can ship. The full mechanism — Version-gated StreamInput/StreamOutput, why reordering writes silently corrupts a stream, NamedWriteableRegistry, and the index/Lucene format window — lives in the Serialization and Backward Compatibility deep dive. Re-read it before this level's first lab; the lab assumes it.

Concern 2 — Performance (it stays fast while it changes)

OpenSearch indexes and searches at scale. A path that runs once per document, once per request, once per shard, or inside the coordination layer is hot: a 5% allocation regression there compounds into real cost and real latency across every cluster running your code. Maintainers do not accept "it should be faster" — they accept "here is the JMH microbenchmark, here is the before/after, here is the OpenSearch Benchmark macro result that proves it matters and that no other workload regressed." The discipline is taught in Stage 10: Performance Improvements and drilled in this level's second lab.

These two concerns are why maintainer review feels slow. It is not gatekeeping for its own sake; it is the cost of an invariant that cannot be un-shipped once a release goes out.

How These Map to the Codebase

Concern	Where it lives	What you run
Wire BWC round-trip	`test/framework` — `AbstractWireSerializingTestCase`, `AbstractSerializingTestCase`	`./gradlew :server:test --tests "...Tests"`
Real mixed-version BWC	`qa/` — `qa/rolling-upgrade`, `qa/mixed-cluster`, `qa/full-cluster-restart`	`./gradlew :qa:rolling-upgrade:check`
BWC version constants	`server/src/main/java/org/opensearch/Version.java`	`grep -n "V_3_" server/.../Version.java`
Index/Lucene format BWC	`IndexMetadata` (`index.version.created`), Lucene codecs	`qa:full-cluster-restart`, reindex checks
Microbenchmarks (JMH)	`:benchmarks` module (`benchmarks/`)	`./gradlew :benchmarks:jmh`
Macrobenchmarks	OpenSearch Benchmark (`opensearch-benchmark`, formerly Rally)	`opensearch-benchmark execute-test --workload nyc_taxis`
Pooled allocation	`BigArrays`, `PageCacheRecycler`, circuit breakers	see circuit breakers & memory

Two commands to orient yourself in the repo right now:

# Where the BWC QA suites live
ls qa/
# expect: rolling-upgrade  mixed-cluster  full-cluster-restart  smoke-test-* ...

# Where the JMH microbenchmarks live
ls benchmarks/src/main/java/org/opensearch/benchmark/
./gradlew :benchmarks:jmhJar -q   # builds the runnable JMH uber-jar

Key Practices at the Maintainer Tier

Practice	What it means in OpenSearch	Why it gates merge
Version-gate every wire change	New `Writeable` fields appended behind `out.getVersion().onOrAfter(Version.V_x_y_z)`	A mixed cluster must round-trip in both directions
Round-trip test every serialized type	Extend `AbstractWireSerializingTestCase<T>`; it exercises old `Version`s	Catches a missing/asymmetric gate cheaply, before `qa/`
Profile before optimizing	async-profiler / JFR or an existing benchmark, never intuition	"Faster" claims without numbers are guesses
Micro + macro for perf PRs	JMH in `:benchmarks` proves the mechanism; OSB proves it matters	A JMH win that does not move a workload is noise
Watch allocation / GC	Prefer `BigArrays`/`PageCacheRecycler`; avoid autoboxing and per-doc objects	Hot-path churn compounds into latency and breaker trips
One change, one number	Don't bundle two optimizations or two refactors	You can't attribute the delta otherwise
CHANGELOG + backport label	Every PR adds a `CHANGELOG.md` line; release branches use `backport 2.x`	Release notes and maintenance lines depend on it
Deprecate, don't break	REST/setting changes go through a deprecation cycle, not a removal	Clients and ops tooling break otherwise
Read the whole PR, not the diff	Tests, docs, BWC, perf, the issue it closes, the design discussion	A clean diff can still corrupt a rolling upgrade

Reviewing Others' PRs

By Level 9 you should be reviewing PRs, not only opening them. A maintainer review is a checklist applied in a fixed order, fastest-to-fail first:

Does it build and pass CI? If not, stop — comment and move on.
Does it touch the wire or on-disk format? Any change to a Writeable, *Request/*Response, ClusterState component, Metadata custom, or NamedWriteable name is a BWC review. Look for out.getVersion().onOrAfter(...) and a matching qa/ or round-trip test. No gate, no merge.
Is it on a hot path? If the diff is inside QueryPhase, an Aggregator, IndexShard.applyIndexOperationOnPrimary, the coordination layer, or any per-document/per-request loop, ask for numbers.
Are the tests real? A test that asserts "no exception" is not a test. Look for round-trip, randomized, and (for BWC) cross-version coverage.
CHANGELOG, backport label, deprecation story. Is there a CHANGELOG.md entry? Does it need backport 2.x? Does it remove or rename anything a client relies on?

Disagree without being a wall. The GitHub review chapter and the responding-to-feedback mindset chapter cover the human side — Approve vs Request changes vs Comment, and how to say "this is a BWC break" without making the contributor feel attacked.

Owning a Subsystem

A maintainer is listed in a repo's MAINTAINERS.md and is accountable for an area: search, indexing, cluster coordination, a specific plugin. Ownership means:

You know the area's area labels (Search:Performance, Storage:Durability, Cluster Manager, etc.) and you triage untriaged issues into them.
You know the backport policy for that area and which Version constants gate in-flight features.
You are the BWC and performance conscience for the area — when someone changes a serialized type you own, it is your job to catch the missing gate.
You write down the non-obvious invariants so the next contributor doesn't relearn them by breaking a rolling upgrade. (This is itself a high-value contribution; see the maintainership mindset chapter.)

Deliverables

You must demonstrate all of the following before attempting the capstone:

Completed Lab 9.1: added a version-gated field to a transport request/response, wrote the round-trip test, and ran a qa/ BWC task.
Completed Lab 9.2: wrote a JMH benchmark in :benchmarks, captured before/after numbers, and interpreted an OSB workload.
A written review (in the style of a GitHub review) of one real OpenSearch PR that touches a Writeable, explicitly assessing its wire-BWC safety.
A one-paragraph statement of which subsystem you would own and why, naming its MAINTAINERS.md, its area labels, and one BWC or perf invariant it carries.
From memory: explain the difference between wire BWC, index/Lucene format BWC, and REST BWC, with one failure mode and one mitigation for each.

Common Mistakes

Mistake	Consequence	Fix
Adding a `Writeable` field with no version gate	Mixed-cluster deserialize failure or silent corruption during rolling upgrade	Append behind `out.getVersion().onOrAfter(Version.V_x_y_z)`, symmetric on read
Reordering existing stream writes "to clean it up"	Positional misread → silent wrong values, no exception	Never reorder; append only, gated
Bumping the wrong `Version` constant	Field gated to a version that never shipped, or the wrong release line	Gate on the version where the field first ships; confirm against `Version.java` and the branch
Claiming a perf win with no benchmark	Reviewer cannot verify; often the change is neutral or a regression	JMH micro + OSB macro, before/after, same seed/config
Bundling two optimizations in one PR	Cannot attribute the delta; one may regress	One change, one number
Optimizing un-profiled code	Effort spent on a cold path; real hot path untouched	Profile with async-profiler/JFR first
Allocating per document/request on a hot path	GC pressure, latency, circuit-breaker trips	Use `BigArrays`/`PageCacheRecycler`; hoist allocation out of the loop
Forgetting the `CHANGELOG.md` entry / backport label	CI red or missing from release notes / maintenance line	Add the entry; apply `backport 2.x` when the change belongs there
Treating REST changes as free	Clients and dashboards break on removed/renamed fields	Deprecation cycle, additive JSON; see compatibility

Maintainer Profile: Level 9 Graduate

You can now	Evidence
See the BWC implication of a one-line wire change	You can point to the missing gate in a sample PR and write the fix
Prove BWC, not eyeball it	You ran a round-trip test and a `qa:rolling-upgrade` task
Quantify a performance change	You produced JMH before/after and read an OSB p99
Review a PR like a maintainer	You apply the fixed-order checklist, BWC and perf first
Reason about the release train	You know which `Version` gates a change and whether it backports
Own an area	You can name a `MAINTAINERS.md`, its labels, and its invariants

You are now ready for the capstone — an end-to-end contribution where you select a real issue, reproduce it, find the root cause, implement and test the fix (with BWC and performance discipline), open the PR, and write it up. Start at the Capstone Overview. The two reflexes you built here — will this break a rolling upgrade? and did I prove the performance claim? — are the ones the capstone evaluates hardest.

Note: Most engineers never reach this tier in an open-source project, not because they can't, but because they stop at "my PR is green." The distance from contributor to maintainer is the distance from "it works" to "it keeps working for everyone who upgrades into it." That is the entire content of Level 9.

Lab 9.1: Write Backward-Compatibility (BWC) Tests

Lab type: Build It + Test It Estimated time: 4–6 hours

Background

OpenSearch clusters upgrade one node at a time. For the duration of a rolling upgrade — which on a large cluster can last hours — nodes of two different versions serve traffic in the same cluster and exchange transport messages. A 3.1 node must be able to send a request to a 3.0 node and read its response, and vice versa. The machinery that makes this possible is version-gated serialization: every object that crosses the wire implements Writeable, carries the peer's Version on its StreamInput/StreamOutput, and writes new fields only when the peer is new enough to understand them.

This lab makes that abstract. You will add a new field to a transport request/response pair, gate it correctly with a Version check, and then write the tests that prove an old node can still talk to a new node. You will run both the cheap round-trip test (AbstractWireSerializingTestCase) and the expensive but authoritative qa/ BWC suite (rolling-upgrade, mixed-cluster, full-cluster-restart).

Read the Serialization and Backward Compatibility deep dive first. This lab assumes you know why reordering stream writes corrupts a stream, what out.getVersion() returns during a rolling upgrade, and the three kinds of compatibility. This lab is the hands-on counterpart to that chapter.

Why This Lab Matters for Contributors

A missing version gate is the single most common reason a maintainer rejects an otherwise-good PR — and the most dangerous bug class to miss, because it produces no failing unit test. It surfaces only in a real mixed-version cluster, typically in production, during a customer's upgrade. Learning to write BWC tests is learning to make this invisible class of bug visible before merge. Every maintainer in the project carries this reflex; by the end of this lab, so do you.

Prerequisites

You can build OpenSearch from source and run ./gradlew :server:test.
You have read the serialization BWC deep dive and the transport layer deep dive.
You understand Writeable: a writeTo(StreamOutput) method plus a StreamInput constructor, read in the same order they were written.
You know the current version lines: 3.x/main is current, 2.x is the maintenance line, 1.x is legacy.

Step-by-Step Tasks

Step 1: Orient yourself in the version and BWC machinery

Find the Version constants — these are the levers for every wire gate — and the qa/ suites that prove BWC end to end.

# The named version constants and the ordering helpers you will gate on.
grep -n "public static final Version V_3_\|public static final Version CURRENT\|minimumCompatibilityVersion\|minimumIndexCompatibilityVersion" \
  server/src/main/java/org/opensearch/Version.java | head -40

# The methods you will use in gates.
grep -n "public boolean onOrAfter\|public boolean before\|public boolean onOrBefore\|public boolean after" \
  server/src/main/java/org/opensearch/Version.java

# The BWC QA suites.
ls qa/
# expect: rolling-upgrade  mixed-cluster  full-cluster-restart  smoke-test-* verify-version-constants ...

Note: The newest unreleased version constant on main is what you usually gate a brand-new field on. Pick it by reading Version.java on your branch — do not guess V_3_1_0 from memory. If CURRENT on your branch is 3.2.0, a field you add now ships in 3.2.0 and gates on Version.V_3_2_0. "Exact names vary by branch."

Step 2: Find a real, simple transport request/response to extend

You want a small Writeable request/response pair so the serialization is easy to read. Good candidates are simple stats or info responses. Find one and read both its writeTo and its StreamInput constructor.

# Find transport requests/responses with explicit version gates already in them —
# these are your worked examples of the correct pattern.
grep -rln "out.getVersion().onOrAfter\|in.getVersion().onOrAfter" \
  server/src/main/java/org/opensearch/action/ | head

# A concrete, readable example to study end to end:
grep -n "writeTo\|StreamInput\|getVersion().onOrAfter" \
  server/src/main/java/org/opensearch/action/admin/cluster/health/ClusterHealthRequest.java | head -40

For this lab we will use a worked, self-contained example so the pattern is crisp. Imagine a transport response NodeFeatureStatsResponse (your real target will be an existing class) that today serializes two fields:

public class NodeFeatureStatsResponse extends TransportResponse {
    private final long indexedDocs;
    private final String engineType;

    public NodeFeatureStatsResponse(long indexedDocs, String engineType) {
        this.indexedDocs = indexedDocs;
        this.engineType = engineType;
    }

    public NodeFeatureStatsResponse(StreamInput in) throws IOException {  // READ
        super(in);
        this.indexedDocs = in.readVLong();
        this.engineType = in.readString();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {            // WRITE
        out.writeVLong(indexedDocs);
        out.writeString(engineType);
    }
}

Step 3: Add the new field — gated, appended, symmetric

We add an optional mergedSegmentCount (a long) that ships in the next release. Determine that version from Version.java (Step 1); here we assume V_3_2_0.

Three rules, all non-negotiable:

Append, never insert in the middle. The new write goes after every existing write; the new read goes after every existing read.
Gate symmetrically. The if (out.getVersion().onOrAfter(...)) on write and the if (in.getVersion().onOrAfter(...)) on read must be the identical condition.
Give the old peer a sensible default on read, since an old node never sent the field.

 public class NodeFeatureStatsResponse extends TransportResponse {
     private final long indexedDocs;
     private final String engineType;
+    private final long mergedSegmentCount;   // new in 3.2.0

-    public NodeFeatureStatsResponse(long indexedDocs, String engineType) {
+    public NodeFeatureStatsResponse(long indexedDocs, String engineType, long mergedSegmentCount) {
         this.indexedDocs = indexedDocs;
         this.engineType = engineType;
+        this.mergedSegmentCount = mergedSegmentCount;
     }

     public NodeFeatureStatsResponse(StreamInput in) throws IOException {  // READ
         super(in);
         this.indexedDocs = in.readVLong();
         this.engineType = in.readString();
+        if (in.getVersion().onOrAfter(Version.V_3_2_0)) {
+            this.mergedSegmentCount = in.readVLong();
+        } else {
+            this.mergedSegmentCount = 0L;        // old peer didn't send it
+        }
     }

     @Override
     public void writeTo(StreamOutput out) throws IOException {            // WRITE
         out.writeVLong(indexedDocs);
         out.writeString(engineType);
+        if (out.getVersion().onOrAfter(Version.V_3_2_0)) {
+            out.writeVLong(mergedSegmentCount);  // only send to 3.2+ peers
+        }
     }
 }

Warning: The two if conditions are the entire safety argument. If the read gate says V_3_2_0 and the write gate says V_3_1_0, then a 3.1 node will read a mergedSegmentCount that a 3.2 node never wrote (or worse, the reverse), and the stream desynchronizes — every field after it is garbage. They must be character-for- character the same condition.

When the field is a reference type, prefer the writeOptional*/readOptional* helpers so null round-trips cleanly:

// write side, gated:
out.writeOptionalString(maybeNullName);
// read side, gated:
this.maybeNullName = in.readOptionalString();

Step 4: Write the round-trip serialization test

This is the cheapest, highest-value BWC guard. AbstractWireSerializingTestCase<T> serializes your instance, deserializes it, asserts equality — and crucially also does this at random older Versions, which catches a missing or asymmetric gate immediately.

package org.opensearch.action.admin.cluster.stats;

import org.opensearch.Version;
import org.opensearch.core.common.io.stream.Writeable;
import org.opensearch.test.AbstractWireSerializingTestCase;

public class NodeFeatureStatsResponseTests
        extends AbstractWireSerializingTestCase<NodeFeatureStatsResponse> {

    @Override
    protected Writeable.Reader<NodeFeatureStatsResponse> instanceReader() {
        return NodeFeatureStatsResponse::new;     // the StreamInput ctor
    }

    @Override
    protected NodeFeatureStatsResponse createTestInstance() {
        return new NodeFeatureStatsResponse(
            randomNonNegativeLong(),
            randomFrom("internal", "nrt_replication", "read_only"),
            randomNonNegativeLong()                // mergedSegmentCount
        );
    }

    @Override
    protected NodeFeatureStatsResponse mutateInstance(NodeFeatureStatsResponse in) {
        // Optional but good practice: ensure equals/hashCode discriminate the new field.
        return new NodeFeatureStatsResponse(
            in.getIndexedDocs(),
            in.getEngineType(),
            in.getMergedSegmentCount() + 1
        );
    }
}

For the cross-version assertion to mean anything, your type must implement equals and hashCode over all fields, including the new one — otherwise the round-trip "passes" while silently dropping the field. The base class compares the deserialized instance to the original with equals.

Run it:

./gradlew :server:test --tests "org.opensearch.action.admin.cluster.stats.NodeFeatureStatsResponseTests"

If you forgot or mis-gated the field, this test fails at a randomly chosen old Version with a mismatch — often before you ever touch the slow qa/ suite. That is the point: make the gate fail in two seconds, not in a customer's upgrade.

Note: AbstractWireSerializingTestCase uses the randomized testing framework. A failure prints a -Dtests.seed=... line; re-run with that exact seed to reproduce the specific old Version that exposed the bug.

Step 5: Understand and run the `qa/` BWC suites

The round-trip test proves your type round-trips at an older version in a single JVM. The qa/ suites prove that two real, separately-built clusters of different versions actually interoperate over a real network — the authoritative proof.

Suite	What it proves	Gradle project
`qa:rolling-upgrade`	A cluster upgraded node-by-node (old → mixed → new) keeps working at every step	`:qa:rolling-upgrade`
`qa:mixed-cluster`	A cluster running two versions simultaneously serves traffic correctly	`:qa:mixed-cluster`
`qa:full-cluster-restart`	A cluster shut down on the old version and restarted on the new one (covers on-disk/index format)	`:qa:full-cluster-restart`

These are wired to run against specific old versions via bwcVersion. The build resolves the set of versions still in the compatibility window from the version catalog.

# See how the BWC versions are declared and wired into the qa/ projects.
grep -rn "bwcVersion\|bwc_version\|BWC_VERSION\|wireCompatVersions\|indexCompatVersions" \
  qa/ build.gradle settings.gradle gradle/ 2>/dev/null | head

# Run the rolling-upgrade suite (slow — it builds and boots two-version clusters).
./gradlew :qa:rolling-upgrade:check

# Run the mixed-cluster suite.
./gradlew :qa:mixed-cluster:check

# Run the full-cluster-restart suite (also exercises index/on-disk format BWC).
./gradlew :qa:full-cluster-restart:check

Warning: These tasks download and build the old version's artifacts and start multiple JVMs. They are slow (minutes to tens of minutes) and resource-hungry. Run the round-trip test in Step 4 on every change; run the qa/ suite before you open the PR and whenever you touch serialization of something that crosses the wire.

To scope a qa/ run while iterating, target a single BWC version or a single test:

# List the version-parameterized tasks the rolling-upgrade project generates.
./gradlew :qa:rolling-upgrade:tasks --all | grep -i bwc

# Run just one test class inside the suite.
./gradlew :qa:rolling-upgrade:check --tests "*Recovery*"

Step 6: Wire the new field into a real `qa/` assertion (optional but ideal)

The strongest BWC test asserts the behavior the field enables across versions: a new node should still get a correct response from an old node (where the field is absent and defaults), and an old node should not choke on the new node's message (where the field is present but the old node skips it because the write gate suppressed it). In the rolling-upgrade suite these are typically REST-level assertions in the OLD/MIXED/UPGRADED task phases. Read an existing one:

grep -rln "MIXED\|UPGRADED\|TEST_STEP\|isRunningAgainstOldCluster" qa/rolling-upgrade/ | head
grep -rn "assertBusy\|client().performRequest" \
  qa/rolling-upgrade/src/test/java/org/opensearch/upgrades/ 2>/dev/null | head

Model your assertion on those: in the MIXED phase, hit the endpoint that returns your field and assert the cluster behaves correctly regardless of which node answers.

Index / Lucene Format BWC (know the difference)

Everything above is wire BWC — nodes on a network. There is a second, distinct kind: index/Lucene format BWC — data already written to disk. OpenSearch reads indices created by the previous major version (Lucene's N-1 codec back-compat), but not older; an index created two majors ago must be reindexed. The creation version is recorded as index.version.created in IndexMetadata, and the engine refuses to open an index whose format is too old.

grep -rn "index.version.created\|VERSION_CREATED\|minimumIndexCompatibilityVersion\|isCompatible" \
  server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java \
  server/src/main/java/org/opensearch/Version.java | head

The qa:full-cluster-restart suite is where index-format BWC is exercised: it writes data on the old version, restarts on the new version, and asserts the index still opens and serves. If your change touches how a segment, a doc value, or index metadata is written, this is the suite that catches a format break. See the refresh/flush/merge deep dive for what actually lands on disk.

Pitfalls

Pitfall	What goes wrong	How to avoid
Reordering existing writes	Positional stream misread → silent wrong values, no exception	Append only; never touch the order of existing writes
Asymmetric gate (read vs write conditions differ)	One side reads a field the other never wrote → stream desync	Copy the exact same `onOrAfter(...)` condition to both sides
Forgetting the version guard entirely	Old node reads bytes it doesn't expect → deserialize failure or corruption	Always gate a new field; the round-trip test catches the omission
Gating on the wrong `Version` constant	Field gated to a version that never shipped, or the wrong line	Read `Version.java` on your branch; gate on where it first ships
Missing field in `equals`/`hashCode`	Round-trip test "passes" but silently drops the new field	Include every field in `equals`/`hashCode`
Using `readInt`/`writeInt` where the other side used `VInt`	Type/width mismatch → misread	Read and write the exact same method pair
Changing `getWriteableName()` of a `NamedWriteable`	Old nodes registered the old name → `unknown NamedWriteable [x]`	The name is a contract; never rename it
Skipping `qa/` because units are green	Wire BWC only manifests across real versions	Run `:qa:rolling-upgrade:check` before the PR

Expected Output

A correct round-trip test run:

> Task :server:test
NodeFeatureStatsResponseTests > testSerialization PASSED
NodeFeatureStatsResponseTests > testEqualsAndHashcode PASSED

BUILD SUCCESSFUL

A failure from a missing/asymmetric gate looks like a mismatch at a random old version, with a reproducible seed:

NodeFeatureStatsResponseTests > testSerialization FAILED
    java.lang.AssertionError: expected ... but was ...
    REPRODUCE WITH: ./gradlew :server:test --tests "...NodeFeatureStatsResponseTests" \
      -Dtests.seed=DEADBEEFCAFE -Dtests.method="testSerialization"

A clean qa/ run ends with BUILD SUCCESSFUL after booting and tearing down the multi-version clusters.

Stretch Goals

Add an enum field instead of a primitive. Enums are subtler: an old node cannot deserialize a value it doesn't know. Gate the producer so an old peer never receives the new value, and write a test that proves it.
Make the new field a NamedWriteable and register it in NamedWriteableRegistry. Write a test that a node without the registration gets unknown NamedWriteable [x] — then add the registration and watch it pass. See the serialization BWC deep dive.
Find a real merged PR on github.com/opensearch-project/OpenSearch that added a version-gated field. Read its writeTo/StreamInput diff and its test. Confirm the gates are symmetric and the test exercises old versions.
Run qa:full-cluster-restart and trace how it asserts that an index written on the old version still opens on the new version.

Validation / Self-check

Answer all of these before marking the lab complete:

Write, from memory, a correct version-gated writeTo/StreamInput pair that adds one optional long field in the next release. State the single invariant that makes it safe in a mixed cluster.
With a byte-level sketch, explain why reordering two existing writeTo writes corrupts the stream silently rather than throwing.
During a 3.0→3.2 rolling upgrade, a 3.2 node sends your response to a 3.0 node. What does out.getVersion() return, and exactly which field "disappears" and why?
Why must your type implement equals/hashCode over the new field for AbstractWireSerializingTestCase to actually catch a missing gate?
Name the three qa/ suites and state what each one proves that the round-trip unit test cannot.
Distinguish wire BWC from index/Lucene format BWC. Which qa/ suite exercises the index-format kind, and how many major versions back can an index be read?
Your round-trip test is green but qa:rolling-upgrade is red. List two concrete causes and where you would look first.

Lab 9.2: Analyze a Performance Regression

Lab type: Research & Benchmark Estimated time: 4–6 hours

Background

A performance PR without numbers is a guess. At OpenSearch's scale a 5% allocation regression on a path that runs once per document, once per request, or once per shard compounds into real latency and real cost across every cluster running the code. Maintainers therefore treat performance on hot paths as a correctness property: they do not accept "this should be faster," they accept "here is the JMH microbenchmark, here is the before/after, here is the OpenSearch Benchmark macro result that proves it matters and that nothing else regressed."

This lab teaches the measurement loop. You will profile to find a hot path, write a JMH microbenchmark in the :benchmarks module to isolate it, read the results (including allocation and GC pressure), and then connect the micro result to a macro result from OpenSearch Benchmark. You will walk a realistic regression — a per-request allocation or an accidental O(n²) on a search/aggregation/coordination path — and prove the fix the way a maintainer demands.

This lab is the hands-on counterpart to Stage 10: Performance Improvements; read that stage for the issue-finding workflow and the policy framing. The allocation and circuit-breaker mechanics live in the circuit breakers & memory deep dive.

Why This Lab Matters for Contributors

Performance patches are held to a higher review bar than correctness patches, for two reasons. First, "faster" is unfalsifiable without a benchmark — a maintainer cannot approve a claim they cannot verify. Second, an optimization that helps one workload often hurts another, and the cost of shipping a regression is paid by every user, not just the one whose case you optimized. Learning to produce micro + macro numbers, and to read them honestly (including saying "this JMH win does not move any real workload"), is what lets a maintainer trust your perf PRs — and what lets you, as a reviewer, gate others'.

Prerequisites

You can build OpenSearch and run ./gradlew :server:test.
You have read Stage 10 and the circuit breakers & memory deep dive.
You understand at least one hot path well — the search execution, aggregations, or engine internals deep dive.
JMH basics: @Benchmark, @BenchmarkMode, @State, Blackhole, warmup vs measurement iterations.

The Measurement Loop

flowchart LR
  P[profile: async-profiler / JFR<br/>find the hot path] --> M1[baseline: JMH micro + OSB macro]
  M1 --> C[make ONE change]
  C --> M2[re-measure same JMH + same OSB]
  M2 --> D{delta real AND<br/>no regression?}
  D -->|yes| PR[PR with before/after numbers]
  D -->|no| C

Three rules, drilled until they are reflex:

Never optimize un-profiled code. Find the hot path with a profiler or an existing benchmark, not intuition. Most "obvious" optimizations target cold code.
Micro proves the mechanism; macro proves it matters. A JMH win that does not move an OSB workload is noise — and you must say so.
One change, one number. Bundle two optimizations and you cannot attribute the delta, and one may quietly regress.

Step-by-Step Tasks

Step 1: Find a hot path (profile, don't guess)

Hot paths run per-document, per-request, per-shard, or inside the coordination layer. Start from one of these and confirm with a profiler.

Path	Representative class	Why it's hot
Query execution	`QueryPhase.execute(...)`	Once per shard per search
Aggregation collection	`Aggregator.collect(...)`, `LeafBucketCollector`	Once per matching doc
Aggregation reduce	`InternalAggregation.reduce(...)`	Once per shard result on the coordinator
Indexing	`IndexShard.applyIndexOperationOnPrimary`, `InternalEngine.index`	Once per indexed doc
Result merge	`SearchPhaseController.reducedQueryPhase(...)`	Once per search on the coordinator
DocValues read	`SortedNumericDocValues`/`fielddata` iteration	Once per doc in sort/agg

Profile a representative workload (you can run one from OpenSearch Benchmark) under async-profiler or JFR, then grep for the allocation site you suspect:

# Per-document / per-request allocations are the usual culprits. Look for new objects,
# autoboxing, and collection growth inside loops on these paths.
grep -rn "new HashMap\|new ArrayList\|new Object\[\|\.add(" \
  server/src/main/java/org/opensearch/search/aggregations/ | head

# DocValues / BigArrays usage on hot paths (the right way to allocate):
grep -rn "bigArrays\|BigArrays\|PageCacheRecycler\|newLongArray\|newDoubleArray" \
  server/src/main/java/org/opensearch/search/aggregations/ | head

Note: A new HashMap inside a per-document collect() is a classic regression: it allocates an entry array, churns the young generation, and at high doc counts trips the request circuit breaker. The fix is usually to hoist the allocation out of the loop or use a pooled BigArrays structure. See the circuit breakers & memory deep dive.

Step 2: Locate the `:benchmarks` module

JMH microbenchmarks live in their own Gradle module.

# Where the benchmarks live and how they're organized.
ls benchmarks/src/main/java/org/opensearch/benchmark/
# expect subpackages like: search/ routing/ time/ store/ ...

# Read an existing benchmark end to end as a template.
grep -rln "@Benchmark" benchmarks/src/main/java/org/opensearch/benchmark/ | head
grep -n "@Benchmark\|@State\|@Setup\|@BenchmarkMode\|Blackhole" \
  $(grep -rln "@Benchmark" benchmarks/src/main/java/org/opensearch/benchmark/ | head -1)

Step 3: Write a JMH microbenchmark for the hot path

Isolate the suspect method. The example below contrasts a per-call HashMap allocation against a reused, pre-sized structure — the shape of a real regression and its fix. Put it under benchmarks/src/main/java/org/opensearch/benchmark/....

package org.opensearch.benchmark.search;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;

@Fork(value = 2)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class BucketAggregationBenchmark {

    @Param({"16", "256", "4096"})
    public int buckets;

    private long[] keys;
    private long[] vals;

    @Setup
    public void setup() {
        keys = new long[buckets];
        vals = new long[buckets];
        for (int i = 0; i < buckets; i++) {
            keys[i] = i;
            vals[i] = i * 31L;
        }
    }

    /** Regressed path: allocates a HashMap (with boxing) on every call. */
    @Benchmark
    public void allocatingPerCall(Blackhole bh) {
        Map<Long, Long> m = new HashMap<>();          // allocation + autoboxing
        for (int i = 0; i < buckets; i++) {
            m.merge(keys[i], vals[i], Long::sum);
        }
        bh.consume(m.size());
    }

    /** Fixed path: primitive arrays, no boxing, sized once. */
    @Benchmark
    public void primitiveArrays(Blackhole bh) {
        long[] acc = new long[buckets];               // single primitive array
        for (int i = 0; i < buckets; i++) {
            acc[(int) keys[i]] += vals[i];
        }
        bh.consume(acc[buckets - 1]);
    }
}

Warning: A JMH benchmark that the JIT can prove has no side effects gets optimized away — you measure nothing. Always bh.consume(...) the result, take work in from @State/@Param (not constants), and trust the warmup iterations. If a benchmark reports 0 ns/op, the dead-code eliminator ate it.

Step 4: Run the benchmark and read the results

# Build the runnable JMH uber-jar.
./gradlew :benchmarks:jmhJar -q

# Run the whole suite, or scope to one benchmark with a regex and turn on the
# allocation profiler (-prof gc) — allocation is usually the real story.
java -jar benchmarks/build/distributions/opensearch-benchmarks-*.jar \
  "BucketAggregationBenchmark" -prof gc

# Some checkouts expose a convenience task; either works:
./gradlew :benchmarks:jmh -Pjmh.includes="BucketAggregationBenchmark"

Read the output as a maintainer reads it — two numbers matter, not one:

Benchmark                              (buckets)  Mode  Cnt    Score    Error  Units
BucketAggregationBenchmark.allocatingPerCall  4096  avgt   20  41210.3 ± 980.1  ns/op
BucketAggregationBenchmark.primitiveArrays    4096  avgt   20   3120.7 ±  88.4  ns/op

# With -prof gc, the line that decides perf-sensitive PRs:
BucketAggregationBenchmark.allocatingPerCall:·gc.alloc.rate.norm  4096   ~196600  B/op
BucketAggregationBenchmark.primitiveArrays:·gc.alloc.rate.norm    4096    ~32792  B/op

What you extract:

Score ± Error — mean time per op and its confidence interval. The intervals must not overlap for the delta to be real.
gc.alloc.rate.norm — bytes allocated per operation. This is the number that predicts GC pressure and circuit-breaker behavior at scale. A throughput win that raises allocation is suspect.
@Param scaling — how the gap behaves as buckets grows. A regression that is flat at 16 buckets and explosive at 4096 is an algorithmic problem (e.g. O(n²)), not a constant-factor one.

Step 5: Macro benchmark with OpenSearch Benchmark

The microbenchmark proves the mechanism in isolation. To prove it matters you run a macro benchmark against a real cluster with OpenSearch Benchmark (opensearch-benchmark, formerly Rally). It drives standard workloads and reports end-to-end service time and throughput.

pip install opensearch-benchmark   # one-time

# List the standard workloads.
opensearch-benchmark list workloads
# nyc_taxis        — heavy aggregation + range queries over taxi-trip data
# http_logs        — log-style indexing + search, big volume
# geonames, pmc, noaa, eventdata, so (Stack Overflow), ...

# Run a search-heavy workload against a cluster you launched from your build
# (./gradlew run gives you a node on :9200). Baseline THEN your change.
opensearch-benchmark execute-test \
  --target-hosts localhost:9200 \
  --pipeline benchmark-only \
  --workload nyc_taxis \
  --kill-running-processes

Interpret the summary the way a maintainer does:

Metric	What it tells you	Maintainer reads it as
Throughput (ops/s)	Sustained request rate	Did the change raise indexing/search rate?
Service time (mean)	Server-side processing time per op	The core latency number for the operation
Service time p50/p90/p99	Tail latency	p99 is what users feel; tail regressions are unacceptable
Error rate	Failed requests (incl. breaker trips)	A "faster" change that trips breakers is a regression

Note: Run baseline and candidate on the same machine, same workload, same warmup, ideally back-to-back, and compare the distributions, not single runs. OpenSearch Benchmark can store results and compare two test executions (opensearch-benchmark compare --baseline=<id> --contender=<id>). A 3% mean change inside the run-to-run noise is not a result.

Step 6: Prove the fix — before / after, micro and macro

Apply the one change (and only that change). Re-run the identical JMH benchmark and the identical OSB workload. Lay the numbers side by side:

                              JMH ns/op (4096)   JMH B/op (4096)   OSB nyc_taxis p99
allocatingPerCall (before)         41210               196600           188 ms
primitiveArrays  (after)            3120                 32792           171 ms
delta                              -92%                 -83%             -9%

This is the artifact a maintainer wants in the PR description: a micro result that explains the mechanism (allocation and time both fell), a macro result that shows it moves a real workload's p99, and confirmation that no other workload regressed.

How Maintainers Gate Perf-Sensitive PRs

When reviewing a PR that touches a hot path, a maintainer asks, in order:

Where are the numbers? No before/after benchmark → request changes. "Should be faster" is not reviewable.
Micro and macro? JMH alone can mislead (it measures one method out of context); ask for an OSB workload that exercises the path. JMH-only is acceptable only when the change is provably isolated.
What about allocation and the tail? Throughput up but gc.alloc.rate.norm up, or mean down but p99 up, is a likely net loss. The breaker behavior matters.
Any other workload worse? The bar is "target workload meaningfully better, no workload meaningfully worse." Ask which workloads were checked.
Is it one change? Two optimizations in one diff → ask to split so each delta is attributable.
Is it on a real hot path at all? Optimizing cold code adds complexity for no gain; a clean, simple version often wins.

Apply this same checklist to your own perf PRs before you open them. That is the maintainer reflex this lab builds.

Pitfalls

Pitfall	What goes wrong	How to avoid
Optimizing un-profiled code	Effort on a cold path; hot path untouched	Profile with async-profiler/JFR first
JMH benchmark optimized away	Reports 0 ns/op or meaningless numbers	`bh.consume(...)`; feed inputs via `@State`/`@Param`
Reading only mean time	Miss an allocation or p99 regression	Always run `-prof gc`; read p99 in OSB
Single OSB run	Run-to-run noise mistaken for a result	Compare distributions; use `compare`; same machine
Bundling two changes	Cannot attribute the delta; one may regress	One change, one number
Micro win, no macro move	Optimizing something that doesn't matter	Confirm with an OSB workload; report honestly if flat
Ignoring GC / breaker behavior	"Faster" change trips circuit breakers at scale	Watch `gc.alloc.rate.norm` and OSB error rate
Hand-rolled arrays instead of `BigArrays`	Misses pooling and breaker accounting	Use `BigArrays`/`PageCacheRecycler` on real hot paths

Expected Output

A JMH run with -prof gc showing both time and allocation, with non-overlapping confidence intervals between before and after; an OSB summary table for the same workload before and after, showing the target metric (e.g. p99 service time) improved and no other tracked metric regressed; and a side-by-side before/after table suitable for pasting into a PR description.

Stretch Goals

Find a real merged OpenSearch PR labeled Search:Performance or Indexing:Performance and read the before/after numbers in its description. Reproduce its JMH benchmark.
Convert a hot-path new long[]/new HashMap to BigArrays, measure the allocation and breaker-accounting difference, and read why BigArrays integrates with the request circuit breaker in the circuit breakers & memory deep dive.
Construct a benchmark that exposes an O(n²) path: parameterize input size and show the time grows quadratically while the fixed version grows linearly.
Profile ./gradlew run under an OSB nyc_taxis run with async-profiler and produce a flame graph; identify the single hottest frame and tie it to a class.

Validation / Self-check

Answer all of these before marking the lab complete:

Why is a JMH microbenchmark insufficient on its own to justify a perf PR, and what does an OpenSearch Benchmark macro run add that JMH cannot?
What is gc.alloc.rate.norm, why does a maintainer care about it as much as the time score, and how does it relate to circuit breakers?
Explain "micro proves the mechanism, macro proves it matters" with a concrete case where the micro number improves but the macro number does not move.
Your @Benchmark reports 0 ns/op. Name two causes and the fix for each.
Why is "one change, one number" a hard rule for performance PRs?
You see mean service time fall 8% but p99 rise 15% in OSB. Is this a win? Justify your answer in terms of what users experience.
Given a per-document new HashMap in an Aggregator.collect(...), describe the regression mechanism (allocation, GC, breaker) and two ways to fix it.

OpenSearch Contributor Mindset

This section is the "soft skills with hard edges" half of the curriculum. The Levels teach you the mechanics — how to build OpenSearch from source, trace a search request, write a query builder, fix an allocation decider. This section teaches you the behavior and judgment that turns a working diff into a merged pull request and, eventually, into maintainer trust.

These are not optional skills. A technically excellent PR with poor process around it sits in the queue for months and is quietly closed as stale. A modest, focused PR with clean process — DCO sign-off, a CHANGELOG entry, green CI, a tight description tied to an issue — gets a maintainer's attention and gets merged. The difference is rarely the code.

OpenSearch is a GitHub-native project under the OpenSearch Software Foundation (Linux Foundation). That single fact reshapes everything in this section relative to a classic Apache project: design lives in GitHub issues and PR threads, not JIRA; contributions require a DCO sign-off (git commit -s), not a CLA; iteration happens by force-pushing the PR branch, not by re-rolling numbered patch files; and trust is recorded in a per-repo MAINTAINERS.md, governed by a Technical Steering Committee (TSC).

Reading Order

The seven chapters are ordered to mirror the actual arc of a new contributor: learn to read, learn where decisions live, learn to talk, learn to ship, learn to iterate, learn what you may break, learn how trust accrues.

#	Chapter	What it answers	Audience	When to read
1	Reading a Large OpenSearch Codebase	How do I navigate a Lucene + server + modules monorepo without drowning?	Anyone opening the repo for the first time	Before any lab; pre-work
2	Design Through Code and GitHub	Where do design decisions live, and how do I recover the "why" behind code?	Anyone changing existing behavior	Before you propose a change
3	Community Interaction	How do I use GitHub, the forum, and Slack without burning trust?	Anyone about to file or claim an issue	Before your first issue/comment
4	PR Quality and Preparation	What does a maintainer-ready PR look like?	Anyone about to open a PR	Before you click "Create pull request"
5	Responding to Maintainer Feedback	How do I handle review comments and iterate well?	Anyone with an open PR under review	The moment a review lands
6	Compatibility, Stability, Performance	What can I change without breaking a rolling upgrade or a benchmark?	Anyone touching serialization, REST, or hot paths	Before any change to wire/index/REST/perf surfaces
7	The Path to Maintainership	How does a contributor become a maintainer?	Anyone investing in OpenSearch long-term	When you start thinking beyond one PR

Chapters 1–2 are pre-work — read them before opening any issue. Chapters 3–5 are operational — read them before submitting your first PR. Chapters 6–7 are strategic — read them when you start thinking beyond a single change.

How This Section Differs From the Levels

The distinction is deliberate and worth internalizing:

The Levels (mechanics)	This section (behavior and judgment)
How `TransportSearchAction` fans out to shards	How to find out it does, in 10 minutes, from a cold start
How `StreamInput`/`StreamOutput` serialize a `Writeable`	When a serialization change will break a mixed-version cluster
How to write an `OpenSearchIntegTestCase`	Why a PR without that test will not merge
How `ClusterApplierService` applies a state	How to ask a maintainer about it on the forum without wasting their time
The `backport 2.x` label triggers a bot	When to apply it, and how to escalate when the bot conflicts

The Levels make you able to do the work. This section makes the work land. A maintainer evaluating you is not grading your understanding of InternalEngine; they are grading whether your PR is safe, scoped, tested, and respectful of their time. That is judgment, and judgment is what this section trains.

How This Section Complements the Rest of the Book

The relationship is concrete. Each mindset chapter pairs with technical material:

Mindset chapter	Pairs with
Reading the Codebase	Level 2 Lab 1: Navigate the Repo; the deep-dives index
Design via GitHub	Level 2 Lab 3: Fix a Good-First-Issue
Community Interaction	the release-governance communication chapter
PR Quality	Level 2 Lab 2: Prepare a PR; the GitHub review chapter
Responding to Feedback	Level 2 Lab 4: Review a PR
Compatibility	the serialization-BWC deep dive; Level 9 BWC/perf labs
Maintainership	the TSC governance chapter

If you are doing the Capstone, you should have read all seven chapters by the time you reach the step where you open your real PR. The Capstone is graded in part on process — the things this section teaches.

What This Section Is Not

It is not generic open-source advice. Every claim, template, and procedure here is grounded in OpenSearch-specific reality:

The GitHub repo opensearch-project/OpenSearch, its issues, and its PR threads.
The in-repo policy files: CONTRIBUTING.md, DEVELOPER_GUIDE.md, TESTING.md, CHANGELOG.md, MAINTAINERS.md, .github/pull_request_template.md, CODE_OF_CONDUCT.md.
The community forum at forum.opensearch.org, the public Slack at opensearch.org/slack, and recorded community meetings.
The TSC's published governance under the OpenSearch Software Foundation / Linux Foundation.

Where a chapter generalizes, it labels the generalization. Where it states an OpenSearch-specific rule, it cites the in-repo file or the GitHub mechanism that enforces it.

Prerequisites

Before this section is useful you must have:

A local clone of OpenSearch:

git clone https://github.com/opensearch-project/OpenSearch.git ~/os-src
cd ~/os-src

A GitHub account (your contribution identity — there is no JIRA).

Git configured so git commit -s produces a correct Signed-off-by: line:

git config --global user.name  "Your Real Name"
git config --global user.email "you@example.com"   # must match your DCO sign-off

A successful local build at least once — see Level 1 Lab 1: Build From Source. You cannot reason about PR quality until you have watched ./gradlew precommit pass or fail on your own machine.
Optional but recommended: a forum account and a join of the public Slack. You do not need any special status — committer/maintainer status comes later, earned, and is the subject of Chapter 7.

You Have Absorbed This Section When…

Treat this as the gate before declaring the section "read." You have absorbed it when you can:

Find any feature in OpenSearch within 10 minutes by tracing from a Rest*Action or Transport*Action, using git log -S, the IDE call hierarchy, and tests as spec.
Recover the "why" behind a piece of code: run git blame, find the PR that introduced it, and read that PR's design discussion and the issue it closed.
File an issue a maintainer can act on with zero follow-up questions: version, repro, minimal example, expected vs actual.
Open a PR that passes ./gradlew precommit and spotlessJavaCheck on the first try, carries a DCO sign-off, adds a CHANGELOG.md entry under [Unreleased], includes tests, and links its issue.
Iterate on review the OpenSearch way: push fixups, keep CI green, force-push to squash, re-request review, push back with evidence rather than argue, and apply the right backport label after merge.
Predict, for any change, which compatibility surface it touches — wire/index BWC, REST API contract, or performance — and what test or benchmark proves it safe.
Explain the difference between a contributor, a maintainer (MAINTAINERS.md), and the TSC, and describe a realistic track record that leads from the first to the second.

The next chapter — Reading a Large OpenSearch Codebase — gives you the navigation strategy you will use through everything that follows.

Reading a Large OpenSearch Codebase

OpenSearch core is large. The server/ module alone is hundreds of thousands of lines of Java, and it sits on top of Apache Lucene (another large codebase you will read into), surrounded by libs/, dozens of bundled modules/ and plugins/, a test framework, and a qa/ BWC suite. No single human holds it all in their head — not the founders, not the most prolific maintainers. The skill is not memory; it is navigation.

This chapter gives you the strategies maintainers actually use to find their way through code they did not write and do not remember. Everything here assumes a clone at ~/os-src (git clone https://github.com/opensearch-project/OpenSearch.git ~/os-src).

Module Map First

Before reading any code, learn the module shape. OpenSearch is a Gradle multi-project build, so the authoritative module list comes from Gradle, not from guessing at directories:

cd ~/os-src
./gradlew projects | head -60

That prints the project tree (:server, :libs:core, :modules:transport-netty4, :plugins:repository-s3, :test:framework, :qa:..., and so on). Cross-reference it with the build files on disk:

find . -name build.gradle -maxdepth 3 | sort

The top-level directories that matter for ~90% of core work:

Dir	What lives there	When you read it
`server/`	The core engine, `org.opensearch.*`. The bulk of what you read.	Almost always
`libs/core`	`org.opensearch.core`: `StreamInput`/`StreamOutput`, `Writeable`, `Version`	Anything serialized or version-gated
`libs/x-content`	XContent (JSON/YAML/CBOR/SMILE) parsing	REST bodies, mappings, `toXContent`
`libs/common`, `libs/geo`, `libs/secure-sm`	Shared low-level primitives	Utilities, geo, security manager
`modules/`	Bundled-by-default: `transport-netty4`, `reindex`, `lang-painless`, `analysis-common`, `ingest-common`, `percolator`	Transport, scripting, analysis, ingest
`plugins/`	In-repo optional plugins: `analysis-icu`, `repository-s3`, `discovery-ec2`	Storage repos, cloud discovery
`test/framework/`	`OpenSearchTestCase`, `OpenSearchIntegTestCase`, `InternalTestCluster`	Always, when writing tests
`qa/`	BWC, rolling-upgrade, mixed-cluster, packaging QA	Any compatibility-sensitive change
`rest-api-spec/`	REST API JSON specs + shared REST-YAML tests	REST contracts
`distribution/`	Packaging: archives, docker, deb/rpm	Build/packaging issues

Pin a one-sentence-per-module note file. You will re-read it constantly:

mkdir -p ~/os-notes
$EDITOR ~/os-notes/module-map.md

Note: The big functional plugins — security, k-NN, sql, alerting, ml-commons, index-management — live in separate repos under opensearch-project/, not in this tree. If you grep core for a security filter or a k-NN field type and find nothing, that is why. Clone the relevant repo separately.

Strategy 1: Start From the Public Surface, Trace Inward

Every external interaction with OpenSearch enters through one of two surfaces: the REST layer (HTTP on :9200) for users, or the transport layer (internal RPC on :9300) for node-to-node traffic. That makes Rest*Action and Transport*Action classes the only mandatory entry points. Read inward from them.

The canonical reading order for a request:

RestXxxAction          (parses HTTP, builds a request, calls NodeClient)
   ↓
TransportXxxAction     (the registered handler; the business logic entry)
   ↓
XxxService             (e.g. SearchService, IndicesService — the subsystem)
   ↓
Lucene / Engine / ClusterState   (the actual mechanism)

Find the REST handlers and the actions that pair with them:

cd ~/os-src
# All REST handlers (the HTTP surface):
find server -name 'Rest*Action.java' | sort | head -40
# All transport actions (the internal surface):
find server -name 'Transport*Action.java' | sort | head -40

A Rest*Action is wired in ActionModule; so is the Transport*Action it dispatches to. When you do not know how a feature is reachable, grep ActionModule for the registration:

grep -n "SearchAction\|RestSearchAction" \
  server/src/main/java/org/opensearch/action/ActionModule.java

That registration line tells you the ActionType, the request/response classes, and the transport handler — the whole arc in one place. See the action-framework deep dive and the REST-layer deep dive for the machinery; here we only care that the surface is where you start reading.

Strategy 2: The Wire Surface Is the Real Contract

In a classic Apache project, protobuf .proto files are the source of truth for anything serialized. OpenSearch has no protobuf on the core hot path (historically — proto is appearing in some newer subsystems). The equivalent contract is twofold:

Wire serialization: classes implement Writeable and read/write themselves through StreamInput/StreamOutput. Polymorphic types resolve through NamedWriteableRegistry. Compatibility is enforced by Version gating inside those readFrom/writeTo methods (if (out.getVersion().onOrAfter(Version.V_3_0_0)) { ... }).
REST specs: the JSON files under rest-api-spec/ define every REST endpoint's path, parameters, and body — the user-facing contract.

To learn the wire shape of any serialized type, read its Writeable implementation, not a schema file:

# The serialization primitives:
find libs/core -name 'StreamInput.java' -o -name 'StreamOutput.java'
# Find a type's writeTo/readFrom and any Version gating in it:
grep -n "writeTo\|StreamInput\|getVersion().onOrAfter" \
  server/src/main/java/org/opensearch/action/search/SearchRequest.java

To learn the REST contract:

ls rest-api-spec/src/main/resources/rest-api-spec/api/ | grep search
sed -n '1,40p' rest-api-spec/src/main/resources/rest-api-spec/api/search.json

Practical rule: if you are changing a field that crosses a node boundary or changes a REST body, you are changing compatibility. Stop and read the compatibility chapter and the serialization-BWC deep dive before you write the change.

Strategy 3: IDE Call Hierarchy + `git log -S` / `git log -G`

Two tools, used together, replace most speculative reading.

Call hierarchy (IntelliJ Ctrl-Alt-H, VS Code "Show Call Hierarchy") answers "who calls this?". Use it on an entry point like SearchService.executeQueryPhase to find every caller in production code and tests. The test call sites are often the clearest documentation.

git log -S and git log -G answer "when and why did this code appear?". -S matches commits that changed the count of a string (added or removed it); -G matches commits whose diff touches lines matching a regex (better for following an existing identifier):

cd ~/os-src
# When was this method introduced / removed?
git log -S "executeQueryPhase" --oneline -- server/

# Which commits touched lines mentioning segment replication config?
git log -G "segment.replication" --oneline -- server/ | head

Pick the oldest relevant commit and read its full message. OpenSearch commits reference the PR number ((#1234)) and usually the issue:

git show <sha> --stat | head -30
# Look for "(#NNNN)" — that is the pull request.

That PR — its description, its review thread, the issue it closed — is the design discussion. It is often more valuable than the code. The next chapter, Design via GitHub, is entirely about turning that (#NNNN) into the "why."

Strategy 4: Tests Are Executable Spec

The OpenSearch test suite is the cheapest way to learn what a class does, because it exercises the class with assertions about expected behavior. There are three tiers, and you read them in this order:

Test type	Files	What it tells you
Unit	`*Tests.java` (`OpenSearchTestCase`)	The contract of a single class, including edge cases
Serialization	`Abstract*SerializingTestCase` subclasses	The exact wire/XContent round-trip — gold for BWC
Integration	`*IT.java` (`OpenSearchIntegTestCase`, `InternalTestCluster`)	End-to-end behavior across nodes
REST	YAML under `rest-api-spec` (`yamlRestTest`)	The HTTP contract, version-skipped where needed

For any class Foo.java, look for FooTests.java; the test method names alone form a behavior spec:

# Find the unit test for SearchRequest:
find server -name 'SearchRequestTests.java'
# Read the behaviors it asserts:
grep "public void test" $(find server -name 'SearchRequestTests.java')

For end-to-end behavior, the integration tests are the truth:

find server -name '*SearchIT.java' | head

And the REST-YAML tests show exactly what an HTTP client sees, including the skip blocks that document version-specific behavior:

ls rest-api-spec/src/main/resources/rest-api-spec/test/search/
sed -n '1,40p' rest-api-spec/src/main/resources/rest-api-spec/test/search/10_source_filtering.yml

To actually run a scoped slice while you read:

./gradlew :server:test --tests "org.opensearch.action.search.SearchRequestTests"

Read tests before guessing how a feature behaves. The test author already encoded the behavior you are about to reverse-engineer.

Strategy 5: Keep a Reading Log

Maintainers have working memory of the codebase because they wrote a lot of it. You don't. Compensate with notes. Keep one file and append a dated entry every time you trace a path:

cat >> ~/os-notes/reading-log.md <<'EOF'

## 2026-06-16 — Search request path
- RestSearchAction.prepareRequest (server/.../rest/action/search/) parses query, builds SearchRequest
- → client.execute(SearchAction.INSTANCE, searchRequest, listener)
- → TransportSearchAction.doExecute (server/.../action/search/) — coordinating node
- → fans out per-shard; SearchService.executeQueryPhase (QueryPhase) then executeFetchPhase
- → SearchPhaseController.reduce merges shard results on the coordinating node
EOF

Re-reading three months later, the log is gold. Without it, you re-trace the same path from zero and waste an afternoon. The best contributors treat the reading log as part of the work, not overhead.

Worked Exercise: `RestSearchAction` → `TransportSearchAction` → `SearchService` (90 minutes)

Goal: in 90 minutes, trace a _search request from the HTTP edge to per-shard execution, and record the path in your reading log. Use only the tools above. Do not read this chapter's hop names back — discover them.

Step 1 (15 min) — Find the REST entry

cd ~/os-src
find server -name 'RestSearchAction.java'
grep -n "prepareRequest\|registerHandlers\|routes()" \
  $(find server -name 'RestSearchAction.java')

Read prepareRequest. Note two things: it parses the query/body into a SearchRequest, and it ends by calling something like client.execute(SearchAction.INSTANCE, searchRequest, ...) (or a channel consumer that does). Write the file + method in your log.

Step 2 (15 min) — Find where the action is registered

grep -n "SearchAction\b" server/src/main/java/org/opensearch/action/ActionModule.java

This proves the mapping from SearchAction.INSTANCE to TransportSearchAction. Confirm the request and response classes named in the registration.

Step 3 (20 min) — Enter the transport action

find server -name 'TransportSearchAction.java'
grep -n "doExecute\|executeSearch\|class TransportSearchAction" \
  $(find server -name 'TransportSearchAction.java')

Read doExecute. This is the coordinating node logic: it resolves indices to shards, decides query-then-fetch vs DFS, and dispatches per-shard search requests. Note where it builds the shard iterator and where it hands off to the search phases.

Step 4 (20 min) — Drop into per-shard execution

find server -name 'SearchService.java'
grep -n "executeQueryPhase\|executeFetchPhase\|createContext" \
  $(find server -name 'SearchService.java')

executeQueryPhase runs on the node holding each shard. Trace how it builds a SearchContext/DefaultSearchContext and invokes QueryPhase. This is the boundary where OpenSearch hands off to Lucene. (Depth here belongs to the search-execution deep dive — do not go deeper now; just confirm the hop.)

Step 5 (10 min) — Verify with a test, then run it live

# A test that exercises the path end to end:
find server -name '*SearchIT.java' | head -1
# Optional: launch a node and hit it for real (separate terminal):
# ./gradlew run
# curl -s 'localhost:9200/_search?size=0' | head

Step 6 (10 min) — Record it

Append the five-hop trace to ~/os-notes/reading-log.md, citing the file and method for each hop. If you can reproduce the trace tomorrow without this page, you have the navigation skill.

Validation: Artifacts To Keep

After this chapter you should have produced and saved:

~/os-notes/module-map.md — one sentence per top-level dir and per major server/ package, derived from ./gradlew projects and find . -name build.gradle.
~/os-notes/reading-log.md — the _search trace from the exercise above, with file + method per hop.
One git log -S (or git log -G) command you ran and the PR number (#NNNN) it surfaced, saved to the log — your bridge into Design via GitHub.
The name of one unit test, one serialization test, and one *IT.java that exercise the search path — your "executable spec" set.

You have absorbed this chapter when you can find any feature in OpenSearch within 10 minutes, starting from a Rest*Action or Transport*Action, without re-reading these steps. The next chapter — Understanding Design Through Code and GitHub — tells you where the decisions behind that code actually lived.

Understanding Design Through Code and GitHub

In an Apache project, you ask "what is the JIRA?" In OpenSearch, the question is "what is the issue and PR?" OpenSearch is a GitHub-native project: design discussion, decision records, review history, and the rationale for nearly every line of code live on github.com/opensearch-project/OpenSearch. There is no JIRA, no separate design wiki that maintainers actually use, and no email thread that the code points back to. The "why" behind the code is a (#NNNN) away.

This chapter teaches you to recover that "why" — to do code archaeology — so you change existing behavior with full knowledge of why it is the way it is. Changing code without reading its origin PR is the single most common way a well-meaning contributor reintroduces a bug that was deliberately fixed two years ago.

Where OpenSearch Design Actually Lives

Location	What it holds	How to reach it
GitHub issues	Bug reports, feature requests, RFCs, meta/tracking issues	Repo → Issues; filter by label
Issue labels `RFC` / `meta` / `proposal` / `roadmap`	Large designs and their discussion	`is:issue label:RFC`
Pull request descriptions	The "what and why" of a change, linked to its issue	The `(#NNNN)` on every commit
PR review threads	Maintainer objections, alternatives considered, the final consensus	The "Conversation" tab of a PR
In-repo docs	`DEVELOPER_GUIDE.md`, `CONTRIBUTING.md`, `TESTING.md`, design notes under subsystem dirs	The repo tree itself
Community meetings	Recorded video + notes for big decisions	`forum.opensearch.org` and the project YouTube
The forum	Longer-form design and user-impact discussion	`forum.opensearch.org`

The center of gravity is the issue → PR pair. A well-run feature looks like: an issue (often labeled RFC or meta) collects the design and the consensus; one or more PRs implement it, each closing or referencing the issue with Closes #NNNN; the PR review thread records every objection and how it was resolved. That chain is the design document.

Note: Large efforts use a meta issue as an umbrella that links many child issues and PRs. When you find a meta issue for your area (search is:issue label:meta plus a keyword), you have found the project's own roadmap for that subsystem. Read it top to bottom before proposing anything.

The Difference From Apache JIRA Culture

If you come from Hadoop, Hive, or Tez, recalibrate. The differences are not cosmetic — they change where you look and how you participate.

Apache (JIRA)	OpenSearch (GitHub)
Design lives in a JIRA ticket; code commits reference `PROJECT-NNNN`	Design lives in an issue + PR thread; commits reference `(#NNNN)`
Patches attached as `.patch` files, re-rolled as `v1/v2/v3`	A branch + PR; iterate by force-pushing the branch
Contributor identity gated by a CLA	Identity gated by DCO sign-off (`git commit -s`)
`dev@` mailing list for proposals and votes	GitHub issues (`RFC`/`proposal`), forum, Slack, community meetings
"Committer" / "PMC"	"Maintainer" (`MAINTAINERS.md`) / "TSC"
Find the "why" via JIRA → linked patch	Find the "why" via `git blame` → PR → linked issue

The practical consequence: when you want the rationale for a line of code in OpenSearch, you do not search a JIRA project — you walk git blame to the introducing commit, read its PR, and read the issue that PR closed.

Archaeology: From a Line of Code to Its "Why"

This is the core skill. Given any suspicious or surprising line, recover the decision behind it in four moves.

Move 1 — `git blame` to find the introducing commit

cd ~/os-src
# Blame a specific region of a file (lines 120–140):
git blame -L 120,140 server/src/main/java/org/opensearch/cluster/ClusterState.java

git blame gives you the commit SHA and author for each line. If the line you care about was last touched by a trivial reformat (Spotless, a license header), peel back to the real change:

# Walk history of just these lines, following moves/renames:
git log -L 120,140:server/src/main/java/org/opensearch/cluster/ClusterState.java --oneline

Move 2 — `git log -S` / `git log -G` to find when a concept appeared

blame shows the last touch; -S/-G find the originating change of a string or pattern, even if later commits reshaped it:

# When did "clusterManager" terminology enter this file's history?
git log -G "clusterManager" --oneline -- server/src/main/java/org/opensearch/cluster/

# When was a specific method first added anywhere?
git log -S "applyIndexOperationOnPrimary" --oneline -- server/

Move 3 — From commit to PR

OpenSearch commit messages end with the PR number. Extract it:

git show <sha> | head -20
# Title line looks like:  "Rename master to cluster_manager in X (#3537)"

Open https://github.com/opensearch-project/OpenSearch/pull/3537. Read the description (the "what and why") and the Conversation tab (the objections and the resolution). Note the linked issue at the top (Closes #NNNN / Resolves #NNNN).

Move 4 — From PR to issue (the design)

The issue the PR closed is usually where the design was argued. Open it. For a large feature it will be an RFC/meta issue with the full proposal, alternatives, and the maintainer who signed off. That issue, plus the PR thread, is your complete "why."

You can do Moves 3–4 from the CLI with the GitHub CLI if you have it:

gh pr view 3537 --repo opensearch-project/OpenSearch --comments | head -60
gh issue view 1684 --repo opensearch-project/OpenSearch | head -60

Worked Example: The "master" → "cluster_manager" Rename

OpenSearch renamed the master node role and "master node" terminology to cluster_manager / cluster manager for inclusive language, keeping the old names as deprecated aliases. This is a textbook archaeology target because it touched serialization, settings, REST, and logs — exactly the kind of cross-cutting change you must understand before extending it.

Reconstruct the design without prior knowledge:

cd ~/os-src
# 1. Find where both terms coexist (the alias seam):
grep -rn "cluster_manager" server/src/main/java | grep -i "master" | head

# 2. Find the deprecated-alias settings:
grep -rn "initial_cluster_manager_nodes\|initial_master_nodes" \
  server/src/main/java | head

# 3. Walk history of the role definition to the introducing commits:
git log -G "CLUSTER_MANAGER_ROLE\|cluster_manager" --oneline -- server/ | tail -20

What you should reconstruct from the PRs and issues those commits point to:

Why both terms exist. Removing master outright would break every user's opensearch.yml (cluster.initial_master_nodes), every node role config, and every script parsing master from _cat/nodes. So the old names became deprecated aliases, not deletions — a compatibility decision, not just a wording one.
Where the seam is Version-gated. Cross-node messages and APIs that report the role had to keep emitting master to old nodes and cluster_manager to new ones. That is exactly the Version-gated serialization pattern from the serialization-BWC deep dive. The rename's PRs are a live example of why "small wording change" was actually a compatibility project.
The deprecation path. The old master role and settings log deprecation warnings and are scheduled for eventual removal — see the compatibility chapter for how OpenSearch stages deprecations.

The lesson generalizes: a change that looks like a string rename was, in design, a multi-surface compatibility exercise. The PR thread is where the maintainers worked that out. Read it before touching anything in that area.

A Second Pattern: Tracing a Feature Like Segment Replication

Segment replication (replicas copy Lucene segments from the primary instead of replaying each operation, contrasted with the default document replication) is a large feature with a meta issue and many child PRs. Use it to practice tracing a feature, not a single line:

cd ~/os-src
# Find the entry points in code:
grep -rln "SegmentReplication" server/src/main/java | head
find server -name 'SegmentReplicationTargetService.java' \
            -o -name 'SegmentReplicationSourceService.java'

# Find when the subsystem landed and the PRs behind it:
git log -G "SegmentReplication" --oneline -- server/ | tail -20

Then, on GitHub, search is:issue label:meta segment replication. The meta issue links the RFC, the design trade-offs (read amplification vs. CPU/refresh cost, the global-checkpoint interaction in the replication deep dive), and the staged child PRs. Reading the meta issue first tells you which sub-area is stable, which is experimental, and where a new contribution would actually be welcome — information that does not exist in the code at all.

How To Find the "Why" Behind a Design — Checklist

When you are about to change existing behavior, run this before writing any code:

git blame the lines you want to change; peel past reformat/license commits with git log -L.
Extract the introducing PR number (#NNNN) from the originating commit.
Read the PR description (what/why) and the Conversation (objections, alternatives, resolution).
Open the issue the PR closed; if it is an RFC/meta issue, read the whole design.
Search for a meta issue covering your subsystem (is:issue label:meta <keyword>).
Check whether there is a community-meeting recording or forum thread referenced from the issue.
Confirm no open RFC/proposal already plans the change you are about to make — if so, join that discussion instead of duplicating it.

If you skip these and your PR reverses a deliberate decision, the maintainer's first comment will be a link to the very issue you should have read. That costs you a review round and some credibility. The archaeology is faster than the rework.

Validation: Prove You Understand This

Pick any non-trivial method in server/ you have never seen. Using git blame → git log → PR → issue, state in one sentence why it exists and who signed off on it.
Find one RFC or meta issue for a subsystem you care about and summarize its design decision and current status in three bullets.
Explain, with reference to the rename example, why a "rename a string" PR can be a compatibility-sensitive change, and which deep dive governs that.
Demonstrate the difference between git log -S and git log -G on a real identifier and say which you would use to follow an existing method through renames.
Produce the GitHub URL of the PR that introduced a piece of code you traced, and the URL of the issue it closed.

When you can recover the "why" behind any line in under ten minutes, you are ready to participate in the conversation. The next chapter — Community Interaction: GitHub, Forum, and Slack — is how you join it without burning trust.

Community Interaction: GitHub, Forum, and Slack

You can write a perfect patch and still get nowhere if you talk to the OpenSearch community the wrong way. Maintainers are volunteers and paid engineers with finite attention; the way you ask, claim, report, and escalate either earns that attention or spends it. This chapter is the operational etiquette of OpenSearch's communication channels — which one to use, how to use it well, and how to escalate a stalled effort without burning the bridge you will need next month.

The Channels and What Each Is For

OpenSearch has several channels, and using the wrong one is itself a mild faux pas (asking a deep design question in Slack, or a "how do I configure X" support question on a code PR).

Channel	URL	Use it for	Don't use it for
GitHub issues	`github.com/opensearch-project/OpenSearch/issues`	Bug reports, feature requests, RFCs, claiming work	Vague "is anyone there?" pings; support questions
GitHub PRs	…/pulls	The change itself + review discussion	Unrelated design debates; chit-chat
Community forum	`forum.opensearch.org`	Longer design discussion, user-impact questions, "how should this work?"	Bug reports that belong in a tracked issue
Public Slack	`opensearch.org/slack`	Quick, ephemeral questions; meeting coordination; saying hello	Anything that needs to be findable later (Slack is not the record)
Community meetings	linked from the forum / project calendar	Watching/raising big decisions live; demos	Routine review (that's the PR)
Announce / release notes	forum + GitHub Releases	Staying current on releases and breaking changes	—

The single most important rule: the durable record is GitHub. A decision made in Slack or a meeting is not real until it is written into an issue or PR. When something is decided in an ephemeral channel, the contributor's job is to capture it in the relevant issue. Maintainers notice who does this; it is low-effort, high-trust behavior.

Note: There is no JIRA and no dev@ mailing list as the working channel. If you are used to Apache, mentally remap: "open a JIRA" → "open a GitHub issue"; "post to dev@" → "comment on the issue / forum thread"; "attach a patch" → "open a PR."

When To Use Which — A Decision Guide

flowchart TD
    A[I have a thing to say] --> B{Is it a concrete defect<br/>or a specific feature ask?}
    B -- Yes --> C[Open a GitHub issue<br/>with repro + version]
    B -- No --> D{Is it about a specific<br/>change in flight?}
    D -- Yes --> E[Comment on that PR]
    D -- No --> F{Is it open-ended design<br/>or user-impact discussion?}
    F -- Yes --> G[Start a forum thread<br/>or RFC issue]
    F -- No --> H{Is it a quick,<br/>throwaway question?}
    H -- Yes --> I[Ask in Slack]
    H -- No --> G

If you find yourself about to ask the same question in two channels, stop — pick the durable one (issue or forum) and link to it from the ephemeral one.

How To File a Good Issue

A maintainer should be able to act on your issue with zero follow-up questions. The repo ships issue templates (bug report, feature request) — fill them out fully. The non-negotiable elements of a bug report:

Version: exact OpenSearch version and line (3.0.0, 2.x, built from main at SHA).
Environment: JVM, OS, single vs multi-node, relevant plugins, opensearch.yml deltas.
A minimal reproduction: the smallest sequence of curl/REST calls that triggers it.
Expected vs actual: what you expected, what happened, and the exact error/stack trace.
Logs: the relevant log lines (not a 50 MB dump) — and the full stack trace if there is one.

A reproduction maintainers love is copy-pasteable and self-contained:

# Repro for: aggregation returns wrong doc_count after delete-by-query (OpenSearch 3.0.0)
curl -s -XPUT 'localhost:9200/demo' -H 'content-type: application/json' -d '{
  "mappings": { "properties": { "status": { "type": "keyword" } } }
}'
curl -s -XPOST 'localhost:9200/demo/_doc?refresh=true' -H 'content-type: application/json' -d '{"status":"open"}'
curl -s -XPOST 'localhost:9200/demo/_delete_by_query?refresh=true' -H 'content-type: application/json' -d '{"query":{"term":{"status":"open"}}}'
# Expected: terms agg returns doc_count 0 for "open"
# Actual:   returns doc_count 1
curl -s 'localhost:9200/demo/_search?size=0' -H 'content-type: application/json' -d '{
  "aggs": { "by_status": { "terms": { "field": "status" } } }
}'

What kills an issue's chances: no version, "it doesn't work," a screenshot of a stack trace (post text), a giant unfocused log, or a reproduction that depends on your private cluster. Search first — is:issue <keywords> — and link any duplicate or related issue you find.

How To Claim an Issue You Want To Work On

OpenSearch does not formally "assign" most issues to non-maintainers up front. The convention is to comment your intent on the issue:

"I'd like to take this. My plan is to add a Version-gated field to XRequest and a round-trip serialization test. I'll open a draft PR this week — please let me know if someone is already on it or if the approach is wrong."

This does three things: signals you, states an approach a maintainer can correct before you write code, and sets a loose timeline. Then:

Wait for a brief acknowledgement on good first issue / help wanted issues — they exist precisely to be claimed; a maintainer will usually thumbs-up or comment.
Do not silently start large work on an untriaged issue or an open RFC — the design may not be settled, and you risk building the wrong thing. Ask first.
If you go quiet for weeks after claiming, expect (and accept) someone else picking it up. A short "still working on this, blocked on X" comment keeps your claim alive.

Triage Labels You Must Recognize

Labels are the maintainers' workflow. Reading them tells you whether an issue is ready for you:

Label	Meaning	What it means for you
`untriaged`	Not yet reviewed by a maintainer	The design isn't endorsed; ask before building
`good first issue`	Scoped, low-context, mentor-friendly	Ideal first contribution; claim it
`help wanted`	Maintainers want external help here	Welcome to take; still comment your plan
`bug`	Confirmed defect	Repro likely already present
`enhancement`	Feature/improvement	May need an RFC if large
`flaky-test`	Intermittently failing test	Often paired with an `AwaitsFix` mute
`RFC` / `proposal` / `meta`	Design discussion / umbrella tracking	Read fully; participate in design before coding
`backport 2.x` (etc.)	Should be backported to a release branch	Applied to PRs; triggers the backport bot
`v3.0.0` (version milestones)	Targeted at a release	Tells you the timeline pressure

Warning: Do not start building on an untriaged or RFC-labeled issue as if the design were final. untriaged means no maintainer has agreed it should be done at all. The fastest way to waste a weekend is to implement an unendorsed proposal.

How To Escalate a Stalled PR — Politely

PRs go quiet. Maintainers are busy, reviewers get reassigned, CI breaks on main and masks your status. Escalation is legitimate; impatience is not. The cadence:

Wait a reasonable interval. A few business days of silence on an open PR is normal. Do not ping after 12 hours.
First nudge — on the PR, factual and specific:

"Friendly ping — this has been green for a week and addresses #1234. Is there anything you'd like me to change, or is it waiting on a reviewer? Happy to rebase if main has moved."
Find the right reviewer. Check MAINTAINERS.md and the CODEOWNERS for the touched paths; @-mention a maintainer for that area (one, not all of them).
Use the forum or Slack as a pointer, not a venue. A short "could a maintainer take a look at #5678?" in the relevant Slack channel is fine; do not re-litigate the change there.
Raise it in a community meeting only if it is genuinely stuck and important — and add a one-line note to the PR afterward capturing whatever was said.

What never works: pinging daily, @-mentioning the entire maintainer list, editing the PR title to "PLEASE REVIEW", or arguing that your PR is more important than others in the queue.

Etiquette: Dos and Don'ts

Do	Don't
Search before filing; link duplicates/related issues	Open a fresh issue for a known problem
Post text (errors, stack traces, configs)	Post screenshots of text
Give exact version + minimal repro	Say "latest" or "it's broken"
Comment your plan before large work	Drop a 3,000-line PR with no prior discussion
Keep one topic per issue/PR	Pile unrelated asks into one thread
Capture Slack/meeting decisions back into the issue	Let decisions live only in ephemeral chat
`@`-mention one relevant maintainer when stuck	`@`-mention everyone, repeatedly
Assume good faith; thank reviewers	Take review comments personally
Use the forum for open design questions	Hold design debates in Slack where they vanish
Disclose AI-assisted work where the project asks	Paste machine-generated PRs without understanding them

Code of Conduct

OpenSearch publishes a CODE_OF_CONDUCT.md in the repo (it follows a Contributor-Covenant-style standard) and the OpenSearch Software Foundation has its own conduct expectations. Read it once:

find ~/os-src -maxdepth 1 -iname 'CODE_OF_CONDUCT.md'

The substance is unremarkable and binding: be respectful, assume good faith, no harassment, no personal attacks, keep technical disagreement technical. The practical version for a contributor: disagree with the idea, never the person; when a maintainer says no, ask "what would change your mind?" rather than escalating in tone; and if you see conduct issues, use the private reporting path in the document rather than a public flame. Trust in this project is slow to build and fast to lose — see the maintainership chapter — and conduct is the foundation of it.

Validation: Prove You Understand This

Given a defect you found, write the full GitHub issue body — version, environment, a copy-pasteable curl repro, expected vs actual, and the relevant log lines.
Draft the comment you would post to claim a good first issue, including your intended approach and a timeline.
For a hypothetical stalled, green PR, write the escalation cadence: when you nudge, who you @-mention (and how you found them via MAINTAINERS.md), and what you do not do.
State which channel you'd use for each of: a NullPointerException with a repro; "should the default for setting X change?"; "what time is the community meeting?"; a decision reached in Slack that needs to become real.
Explain why "the durable record is GitHub" changes how you capture a Slack decision.

When your issues need no follow-up and your nudges land without friction, you are ready to ship. The next chapter — PR Quality and Preparation — is what a maintainer expects to see when you do.

PR Quality and Preparation

A maintainer reviewing your pull request is making a risk decision: will merging this make the codebase better without introducing a regression, a compatibility break, or a maintenance burden? Everything you do before you click "Create pull request" either lowers that risk or raises it. A high-quality PR is not the one with the cleverest code — it is the one that is easy and safe to say yes to.

This chapter is the checklist a maintainer-ready OpenSearch PR meets. It maps directly to Level 2 Lab 2: Prepare a PR and to the project's own CONTRIBUTING.md, DEVELOPER_GUIDE.md, and .github/pull_request_template.md.

What a Maintainer-Ready PR Looks Like

Seven properties, in rough priority order:

Focused scope — one logical change. Not "fix bug + reformat 12 files + rename a class."
Tests — unit tests always; integration (*IT.java) and/or REST-YAML where behavior crosses nodes or the HTTP boundary.
A CHANGELOG.md entry under ## [Unreleased], in the right category.
DCO sign-off on every commit (git commit -s).
Green local gate — ./gradlew spotlessApply then ./gradlew precommit pass before you push; CI (gradle-check) green after.
A clear description tied to an issue — Closes #NNNN, what changed, why, and how it was tested.
No collateral damage — no unrelated reformatting, no incidental dependency bumps, no stray debug logging.

The rest of this chapter is each of these in detail.

Focused Scope: One Logical Change

The cardinal rule. A reviewer can hold one change in their head; they cannot hold five. A PR that fixes a bug and reformats a package and renames a field is three reviews wearing one hat, and it will sit unmerged because reviewing it is expensive and risky.

Litmus test before you open the PR:

cd ~/os-src
git diff --stat main...HEAD

If the stat shows files you cannot explain as part of the one thing, split them out. If Spotless reformatted unrelated lines, restrict formatting to your change. If you found a second bug while fixing the first, file a separate issue and a separate PR. "Drive-by" cleanups feel generous; to a reviewer they are noise that hides the real change and expands the blast radius.

Tests: The Non-Negotiable

A PR that changes behavior without a test that would have failed before your change is not ready, full stop. OpenSearch is heavily tested and randomized; maintainers will not merge an untested behavioral change because they cannot trust that a future refactor won't silently break it.

Choose the test tier by what you touched (see Reading the Codebase for the catalog, and the test framework under test/framework/):

You changed…	Add at least…	Run with
A single class's logic	A `*Tests.java` unit test	`./gradlew :server:test --tests "...YourTests"`
A serialized type (`Writeable`/XContent)	A round-trip test extending `Abstract*SerializingTestCase`	`./gradlew :server:test --tests "...SerializationTests"`
Behavior spanning nodes	A `*IT.java` integration test (`OpenSearchIntegTestCase`)	`./gradlew :server:internalClusterTest --tests "...IT"`
A REST endpoint contract	A REST-YAML test under `rest-api-spec`	`./gradlew :rest-api-spec:yamlRestTest`
Wire/index BWC	A case in the `qa/` BWC suite	`./gradlew :qa:...` (see compatibility)

Write the failing test first, watch it fail, then make it pass. A test that passes on both the old and new code proves nothing. If you mute a flaky test, never use @Ignore — use @AwaitsFix(bugUrl="https://github.com/opensearch-project/OpenSearch/issues/NNNN") with a real tracking issue.

The `CHANGELOG.md` Entry

Every user-visible PR adds one line to CHANGELOG.md under the ## [Unreleased] heading, in the correct category. Forgetting this is the single most common reason a CI "changelog verifier" check goes red and a maintainer leaves a one-word review.

sed -n '1,40p' ~/os-src/CHANGELOG.md

The categories follow Keep-a-Changelog:

## [Unreleased 3.x]
### Added
- Add `wait_for_active_shards` support to the foo API ([#5678](https://github.com/opensearch-project/OpenSearch/pull/5678))
### Changed
### Deprecated
### Removed
### Fixed
- Fix incorrect doc_count after delete-by-query in terms agg ([#5679](https://github.com/opensearch-project/OpenSearch/pull/5679))
### Security

Rules: put it under the right ### ; phrase it for a user reading release notes, not for a reviewer; link your PR number; pick the right [Unreleased 3.x] / [Unreleased 2.x] section matching where it lands. Purely internal changes (test-only, build-only) may not need an entry — check the verifier's rules and recent merged PRs.

DCO Sign-Off (Not a CLA)

OpenSearch uses the Developer Certificate of Origin, not a CLA. Every commit must carry a Signed-off-by: line whose name and email match your Git identity. The DCO bot fails the PR if any commit is missing it.

git config user.name  "Your Real Name"
git config user.email "you@example.com"
git commit -s -m "Fix incorrect doc_count after delete-by-query in terms agg"

git commit -s appends:

Signed-off-by: Your Real Name <you@example.com>

Forgot it on commits you already made? Fix the whole branch in one shot:

git rebase --signoff main      # add Signed-off-by to every commit since main
git push --force-with-lease     # force-pushing your PR branch is normal here

Note: There is no CLA to sign and no JIRA account to create. Your GitHub identity plus a correct DCO sign-off is the entire contributor-identity requirement.

The Local Gate: `spotlessApply`, `precommit`, gradle-check

Run the gate locally before pushing. Discovering a checkstyle or license-header failure in CI after a 30-minute build, when you could have caught it in 30 seconds, wastes a review cycle.

cd ~/os-src
# 1. Auto-format (Spotless). Always run this first.
./gradlew spotlessApply

# 2. The fast static gate: checkstyle, forbidden-APIs, license/SPDX headers,
#    dependency checks, loggerUsageCheck, etc.
./gradlew precommit

# 3. Verify formatting is clean (what CI checks):
./gradlew spotlessJavaCheck

# 4. The targeted tests for what you changed:
./gradlew :server:test --tests "org.opensearch.your.ChangedTests"

./gradlew check runs the full gate (all tests + precommit + integ) and is long-running — CI runs the equivalent gradle-check. You do not need to run check end-to-end locally for every push, but you must run precommit and the tests touching your change. New source files need the SPDX Apache-2.0 header:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

precommit's license check will fail without it.

The PR Description and Template

The repo's .github/pull_request_template.md pre-populates the description. Fill every section; do not delete the template. A reviewer reads the description first and decides whether to invest in the diff.

cat ~/os-src/.github/pull_request_template.md

A strong description has:

Description: what the change does and why, in plain prose. Link the design discussion.
Related Issues: Closes #NNNN (this auto-closes the issue on merge) or Related to #NNNN.
Check List: tick the template's boxes truthfully — tests added, CHANGELOG updated, commits signed off, public docs considered, etc.
Testing: how you verified it (which ./gradlew tasks, which manual curl).

Tie it to an issue. A PR with no linked issue, for anything beyond a trivial typo, invites the question "was this discussed?" — and if the answer is no, you may be asked to open an issue first.

Backport Labels

OpenSearch maintains release branches (2.x, etc.) alongside main. When a fix should also ship on a maintenance line, the PR gets a backport <branch> label (e.g. backport 2.x), which triggers a backport bot to open the backport PR automatically after merge.

You usually do not self-apply the label — a maintainer decides whether a backport is warranted — but you should propose it in your description ("This is a bug fix and should be backported to 2.x") and be ready to resolve conflicts if the bot's automated backport doesn't apply cleanly. See Responding to Maintainer Feedback for the post-merge backport workflow.

PR Readiness Checklist

Run this before you open the PR. Every box unchecked is a reason it will sit.

#	Check	Command / where
1	One logical change; no unrelated reformatting	`git diff --stat main...HEAD`
2	Unit tests added that fail before, pass after	`./gradlew :server:test --tests "..."`
3	Integration / REST-YAML tests where behavior crosses a boundary	`./gradlew :server:internalClusterTest` / `:rest-api-spec:yamlRestTest`
4	BWC test for any wire/index change	`./gradlew :qa:...` (see compatibility)
5	`CHANGELOG.md` entry under `[Unreleased]`, right category, PR link	edit `CHANGELOG.md`
6	Every commit signed off (DCO)	`git log --format='%b' \| grep Signed-off-by`
7	`spotlessApply` run; `spotlessJavaCheck` clean	`./gradlew spotlessApply spotlessJavaCheck`
8	`precommit` green (checkstyle, headers, forbidden-APIs)	`./gradlew precommit`
9	SPDX header on new files	inspect new files
10	Description filled from template; issue linked (`Closes #NNNN`)	the PR form
11	Backport need stated in description	the PR form

Common Reasons PRs Sit Unmerged

Learn these so your PR avoids them:

Reason	Symptom	Fix
Scope too broad	"Can you split this?"	One PR per logical change
No / weak tests	"Please add a test that covers X"	Add a failing-then-passing test
Missing CHANGELOG entry	Changelog-verifier check red	Add the `[Unreleased]` line
DCO not signed	DCO bot red	`git rebase --signoff main` + force-push
precommit/spotless fails in CI	gradle-check red	Run the local gate first
Unrelated reformatting	"Lots of noise here"	Revert collateral changes
Compatibility risk unaddressed	"Does this break a rolling upgrade?"	`Version`-gate + a `qa/` BWC test (compatibility)
No linked issue / undiscussed design	"Was this agreed?"	Open an issue / RFC first (community)
Goes stale after review	Author silent; PR auto-stale	Iterate promptly (feedback)
Conflicts with `main`	Merge-conflict banner	`git rebase main` + force-push

Validation: Prove You Understand This

Take a branch you have and run git diff --stat main...HEAD. Justify every changed file as part of one logical change, or split it.
Add a CHANGELOG.md entry for a hypothetical fix in the correct ### category, with the PR-number link.
Show the exact commands to (a) sign off every commit on an existing branch and (b) safely force-push the result.
Run ./gradlew spotlessApply precommit on a real change and report a clean gate, or fix what it flags.
From the PR template, write a complete description for a real or hypothetical change, including Closes #NNNN, the testing section, and a backport recommendation.
Name the test tier you would add for each of: a serialization change, a REST contract change, and a cross-node behavior change — with the ./gradlew task for each.

When every box on the readiness checklist is green on the first push, you have made it easy to say yes. The next chapter — Responding to Maintainer Feedback — is what happens after the maintainer reads it.

Responding to Maintainer Feedback

You opened a clean, scoped, tested PR (see PR Quality). A maintainer reviewed it and left comments. How you respond over the next few days decides whether this PR merges in one more round or drags through five — and decides whether this maintainer wants to review your next PR. Review response is a skill, and it is one of the clearest signals maintainers use when forming an opinion of a contributor (see Maintainership).

This chapter is the mechanics and the etiquette of iterating on an OpenSearch PR. It pairs with Level 2 Lab 4: Review a PR — having reviewed someone else's PR makes you far better at receiving review on your own.

First: Read the Whole Review Before Touching Code

When a review lands, resist the reflex to start changing things. Read every comment first, then sort them:

Bucket	Example	Your response
Clear fix	"This NPEs if the list is empty"	Just fix it; reply "Done."
Style/convention	"Use `assertThat` here"	Fix it; the project's conventions win
Question	"Why fan out before resolving indices?"	Answer it in the thread; maybe add a code comment
Disagreement	"I think this should be `Version`-gated differently"	Discuss with evidence before changing
Blocking concern	"This breaks a rolling upgrade"	Stop, address it fully — see compatibility

GitHub's review states matter: "Comment" is advisory, "Approve" clears the path, and "Request changes" (often shown as "Changes requested") is a hard block — the PR will not merge until the reviewer who requested changes re-reviews and clears it. Treat "Changes requested" as a gate, not an insult.

The OpenSearch Iterate Cadence: Fixups, Then Squash, Force-Push

This is where OpenSearch differs sharply from Apache patch culture. In Apache projects you re-roll a numbered patch file (v1.patch, v2.patch). In OpenSearch you push to the same PR branch, and force-pushing that branch is normal and expected.

The recommended cadence during active review:

Make a fixup commit per addressed comment, so the reviewer can see exactly what changed since their last look:
```
git commit -s --fixup=HEAD          # or --fixup=<sha> targeting a specific commit
git push                            # adds a visible commit; no force needed yet
```
Reviewers can use GitHub's "changes since last review" view against your new commits.
When the reviewer is satisfied, squash the fixups into clean logical commits and force-push:
```
git rebase -i --autosquash main     # folds fixup! commits into their targets
git push --force-with-lease          # rewrites the PR branch — normal here
```
Always use --force-with-lease (not bare --force) so you do not clobber commits you didn't know about.

Note: git push --force-with-lease to your own PR branch is routine and welcomed in OpenSearch. It keeps history clean. The thing to avoid is force-pushing in a way that loses a reviewer's place mid-review without warning — hence the fixup-then-squash rhythm: fixups while review is active, squash when it's settling.

After any push that should be re-reviewed, re-request review explicitly (the circular-arrows icon next to the reviewer, or a short comment) — maintainers do not watch every branch for pushes.

Addressing vs. Discussing: Resolve Threads Honestly

Every review comment is a conversation thread. Close the loop on each one:

If you made the change, reply "Done" (and link the commit if helpful), then let the reviewer resolve the thread — or resolve it yourself only when the project's convention allows.
If you chose not to, do not silently ignore it. Reply with your reasoning. Silence reads as either "I missed it" or "I'm ignoring you," and both cost you.
Do not resolve a thread you disagreed with by just hiding it. Resolving a thread signals "addressed"; resolving one you actually pushed back on, without agreement, looks evasive.

A reviewer scanning your PR should be able to see, thread by thread, that every comment was either fixed or answered. That completeness is what lets them approve without re-reading the whole diff.

When and How To Push Back — With Evidence

You will sometimes be right and the reviewer wrong, or the reviewer's suggestion will have a cost they didn't see. Pushing back is legitimate; arguing is not. The difference is evidence.

Good push-back is specific, evidence-backed, and offers a path forward:

"I looked at gating this on Version.V_2_18_0 as suggested, but the field is also written to the qa/ rolling-upgrade snapshot, and a 2.17 reader would then see an unknown field. The BWC test …RollingUpgradeIT fails with that gating (log attached). Gating on V_3_0_0 keeps the mixed-cluster test green. Would that work, or is there a reason to target 2.18?"

That comment cites the test, shows the failure, proposes an alternative, and ends with a question. It invites agreement instead of demanding it.

Bad push-back — even when you're correct — looks like:

"That's wrong, the version gating is fine as is."

It asserts, cites nothing, and leaves the reviewer no path but to dig in. You may win the technical point and still lose the relationship.

When the disagreement is genuine and unresolved, escalate the idea, not the tone: ask the maintainer to loop in another reviewer for the area (check MAINTAINERS.md/CODEOWNERS), or move the design question to the issue/RFC. Never let a PR thread turn into a multi-screen argument — that is where PRs go to die.

Handling "Changes Requested" and Re-Requesting Review

The workflow when a reviewer formally requests changes:

flowchart LR
    A[Review: Changes requested] --> B[Read all comments, bucket them]
    B --> C[Push fixup commits<br/>+ reply per thread]
    C --> D[Keep CI green]
    D --> E[Re-request review]
    E --> F{Reviewer satisfied?}
    F -- No --> B
    F -- Yes --> G[Squash fixups,<br/>force-push clean history]
    G --> H[Approve + merge]

The two things contributors most often forget: re-requesting review (the reviewer is not notified by a push alone) and keeping CI green between rounds.

Keep CI Green Between Rounds

Every push reruns CI (gradle-check). A reviewer will not invest in a PR with red CI, because the red might be your change. Before each push:

cd ~/os-src
./gradlew spotlessApply
./gradlew precommit
./gradlew :server:test --tests "org.opensearch.your.ChangedTests"

If CI fails on a flaky test unrelated to your change, say so explicitly with the failing test name and a link if it's a known flaky-test — do not just push again silently and hope. If main moved and you have conflicts, rebase and force-push:

git fetch origin
git rebase origin/main
git push --force-with-lease

A PR that is green, rebased, and has every thread answered is one a maintainer can approve in five minutes. That is the goal of every round.

After Merge: The Backport Workflow

Your PR merged into main. If it carries a backport <branch> label (e.g. backport 2.x), a backport bot automatically opens a backport PR against that release branch. Your job is not done:

Watch for the bot's backport PR (it @-mentions you).
If the cherry-pick applied cleanly, just make sure its CI is green.

If the bot reports a conflict, it cannot auto-backport — you cherry-pick by hand:

git fetch origin
git checkout -b backport/2.x/my-fix origin/2.x
git cherry-pick -x <merge-or-commit-sha>   # -x records the source commit
# resolve conflicts, keep the change minimal and version-appropriate
git push origin backport/2.x/my-fix
# open a PR against 2.x, link the original PR

The backport still needs the same quality bar: DCO sign-off, a CHANGELOG entry under the 2.x [Unreleased] section, and green CI. See PR Quality and compatibility — a backport must be safe on the older line, which can differ from main.

A Worked Review Thread: Good vs. Bad Responses

Reviewer comment on your PR:

"This new field in FooRequest.writeTo isn't Version-gated. Won't a 2.x node fail to deserialize a request from a 3.0 node during a rolling upgrade?"

Bad response (do not do this):

"It works in my tests."

It dismisses a real compatibility concern, cites nothing, and ignores that the maintainer's worry is the mixed-cluster path, which your single-version tests never exercised.

Good response:

"Good catch — you're right, the single-version FooRequestTests round-trip doesn't cover the mixed-version case. I've gated the field write/read on Version.V_3_0_0 (commit abc1234) and added a …BwcSerializationTests case asserting a 2.x stream omits it. I also added a rolling-upgrade case under qa/ (commit def5678). Both green in CI. Re-requesting review — does the gate version look right to you, or should it target the next 2.x minor instead?"

It concedes the valid point, names exactly what changed and where, points to the new tests that prove the fix, keeps CI green, and ends with a question that invites approval. This is the response pattern that turns reviewers into advocates.

Warning: Compatibility comments like the one above are the most common blocking feedback on OpenSearch PRs. If you do not yet know why a non-gated field breaks a rolling upgrade, read Compatibility, Stability, and Performance and the serialization-BWC deep dive before you respond — a confident wrong answer here is worse than asking.

Validation: Prove You Understand This

Given a review with five comments of different types, bucket each (clear fix / style / question / disagreement / blocking) and state your response to each.
Show the exact commands for the fixup-then-squash cadence, including --autosquash and --force-with-lease, and explain why force-pushing your PR branch is normal here.
Write an evidence-backed push-back to a reviewer suggestion you believe is wrong, citing a test or measurement and ending with a question.
Describe the steps after a backport 2.x bot reports a conflict, including cherry-pick -x and the 2.x CHANGELOG entry.
Rewrite a bad ("it works on my machine") response to a rolling-upgrade concern into a good one that names the fix, the proving test, and re-requests review.

When your review rounds are short, your threads are all answered, and your CI is always green, maintainers start trusting your PRs by default. The next chapter — Compatibility, Stability, and Performance — is the knowledge behind the most common blocking feedback you will receive.

Compatibility, Stability, and Performance

OpenSearch runs in long-lived production clusters that hold petabytes of data and are upgraded in place, node by node, without downtime. That single operational fact — rolling upgrades on clusters that cannot be re-indexed from scratch — is the source of almost every "no" a maintainer will give you. This chapter is the maintainer's three constant concerns, and it is the knowledge behind the most common blocking review comments (see Responding to Feedback).

A maintainer reviewing your PR is silently asking three questions:

Wire / index backward compatibility — will this break a mixed-version cluster mid-upgrade, or make today's data unreadable by a future node (or vice versa)?
API stability — does this change a REST contract that users and clients depend on?
Performance — does this regress a hot path that benchmarks watch?

If you can answer all three before they ask, your PR is half-reviewed already.

Concern 1: Wire and Index Backward Compatibility

Why a "Small" Change Breaks a Rolling Upgrade

During a rolling upgrade, the cluster is mixed: some nodes run version N, some run N+1, and they talk to each other over the transport layer (:9300). A node serializes a request/response with StreamOutput and the peer deserializes it with StreamInput. If a newer node writes a field that an older node's readFrom does not expect, the older node throws while parsing the stream — and a node that cannot parse cluster traffic is a node that cannot stay in the cluster.

So the rule: any change to what a Writeable writes or reads is a wire-compatibility change, even adding "just one field." It is invisible in single-version tests and fatal in a mixed cluster.

`Version`-Gated Serialization

The mechanism OpenSearch uses is Version gating inside writeTo/readFrom. The stream carries the negotiated version of the peer, and you only write the new field if the peer understands it:

@Override
public void writeTo(StreamOutput out) throws IOException {
    super.writeTo(out);
    out.writeString(name);
    if (out.getVersion().onOrAfter(Version.V_3_0_0)) {   // only newer peers get it
        out.writeOptionalString(newField);
    }
}

public FooRequest(StreamInput in) throws IOException {
    super(in);
    this.name = in.readString();
    if (in.getVersion().onOrAfter(Version.V_3_0_0)) {     // mirror the gate exactly
        this.newField = in.readOptionalString();
    } else {
        this.newField = null;                             // sensible default for old peers
    }
}

The read and write gates must match exactly, and you must choose a sane default for the old-peer case. Find the Version constants and study the pattern:

cd ~/os-src
find libs/core -name 'Version.java'
grep -n "getVersion().onOrAfter\|getVersion().before" \
  server/src/main/java/org/opensearch/action/search/SearchRequest.java | head

The deep mechanics — NamedWriteableRegistry, Diffable cluster state, XContent BWC — live in the serialization-BWC deep dive. This chapter is the judgment: recognize that you touched the wire, then gate it.

Index Backward Compatibility

Index BWC is the other half. OpenSearch reads indices written by the previous major line, and a node must read segments and metadata written before it upgraded. You break index BWC by changing how something is written to disk (mapping serialization, a new codec, metadata format) without a read path for the old form. Index BWC violations are worse than wire violations because the data is already on disk and cannot be re-serialized by negotiation — there is no peer to gate against, only the old bytes.

The `qa/` BWC Test Suites

OpenSearch proves BWC with dedicated suites under qa/. These are the tests maintainers expect to see touched when your change has any compatibility surface:

ls ~/os-src/qa/
# rolling-upgrade        — node-by-node upgrade across a mixed cluster
# mixed-cluster          — old + new nodes serving traffic simultaneously
# full-cluster-restart   — write on old, restart whole cluster on new, read back
# repository-multi-version, etc.

Run the rolling-upgrade suite against a previous version (bwcVersion wiring drives this):

cd ~/os-src
./gradlew :qa:rolling-upgrade:check          # may take a while; downloads BWC artifacts
./gradlew :qa:mixed-cluster:check
./gradlew :qa:full-cluster-restart:check

For the unit-level proof, serialization round-trip tests against an older Version are the cheap, fast signal a reviewer wants:

grep -rln "AbstractWireSerializingTestCase\|AbstractSerializingTestCase" server/src/test | head

Extend one of those with a case asserting that, at the old Version, your new field is not written and the round-trip still succeeds. This is exactly the test the worked review thread in Responding to Feedback demands.

Change	BWC surface	What proves it safe
Add a field to a `Writeable`	Wire	`Version` gate + round-trip test at old version + `qa/mixed-cluster`
New transport action / response	Wire	Gate the registration/handling on `Version`; rolling-upgrade test
New mapping/metadata on disk	Index	Read path for old form + `qa/full-cluster-restart`
Change cluster-state custom	Wire + diff	`Diffable` BWC + `qa/rolling-upgrade`
Pure in-memory refactor, no serialized form	None	Standard unit tests suffice

Concern 2: API Stability

The REST API is a contract. Dashboards, the language clients (opensearch-java, opensearch-py, opensearch-js), Logstash/Beats-style ingest, and countless user scripts call it. Breaking it silently breaks tools the user never thinks about.

The REST contract is defined in rest-api-spec/ and exercised by REST-YAML tests:

ls ~/os-src/rest-api-spec/src/main/resources/rest-api-spec/api/ | head
sed -n '1,40p' ~/os-src/rest-api-spec/src/main/resources/rest-api-spec/api/search.json

Rules of thumb for a REST change:

Adding an optional query parameter or response field is generally safe (additive).
Removing or renaming a parameter, changing a response field's type, or changing a default is a breaking change — it needs deprecation first, and usually targets a major version.
Deprecate, don't delete. OpenSearch surfaces deprecations through the Warning HTTP header and deprecation logs. Mark the parameter deprecated, keep it working, log a warning, and remove it only at a major boundary. The master → cluster_manager rename (see Design via GitHub) is the canonical example: old names live on as deprecated aliases rather than being deleted.

REST-YAML tests document version-specific behavior with skip blocks — when behavior changes by version, the YAML test encodes both:

- skip:
    version: " - 2.99.99"
    reason: "new response field added in 3.0"

When you change a REST endpoint, update its JSON spec, update/extend its YAML tests, and confirm:

./gradlew :rest-api-spec:yamlRestTest

Concern 3: Performance — No Regressions

OpenSearch is a performance product; a search or indexing regression that ships is a regression real users feel. Maintainers guard hot paths (search query/fetch, indexing, aggregation reduce, serialization) jealously. A change that is correct but slower on a hot path can still be a "no."

The tools, smallest to largest:

Tool	Scope	When to use
JMH microbenchmarks (`:benchmarks`)	A single method/algorithm	You changed a tight inner loop
`internalClusterTest` timing	A subsystem in-JVM	Sanity-check a path end to end
OpenSearch Benchmark (`opensearch-benchmark`, aka the macrobenchmark harness)	A full cluster under realistic workloads	You touched search/indexing throughput or latency

The microbenchmarks live in the :benchmarks project and use JMH:

cd ~/os-src
ls benchmarks/src/main/java/org/opensearch/benchmark/ | head
# Run a JMH benchmark (pattern matches the @Benchmark classes):
./gradlew :benchmarks:jmh -Pjmh.includes='.*YourBenchmark.*'

For end-to-end throughput/latency, OpenSearch Benchmark (opensearch-benchmark, the separate benchmarking client) drives standardized workloads against a real cluster:

# Installed separately (pip install opensearch-benchmark); illustrative run:
opensearch-benchmark execute-test \
  --target-hosts localhost:9200 \
  --workload geonames \
  --pipeline benchmark-only

The discipline a maintainer expects when your change is on a hot path:

Measure before (baseline on main).
Measure after (your branch), same workload, same hardware.
Report the delta in the PR — numbers, not adjectives. "No measurable regression on the geonames workload (p50 indexing 41.2k → 41.0k docs/s, within noise)" is a sentence that gets a PR merged. "Should be fine, it's a small change" is not.

If you cannot run the full macrobenchmark, say so and at least provide a JMH result for the changed method; a reviewer would rather have a microbenchmark than a guess.

Putting It Together: The Compatibility Triage

Before opening a PR, triage which surfaces you touched:

flowchart TD
    A[My change] --> B{Touches a Writeable<br/>read/write or serialized<br/>cluster state?}
    B -- Yes --> B1[Version-gate it + round-trip test<br/>+ qa/ rolling-upgrade or mixed-cluster]
    A --> C{Changes a REST path,<br/>param, or response?}
    C -- Yes --> C1[Update rest-api-spec + YAML<br/>deprecate, don't delete]
    A --> D{Writes a new on-disk<br/>mapping/metadata/codec?}
    D -- Yes --> D1[Old-form read path<br/>+ qa/full-cluster-restart]
    A --> E{On a hot path<br/>search/index/agg/serde?}
    E -- Yes --> E1[Benchmark before/after,<br/>report the delta]
    B -- No --> Z[Standard unit/IT tests]
    C -- No --> Z
    D -- No --> Z
    E -- No --> Z

Most PRs touch zero of these and need only ordinary tests. But the moment a PR touches one, the review bar rises, and the contributor who self-identifies the surface and brings the proving test or benchmark is the one whose PR sails through.

Where To Go Deeper

Mechanics of Version gating, NamedWriteableRegistry, Diffable, XContent BWC, and the qa/ suite → the serialization-BWC deep dive.
The replication/global-checkpoint interactions that make some serialized changes subtle → the replication deep dive.
Hands-on BWC and performance labs → Level 9, where you build and run a mixed-cluster BWC test and a benchmark comparison end to end.
How maintainers reason about these trade-offs at the project level → the maintainer-mindset chapter.

Validation: Prove You Understand This

Explain, to a colleague, why adding a single non-gated field to a TransportRequest can take down a node during a rolling upgrade — name the exact failure point.
Write the writeTo/readFrom pair for a new optional field, Version-gated on V_3_0_0, with the correct old-peer default.
Name the three qa/ suites and what each one proves, and give the ./gradlew task to run the rolling-upgrade suite.
For a REST change that renames a query parameter, describe the deprecation path (keep working, Warning header, remove at major) and the spec/YAML files you must update.
Describe the before/after performance discipline for a hot-path change, name the JMH project, and write the one-sentence delta report you would put in the PR.
Triage a hypothetical change ("add a min_score field to a custom query, serialized over the wire and in cluster state") across all three concerns and list every artifact you'd attach.

When you can triage any change across wire/index BWC, API stability, and performance — and bring the proving test or benchmark unprompted — you have the judgment maintainers most want to see. The next chapter — The Path to Maintainership — is how that judgment, shown repeatedly, turns a contributor into a maintainer.

The Path to Maintainership

Every chapter before this one was about a single contribution: read the code, find the design, talk to the community, open a clean PR, iterate on feedback, keep it compatible. This chapter is about the arc — how a stream of good contributions accrues into trust, and how that trust is formalized when a project invites you to become a maintainer.

OpenSearch governance is deliberately concrete and GitHub-native: trust is recorded in a file (MAINTAINERS.md), authority is delegated by a body (the TSC), and the path from contributor to maintainer is a track record anyone can read in your commit and review history. There is no exam and no application form. There is sustained, visible, high-judgment work.

The Three Roles

Role	What they can do	How you get there	Recorded in
Contributor	Open issues/PRs, review and comment, claim `good first issue`/`help wanted`	Just start; DCO sign-off is the only gate	Your GitHub history
Maintainer	All of the above + approve/merge PRs, manage labels/releases, set direction for their repo	Nomination by existing maintainers after a sustained track record	`MAINTAINERS.md` (per repo)
TSC (Technical Steering Committee)	Cross-project governance: technical direction, new repos, conflict resolution, foundation matters	Selected per the project's charter	The project's governance docs

Read the actual files for the repo you work in:

cd ~/os-src
sed -n '1,60p' MAINTAINERS.md            # the people, their GitHub handles, and areas
find . -maxdepth 2 -iname 'GOVERNANCE.md' -o -iname 'CONTRIBUTING.md' | head

Each repo under opensearch-project/ has its own MAINTAINERS.md. The k-NN repo, the SQL repo, and core all maintain separate lists — maintainership is per repo, and being a maintainer of one does not make you a maintainer of another. This matters strategically: it is far easier to become a maintainer of a focused plugin repo where you have concentrated your work than of core, where the bar and the contributor pool are largest.

What Maintainers Actually Do

The title is mostly responsibility, not privilege. A maintainer's day-to-day:

Review PRs — the bulk of the job. Reading others' code, asking the compatibility and test questions from the compatibility chapter, and approving/merging.
Triage issues — apply labels (untriaged → triaged), reproduce bugs, mark good first issue/help wanted, close duplicates, shepherd RFCs.
Manage releases — cut release branches, decide what backports in, sign off on a release's readiness. See the release-process chapter.
Set and defend direction — weigh in on RFCs, say "no" to scope creep, keep the subsystem coherent.
Mentor — help new contributors land their first PRs, which is also how the next generation of maintainers is grown.

Note: The single highest-signal maintainer activity is reviewing other people's PRs. It is also the activity a contributor can start doing today, with no special permission. More on this below — it is the most underused lever on the whole path.

How Maintainers Are Nominated

There is no fixed point-count, but the pattern is consistent across OpenSearch repos. A maintainer is nominated by an existing maintainer (usually via a GitHub issue or a maintainer discussion), and the existing maintainers reach consensus. The nomination is justified by a track record demonstrating three things:

Sustained, quality contribution — not one big PR, but a steady stream over months, in a coherent area, that merged cleanly and didn't cause regressions.
Review participation — you have reviewed others' PRs with useful, correct feedback. This proves you can evaluate code, not just write it, which is the core maintainer skill.
Judgment — you self-identify compatibility risks, you scope PRs tightly, you handle disagreement with evidence not heat, you know the area's design history. This is the quality from Design via GitHub and Compatibility made visible.

Note what is not on the list: lines of code, cleverness, or seniority elsewhere. A staff engineer who drops three large unreviewed PRs and argues in threads is less likely to be nominated than a junior engineer who shipped fifteen small clean fixes and reviewed thirty PRs thoughtfully.

The Timeline: Months to Years

Be honest with yourself about the clock. The path is measured in months to years, not weeks:

flowchart LR
    A[First PR merged] --> B[Steady contributions<br/>in one area<br/>~months]
    B --> C[Start reviewing<br/>others' PRs]
    C --> D[Recognized as the<br/>de-facto expert in a<br/>sub-area]
    D --> E[Maintainer notices,<br/>nominates]
    E --> F[Maintainer consensus<br/>+ MAINTAINERS.md update]

A realistic core-repo timeline is on the order of a year or more of consistent work; a focused plugin repo can be faster because the area is smaller and the contributor pool thinner. Trust is slow to build precisely because the cost of a bad maintainer — someone who merges unsafe changes — is high. Patience here is not passivity; it is the steady accumulation of merged work and helpful reviews.

Building a Track Record — Concretely

You do not become a maintainer by wanting to; you become one by doing the work in a way that is visible and area-focused. The strategy:

1. Pick an area and stay in it

Maintainership is per-area trust. Scattering one PR each across search, allocation, snapshots, and the build system makes you a generalist nobody can vouch for in any one place. Pick a subsystem — say aggregations, or shard allocation, or a specific plugin — read its deep dive (e.g. aggregations, shard allocation), read its design history, and stay there. Become the person whose name is on every recent aggregation fix.

2. Sustain it

A track record is a rate, not a single event. A handful of small, clean, well-tested PRs per month in your area beats one heroic 5,000-line PR. Each clean PR teaches the maintainers that your work is safe to merge; that accumulated confidence is the nomination case.

3. Review others' PRs

This is the lever most contributors ignore, and it is the strongest. You can review any open PR right now:

gh pr list --repo opensearch-project/OpenSearch --search "label:aggregations" --state open
gh pr view <number> --repo opensearch-project/OpenSearch --comments

Good reviews — catching a missing Version gate, asking for the integration test, spotting an unrelated reformat — do three things at once: they help the project, they teach you the codebase faster than writing code does, and they demonstrate to maintainers that you can evaluate code, which is the literal job of a maintainer. A contributor who reviews well is on the shortlist; a contributor who only writes is not.

4. Show judgment in public

Every interaction is part of the record. Self-identify compatibility risks before a reviewer asks (compatibility). Scope tightly (PR quality). Push back with evidence and concede when wrong (responding to feedback). Capture Slack/meeting decisions into issues (community). These are the behaviors maintainers cite when they nominate.

Track-record scorecard

Signal	Weak	Strong
Focus	One PR each in 6 subsystems	15 PRs in one subsystem
Cadence	One big PR, then silence	Steady small PRs over months
Tests	"Works on my machine"	Every PR has the right test tier
Reviews given	Zero	Dozens of useful reviews in your area
Compatibility	Reviewer always catches it	You flag it first, with the proving test
Conduct	Argues in threads	Evidence-based, gracious, helpful to newcomers

What Changes When You Become a Maintainer

The merge button is the smallest part. What actually changes is that the project's quality is now partly your responsibility. You inherit the obligation to review fairly and promptly, to mentor the next contributors the way someone mentored you, to say "no" to changes that aren't safe even when the author is frustrated, and to keep your area healthy. The maintainer-mindset chapter covers how that responsibility reshapes how you think about every change, and the TSC governance chapter covers the layer above — how the project as a whole is steered.

Warning: Maintainership is a commitment, not a trophy. An inactive maintainer who blocks PRs by never reviewing them is a worse outcome than no maintainer. Take the role when you can sustain the responsibility, not merely when you've earned the recognition.

Validation: Prove You Understand This

State the three roles (contributor, maintainer, TSC), what each can do, and where each is recorded — and explain why maintainership is per-repo.
List the three things a maintainer nomination demonstrates, and explain why "lines of code" is not one of them.
Find and read the MAINTAINERS.md of one OpenSearch repo you care about; name two maintainers and their areas.
Describe a realistic 12-month plan to build a track record in one subsystem: which area, what cadence of PRs, and how many reviews of others' PRs.
Explain why reviewing other people's PRs is the highest-signal activity on the path, and run a gh pr list to find one open PR in your area you could review today.
Articulate what new responsibilities (not privileges) you would take on as a maintainer, and why an inactive maintainer is worse than none.

This is the last chapter of the Contributor Mindset section. You now know how to read the codebase, recover design intent, talk to the community, ship a clean PR, iterate on feedback, keep changes compatible, and build toward maintainership. The next stop is the release-governance section, which views the same project from the maintainers' and the foundation's side of the table — and the TSC governance chapter in particular, where the path you just read about leads.

From Beginner to Advanced Issues

This roadmap is a deliberately ordered ladder of OpenSearch contributions. Each rung trains one skill, depends on the rung below it, and ends at a concrete, review-ready Pull Request on github.com/opensearch-project/OpenSearch. Skipping rungs is the most common reason contributors stall: a shard-allocation fix without cluster-state fluency turns into a six-month PR thread, and a release-blocker triage call without backward-compatibility reflexes turns into a reverted commit on a release branch.

OpenSearch is not Apache. There is no JIRA, no patch attachments, no "Patch Available" state. The source of truth is GitHub: issues and Pull Requests. Contributions require a DCO sign-off (git commit -s), not a CLA. Every PR adds a one-line CHANGELOG.md entry. CI runs through GitHub Actions and Jenkins; the local pre-flight is ./gradlew precommit. If you are arriving from the Apache Tez roadmap, unlearn the JIRA muscle memory now — the rest of this section assumes the GitHub flow.

The stages are calibrated to the OpenSearch main / 3.x codebase. The maintenance line is 2.x; 1.x is legacy. Where a stage references real modules it uses the exact top-level paths you will see in a checkout:

server/                 the core engine. org.opensearch.* — the bulk of what you read
libs/                   low-level shared libs: libs/core (StreamInput/Output, Writeable),
                        libs/x-content (XContent parsing), libs/common, libs/geo
modules/                bundled-by-default modules: transport-netty4, reindex,
                        lang-painless, analysis-common, ingest-common, percolator
plugins/                in-repo optional plugins: analysis-icu, repository-s3, store-smb
client/                 Java clients: client/rest, client/sniffer
distribution/           packaging: archives, docker, deb/rpm, distribution/tools
test/framework/         OpenSearchTestCase, OpenSearchIntegTestCase, InternalTestCluster
qa/                     cross-version BWC, rolling-upgrade, mixed-cluster, packaging QA
rest-api-spec/          REST API JSON specs + shared REST-YAML tests (yamlRestTest)
benchmarks/             JMH microbenchmarks (:benchmarks)
buildSrc/, build-tools*/  Gradle build logic and custom plugins

Note: OpenSearch renamed the master node role and many APIs to cluster manager for inclusive language. The old terms survive as deprecated aliases (master role, cluster.initial_master_nodes). This roadmap writes "cluster manager (formerly master)" on first use in each stage and prefers the new term thereafter.

The Twelve Stages

#	Stage	Target skill	Prereq level	Primary subsystem	Typical PR size
1	Docs & tests	GitHub flow, DCO, CHANGELOG, `precommit`	none	docs, `test/framework`	1–40 lines
2	Build, dependency & logging	version catalog, Spotless, forbidden-APIs, Log4j2	1	`buildSrc`, `server`	5–80 lines
3	Error messages & diagnostics	`validate()`, exception text, REST rendering	2	`server/.../action`, `rest`	20–200 lines + test
4	Cluster state & coordination	update tasks, appliers, listeners	3	`org.opensearch.cluster`	30–300 lines + test
5	Shard allocation	`AllocationDecider`, `RoutingAllocation`	4	`cluster.routing.allocation`	40–400 lines + test
6	Indexing & engine	`IndexShard`, `InternalEngine`, `Translog`, seqno	4	`index.engine`, `index.translog`	50–500 lines + test
7	Search & aggregations	`QueryPhase`, `FetchPhase`, `InternalAggregation.reduce`	4	`search`, `search.aggregations`	50–500 lines + test
8	Plugin & extension compat	extension points, SPI, `Plugin` evolution	4–7	`org.opensearch.plugins` + plugin repos	varies; often two repos
9	Flaky tests	`tests.seed`, `assertBusy`, `@AwaitsFix`, races	4	`test/framework`, `*IT`	20–150 lines
10	Performance	JMH, OpenSearch Benchmark, GC/allocation, `BigArrays`	6 or 7	hot paths in `server`	30–300 lines + bench
11	Backward compatibility	`Version`-gated streams, Lucene/index BWC, `qa/`	4	`libs/core`, `qa/`	small code, long thread
12	Release-blocking	triage, milestones, reverts, backports	maintainer	whole-project	varies

Stages 4–7 share a prerequisite level rather than a strict order: once you have cluster- state fluency (Stage 4) you can branch into allocation, engine, or search depending on the bug in front of you. Stage 8 sits across that band because the core↔plugin boundary touches whichever subsystem the extension point belongs to.

How OpenSearch labels map to stages

OpenSearch issues are labelled on GitHub. There is no JIRA component tree; instead a combination of type labels, status labels, and component/area labels does the same job. Learn these — every stage's issue search is built from them.

Label	Meaning	Which stages use it
`good first issue`	curated, small, mentor-friendly	1, 2
`help wanted`	maintainers want community to take it	all stages
`untriaged`	not yet labelled by a maintainer; do not start blind	filter these out until triaged
`bug`	incorrect behaviour	3–8, 11, 12
`enhancement`	new behaviour / improvement	2, 3, 7, 10
`flaky-test`	a test that fails nondeterministically	9
`discuss` / `RFC` / `proposal` / `meta`	design-level, needs a thread first	not bug-fix work; read, don't patch
`backport 2.x`, `backport 1.x`	triggers the backport bot when merged	11, 12
`v3.x.0` (milestone, not a label)	targeted release	12
Component/area labels	`Cluster Manager`, `Search:Aggregations`, `Storage:Durability`, `Indexing`, `Search:Performance`, `Plugins`, `distributed framework`	every stage scopes by these

The component labels drift and get renamed; do not memorise the exact strings. Instead, open https://github.com/opensearch-project/OpenSearch/labels once and skim the current set, then use the label:"…" filter that the relevant stage shows. Every stage also gives at least one fallback grep to find a candidate when labels return nothing.

A canonical issue search, reused (with different labels) in every stage:

is:issue is:open label:"good first issue" label:"help wanted" no:assignee sort:updated-desc

Paste it into the GitHub issue search box on the OpenSearch repo. no:assignee is the single most useful filter: it skips issues someone is already on.

How to use this roadmap

Pick a stage honestly

Find your rung by asking what is the largest change you have shipped to OpenSearch:

Never landed an OpenSearch PR: start at Stage 1.
Landed a docs PR but never touched Java in server/: Stage 2.
Comfortable with server/ Java but never read a cluster-state update task: Stage 3.
Read ClusterApplierService once and were confused: Stage 4.
Read it twice and could draw the publish/apply flow: Stages 5–7.
Already a maintainer (in a MAINTAINERS.md): jump to Stages 10–12 for sharpening.

Do not jump rungs to chase a "cool" bug. A wrong Decision in DiskThresholdDecider looks self-contained and isn't — the fix lands on RoutingAllocation plumbing and a ClusterInfo you have never built in a test (Stage 4 prerequisite work).

One stage per PR

Resist the urge to fix two things in one PR. OpenSearch reviewers reject mixed-concern PRs almost reflexively, and the CHANGELOG entry forces you to name the single change in one line — if you cannot, the PR is doing too much. If you find a logging issue while fixing an error message, open a follow-up issue and move on. The roadmap rewards small surface area.

Always start with `git log` and `git blame`

Before touching a file, find who cares about it:

git log --oneline -n 5 -- server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java
git blame -L 200,260 server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java

The blame output tells you which maintainer last touched that region. Mention them with @handle in your PR description (politely, once) — that is how OpenSearch routes review, in place of CC-ing a mailing list.

Read the workflow once, then never re-explain it

Every stage from 2 onward assumes the Stage 1 mechanics: fork → branch → DCO sign-off → CHANGELOG entry → ./gradlew precommit → PR → read CI → respond to review. Stage 1 drills all of it. The later stages spend their words on code, not process.

Time investment per stage

Calibrated against a contributor who has the repo checked out, can run ./gradlew localDistro and ./gradlew :server:test, and has opened at least one PR:

Stage	First PR	Becoming fluent (≈5 PRs merged)
1	half a day	1 week
2	1 day	2 weeks
3	1–2 days	1 month
4	3–5 days	2–3 months
5	1–2 weeks	4–6 months
6	2–4 weeks	6 months
7	2–4 weeks	6 months
8	1–3 weeks per boundary	a year of cross-repo work
9	1–3 days per flake	ongoing
10	weeks (perf is bench-bound)	maintainer-level skill
11	weeks (BWC review cycle)	maintainer-level skill
12	maintainer responsibility	n/a

Success criterion per stage

Each stage is "complete" for you when:

Stage 1: one docs/javadoc PR and one test-only PR are merged.
Stage 2: two logging or build/dependency PRs merged without precommit re-asks.
Stage 3: one error-message PR merged with a unit test asserting the exact text.
Stage 4: one cluster-state fix merged with a ClusterServiceUtils-based test.
Stage 5: one allocation fix merged, reproduced via a RoutingAllocation unit test.
Stage 6: one engine/translog fix merged with an InternalEngineTests-style test.
Stage 7: one search/agg fix merged with an AggregatorTestCase/AbstractQueryTestCase test.
Stage 8: one core change merged that you verified against a real plugin build.
Stage 9: at least three flaky tests de-flaked (un-muted and fixed).
Stage 10: one perf PR merged with before/after JMH or OSB numbers.
Stage 11: one BWC-sensitive PR merged with a Version guard and a qa/ test.
Stage 12: you have helped triage at least one release-blocking issue.

When to open a `discuss` issue first

For Stages 4 and above, before writing code, open (or comment on) the GitHub issue with a three-sentence plan:

I see <symptom> at <file> (grep below). My read is <cause>. I plan to <fix>, with a
regression test in <TestClass>. Anything I'm missing before I open a PR?

The maintainers will tell you within a day or two whether you are about to collide with in-flight work or a design decision. This is the GitHub equivalent of the Apache [DISCUSS] dev@ ping — same discipline, different venue.

What to read alongside this roadmap

Roadmap stage	Companion deep dive(s)
1–3	Reading the codebase, The REST layer
4	Cluster state, Cluster state publishing, Discovery & coordination
5	Shard allocation
6	Engine internals, Translog, Refresh/flush/merge
7	Search execution, Aggregations, Query DSL & QueryBuilders
8	Plugin architecture + the plugin labs
9	Threadpools & concurrency
10	Circuit breakers & memory, DocValues & fielddata
11	Serialization & BWC, the compatibility mindset
12	Release process, the maintainer mindset

What this roadmap is not

This roadmap is not a tutorial on OpenSearch itself. The deep dives cover the architecture; the labs from Level 1 onward cover hands-on code reading. The roadmap assumes you can already build from source, run ./gradlew :server:test, and stand up a node with ./gradlew run. If you cannot, the prerequisite is Level 1.

It is also not a generic open-source guide. CONTRIBUTING.md, DEVELOPER_GUIDE.md, and TESTING.md in the repo cover account setup, the DCO mechanics, and the test commands; the roadmap assumes you have read them once.

Finally, it is not a roadmap to maintainership. Becoming a maintainer is a separate path the TSC and each repo's existing maintainers manage (see the maintainer mindset). The roadmap teaches the skills that, applied consistently, make maintainership a reasonable outcome — landing PRs is necessary, not sufficient.

How the stages interlock

Each stage builds vocabulary the next stage uses without re-explaining:

Stage 1 teaches the PR artifact: fork, DCO, CHANGELOG, precommit. Every later stage assumes it.
Stage 2 teaches Log4j2 idioms and the build gates. Stage 3 builds on them with the rule that every thrown exception carries actionable context.
Stage 3 teaches you to navigate the action/rest layer. Stage 4 follows the request into the cluster-manager service and its update tasks.
Stage 4 teaches ClusterState, update tasks, and appliers. Stages 5–7 all read cluster state: allocation routes shards, the engine reacts to index metadata, search resolves shards from the routing table.
Stage 5 teaches RoutingAllocation unit tests. Stage 6 teaches InternalEngineTests. Stage 7 teaches AggregatorTestCase. These three test harnesses are the workhorses of core contribution.
Stage 8 teaches the plugin boundary, attributing a break to core vs a plugin — the OpenSearch analog of the Tez "Hive-on-Tez" attribution skill.
Stage 9 teaches deterministic testing (tests.seed, assertBusy). Stage 10 uses that determinism as the baseline for stable benchmarks.
Stage 10 teaches measurement. Stage 11 uses measurement as evidence in BWC decisions. Stage 11 teaches Version gating, which Stage 12 weighs when deciding whether an issue blocks a release or merely a backport.

Skipping a stage means skipping a vocabulary. Reviewers will notice.

Now turn to Stage 1.

Stage 1 — Docs and Tests

What this stage teaches

Stage 1 is the on-ramp. The skills are deliberately non-technical — they are the GitHub contribution workflow that every later stage assumes and never re-explains:

Claim an issue by commenting, then fork and branch.
Sign every commit off (DCO, git commit -s) so the Signed-off-by: line is present.
Add the one-line CHANGELOG.md entry that OpenSearch requires on every PR.
Run ./gradlew precommit and read its output without panicking.
Open a Pull Request, read the CI (GitHub Actions + Jenkins gradle-check), and respond to review by pushing more commits to the same branch.

The contributions themselves are surgical: a docs typo, a stale config key in a .md, a javadoc that the build flags, a weak or missing assertion in an existing unit test. Nothing in this stage will surprise a reviewer. That is the point — you are exercising the workflow so the next stages can be about code.

Note: If you are arriving from the Apache Tez roadmap: there is no JIRA, no .patch file, no "Submit Patch" button. The unit of work is a branch on your fork and a PR against opensearch-project/OpenSearch. Re-rolls are extra commits (or a force-push, with etiquette — see Pitfalls), not new attachments.

The GitHub workflow, end to end

Step 0 — Fork and clone

gh repo fork opensearch-project/OpenSearch --clone   # or fork in the UI, then:
git clone git@github.com:<you>/OpenSearch.git
cd OpenSearch
git remote add upstream https://github.com/opensearch-project/OpenSearch.git
git fetch upstream

Keep main clean and tracking upstream. Branch for every change:

git checkout -b docs/fix-rolling-upgrade-typo upstream/main

Step 1 — Find an issue

Paste this into the issue search box on the OpenSearch repo (https://github.com/opensearch-project/OpenSearch/issues):

is:issue is:open label:"good first issue" no:assignee sort:updated-desc

Narrow to docs/test work by adding a keyword:

is:issue is:open label:"good first issue" no:assignee docs in:title,body
is:issue is:open label:"good first issue" no:assignee "javadoc" in:title,body
is:issue is:open label:"good first issue" no:assignee "test" "assert" in:body

Open three candidates. Read each thread end to end. Choose one that has no assignee, no open PR linked, and is not labelled untriaged (untriaged means a maintainer has not confirmed it is real). Comment to claim it:

I'd like to work on this. Planning to <one sentence>. Could a maintainer assign it to me?

OpenSearch does not require an explicit assignment to open a PR, but commenting prevents two people doing the same work and signals the maintainers to triage.

If no labelled issue fits, file your own from a real defect you find. Grep the docs in the repo (most user docs live in the separate documentation-website repo, but the core repo carries README, CONTRIBUTING.md, DEVELOPER_GUIDE.md, TESTING.md, package-info javadoc, and inline javadoc):

grep -rn "cluster.initial_master_nodes" . --include=*.md --include=*.java | head
grep -rn "TODO\|FIXME\|XXX" DEVELOPER_GUIDE.md TESTING.md CONTRIBUTING.md
grep -rn "elasticsearch" docs/ 2>/dev/null | head     # stale fork references

A genuine doc or stale-reference bug found this way is fair game for your first PR. Open an issue describing it first, then the PR that fixes it (the PR links the issue with Closes #NNNN).

Walked example 1 — a javadoc / package-info documentation fix

This example is illustrative of the pattern, not a transcription of one specific PR. The grep finds the real site on your branch; class and line numbers vary by branch.

Symptom: a user reports on the forum that org.opensearch.common.settings.Setting has a public Property enum value whose javadoc does not say what it does, so plugin authors cannot tell when to use it.

Locate the code

grep -rn "enum Property" server/src/main/java/org/opensearch/common/settings/Setting.java
git log --oneline -n 5 -- server/src/main/java/org/opensearch/common/settings/Setting.java

Open the file. The relevant block looks roughly like:

public enum Property {
    Filtered,
    Dynamic,
    Final,
    Deprecated,
    NodeScope,
    IndexScope,
    // ...
}

Several values have no javadoc. That is the bug.

Diff

--- a/server/src/main/java/org/opensearch/common/settings/Setting.java
+++ b/server/src/main/java/org/opensearch/common/settings/Setting.java
@@
 public enum Property {
-    Filtered,
+    /** The setting's value is sensitive and must be redacted from APIs and logs. */
+    Filtered,
+    /** The setting can be changed at runtime via the cluster/index update settings API. */
     Dynamic,
+    /** The setting can be set once at index creation and never changed afterwards. */
     Final,

Two rules for documentation diffs:

Describe behaviour, not the identifier. "Dynamic means dynamic" is noise. Say what changes for the user when the property is present.
Verify against the code, not your memory. Trace where the enum value is read (grep -rn "Property.Dynamic" server/) and confirm your sentence matches the actual check before you write it.

Build the documentation gate

OpenSearch validates javadoc as part of precommit. For a docs-only change you do not need the full test suite — run the targeted gates:

./gradlew :server:compileJava -q          # does it still compile?
./gradlew :server:checkstyleMain -q        # style (line length, ordering)
./gradlew spotlessJavaCheck -q             # formatting
./gradlew :server:precommit                # the full per-project gate (javadoc, headers, forbidden-APIs)

If spotlessJavaCheck fails, fix it mechanically: ./gradlew spotlessApply.

CHANGELOG and commit

Every PR must add a line to CHANGELOG.md under the ## [Unreleased ...] heading, in the right category (Added / Changed / Fixed / Deprecated / Removed / Security):

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ## [Unreleased 3.x]
 ### Changed
+- Document the `Setting.Property` enum values for plugin authors ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

The PR number is not known yet; open the PR, then come back and fix the link in a follow-up commit (maintainers accept the placeholder being updated after the number exists).

Commit with sign-off:

git add server/.../Setting.java CHANGELOG.md
git commit -s -m "Document Setting.Property enum values"

The -s adds the Signed-off-by: Your Name <you@example.com> line that the DCO check requires. Configure user.name/user.email first so it matches your GitHub identity; the DCO bot compares them.

Open the PR

git push origin docs/document-setting-property
gh pr create --repo opensearch-project/OpenSearch --fill

Fill in the PR template: link the issue (Closes #NNNN), tick the CHANGELOG and DCO checkboxes honestly, and describe the change in two sentences. Then watch CI:

DCO — green only if every commit is signed off.
CHANGELOG verifier — green only if you touched CHANGELOG.md.
gradle-check / GitHub Actions — compile, precommit, and (for code) tests.

Read failures top to bottom. A docs PR that fails gradle-check almost always failed checkstyle or spotless, both of which the local commands above would have caught.

Walked example 2 — strengthening a weak unit-test assertion

Illustrative of the pattern. Run the grep to find a real candidate.

Symptom: an OpenSearchTestCase subclass exercises a parser but asserts only that the result is non-null — a regression that produced a wrong value would still pass. This is exactly the kind of issue tagged good first issue + test.

Locate a candidate

# Tests that assert non-null but never assert the value — a classic weak assertion smell:
grep -rn "assertNotNull" server/src/test/java/org/opensearch/ | head -40

Pick one where the parsed object has fields you can check. Read the test and the class under test together.

Diff

--- a/server/src/test/java/org/opensearch/common/unit/TimeValueTests.java
+++ b/server/src/test/java/org/opensearch/common/unit/TimeValueTests.java
@@
     public void testParseTimeValue() {
         TimeValue value = TimeValue.parseTimeValue("10s", "test");
-        assertNotNull(value);
+        assertNotNull(value);
+        assertEquals(10, value.seconds());
+        assertEquals(TimeUnit.SECONDS, value.timeUnit());
+        assertEquals("10s", value.getStringRep());
     }

Warning: Do not change the production class in a Stage 1 test PR. If strengthening the assertion reveals a real bug (the value is genuinely wrong), that is a Stage 3 bug fix — open a separate issue and PR. Mixing a test improvement with a behaviour change is the fastest way to get a Stage 1 PR bounced.

Run just this test

./gradlew :server:test --tests "org.opensearch.common.unit.TimeValueTests" -q
# reproduce a randomized failure later with the printed seed:
./gradlew :server:test --tests "org.opensearch.common.unit.TimeValueTests" -Dtests.seed=<SEED>

Then CHANGELOG (category Changed or no entry if the maintainers say test-only changes are exempt — check the PR template; OpenSearch generally still wants a line), commit with -s, push, open the PR.

Pitfalls

Scope creep. "While I was in the file I also fixed…" is the number-one reason a Stage 1 PR sits for months. One logical change, one CHANGELOG line. File a follow-up issue for everything else.
Missing CHANGELOG entry. The CHANGELOG verifier will fail your PR. Add the line in the correct category under ## [Unreleased ...], not under a released version heading.
Missing DCO. If you forgot -s, the DCO check fails. Fix the whole branch with git rebase --signoff upstream/main (or git commit --amend -s for a single commit), then force-push.
Force-push etiquette. Before a maintainer has reviewed, force-push freely to keep history clean. After review has started, prefer adding follow-up commits so reviewers can see what changed; squash at the end if asked. OpenSearch typically squash-merges, so you do not need a pristine history — you need a reviewable diff. Never force-push in a way that drops a maintainer's suggested change commit.
Editing user docs in the wrong repo. Most end-user documentation lives in opensearch-project/documentation-website, not the core repo. Confirm which repo owns the text before you open a PR; a doc fix in the wrong repo wastes a review cycle.
master vs cluster manager. If your docs fix touches node-role text, use cluster manager (formerly master) — do not reintroduce master-only wording.
Running the whole suite for a one-line change. ./gradlew check runs everything and takes a long time. For docs use the targeted gates above; for a single test use --tests.

Exit criteria — when you're ready for Stage 2

You can move on when:

You have one merged docs/javadoc PR and one merged test-only PR (a strengthened assertion or a missing @Test for an existing class).
You have responded to at least one round of reviewer nits by pushing follow-up commits, without a maintainer having to explain git commit -s or the CHANGELOG to you.
A green gradle-check no longer makes you anxious, and you can read a red one and tell whether the failure is yours or a pre-existing flake (Stage 9 territory).
You can recite the loop from memory: fork → branch → change → CHANGELOG → git commit -s → ./gradlew precommit → push → PR → read CI → respond with follow-up commits.

When to file a follow-up

If, while fixing a Stage 1 issue, you find a bigger problem — say the javadoc was missing because a setting was renamed without @Deprecated on the old key — do not bundle the bigger fix. File a follow-up issue:

While documenting Setting.Property (#NNNN) I noticed `index.foo` was renamed from
`index.bar` without an @Deprecated alias. Filing this to track the deprecation cleanup —
happy to take it as a Stage 2/3 change.

Narrow Stage 1 PR + follow-up issue is exactly what reviewers mean by "keep PRs focused." It is the discipline the entire roadmap depends on. Move on to Stage 2 — Build, Dependency, and Logging.

Stage 2 — Build, Dependency, and Logging

What this stage teaches

Stage 2 moves you from prose into the build system and the logging substrate — the two places where you can make a real, low-risk code change while the blast radius stays small. The skills:

Read and fix Gradle build issues: a precommit gate failing, a misconfigured task, a forbidden-APIs violation, a missing license header, a Spotless formatting break.
Do a dependency version bump correctly: edit the version catalog (gradle/libs.versions.toml), regenerate the dependency report, update the SHA/license metadata, and prove nothing else moved.
Improve a log line: fix a misleading message, correct a wrong level, convert string concatenation to Log4j2 parameterised logging, and pass loggerUsageCheck.

You are still working in surgical PRs, but now they compile, run gates, and occasionally change a number that ripples through CI. The reward for staying in this stage is that you learn to read CI failures fluently, which every later stage needs.

Prerequisite: Stage 1. This stage assumes the fork → DCO → CHANGELOG → precommit → PR loop is muscle memory.

The build gates you will meet

./gradlew precommit is an umbrella over a dozen checks. Know what each one means so you can read the failure:

Gate	Gradle task	What trips it	How you fix it
Formatting	`spotlessJavaCheck`	wrong import order, whitespace, wildcard imports	`./gradlew spotlessApply`
Style	`checkstyleMain` / `checkstyleTest`	line length, naming, missing braces	edit by hand
Forbidden APIs	`forbiddenApisMain`	`System.out`, `new Date()`, default-charset `String.getBytes()`, `Math.random()`	use the approved API
License headers	`licenseHeaders`	missing SPDX header on a new file	paste the SPDX block
Dependency licenses	`dependencyLicenses`	a jar whose `LICENSE`/`NOTICE`/SHA is absent	add files under `*/licenses/`
Third-party audit	`thirdPartyAudit`	a dependency pulls a disallowed transitive class	exclude or jarHell-fix it
Logger usage	`loggerUsageCheck`	`logger.debug("..." + x)` string concat, wrong arg count	use `{}` placeholders

Run the umbrella, but when it fails, re-run just the failing task to iterate fast:

./gradlew precommit
./gradlew forbiddenApisMain -q          # iterate on one gate
./gradlew :server:checkstyleMain -q

Finding Stage 2 issues

is:issue is:open label:"good first issue" no:assignee "dependency" in:title,body
is:issue is:open label:"good first issue" no:assignee "log" in:title,body
is:issue is:open label:"help wanted" no:assignee label:"dependencies"
is:issue is:open label:"enhancement" no:assignee "logging" in:title

Dependency bumps are also raised by Dependabot PRs; a good Stage 2 task is to finish a Dependabot bump that needs the license/CHANGELOG plumbing the bot cannot do, or to do the bump Dependabot skipped because the dependency is pinned in the version catalog.

Fallback grep for a logging candidate — find string-concatenated log calls that the logger check would flag if they were added today, and misleading levels:

# String concatenation inside a log call (parameterise these):
grep -rn 'logger\.\(info\|debug\|warn\|error\)("[^"]*"\s*+' server/src/main/java/ | head
# logger.info that looks like it should be debug (fires per-shard / per-request):
grep -rn 'logger.info' server/src/main/java/org/opensearch/index/ | head

Walked example 1 — a dependency version bump

Illustrative of the pattern. The exact catalog key and version vary by branch.

Symptom: an issue asks to bump a low-risk library (say a logging or compression dep) to pick up a CVE fix. OpenSearch centralises versions in a version catalog.

Locate the version

find . -name "libs.versions.toml"
grep -n "log4j\|jackson\|netty\|commons-" gradle/libs.versions.toml

The catalog looks like:

[versions]
log4j        = "2.21.0"
jackson      = "2.17.0"
netty        = "4.1.110.Final"

[libraries]
log4japi     = { group = "org.apache.logging.log4j", name = "log4j-api", version.ref = "log4j" }

Diff

--- a/gradle/libs.versions.toml
+++ b/gradle/libs.versions.toml
@@ [versions]
-log4j = "2.21.0"
+log4j = "2.21.1"

Prove the change is isolated

A version bump's risk is the transitive fan-out. Show the before/after dependency tree:

./gradlew :server:dependencies --configuration runtimeClasspath > /tmp/before.txt
# apply the bump, then:
./gradlew :server:dependencies --configuration runtimeClasspath > /tmp/after.txt
diff /tmp/before.txt /tmp/after.txt

If the diff shows only the lines you expected (the bumped artifact and nothing else), the bump is clean. If a transitive dependency also jumped, say so in the PR — reviewers care about the second-order moves.

Update license metadata and run the gates

When a jar's coordinates change, the dependencyLicenses gate needs the matching LICENSE.txt/NOTICE.txt and SHA file under the module's licenses/ directory:

find . -path "*licenses*log4j*"
./gradlew :server:dependencyLicenses -q     # tells you exactly which SHA is stale
./gradlew updateShas                         # regenerate the SHA-1 files if the task exists
./gradlew :server:thirdPartyAudit -q
./gradlew :server:precommit

CHANGELOG and PR

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Changed
+- Bump `org.apache.logging.log4j:log4j-api` from 2.21.0 to 2.21.1 ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

git add gradle/libs.versions.toml server/licenses/ CHANGELOG.md
git commit -s -m "Bump log4j-api from 2.21.0 to 2.21.1"
git push origin deps/bump-log4j-api
gh pr create --repo opensearch-project/OpenSearch --fill

Walked example 2 — fixing a misleading / wrongly-levelled log line

Illustrative of the pattern. Run the grep to find the real site; class and line vary.

Symptom: a node logs at INFO on every shard relocation:

logger.info("relocating shard " + shardId + " from " + sourceNode + " to " + targetNode);

Two problems: (1) string concatenation instead of Log4j2 placeholders — fails loggerUsageCheck if anyone touches it; (2) it is INFO, so a large rebalancing floods the cluster-manager (formerly master) log with thousands of lines, drowning real events. Operators have asked for it to be DEBUG.

Locate the code

grep -rn "relocating shard" server/src/main/java/org/opensearch/ | head
git log --oneline -n 5 -- <the file the grep printed>
git blame -L <start>,<end> <that file>

git blame tells you who last touched the line; consider mentioning them in the PR if the level choice was deliberate.

Diff

--- a/server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java
+++ b/server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java
@@
-        logger.info("relocating shard " + shardId + " from " + sourceNode + " to " + targetNode);
+        logger.debug("relocating shard [{}] from [{}] to [{}]", shardId, sourceNode, targetNode);

Three improvements in one diff:

Parameterised logging. {} placeholders mean the message string is only built if DEBUG is enabled — no allocation on the hot path when the level is off. This is what loggerUsageCheck enforces and the reason Log4j2 placeholders exist.
Correct level. Per-shard, high-frequency events are DEBUG; cluster-level summaries (e.g. "rerouted N shards") stay INFO. Decide by frequency × operator value.
Bracketed identifiers. OpenSearch convention wraps IDs in [...] so they are greppable in logs and so an empty value is visible as [] rather than vanishing.

Warning: Do not "fix" a level you are unsure about. If the line is WARN and you think it is benign, the maintainer who wrote it may know it precedes a real failure. Argue the change in the issue first, with a sentence on how often it fires.

Validate

./gradlew :server:compileJava -q
./gradlew loggerUsageCheck -q            # the gate that specifically checks log calls
./gradlew :server:precommit

There is usually no functional test for a log-line change. The reviewer signal is the diff plus, if you want to be thorough, a manual run:

./gradlew run &                          # single-node from source, REST on :9200
# create an index, add a node-less rebalance scenario, then:
grep "relocating shard" logs/opensearch.log | head

Document that grep in the PR description. Then CHANGELOG (Changed), git commit -s, PR.

Pitfalls

Bumping a dependency without the license files. dependencyLicenses / thirdPartyAudit will fail. Always regenerate SHAs and add LICENSE/NOTICE for new coordinates.
A "harmless" bump that drags a transitive major version. Always diff the :dependencies output. A patch bump that pulls a new minor of a transitive jar is a different, riskier PR.
Changing a log message that a test or alert greps for. Some *IT tests and external tooling match on log text. grep -rn "the old message" server/src/test qa/ before you change it.
Lowering a level that gates a deprecation or security message. WARN/ERROR lines are often load-bearing. Leave them unless the issue explicitly asks.
Reformatting the whole file. Run spotlessApply only after your change; do not let it (or your IDE) reflow unrelated lines into your diff. Reviewers reject noise.
Forgetting the version catalog is the single source. Do not pin a version inline in a build.gradle; edit gradle/libs.versions.toml so every module stays consistent.

Exit criteria — when you're ready for Stage 3

You have at least two logging or build/dependency PRs merged with no precommit re-asks from the reviewer.
You can read a red gradle-check and name the failing gate (forbidden-APIs vs spotless vs dependencyLicenses) from the log alone.
You did one dependency bump where you diff-ed the :dependencies output and stated the transitive impact in the PR.
You can explain, in one sentence each, why Log4j2 placeholders exist and when a log line belongs at DEBUG versus INFO.

Stage 2 keeps you in build-and-logging scaffolding. Stage 3 takes the same fluency into the behaviour of the system: the exception messages and validation that users actually see when something goes wrong.

Stage 3 — Error Messages and Diagnostics

What this stage teaches

Stage 3 is the first stage where your change alters what a user sees when something goes wrong. The skill is producing actionable diagnostics: turning a vague IllegalArgumentException: bad value into a message that tells the operator exactly which field, what was wrong, and what would have been valid. The places this lives:

Request validation — *Request.validate() returning an ActionRequestValidationException with one entry per problem.
Parse and argument errors — IllegalArgumentException (and the XContent parse exceptions) thrown from QueryBuilder/AggregationBuilder/Setting parsing, with text that names the offending token.
REST rendering — how an OpenSearchException becomes the JSON error body and HTTP status a client receives, via RestStatus and toXContent.
Operator-facing explanations — the family of "explain" responses (_cluster/allocation/explain, _validate/query, profile) whose entire value is the quality of their human-readable text.

The blast radius is still small — you are changing strings and validation, not control flow — but now the message is part of the contract, and a unit test must pin its exact text so it does not silently regress.

Prerequisite: Stages 1–2. You also want a working mental model of the REST layer: REST → Transport → Action.

Where errors are born and where they surface

flowchart LR
  A[HTTP request] --> B[RestHandler.prepareRequest]
  B -->|bad syntax| E1[IllegalArgumentException / XContentParseException]
  B --> C[ActionRequest.validate]
  C -->|null/empty/conflicting fields| E2[ActionRequestValidationException]
  C --> D[TransportAction.doExecute]
  D -->|runtime failure| E3[OpenSearchException subclass]
  E1 --> R[RestController error path]
  E2 --> R
  E3 --> R
  R --> J[JSON error body + RestStatus]

The reader's whole job in this stage is to improve the text at E1, E2, or E3 and to confirm it renders correctly at R. Three exception families matter:

Exception	Thrown from	HTTP status	Carries field context?
`ActionRequestValidationException`	`*Request.validate()`	400	yes — a list of messages
`IllegalArgumentException`	parsing, setting validation, builder construction	400 (mapped)	only if you put it there
`OpenSearchException` (and subclasses like `ResourceNotFoundException`, `IndexNotFoundException`)	action execution	per subclass (`status()`)	structured, renders to XContent

The mapping from a thrown exception to an HTTP status lives in ExceptionsHelper / OpenSearchException. Find it:

grep -rn "status()" server/src/main/java/org/opensearch/OpenSearchException.java | head
grep -rn "class IndexNotFoundException" server/src/main/java/org/opensearch/

Finding Stage 3 issues

is:issue is:open label:bug no:assignee "error message" in:title,body
is:issue is:open label:bug no:assignee "misleading" in:title,body
is:issue is:open label:"good first issue" no:assignee "validation" in:title,body
is:issue is:open label:enhancement no:assignee "exception" in:title

Fallback grep — vague messages are easy to find: short, context-free strings thrown as IllegalArgumentException.

# Bare exceptions with no field/value context:
grep -rn 'throw new IllegalArgumentException("[a-z ]\{1,25\}")' server/src/main/java/ | head
# validate() methods that add a generic message:
grep -rn 'addValidationError("' server/src/main/java/org/opensearch/action/ | head

A message like "invalid value" or "field is required" (no field name, no actual value) is a candidate.

Walked example — a vague parse error becomes a precise one

Illustrative of the pattern. The real site and class come from the grep; do not trust these line numbers.

Symptom: a user posts a malformed aggregation and gets back:

{ "error": { "type": "illegal_argument_exception", "reason": "Unknown key" }, "status": 400 }

"Unknown key" — but which key, in which aggregation, and what keys were valid? The user cannot tell. We will make the message name the offending key, the aggregation, and list the accepted keys.

Locate the throw site

grep -rn '"Unknown key"\|Unknown key for' server/src/main/java/org/opensearch/search/aggregations/ | head
# Common pattern: parsers switch on the current XContent field name:
grep -rn "parseFieldMatcher\|currentFieldName\|token == XContentParser.Token.FIELD_NAME" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/ | head

Open the parser the grep points at. The block typically looks like:

} else {
    throw new IllegalArgumentException("Unknown key");
}

Diff

--- a/server/src/main/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregationBuilder.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregationBuilder.java
@@
-        } else {
-            throw new IllegalArgumentException("Unknown key");
-        }
+        } else {
+            throw new IllegalArgumentException(
+                "Unknown key [" + currentFieldName + "] for ["
+                    + NAME + "] aggregation [" + aggregationName
+                    + "]. Valid keys are " + SUPPORTED_FIELDS + "."
+            );
+        }

The rules for a good diagnostic, in priority order:

Name the thing that was wrong — the offending key/value, in [...] so it is exact even when empty or whitespace.
Locate it — which aggregation/query/field, by name, so the user can find it in a large request body.
Say what would have been right — the valid set, or a hint. Do not dump an enormous list; if the valid set is huge, name the category ("a numeric field type") instead.
Do not leak internals — class names, stack traces, or absolute paths belong in logs, not in the user-facing reason.

Pin the message with a unit test

A diagnostic with no test will regress the next time someone refactors the parser. Assert the exact text:

public void testUnknownKeyMessageNamesTheKey() {
    String json = "{ \"date_histogram\": { \"bogus_key\": 5 } }";
    XContentParser parser = createParser(JsonXContent.jsonXContent, json);
    IllegalArgumentException e = expectThrows(
        IllegalArgumentException.class,
        () -> DateHistogramAggregationBuilder.parse("agg1", parser)
    );
    assertThat(e.getMessage(), containsString("Unknown key [bogus_key]"));
    assertThat(e.getMessage(), containsString("aggregation [agg1]"));
}

expectThrows and the Hamcrest containsString matcher are standard in OpenSearchTestCase. Prefer containsString over assertEquals on the whole message so a later, additive improvement to the text does not break the test for no reason.

Verify the REST rendering

Confirm the improved message actually reaches the client with the right status:

./gradlew run &
curl -s -XPOST 'localhost:9200/idx/_search' -H 'content-type: application/json' -d '{
  "aggs": { "agg1": { "date_histogram": { "bogus_key": 5 } } }
}' | jq '.error.reason, .status'

You should see the new reason text and 400. If the status is wrong, the exception's status() mapping (not the message) is the real bug — a different, deeper fix.

Build, CHANGELOG, PR

./gradlew :server:test --tests "*DateHistogramAggregationBuilderTests" -q
./gradlew :server:precommit

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Fixed
+- Improve the parse error for unknown keys in `date_histogram` to name the key and aggregation ([#NNNNN](...))

git add server/.../DateHistogramAggregationBuilder.java server/.../DateHistogramAggregationBuilderTests.java CHANGELOG.md
git commit -s -m "Improve unknown-key parse error in date_histogram aggregation"
git push && gh pr create --repo opensearch-project/OpenSearch --fill

A second pattern — request validation (`validate()`)

The *Request.validate() method is where pre-execution checks belong. Each problem is a separate entry, so a user with three mistakes sees all three at once:

@Override
public ActionRequestValidationException validate() {
    ActionRequestValidationException e = null;
    if (indices == null || indices.length == 0) {
        e = addValidationError("at least one index is required", e);
    }
    if (maxDocs < 0) {
        e = addValidationError("max_docs [" + maxDocs + "] must not be negative", e);
    }
    return e;   // null means "valid"
}

When you improve a validate(), the test uses the request object directly — no cluster needed:

public void testNegativeMaxDocsIsRejected() {
    MyRequest req = new MyRequest("idx").maxDocs(-1);
    ActionRequestValidationException e = req.validate();
    assertNotNull(e);
    assertThat(e.validationErrors(), hasItem(containsString("max_docs [-1] must not be negative")));
}

This is the cheapest, most deterministic test in the whole roadmap — no InternalTestCluster, no randomness beyond the seed. It is why validation fixes are a great Stage 3 staple.

The "explain"-style responses

Some endpoints exist only to produce good diagnostics; their text is the feature:

_cluster/allocation/explain — why a shard is unassigned, decider by decider. The text comes from each AllocationDecider.canAllocate(...) Decision and its Explanation. Improving one of these straddles Stage 3 and Stage 5.
_validate/query?explain=true — why a query is invalid or how it rewrites.
_search?explain=true and the profile API — scoring/timing breakdowns.

If your issue is "the allocation explain output is confusing," read shard allocation first: you are not changing why a shard is unassigned, only how clearly the system says so.

Pitfalls

Leaking internals into reason. Stack traces, class names, and file paths go to the log, not the user-facing message. Users see error.reason; operators see the log.
Over-specifying the test. assertEquals on the entire message string breaks on any future wording tweak. Assert the load-bearing substrings with containsString.
Changing the HTTP status by accident. Message text and status() are independent. If you only meant to improve wording, do not touch the exception type — that changes the status a client keys off.
Concatenating user input into the message unsafely. Always bracket it ([" + value + "]") so a null/empty/whitespace value is visible and cannot be mistaken for surrounding prose.
Improving one message and missing its siblings. If the same vague string is thrown in five places, grep them all; fix them in one PR only if they are the same logical message, otherwise file follow-ups.
Forgetting validation runs before execution. A check that needs cluster state belongs in the transport action, not validate() — validate() runs on the calling node with no cluster state available.

Exit criteria — when you're ready for Stage 4

One error-message or validation PR is merged with a unit test asserting the exact diagnostic substrings.
No reviewer had to ask "which field?" or "what was the value?" — your first draft already named both.
You can trace, for any thrown exception, what HTTP status and JSON body the client gets, using OpenSearchException.status() and the REST error path.
You understand why validate() errors are cheap to test and why "explain"-style outputs are diagnostics, not control flow.

You have now followed the request from REST through validation. Stage 4 follows it one layer deeper — into the cluster-manager service, where the cluster's shared state is mutated.

Stage 4 — Cluster State and Coordination Bugs

What this stage teaches

Stage 4 is the first stage that touches the distributed core. The skill is reasoning about ClusterState — the single immutable snapshot of cluster metadata — and the machinery that mutates and applies it:

Cluster state update tasks (ClusterStateUpdateTask, ClusterStateTaskExecutor) that run on the cluster manager (formerly master) and compute a new state from the old one. Getting a no-op wrong here is a classic bug: returning a new but equal state forces an unnecessary publish; mutating in place is a correctness disaster.
Appliers and listeners (ClusterStateApplier, ClusterStateListener) that react to a committed state on every node. A listener that NPEs on a missing index or assumes a shard exists is the other classic bug.
The two services on the cluster-manager node — MasterService (computes states) and ClusterApplierService (applies them) — tied together by ClusterService.

You will write tests against this machinery without a full cluster, using ClusterServiceUtils (a fake ClusterService you drive by hand) and OpenSearchSingleNodeTestCase (one real in-JVM node). The bugs are subtle but the surface area per fix stays small.

Prerequisite: Stage 3, plus the cluster-state deep dives: Cluster state and Cluster state publishing. Read those first — this stage summarises, it does not re-teach them.

The model in one diagram

flowchart TD
  subgraph CM[Cluster-manager node]
    UT[ClusterStateUpdateTask.execute<br/>oldState -> newState] --> MS[MasterService<br/>batches tasks, computes ClusterState]
    MS --> PUB[Publish two-phase:<br/>send -> commit]
  end
  PUB --> ALL[Every node]
  subgraph N[Each node]
    ALL --> CAS[ClusterApplierService.applyChangedState]
    CAS --> AP[ClusterStateApplier.applyClusterState]
    CAS --> LI[ClusterStateListener.clusterChanged]
  end

Two invariants you must never break:

ClusterState is immutable. An update task receives currentState and must return either the same instance (a no-op) or a newly built instance via ClusterState.builder(currentState)....build(). Never mutate Metadata, RoutingTable, or any sub-object of the current state.
A no-op returns the identical instance. If your task computes that nothing changed, return currentState; (the same object). MasterService uses identity/equals to decide whether to publish. Returning a freshly-built equal state triggers a needless cluster-wide publish — a real performance bug that shows up as "cluster-manager is busy."

Finding Stage 4 issues

is:issue is:open label:bug no:assignee label:"Cluster Manager"
is:issue is:open label:bug no:assignee "cluster state" in:title,body
is:issue is:open label:bug no:assignee "NullPointerException" "cluster" in:body
is:issue is:open label:"help wanted" no:assignee "ClusterStateUpdateTask" in:body

The component label has been spelled Cluster Manager, cluster-manager, and historically master — check the current label list. Coordination-layer bugs may carry distributed framework or a Coordination area label.

Fallback grep — listeners and appliers that dereference an index/shard without a null guard:

# Listeners that look up an index by name and may get null:
grep -rn "metadata().index(" server/src/main/java/org/opensearch/ \
  | grep -i "listener\|applier" | head
# clusterChanged implementations:
grep -rln "implements ClusterStateListener" server/src/main/java/org/opensearch/

Walked example — a listener that NPEs on a missing index

Illustrative of the pattern. The grep finds the real listener; do not trust the line numbers below.

Symptom: an issue reports a NullPointerException from a ClusterStateListener when an index is deleted between the state that scheduled some work and the state the listener runs against. The listener does event.state().metadata().index(name).getSettings() and index(name) returns null because the index is gone.

Locate the listener

grep -rn "implements ClusterStateListener" server/src/main/java/org/opensearch/ | head
# Then in the suspect file, find the unguarded lookup:
grep -n "metadata().index(" server/src/main/java/org/opensearch/<path>/SomeService.java
git log --oneline -n 5 -- server/src/main/java/org/opensearch/<path>/SomeService.java
git blame -L <start>,<end> server/src/main/java/org/opensearch/<path>/SomeService.java

The offending code:

@Override
public void clusterChanged(ClusterChangedEvent event) {
    for (String name : trackedIndices) {
        IndexMetadata meta = event.state().metadata().index(name);   // may be null after delete
        Settings s = meta.getSettings();                              // NPE here
        // ...
    }
}

Diff

--- a/server/src/main/java/org/opensearch/<path>/SomeService.java
+++ b/server/src/main/java/org/opensearch/<path>/SomeService.java
@@
     public void clusterChanged(ClusterChangedEvent event) {
         for (String name : trackedIndices) {
-            IndexMetadata meta = event.state().metadata().index(name);
-            Settings s = meta.getSettings();
+            IndexMetadata meta = event.state().metadata().index(name);
+            if (meta == null) {
+                // Index was deleted between scheduling and this applier run; stop tracking it.
+                trackedIndices.remove(name);
+                continue;
+            }
+            Settings s = meta.getSettings();
             // ...
         }
     }

Three things to notice:

The fix is defensive, not clever. Concurrency between the publishing of states means a listener must treat every lookup into the new state as possibly absent. The index you saw last round can be gone this round.
Clean up your own tracking. Removing the stale name prevents the NPE and a slow leak of dead index names in trackedIndices.
ClusterChangedEvent gives you both states. If you need to know what changed, use event.indicesDeleted(), event.indicesCreated(), event.metadataChanged() rather than diffing by hand.

Test with `ClusterServiceUtils` — no real cluster

You can drive cluster-state changes by hand using a test ClusterService:

public void testListenerSurvivesIndexDeletion() {
    ThreadPool threadPool = new TestThreadPool(getTestName());
    try {
        ClusterService clusterService = ClusterServiceUtils.createClusterService(threadPool);
        SomeService service = new SomeService(clusterService /*, deps */);

        // State A: index "foo" exists and is tracked.
        ClusterState withFoo = ClusterState.builder(clusterService.state())
            .metadata(Metadata.builder().put(
                IndexMetadata.builder("foo")
                    .settings(settings(Version.CURRENT))
                    .numberOfShards(1).numberOfReplicas(0)))
            .build();
        ClusterServiceUtils.setState(clusterService, withFoo);
        service.startTracking("foo");

        // State B: "foo" is deleted. The applier must not throw.
        ClusterState withoutFoo = ClusterState.builder(withFoo)
            .metadata(Metadata.builder(withFoo.metadata()).remove("foo"))
            .build();
        ClusterServiceUtils.setState(clusterService, withoutFoo);   // fires clusterChanged

        // No NPE, and the name is no longer tracked.
        assertThat(service.trackedIndices(), not(hasItem("foo")));
    } finally {
        // terminate the thread pool
        terminate(threadPool);
    }
}

ClusterServiceUtils.setState(...) publishes the new state to the listeners synchronously, so the assertion runs after clusterChanged. This is the standard way to unit-test cluster-state reactions without spinning up InternalTestCluster.

For a test that needs a real index lifecycle (create, then delete, then assert the service recovered), step up to OpenSearchSingleNodeTestCase:

public class SomeServiceIT extends OpenSearchSingleNodeTestCase {
    public void testTrackingSurvivesDeleteOnRealNode() throws Exception {
        createIndex("foo");
        // ... trigger tracking ...
        assertAcked(client().admin().indices().prepareDelete("foo").get());
        assertBusy(() -> assertThat(serviceUnderTest().trackedIndices(), not(hasItem("foo"))));
    }
}

assertBusy (not Thread.sleep) waits for the asynchronous applier to run — a habit Stage 9 drills hard.

Build and PR

./gradlew :server:test --tests "*SomeServiceTests" --tests "*SomeServiceIT" -q
./gradlew :server:precommit

CHANGELOG under ### Fixed, git commit -s, push, open the PR. Mention in the description which two states trigger the race (delete-between-publish) so the reviewer can reason about it without re-deriving it.

The no-op variant of the bug

The other half of this stage is the update task that does too much. A task that should be a no-op but rebuilds the state anyway floods publishing. The fix:

--- a/server/src/main/java/org/opensearch/<path>/SomeUpdateTask.java
+++ b/server/src/main/java/org/opensearch/<path>/SomeUpdateTask.java
@@
     public ClusterState execute(ClusterState currentState) {
-        Metadata.Builder md = Metadata.builder(currentState.metadata());
-        md.put(updatedIndexMetadata, true);
-        return ClusterState.builder(currentState).metadata(md).build();
+        IndexMetadata existing = currentState.metadata().index(indexName);
+        if (existing != null && existing.equals(updatedIndexMetadata)) {
+            return currentState;   // no change — same instance, no publish
+        }
+        Metadata.Builder md = Metadata.builder(currentState.metadata());
+        md.put(updatedIndexMetadata, true);
+        return ClusterState.builder(currentState).metadata(md).build();
+    }

Test it by asserting instance identity:

public void testNoOpReturnsSameState() {
    ClusterState before = /* state already containing updatedIndexMetadata */;
    ClusterState after = new SomeUpdateTask(updatedIndexMetadata).execute(before);
    assertSame(before, after);   // not assertEquals — must be the *same* object
}

assertSame is the whole point: MasterService keys publishing off identity, so a no-op must return the same reference.

Pitfalls

Mutating the current state. Any currentState.metadata().getIndices().put(...) is a bug even if it "works" in a test — other components hold the same immutable reference. Always go through a Builder.
Returning a fresh-but-equal state for a no-op. Use assertSame in the test to catch this. It is the most common publish-storm bug.
Assuming an index/shard from the previous state still exists. Every lookup into the applied state can be null. Guard it; consume ClusterChangedEvent.indicesDeleted().
Doing slow work on the applier thread. applyClusterState runs on the cluster applier thread; blocking it stalls the whole node's view of the cluster. Hand heavy work to a thread pool — see threadpools & concurrency.
Testing a race with Thread.sleep. It is flaky by construction. Use assertBusy with a tight assertion, or drive state synchronously with ClusterServiceUtils.setState.
Forgetting the cluster-manager-only context. Update tasks run only on the elected cluster manager. Logic that must run on every node belongs in an applier/listener, not a task. Confusing the two is a design bug a reviewer will flag immediately.

Exit criteria — when you're ready for Stage 5

One cluster-state fix is merged with a ClusterServiceUtils-based unit test (and, if the behaviour needs a real index, an OpenSearchSingleNodeTestCase).
You can state the immutability and no-op-identity invariants without looking them up, and you used assertSame to defend the no-op.
You can read a ClusterChangedEvent and use its indicesCreated/Deleted/metadataChanged helpers instead of diffing two states by hand.
You know which logic belongs in an update task (cluster manager) versus an applier (every node), and why.

You now understand how state is mutated and applied. Stage 5 reads one specific part of that state — the RoutingTable — and the deciders that compute it.

Stage 5 — Shard Allocation Issues

What this stage teaches

Stage 5 drills the subsystem that decides where every shard lives: the allocation engine. The skill is reasoning about a chain of yes/no/throttle decisions over an immutable routing snapshot, and reproducing a placement bug in a fast unit test instead of a flaky cluster.

AllocationDeciders — the ordered chain of AllocationDecider implementations (SameShardAllocationDecider, DiskThresholdDecider, AwarenessAllocationDecider, FilterAllocationDecider, MaxRetryAllocationDecider, ThrottlingAllocationDecider, …). Each returns a Decision (YES / NO / THROTTLE) with a human-readable explanation.
AllocationService — orchestrates a reroute: applies deciders, moves shards, updates the RoutingTable, and produces the new ClusterState.
BalancedShardsAllocator — the default balancer that picks which eligible node a shard goes to, optimising a weight function across nodes.
RoutingAllocation / RoutingNodes — the mutable working copy the allocator operates on during a single reroute, derived from the immutable ClusterState.

The classic Stage 5 bug is a decider returning the wrong Decision for an edge case — a YES where it should THROTTLE, or a NO whose explanation misleads the operator. You reproduce it with a hand-built RoutingAllocation and confirm it through _cluster/allocation/explain.

Prerequisite: Stage 4 (you must be fluent in ClusterState and RoutingTable) plus the shard allocation deep dive.

The decision chain

flowchart LR
  R[reroute trigger] --> AS[AllocationService.reroute]
  AS --> RA[build RoutingAllocation<br/>RoutingNodes mutable copy]
  RA --> D{AllocationDeciders.canAllocate}
  D -->|each decider| d1[SameShard]
  D --> d2[DiskThreshold]
  D --> d3[Awareness]
  D --> d4[Filter]
  D --> d5[MaxRetry]
  D --> d6[Throttling]
  D -->|combined Decision| B[BalancedShardsAllocator picks node]
  B --> NRT[new RoutingTable -> new ClusterState]

Decision.Multi combines the chain: the most restrictive wins (NO beats THROTTLE beats YES), but every decider's explanation is retained so _cluster/allocation/explain can show the full reasoning. That retention is why allocation explanations are so valuable — and why a wrong explanation is itself a bug worth fixing.

Finding Stage 5 issues

is:issue is:open label:bug no:assignee "allocation" in:title,body
is:issue is:open label:bug no:assignee "unassigned" in:title,body
is:issue is:open label:bug no:assignee "rebalance" in:title,body
is:issue is:open label:"help wanted" no:assignee "AllocationDecider" in:body

Area labels may read Cluster Manager, distributed framework, or a Allocation-flavoured component label — check the current set.

Fallback grep — find the decider whose behaviour the issue describes:

ls server/src/main/java/org/opensearch/cluster/routing/allocation/decider/
grep -rln "extends AllocationDecider" server/src/main/java/org/opensearch/cluster/routing/allocation/decider/
# The disk decider, a frequent source of edge-case bugs:
grep -n "canAllocate\|canRemain\|Decision" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java | head

Walked example — a decider returning the wrong `Decision`

Illustrative of the pattern. The exact branch and class come from the grep; the disk decider stands in for "some decider whose threshold logic has an off-by-edge bug."

Symptom: an issue reports that when free disk is exactly at the low watermark, the DiskThresholdDecider returns YES (allowing a new shard) when it should return NO — an off-by-boundary in a > that should be >=. The operator only finds out when the node fills past the watermark and shards refuse to relocate.

Locate the decision

grep -n "freeBytesThresholdLow\|freeDiskThresholdLow\|canAllocate" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
git log --oneline -n 5 -- server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
git blame -L <start>,<end> .../DiskThresholdDecider.java

The suspect branch (schematic):

if (freeBytes > freeBytesThresholdLow.getBytes()) {
    return allocation.decision(Decision.YES, NAME, "enough disk on node [%s]", node.nodeId());
}
return allocation.decision(Decision.NO, NAME,
    "the node is above the low watermark ...");

At freeBytes == threshold the > is false, so it falls through to NO — but trace the whole method; the real bug is often a second comparison elsewhere (e.g. the canRemain path) that uses >= inconsistently. The lesson: read both canAllocate and canRemain, they must agree at the boundary.

Diff

--- a/server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
+++ b/server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
@@
-        if (freeBytes > freeBytesThresholdLow.getBytes()) {
+        if (freeBytes >= freeBytesThresholdLow.getBytes()) {
             return allocation.decision(Decision.YES, NAME,
                 "enough disk for the shard on node [%s], free: [%s], threshold: [%s]",
                 node.nodeId(), new ByteSizeValue(freeBytes), freeBytesThresholdLow);
         }
         return allocation.decision(Decision.NO, NAME,
             "the node [%s] is above the low watermark cluster setting [%s], free: [%s], threshold: [%s]",
             node.nodeId(), CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(),
             new ByteSizeValue(freeBytes), freeBytesThresholdLow);

Notice the diff also enriches the explanation (free bytes, threshold, setting key). A decider fix almost always improves its Decision text in the same PR — that is the Stage 3 diagnostic skill applied to allocation, and reviewers expect it.

Reproduce in a `RoutingAllocation` unit test

The whole point of this stage: you do not need a cluster. Build a RoutingAllocation with the deciders and a synthetic ClusterInfo (disk usage), then assert the decision.

public void testLowWatermarkBoundaryReturnsNo() {
    // Disk info: node has exactly the low-watermark amount free.
    ClusterInfo clusterInfo = new DevNullClusterInfo(
        /* leastUsages */ Map.of("node1", new DiskUsage("node1", "node1", "/data",
            /*total*/ 100, /*free*/ 15)),   // 15% free == low watermark of 85% used
        /* mostUsages */ Map.of("node1", new DiskUsage("node1", "node1", "/data", 100, 15)),
        /* shardSizes */ Map.of());

    Settings settings = Settings.builder()
        .put(CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING.getKey(), "85%")
        .build();

    DiskThresholdDecider decider = new DiskThresholdDecider(settings,
        new ClusterSettings(settings, ClusterSettings.BUILT_IN_CLUSTER_SETTINGS));

    ClusterState state = /* one index, one unassigned shard, one node — built with helpers */;
    RoutingAllocation allocation = new RoutingAllocation(
        new AllocationDeciders(Collections.singleton(decider)),
        state.getRoutingNodes(), state, clusterInfo, null, System.nanoTime());
    allocation.debugDecision(true);   // capture explanations for the assertion

    ShardRouting shard = /* the unassigned primary */;
    RoutingNode node = allocation.routingNodes().node("node1");

    Decision decision = decider.canAllocate(shard, node, allocation);
    assertThat(decision.type(), equalTo(Decision.Type.NO));
    assertThat(((Decision.Single) decision).getExplanation(),
        containsString("above the low watermark"));
}

The OpenSearch test suite ships helpers (OpenSearchAllocationTestCase, DiskThresholdDeciderTests, ClusterInfo fakes like DevNullClusterInfo) that build these fixtures — extend the existing test class; do not hand-roll the routing table from scratch.

grep -rln "extends OpenSearchAllocationTestCase" server/src/test/java/org/opensearch/cluster/routing/
./gradlew :server:test --tests "*DiskThresholdDeciderTests" -q

Confirm through `_cluster/allocation/explain`

Prove the operator-facing behaviour on a running node:

./gradlew run &
# Fill a node to the watermark (or set a tiny watermark), create an index, then:
curl -s -XGET 'localhost:9200/_cluster/allocation/explain' -H 'content-type: application/json' -d '{
  "index": "idx", "shard": 0, "primary": true
}' | jq '.can_allocate, .node_allocation_decisions[].deciders[] | select(.decider=="disk_threshold")'

You should see the disk_threshold decider report NO at the boundary with your improved explanation text. Capture this in the PR.

Build and PR

./gradlew :server:test --tests "*DiskThreshold*" -q
./gradlew :server:precommit

CHANGELOG ### Fixed, git commit -s, push, PR. In the description, state the exact boundary condition (free == threshold) and confirm both canAllocate and canRemain agree — reviewers in this area always ask about the symmetric path.

Reading the balancer

If the issue is not "wrong decision" but "shards land on the wrong node" or "cluster never balances," the bug is in BalancedShardsAllocator, not a decider. The balancer minimises a weight function over (shard count, index spread, disk). Start here:

grep -n "weight\|balance\|WeightFunction\|allocateUnassigned\|balanceByWeights" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java | head

Balancer bugs are harder and frequently cross into Stage 10 (the balancer is hot during large rebalances). Confirm with a decider fix or an explanation fix first; the weight function is deep water.

Pitfalls

Fixing canAllocate but not canRemain. The two answer "can a shard start here?" and "can a shard stay here?". A boundary fix to one without the other produces a flapping shard: allowed to allocate, then immediately moved off. Always reconcile both.
Forgetting THROTTLE. Allocation has three answers, not two. A decider that should delay (recovery in progress, too many concurrent moves) must return THROTTLE, not NO — NO makes the shard look permanently unassignable in the explain output.
Building the routing table by hand. Use OpenSearchAllocationTestCase and its builders. A hand-built RoutingNodes that is subtly inconsistent will pass your test and hide the bug.
Not enabling debugDecision(true). Without it, RoutingAllocation does not retain explanations and your containsString assertion on the text will fail mysteriously.
Testing only the happy path. Allocation bugs live at boundaries: exactly-at-watermark, zero replicas, single node, all-nodes-excluded. Parameterise the test across them.
Changing a decider's order or default. The decider chain order and default watermarks are behaviour the whole ecosystem depends on; changing them is a BWC-sensitive decision (Stage 11), not a bug fix.

Exit criteria — when you're ready for Stage 6

One allocation fix is merged, reproduced with a RoutingAllocation/OpenSearchAllocationTestCase unit test, and confirmed via _cluster/allocation/explain.
Your fix reconciled both canAllocate and canRemain at the boundary.
You improved the Decision explanation as part of the same PR, naming the setting and the actual values.
You can explain how Decision.Multi combines the chain and why THROTTLE is distinct from NO.

Allocation places shards; the next stage works inside a single shard, where data is actually written and read. Stage 6 goes into IndexShard and the engine.

Stage 6 — Indexing and Engine Issues

What this stage teaches

Stage 6 goes inside a single shard, to the layer where documents are actually written, versioned, made durable, and made visible. This is the correctness-critical core: a bug here can lose data or return stale results. The skill is reasoning about the engine's concurrency and durability contracts and pinning them with InternalEngineTests-style unit tests.

IndexShard — wraps the engine; entry points applyIndexOperationOnPrimary, applyIndexOperationOnReplica, applyDeleteOperationOnPrimary. Holds the versioning and sequence-number bookkeeping for the shard.
InternalEngine — wraps Lucene's IndexWriter/DirectoryReader. Implements index(...), delete(...), refresh(...), flush(...), version/seqno conflict resolution, and the LiveVersionMap.
Translog — the write-ahead log providing durability between Lucene commits; Translog.add, sync, generations, and the durability mode (REQUEST vs ASYNC).
Sequence numbers — every operation gets a seqNo from LocalCheckpointTracker; the global checkpoint (ReplicationTracker) marks what is durably replicated. These drive replication, recovery, and dedup.

The two classic Stage 6 bugs: a versioning / seqno edge case (a concurrent update resolved wrong, a stale operation not skipped) and a translog durability bug (an operation acknowledged before it was synced, or a generation rolled at the wrong time).

Prerequisite: Stage 4 plus the engine deep dives: Engine internals, Translog, and Refresh/flush/merge. This stage assumes you know what a refresh, flush, and merge are.

The write path in one diagram

flowchart TD
  B[TransportShardBulkAction] --> P[IndexShard.applyIndexOperationOnPrimary]
  P --> V[resolve version + assign seqNo<br/>LocalCheckpointTracker]
  V --> E[InternalEngine.index]
  E --> LVM[LiveVersionMap update]
  E --> IW[Lucene IndexWriter.addDocument / updateDocument]
  E --> TL[Translog.add]
  P --> R[replicate to replicas<br/>applyIndexOperationOnReplica]
  TL --> SY{durability == REQUEST?}
  SY -->|yes| FS[Translog.sync before ack]
  SY -->|no| ASYNC[async sync on interval]

Two contracts you must not break:

Durability before acknowledgement. With index.translog.durability: request (the default), the operation's translog entry is fsync-ed before the client is told it succeeded. A bug that acks first and syncs later is silent data loss on crash.
Versioning monotonicity / seqno correctness. An operation with a stale version or an already-seen seqNo must be skipped or rejected deterministically, the same way on primary and replica, so the two never diverge. Get this wrong and a replica's documents differ from the primary's.

Finding Stage 6 issues

is:issue is:open label:bug no:assignee "translog" in:title,body
is:issue is:open label:bug no:assignee "version_conflict" in:title,body
is:issue is:open label:bug no:assignee "seq_no" in:title,body
is:issue is:open label:bug no:assignee label:"Storage:Durability"
is:issue is:open label:bug no:assignee label:"Indexing"

Area labels in this space include Storage:Durability, Storage:Performance, Indexing, and historically an Engine-flavoured label. Check the current list.

Fallback grep — the engine's edge-case logic clusters around version/seqno resolution:

grep -n "resolveDocVersion\|VersionConflictEngineException\|assertSameSeqNoOnReplica\|maxSeqNoOfUpdatesOrDeletes" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head
grep -n "durability\|sync\|rollGeneration\|trimUnreferenced" \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head

Walked example — a versioning / seqno edge case

Illustrative of the pattern. The grep finds the real plan-resolution method; do not trust the line numbers.

Symptom: an issue reports that under concurrent updates to the same doc id, a stale operation (lower version, already-superseded) is occasionally applied on a replica instead of being skipped, so the replica ends up one version behind the primary. The bug is in how the replica's index plan decides "is this operation stale?".

Locate the plan resolution

grep -n "planIndexingAsNonPrimary\|planIndexingAsPrimary\|IndexingStrategy\|isOptimizedFromDoc\|currentVersion" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head
git log --oneline -n 5 -- server/src/main/java/org/opensearch/index/engine/InternalEngine.java

The schematic of a staleness check:

// In planIndexingAsNonPrimary (replica/recovery path):
if (index.versionType().isVersionConflictForWrites(currentVersion, index.version(), deleted)) {
    // treat as stale: skip but record seqNo as processed
    plan = IndexingStrategy.skipAsStale(...);
} else {
    plan = IndexingStrategy.processNormally(...);
}

The bug is typically a boundary in isVersionConflictForWrites or a missing check on the maxSeqNoOfUpdatesOrDeletes optimisation — an operation that looks fresh by version but is stale by seqNo, or vice versa.

Diff

--- a/server/src/main/java/org/opensearch/index/engine/InternalEngine.java
+++ b/server/src/main/java/org/opensearch/index/engine/InternalEngine.java
@@  IndexingStrategy planIndexingAsNonPrimary(Index index) {
-        if (index.seqNo() <= localCheckpointTracker.getProcessedCheckpoint()) {
-            // already processed
-            return IndexingStrategy.processButSkipLucene(false, index.version());
-        }
+        // An operation at or below the processed checkpoint has already been applied.
+        // It must be marked processed AND skipped in Lucene, otherwise a duplicate
+        // updateDocument can resurrect a stale version on the replica.
+        if (index.seqNo() <= localCheckpointTracker.getProcessedCheckpoint()) {
+            return IndexingStrategy.processButSkipLucene(false, index.version());
+        }
+        // Below maxSeqNoOfUpdatesOrDeletes we must read the live version map; above it,
+        // the optimization that skips the lookup is only safe because no update can target
+        // this id yet. Guard the optimization explicitly.
+        if (index.seqNo() > maxSeqNoOfUpdatesOrDeletes && canOptimizeAddDocument(index)) {
+            return IndexingStrategy.optimizedAppendOnly(index.version());
+        }

Warning: This is the single most dangerous region in the engine. Real fixes here are tiny but require understanding maxSeqNoOfUpdatesOrDeletes, the append-only optimisation, and the primary/replica symmetry. Always open a discuss comment on the issue with your reading before you write the diff — a maintainer in this area will catch a wrong mental model in one reply.

Test with `InternalEngineTests`

The engine has a rich unit-test base (EngineTestCase / InternalEngineTests) that builds a real InternalEngine over a temp Lucene directory, so you can replay exact operation sequences with chosen seqNos and versions:

public void testStaleReplicaOpIsSkipped() throws IOException {
    // Apply seqNo=2,v=2 first, then replay a stale seqNo=1,v=1 for the same id.
    ParsedDocument doc = testParsedDocument("1", null, testDocument(), B_1, null);

    Engine.IndexResult fresh = replicaEngine.index(replicaIndexForDoc(doc, /*version*/ 2, /*seqNo*/ 2, false));
    assertThat(fresh.getResultType(), equalTo(Engine.Result.Type.SUCCESS));

    Engine.IndexResult stale = replicaEngine.index(replicaIndexForDoc(doc, /*version*/ 1, /*seqNo*/ 1, false));
    // Stale op is accepted as "processed" but must NOT overwrite the fresher doc in Lucene.
    assertThat(stale.getResultType(), equalTo(Engine.Result.Type.SUCCESS));

    replicaEngine.refresh("test");
    try (Engine.Searcher searcher = replicaEngine.acquireSearcher("test")) {
        // The visible document must still be version 2.
        assertVisibleVersion(searcher, "1", 2L);
    }
}

replicaIndexForDoc, testParsedDocument, and the searcher helpers come from EngineTestCase. Extend it; never construct an InternalEngine by hand in a test.

grep -rln "extends EngineTestCase" server/src/test/java/org/opensearch/index/engine/
./gradlew :server:test --tests "org.opensearch.index.engine.InternalEngineTests" -q

Reproduce a specific randomized failure with the printed seed:

./gradlew :server:test --tests "org.opensearch.index.engine.InternalEngineTests" -Dtests.seed=<SEED> -q

A second pattern — translog durability

Illustrative of the pattern.

Symptom: with index.translog.durability: request, an operation is acknowledged before its translog location is synced, so a crash between ack and sync loses the write.

Locate the sync point

grep -n "ensureTranslogSynced\|Translog.Location\|sync(\|durability" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head
grep -n "Durability.REQUEST\|sync(" server/src/main/java/org/opensearch/index/translog/Translog.java | head

The fix shape

--- a/server/src/main/java/org/opensearch/index/shard/IndexShard.java
+++ b/server/src/main/java/org/opensearch/index/shard/IndexShard.java
@@
-        // ack the operation
-        listener.onResponse(result);
-        maybeSyncTranslog(result.getTranslogLocation());
+        // Durability contract: with REQUEST durability the translog must be fsync-ed
+        // for this location BEFORE we acknowledge to the client.
+        if (indexSettings.getTranslogDurability() == Translog.Durability.REQUEST) {
+            sync(result.getTranslogLocation());
+        }
+        listener.onResponse(result);

Test it

public void testRequestDurabilitySyncsBeforeAck() throws Exception {
    IndexShard shard = newStartedShard(true,
        Settings.builder().put(IndexSettings.INDEX_TRANSLOG_DURABILITY_SETTING.getKey(), "request").build());
    // index a doc; assert the translog is synced to at least this op's location at ack time.
    Engine.IndexResult r = indexDoc(shard, "_doc", "1");
    assertThat(shard.getTranslog().getLastSyncedGlobalCheckpoint() /* or syncedLocation */,
        greaterThanOrEqualTo(/* this op's location */));
    closeShards(shard);
}

newStartedShard, indexDoc, and closeShards come from IndexShardTestCase. This is the shard-level analog of EngineTestCase.

grep -rln "extends IndexShardTestCase" server/src/test/java/org/opensearch/index/shard/
./gradlew :server:test --tests "*IndexShardTests" -q

Pitfalls

Asymmetry between primary and replica plans. planIndexingAsPrimary and planIndexingAsNonPrimary must agree on what is stale. A fix to one without the other silently diverges replicas — the worst class of bug because tests on a single node pass.
Breaking the append-only optimisation. The maxSeqNoOfUpdatesOrDeletes optimisation skips the version-map lookup for ids that have never been updated. Touch it only with a maintainer's confirmation; getting it wrong loses or duplicates documents.
Acking before sync. With REQUEST durability, sync precedes ack — always. With ASYNC, do not sync synchronously (that is a performance regression, Stage 10).
Refresh ≠ flush ≠ commit. Refresh makes data visible; flush/commit makes it durable in Lucene and trims the translog. Confusing them produces both correctness and perf bugs. See refresh/flush/merge.
Hand-building an InternalEngine. The setup (store, translog config, merge policy, seqno service) is intricate. Always extend EngineTestCase/IndexShardTestCase.
Skipping the discuss step. Engine PRs that arrive without a stated mental model are the slowest to review. Three sentences on the issue first saves weeks.

Exit criteria — when you're ready for Stage 7

One engine, translog, or shard fix is merged with an InternalEngineTests/IndexShardTestCase test that replays the exact operation sequence (chosen seqNos/versions) that triggers it.
You can state the durability-before-ack contract and the primary/replica plan-symmetry contract from memory.
You opened a discuss comment with your reading before writing the diff, and a maintainer confirmed (or corrected) the model.
You can reproduce a randomized engine-test failure from its -Dtests.seed= line.

You have worked the write path. Stage 7 works the read path: queries, the query/fetch phases, and aggregation reduce.

Stage 7 — Search and Aggregation Issues

What this stage teaches

Stage 7 is the read path: turning a query into Lucene work on each shard, then merging per-shard results on the coordinating node. The skill is reasoning about a fan-out / reduce pipeline where the hard bugs live in the merge — especially aggregation reduce, which must produce the same answer whether a shard returned data, returned nothing, or did not exist at all.

QueryPhase / FetchPhase — per-shard execution. QueryPhase runs the query and collects top docs and aggregation buckets; FetchPhase hydrates the matched docs. SearchService.executeQueryPhase / executeFetchPhase are the entry points.
SearchPhaseController — the coordinating-node reduce: merges top docs, calls InternalAggregation.reduce(...), computes the final response.
Aggregations — AggregatorFactory → Aggregator (per shard) → InternalAggregation (the serializable result) → reduce(...) (the cross-shard merge). The merge is where empty-shard and partial-result bugs hide.
QueryBuilder / AbstractQueryBuilder — the DSL → Lucene Query translation via QueryShardContext, including query rewrite (rewrite(...)), where another family of bugs lives.

The two classic Stage 7 bugs: an aggregation reduce edge case (wrong result when some shards are empty or absent, or when partial reduce runs) and a query rewrite bug (a builder that rewrites to the wrong query, or is not equal/serializable round-trip).

Prerequisite: Stage 4 plus the search deep dives: Search execution, Aggregations, and Query DSL & QueryBuilders.

Fan-out and reduce

flowchart TD
  C[TransportSearchAction<br/>coordinating node] --> F1[shard 1: QueryPhase]
  C --> F2[shard 2: QueryPhase]
  C --> F3[shard N: QueryPhase]
  F1 --> RED[SearchPhaseController reduce<br/>InternalAggregation.reduce]
  F2 --> RED
  F3 --> RED
  RED --> FET[FetchPhase on matched docs]
  FET --> RESP[final SearchResponse]

The invariant that breaks most often:

reduce must be associative and handle the empty/absent case. It runs in batches (partial reduce of M shard results at a time, then a final reduce of the partials), so reduce(reduce(a, b), c) must equal reduce(a, reduce(b, c)), and reduce of an empty or all-empty input must produce a well-formed (often empty) aggregation, not a null, not a throw, not a wrong default.

A shard can contribute no documents (empty index), and with index filtering a shard may not be queried at all. Both must reduce cleanly.

Finding Stage 7 issues

is:issue is:open label:bug no:assignee label:"Search:Aggregations"
is:issue is:open label:bug no:assignee label:"Search:Query Capabilities"
is:issue is:open label:bug no:assignee "aggregation" "empty" in:body
is:issue is:open label:bug no:assignee "rewrite" "query" in:body

Area labels include Search:Aggregations, Search:Query Capabilities, Search:Relevance, Search:Performance. Check the current set.

Fallback grep — find a reduce that does not guard the empty case, or a builder whose doRewrite is suspect:

grep -rn "public InternalAggregation reduce" server/src/main/java/org/opensearch/search/aggregations/ | head
grep -rln "doRewrite\|protected QueryBuilder doRewrite" server/src/main/java/org/opensearch/index/query/

Walked example — an aggregation reduce edge case

Illustrative of the pattern. The grep finds the real aggregation; treat the class as a stand-in for "some aggregation whose reduce mishandles empty shards."

Symptom: an issue reports that an aggregation returns a wrong value (or NaN, or a null) when at least one shard has no matching documents — the empty shard's InternalAggregation is folded into the reduce incorrectly. A common shape: a metric agg that averages by summing values and dividing by count, where an empty shard contributes count == 0 and the reduce divides by a total that ignored it, or includes a sentinel that poisons the sum.

Locate the reduce

ls server/src/main/java/org/opensearch/search/aggregations/metrics/
grep -n "reduce\|getProperty\|buildEmptyAggregation\|InternalAggregation" \
  server/src/main/java/org/opensearch/search/aggregations/metrics/InternalAvg.java | head
git log --oneline -n 5 -- server/src/main/java/org/opensearch/search/aggregations/metrics/InternalAvg.java

The schematic of a metric reduce:

@Override
public InternalAvg reduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) {
    CompensatedSum sum = new CompensatedSum(0, 0);
    long count = 0;
    for (InternalAggregation aggregation : aggregations) {
        InternalAvg avg = (InternalAvg) aggregation;
        count += avg.count;
        sum.add(avg.sum);     // empty shard contributes count=0, sum=0 — must be a no-op
    }
    return new InternalAvg(getName(), sum.value(), count, format, getMetadata());
}

If an empty shard contributes a non-zero sentinel sum, or if buildEmptyAggregation() returns the wrong neutral element, the average skews. Read both reduce and buildEmptyAggregation() — they must agree on the identity element.

Diff

--- a/server/src/main/java/org/opensearch/search/aggregations/metrics/InternalAvg.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/metrics/InternalAvg.java
@@  public InternalAvg reduce(...) {
         for (InternalAggregation aggregation : aggregations) {
             InternalAvg avg = (InternalAvg) aggregation;
+            // An empty shard has count == 0; its sum must be the additive identity (0),
+            // never a sentinel like Double.NaN, or it poisons the merged average.
             count += avg.count;
             sum.add(avg.sum);
         }
-        return new InternalAvg(getName(), sum.value(), count, format, getMetadata());
+        // With zero documents across all shards, the average is defined as NaN by contract;
+        // make that explicit rather than dividing 0/0 implicitly.
+        return new InternalAvg(getName(), count == 0 ? Double.NaN : sum.value(), count, format, getMetadata());

The reasoning that generalises beyond this one aggregation:

Find the identity element. For sum it is 0; for min it is +inf; for max it is -inf; for a sketch it is the empty sketch. buildEmptyAggregation() must return it and reduce must treat it as a no-op.
Define the all-empty result by contract, not by accident. avg of nothing is NaN, sum of nothing is 0, min/max of nothing is null/absent. Make it explicit.
Associativity. Because reduce runs in partial batches, the per-shard order must not change the result. Test that.

Test with `AggregatorTestCase`

AggregatorTestCase runs an aggregation over an in-memory Lucene index, including the multi-shard reduce, with no cluster:

public void testAvgWithEmptyShardIsCorrect() throws IOException {
    // One "shard" with values, one with no documents.
    testCase(new MatchAllDocsQuery(), iw -> {
        iw.addDocument(singleton(new NumericDocValuesField("value", 10)));
        iw.addDocument(singleton(new NumericDocValuesField("value", 20)));
        // ... and a segment/leaf with no "value" docs ...
    }, agg -> {
        InternalAvg avg = (InternalAvg) agg;
        assertEquals(15.0, avg.getValue(), 0.0);   // empty shard did not skew it
    }, new NumberFieldMapper.NumberFieldType("value", NumberFieldMapper.NumberType.LONG),
       avgBuilder("value"));
}

public void testAvgOfNothingIsNaN() throws IOException {
    testCase(new MatchAllDocsQuery(), iw -> { /* add no value docs */ }, agg -> {
        assertTrue(Double.isNaN(((InternalAvg) agg).getValue()));
    }, /* field */, avgBuilder("value"));
}

For the reduce specifically, InternalAggregationTestCase round-trips and reduces InternalAggregation instances directly — use it to assert associativity across an empty element:

grep -rln "extends AggregatorTestCase\|extends InternalAggregationTestCase" \
  server/src/test/java/org/opensearch/search/aggregations/
./gradlew :server:test --tests "*InternalAvgTests" --tests "*AvgAggregatorTests" -q

Confirm end to end

./gradlew run &
curl -s -XPUT 'localhost:9200/idx' -H 'content-type: application/json' -d '{"settings":{"number_of_shards":3}}'
curl -s -XPOST 'localhost:9200/idx/_doc?refresh' -H 'content-type: application/json' -d '{"value":10}'
curl -s -XPOST 'localhost:9200/idx/_doc?refresh' -H 'content-type: application/json' -d '{"value":20}'
curl -s 'localhost:9200/idx/_search' -H 'content-type: application/json' -d '{
  "size":0, "aggs":{"a":{"avg":{"field":"value"}}}
}' | jq '.aggregations.a.value'

With 3 shards and 2 docs, at least one shard is empty — the average must read 15, proving the empty-shard reduce.

A second pattern — a query rewrite bug

Illustrative of the pattern.

QueryBuilders rewrite themselves before execution (e.g. a range over a constant field rewrites to match_none or match_all). A rewrite bug returns the wrong simplified query, or breaks the equals/hashCode/serialization round-trip that the framework requires.

grep -n "doRewrite\|Rewriteable\|protected Query doToQuery" \
  server/src/main/java/org/opensearch/index/query/RangeQueryBuilder.java | head

These are tested with AbstractQueryTestCase, which auto-checks serialization round-trip, equals/hashCode, XContent parse, and toQuery:

// Extend AbstractQueryTestCase<RangeQueryBuilder>; override doAssertLuceneQuery
// to assert the rewritten/translated Lucene Query is what you expect.
protected void doAssertLuceneQuery(RangeQueryBuilder qb, Query query, QueryShardContext context) {
    assertThat(query, instanceOf(/* expected Lucene query class */));
}

grep -rln "extends AbstractQueryTestCase" server/src/test/java/org/opensearch/index/query/
./gradlew :server:test --tests "*RangeQueryBuilderTests" -q

AbstractQueryTestCase is strict: if your change breaks Writeable round-trip or equals, the base class fails the test for you — which is exactly the safety you want.

Pitfalls

reduce that throws/returns null on empty input. It runs on empty and all-empty inputs. The identity element and the all-empty contract must be explicit.
Disagreement between reduce and buildEmptyAggregation. They must share the same identity element. Fix both in one PR.
Ignoring partial reduce. Large searches reduce in batches; a non-associative reduce gives different answers under different batched_reduce_size. Test associativity.
Breaking the query round-trip. Any new field on a QueryBuilder must be added to equals, hashCode, doWriteTo/constructor-StreamInput, and doXContent. AbstractQueryTestCase enforces this — do not suppress its checks.
Forgetting BWC on a serialized field. A new field on an InternalAggregation or QueryBuilder that crosses the wire needs a Version guard (Stage 11).
Hand-rolling Lucene fixtures. Use AggregatorTestCase/AbstractQueryTestCase; they build the leaf readers, field types, and shard context correctly.
Confusing query phase and fetch phase. Scoring/aggregation bugs are QueryPhase; source/highlight/fields bugs are FetchPhase. Diagnose which phase before you edit.

Exit criteria — when you're ready for Stage 8

One search or aggregation fix is merged with an AggregatorTestCase / InternalAggregationTestCase / AbstractQueryTestCase test that exercises the edge case (empty shard, all-empty, rewrite boundary).
You can name the identity element for the aggregation you fixed and you reconciled reduce with buildEmptyAggregation.
You can explain why aggregation reduce must be associative and how partial reduce makes that observable.
For a QueryBuilder change, you let AbstractQueryTestCase prove the round-trip rather than disabling its checks.

You have now worked both the write and read paths of core. Stage 8 steps to the boundary where core meets the plugins that build on it.

Stage 8 — Plugin and Extension Compatibility

What this stage teaches

Stage 8 is the OpenSearch analog of the Tez "Hive-on-Tez" attribution skill: most of OpenSearch's value (security, k-NN, SQL/PPL, alerting, ml-commons, index-management, cross-cluster-replication) lives in separate repositories that build against the core engine through extension points. A change to a core extension point can break those plugins without a single test failing in the core repo. The skill is:

Reasoning about the plugin SPI: org.opensearch.plugins.Plugin and the interfaces a plugin implements (ActionPlugin, SearchPlugin, MapperPlugin, AnalysisPlugin, IngestPlugin, ClusterPlugin, EnginePlugin, NetworkPlugin, RepositoryPlugin, ExtensiblePlugin, …).
Recognising when a core change is source- or binary-incompatible for plugins (a method signature change, a removed registration helper, a new abstract method) versus a safe, additive change.
Attributing a reported break correctly: is the fault in core (it broke the contract), in the plugin (it relied on an internal it should not have), or in the build (a version mismatch)? Filing in the right repo is half the work.
Testing a core change against a real plugin by building the plugin against your local core snapshot.

Prerequisite: whichever of Stages 4–7 covers the subsystem the extension point belongs to, plus the plugin architecture deep dive and the plugin labs. This stage assumes you know how PluginsService loads a plugin and gives it an isolated classloader.

The boundary

flowchart LR
  subgraph Core[opensearch-project/OpenSearch]
    SPI[Plugin + ActionPlugin/SearchPlugin/...<br/>extension-point interfaces]
    REG[ActionModule / SearchModule<br/>registries]
  end
  subgraph Plugins[separate repos]
    SEC[security]
    KNN[k-NN]
    SQL[sql]
    ML[ml-commons]
  end
  SPI -. implemented by .-> SEC
  SPI -. implemented by .-> KNN
  SPI -. implemented by .-> SQL
  SPI -. implemented by .-> ML
  Plugins -->|build against published<br/>org.opensearch:opensearch| Core

The contract is the set of public/protected signatures on the Plugin interfaces and the registration types they return (ActionHandler, RestHandler, QuerySpec, AggregationSpec, Mapper.TypeParser, …). Anything a plugin can see is part of the contract — including a return type you "just" widened.

Extension interface	What a plugin registers	A breaking change looks like
`ActionPlugin`	`getActions()`, `getRestHandlers(...)`	changing `RestHandler` or `ActionHandler` shape, or the args to `getRestHandlers`
`SearchPlugin`	`getQueries()`, `getAggregations()`, `getFetchSubPhases()`	renaming `QuerySpec`/`AggregationSpec` or its constructor
`MapperPlugin`	`getMappers()`, `getMetadataMappers()`	changing `Mapper.TypeParser`
`EnginePlugin`	`getEngineFactory(...)`	changing `EngineFactory`/`EngineConfig`
`ExtensiblePlugin`	`loadExtensions(...)`	changing the extension-loading SPI

Finding Stage 8 issues

is:issue is:open label:bug no:assignee "plugin" "compatibility" in:body
is:issue is:open label:bug no:assignee "extension point" in:body
is:issue is:open label:"help wanted" no:assignee label:"Plugins"

Cross-repo breaks are also reported as issues in the plugin repos themselves ("k-NN fails to build against OpenSearch 3.x main"). Search those repos too:

repo:opensearch-project/k-NN is:issue is:open "build against" OR "incompatible"
repo:opensearch-project/security is:issue is:open "API change" OR "removed"

Fallback grep — find the extension-point interfaces and the registries that back them:

ls libs/*/src/main/java/org/opensearch/plugins/ server/src/main/java/org/opensearch/plugins/ 2>/dev/null
grep -rln "interface .*Plugin" server/src/main/java/org/opensearch/plugins/
grep -n "registerQuery\|registerAggregation\|QuerySpec\|AggregationSpec" \
  server/src/main/java/org/opensearch/search/SearchModule.java | head

Walked example — a core change that forces a plugin registration update

Illustrative of the pattern. The grep finds the real registration method; treat the details as a stand-in for "an extension-point signature evolved."

Symptom: a core PR adds a required parameter to the SearchPlugin.QuerySpec constructor (say, to thread a NamedXContentRegistry or a new reader). Core compiles and its tests pass — but every out-of-repo SearchPlugin that constructs a QuerySpec now fails to compile against the new snapshot. The k-NN plugin's nightly build against main goes red.

Locate the contract that changed

grep -rn "class QuerySpec\|QuerySpec(" server/src/main/java/org/opensearch/plugins/SearchPlugin.java
git log --oneline -n 5 -- server/src/main/java/org/opensearch/plugins/SearchPlugin.java
# Where core itself registers queries (your in-repo callers to update):
grep -rn "new QuerySpec\|getQueries" server/src/main/java/org/opensearch/ modules/ plugins/ | head

Decide: is this break necessary?

The first question is not "how do I fix the plugin" but "should core have broken the contract at all?" Two options:

Additive (preferred). Add a new constructor/overload and deprecate the old one, so existing plugins keep compiling and migrate at their own pace.
Breaking (only if justified). If the old signature cannot be kept, the change must be announced (a breaking label, a CHANGELOG ### Changed with a migration note) and the plugin repos must be updated in lockstep.

The additive diff:

--- a/server/src/main/java/org/opensearch/plugins/SearchPlugin.java
+++ b/server/src/main/java/org/opensearch/plugins/SearchPlugin.java
@@  public static class QuerySpec<T extends QueryBuilder> {
+        /**
+         * @deprecated use {@link #QuerySpec(ParseField, Writeable.Reader, ContextParser)}
+         *             which threads the parse context. Retained for plugin source compatibility.
+         */
+        @Deprecated
         public QuerySpec(String name, Writeable.Reader<T> reader, QueryParser<T> parser) {
-            this(new ParseField(name), reader, parser);
+            this(new ParseField(name), reader, (p, c) -> parser.fromXContent(p));
         }
+        public QuerySpec(ParseField name, Writeable.Reader<T> reader, ContextParser<Object, T> parser) {
+            // new canonical constructor
+        }

Warning: "Just deprecate, don't delete" is the default posture at the core↔plugin boundary. Deleting a public method that a downstream plugin uses is a breaking change that needs maintainer sign-off and a coordinated migration. When in doubt, keep the old path.

Update in-repo callers and the CHANGELOG

grep -rn "new QuerySpec(" server/ modules/ plugins/
# update each in-repo caller to the new constructor where appropriate
./gradlew :server:compileJava :modules:compileJava -q

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Changed
+- Add a context-aware `SearchPlugin.QuerySpec` constructor and deprecate the legacy one ([#NNNNN](...))

Test the core change against a real plugin

This is the Stage 8 skill that no in-repo test gives you. Build core locally, publish it to your local Maven, then build a plugin against it:

# 1. In the core repo: publish your branch as a local snapshot.
./gradlew publishToMavenLocal -Dbuild.version_qualifier=SNAPSHOT
# note the version it published, e.g. 3.2.0-SNAPSHOT

# 2. In a plugin repo (e.g. k-NN), point it at your local snapshot and build.
git clone https://github.com/opensearch-project/k-NN.git
cd k-NN
./gradlew build -Dopensearch.version=3.2.0-SNAPSHOT

If the plugin builds with the deprecated path intact, your additive change is safe. If you had to make it breaking, this build is where you discover every call site the plugin must change — which you then file as a coordinated issue/PR in the plugin repo, linked from the core PR.

Note: Plugins consume core via -Dopensearch.version=... and the opensearch.opensearchplugin Gradle plugin; the exact flag name varies by plugin repo — check that repo's DEVELOPER_GUIDE.md. The principle is identical: publish core locally, resolve the plugin against it.

Open the PR(s)

The core PR carries the additive change, the in-repo caller updates, the CHANGELOG note, and a line in the description: "verified against k-NN built with -Dopensearch.version=<snapshot>."
If breaking, a companion plugin PR (or a tracking issue in the plugin repo) migrates the plugin. Link them. Maintainers will not merge a knowingly-breaking core change until the downstream migration path exists.

Attribution — whose bug is it?

When someone reports "feature X broke after upgrading," the most valuable thing you can do is attribute it correctly. Walk the boundary:

Symptom	Likely fault	Where to file
Plugin fails to compile against new core	core changed a public signature	core (make additive) or plugin (adopt new API)
Plugin loads but `ClassNotFound`/`NoSuchMethod` at runtime	binary-incompatible change; version skew	core (BWC, Stage 11)
Plugin relied on a core internal (non-`@PublicApi`) class	plugin overreached	plugin
Same request works without the plugin, fails with it	plugin logic	plugin
`plugin-descriptor.properties` version mismatch refuses to load	packaging/version	plugin build

The decisive grep when you suspect overreach is whether the symbol the plugin used is part of the public surface:

grep -rn "@PublicApi\|@InternalApi\|@DeprecatedApi" \
  libs/core/src/main/java/org/opensearch/core/common/ | head   # annotation usage
grep -rn "class TheSymbolThePluginUsed" server/src/main/java/  # is it even public?

If the plugin imported a non-public core internal, the fix is in the plugin (or a request to core to promote the symbol to public API) — not a revert of the core change.

Pitfalls

Breaking a public extension point without deprecation. The default is additive + @Deprecated. Deletion needs maintainer sign-off and a coordinated downstream migration.
Assuming "core tests green" means "plugins fine." Core has no visibility into k-NN or security. The publishToMavenLocal → plugin-build loop is the only proof.
Filing the bug in the wrong repo. A plugin that overreached into a core internal is a plugin bug. A core change that broke a public method is a core bug. Attribute before you patch.
Forgetting the descriptor version pin. Plugins encode a compatible OpenSearch version in plugin-descriptor.properties; a core version bump can require a coordinated descriptor bump in every plugin repo.
Ignoring the Extensions SDK. OpenSearch 2.10+ has an experimental out-of-process Extensions model (opensearch-sdk-java). It is forward-looking; plugins remain the mainstream model, but if your change touches ExtensiblePlugin/loadExtensions, mention the extensions implications.
Mentioning a plugin maintainer without doing the build first. Bring evidence (the plugin build log against your snapshot), not a guess.

Exit criteria — when you're ready for Stage 9

One core change at the plugin boundary is merged, made additive where possible, and you verified it against a real plugin via publishToMavenLocal + a plugin build.
You can attribute a reported break to core, plugin, or build, with a grep that proves whether the disputed symbol is public API.
You defaulted to deprecate-don't-delete and only went breaking with a maintainer-approved, coordinated downstream migration.
You know how a plugin resolves core (-Dopensearch.version, the version catalog, the descriptor) well enough to reproduce its build.

You have worked the core↔plugin seam. Stage 9 turns to the tests themselves — the flaky ones that erode trust in every PR's CI.

Stage 9 — Flaky Test Fixes

What this stage teaches

A flaky test is a test that passes and fails on the same code. Flakes are corrosive: they train contributors to re-run CI until green, which means a real regression hidden behind a flake ships. Fixing flakes is some of the highest-leverage work in the project, and it is how you learn the test framework deeply. The skill:

Reproduce a randomized failure deterministically from its -Dtests.seed= line, and stress it with -Dtests.iters.
Tell apart the flake families: a real race condition, an over-tight timing assumption, a leaked port/thread, order-dependence, or genuine nondeterminism in InternalTestCluster.
Replace Thread.sleep with assertBusy, and replace polling loops with proper awaiting.
Use @AwaitsFix(bugUrl=...) correctly to mute (with a tracking issue), and — the real goal — un-mute and fix an already-muted test.

Prerequisite: Stage 4 (most integ flakes are cluster-state timing) plus the threadpools & concurrency deep dive. OpenSearch tests use Randomized Testing (RandomizedRunner); a seed determines every random choice in a run.

The flake taxonomy

Family	Signature in the failure	Fix
Real race	fails only under load / specific interleavings; `assertBusy` would pass	fix the production race or await the right signal
Timing assumption	`Thread.sleep(100)` then assert; fails on a slow CI box	replace with `assertBusy` on the actual condition
Resource leak	`BindException`, leaked-thread/leaked-searcher assertion at teardown	close the resource; fix the leak detector's complaint
Order dependence	passes alone, fails in suite; static state	remove shared mutable static; reset in `@Before`
Cluster nondeterminism	shard lands on a different node; election timing	assert on outcome, not on a specific node/timing
Seed-specific data	a random string/value triggers an edge case	the test exposed a real bug — fix the code, not the test

The last row is the important reframe: a flake is sometimes a real bug the randomizer found. Before muting, always ask "is the production code actually correct for this seed?"

Reproducing a flake

When a randomized test fails, the output prints a reproduce line:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTests.testThing" \
  -Dtests.seed=DEADBEEF12345678 -Dtests.locale=en-US -Dtests.timezone=UTC

Run it exactly:

./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTests.testThing" \
  -Dtests.seed=DEADBEEF12345678 -Dtests.locale=en-US -Dtests.timezone=UTC -q

If it reproduces every time with the fixed seed but not without, it is seed-specific — the randomizer is hitting a particular value. If it reproduces only sometimes even with the seed, the nondeterminism is outside the randomizer (threads, wall-clock, the OS) — a true race. Distinguish them by stressing:

# Run the method many times to surface intermittency:
./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTests.testThing" \
  -Dtests.iters=200 -q
# Or loop a whole class with fresh seeds each iteration to estimate the failure rate:
for i in $(seq 1 50); do
  ./gradlew ':server:test' --tests "org.opensearch.cluster.SomeTests" --rerun-tasks -q || echo "FAIL run $i"
done

Finding flaky-test issues

OpenSearch tracks flakes with a dedicated label:

is:issue is:open label:flaky-test no:assignee sort:updated-desc
is:issue is:open label:flaky-test "AwaitsFix" in:body

Cross-check against tests already muted in the code (those are the un-mute-and-fix candidates — the most valuable Stage 9 work):

grep -rn "@AwaitsFix\|@LuceneTestCase.AwaitsFix\|@Repeat\|@Seed" \
  server/src/test/java/ qa/ modules/ | grep -i "bugUrl" | head

Each @AwaitsFix(bugUrl="https://github.com/opensearch-project/OpenSearch/issues/NNNN") points at the tracking issue. Pick one whose issue is still open, reproduce it, fix the root cause, and remove the annotation.

Walked example — un-mute and fix a flaky `*IT`

Illustrative of the pattern. The grep finds a real muted test; treat the specifics as a stand-in for "a cluster integ test that asserts an asynchronous outcome too eagerly."

Symptom: an *IT (integration test on InternalTestCluster) is muted with @AwaitsFix. The test indexes a doc, then immediately asserts the doc is searchable on a replica — but search visibility is asynchronous (it waits on refresh + replication), so on a slow CI box the assertion runs before the replica caught up. The original author "fixed" it once with Thread.sleep(200), which still flakes, then muted it.

Find and reproduce

grep -rn "@AwaitsFix" server/src/internalClusterTest/java/org/opensearch/ | head
# pick one; read the test and its bugUrl issue, then reproduce:
./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.SomeSearchIT.testReplicaSeesDoc" \
  -Dtests.iters=100 -q

The offending test body:

@AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/NNNN")
public void testReplicaSeesDoc() throws Exception {
    client().prepareIndex("idx").setId("1").setSource("f", "v").get();
    Thread.sleep(200);   // hope the replica refreshed
    SearchResponse r = client().prepareSearch("idx")
        .setPreference("_replica").setQuery(matchQuery("f", "v")).get();
    assertHitCount(r, 1);   // flakes: replica not refreshed yet on slow CI
}

Diff — replace timing with an explicit await

--- a/server/src/internalClusterTest/java/org/opensearch/search/SomeSearchIT.java
+++ b/server/src/internalClusterTest/java/org/opensearch/search/SomeSearchIT.java
@@
-    @AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/NNNN")
     public void testReplicaSeesDoc() throws Exception {
-        client().prepareIndex("idx").setId("1").setSource("f", "v").get();
-        Thread.sleep(200);
-        SearchResponse r = client().prepareSearch("idx")
-            .setPreference("_replica").setQuery(matchQuery("f", "v")).get();
-        assertHitCount(r, 1);
+        client().prepareIndex("idx").setId("1").setSource("f", "v")
+            .setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL).get();
+        // Wait for the outcome, not a fixed time: poll until the replica is consistent.
+        assertBusy(() -> {
+            SearchResponse r = client().prepareSearch("idx")
+                .setPreference("_replica").setQuery(matchQuery("f", "v")).get();
+            assertHitCount(r, 1);
+        }, 30, TimeUnit.SECONDS);
     }

Two distinct improvements:

WAIT_UNTIL refresh policy makes the index request return only once the doc is visible — removing the need to guess at refresh timing at the source.
assertBusy polls the actual condition with a generous timeout instead of a fixed sleep. It passes the instant the condition is true (fast on a fast box) and tolerates a slow box (no false failure). The timeout is an upper bound, not an expectation.

Removing the @AwaitsFix line is the deliverable — the test is live again.

Verify it is genuinely de-flaked

./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.SomeSearchIT.testReplicaSeesDoc" \
  -Dtests.iters=200 -q

Two hundred clean iterations is the bar before you claim a flake is fixed. If it still flakes even with assertBusy, the bug is a real race in production code, not the test — escalate to the relevant subsystem stage (4–7) and fix the code.

Close the loop

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Fixed
+- Un-mute and de-flake `SomeSearchIT.testReplicaSeesDoc` by awaiting replica visibility ([#NNNNN](...))

git add server/.../SomeSearchIT.java CHANGELOG.md
git commit -s -m "Un-mute and de-flake SomeSearchIT.testReplicaSeesDoc"
git push && gh pr create --repo opensearch-project/OpenSearch --fill

In the PR, link the original bugUrl issue with Closes #NNNN and paste your -Dtests.iters=200 clean run as evidence. Maintainers will not trust a flake fix without the stress-run evidence.

`assertBusy` vs `Thread.sleep` — the rule

// WRONG: assumes a duration. Flaky on slow machines, slow on fast ones.
Thread.sleep(500);
assertThat(thing.state(), equalTo(READY));

// RIGHT: assert the condition, poll until true or time out.
assertBusy(() -> assertThat(thing.state(), equalTo(READY)), 30, TimeUnit.SECONDS);

assertBusy retries the lambda on AssertionError until it passes or the timeout elapses. Use it for anything asynchronous: cluster-state propagation, refresh, recovery, async listeners. The only correct use of Thread.sleep in a test is when you must verify that something did not happen within a window — and even then, prefer a deterministic signal.

Muting correctly (when you cannot fix it now)

If a flake is real but you cannot root-cause it immediately, mute it with a tracking issue so it is not lost:

@AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/NNNN")
public void testSomething() { ... }

Never use plain @Ignore — it has no tracking link and the test rots silently. For a Lucene-level flake, @LuceneTestCase.AwaitsFix is the equivalent. The mute is a debt with a ticket attached, not a fix.

Pitfalls

Muting instead of investigating. A flake might be a real bug the randomizer found. Always reproduce with the seed and ask whether the production code is correct for that seed before muting.
Thread.sleep as a fix. It moves the flake's probability, it does not remove it. assertBusy on the real condition is the fix.
Too-short assertBusy timeout. CI boxes are slow and shared. Use generous timeouts (tens of seconds); the timeout is an upper bound, not a benchmark.
Leaked threads/ports. OpenSearch's test framework fails the build on leaked threads and searchers. If teardown complains, you did not close something — fix the leak, do not suppress the detector.
Order dependence via static state. A test that passes alone but fails in the suite has shared mutable static. Reset it in @Before/@After, or remove it.
Claiming a fix without stress evidence. -Dtests.iters=200 clean (or a for loop of fresh-seed runs) is the minimum proof. Paste it in the PR.
Removing @AwaitsFix without reading its issue. The tracking issue often contains the prior investigation — read it before you re-derive everything.

Exit criteria — when you're ready for Stage 10

At least three flaky tests have been de-flaked (un-muted and fixed), each with a -Dtests.iters clean run pasted into the PR.
You can classify a flake into the taxonomy above from its failure output alone.
You never reach for Thread.sleep; assertBusy (or WAIT_UNTIL/a deterministic signal) is reflex.
You have at least once discovered that a "flake" was a real race and escalated it to the owning subsystem instead of muting it.

Determinism is the foundation of measurement. Stage 10 needs exactly that determinism to make a benchmark mean something.

Stage 10 — Performance Improvements

What this stage teaches

A performance PR without numbers is a guess. Stage 10 drills the discipline that separates a real improvement from a plausible-looking refactor: measure first, change, measure again, and prove the delta with a reproducible benchmark. The skill set:

Write a JMH microbenchmark in the :benchmarks module to isolate a hot method and show a before/after allocation or throughput change.
Use OpenSearch Benchmark (opensearch-benchmark, formerly Rally) for macro numbers — end-to-end indexing/search throughput and latency against a running cluster.
Reason about allocation and GC pressure: object churn on a hot path, autoboxing, and using BigArrays (the engine's pooled, circuit-breaker-aware array allocator) instead of raw arrays.
Avoid regressions: a change that helps one workload often hurts another; the bar is "no workload meaningfully worse, target workload meaningfully better."

Prerequisite: Stage 6 or Stage 7 (you must be in a real hot path), Stage 9 (you need deterministic measurement), plus the Level 9 performance lab and the circuit breakers & memory deep dive.

The measurement loop

flowchart LR
  P[profile / identify hot path] --> M1[baseline: JMH micro + OSB macro]
  M1 --> C[change]
  C --> M2[re-measure same JMH + OSB]
  M2 --> D{delta real &<br/>no regression?}
  D -->|yes| PR[PR with before/after numbers]
  D -->|no| C

Rules:

Never optimise un-profiled code. Find the hot path with a profiler (async-profiler / JFR) or an existing benchmark, not intuition.
Micro proves the mechanism; macro proves it matters. A JMH win that does not move an OSB workload is noise — say so honestly.
One change, one number. Bundle two optimisations and you cannot attribute the delta.

Finding Stage 10 issues

is:issue is:open label:enhancement no:assignee "performance" in:title,body
is:issue is:open label:bug no:assignee "regression" "performance" in:body
is:issue is:open label:"help wanted" no:assignee label:"Search:Performance"
is:issue is:open label:"help wanted" no:assignee label:"Storage:Performance"

Area labels: Search:Performance, Storage:Performance, Indexing:Performance, Performance. Check the current set.

Fallback grep — hot paths that allocate per-document or per-request:

ls benchmarks/src/main/java/org/opensearch/benchmark/
# Existing JMH benchmarks tell you which paths are already considered hot:
grep -rln "@Benchmark" benchmarks/src/main/java/org/opensearch/
# Raw arrays on hot paths that could use BigArrays:
grep -rn "new long\[\|new double\[\|new int\[" \
  server/src/main/java/org/opensearch/search/aggregations/ | head

Walked example — a hot-path allocation reduction with a JMH benchmark

Illustrative of the pattern. The grep finds a real hot method; treat the details as a stand-in for "a per-document path that allocates a throwaway object."

Symptom: an aggregation's per-bucket collection allocates a small temporary object (a boxed Long, or a fresh array) for every document, producing heavy young-gen GC pressure on high-cardinality aggregations. The fix reuses a pooled BigArrays-backed structure or hoists the allocation out of the per-doc loop.

Profile to confirm it is hot

# Run a node, drive a high-cardinality aggregation with OSB, attach async-profiler:
./gradlew run &
# (in another shell) drive load with opensearch-benchmark, then profile allocations:
# async-profiler: ./profiler.sh -e alloc -d 30 -f /tmp/alloc.html <opensearch-pid>

The allocation flame graph should show the per-doc allocation dominating. If it does not, stop — you have not found the hot path.

Locate the allocation

grep -n "collect(\|LongValuesSource\|new Long\|new long\[" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/<SomeAggregator>.java | head

The schematic of the churn:

@Override
public void collect(int doc, long owningBucketOrd) throws IOException {
    long[] tmp = new long[1];          // allocated per document — GC pressure
    tmp[0] = values.nextValue();
    addToBucket(owningBucketOrd, tmp[0]);
}

Diff — hoist / pool the allocation

--- a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SomeAggregator.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/SomeAggregator.java
@@
-    public void collect(int doc, long owningBucketOrd) throws IOException {
-        long[] tmp = new long[1];
-        tmp[0] = values.nextValue();
-        addToBucket(owningBucketOrd, tmp[0]);
-    }
+    public void collect(int doc, long owningBucketOrd) throws IOException {
+        // No per-doc allocation: read the value directly.
+        addToBucket(owningBucketOrd, values.nextValue());
+    }

For a structure that genuinely needs growable, large backing storage on a hot path, use BigArrays (it is pooled and accounted against the request circuit breaker — see circuit breakers & memory):

// Allocated once, grown as needed, released in close(); accounted by the breaker.
private LongArray counts = bigArrays.newLongArray(1, true);
// ... counts = bigArrays.grow(counts, owningBucketOrd + 1);

Write a JMH microbenchmark that proves it

Add a benchmark under :benchmarks isolating the path:

@Fork(2)
@Warmup(iterations = 5)
@Measurement(iterations = 10)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class SomeAggregatorBenchmark {

    private long[] values;

    @Setup
    public void setup() {
        values = new long[1_000_000];
        Random r = new Random(42);   // deterministic
        for (int i = 0; i < values.length; i++) values[i] = r.nextInt(100_000);
    }

    @Benchmark
    public long collectAll() {
        long acc = 0;
        for (long v : values) acc += collectOne(v);   // calls the path under test
        return acc;
    }
}

Run it before and after the change:

# Build the benchmarks jar and run JMH:
./gradlew :benchmarks:jmh -Pjmh.include="SomeAggregatorBenchmark" 2>&1 | tee /tmp/after.txt
# (run on the baseline commit too -> /tmp/before.txt)

Capture both throughput and the gc.alloc.rate JMH profiler if you enable -prof gc — the allocation-rate drop is the headline number for a GC-pressure fix.

Prove it matters with a macro benchmark

JMH shows the mechanism; OpenSearch Benchmark shows the user-visible effect. Run the same workload against a baseline node and a patched node:

# Against a running cluster on :9200, with a workload that exercises the aggregation:
opensearch-benchmark execute-test --target-hosts=localhost:9200 \
  --pipeline=benchmark-only --workload=<a high-cardinality agg workload>

Compare the Mean Throughput / p50/p99 Service Time and the young-GC time between runs. A real fix moves the macro number or at least the GC time; if it does not, the micro-improvement was not on a path the macro workload stresses — report that honestly.

Build, CHANGELOG, PR

./gradlew :server:test --tests "*SomeAggregatorTests" -q   # correctness must still hold
./gradlew :server:precommit

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Changed
+- Remove per-document allocation in `SomeAggregator.collect`, reducing young-gen GC on high-cardinality terms aggs ([#NNNNN](...))

In the PR description, paste: (1) the JMH before/after with Score ± error, (2) the OSB throughput/latency/GC deltas, and (3) the seeds/workload so a reviewer can reproduce. A perf PR with no reproducible numbers is unmergeable in this project.

Reading the numbers honestly

Claim	What you must show
"Faster"	JMH `Score` improved beyond the `± error` bars (overlap = no result)
"Less GC"	JMH `-prof gc` `alloc.rate` drop, or OSB young-GC-time drop
"No regression"	OSB on other common workloads unchanged within noise
"Real-world impact"	OSB macro delta, not just JMH micro

Warning: Overlapping JMH error bars mean no measured difference, no matter how good the change looks on paper. Two forks (@Fork(2)) and enough iterations are the minimum to trust a delta. A single-fork JMH run is noise.

Pitfalls

Optimising without a profile. Intuition about hot paths is wrong more often than right. Profile, then optimise the thing the profiler points at.
Micro-only evidence. A JMH win that no OSB workload reflects is not worth the complexity. State the macro result, even when it is "no measurable change."
Trading correctness for speed. Re-run the subsystem's unit tests. A faster wrong answer is a Stage 6/7 bug.
Ignoring the circuit breaker. Pooled BigArrays allocations are accounted against the request breaker for a reason. Replacing them with un-accounted raw arrays "to go faster" reintroduces the OOM the breaker exists to prevent.
Helping one workload, hurting another. Always run a couple of other OSB workloads. The bar is target-better, others-not-worse.
Non-reproducible numbers. No seed, no workload name, no hardware note = reviewers cannot trust it. Include all three.
Bundling optimisations. Two changes, one number = unattributable. One change per PR.

Exit criteria — when you're ready for Stage 11

One performance PR is merged with a JMH before/after (non-overlapping error bars) and an OpenSearch Benchmark macro delta, both reproducible from the PR.
You can profile (async-profiler/JFR), find the hot path, and resist optimising anything else.
You reach for BigArrays (breaker-accounted) on growable hot-path storage rather than raw arrays.
You report honest negatives: when a micro win does not move the macro, you say so.

Measurement is the evidence base for the hardest judgment calls in the project: when changing the wire or storage format is worth it. Stage 11 applies that judgment to backward compatibility.

Stage 11 — Backward Compatibility

What this stage teaches

OpenSearch clusters run mixed versions during rolling upgrades, persist indices written by older versions, and stream objects between nodes that may not be on the same release. A change that ignores this breaks upgrades — the most expensive class of bug, because it surfaces in production, not CI. Stage 11 drills the reflexes that keep the project upgradeable:

Wire BWC — gating StreamInput/StreamOutput reads/writes on Version so a new field is only sent to nodes that understand it, and an old node's stream is parsed correctly.
Index / Lucene format BWC — an index written by version N must open on N+1; you cannot drop or repurpose a stored field or codec without a read path for the old format.
The qa/ tests — qa/full-cluster-restart, qa/rolling-upgrade, qa/mixed-cluster exercise old↔new behaviour for real.
Serialization round-trip tests — AbstractWireSerializingTestCase and AbstractSerializingTestCase prove a Writeable/XContent type survives write-then-read, including across Version boundaries with bwcSerializationCheck.
The deprecation policy — how a setting/field is deprecated (warning header, @Deprecated, removal only on a major) rather than removed outright.

Prerequisite: Stage 4 (cluster state crosses the wire) and the serialization & BWC deep dive, plus the compatibility mindset. The current line is 3.x/main, maintenance is 2.x, legacy is 1.x.

The two BWC axes

flowchart TD
  subgraph Wire[Wire BWC: node <-> node]
    A[new node writes a field] -->|Version gate| B{peer >= V?}
    B -->|yes| C[write the field]
    B -->|no| D[omit / write a default]
  end
  subgraph Index[Index BWC: disk]
    E[index written by N] --> F[must open on N+1]
    F --> G[old codec/field read path retained]
  end

Wire BWC is per-message: every writeTo/constructor-StreamInput that adds a field must guard it with out.getVersion() / in.getVersion().
Index BWC is per-format: the read path for an older index/segment format cannot be deleted until the major that drops support for that version.

OpenSearch's Version (in server/libs/core) and the Lucene version it embeds are the gates. A node refuses to join a cluster, or to open an index, that is too old — but within the supported window, every cross-version path must work.

Finding Stage 11 issues

is:issue is:open label:bug no:assignee "rolling upgrade" in:body
is:issue is:open label:bug no:assignee "mixed cluster" in:body
is:issue is:open label:bug no:assignee "serialization" "version" in:body
is:issue is:open label:"backport 2.x" no:assignee

Fallback grep — find the Version-gated serialization patterns to learn the idiom and spot a field that crosses the wire without a guard:

grep -rn "out.getVersion().onOrAfter\|in.getVersion().onOrAfter\|out.getVersion().before" \
  server/src/main/java/org/opensearch/ | head
ls qa/                          # full-cluster-restart, rolling-upgrade, mixed-cluster, ...
grep -rln "extends AbstractWireSerializingTestCase" server/src/test/java/org/opensearch/ | head

Walked example — adding a new field to a `Writeable` correctly

Illustrative of the pattern. The grep finds a real Writeable; treat the field as a stand-in for "a new optional field on a response that crosses the wire."

Symptom: you are adding a new optional field (say a long timeoutMillis) to a transport response. A naive implementation writes it unconditionally — so a 3.2 node sends 9 bytes that a 3.1 node does not expect, corrupting the stream and crashing the older node during a rolling upgrade.

Locate the serialization

grep -rn "class SomeResponse" server/src/main/java/org/opensearch/action/
grep -n "writeTo\|StreamInput\|readFrom\|readVInt\|writeVLong" \
  server/src/main/java/org/opensearch/action/<path>/SomeResponse.java

The constructor-from-stream and writeTo:

public SomeResponse(StreamInput in) throws IOException {
    super(in);
    this.count = in.readVInt();
}

@Override
public void writeTo(StreamOutput out) throws IOException {
    super.writeTo(out);
    out.writeVInt(count);
}

Diff — gate the new field on `Version`

First, find the version the field is introduced in. New work targets main, which carries the next Version.V_3_x_0 (or Version.CURRENT). Use the actual constant on your branch:

grep -n "public static final Version V_3" server/src/main/java/org/opensearch/Version.java | tail

--- a/server/src/main/java/org/opensearch/action/<path>/SomeResponse.java
+++ b/server/src/main/java/org/opensearch/action/<path>/SomeResponse.java
@@
 public SomeResponse(StreamInput in) throws IOException {
     super(in);
     this.count = in.readVInt();
+    if (in.getVersion().onOrAfter(Version.V_3_2_0)) {
+        this.timeoutMillis = in.readVLong();
+    } else {
+        this.timeoutMillis = DEFAULT_TIMEOUT_MILLIS;   // sensible default for old peers
+    }
 }

 @Override
 public void writeTo(StreamOutput out) throws IOException {
     super.writeTo(out);
     out.writeVInt(count);
+    if (out.getVersion().onOrAfter(Version.V_3_2_0)) {
+        out.writeVLong(timeoutMillis);
+    }
 }

The rules:

Read and write guards must use the same version constant. A mismatch desynchronises the stream — the cause of most BWC crashes.
Old peers get a default, not garbage. When the field is absent, choose a default that preserves old behaviour (here, the pre-existing implicit timeout).
Use the variable-length types (writeVLong/readVLong, writeOptionalString, etc.) that the surrounding code uses; do not switch encodings.

Prove the round-trip with `AbstractWireSerializingTestCase`

This base class round-trips your object through StreamInput/StreamOutput and, crucially, across a range of versions:

public class SomeResponseTests extends AbstractWireSerializingTestCase<SomeResponse> {
    @Override protected Writeable.Reader<SomeResponse> instanceReader() { return SomeResponse::new; }
    @Override protected SomeResponse createTestInstance() {
        return new SomeResponse(randomNonNegativeInt(), randomNonNegativeLong());
    }
    // Optional: assert behaviour when serialized to an OLD version drops the field.
    public void testSerializeToOldVersionDropsField() throws IOException {
        SomeResponse original = createTestInstance();
        SomeResponse roundTripped = copyWriteable(original, writableRegistry(),
            SomeResponse::new, Version.V_3_1_0);   // serialize as if to a 3.1 node
        assertEquals(DEFAULT_TIMEOUT_MILLIS, roundTripped.getTimeoutMillis());  // field defaulted
        assertEquals(original.getCount(), roundTripped.getCount());             // rest survives
    }
}

copyWriteable(..., Version.V_3_1_0) simulates serializing to an older node — the single most important BWC test you can write for a new field.

./gradlew :server:test --tests "*SomeResponseTests" -q

Run the `qa/` cross-version tests

Unit round-trips are necessary but not sufficient; the qa/ suites bring up actual old-and-new clusters:

# Rolling upgrade: a real cluster upgraded node-by-node, exercising mixed-version traffic.
./gradlew :qa:rolling-upgrade:check -Dtests.bwc.version=<previous minor> -q
# Full cluster restart: write on old, restart onto new, assert data + behaviour survive.
./gradlew :qa:full-cluster-restart:check -q
# Mixed cluster: old and new nodes coexisting.
./gradlew :qa:mixed-cluster:check -q

These are slow and depend on a downloadable BWC distribution; consult TESTING.md for the exact bwc.version invocation on your branch. They are the proof that a 3.1 node and your 3.2 node actually coexist.

Build, CHANGELOG, PR

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Added
+- Add an optional `timeoutMillis` field to `SomeResponse`, version-gated for rolling-upgrade safety ([#NNNNN](...))

In the PR, state explicitly: the version constant you gated on, the default for old peers, and that you ran (or why CI runs) the relevant qa/ suite. BWC reviewers key off exactly those three.

Index / Lucene format BWC

Wire BWC is per-message; index BWC is per-format and longer-lived. The rules:

An index written by a supported older version must open on the current version. The read path for the old format cannot be removed until the major that drops that version's support.
You cannot repurpose a stored field, doc-values format, or codec name. Add a new one; read the old one for old segments.
qa/full-cluster-restart is the safety net: it writes data on the old version and asserts it reads back correctly on the new one.

grep -rn "IndexVersion\|MINIMUM_COMPATIBLE\|MINIMUM_INDEX_COMPATIBILITY" \
  server/src/main/java/org/opensearch/Version.java | head

The deprecation policy

When a setting, field, or API must go away, you deprecate, then remove on a major:

Mark it @Deprecated in code and emit a deprecation warning (the _deprecation log / Warning response header) when it is used.
Keep an alias/default so existing configs keep working (the master → cluster_manager rename is the canonical example — both work; master warns).
Remove it only in a major release, with a migration note in the breaking-changes documentation and the CHANGELOG ### Removed/### Deprecated.

grep -rn "deprecationLogger\|@Deprecated\|addDeprecatedSetting\|deprecateAndAddAlias" \
  server/src/main/java/org/opensearch/common/settings/ | head

Pitfalls

Mismatched read/write version guards. The single most common BWC crash. Read and write must gate on the identical Version constant.
No default for old peers. An absent field must default to old behaviour, not zero or null-that-NPEs downstream.
Removing an old index/codec read path. You cannot delete the read path for a still- supported format. Add new, retain old.
Skipping the qa/ suites. Unit round-trips do not exercise a real mixed cluster. Run (or ensure CI runs) rolling-upgrade/full-cluster-restart.
Repurposing a serialized field's meaning. Even if the type is unchanged, changing what a field means across versions desynchronises semantics. Add a new field.
Removing a setting without deprecation. Deprecate with an alias and a warning; remove only on a major. Reviewers will block a hard removal on a minor.
Guessing the version constant. grep Version.java for the real next constant on your branch; do not hard-code a number you assume.

Exit criteria — when you're ready for Stage 12

One BWC-sensitive PR is merged with a Version-gated StreamInput/StreamOutput, an AbstractWireSerializingTestCase that round-trips to an older version, and a passing qa/ cross-version run.
You can explain the difference between wire BWC (per-message, Version-gated) and index BWC (per-format, retained read path), and the support window for each.
You default to deprecate-then-remove-on-major, with an alias and a warning, never a hard removal on a minor.
You always verify the version constant against Version.java rather than assuming it.

BWC judgment is the core of release triage. Stage 12 is where that judgment decides whether an issue blocks a release.

Stage 12 — Release-Blocking Issues

What this stage teaches

Stage 12 is not about a bigger patch. It is about judgment — the call a near-maintainer makes when an issue lands during a release cycle: does this block the release, or not? Get it wrong in one direction and you ship a data-loss bug; get it wrong in the other and you hold a release hostage to a cosmetic defect. The skills:

Recognise what makes an issue release-blocking: data loss, a BWC break, a security vulnerability, or a correctness regression versus the prior release.
Read the v3.x.0 milestone and understand how release managers use it to scope a release.
Know how to handle a revert — when reverting a merged change is the right call, and how to do it cleanly.
Understand the bar for a backport to a release branch (2.x, a 2.x.y patch line) and how the backport <branch> label and bot work.

This is the stage where you stop thinking like a contributor closing an issue and start thinking like a maintainer protecting a release. You do not need to be a maintainer to practise it — triaging blockers well is exactly how you earn the trust that leads there.

Prerequisite: everything above, especially Stage 11 (BWC is the most common blocker class), plus the release process and the maintainer mindset.

What makes an issue release-blocking

Not all bugs block. Use this decision table — it is the heart of the stage:

Class	Blocks the release?	Why
Data loss / corruption	Always	Irreversible. No workaround restores lost data.
BWC break (rolling upgrade fails, index won't open)	Always	Breaks the upgrade path for every user.
Security vulnerability	Always (often via embargo)	Exploitable; handled through the security disclosure process, not a public issue.
Correctness regression vs prior release	Usually	Something that worked in `N-1` now returns wrong results.
Crash / availability regression	Usually	A node or cluster that was stable now falls over on a common path.
Performance regression (significant, common workload)	Sometimes	Blocks if it is large and on a default path; otherwise note and fix in a patch.
New-feature bug (feature is new this release)	Sometimes	Often "disable the feature" is the fix, not "block the release."
Cosmetic / docs / minor UX	No	Ship and fix in a patch (`3.x.y`).

The pivotal word is regression: a bug that exists in N-1 and N equally is not a release blocker for N (it shipped before). A bug introduced in N against N-1 behaviour is a regression and a candidate blocker. The first question on any "is this a blocker" triage is therefore: did this work in the last release?

flowchart TD
  I[incoming issue during release cycle] --> Q1{data loss / BWC break / security?}
  Q1 -->|yes| BLOCK[release-blocker: must fix or revert before GA]
  Q1 -->|no| Q2{regression vs previous release?}
  Q2 -->|no, pre-existing| PATCH[not a blocker; fix in a future patch]
  Q2 -->|yes| Q3{common default path & severe?}
  Q3 -->|yes| BLOCK
  Q3 -->|no| Q4{can the new feature be disabled?}
  Q4 -->|yes| FLAG[ship with feature off / experimental; fix later]
  Q4 -->|no| BLOCK

How release managers triage with the milestone

OpenSearch scopes a release with a GitHub milestone (v3.2.0, v3.3.0, …) — not a label. Release managers move issues into and out of the milestone as the date approaches.

is:issue is:open milestone:"v3.2.0" label:bug sort:updated-desc
is:issue is:open milestone:"v3.2.0" label:"backport 2.x"
is:open milestone:"v3.2.0" -label:"good first issue"   # what's actually gating the release

As code-freeze nears, the release manager's job is to drive the blocker count in the milestone to zero. Anything that is genuinely a blocker stays in the milestone; everything else is bumped to the next milestone with a comment. Practising Stage 12 means being able to look at the milestone and correctly argue, issue by issue, "this stays / this can move."

A blocker argument in an issue comment looks like:

This is a release-blocker for v3.2.0: it is a correctness regression vs 3.1 (the same
query returned 5 hits in 3.1, returns 4 in 3.2 — repro below). It is on the default search
path with no workaround. Recommend either a fix in #PR or reverting #PR-that-introduced-it
before GA.

Three elements: class (regression), evidence (repro + the N-1 comparison), and a recommendation (fix or revert). A blocker claim without all three gets discounted.

Handling a revert

Sometimes the right call for a blocker is not "fix forward" but "revert the change that caused it." Reverting is a first-class, low-drama action in a healthy project — it buys time without holding the release. When to revert rather than fix forward:

The fix is not obvious and the release date is close.
The change that introduced the regression is recent and isolated (a clean revert).
The feature can ship later without disruption.

How to do it cleanly:

git fetch upstream
git checkout -b revert/some-change upstream/main
git revert <commit-sha>            # creates a revert commit with the original message
# resolve any conflicts, keep the revert minimal
git commit -s                       # DCO still required on the revert

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ### Changed
+- Revert "<original change>" due to a search correctness regression in v3.2 ([#NNNNN](...), reverts [#MMMMM](...))

In the revert PR, link both the regression issue and the original PR, and state plainly: "we are reverting to unblock v3.2; the feature will be re-introduced with a fix in #follow-up." Reverting is not a failure — shipping a known data-loss or correctness regression is. The maintainer instinct is to protect the release first and restore the feature second.

The bar for a backport to a release branch

A change merged to main (the 3.x line) does not automatically reach the maintenance line (2.x) or a patch line. Backporting has a higher bar than merging to main:

Backport candidate	Backport to `2.x`?
Security fix	Yes — backport as far as the support window allows
Data-loss / corruption fix	Yes
Correctness/availability regression fix	Usually
Bug fix with low risk and clear value	Often
New feature	Rarely (features generally go forward-only)
Refactor / cleanup	No

The mechanics: label the merged PR backport 2.x (or the relevant branch) and the backport bot opens a cherry-pick PR against that branch. If the cherry-pick conflicts, the bot asks you to do it manually:

git fetch upstream
git checkout -b backport/2.x-of-NNNN upstream/2.x
git cherry-pick -x <merge-or-squash-sha>    # -x records the source commit
# resolve conflicts, keep the change minimal and faithful
git commit -s
git push && gh pr create --repo opensearch-project/OpenSearch --base 2.x --fill

Warning: A backport must be faithful — it carries the same change, not a new one. If the fix needs to be different on 2.x (e.g. a BWC constraint differs), that is a separate, reviewed PR, not a backport. Quietly altering behaviour in a backport is how maintenance branches drift from main.

The -x flag annotates the cherry-pick with the original SHA so anyone can trace the maintenance commit back to its main origin — auditability the release process depends on.

Finding Stage 12 issues to triage

You practise this stage by triaging, not (only) by coding:

is:issue is:open milestone:"v3.2.0" sort:created-asc
is:issue is:open label:bug "regression" in:title,body sort:updated-desc
is:issue is:open label:"backport 2.x" no:assignee

For each, write the blocker argument (class / evidence / recommendation) in a comment, even if you are not the one to decide. Maintainers weigh well-argued triage heavily; it is the most visible near-maintainer skill.

Security note: never triage a suspected vulnerability in a public issue. OpenSearch has a security disclosure process (SECURITY.md, a private report channel). Suspected security blockers go there, under embargo, not into a milestone comment. Mis-handling a disclosure publicly is itself a serious mistake.

Pitfalls

Confusing severity with blocker status. A severe bug that existed in N-1 is not a release blocker for N — it shipped already. The blocker test is regression + class, not raw severity.
Blocking on a fixable feature. If a new feature is broken, "disable / mark experimental" is often the correct, faster path than holding the whole release.
Fix-forward when revert is right. Near a freeze, a clean revert beats a rushed fix. Protect the release; restore the feature next cycle.
Unfaithful backports. A backport carries the same change. Different code for 2.x needs its own review.
Backporting features or refactors. Maintenance branches take fixes, not features.
Triaging security in public. Use the disclosure process under embargo. Always.
Blocker claims without evidence. "This feels bad" is not triage. Class + repro + N-1 comparison + recommendation, or it is noise.

Exit criteria — this is the end of the roadmap

You have completed the roadmap when:

You have helped triage at least one release-blocking issue, writing the class / evidence / recommendation argument, and your call matched the maintainers'.
You can apply the blocker decision table from memory and explain why a pre-existing bug is not a release blocker while a regression of the same severity is.
You have either executed or correctly proposed a revert to protect a release, and you know when fix-forward is preferable.
You understand the backport bar, the backport <branch> bot, the faithful-cherry-pick rule, and that security goes through the disclosure process, never a public milestone.

Where to go from here

The roadmap ends at the judgment a maintainer applies; becoming a maintainer is a separate path the TSC and each repo's maintainers steward — see the maintainer mindset and release process. Landing PRs across these twelve stages is necessary for that trajectory, not sufficient: the rest is consistency, good review citizenship, and showing the judgment this stage describes, over time.

If you have worked even three or four rungs of this ladder for real, attempt the capstone: a single end-to-end contribution that exercises issue selection, root-cause, a fix with tests, BWC reasoning, and a release-aware writeup — the whole roadmap compressed into one PR.

Deep Dives: Reading Order

This directory contains 24 deep-dive chapters. They are the reference material behind the Level curriculum. Each chapter is self-contained, but most chapters depend on a handful of earlier ones. Read in the order below the first time through; thereafter use the index as a lookup.

OpenSearch is a distributed search and analytics engine built on Apache Lucene, forked from Elasticsearch 7.10.2. Almost everything you read lives under server/src/main/java/org/opensearch/..., with low-level primitives in libs/ and bundled extensions in modules/. The chapters below mirror the request path and the cluster lifecycle: you build up "what a cluster is" before "how requests flow" before "how a shard stores and searches data."

The chapters are grouped by subsystem. For each chapter we list:

Title — the file.
One-line summary — what you should walk away knowing.
Consumed by — which Levels/Labs depend on it.

Note: Throughout the book, "cluster manager" is the current term for what Elasticsearch (and older OpenSearch) called "master." OpenSearch renamed the role and most settings for inclusive language; the old master names survive as deprecated aliases. Each chapter notes both terms on first use.

Group 1 — Cluster and Node Model

These four chapters define "what is an OpenSearch cluster" — the nodes, the shared state object they agree on, and how that agreement is reached and distributed — before any request-handling machinery exists.

#	File	Summary	Consumed by
1	cluster-and-node-model.md	Node bootstrap, node roles, `DiscoveryNode`, index→shard→segment hierarchy, how services are wired in the `Node` constructor	Level 1 (all labs); Level 3 lab 3.2
2	discovery-coordination.md	`Coordinator`, the Zen2/Raft-like consensus, elections, voting configuration, split-brain prevention	Level 4 lab 4.1
3	cluster-state.md	The `ClusterState` object: `Metadata`, `RoutingTable`, `DiscoveryNodes`, `ClusterBlocks`; immutability, versioning, `Diffable`	Level 4 lab 4.2; Level 4 lab 4.3
4	cluster-state-publishing.md	Update-task model, `MasterService` vs `ClusterApplierService`, two-phase publish→commit, appliers vs listeners	Level 4 lab 4.2; Level 4 lab 4.3

Start here. Without the cluster/node/state model in your head, every later chapter feels like trivia.

Group 2 — Transport and Actions

How a request becomes work. These chapters explain the layered request path: HTTP at the edge, an internal RPC fabric between nodes, an action registry that ties request types to handlers, and the thread pools that run all of it.

#	File	Summary	Consumed by
5	transport-layer.md	`TransportService`, `Netty4Transport`, request handlers, `Writeable`/`StreamInput`/`StreamOutput`, `NamedWriteableRegistry`, port 9300	Level 3 lab 3.1; Level 3 lab 3.2
6	rest-layer.md	`RestController` dispatch, `BaseRestHandler`, `prepareRequest`, `NodeClient`, XContent parsing, error rendering	Level 3 lab 3.1; Level 3 lab 3.3
7	action-framework.md	`ActionType`, `TransportAction` base classes, `ActionModule` registration, `ActionFilters`, `ActionListener`, write vs read path	Level 3 lab 3.3; Level 4 lab 4.4
8	threadpools-concurrency.md	`ThreadPool` and its named pools, pool types, single-writer-per-shard, rejections, `ThreadContext`, why blocking the applier thread is fatal	Level 3 (all labs); Level 4 (all labs)

These chapters explain how a request is dispatched and executed. They must precede the storage and search chapters, which assume you know the action and thread-pool model.

Group 3 — Indexing and Storage

The write path, end to end, inside a single shard: the shard lifecycle wrapper, the Lucene-backed engine, the write-ahead log, the schema/analysis layer, and the three operations (refresh, flush, merge) that govern visibility, durability, and segment count.

#	File	Summary	Consumed by
9	index-shard-lifecycle.md	`IndicesService`→`IndexService`→`IndexShard`; shard states; `Store`/`Directory`; primary vs replica op application	Level 1 lab 1.4; Level 5
10	engine-internals.md	`InternalEngine` and friends; Lucene `IndexWriter`; versioning, `LiveVersionMap`, sequence numbers, `Engine.Searcher`	Level 5; Level 6
11	translog.md	`Translog`, generations, durability modes, fsync, the translog↔commit relationship, crash recovery	Level 5; the recovery deep dive
12	mapping-and-analysis.md	`MapperService`, `FieldMapper`/`MappedFieldType`, dynamic mapping, the analysis chain, `analysis-common`, `_analyze`	Level 5; Level 7
13	refresh-flush-merge.md	The three background operations; near-real-time visibility; Lucene commits; merge policy and scheduler	Level 5; Level 6

If you skip chapter 9, every later storage chapter will reference a shard state machine you have not seen. Read it first.

Group 4 — Search and Aggregations

The read path: how a query fans out across shards, how QueryBuilders become Lucene Querys, how aggregations build and reduce, and the columnar data structures that sorting and aggregations actually read.

#	File	Summary	Consumed by
14	search-execution.md	`TransportSearchAction`, scatter/gather, query→fetch phases, `SearchContext`, `SearchPhaseController` reduce	Level 5; Level 8
15	query-dsl-querybuilders.md	`QueryBuilder`/`AbstractQueryBuilder`→Lucene `Query` via `QueryShardContext`; parsing, serialization, registration	Level 5; Level 7
16	aggregations.md	`AggregatorFactory`→`Aggregator`→`InternalAggregation.reduce`; bucket vs metric; the reduce pipeline	Level 5; Level 8
17	docvalues-fielddata.md	Columnar `DocValues`, legacy `fielddata`, `IndexFieldData`, why heap pressure shows up here	Level 6; the circuit breaker deep dive

Read 14 before 16 — an aggregation only makes sense once you understand the query/fetch phase split it runs inside.

Group 5 — Allocation, Recovery, Replication

How shards get placed on nodes, how a placed-but-empty shard is filled with data, and how primary and replica stay in sync after they are both running.

#	File	Summary	Consumed by
18	shard-allocation.md	`AllocationService`, `BalancedShardsAllocator`, `AllocationDeciders`, `RoutingAllocation`, the explain API	Level 4 lab 4.4; Level 6
19	recovery.md	Peer recovery, `RecoverySourceHandler`, phase 1/2, sequence-number-based recovery, the translog hand-off	Level 6; the translog deep dive
20	replication.md	Document replication (`TransportReplicationAction`) vs segment replication; `ReplicationTracker`, global checkpoint	Level 6; Level 8

Group 6 — Cross-cutting

The subsystems that touch every other chapter: backups, memory safety, the extension model, and wire compatibility across versions.

#	File	Summary	Consumed by
21	snapshots-repositories.md	`SnapshotsService`, `RepositoriesService`, `BlobStoreRepository`, incremental segment-level snapshots	Level 6; Level 9
22	circuit-breakers-memory.md	`HierarchyCircuitBreakerService`, parent/fielddata/request/in-flight breakers, real-memory accounting	Level 6; Level 8
23	plugin-architecture.md	`Plugin` and its extension interfaces, `PluginsService`, classloader isolation, `plugin-descriptor.properties`	the plugin labs; Level 7
24	serialization-bwc.md	`Version` gating in `StreamInput`/`StreamOutput`, `NamedWriteableRegistry`, XContent BWC, the `qa/` BWC suite	the compatibility mindset chapter; Level 9

A note on order vs index

The deep-dives are an index — they exist to be looked up later. The first read should follow the table above, top to bottom: cluster model, then request path, then storage, then search, then allocation/recovery/replication, then the cross-cutting subsystems. Each group assumes the previous ones.

But when you return to fix a bug, do not re-read linearly. Jump directly to the chapter most relevant to the failing component and use the cross-references inside it. A shard-stuck-UNASSIGNED bug starts in shard-allocation.md; a "documents not visible after index" bug starts in refresh-flush-merge.md; a "node won't join the cluster" bug starts in discovery-coordination.md.

Every chapter ends with a Validation: prove you understand this section. Treat that as the gate before declaring the chapter "read." If you cannot answer the validation questions from memory plus a grep, you have skimmed, not read.

Cluster and Node Model

An OpenSearch cluster is a set of nodes that share a single, agreed-upon view of the world — the ClusterState — and cooperatively store and search data. This chapter is the foundation: it defines what a node is as a Java process, what roles a node can play, how nodes identify each other (DiscoveryNode), and the data hierarchy every other chapter assumes (index → shard → Lucene segment, primary vs replica). It ends by walking the Node constructor, which is the single most important "wiring diagram" in the whole codebase — almost every service you will ever touch is instantiated there.

After this chapter you should be able to: name every node role and the setting that enables it; explain what a DiscoveryNode carries on the wire; draw the index→shard→segment hierarchy and place ShardRouting/ShardId/Index on it; and open Node.java and find where any major service (transport, cluster service, indices service, search service) is constructed.

What "node" means as a process

A node is a single JVM running the OpenSearch server. The entry point is short and worth reading top to bottom:

find server/src/main/java/org/opensearch -name "OpenSearch.java"
find server/src/main/java/org/opensearch -name "Bootstrap.java"
find server/src/main/java/org/opensearch -name "Node.java"

The three classes form a layered bring-up:

Class	Package	Responsibility
`OpenSearch`	`org.opensearch.bootstrap`	The `main(String[])` class. Parses CLI args, extends `EnvironmentAwareCommand`, hands off to `Bootstrap`.
`Bootstrap`	`org.opensearch.bootstrap`	Process-level setup: security manager, native access (mlockall, seccomp), JVM checks (the "bootstrap checks"), keystore, then constructs and starts a `Node`.
`Node`	`org.opensearch.node`	The cluster member. Its constructor builds the entire service graph; `start()` brings services online; `close()` tears them down.

Read the ordering with:

grep -n "new Node\|node.start\|INSTANCE.start\|bootstrap(" \
  server/src/main/java/org/opensearch/bootstrap/Bootstrap.java

Note: The "bootstrap checks" in BootstrapChecks are production guardrails (max file descriptors, vm.max_map_count, heap size equality, etc.). They are enforced only when a node binds a non-loopback address — i.e. when it looks like production. This is why ./gradlew run on localhost skips them. Find them:
grep -n "class .*BootstrapCheck" server/src/main/java/org/opensearch/bootstrap/BootstrapChecks.java

Node roles

A node advertises a set of roles. Roles are modeled as DiscoveryNodeRole objects, each backed by a node setting. Find the catalog:

find server/src/main/java/org/opensearch -name "DiscoveryNodeRole.java"
grep -n "public static final .*Role\|roleName\|ROLES" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java

The roles you must know:

Role	Setting	What the node does
`cluster_manager` (formerly `master`)	`node.roles: [cluster_manager]`	Eligible to be elected cluster manager: owns cluster-state updates and publishing. See discovery-coordination.md.
`data`	`node.roles: [data]`	Holds shards; executes index/get/search operations against local Lucene indices.
`ingest`	`node.roles: [ingest]`	Runs ingest pipelines (processors) before indexing.
`remote_cluster_client`	`node.roles: [remote_cluster_client]`	Can connect to remote clusters for cross-cluster search/replication.
`search`	`node.roles: [search]`	Searchable-snapshot / remote-store search tier (separates search compute from indexing).
coordinating-only	`node.roles: []`	A node with no roles is purely coordinating: it routes requests and reduces results but stores no data and is not election-eligible.

Warning: "master" → "cluster manager" is more than cosmetic in the code. The role is named cluster_manager; the setting cluster.initial_master_nodes is a deprecated alias of cluster.initial_cluster_manager_nodes; many APIs accept both ?master_timeout and ?cluster_manager_timeout. When you read older code you will still see MASTER_ROLE, isMasterNode(), and TransportMasterNodeAction. Treat them as synonyms and prefer the new names in anything you write. Confirm the deprecation wiring:
grep -rn "initial_cluster_manager_nodes\|initial_master_nodes" server/src/main/java
grep -rn "cluster_manager\|CLUSTER_MANAGER_ROLE" server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java

Roles are resolved at startup from node.roles (or, on old configs, from the legacy booleans node.master/node.data/node.ingest). The resolution logic:

grep -rn "NODE_ROLES_SETTING\|getRolesFromSettings\|node.roles" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java

DiscoveryNode and DiscoveryNodes

A DiscoveryNode is the wire-serializable identity of one node: its name, a persistent ephemeral ID, the transport address, its roles, attributes, and the Version it runs. It is a Writeable (see transport-layer.md) and travels inside every cluster state.

grep -n "private final\|public DiscoveryNode\|getId\|getEphemeralId\|getRoles" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java

Field	Meaning
`nodeName`	Human-readable name (`node.name`), defaults to hostname-derived.
`nodeId`	Stable per-node UUID persisted under `data/`. Survives restarts.
`ephemeralId`	Regenerated each process start. Distinguishes "same node, new process."
`address`	`TransportAddress` — host:port (transport port, default 9300).
`roles`	The `Set<DiscoveryNodeRole>` resolved from settings.
`version`	The OpenSearch `Version` the node runs (drives BWC, see serialization-bwc.md).

DiscoveryNodes is the collection — an immutable view of every node in the cluster, plus convenience accessors: the elected cluster manager, the local node, lookups by ID, and filtered views (data nodes, cluster-manager-eligible nodes). It lives inside the cluster state:

grep -n "getMasterNode\|getClusterManagerNode\|getDataNodes\|getLocalNode\|class DiscoveryNodes" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodes.java

flowchart TD
    CS[ClusterState] --> DNs[DiscoveryNodes]
    DNs --> CM["cluster_manager node (1, elected)"]
    DNs --> DN1[data node A]
    DNs --> DN2[data node B]
    DNs --> CO[coordinating-only node]
    DNs --> Local[localNode = this process]

The data hierarchy: index → shard → segment

A logical index is a named collection of documents. Physically it is partitioned into shards, and each shard is a full, independent Lucene index composed of immutable segments.

flowchart TD
    IDX["Index 'orders' (logical)"] --> S0["Shard 0"]
    IDX --> S1["Shard 1"]
    S0 --> P0["Primary 0 (node A)"]
    S0 --> R0["Replica 0 (node B)"]
    P0 --> L0["Lucene index = Engine + IndexWriter"]
    L0 --> Seg1["segment _0"]
    L0 --> Seg2["segment _1"]
    L0 --> Seg3["segment _2 (newest)"]

The identity classes you must keep straight:

Class	Package	What it identifies
`Index`	`org.opensearch.core.index`	An index by name + UUID. The UUID matters: a deleted-and-recreated index of the same name is a different `Index`.
`ShardId`	`org.opensearch.core.index.shard`	`Index` + integer shard number. Identifies a shard slot, not a copy.
`ShardRouting`	`org.opensearch.cluster.routing`	A copy of a shard: which node, primary or replica, and its routing state.

find server libs -name "Index.java" -path "*core/index*"
find server libs -name "ShardId.java"
find server -name "ShardRouting.java"
grep -n "primary\|UNASSIGNED\|INITIALIZING\|STARTED\|RELOCATING\|currentNodeId" \
  server/src/main/java/org/opensearch/cluster/routing/ShardRouting.java

Primary vs replica

Every shard has exactly one primary copy and zero or more replica copies. The primary is the source of truth for writes; replicas serve reads and provide redundancy. A write is applied on the primary first, then replicated — see replication.md and action-framework.md for TransportReplicationAction. The placement of every copy is described by a ShardRouting inside the RoutingTable (see cluster-state.md), and decided by the allocator (see shard-allocation.md).

ShardRouting carries a small state machine of its own (distinct from the IndexShardState of the actual shard object — see index-shard-lifecycle.md):

stateDiagram-v2
    [*] --> UNASSIGNED
    UNASSIGNED --> INITIALIZING: allocator assigns a node
    INITIALIZING --> STARTED: recovery completes
    STARTED --> RELOCATING: rebalance / drain
    RELOCATING --> STARTED: relocation completes
    STARTED --> UNASSIGNED: node leaves
    INITIALIZING --> UNASSIGNED: recovery fails

Note: Do not conflate the two state machines. ShardRouting.state() is the cluster manager's plan for the shard (a field in cluster state). IndexShard.state() is the data node's reality for the shard object it holds. They converge but can briefly disagree; many allocation bugs live in that gap.

How services are wired in the Node constructor

The Node constructor (Node(Environment, ...)) is long, deliberately sequential, and the best map of the engine you will find. It constructs every major subsystem in dependency order and registers them for lifecycle management. Read it slowly:

grep -n "new ThreadPool\|new TransportService\|new ClusterService\|new IndicesService\|\
new SearchService\|new NodeClient\|new ActionModule\|new Coordinator\|new GatewayMetaState\|\
new RepositoriesService\|new SnapshotsService\|new PluginsService" \
  server/src/main/java/org/opensearch/node/Node.java

The rough construction order (names/order vary by branch — verify with the grep above):

flowchart TD
    Env[Environment + Settings] --> PS[PluginsService loads plugins]
    PS --> TP[ThreadPool]
    TP --> NWR[NamedWriteableRegistry + NamedXContentRegistry]
    NWR --> TS["TransportService (Netty4Transport)"]
    TS --> CS[ClusterService = MasterService + ClusterApplierService]
    CS --> IS[IndicesService]
    IS --> NC[NodeClient]
    NC --> AM[ActionModule registers TransportActions + REST handlers]
    AM --> SS[SearchService]
    SS --> Coord[Coordinator wires discovery]
    Coord --> Gateway[GatewayMetaState persists state]

Subsystem built in `Node`	Deep dive
`ThreadPool`	threadpools-concurrency.md
`TransportService` / `Transport`	transport-layer.md
`ClusterService` (`MasterService` + `ClusterApplierService`)	cluster-state-publishing.md
`Coordinator`	discovery-coordination.md
`IndicesService`	index-shard-lifecycle.md
`ActionModule` / `NodeClient`	action-framework.md
`SearchService`	search-execution.md
`PluginsService`	plugin-architecture.md

The constructor also collects each plugin's contributions (extra NamedWriteables, settings, actions, REST handlers) and folds them into the registries — this is the seam every plugin lab exploits. See plugin-architecture.md.

Most constructed services implement LifecycleComponent and are added to a list that Node.start() iterates. To see the lifecycle contract:

grep -n "lifecycle\|LifecycleComponent\|doStart\|doStop\|doClose" \
  server/src/main/java/org/opensearch/common/component/AbstractLifecycleComponent.java

Reading exercise

Bring up a node from source and inspect the live model.

# 1. Launch a single-node cluster from source.
./gradlew run
# (in another shell)

# 2. Who is in the cluster, and what roles do they hold?
curl -s "localhost:9200/_cat/nodes?v&h=name,node.role,master,ip"

# 3. The full DiscoveryNodes + roles as the cluster manager sees them.
curl -s "localhost:9200/_nodes?filter_path=nodes.*.name,nodes.*.roles" | python3 -m json.tool

# 4. Create an index and look at its shards / ShardRouting.
curl -s -X PUT "localhost:9200/orders?pretty" \
  -H 'Content-Type: application/json' \
  -d '{"settings":{"number_of_shards":2,"number_of_replicas":0}}'
curl -s "localhost:9200/_cat/shards/orders?v&h=index,shard,prirep,state,node,docs"

# 5. The Index UUID (proves name != identity).
curl -s "localhost:9200/orders/_settings?filter_path=*.settings.index.uuid"

Now answer:

Your single node from ./gradlew run — which roles does it hold, and why does a one-node dev cluster need to be both cluster_manager and data?
In DiscoveryNode, what is the difference between nodeId and ephemeralId, and which one would change if you restarted the same node?
Find where node.roles is parsed into a Set<DiscoveryNodeRole>. What happens if a user lists an unknown role string?
A coordinating-only node has node.roles: []. Trace, in DiscoveryNodes, how such a node is excluded from getClusterManagerNodes() and getDataNodes().
In Node.java, find the line where ClusterService is constructed. What two services does it compose, and which one runs only on the elected cluster manager?
Delete orders, recreate it with the same name, and re-read its UUID. Why does the engine treat the new index as a different Index despite the identical name?

Common bugs and symptoms

Symptom	Root cause	Where to look
Node refuses to start: "the default discovery settings are unsuitable for production"	No `cluster.initial_cluster_manager_nodes` and a non-loopback bind	`BootstrapChecks`; discovery-coordination.md
`_cat/nodes` shows a node but it holds no shards	Node has no `data` role (or is coordinating-only)	`node.roles` in `opensearch.yml`; `DiscoveryNode.getRoles()`
Writes succeed but never become a master/cluster-manager action target	Node lacks `cluster_manager` role; routed elsewhere	`TransportClusterManagerNodeAction`; action-framework.md
Two nodes with same `node.name` both join	`node.name` is cosmetic; identity is `nodeId`	`DiscoveryNode.nodeId`; persisted node UUID
Index recreated with same name shows stale aliases/templates	Code keyed on index name not `Index` (name+UUID)	Compare `Index.getUUID()`; cluster-state.md
`master` deprecation warnings flood the log	Config still uses `node.master` / `cluster.initial_master_nodes`	Migrate to `node.roles` / `cluster.initial_cluster_manager_nodes`

Validation: prove you understand this

From memory, list six node roles and the one-line job of each. Mark which roles make a node election-eligible vs which let it hold shards.
Draw the index → shard → segment hierarchy and annotate it with Index, ShardId, ShardRouting, primary/replica, and Engine/IndexWriter.
Explain the difference between ShardRouting.state() and IndexShard.state(). Give one realistic moment when they disagree.
Open Node.java. Without scrolling past the constructor, name the order in which ThreadPool, TransportService, ClusterService, and IndicesService are built, and explain why that order is forced by their dependencies.
Explain why Index carries a UUID and not just a name. Construct a one-line failure that occurs if code keys a cache on index name alone.
A teammate proposes a node with node.roles: [ingest] only. What can and cannot this node do? Will it ever be elected cluster manager? Will it ever hold a shard?

Discovery and Coordination

Before a cluster can do anything, its nodes must (1) find each other and (2) agree on who is in charge and what the cluster state is. That is the job of the coordination layer in org.opensearch.cluster.coordination. OpenSearch inherits the "Zen2" consensus design from Elasticsearch 7 — a custom, Raft-like algorithm with a single elected leader (the cluster manager), a quorum-based voting configuration, and explicit fault detectors. This chapter explains the Coordinator state machine (CANDIDATE → LEADER → FOLLOWER), the consensus core (CoordinationState), the join protocol, the election machinery, the fault detectors, and how the voting configuration mathematically prevents split brain.

After this chapter you should be able to: explain why a two-node cluster cannot safely tolerate the loss of either node; trace a node from "starting up" to "joined a leader"; name the class that runs each step of an election; and read the diagnostics OpenSearch prints when a cluster fails to form.

Note: "cluster manager" is the role formerly called "master." The coordination package still uses Master/master in many class and method names (MasterService, getMasterNode()); read them as "cluster manager."

The package map

find server/src/main/java/org/opensearch/cluster/coordination -name "*.java" | sort
find server/src/main/java/org/opensearch/discovery -name "*.java" | sort

The classes that matter:

Class	Role
`Coordinator`	The top-level state machine. Holds CANDIDATE/LEADER/FOLLOWER mode; owns all the helpers below. Implements `Discovery`.
`CoordinationState`	The consensus core: tracks `currentTerm`, `lastAcceptedState`, `lastCommittedConfiguration`, vote collection. The "Raft state."
`JoinHelper`	Sends/receives join requests; runs the join validation; accumulates joins into a new cluster state.
`PreVoteCollector`	Runs a pre-voting round to avoid disruptive elections; only escalates to a real election if a quorum looks reachable.
`ElectionSchedulerFactory`	Schedules election attempts with randomized, backing-off delays to prevent dueling candidates.
`LeaderChecker`	On followers: pings the leader; if it disappears, triggers a new election.
`FollowersChecker`	On the leader: pings each follower; removes unresponsive nodes from the cluster.
`VotingConfiguration`	The set of `cluster_manager`-eligible node IDs whose votes count for quorum.
`Reconfigurator`	Adjusts the `VotingConfiguration` as nodes join/leave, keeping it odd-sized when possible.
`ClusterFormationFailureHelper`	Periodically logs why the cluster has not formed (the message you will read at 2am).
`PublicationTransportHandler` / `Publication`	The two-phase publish→commit of new cluster states (see cluster-state-publishing.md).

Discovery (finding peers) is separate from coordination (agreeing):

Class	Role
`DiscoveryModule`	Wires the configured discovery type and the seed-hosts providers.
`SeedHostsProvider`	Supplies candidate addresses to probe. Implementations: `SettingsBasedSeedHostsProvider` (`discovery.seed_hosts`), `FileBasedSeedHostsProvider`, plus cloud plugins (`discovery-ec2`, etc.).
`PeerFinder`	Actively probes seed hosts and discovered peers to build the set of known cluster-manager-eligible nodes.

grep -n "class Coordinator\|enum Mode\|CANDIDATE\|LEADER\|FOLLOWER" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

The Coordinator state machine

A Coordinator is always in exactly one of three modes:

stateDiagram-v2
    [*] --> CANDIDATE
    CANDIDATE --> LEADER: wins election (quorum of joins)
    CANDIDATE --> FOLLOWER: receives a valid leader publication
    LEADER --> CANDIDATE: loses quorum / FollowersChecker fails
    FOLLOWER --> CANDIDATE: LeaderChecker fails (leader lost)
    LEADER --> [*]: node shuts down
    FOLLOWER --> [*]: node shuts down

Mode	Meaning	Active fault detector
`CANDIDATE`	Not attached to a leader; eligible to start/win an election	`ElectionScheduler` running
`LEADER`	This node is the elected cluster manager; publishes state	`FollowersChecker` (watches followers)
`FOLLOWER`	Attached to a leader; receives published state	`LeaderChecker` (watches the leader)

Trace mode transitions:

grep -n "becomeCandidate\|becomeLeader\|becomeFollower\|mode ==" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

Every node starts as a CANDIDATE. It either becomes the leader by winning an election, or finds the leader and becomes a follower.

Terms, votes, and CoordinationState

The consensus core is CoordinationState. Like Raft, it is built on a monotonically increasing term: each election bumps the term, and a node will only accept a leader whose term is at least its own. The key fields:

grep -n "currentTerm\|lastAcceptedState\|lastCommittedConfiguration\|\
lastPublishedConfiguration\|handleStartJoin\|handleJoin\|handlePublishRequest\|handleCommit" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java

Concept	Field/method	Invariant
Term	`currentTerm`	Strictly increases; a higher term wins.
Accepted state	`lastAcceptedState`	The most recent state this node has accepted (but maybe not committed).
Last committed config	`lastCommittedConfiguration`	The voting configuration in force for quorum math.
Last accepted config	`lastAcceptedConfiguration`	The config being transitioned to (may differ during reconfig).

An election proceeds as a StartJoin (a candidate, at term T) collecting Join votes. A candidate becomes leader when it has a quorum of joins from the current VotingConfiguration. The candidate then publishes a new cluster state (term T, version V+1) which followers accept and then commit — that is the two-phase publish covered in cluster-state-publishing.md.

A node from boot to "joined"

sequenceDiagram
    participant N as New node (CANDIDATE)
    participant PF as PeerFinder/SeedHostsProvider
    participant PV as PreVoteCollector
    participant ES as ElectionScheduler
    participant L as Existing leader
    N->>PF: probe discovery.seed_hosts
    PF-->>N: discovered cluster-manager-eligible peers
    alt a leader already exists
        N->>L: send Join (handleJoin)
        L-->>N: publish current cluster state
        Note over N: becomeFollower, start LeaderChecker
    else no leader
        N->>PV: run pre-vote round
        PV-->>N: quorum reachable?
        N->>ES: schedule election (randomized delay)
        ES->>N: startElection at term T
        N->>N: collect Joins from voting config
        Note over N: quorum reached -> becomeLeader
        N->>L: publish new state (term T)
    end

Read the join handshake:

grep -n "sendJoinRequest\|handleJoinRequest\|JoinCallback\|validateJoinRequest" \
  server/src/main/java/org/opensearch/cluster/coordination/JoinHelper.java

The pre-vote round (PreVoteCollector) is a politeness optimization: a node asks "if I started an election, could I plausibly win?" before bumping the term. This prevents a node that is merely partitioned from itself repeatedly disrupting a healthy cluster by forcing term increases.

grep -n "PreVoteRequest\|PreVoteResponse\|start(\|update(" \
  server/src/main/java/org/opensearch/cluster/coordination/PreVoteCollector.java

The voting configuration and split-brain prevention

The voting configuration is the set of cluster-manager-eligible node IDs whose votes count toward quorum. Quorum = strict majority of the voting configuration. This is the single rule that prevents split brain.

find server -name "VotingConfiguration.java"
grep -n "hasQuorum\|isEmpty\|getNodeIds" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationMetadata.java

Worked example — why a strict majority is safe:

Voting config size	Quorum needed	Max simultaneous failures tolerated
1	1	0 (any loss = no cluster manager)
2	2	0 (losing either kills quorum)
3	2	1
5	3	2
7	4	3

The reason two disjoint halves can never both elect a leader: each elected leader needs a strict majority of the same voting configuration, and two disjoint majorities of one set cannot exist. A partition that splits 5 eligible nodes into 3+2 lets only the side-of-3 form a cluster; the side-of-2 stays CANDIDATE and serves no writes. This is the formal split-brain guarantee.

Warning: Therefore deploy an odd number of cluster-manager-eligible nodes (3 or 5 for HA). Two such nodes is the worst case: quorum is 2, so the failure of either node halts the cluster manager. The Reconfigurator automatically shrinks/grows the voting config to keep it odd when nodes are added/removed, but it cannot fix a fundamentally even topology. Confirm:
grep -n "reconfigure\|ODD\|auto_shrink_voting_configuration" \
  server/src/main/java/org/opensearch/cluster/coordination/Reconfigurator.java

Bootstrapping the very first cluster

A brand-new cluster has no committed voting configuration yet, so a chicken/egg problem appears: you cannot elect a leader without a voting config, and you cannot commit a voting config without a leader. This is resolved by cluster bootstrapping: the operator lists the initial cluster-manager-eligible nodes once, and the cluster uses that list as the seed voting configuration.

grep -rn "INITIAL_CLUSTER_MANAGER_NODES\|initial_cluster_manager_nodes\|\
INITIAL_MASTER_NODES\|ClusterBootstrapService" \
  server/src/main/java/org/opensearch/cluster/coordination/

The setting:

Setting	Status
`cluster.initial_cluster_manager_nodes`	Current canonical name.
`cluster.initial_master_nodes`	Deprecated alias — same effect, logs a deprecation.

Warning: cluster.initial_*_nodes is a bootstrap-only setting. Set it on the first formation, then remove it. Leaving it set, or setting it differently on different nodes, can cause two independent clusters to form from one set of nodes — a real split brain that bypasses the quorum guarantee.

Fault detection: LeaderChecker and FollowersChecker

A healthy cluster runs two heartbeat loops in opposite directions:

Detector	Runs on	Watches	On failure
`LeaderChecker`	every follower	the leader	follower → CANDIDATE, triggers election
`FollowersChecker`	the leader	every follower	leader removes the node from the cluster state

grep -n "leaderFailed\|handleLeaderCheck\|class LeaderChecker" \
  server/src/main/java/org/opensearch/cluster/coordination/LeaderChecker.java
grep -n "FollowerChecker\|onNodeFailure\|handleDisconnectedNode" \
  server/src/main/java/org/opensearch/cluster/coordination/FollowersChecker.java

These are governed by timeout/retry settings (cluster.fault_detection.*). Tuning them too aggressively causes spurious elections under transient network blips; too loosely delays recovery from real failures.

When the cluster will not form: ClusterFormationFailureHelper

If a node sits in CANDIDATE mode and cannot form/join a cluster, OpenSearch does not stay silent — ClusterFormationFailureHelper logs a structured diagnosis every few seconds. This message is the most valuable single artifact for debugging discovery problems.

grep -n "describeQuorum\|getDescription\|class ClusterFormationFailureHelper" \
  server/src/main/java/org/opensearch/cluster/coordination/ClusterFormationFailureHelper.java

A typical message names: the discovered peers, the nodes still needed for quorum, and whether bootstrapping has happened. Learn to read it; it tells you which of the failure modes below you are in.

Reading exercise

# 1. Read the three-mode machine.
grep -n "becomeCandidate\|becomeLeader\|becomeFollower" \
  server/src/main/java/org/opensearch/cluster/coordination/Coordinator.java

# 2. Find the consensus invariants.
grep -n "currentTerm\|handleJoin\|handlePublishRequest\|handleCommit" \
  server/src/main/java/org/opensearch/cluster/coordination/CoordinationState.java

# 3. Run the coordination unit tests and watch a simulated election.
./gradlew :server:test --tests "org.opensearch.cluster.coordination.CoordinatorTests"
./gradlew :server:test --tests "org.opensearch.cluster.coordination.CoordinationStateTests"

# 4. The "why won't it form" message.
grep -rn "cluster-manager not discovered\|master not discovered\|describeQuorum" \
  server/src/main/java/org/opensearch/cluster/coordination/

Answer:

In Coordinator, what condition causes becomeLeader to be called, and what is CoordinationState.currentTerm immediately afterward?
What does PreVoteCollector do, and what disruption does it prevent in a cluster where one node is flapping?
Given a 5-node cluster-manager-eligible voting configuration, exactly how many node failures can it tolerate while still electing a leader? Show the quorum arithmetic.
Why is cluster.initial_cluster_manager_nodes only needed on the very first formation? What goes wrong if it is left set and two nodes are isolated at startup?
A follower's LeaderChecker fails three times. Trace, in code, the resulting mode transition and what the node does next.
In Reconfigurator, find where the voting configuration is kept odd-sized. Why does evenness hurt availability?

Common bugs and symptoms

Symptom	Root cause	Where to look
Node logs "cluster-manager not discovered yet" forever	Wrong `discovery.seed_hosts`, or no quorum of eligible nodes reachable	`ClusterFormationFailureHelper`; `SeedHostsProvider`
Two separate one-node clusters form from one fleet	`cluster.initial_cluster_manager_nodes` set differently per node, or left set after first boot	`ClusterBootstrapService`; the bootstrap setting
Frequent leader re-elections under load	Fault-detection timeouts too aggressive, or GC pauses on the leader	`LeaderChecker`/`FollowersChecker` settings; GC logs
Cluster halts after losing one of two eligible nodes	Even-sized voting configuration; quorum = 2	`VotingConfiguration`; deploy 3 eligible nodes
New node joins but is immediately removed	`FollowersChecker` can ping it but it cannot ping back (asymmetric network)	`FollowersChecker.onNodeFailure`; firewall/MTU
Election storm with dueling candidates	`ElectionSchedulerFactory` backoff defeated, or clock skew	`ElectionSchedulerFactory`; randomized delay logic

Validation: prove you understand this

Draw the CANDIDATE/LEADER/FOLLOWER state diagram and label every transition with the class that triggers it (ElectionScheduler, LeaderChecker, FollowersChecker, a valid publication).
Explain, with the quorum arithmetic, why a 2-eligible-node cluster is strictly worse for availability than a 1-eligible-node cluster and a 3-eligible-node cluster.
From memory, name the four classes that participate in winning an election (scheduling, pre-voting, vote collection, join accumulation) and the order they fire in.
A new node will not join an existing cluster. List the three pieces of information the ClusterFormationFailureHelper message gives you, and how each narrows the diagnosis.
Explain the role of currentTerm in preventing a stale ex-leader from continuing to publish after a partition heals.
Why is cluster.initial_cluster_manager_nodes dangerous to leave configured in steady state? Describe the concrete split-brain it can produce.

The ClusterState Object

The ClusterState is the single most important data structure in OpenSearch. It is an immutable snapshot of everything the cluster has agreed upon: which indices exist and how they are mapped and configured (Metadata), where every shard copy lives (RoutingTable), who the nodes are (DiscoveryNodes), and what operations are currently blocked (ClusterBlocks). Every node holds a copy; the elected cluster manager produces new versions and publishes them to everyone (see cluster-state-publishing.md). If you understand this object, the rest of the cluster layer is bookkeeping around it.

After this chapter you should be able to: enumerate the top-level components of a cluster state; explain why immutability + the builder pattern + Diffable are load-bearing, not stylistic; read a live cluster state from a running node and map each JSON section back to a Java class; and find where any piece of metadata (a mapping, a setting, a shard assignment) lives in the object graph.

Note: Many accessor names still say master — e.g. DiscoveryNodes.getMasterNode(). Read "master" as "cluster manager" throughout; the role was renamed for inclusive language but the field names linger.

The top-level structure

find server/src/main/java/org/opensearch/cluster -name "ClusterState.java"
grep -n "private final\|public Metadata\|public RoutingTable\|public DiscoveryNodes\|\
public ClusterBlocks\|long version\|String stateUUID" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java

A ClusterState is composed of:

Component	Class	Contains
`metadata`	`Metadata`	`IndexMetadata` per index (mappings, settings, aliases), index templates, persistent cluster settings, `IndexGraveyard`, custom metadata.
`routingTable`	`RoutingTable`	The shard assignment plan: per-index `IndexRoutingTable` → per-shard `IndexShardRoutingTable` → `ShardRouting`s.
`nodes`	`DiscoveryNodes`	All nodes, the elected cluster manager, the local node.
`blocks`	`ClusterBlocks`	Global and per-index blocks (read-only, metadata-write, etc.).
`customs`	`Map<String, Custom>`	Pluggable metadata: snapshots in progress, restore in progress, persistent tasks, etc.
`version` / `term` / `stateUUID`	primitives	Monotonic version (per leader), coordination term, unique state UUID.
`routingNodes`	`RoutingNodes` (derived)	A node-centric view of the routing table, lazily built for the allocator.

flowchart TD
    CS[ClusterState] --> M[Metadata]
    CS --> RT[RoutingTable]
    CS --> DN[DiscoveryNodes]
    CS --> CB[ClusterBlocks]
    CS --> CU["customs: SnapshotsInProgress, ..."]
    M --> IM["IndexMetadata per index"]
    IM --> MAP[mappings]
    IM --> SET[settings]
    IM --> AL[aliases]
    M --> TPL[templates]
    M --> PCS[persistent cluster settings]
    RT --> IRT[IndexRoutingTable]
    IRT --> ISRT[IndexShardRoutingTable]
    ISRT --> SR["ShardRouting (primary/replica)"]

Metadata: the durable truth

Metadata holds everything that must survive a full cluster restart. Read its shape:

find server -name "Metadata.java" -path "*cluster/metadata*"
find server -name "IndexMetadata.java"
grep -n "ImmutableOpenMap\|indices\|templates\|persistentSettings\|customs\|indexGraveyard" \
  server/src/main/java/org/opensearch/cluster/metadata/Metadata.java

IndexMetadata is the per-index record. The fields you will touch most:

Field	Meaning	Related deep dive
`settings`	`number_of_shards`, `number_of_replicas`, refresh interval, analysis config, etc.	mapping-and-analysis.md
`mappings`	The compiled mapping (`MappingMetadata`)	mapping-and-analysis.md
`aliases`	Alias definitions and filters	—
`state`	`OPEN` or `CLOSE`	index-shard-lifecycle.md
`inSyncAllocationIds`	Per-shard set of allocation IDs known to be in-sync	replication.md, recovery.md
`primaryTerms`	Per-shard primary term (bumped on each new primary)	replication.md

Note: number_of_shards is fixed at index creation and is part of the index's identity (it determines routing). number_of_replicas is dynamic. The asymmetry is encoded directly in IndexMetadata/IndexScopedSettings.

RoutingTable: where every shard copy lives

The RoutingTable is the plan for shard placement — the cluster manager's decision about which node holds each primary and replica. It is the output of the allocator (see shard-allocation.md).

find server -name "RoutingTable.java"
find server -name "IndexShardRoutingTable.java"
grep -n "primaryShard\|replicaShards\|allShards\|class IndexShardRoutingTable" \
  server/src/main/java/org/opensearch/cluster/routing/IndexShardRoutingTable.java

The hierarchy mirrors the data hierarchy from cluster-and-node-model.md:

RoutingTable
 └─ IndexRoutingTable (per index)
     └─ IndexShardRoutingTable (per shard id)
         ├─ ShardRouting (primary)
         └─ ShardRouting (replica…)

RoutingNodes is a derived, node-keyed view of the same data — "for node X, which shards are assigned here?" — that the allocator mutates in place during a single allocation round and then snapshots back into an immutable RoutingTable. That mutate-then-freeze pattern is the one exception to the immutability rule, and it is carefully contained.

grep -n "class RoutingNodes\|mutable\|unassigned()\|node(" \
  server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java

ClusterBlocks: what is forbidden right now

A ClusterBlocks is a set of ClusterBlocks that veto operations. Blocks are global or per-index, and each declares which operation levels it blocks (READ, WRITE, METADATA_READ, METADATA_WRITE).

find server -name "ClusterBlock.java" -o -name "ClusterBlocks.java"
grep -n "static final ClusterBlock\|NO_MASTER_BLOCK\|INDEX_READ_ONLY\|STATE_NOT_RECOVERED" \
  server/src/main/java/org/opensearch/cluster/block/ClusterBlocks.java \
  server/src/main/java/org/opensearch/cluster/coordination/NoClusterManagerBlockService.java 2>/dev/null

Block	When	Effect
`STATE_NOT_RECOVERED_BLOCK`	Cluster just started, state not yet recovered from disk	Blocks most ops until gateway recovery finishes
no-cluster-manager block	No elected cluster manager	Blocks writes (and optionally reads) until one is elected
`INDEX_READ_ONLY_BLOCK`	`index.blocks.read_only` or disk watermark hit	Blocks writes to that index
`INDEX_METADATA_BLOCK`	Index closed	Blocks data ops on a closed index

A request that hits a block fails fast with a ClusterBlockException. This is why "node has no cluster manager" surfaces to clients as a 503-style block rather than a hang.

Immutability, the builder pattern, and versioning

ClusterState and every component are immutable. You never mutate a state; you build a new one from an old one:

ClusterState newState = ClusterState.builder(currentState)
    .metadata(Metadata.builder(currentState.metadata()).put(newIndexMetadata, true))
    .routingTable(updatedRoutingTable)
    .build();

grep -n "public static Builder builder\|public ClusterState build\|incrementVersion" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java

Why immutability is not optional here:

Safe sharing across threads. Dozens of components read the current state concurrently (the search layer, the indices layer, the REST layer). If state were mutable, every reader would need locking. Instead, each reader grabs a reference to an immutable snapshot and is guaranteed it will not change under them. See threadpools-concurrency.md.
Atomic transitions. A new state appears all-at-once. There is no window where, say, the routing table reflects a new index but the metadata does not.
Cheap diffing. Because old and new states are separate immutable objects that share most of their sub-structures, you can compute a compact Diff.

The version field is a monotonically increasing counter (per leader term). Combined with the coordination term, it gives a total order: a state with a higher (term, version) supersedes a lower one. Nodes reject older states.

Diff and Diffable: efficient publishing

Publishing a full multi-megabyte cluster state to every node on every tiny change would be wasteful. Most components implement Diffable<T>, which can emit a Diff<T> — a compact description of what changed from a previous version.

find server libs -name "Diffable.java" -o -name "Diff.java" -o -name "Diffs.java"
grep -rn "implements Diffable\|Diff<.*> diff(" \
  server/src/main/java/org/opensearch/cluster/ | head

flowchart LR
    Old["ClusterState v100"] --> D["Diff = changes only"]
    New["ClusterState v101"] --> D
    D -->|"small bytes"| Follower["follower applies Diff to its v100"]
    Follower --> Result["follower now at v101"]

The publishing layer sends a full state to nodes that are behind (or new), and a Diff to nodes that already hold the immediately previous version — a large bandwidth win on big clusters. The mechanism is detailed in cluster-state-publishing.md; the serialization contract (and its version gating) is in serialization-bwc.md.

Warning: Diffable correctness is subtle. If your component's diff and readDiffFrom disagree with its full read/write, followers will silently diverge from the leader. This is exactly the class of bug that the AbstractDiffableSerializationTestCase round-trip tests exist to catch. If you add a Custom to cluster state, you owe a diff test.

Customs: pluggable cluster state

Subsystems that need to track cluster-wide, transient or durable data attach a Custom to either the ClusterState (transient, e.g. SnapshotsInProgress, RestoreInProgress) or the Metadata (durable, e.g. persistent tasks, index-state-management metadata from a plugin).

grep -rn "implements ClusterState.Custom\|implements Metadata.Custom" \
  server/src/main/java/org/opensearch/ | head -20

This is the extension point plugins use to put their own state into the agreed cluster state. It is covered in plugin-architecture.md.

Reading the live state

The single most useful command for understanding this object:

# Full state (large). Scope it with filter_path.
curl -s "localhost:9200/_cluster/state?pretty" | less

# Just the routing table for one index.
curl -s "localhost:9200/_cluster/state/routing_table/orders?pretty"

# Just metadata of one index (settings + mappings).
curl -s "localhost:9200/_cluster/state/metadata/orders?pretty"

# Just the nodes and the elected cluster manager.
curl -s "localhost:9200/_cluster/state/master_node,nodes?pretty"

# Just the blocks.
curl -s "localhost:9200/_cluster/state/blocks?pretty"

# The version + UUID + term (proves monotonic versioning).
curl -s "localhost:9200/_cluster/state?filter_path=version,state_uuid,cluster_uuid,metadata.cluster_coordination"

Each top-level JSON key maps directly to a Java field:

JSON section	Java component
`metadata.indices.*`	`Metadata` → `IndexMetadata`
`routing_table.indices.*`	`RoutingTable` → `IndexRoutingTable`
`nodes` / `master_node`	`DiscoveryNodes`
`blocks`	`ClusterBlocks`
`snapshots`, `restore`, ...	`ClusterState.Custom`s
`version`, `state_uuid`	`ClusterState` primitives

Reading exercise

# 1. The class itself.
grep -n "private final\|Builder\|class ClusterState" \
  server/src/main/java/org/opensearch/cluster/ClusterState.java

# 2. How IndexMetadata is built and what it carries.
grep -n "Builder\|numberOfShards\|numberOfReplicas\|putMapping\|state(" \
  server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java

# 3. Round-trip serialization tests (the BWC safety net).
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests"
./gradlew :server:test --tests "org.opensearch.cluster.metadata.MetadataTests"

# 4. Find every ClusterState.Custom.
grep -rn "implements ClusterState.Custom" server/src/main/java

Answer:

Name the five top-level components of a ClusterState and one thing each one holds.
number_of_shards cannot be changed after index creation, but number_of_replicas can. Where in IndexMetadata/IndexScopedSettings is that distinction enforced?
Explain why ClusterState is immutable in terms of concurrency. What would the search layer have to do differently if it were mutable?
What is the relationship between RoutingTable and RoutingNodes? Why does RoutingNodes get to be mutable when nothing else is?
Open the live blocks section after starting a node before any index exists. Which blocks are present during gateway recovery, and which class adds them?
You add a Custom to track per-cluster state for a plugin. Which two serialization paths must agree (full read/write vs diff), and which test base class verifies it?

Common bugs and symptoms

Symptom	Root cause	Where to look
Followers diverge from the cluster manager over time	A `Diffable.diff()` that doesn't match the full write	`AbstractDiffableSerializationTestCase`; the component's `diff`/`readDiffFrom`
Writes rejected with `ClusterBlockException` after restart	`STATE_NOT_RECOVERED_BLOCK` still in place; gateway recovery not done	[gateway recovery]; `ClusterBlocks`
All writes 503 while reads work	No-cluster-manager block applied	discovery-coordination.md; `NoClusterManagerBlockService`
Stale index data after delete+recreate	Code compared index by name, not `Index` (name+UUID)	`Metadata.index(...)` overloads; `Index.getUUID()`
Cluster state grows huge, publish slow	Too many indices/shards or a bloated `Custom`	`_cluster/state` size; reduce shard count; audit customs
`setting [X] not recognized` on update	Setting not registered in `IndexScopedSettings`/`ClusterSettings`	the `*ScopedSettings` registration

Validation: prove you understand this

Draw the ClusterState object graph from memory down to ShardRouting and IndexMetadata.mappings, labeling each edge with its Java type.
Explain the three concrete benefits of immutability (safe sharing, atomic transitions, cheap diffs) and give a bug that each one prevents.
Using a running node, fetch only the routing table for one index via filter_path and identify the primary's node and the replica's state.
Explain how (term, version) gives a total order over cluster states and why a node rejects a state with a lower pair.
Describe what Diffable buys the publishing layer on a 200-node cluster, and name the failure mode of a buggy diff().
List the four block levels (READ, WRITE, METADATA_READ, METADATA_WRITE) and give one real block for each level.

Cluster State Publishing

A ClusterState is only useful if every node sees the same one. This chapter explains how a new state is produced on the elected cluster manager and distributed to the whole cluster. There are two distinct services and two distinct jobs: MasterService (the cluster-manager service) computes new states from a queue of update tasks, and ClusterApplierService applies committed states to local components. Between produce and apply sits a two-phase publish→commit protocol that guarantees a state is durably accepted by a quorum before it is allowed to take effect anywhere. Get this chapter wrong in code and you cause cluster-wide hangs; get it right and you can safely react to any cluster change.

After this chapter you should be able to: write a correct ClusterStateUpdateTask; explain why update tasks are batched and on which thread they run; distinguish ClusterStateApplier from ClusterStateListener and know which runs first; trace a state from "submitted" to "committed and applied everywhere"; and explain why blocking inside an applier is a cardinal sin.

Note: MasterService is the cluster-manager service. The class keeps the old name; treat "master" as "cluster manager." OpenSearch also exposes ClusterManagerService as an alias in places.

Two services, two jobs

find server/src/main/java/org/opensearch/cluster/service -name "*.java"
grep -n "class MasterService\|class ClusterApplierService\|class ClusterService" \
  server/src/main/java/org/opensearch/cluster/service/*.java

Service	Runs on	Job	Key thread
`MasterService`	the elected cluster manager only	Pull update tasks off a queue, batch compatible ones, run their executor to compute a new `ClusterState`, then publish it.	`cluster-manager` / `masterService#updateTask`
`ClusterApplierService`	every node	Receive a committed state, run appliers then listeners to apply it locally.	`clusterApplierService#updateTask`
`ClusterService`	every node	Façade that composes the two and exposes the current state.	—

flowchart LR
    subgraph "Cluster manager node"
      Q[update task queue] --> MS[MasterService]
      MS -->|compute new state| PUB[Publication]
    end
    PUB -->|publish phase| F1[follower 1]
    PUB -->|publish phase| F2[follower 2]
    F1 -->|accept| PUB
    F2 -->|accept| PUB
    PUB -->|"commit (quorum accepted)"| F1
    PUB -->|commit| F2
    F1 --> CAS1[ClusterApplierService applies]
    F2 --> CAS2[ClusterApplierService applies]
    MS --> CASM[cluster manager applies locally too]

The update-task model

You never mutate cluster state directly. You submit a task describing the change, and the cluster-manager service runs it. Two APIs:

find server -name "ClusterStateUpdateTask.java" -o -name "ClusterStateTaskExecutor.java"
grep -n "abstract ClusterState execute\|onFailure\|clusterStateProcessed" \
  server/src/main/java/org/opensearch/cluster/ClusterStateUpdateTask.java

API	Use when
`ClusterStateUpdateTask`	A single self-contained change; its `execute(ClusterState)` returns the new state.
`ClusterStateTaskExecutor<T>` + `submitStateUpdateTask`	Many tasks of the same kind that should be batched and computed together for one published state.

A minimal update task:

clusterService.submitStateUpdateTask("add-my-custom", new ClusterStateUpdateTask() {
    @Override
    public ClusterState execute(ClusterState current) {
        // pure function: derive a NEW immutable state from current
        return ClusterState.builder(current)
            .metadata(Metadata.builder(current.metadata()).putCustom(MyCustom.TYPE, value))
            .build();
    }
    @Override
    public void onFailure(String source, Exception e) { /* log; never throw */ }
    @Override
    public void clusterStateProcessed(String source, ClusterState before, ClusterState after) {
        // runs after the state is committed AND applied locally; safe to ack here
    }
});

Warning: execute(ClusterState) must be a pure function: derive a new immutable state, do no I/O, do not block, do not sleep. It runs on the single cluster-manager update thread; blocking it stalls all cluster state progress for the entire cluster. Do expensive work before submitting, or in clusterStateProcessed after.

Batching

The cluster-manager service batches tasks that share a ClusterStateTaskExecutor instance and can be coalesced (e.g. 50 shard-started notifications arriving at once produce one new cluster state instead of 50). Batching is the reason a busy cluster does not publish a new state per shard event.

grep -n "executeTasks\|batch\|TaskBatcher\|class MasterService" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java

flowchart TD
    T1[shard-started s0] --> B[TaskBatcher]
    T2[shard-started s1] --> B
    T3[shard-started s2] --> B
    B -->|one executor run| Exec["ShardStartedClusterStateTaskExecutor.execute(batch)"]
    Exec --> NS["one new ClusterState (version+1)"]

Two-phase publish → commit

When MasterService has computed a new state, it does not simply broadcast it. It runs a Publication, modeled on the coordination consensus from discovery-coordination.md:

find server -name "Publication.java" -o -name "PublicationTransportHandler.java"
grep -n "publish\|sendPublishRequest\|handlePublishRequest\|sendApplyCommit\|handleApplyCommit" \
  server/src/main/java/org/opensearch/cluster/coordination/PublicationTransportHandler.java

sequenceDiagram
    participant MS as MasterService
    participant P as Publication
    participant Q as quorum of followers
    MS->>P: publish(new state v101)
    P->>Q: PHASE 1 PublishRequest (full state or Diff)
    Q-->>P: accept (persisted to lastAcceptedState)
    Note over P: wait for a QUORUM of accepts
    P->>Q: PHASE 2 ApplyCommit
    Q->>Q: ClusterApplierService applies v101
    Q-->>P: ack (after appliers run)
    P-->>MS: publication succeeded (or timed out)

Phase	Message	Effect on follower
1 — publish	`PublishRequest` (full state, or a `Diff` if the follower holds v100)	Validate term/version, persist as `lastAcceptedState`. Not yet applied.
2 — commit	`ApplyCommit` (sent only after a quorum accepted)	`ClusterApplierService` actually applies the state locally.

This two-phase shape is what makes cluster state changes safe across partitions: a state is only committed once a quorum has durably accepted it, so a failed cluster manager cannot leave the cluster half-updated. The Diff optimization from cluster-state.md rides on phase 1.

Note: Publication has a timeout (cluster.publish.timeout, default 30s). If a quorum does not accept in time, publication fails and the state is not committed; the cluster manager may step down. Slow appliers on followers are a common cause of publish timeouts.

Applying a committed state: appliers vs listeners

Once a node receives the commit, ClusterApplierService applies the state by running two ordered groups of callbacks:

find server -name "ClusterStateApplier.java" -o -name "ClusterStateListener.java"
grep -n "applyClusterState\|clusterChanged\|addStateApplier\|addListener\|callClusterStateAppliers\|callClusterStateListeners" \
  server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java

Callback	Interface	Runs	Purpose
Applier	`ClusterStateApplier.applyClusterState(ClusterChangedEvent)`	first, in registration order, synchronously	Components that must act to make the new state real (create/delete shards, update routing, rebuild mappings).
Listener	`ClusterStateListener.clusterChanged(ClusterChangedEvent)`	after all appliers, synchronously	Components that merely react (refresh caches, fire metrics, schedule follow-up work).

The ordering contract is precise and load-bearing:

All appliers run, in order, on the single applier thread.
Only after every applier returns do listeners run.
Both run synchronously on the applier thread — so both block cluster state progress while they execute.

flowchart TD
    Commit[ApplyCommit received] --> A1["applier 1 (IndicesClusterStateService)"]
    A1 --> A2["applier 2 (...)"]
    A2 --> A3[applier N]
    A3 --> L1[listener 1]
    L1 --> L2[listener N]
    L2 --> Done["state applied; ack sent"]

Warning: An applier or listener that blocks (does I/O, waits on a lock, calls back into the cluster service, or sleeps) stalls the applier thread, which delays the commit ack, which can blow the publish timeout, which can cause the cluster manager to step down — a cluster-wide cascade from one bad callback. If you need to do slow work, hand it off to a thread pool from inside the callback. This is the most common newcomer mistake; see Level 4 lab 4.2 and Level 4 lab 4.3.

The biggest applier is IndicesClusterStateService — it reconciles the local node's shards with the new routing table, creating, starting, and closing shards to match. That is the bridge from cluster state into index-shard-lifecycle.md:

grep -n "applyClusterState\|createOrUpdateShards\|removeShards\|failAndRemoveShard" \
  server/src/main/java/org/opensearch/indices/cluster/IndicesClusterStateService.java

The ackListener: knowing the change took effect

Some operations (e.g. PUT mapping with ?wait_for_active_shards, or any API that returns "acknowledged": true) must wait until enough nodes have applied the new state. The publication carries an ackListener that fires per node as each one acks the commit. The REST response is held until the ack condition is satisfied (or times out).

grep -n "AckListener\|onNodeAck\|onCommit\|ackTimeout\|AckedClusterStateUpdateTask" \
  server/src/main/java/org/opensearch/cluster/AckedClusterStateUpdateTask.java \
  server/src/main/java/org/opensearch/cluster/coordination/Publication.java 2>/dev/null

This is why a PUT /index/_mapping can return "acknowledged": false — it means the change was committed but not acknowledged by all nodes within the master/cluster-manager timeout, not that it failed.

End-to-end sequence

sequenceDiagram
    participant C as Client / internal caller
    participant CS as ClusterService
    participant MS as MasterService
    participant EX as ClusterStateTaskExecutor
    participant PUB as Publication
    participant F as Followers
    participant CAS as ClusterApplierService (each node)
    C->>CS: submitStateUpdateTask(source, task)
    CS->>MS: enqueue task
    Note over MS: batch compatible tasks
    MS->>EX: execute(currentState, tasks) -> newState
    MS->>PUB: publish(newState)
    PUB->>F: PublishRequest (state or Diff)
    F-->>PUB: accept (lastAcceptedState)
    Note over PUB: quorum reached
    PUB->>F: ApplyCommit
    F->>CAS: applyClusterState (appliers, then listeners)
    CAS-->>PUB: ack
    PUB-->>MS: publication complete
    MS->>EX: clusterStateProcessed callbacks
    EX-->>C: response (acknowledged if ack condition met)

Reading exercise

# 1. The update-task loop and batching.
grep -n "runTasks\|executeTasks\|publish\|TaskBatcher" \
  server/src/main/java/org/opensearch/cluster/service/MasterService.java

# 2. The applier/listener split and ordering.
grep -n "callClusterStateAppliers\|callClusterStateListeners\|addStateApplier\|addListener" \
  server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java

# 3. The publish/commit transport handlers.
grep -n "PUBLISH_STATE_ACTION_NAME\|COMMIT_STATE_ACTION_NAME\|handlePublishRequest\|handleApplyCommit" \
  server/src/main/java/org/opensearch/cluster/coordination/PublicationTransportHandler.java

# 4. Tests that exercise publication.
./gradlew :server:test --tests "org.opensearch.cluster.service.MasterServiceTests"
./gradlew :server:test --tests "org.opensearch.cluster.service.ClusterApplierServiceTests"

# 5. Watch real publishes on a running node.
curl -s "localhost:9200/_cluster/health?pretty&wait_for_events=languid"

Answer:

On which node does MasterService run, and on which node does ClusterApplierService run? Why the difference?
Why must ClusterStateUpdateTask.execute() be pure and non-blocking? What exactly stalls if it blocks?
What is the difference between phase 1 (publish) and phase 2 (commit)? At which phase has a follower applied the state?
In ClusterApplierService, prove from the code that all appliers run before any listener. What is the consequence if an applier throws?
A PUT _mapping returns "acknowledged": false. Trace, via the ackListener, what that means and what timeout governs it.
Find IndicesClusterStateService.applyClusterState. How does this one applier turn a routing-table change into shard creation/removal on the local node?

Common bugs and symptoms

Symptom	Root cause	Where to look
Whole cluster appears frozen; no state changes apply	An applier/listener is blocking the applier thread	thread dump on the applier thread; the offending `ClusterStateApplier`
`failed to publish cluster state … timed out`	Slow appliers on followers, or network; publish timeout exceeded	`cluster.publish.timeout`; follower applier latency
Update-task throughput collapses under load	Tasks not batched (distinct executors) or `execute()` doing I/O	`ClusterStateTaskExecutor` batching; remove I/O from `execute`
`acknowledged: false` on metadata changes	Ack not received from all nodes within `cluster_manager_timeout`	`AckedClusterStateUpdateTask`; raise timeout or fix slow node
Listener sees a state but acts on stale local data	Did work that belonged in an applier (must run before listeners)	move logic from `ClusterStateListener` to `ClusterStateApplier`
Cluster manager steps down repeatedly	Publication failures from slow followers cascade	follower GC/applier latency; discovery-coordination.md

Validation: prove you understand this

Draw the produce→publish→commit→apply pipeline and label which steps happen on the cluster manager vs every node.
Write a correct ClusterStateUpdateTask skeleton and annotate which method must be pure, which may do post-commit work, and which handles failure.
Explain batching with a concrete example (N shard-started events) and state how many cluster states result.
Distinguish phase 1 from phase 2 of publication and explain why committing only after a quorum accepts prevents a half-updated cluster.
Explain the applier-before-listener ordering and give one task that belongs in an applier and one that belongs in a listener.
Explain precisely how blocking inside an applier can make the elected cluster manager step down. Name every link in the cascade.

The Transport Layer

Every interaction between nodes in an OpenSearch cluster — a search fan-out, a replication request, a join, a cluster-state publish — travels over the transport layer. It is OpenSearch's internal RPC fabric: a binary protocol on port 9300 (distinct from the REST/HTTP port 9200), built on Netty, with a typed request/response model and a custom serialization framework (StreamInput/StreamOutput, Writeable). This chapter explains TransportService and the Transport SPI, how request handlers are registered and invoked, the serialization primitives that make payloads cross the wire, and the version-aware hooks that keep a mixed-version cluster working.

After this chapter you should be able to: register a transport request handler; explain what Writeable, StreamInput/StreamOutput, and NamedWriteableRegistry each do; trace a TransportRequest from sender to handler to TransportResponse; and find where the wire format is version-gated for backward compatibility (the hook into serialization-bwc.md).

Note: Port 9200 is HTTP/REST (clients talk to it). Port 9300 is the transport protocol (nodes talk to each other). They are different stacks with different handlers. This chapter is about 9300; rest-layer.md is about 9200.

The classes

find server/src/main/java/org/opensearch/transport -name "TransportService.java"
find . -path "*transport-netty4*" -name "Netty4Transport.java"
grep -n "class TransportService\|registerRequestHandler\|sendRequest\|interface Transport" \
  server/src/main/java/org/opensearch/transport/TransportService.java

Class	Package / module	Role
`TransportService`	`org.opensearch.transport`	The high-level API every component uses: register handlers, open connections, send requests. Wraps a `Transport`.
`Transport`	`org.opensearch.transport`	The SPI for the wire implementation (connect, bind, send bytes).
`Netty4Transport`	`modules/transport-netty4`	The default `Transport` implementation, on Netty. Bound to 9300.
`TransportRequest`	`org.opensearch.transport`	Base class for anything sent. Implements `Writeable`.
`TransportResponse`	`org.opensearch.transport`	Base class for anything returned. Implements `Writeable`.
`TransportChannel`	`org.opensearch.transport`	A handler's reply handle: `sendResponse(resp)` or `sendResponse(exc)`.
`TransportRequestHandler<T>`	`org.opensearch.transport`	The functional interface a handler implements: `messageReceived(request, channel, task)`.
`TransportResponseHandler<T>`	`org.opensearch.transport`	The callback on the sender side: `handleResponse` / `handleException`, plus the executor and the response reader.

flowchart LR
    Comp[some component] -->|sendRequest| TS1[TransportService A]
    TS1 -->|Transport SPI| N4A[Netty4Transport A]
    N4A -->|"TCP 9300, framed bytes"| N4B[Netty4Transport B]
    N4B --> TS2[TransportService B]
    TS2 -->|dispatch by action name| H[registered handler]
    H -->|channel.sendResponse| TS2
    TS2 --> N4B
    N4B --> N4A
    N4A --> TS1
    TS1 -->|handleResponse| Comp

Registering and invoking a handler

Every transport interaction is keyed by an action name string (e.g. "indices:data/read/search[phase/query]"). A node that can serve an action registers a handler:

transportService.registerRequestHandler(
    ACTION_NAME,                       // String key
    ThreadPool.Names.SEARCH,           // which thread pool runs the handler
    MyRequest::new,                    // Writeable.Reader<MyRequest> (deserializer)
    (request, channel, task) -> {
        MyResponse resp = handle(request);
        channel.sendResponse(resp);    // or channel.sendResponse(exception)
    });

grep -n "registerRequestHandler" server/src/main/java/org/opensearch/transport/TransportService.java
# Real examples across the codebase:
grep -rn "registerRequestHandler" server/src/main/java | head -20

To call a remote (or local) action:

transportService.sendRequest(targetNode, ACTION_NAME, request,
    new ActionListenerResponseHandler<>(listener, MyResponse::new, ThreadPool.Names.SAME));

Sender ingredient	Why
`targetNode` (`DiscoveryNode`)	Where to send; resolved from cluster state.
`ACTION_NAME`	Selects the remote handler.
`request` (`TransportRequest`)	The `Writeable` payload.
`TransportResponseHandler`	Reads the response and routes it (often `ActionListenerResponseHandler`, which adapts to an `ActionListener`).

ActionListenerResponseHandler is the bridge between the transport world and the action world (action-framework.md): it deserializes the response and feeds it to an ActionListener.onResponse/onFailure.

Note: The thread-pool name in registerRequestHandler decides which pool runs the handler — pick it deliberately. A search handler runs on SEARCH, a write on WRITE, a lightweight metadata reply on SAME (the calling thread). Choosing the wrong pool causes rejections or starves other work; see threadpools-concurrency.md.

Serialization: Writeable, StreamInput, StreamOutput

Everything on the wire is serialized by hand through a binary stream. There is no reflection-based JSON here — it is explicit, ordered, byte-for-byte. The primitives live in libs/core:

find libs -name "Writeable.java" -o -name "StreamInput.java" -o -name "StreamOutput.java"
grep -n "interface Writeable\|void writeTo\|interface Reader" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/Writeable.java

Primitive	Contract
`Writeable`	`void writeTo(StreamOutput out)` — serialize self in a fixed field order.
`Writeable.Reader<T>`	`T read(StreamInput in)` — usually a constructor `MyType(StreamInput in)` that reads fields in the same order `writeTo` wrote them.
`StreamOutput`	Typed writers: `writeString`, `writeVInt`, `writeOptionalString`, `writeList`, `writeEnum`, `writeBoolean`, ...
`StreamInput`	Matching readers: `readString`, `readVInt`, `readOptionalString`, `readList`, `readEnum`, ...

The cardinal rule: read order must match write order, exactly. A TransportRequest typically looks like:

public MyRequest(StreamInput in) throws IOException {
    super(in);
    this.indexName = in.readString();
    this.size = in.readVInt();
    this.filter = in.readOptionalWriteable(Filter::new);
}
@Override
public void writeTo(StreamOutput out) throws IOException {
    super.writeTo(out);
    out.writeString(indexName);
    out.writeVInt(size);
    out.writeOptionalWriteable(filter);
}

Warning: A mismatch between writeTo and the reader is the single most common serialization bug. It does not fail at compile time; it manifests as a corrupt deserialization (EOFException, wrong values, or IllegalStateException) at runtime, often only against a different node version. The round-trip test base classes (AbstractWireSerializingTestCase) exist precisely to catch this — every new Writeable owes one. See serialization-bwc.md.

NamedWriteableRegistry: polymorphism on the wire

StreamInput can read a concrete type if it knows the reader. But many fields are polymorphic — a QueryBuilder could be any of dozens of subclasses, an Aggregation could be any aggregation. For those, OpenSearch writes a name and looks up the reader in a NamedWriteableRegistry.

find . -name "NamedWriteableRegistry.java" -o -name "NamedWriteable.java"
grep -n "writeNamedWriteable\|readNamedWriteable\|getNamedWriteableName\|class Entry" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/NamedWriteableRegistry.java

Mechanism	Used for
`Writeable` + a known `Reader`	Concrete, non-polymorphic types (a specific request).
`NamedWriteable` + `NamedWriteableRegistry`	Polymorphic types selected by a registered name (queries, aggs, suggesters, cluster-state customs).

The registry is assembled at node start from core plus every plugin's getNamedWriteables() (see plugin-architecture.md). The XContent analog (NamedXContentRegistry) does the same for JSON parsing on the REST side — see rest-layer.md.

flowchart LR
    Q["QueryBuilder field"] -->|writeNamedWriteable| W["name 'bool' + bytes"]
    W -->|wire| R[reader node]
    R -->|"readNamedWriteable(QueryBuilder.class)"| Reg[NamedWriteableRegistry]
    Reg -->|"lookup 'bool'"| BB["BoolQueryBuilder(StreamInput)"]
    BB --> Obj[reconstructed QueryBuilder]

Connections, ports, and request options

TransportService opens multiple logical channels per node connection, grouped by purpose so that, say, a flood of bulk traffic cannot starve cluster-state pings.

grep -n "ConnectionProfile\|TransportRequestOptions\|Type\.\(BULK\|PING\|REG\|STATE\|RECOVERY\)" \
  server/src/main/java/org/opensearch/transport/TransportRequestOptions.java \
  server/src/main/java/org/opensearch/transport/ConnectionProfile.java 2>/dev/null

Channel type	Carries
`PING`	Fault-detection heartbeats (`LeaderChecker`/`FollowersChecker`).
`STATE`	Cluster-state publish/commit.
`RECOVERY`	Peer-recovery byte transfers.
`BULK`	Indexing/bulk traffic.
`REG`	Everything else (the default).

TransportRequestOptions lets a caller set the channel type and a per-request timeout. The port (default 9300, transport.port) and bind addresses come from NetworkModule/transport settings.

The version hook (BWC)

OpenSearch runs mixed-version clusters during rolling upgrades. The serialization layer must therefore be version-aware: a newer node writing to an older node must omit fields the older node cannot parse, and vice versa. Every StreamInput/StreamOutput carries a Version, and serialization code branches on it:

@Override
public void writeTo(StreamOutput out) throws IOException {
    out.writeString(name);
    if (out.getVersion().onOrAfter(Version.V_3_0_0)) {
        out.writeOptionalString(newFieldAddedIn3x);  // older nodes never see this
    }
}

grep -rn "out.getVersion()\|in.getVersion()\|onOrAfter\|before(" \
  server/src/main/java/org/opensearch/action/ | head

This is the seam where transport meets backward compatibility. The full rules, the Version constants, and the qa/ BWC test suite are in serialization-bwc.md. For the purposes of this chapter: any time you add or change a field on a TransportRequest/TransportResponse, you must version-gate it and add a BWC round-trip test.

Where the transport layer sits in a request

flowchart TD
    REST["RestSearchAction (9200)"] --> NC[NodeClient.execute]
    NC --> TA[TransportSearchAction]
    TA -->|"per-shard sub-requests"| TS[TransportService.sendRequest]
    TS -->|9300| Remote[data node]
    Remote --> H["search query-phase handler"]
    H --> SS[SearchService.executeQueryPhase]
    SS --> Resp["channel.sendResponse(QuerySearchResult)"]

The transport layer is the connective tissue: the action framework (action-framework.md) decides what to send and where; the transport layer moves the bytes; the thread pools (threadpools-concurrency.md) decide which thread runs each handler.

Reading exercise

# 1. The send/receive API surface.
grep -n "public .*sendRequest\|registerRequestHandler\|openConnection" \
  server/src/main/java/org/opensearch/transport/TransportService.java

# 2. A real handler registration (search query phase).
grep -rn "registerRequestHandler" server/src/main/java/org/opensearch/search/ | head

# 3. The serialization primitives.
grep -n "public .* read\|public void write" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/StreamInput.java | head -30

# 4. NamedWriteable in action for queries.
grep -rn "writeNamedWriteable\|readNamedWriteable" server/src/main/java/org/opensearch/index/query/ | head

# 5. Tests.
./gradlew :server:test --tests "org.opensearch.transport.TransportServiceTests"
./gradlew :server:test --tests "*AbstractWireSerializingTestCase*" 2>/dev/null || true

Answer:

What three things must you supply to registerRequestHandler, and what does the thread-pool-name argument control?
State the cardinal rule relating writeTo to the StreamInput constructor. What runtime error appears when it is violated?
When does a field use NamedWriteableRegistry instead of a plain Writeable reader? Give a concrete OpenSearch type that needs it and why.
What is ActionListenerResponseHandler for, and how does it connect the transport layer to the ActionListener world?
Why does the transport layer use multiple channel types (PING, STATE, BULK, …)? What failure does the separation prevent?
Find one if (out.getVersion().onOrAfter(...)) in the codebase. Explain what breaks in a rolling upgrade if that guard is removed.

Common bugs and symptoms

Symptom	Root cause	Where to look
`EOFException` / garbage values deserializing a request	`writeTo` and the `StreamInput` reader disagree on field order/count	the type's `writeTo` vs its `(StreamInput)` constructor; add an `AbstractWireSerializingTestCase`
Works same-version, breaks during rolling upgrade	New field not version-gated with `getVersion().onOrAfter(...)`	serialization-bwc.md; add a BWC test
`IllegalArgumentException: Unknown NamedWriteable [...]`	A polymorphic type's name not registered (plugin missing on this node)	`getNamedWriteables()`; `NamedWriteableRegistry`
Handler runs on the wrong/overloaded pool; rejections	Wrong `ThreadPool.Names` in `registerRequestHandler`	the registration call; threadpools-concurrency.md
Cluster-state pings delayed under bulk load	Traffic not separated onto distinct channel types	`ConnectionProfile`; `TransportRequestOptions`
`NodeNotConnectedException`	No open connection to target (node left, or never connected)	`TransportService.connectToNode`; cluster state freshness

Validation: prove you understand this

Distinguish port 9200 from 9300 and name the top-level class that owns each stack.
Write, from memory, the skeleton of a TransportRequest with one string and one optional sub-object, including both writeTo and the reader constructor in the correct order.
Explain NamedWriteableRegistry versus a plain Writeable.Reader, and name two OpenSearch types that require the registry.
Trace a single sendRequest from caller through Netty4Transport to a remote handler and back to the caller's handleResponse.
Show one version-gated write and explain the exact rolling-upgrade failure that the guard prevents.
Given a registered handler that runs on ThreadPool.Names.GENERIC but does heavy CPU search work, explain the operational symptom and the fix.

The REST Layer

The REST layer is the edge of OpenSearch — the HTTP front door on port 9200 where every client request arrives. Its job is narrow and well-defined: receive an HTTP request, route it to the right handler based on method and path, parse its body and parameters into a typed action request, hand that to the NodeClient, and render the eventual response (or error) back as HTTP. It does not execute the business logic; it delegates everything to the action framework (action-framework.md) which runs over the transport layer (transport-layer.md). This chapter walks RestController dispatch, the BaseRestHandler contract, request parsing via XContent, response/error rendering, and how a handler is registered by a plugin.

After this chapter you should be able to: write a BaseRestHandler and register it; explain the prepareRequest → RestChannelConsumer two-step and why it exists; parse a JSON body with XContent; and trace a GET _search from HTTP bytes to a TransportSearchAction call.

Note: Port 9200 is HTTP. Port 9300 is the transport protocol between nodes. This chapter is about 9200; the moment a REST handler calls NodeClient.execute(...), control crosses into the action/transport world.

The classes

find server/src/main/java/org/opensearch/rest -name "RestController.java"
find server/src/main/java/org/opensearch/rest -name "BaseRestHandler.java"
grep -n "class RestController\|registerHandler\|dispatchRequest\|class MethodHandlers" \
  server/src/main/java/org/opensearch/rest/RestController.java

Class	Role
`RestController`	The router. Holds the trie of `(method, path) → handler`. Receives every HTTP request and dispatches it; renders errors.
`MethodHandlers`	The per-path bucket mapping each HTTP method to its handler.
`RestHandler`	The base interface: `handleRequest(RestRequest, RestChannel, NodeClient)`.
`BaseRestHandler`	The class you almost always extend. Splits handling into `prepareRequest(...)` (parse) and a returned `RestChannelConsumer` (execute). Tracks consumed params.
`RestRequest`	The parsed HTTP request: method, path, query params, headers, body as `BytesReference`, and an XContent parser.
`RestChannel`	The reply handle: `sendResponse(RestResponse)`.
`RestResponse` / `BytesRestResponse`	The HTTP response (status + content type + body bytes). `BytesRestResponse` is the common concrete one.
`NodeClient`	The in-process entry point to the action framework: `execute(ActionType, request, listener)`.

flowchart LR
    HTTP["HTTP request :9200"] --> HC[HttpChannel / Netty]
    HC --> RC[RestController.dispatchRequest]
    RC -->|"match (method, path)"| MH[MethodHandlers]
    MH --> H["BaseRestHandler.handleRequest"]
    H -->|prepareRequest| PR[parse params + body]
    PR --> CC[RestChannelConsumer]
    CC --> NC[NodeClient.execute]
    NC --> TA[TransportAction]
    TA -->|ActionListener| RTL[RestToXContentListener]
    RTL --> Channel[RestChannel.sendResponse]
    Channel --> HTTP

Dispatch: how RestController routes

RestController holds a path trie. Each handler declares its routes() — the (method, path) pairs it serves, including templated path segments like /{index}/_search.

grep -n "registerHandler\|routes()\|class Route\|PathTrie" \
  server/src/main/java/org/opensearch/rest/RestController.java
grep -n "public List<Route> routes" server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java

On each request, dispatchRequest:

Matches the method+path to a RestHandler (via the trie / MethodHandlers).
Applies cross-cutting concerns (headers, deprecation, content-type checks).
Calls handler.handleRequest(request, channel, client).
If nothing matches, renders a 400/405/404 (e.g. method not allowed lists the allowed methods for that path).

Note: Path templating and the trie are why GET /my-index/_search and GET /_search can resolve to the same handler with the index either bound or absent. The handler reads request.param("index").

The prepareRequest → RestChannelConsumer two-step

BaseRestHandler deliberately splits handling into two phases:

grep -n "protected abstract RestChannelConsumer prepareRequest\|handleRequest\|unrecognized\|consumedParams\|interface RestChannelConsumer" \
  server/src/main/java/org/opensearch/rest/BaseRestHandler.java

@Override
protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) throws IOException {
    // PHASE 1 (this thread): parse params + body into a typed action request.
    SearchRequest searchRequest = new SearchRequest();
    request.withContentOrSourceParamParserOrNull(parser -> parseSearchRequest(searchRequest, request, parser));
    // PHASE 2 (returned lambda): actually execute, async.
    return channel -> client.execute(SearchAction.INSTANCE, searchRequest,
        new RestToXContentListener<>(channel));
}

Why two phases?

Phase	Runs on	Purpose	Failure behavior
`prepareRequest` (parse)	the HTTP worker thread, synchronously	Validate and build the action request. Read every param here.	A parse error throws before any execution — clean 400.
`RestChannelConsumer` (execute)	typically off-thread via the action framework	Submit the action and render the result asynchronously.	Errors flow to the `ActionListener` and become an HTTP error response.

This split guarantees that all request parsing happens before any execution. If parsing fails, you get a clean client error and have done no work. It also lets BaseRestHandler enforce that every query parameter was consumed: an unrecognized param produces request [GET /...] contains unrecognized parameter: [...] rather than being silently ignored.

Warning: You must read (consume) every parameter you accept inside prepareRequest. BaseRestHandler records consumed params and rejects the request if any are left over. Forgetting to consume a param you support causes a spurious "unrecognized parameter" rejection; accepting a param you never read lets typos pass silently. Read them all, even ones you ignore.

Parsing the body: XContent

Request bodies are JSON (or YAML/CBOR/SMILE — XContent abstracts the format). Parsing goes through libs/x-content:

find libs/x-content -name "XContentParser.java" -o -name "NamedXContentRegistry.java"
grep -n "withContentOrSourceParamParserOrNull\|contentParser\|requiredContent" \
  server/src/main/java/org/opensearch/rest/RestRequest.java

Concept	Class	Role
Format-agnostic parser	`XContentParser`	Token-stream parser (`nextToken()`, `currentName()`, `text()`, `intValue()`).
Format detection	`XContentType` / `XContentFactory`	Detects JSON/YAML/CBOR/SMILE from content type or sniffing.
Polymorphic parsing	`NamedXContentRegistry`	The XContent analog of `NamedWriteableRegistry`: parse a named sub-object (e.g. a query named `bool`) by looking up its parser. Assembled from core + plugins.
Declarative parsing	`ObjectParser` / `ConstructingObjectParser`	Map JSON fields to setters/constructor args declaratively, with validation. Preferred for new code.

NamedXContentRegistry is what lets {"query": {"bool": {...}}} parse into a BoolQueryBuilder without the parser hardcoding every query type. Plugins add entries via getNamedXContent(); see plugin-architecture.md and query-dsl-querybuilders.md.

Rendering the response and errors

A handler does not build HTTP directly; it gives the action framework an ActionListener that knows how to render. The common ones:

find server -name "RestToXContentListener.java" -o -name "RestActionListener.java" -o -name "RestStatusToXContentListener.java"
grep -n "class RestToXContentListener\|buildResponse\|onResponse\|onFailure" \
  server/src/main/java/org/opensearch/rest/action/RestToXContentListener.java

Listener	Use
`RestActionListener`	Base: turns `onFailure` into a rendered error response on the channel.
`RestToXContentListener<T extends ToXContent>`	The workhorse: serializes a `ToXContentObject` response to the negotiated XContent format with the right status.
`RestStatusToXContentListener`	Like above but derives the HTTP status from the response (e.g. `201 Created` for index).

Errors are rendered by BytesRestResponse, which knows how to turn an exception into a structured JSON error with the right HTTP status (RestStatus):

grep -n "class BytesRestResponse\|build.*Exception\|status()\|RestStatus" \
  server/src/main/java/org/opensearch/rest/BytesRestResponse.java
grep -n "enum RestStatus\|BAD_REQUEST\|NOT_FOUND\|INTERNAL_SERVER_ERROR\|TOO_MANY_REQUESTS" \
  libs/core/src/main/java/org/opensearch/core/rest/RestStatus.java

The exception-to-status mapping is why a missing index returns 404, a malformed query returns 400, and a circuit-breaker trip returns 429 (circuit-breakers-memory.md).

Registering a handler (the plugin seam)

Core REST handlers are registered in ActionModule. Plugin handlers are contributed via ActionPlugin.getRestHandlers(...):

grep -n "registerHandler\|getRestHandlers" \
  server/src/main/java/org/opensearch/action/ActionModule.java
grep -n "getRestHandlers" server/src/main/java/org/opensearch/plugins/ActionPlugin.java

// In your plugin:
@Override
public List<RestHandler> getRestHandlers(Settings settings, RestController restController,
        ClusterSettings clusterSettings, IndexScopedSettings indexScopedSettings,
        SettingsFilter settingsFilter, IndexNameExpressionResolver resolver,
        Supplier<DiscoveryNodes> nodesInCluster) {
    return List.of(new RestMyAction());
}

This is exactly what Level 3 lab 3.3 builds. The handler's routes() declare the paths; ActionModule/RestController wire them into the trie at startup.

Worked example: RestSearchAction

find server -name "RestSearchAction.java"
grep -n "routes()\|prepareRequest\|parseSearchRequest\|SearchAction.INSTANCE\|RestToXContentListener\|RestStatusToXContentListener" \
  server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java

The path of a GET /orders/_search:

sequenceDiagram
    participant HTTP as HTTP :9200
    participant RC as RestController
    participant RSA as RestSearchAction
    participant NC as NodeClient
    participant TSA as TransportSearchAction
    HTTP->>RC: GET /orders/_search {query}
    RC->>RSA: handleRequest (matched route)
    RSA->>RSA: prepareRequest: parseSearchRequest (XContent + params)
    RSA-->>RC: RestChannelConsumer
    RC->>NC: consumer.accept(channel) -> client.execute(SearchAction.INSTANCE, searchRequest, RestToXContentListener)
    NC->>TSA: dispatch to TransportSearchAction
    TSA-->>NC: SearchResponse (async)
    NC->>RSA: listener.onResponse(SearchResponse)
    RSA->>HTTP: RestToXContentListener renders JSON + 200

From here the story continues in search-execution.md.

Reading exercise

# 1. Dispatch and routing.
grep -n "dispatchRequest\|tryAllHandlers\|handleBadRequest\|PathTrie" \
  server/src/main/java/org/opensearch/rest/RestController.java

# 2. The two-step contract.
grep -n "prepareRequest\|RestChannelConsumer\|unrecognized\|consumeParam" \
  server/src/main/java/org/opensearch/rest/BaseRestHandler.java

# 3. A real handler end to end.
sed -n '1,120p' server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java

# 4. Error rendering.
grep -n "RestStatus\|build\|XContentBuilder" \
  server/src/main/java/org/opensearch/rest/BytesRestResponse.java

# 5. Tests.
./gradlew :server:test --tests "org.opensearch.rest.RestControllerTests"
./gradlew :server:test --tests "org.opensearch.rest.action.search.RestSearchActionTests"

Answer:

How does RestController decide which handler serves GET /a/_search? What does it return if the path matches but the method does not?
Why is handling split into prepareRequest and the returned RestChannelConsumer? What guarantee does the split give about parse errors?
What does BaseRestHandler do with a query parameter you accept but never consume? With one the handler does not recognize at all?
What does NamedXContentRegistry enable, and how is it the XContent twin of NamedWriteableRegistry from the transport layer?
Trace how a missing-index error becomes an HTTP 404. Which class maps the exception to the status?
In RestSearchAction, find where the parsed SearchRequest is handed to the action framework. Which ActionType and which listener are used?

Common bugs and symptoms

Symptom	Root cause	Where to look
`contains unrecognized parameter: [foo]` for a param you support	Param not consumed in `prepareRequest`	`request.param("foo")` missing; `BaseRestHandler` consumed-param tracking
Body silently ignored	Parsed on the wrong content path, or `withContentOrSourceParamParserOrNull` not called	`RestRequest` content accessors; XContent parsing
`415 Unsupported Media Type`	Missing/wrong `Content-Type` header	`RestRequest`/`RestController` content-type checks
Plugin REST route returns 404	Handler not returned from `getRestHandlers`, or wrong `routes()`	`ActionPlugin.getRestHandlers`; the handler's `routes()`
Error returns 500 where 400 was right	Threw a generic exception instead of one with a `RestStatus`	use `IllegalArgumentException`/`*Exception` with proper status; `BytesRestResponse`
`{"query":{"customq":...}}` fails to parse	Query type not in `NamedXContentRegistry` (plugin not installed)	`getNamedXContent()`; query-dsl-querybuilders.md

Validation: prove you understand this

Draw the path from HTTP bytes on 9200 to a NodeClient.execute call, naming every class an OpenSearch request touches.
Explain the prepareRequest / RestChannelConsumer split and the exact guarantee it makes about when parse errors occur.
State the rule about consuming query parameters and the two failure modes of getting it wrong.
Explain NamedXContentRegistry and give a concrete request body whose parsing depends on it.
Map three errors (missing index, malformed query, breaker trip) to their HTTP statuses and name the class that performs the mapping.
From memory, write the getRestHandlers signature a plugin overrides to add a REST endpoint, and name what routes() on the handler must return.

The Action Framework

The action framework is the dispatch backbone of OpenSearch. Between the REST edge (rest-layer.md) and the transport wire (transport-layer.md) sits a uniform abstraction: every operation — index, get, search, create-index, cluster-health, your custom plugin action — is an ActionType with a request and a response, served by a TransportAction, registered in ActionModule, and invoked through the NodeClient/Client. This uniformity is what lets the same machinery route a request to the elected cluster manager, fan it out to all shards, or replicate it to a primary-then-replica without each feature reinventing routing. This chapter explains the action contracts, the family of TransportAction base classes (which encode the common routing patterns), registration, filters, and the ActionListener async model. It then contrasts the write path with the read path.

After this chapter you should be able to: pick the right TransportAction base class for a new action; register an ActionType end-to-end; explain ActionListener and ActionFilters; and trace a write through TransportReplicationAction (primary→replica) vs a read through TransportSingleShardAction.

Note: TransportClusterManagerNodeAction is the renamed TransportMasterNodeAction. Both names appear in the tree; they route an action to the elected cluster manager.

The four contracts

find server -name "ActionType.java" -path "*core*" -o -name "ActionType.java"
find server libs -name "ActionRequest.java" -o -name "ActionResponse.java" -o -name "ActionListener.java"
grep -n "class ActionType\|String name()\|Writeable.Reader" \
  server/src/main/java/org/opensearch/action/ActionType.java 2>/dev/null

Contract	Role
`ActionType<Response>`	The typed identity of an action. Carries the action name (e.g. `"indices:data/read/search"`) and the response reader. The key both clients and the registry use.
`ActionRequest`	Base for request payloads. Implements `Writeable`, and `validate()` for input checks.
`ActionResponse`	Base for response payloads. Implements `Writeable`.
`ActionListener<Response>`	The async callback: `onResponse(Response)` / `onFailure(Exception)`. Everything is async.

ActionListener deserves emphasis: OpenSearch is callback-driven, not blocking-call-driven. A request returns immediately and the result arrives later on onResponse/onFailure. There are rich combinators:

grep -n "static .*wrap\|static .*map\|delegateFailure\|runAfter\|runBefore" \
  server/src/main/java/org/opensearch/core/action/ActionListener.java 2>/dev/null \
  || grep -rn "wrap\|map\|delegateFailure\|runAfter" libs/core/src/main/java/org/opensearch/core/action/ActionListener.java

Combinator	Use
`ActionListener.wrap(onResp, onFail)`	Build a listener from two lambdas.
`map(fn)`	Transform the response before passing it on.
`delegateFailure(...)`	Reuse the same failure path while customizing success.
`runAfter(...)` / `runBefore(...)`	Run cleanup regardless of outcome.

Warning: Every code path must call the listener exactly once — never zero (a hung request) and never twice (a double-response error). This is the most common bug in new actions. Audit every branch, including exception handlers, to guarantee exactly-once completion.

The TransportAction family

The base classes encode the routing pattern so individual actions don't:

find server -name "TransportAction.java" -o -name "HandledTransportAction.java" \
  -o -name "TransportSingleShardAction.java" -o -name "TransportBroadcastAction.java" \
  -o -name "TransportReplicationAction.java" -o -name "TransportClusterManagerNodeAction.java" \
  -o -name "TransportMasterNodeAction.java"

Base class	Routing pattern	Example actions
`HandledTransportAction`	Runs locally on the receiving (usually coordinating) node; no special routing.	many admin / simple actions
`TransportClusterManagerNodeAction` (was `TransportMasterNodeAction`)	Route to the elected cluster manager; it owns cluster-state changes.	create/delete index, put mapping, update settings
`TransportSingleShardAction`	Route to one shard copy (primary or any replica) that holds the doc.	`get`, `explain`
`TransportBroadcastAction` (and `TransportBroadcastByNodeAction`)	Fan out to all relevant shards, gather, reduce.	`refresh`, `_stats`, `validate query`
`TransportReplicationAction`	Write path: route to the primary, then replicate to in-sync replicas.	index, delete, bulk-shard

flowchart TD
    Req[ActionRequest] --> Pick{routing pattern}
    Pick -->|cluster-state change| CM[TransportClusterManagerNodeAction -> elected cluster manager]
    Pick -->|single doc read| SS[TransportSingleShardAction -> one shard copy]
    Pick -->|all shards| BC[TransportBroadcastAction -> fan out + reduce]
    Pick -->|write| REP[TransportReplicationAction -> primary then replicas]
    Pick -->|local/simple| HT[HandledTransportAction -> here]

Choosing the base class correctly is most of the design work for a new action. Picking HandledTransportAction for something that mutates cluster state, for instance, would skip the required routing to the cluster manager and corrupt state on a non-manager node.

Registration: ActionModule

Actions are wired in ActionModule, which builds the map from ActionType → TransportAction and registers REST handlers. Plugins extend it via ActionPlugin.getActions().

grep -n "registerAction\|ActionHandler\|setupActions\|getActions" \
  server/src/main/java/org/opensearch/action/ActionModule.java
grep -n "getActions\|class ActionHandler" \
  server/src/main/java/org/opensearch/plugins/ActionPlugin.java

// Core: pairs the ActionType with its TransportAction implementation.
actions.register(SearchAction.INSTANCE, TransportSearchAction.class);

// Plugin:
@Override
public List<ActionHandler<?, ?>> getActions() {
    return List.of(new ActionHandler<>(MyAction.INSTANCE, TransportMyAction.class));
}

flowchart LR
    AT["ActionType (name)"] --> AM[ActionModule registry]
    TA[TransportAction impl] --> AM
    AM -->|"NodeClient.execute(AT, req)"| Lookup[lookup by AT]
    Lookup --> Run[run the TransportAction]

ActionFilters: the interception chain

Before a TransportAction runs, the request passes through an ordered chain of ActionFilters. This is the cross-cutting hook used for security authorization/authentication (the out-of-repo security plugin), request auditing, and rate limiting.

find server -name "ActionFilters.java" -o -name "ActionFilter.java"
grep -n "interface ActionFilter\|apply\|order()\|class ActionFilterChain" \
  server/src/main/java/org/opensearch/action/support/ActionFilter.java

flowchart LR
    C[client.execute] --> F1["ActionFilter 1 (e.g. security)"]
    F1 --> F2[ActionFilter 2]
    F2 --> TA[TransportAction.doExecute]
    TA --> Resp[response back up the chain]

Filters are ordered by order() and can short-circuit (deny) or modify the request/response. This is exactly where the security plugin enforces access control without core knowing anything about it.

The client: NodeClient and Client

find server -name "NodeClient.java" -o -name "Client.java"
grep -n "public .* execute\|doExecute\|executeLocally" \
  server/src/main/java/org/opensearch/client/node/NodeClient.java

Type	Role
`Client`	The interface callers use. Has convenience methods (`index(...)`, `search(...)`, `get(...)`) and the generic `execute(ActionType, request, listener)`.
`NodeClient`	The in-node implementation. `execute` looks up the `TransportAction` for the `ActionType` and runs it locally (which may itself send transport requests to other nodes).

REST handlers receive a NodeClient and call client.execute(ActionType, req, listener). That is the single seam between the REST layer and the action layer.

The write path: TransportReplicationAction

A write (index/delete/bulk-shard) is the canonical replication action. The flow:

grep -n "shardOperationOnPrimary\|shardOperationOnReplica\|ReplicationOperation\|PrimaryShardReference\|class TransportReplicationAction" \
  server/src/main/java/org/opensearch/action/support/replication/TransportReplicationAction.java

sequenceDiagram
    participant Co as Coordinating node
    participant P as Primary shard node
    participant R1 as Replica node 1
    participant R2 as Replica node 2
    Co->>P: route to PRIMARY (per routing table)
    P->>P: shardOperationOnPrimary -> IndexShard.applyIndexOperationOnPrimary
    Note over P: writes to Lucene + translog
    par replicate to in-sync copies
        P->>R1: replica request
        R1->>R1: shardOperationOnReplica -> applyIndexOperationOnReplica
        P->>R2: replica request
        R2->>R2: shardOperationOnReplica
    end
    R1-->>P: ok
    R2-->>P: ok
    P-->>Co: success when enough copies ack

Key facts (detailed in replication.md and index-shard-lifecycle.md):

The request is routed to the primary first; the primary is the serialization point for the document.
The primary applies the op (IndexShard.applyIndexOperationOnPrimary → InternalEngine), then fans out to the in-sync replica copies.
Sequencing uses primary terms and sequence numbers (ReplicationTracker).
wait_for_active_shards controls how many copies must be available before the op proceeds.

The read path: TransportSingleShardAction

A get reads one document from one copy:

grep -n "shardOperation\|resolveRequest\|shards(\|class TransportSingleShardAction" \
  server/src/main/java/org/opensearch/action/support/single/shard/TransportSingleShardAction.java

sequenceDiagram
    participant Co as Coordinating node
    participant S as A shard copy (primary or replica)
    Co->>Co: resolve which shard id holds the doc (routing)
    Co->>S: send to one copy from the shard's iterator
    S->>S: shardOperation -> IndexShard.get
    alt copy fails
        Co->>S2: try the next copy in the iterator
    end
    S-->>Co: GetResponse

The contrast is instructive:

Aspect	Write (`TransportReplicationAction`)	Read (`TransportSingleShardAction`)
Target	primary, then all in-sync replicas	one copy (primary or any replica)
Goal	durability + replica consistency	low-latency single fetch
Failure	replica failure → mark out-of-sync; retry	copy failure → fall through to the next copy
Ordering	primary term + seq no	none needed

A multi-shard read (search) uses neither of these directly; it has its own scatter/gather machinery — see search-execution.md.

End-to-end: client.execute to shard

flowchart TD
    REST[REST handler] -->|"client.execute(ActionType, req, listener)"| NC[NodeClient]
    NC --> AF[ActionFilters chain]
    AF --> TA[TransportAction.doExecute]
    TA -->|routing pattern| Shard["shard-level operation (primary/replica/single/broadcast)"]
    Shard --> Eng["IndexShard / Engine / SearchService"]
    Eng --> L["ActionListener.onResponse"]
    L --> REST

Reading exercise

# 1. The base classes side by side.
for c in HandledTransportAction TransportSingleShardAction TransportBroadcastAction \
         TransportReplicationAction TransportClusterManagerNodeAction; do
  echo "== $c =="; find server -name "$c.java"
done

# 2. Registration.
grep -n "register\|getActions\|ActionHandler" \
  server/src/main/java/org/opensearch/action/ActionModule.java | head

# 3. ActionListener combinators.
grep -n "wrap\|map\|delegateFailure\|runAfter" \
  libs/core/src/main/java/org/opensearch/core/action/ActionListener.java

# 4. The write path.
grep -n "shardOperationOnPrimary\|shardOperationOnReplica" \
  server/src/main/java/org/opensearch/action/bulk/TransportShardBulkAction.java

# 5. Tests.
./gradlew :server:test --tests "org.opensearch.action.support.replication.TransportReplicationActionTests"

Answer:

Map each of these to a TransportAction base class and justify: PUT /idx/_mapping, GET /idx/_doc/1, POST /idx/_refresh, POST /idx/_doc.
State the exactly-once rule for ActionListener and describe the symptom of violating it in each direction (zero calls, two calls).
How does ActionModule connect an ActionType to a TransportAction, and how does a plugin add its own pair?
Where do ActionFilters run in the pipeline, and what real out-of-repo plugin uses them for authorization?
In TransportReplicationAction, identify the primary and replica operation methods. What ordering primitives keep replicas consistent?
Contrast the failure handling of a replica failing a write vs a copy failing a single-shard read.

Common bugs and symptoms

Symptom	Root cause	Where to look
Request hangs forever, no response	`ActionListener` never called on some branch	audit every path/exception handler for exactly-once completion
`Response already sent` / double completion	Listener called twice	same audit; especially in retry/fallback code
Cluster-state change applied on a non-manager node, corruption	Used `HandledTransportAction` for a metadata change instead of `TransportClusterManagerNodeAction`	base-class choice; route to cluster manager
Write succeeds on primary but replicas diverge	Replica op not idempotent / seq-no handling wrong	`shardOperationOnReplica`; replication.md
Action not found: `No handler for action [...]`	`ActionType` not registered in `ActionModule`/`getActions`	registration; name mismatch
Security plugin not enforcing on a new action	Action bypasses the filter chain (custom dispatch)	run through `NodeClient.execute`; `ActionFilters`

Validation: prove you understand this

List the five TransportAction base classes and the routing pattern each encodes. For each, name one real OpenSearch action.
Explain the ActionListener async contract and the exactly-once rule, with the failure symptom for each violation.
Write the getActions() snippet a plugin uses to register a custom ActionType/TransportAction pair.
Draw the write path through TransportReplicationAction from coordinating node to primary to replicas, and name the engine method the primary calls.
Draw the read path through TransportSingleShardAction and explain the fall-through-to-next-copy behavior on failure.
Explain where ActionFilters sit and why the security plugin relies on every action flowing through NodeClient.execute.

Thread Pools and Concurrency

OpenSearch is a heavily concurrent server, and almost all of that concurrency is mediated by a small set of named thread pools. Understanding which pool runs which work — and which threads you must never block — is the difference between code that scales and code that wedges a node. This chapter covers the ThreadPool abstraction and its named pools (SEARCH, WRITE, GET, MANAGEMENT, GENERIC, etc.), the pool types (fixed, scaling, fixed-auto-queue-size), the single-writer-per-shard invariant that simplifies the engine, how rejections and queues behave, ThreadContext propagation, and the cardinal rule that you must never block the cluster-applier thread.

After this chapter you should be able to: name the pool that runs any given operation; read _cat/thread_pool to diagnose a saturated node; explain why the write path can avoid locks inside a shard; and explain the cluster-wide cascade that follows from blocking a coordination thread.

Note: Blocking the wrong thread is the most damaging concurrency mistake in the codebase. It rarely fails a test; it manifests in production as a node that stops accepting work or a cluster that stops applying state. The rules in this chapter are not style — they are correctness.

The ThreadPool abstraction

find server -name "ThreadPool.java" -path "*threadpool*"
grep -n "public static class Names\|public static final String SEARCH\|public static final String WRITE\|\
public static final String GET\|public static final String GENERIC\|public static final String MANAGEMENT" \
  server/src/main/java/org/opensearch/threadpool/ThreadPool.java

ThreadPool owns every executor and exposes them by name via ThreadPool.Names. You obtain one with threadPool.executor(Names.SEARCH) or schedule with threadPool.schedule(...). The pools that matter most:

Name (`ThreadPool.Names.*`)	Type	Runs
`SEARCH`	fixed (auto-queue)	query and fetch phases of search
`WRITE`	fixed	index/delete/bulk-shard write operations
`GET`	fixed	single-document `get`
`SEARCH_THROTTLED`	fixed	searches against throttled (frozen-like) indices
`MANAGEMENT`	scaling	lightweight management tasks, `_cat`, stats
`GENERIC`	scaling	general-purpose, possibly long-running, may block
`SNAPSHOT`	scaling	snapshot/restore byte work
`REFRESH`	scaling	shard refreshes
`FLUSH`	scaling	shard flushes (Lucene commits)
`WARMER`	scaling	searcher warming
`LISTENER`	fixed	client listener callbacks
`FORCE_MERGE`	fixed (size 1)	force-merge requests

# See the full configured set and sizes on a running node:
curl -s "localhost:9200/_cat/thread_pool?v&h=node_name,name,type,size,queue,active,rejected,completed"

Pool types

OpenSearch sizes executors with three strategies. Find them:

find server libs -name "OpenSearchExecutors.java" -o -name "EsExecutors.java"
grep -n "newFixed\|newScaling\|newAutoQueueFixed\|class OpenSearchExecutors" \
  server/src/main/java/org/opensearch/common/util/concurrent/OpenSearchExecutors.java 2>/dev/null \
  || grep -rn "newFixed\|newScaling\|newAutoQueueFixed" server/src/main/java/org/opensearch/common/util/concurrent/

Type	Behavior	Used for
fixed	Fixed thread count + bounded queue. Excess work is rejected when the queue is full.	`SEARCH`, `WRITE`, `GET` — bounded resources you must not over-commit.
scaling	Grows up to a max under load, shrinks when idle (with a keep-alive). Effectively unbounded queueing of threads but bounded breadth.	`GENERIC`, `MANAGEMENT`, `SNAPSHOT` — bursty or long-running work.
fixed_auto_queue_size	Fixed threads with a queue size that auto-tunes based on measured latency targets (a feedback controller).	`SEARCH` by default — adapts queue depth to keep latency in check.

The choice encodes a policy: bounded pools (fixed) protect the node by rejecting rather than queueing unboundedly; scaling pools tolerate bursts of work that is expected to be occasional or slow.

Warning: Never submit blocking or long-running work to a fixed pool sized for short operations (SEARCH, WRITE, GET). A blocked thread there is a thread permanently removed from a tiny pool; a few of them stall all search or all indexing on the node. Long/blocking work goes on GENERIC.

Rejections and queues

A fixed pool with a full queue rejects new work, surfacing as OpenSearchRejectedExecutionException and, to clients, HTTP 429 Too Many Requests with a rejected_execution cause. This is a feature — back-pressure — not a crash. The relevant counters:

curl -s "localhost:9200/_nodes/stats/thread_pool?filter_path=nodes.*.thread_pool.search.rejected,nodes.*.thread_pool.write.rejected"
curl -s "localhost:9200/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected,completed"

Column	Meaning
`active`	threads currently running tasks
`queue`	tasks waiting (bounded for fixed pools)
`rejected`	tasks dropped because the queue was full (cumulative)
`completed`	total tasks finished

Rising rejected on write means indexing back-pressure (clients should slow down / retry); rising rejected on search means the query load exceeds capacity. The fix is usually upstream (fewer/cheaper requests, more nodes) — not "make the queue bigger," which just trades rejection for latency and heap pressure.

The single-writer-per-shard model

A crucial simplification: only one thread writes to a given shard's Lucene IndexWriter at a time. Write operations for a shard are serialized so the engine (engine-internals.md) does not need fine-grained locking around the writer for indexing.

grep -n "indexShardOperationPermits\|acquirePrimaryOperationPermit\|acquireReplicaOperationPermit\|\
class IndexShardOperationPermits" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java \
  server/src/main/java/org/opensearch/index/shard/IndexShardOperationPermits.java 2>/dev/null

The mechanism is IndexShardOperationPermits: operations acquire a permit before touching the shard. Normal ops share permits (many can read/index concurrently at the Lucene level, which is itself thread-safe for add/update), but operations that must run exclusively (relocation, primary-term bump, resync) acquire all permits, briefly blocking new ops. This permit system — not a giant lock — is how the shard coordinates concurrent access while keeping the hot path lock-light.

flowchart TD
    Op1[index op] -->|acquire shared permit| Permits[IndexShardOperationPermits]
    Op2[index op] -->|acquire shared permit| Permits
    Relocate["relocation / primary-term bump"] -->|acquire ALL permits| Permits
    Permits -->|"exclusive: wait for in-flight to drain"| Block[new ops queue briefly]

ThreadContext propagation

A single logical request hops across thread pools and across nodes. Headers, the response cap, the request's security identity (used by the security plugin), and transient values must travel with it. That is ThreadContext.

find server libs -name "ThreadContext.java"
grep -n "stashContext\|newStoredContext\|putHeader\|putTransient\|class ThreadContext" \
  libs/common/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java 2>/dev/null \
  || grep -rn "class ThreadContext" server libs

Operation	Effect
`putHeader`	Attach a header that propagates over the transport wire.
`putTransient`	Attach a node-local value (not sent on the wire).
`stashContext()`	Clear the current context (e.g. before running internal work as the system).
`newStoredContext()`	Capture the current context to restore later when work resumes on another thread.

Warning: When you hand work to another thread pool (or a listener), the framework normally restores the originating ThreadContext so headers and identity follow the request. If you bypass the standard executor wrappers (raw Thread, raw Executor), you lose the context — security headers vanish, and the security plugin may run the work as the wrong principal. Always schedule through ThreadPool/threadPool.executor(...).

The cardinal rule: never block coordination threads

Two thread types must never block:

The cluster-applier thread (ClusterApplierService) — runs appliers and listeners synchronously when a new cluster state is committed.
The cluster-manager update thread (MasterService) — runs cluster-state update tasks.

(Both are introduced in cluster-state-publishing.md.)

Why blocking the applier thread is catastrophic, link by link:

flowchart TD
    Block["applier/listener blocks (I/O, lock, sleep)"] --> NoApply[committed state not finished applying]
    NoApply --> NoAck[node does not ack the commit]
    NoAck --> PubTimeout["publish times out (cluster.publish.timeout)"]
    PubTimeout --> StepDown["cluster manager may step down"]
    StepDown --> Election[new election / churn]
    Election --> ClusterWide[cluster-wide instability]

The same logic applies to the update thread: a blocked execute() stalls all cluster-state progress for the entire cluster, because there is one such thread.

The fix is always the same: do the slow part on a worker pool. Inside an applier or listener, capture what you need and threadPool.generic().execute(...) (or the appropriate pool) the heavy work; return immediately.

Note: This is exactly what Level 4 lab 4.2 and Level 4 lab 4.3 make you internalize by building (and mis-building) a cluster-state callback.

assertBusy and concurrency in tests

Because so much happens asynchronously, tests cannot assert "right now." They poll with assertBusy, which retries an assertion until it passes or times out:

grep -n "public static void assertBusy\|busy" \
  test/framework/src/main/java/org/opensearch/test/OpenSearchTestCase.java

assertBusy(() -> {
    SearchResponse r = client().prepareSearch("idx").get();
    assertHitCount(r, 1);   // becomes true once the refresh makes the doc visible
});

assertBusy is the idiomatic way to wait for refresh, recovery, allocation, or any other eventually-consistent state in tests — never Thread.sleep. Using it correctly (with a bounded timeout) is a reviewable requirement; see Level 2 lab 2.2.

Reading exercise

# 1. The named pools and their default types/sizes.
grep -n "Names\.\|new ScalingExecutorBuilder\|new FixedExecutorBuilder\|new AutoQueueAdjustingExecutorBuilder" \
  server/src/main/java/org/opensearch/threadpool/ThreadPool.java

# 2. Live saturation snapshot.
curl -s "localhost:9200/_cat/thread_pool?v&h=node_name,name,type,active,queue,rejected"

# 3. The permit system that serializes shard writers.
grep -n "acquire\|asyncBlockOperations\|allPermits\|TOTAL_PERMITS" \
  server/src/main/java/org/opensearch/index/shard/IndexShardOperationPermits.java

# 4. ThreadContext stash/restore.
grep -n "stashContext\|newStoredContext\|restore" \
  libs/common/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java

# 5. Tests.
./gradlew :server:test --tests "org.opensearch.threadpool.ThreadPoolTests"

Answer:

Which pool runs each of: a search query phase, a bulk write, a get, a shard refresh, a snapshot copy, a _cat/health?
What is the difference between a fixed and a scaling pool, and why is WRITE fixed while GENERIC is scaling?
A client sees HTTP 429 rejected_execution on indexing. Which counter confirms it, and what is the correct remediation (and the wrong one)?
Explain the single-writer-per-shard invariant and how IndexShardOperationPermits implements both shared and exclusive access.
What does ThreadContext carry, and what breaks if you run work on a raw Thread instead of through ThreadPool?
Walk the full cascade from "an applier blocks" to "the cluster manager steps down." Name the timeout that triggers the step-down.

Common bugs and symptoms

Symptom	Root cause	Where to look
Node stops serving search; `search` pool `active` pinned, `rejected` climbing	Blocking/heavy work on the `SEARCH` fixed pool	offending code; move blocking work to `GENERIC`
HTTP `429 rejected_execution` on bulk	`WRITE` queue full — legitimate back-pressure	client retry/backoff; scale out; not "bigger queue"
Cluster intermittently loses its cluster manager under load	An applier/listener blocking the applier thread	thread dump on the applier thread; the callback
Security identity wrong / headers lost mid-request	Work scheduled off the `ThreadPool` (raw thread), losing `ThreadContext`	use `threadPool.executor(...)`; preserve `newStoredContext`
Flaky test: assertion fails then passes on rerun	Asserted before async work finished	replace `sleep` with `assertBusy`
Force-merge serializes everything	`FORCE_MERGE` pool size 1 by design	expected; schedule force-merges off-peak

Validation: prove you understand this

From memory, list eight named thread pools and what each runs. Mark which are fixed vs scaling.
Explain rejection as back-pressure: which pools reject, what the client sees, and why enlarging the queue is the wrong fix.
Describe the single-writer-per-shard model and how shared vs exclusive permits coexist in IndexShardOperationPermits.
Explain ThreadContext propagation and the concrete security failure that bypassing ThreadPool causes.
Draw the cascade from a blocked applier thread to cluster-manager step-down, naming every link and the governing timeout.
Explain why assertBusy (not Thread.sleep) is the correct way to wait for eventually-consistent state in tests.

IndexShard Lifecycle

A shard is the unit of storage and the unit of work in OpenSearch. Every document lives in exactly one shard; every search runs per-shard; every write applies to a shard's primary first. The Java object that owns a shard on a data node is IndexShard, and this chapter is about its full lifecycle: how the hierarchy IndicesService → IndexService → IndexShard is built, the shard state machine (IndexShardState), the Store/Directory that backs it, how a shard is created, recovered, started, and closed, and how index/delete operations are applied on the primary versus a replica. The shard is where the cluster layer (routing table) meets the storage layer (the engine, the translog, and refresh/flush/merge).

After this chapter you should be able to: draw the IndexShardState machine; explain who creates and starts shards in response to cluster-state changes; find applyIndexOperationOnPrimary/OnReplica and explain why they differ; and name the components an IndexShard owns.

Note: Distinguish IndexShardState (the data node's reality for the shard object) from ShardRouting.state() (the cluster manager's plan for the shard, in cluster state — see cluster-and-node-model.md). Both have states; they are not the same machine.

The hierarchy: IndicesService → IndexService → IndexShard

find server -name "IndicesService.java" -o -name "IndexService.java" -o -name "IndexShard.java"
grep -n "class IndicesService\|createIndex\|indexService(" \
  server/src/main/java/org/opensearch/indices/IndicesService.java
grep -n "class IndexService\|createShard\|removeShard\|getShard" \
  server/src/main/java/org/opensearch/index/IndexService.java

Level	Class	Scope	Owns
Node	`IndicesService`	all indices on this node	the map of `IndexService`s; circuit breakers for indexing; shared caches
Index	`IndexService`	one index on this node	per-index `MapperService`, `SimilarityService`, analysis, and the map of local `IndexShard`s
Shard	`IndexShard`	one shard copy	the `Engine`, the `Store`, the `Translog` (via engine), refresh/flush logic, operation permits

flowchart TD
    Node[Node] --> IsS[IndicesService]
    IsS --> IxS1["IndexService 'orders'"]
    IsS --> IxS2["IndexService 'logs'"]
    IxS1 --> MS[MapperService]
    IxS1 --> Sh0["IndexShard 0 (primary)"]
    IxS1 --> Sh1["IndexShard 1 (replica)"]
    Sh0 --> Eng[InternalEngine]
    Eng --> IW[Lucene IndexWriter]
    Eng --> TL[Translog]
    Sh0 --> St[Store -> Directory]

The bridge from cluster state to these objects is IndicesClusterStateService, an applier (see cluster-state-publishing.md). When the routing table assigns a shard to this node, the applier asks IndexService.createShard(...); when it un-assigns one, the applier removes it.

grep -n "createShard\|removeShard\|failAndRemoveShard\|createOrUpdateShards" \
  server/src/main/java/org/opensearch/indices/cluster/IndicesClusterStateService.java

The Store and Directory

The bytes of a shard live on disk under the node's data path, wrapped by Store, which holds a Lucene Directory.

find server -name "Store.java" -path "*index/store*"
grep -n "class Store\|Directory\|directory()\|verify\|MetadataSnapshot\|checkIntegrity" \
  server/src/main/java/org/opensearch/index/store/Store.java

Concept	Class	Role
Shard storage	`Store`	Reference-counted wrapper over a Lucene `Directory`; tracks files, checksums, the segment metadata snapshot used by recovery.
Filesystem	`Directory` (Lucene)	The actual file abstraction (`NIOFSDirectory`/`MMapDirectory` via `FsDirectoryFactory`).
File-level integrity	`Store.MetadataSnapshot`	Per-file checksums/lengths; lets recovery copy only differing files (see recovery.md).

Store is reference-counted: a shard can be in use by a search while it is being closed; the refcount keeps the directory alive until all users release it. This is why a shard close is not instantaneous.

The IndexShardState machine

find server -name "IndexShardState.java"
grep -n "CREATED\|RECOVERING\|POST_RECOVERY\|STARTED\|CLOSED\|enum IndexShardState" \
  server/src/main/java/org/opensearch/index/shard/IndexShardState.java
grep -n "changeState\|state =\|IndexShardState\." \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

stateDiagram-v2
    [*] --> CREATED: IndexService.createShard
    CREATED --> RECOVERING: recovery starts (markAsRecovering)
    RECOVERING --> POST_RECOVERY: engine opened, ops replayed
    POST_RECOVERY --> STARTED: shard marked started, serves traffic
    STARTED --> CLOSED: node leaves / shard removed / index closed
    RECOVERING --> CLOSED: recovery failed
    POST_RECOVERY --> CLOSED: failure
    CREATED --> CLOSED: aborted

State	Meaning	Can serve reads/writes?
`CREATED`	Object exists; engine not open.	No
`RECOVERING`	Filling the shard from a source (empty/store/peer/snapshot).	No
`POST_RECOVERY`	Engine open and ops replayed; finalizing.	Not yet announced as started
`STARTED`	Live; primary takes writes, copies serve reads.	Yes
`CLOSED`	Engine closed, resources released.	No

The recovery source that fills a RECOVERING shard depends on context:

Recovery type	Source	Deep dive
empty store	new primary, no data	—
existing store	local Lucene files on this node	—
peer	copy from the primary on another node	recovery.md
snapshot	restore from a repository	snapshots-repositories.md

Creating, recovering, starting, closing

sequenceDiagram
    participant ICSS as IndicesClusterStateService (applier)
    participant IxS as IndexService
    participant Sh as IndexShard
    participant Eng as Engine
    participant CM as Cluster manager
    ICSS->>IxS: createShard(shardRouting) [routing assigned here]
    IxS->>Sh: new IndexShard (CREATED)
    ICSS->>Sh: startRecovery(...)
    Sh->>Sh: markAsRecovering (RECOVERING)
    Sh->>Eng: openEngineAndRecoverFromTranslog / fillFromSource
    Eng-->>Sh: ops replayed (POST_RECOVERY)
    Sh->>CM: shard-started action
    CM->>CM: cluster-state update: ShardRouting -> STARTED
    Note over Sh: applier observes STARTED -> IndexShard STARTED

Trace the key methods:

grep -n "markAsRecovering\|recoverFromStore\|openEngineAndRecover\|postRecovery\|markShardAsStarted\|close(" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head -30

Note the round-trip: the data node reports the shard started (ShardStateAction → a cluster-state update on the manager), and only then does the routing table mark the ShardRouting STARTED. The two state machines converge through this handshake.

grep -n "shardStarted\|class ShardStateAction\|sendShardStarted" \
  server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java

Applying operations: primary vs replica

A write reaches a shard via TransportReplicationAction (action-framework.md). The shard has two application paths, and the difference is fundamental:

grep -n "applyIndexOperationOnPrimary\|applyIndexOperationOnReplica\|applyDeleteOperationOnPrimary\|\
applyDeleteOperationOnReplica\|markSeqNoAsNoop" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java

Path	Method	Assigns sequence number?	Version logic
Primary	`applyIndexOperationOnPrimary`	Yes — the primary is the source of truth; it allocates the seq no and resolves versioning.	Resolves the requested version type, checks/sets `_version`.
Replica	`applyIndexOperationOnReplica`	No — it receives the seq no the primary assigned and applies idempotently.	Trusts the primary's decision; applies at the given seq no.

flowchart TD
    Op[index request] --> Pr{primary or replica?}
    Pr -->|primary| AP[applyIndexOperationOnPrimary]
    AP --> Seq[assign seq no + primary term]
    Seq --> Ver[resolve version conflict]
    Ver --> Eng1["InternalEngine.index() -> IndexWriter + Translog.add"]
    Eng1 --> Repl[replicate to in-sync replicas]
    Repl --> AR[applyIndexOperationOnReplica on each]
    AR --> Eng2["InternalEngine.index() at given seq no (idempotent)"]

Why the split matters: the primary decides (seq no, version, conflict resolution); replicas obey. This makes replication deterministic and lets a recovering replica replay the primary's exact operation history. The sequence number machinery (LocalCheckpointTracker, global checkpoint via ReplicationTracker) is detailed in engine-internals.md and replication.md.

Warning: A replica must apply operations idempotently at the seq no the primary assigned. If a replica re-derived its own seq no or version, replicas would diverge from the primary — a data-correctness bug, not a performance one.

What IndexShard owns and exposes

The shard is the integration point. The hooks you will meet:

Hook / component	Purpose	Deep dive
`Engine` (`getEngine()`)	The Lucene-backed write/read engine	engine-internals.md
`refresh(...)`	Open a new searcher; make recent writes visible	refresh-flush-merge.md
`flush(...)`	Lucene commit + translog roll for durability	translog.md, refresh-flush-merge.md
`acquireSearcher(...)`	Get a point-in-time `Engine.Searcher` for a query	search-execution.md
`IndexShardOperationPermits`	Serialize/exclude operations	threadpools-concurrency.md
`IndexEventListener`	Callbacks on shard lifecycle transitions (plugin hook)	plugin-architecture.md

IndexEventListener is the extension seam: plugins (and core services) get notified on beforeIndexShardCreated, afterIndexShardStarted, beforeIndexShardClosed, etc.

find server -name "IndexEventListener.java"
grep -n "default void\|afterIndexShardStarted\|beforeIndexShardClosed\|onShardInactive" \
  server/src/main/java/org/opensearch/index/shard/IndexEventListener.java

Reading exercise

# 1. The state machine.
grep -n "IndexShardState\|changeState\|verifyActive\|verifyNotClosed" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

# 2. Create/recover/start path.
grep -n "createShard\|startRecovery\|markAsRecovering\|recoverFromStore" \
  server/src/main/java/org/opensearch/index/IndexService.java \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

# 3. Primary vs replica application.
grep -n "applyIndexOperationOnPrimary\|applyIndexOperationOnReplica" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java

# 4. The started handshake.
grep -n "sendShardStarted\|shardStarted" \
  server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java

# 5. Tests + live view.
./gradlew :server:test --tests "org.opensearch.index.shard.IndexShardTests"
curl -s "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,docs,node"

Answer:

Name the three levels of the IndicesService → IndexService → IndexShard hierarchy and one thing each level owns.
Draw IndexShardState from CREATED to STARTED and CLOSED. In which states can the shard serve traffic?
Who calls IndexService.createShard, and what cluster-state change triggers it?
Explain the started handshake: how does a data node's STARTED shard cause the routing table's ShardRouting to become STARTED?
Contrast applyIndexOperationOnPrimary and applyIndexOperationOnReplica. Which assigns the sequence number, and why must the other be idempotent?
What is IndexEventListener for, and name two lifecycle callbacks a plugin could hook?

Common bugs and symptoms

Symptom	Root cause	Where to look
Shard stuck in `INITIALIZING` (never `STARTED`)	Recovery failing or blocked; or never assigned	`_cluster/allocation/explain`; recovery.md, shard-allocation.md
Replica diverges from primary	Replica path re-derived seq no/version instead of obeying primary	`applyIndexOperationOnReplica`; replication.md
`IndexShardClosedException` mid-request	Op ran while shard closing; refcount/permit handling missing	`Store` refcount; `IndexShardOperationPermits`
Slow shard close / leaked file handles	`Store`/`Searcher` not released; refcount never hits zero	leak detector; `acquireSearcher` callers
Data node `STARTED` but cluster shows `INITIALIZING`	started handshake message lost or delayed	`ShardStateAction.sendShardStarted`
Plugin shard hook never fires	`IndexEventListener` not registered via the index module	`onIndexModule`/listener registration

Validation: prove you understand this

Draw the full object graph from IndicesService down to the Lucene IndexWriter and Translog, naming each owning class.
Reproduce the IndexShardState diagram from memory and annotate which states serve reads/writes and which recovery source fills RECOVERING.
Explain the started handshake between a data node and the cluster manager, and why two state machines (IndexShardState, ShardRouting.state) exist.
Explain precisely why the primary assigns sequence numbers and replicas do not, and the data-correctness bug that the asymmetry prevents.
Explain why Store is reference-counted and what a missing release causes.
List two IndexEventListener callbacks and a realistic use for each (e.g. a plugin warming caches on afterIndexShardStarted).

Engine Internals

The Engine is the heart of a shard — the layer that wraps Lucene and turns "apply this index/delete operation" into concrete IndexWriter calls, a durable translog record, a versioning decision, and an eventually visible searcher. IndexShard (index-shard-lifecycle.md) delegates all real storage work to its Engine. This chapter covers the engine implementations (InternalEngine, ReadOnlyEngine, NRTReplicationEngine), the Lucene IndexWriter it drives, the Engine.Index/Engine.Delete operation model, versioning via the LiveVersionMap, sequence numbers (LocalCheckpointTracker, SeqNoStats), and how a refresh produces a new DirectoryReader through a SearcherManager. It is the connective chapter between the write path, the translog, and refresh/flush/merge.

After this chapter you should be able to: explain what InternalEngine.index() does step by step; describe how versioning and sequence numbers prevent lost or out-of-order updates; explain the LiveVersionMap's role; and find where a refresh swaps in a new reader.

Note: OpenSearch's engine is built directly on Apache Lucene. Everything here ultimately reduces to Lucene IndexWriter/DirectoryReader/IndexSearcher calls; the engine adds versioning, sequence numbers, the translog, and the OpenSearch operation model on top.

The engine implementations

find server -name "Engine.java" -path "*index/engine*"
find server -name "InternalEngine.java" -o -name "ReadOnlyEngine.java" -o -name "NRTReplicationEngine.java"
grep -n "abstract class Engine\|public abstract" \
  server/src/main/java/org/opensearch/index/engine/Engine.java | head

Implementation	Used for	Writes?
`InternalEngine`	The normal read/write engine on an active primary or document-replication replica.	Yes — owns a Lucene `IndexWriter`.
`ReadOnlyEngine`	Closed indices, frozen/searchable-snapshot shards — searchable but not writable.	No
`NRTReplicationEngine`	A segment-replication replica: it does not index documents itself; it receives segment files copied from the primary and exposes them.	No local indexing

The default is InternalEngine. The engine is created via an EngineFactory, which a plugin can override (EnginePlugin) — see plugin-architecture.md.

grep -n "EngineFactory\|newReadWriteEngine\|getEngineFactory" \
  server/src/main/java/org/opensearch/index/engine/EngineFactory.java \
  server/src/main/java/org/opensearch/plugins/EnginePlugin.java 2>/dev/null

The operation model: Engine.Index and Engine.Delete

Every write is an Engine.Operation. The two you care about:

grep -n "class Index\|class Delete\|class NoOp\|abstract class Operation\|VersionType\|seqNo\|primaryTerm\|Origin" \
  server/src/main/java/org/opensearch/index/engine/Engine.java | head -40

Field on an op	Meaning
`uid` / `id`	The document `_id` (a Lucene term used for updates).
`version` + `versionType`	The expected/assigned `_version` and the conflict policy (`INTERNAL`, `EXTERNAL`, etc.).
`seqNo`	The sequence number (assigned on the primary, obeyed on replicas).
`primaryTerm`	Which primary generation issued the op.
`origin`	`PRIMARY`, `REPLICA`, `PEER_RECOVERY`, `LOCAL_TRANSLOG_RECOVERY` — where the op came from, which changes the versioning rules.

The origin is crucial: the same index() method behaves differently for a fresh primary write (assign seq no, check version) versus a replayed translog op during recovery (trust the recorded seq no). This is the engine-level expression of the primary/replica asymmetry from index-shard-lifecycle.md.

InternalEngine.index(): step by step

grep -n "public IndexResult index\|planIndexingAsPrimary\|planIndexingAsNonPrimary\|\
indexIntoLucene\|addStaleDocs\|updateDocument\|addDocument\|Translog.add\|versionMap" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head -40

The essential sequence for a primary index op:

flowchart TD
    Idx["InternalEngine.index(Engine.Index)"] --> Lock[acquire per-uid lock]
    Lock --> Plan{plan: new doc, update, or conflict?}
    Plan -->|version conflict| Conflict[return VersionConflictEngineException]
    Plan -->|ok| SeqNo[assign seqNo + primaryTerm if primary]
    SeqNo --> Lucene["IndexWriter.addDocument / updateDocument"]
    Lucene --> VMap["LiveVersionMap.put(uid, version+seqNo)"]
    VMap --> TL["Translog.add(operation)"]
    TL --> CP[LocalCheckpointTracker.markSeqNoAsProcessed]
    CP --> Result[IndexResult]

The order is deliberate: Lucene write, then version map update, then translog append, then checkpoint advance. The document is durable once it is in the translog (and fsynced per durability policy — see translog.md); it becomes visible only after a refresh opens a new reader.

Note: A new or updated document is durable before it is visible. This is the near-real-time (NRT) model: you can lose a node right after an acknowledged write and recover the op from the translog, even though no search would have returned it yet. Durability (translog) and visibility (refresh) are independent — see refresh-flush-merge.md.

Versioning and the LiveVersionMap

To decide whether an incoming write is newer than what is already stored — and to do it before a refresh makes the prior write searchable — the engine keeps an in-memory LiveVersionMap: _id → (version, seqNo, location) for recently written, not-yet-refreshed documents.

find server -name "LiveVersionMap.java" -o -name "VersionValue.java"
grep -n "class LiveVersionMap\|getUnderLock\|putIndexUnderLock\|putDeleteUnderLock\|maps\|archive" \
  server/src/main/java/org/opensearch/index/engine/LiveVersionMap.java

Why it exists: Lucene only knows what is in committed/refreshed segments. Between a write and the next refresh, the authoritative current version of a document is only in memory. The LiveVersionMap is that memory. On a version check the engine consults the map first, then falls back to reading the version from Lucene.

flowchart LR
    W["write _id=42, v=5"] --> Check[version check]
    Check -->|"in LiveVersionMap?"| Map["LiveVersionMap: 42 -> v5,seqNo"]
    Check -->|"not in map"| Lucene[read version from segments]
    Map --> Decide{incoming v > stored v?}
    Lucene --> Decide
    Decide -->|yes| Apply[apply]
    Decide -->|no| Conflict[VersionConflictEngineException]

After a refresh, the document is in a segment, so its entry can be dropped from the live map (it moves to an "archive" briefly to handle in-flight reads). A correctness bug in this map shows up as lost updates or spurious version conflicts under concurrency.

Sequence numbers and checkpoints

Every operation gets a sequence number (seq no), unique and increasing per shard within a primary term. Seq nos give a total order to a shard's history and are the foundation of fast recovery and replication.

find server -name "LocalCheckpointTracker.java" -o -name "SeqNoStats.java" -o -name "SequenceNumbers.java"
grep -n "generateSeqNo\|markSeqNoAsProcessed\|getProcessedCheckpoint\|getPersistedCheckpoint\|class LocalCheckpointTracker" \
  server/src/main/java/org/opensearch/index/seqno/LocalCheckpointTracker.java

Concept	Class/field	Meaning
Sequence number	per op	Position of the op in the shard's history.
Local checkpoint	`LocalCheckpointTracker`	The highest seq no such that all seq nos ≤ it are present locally (no gaps).
Global checkpoint	`ReplicationTracker`	The highest seq no known to be present on all in-sync copies. Operations ≤ global checkpoint are safe everywhere.
`SeqNoStats`	snapshot	`maxSeqNo`, `localCheckpoint`, `globalCheckpoint` — what `_stats` reports.

The local checkpoint advances only when there are no gaps, which is why out-of-order arrival (common with concurrent replication) does not falsely advance it. The global checkpoint, derived across copies, is the boundary for trimming the translog and for sequence-number-based recovery (recovery.md, replication.md).

flowchart LR
    Ops["seq nos: 0 1 2 _ 4 5"] --> LCP["local checkpoint = 2 (gap at 3)"]
    LCP --> Fill["op 3 arrives"]
    Fill --> LCP2["local checkpoint = 5"]
    LCP2 --> GCP["global checkpoint = min(local cp across in-sync copies)"]

Searchers and refresh

Reads do not go through the IndexWriter; they go through an Engine.Searcher, a point-in-time view backed by a Lucene DirectoryReader. New writes are invisible until a refresh opens a fresh reader via the SearcherManager (Lucene's ReferenceManager for readers).

grep -n "acquireSearcher\|class Searcher\|ReferenceManager\|SearcherManager\|refresh(\|externalReaderManager\|getReferenceManager" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

flowchart TD
    Write["index op -> IndexWriter (in-memory buffer)"] --> Buffer[uncommitted, unsearchable]
    Refresh["refresh()"] --> NewReader["SearcherManager opens new DirectoryReader"]
    Buffer --> Refresh
    NewReader --> Visible[recent docs now searchable]
    Search["acquireSearcher()"] --> NewReader

acquireSearcher hands out a reference-counted searcher; the caller (the search layer) must release it. The relationship to refresh/flush/merge — what each does to readers, segments, and the commit point — is the subject of refresh-flush-merge.md.

The engine's three companions

The engine does not work alone. Its relationships:

Companion	Relationship	Deep dive
Translog	Every op is appended to the translog for durability before refresh; the engine recovers unflushed ops from it on restart.	translog.md
Refresh/flush/merge	Refresh opens readers (visibility); flush commits Lucene + rolls translog (durability/checkpoint); merge compacts segments.	refresh-flush-merge.md
Replication/recovery	Seq nos and the global checkpoint let a recovering or replicating copy replay exactly the missing operations.	replication.md, recovery.md

Reading exercise

# 1. The op model.
grep -n "class Index\|class Delete\|class NoOp\|enum Origin\|VersionType" \
  server/src/main/java/org/opensearch/index/engine/Engine.java | head

# 2. The index() implementation.
grep -n "planIndexing\|indexIntoLucene\|addDocs\|updateDocs\|versionMap\|Translog.add\|markSeqNoAsProcessed" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

# 3. The version map.
grep -n "getUnderLock\|putIndexUnderLock\|beforeRefresh\|afterRefresh\|class LiveVersionMap" \
  server/src/main/java/org/opensearch/index/engine/LiveVersionMap.java

# 4. Checkpoints.
grep -n "generateSeqNo\|getProcessedCheckpoint\|markSeqNoAsProcessed" \
  server/src/main/java/org/opensearch/index/seqno/LocalCheckpointTracker.java

# 5. Tests + live stats.
./gradlew :server:test --tests "org.opensearch.index.engine.InternalEngineTests"
curl -s "localhost:9200/orders/_stats/docs,segments?pretty"
curl -s "localhost:9200/orders/_stats?filter_path=**.seq_no"

Answer:

Name the three engine implementations and exactly when each is used. Which one does not index documents locally, and why?
Walk InternalEngine.index() for a primary op: list the steps in order and say at which step the document becomes (a) durable and (b) visible.
What problem does LiveVersionMap solve that reading the version from Lucene cannot? When can an entry leave the map?
Explain the difference between the local checkpoint and the global checkpoint. Why does a gap stop the local checkpoint from advancing?
Why do reads use an Engine.Searcher rather than the IndexWriter, and what makes a just-written document appear in a searcher?
What does origin (PRIMARY vs PEER_RECOVERY) change about how index() handles versioning?

Common bugs and symptoms

Symptom	Root cause	Where to look
Lost updates under concurrent writes to same `_id`	`LiveVersionMap` consulted incorrectly; version check race	`LiveVersionMap.getUnderLock`; per-uid locking in `index()`
Spurious `VersionConflictEngineException`	Stale entry in the version map / archive not cleared after refresh	version map archive handling around refresh
Local checkpoint stuck below max seq no	A gap (missing seq no) that never fills	`LocalCheckpointTracker`; replication/recovery delivering the gap op
Docs durable but never searchable	Refresh disabled/too slow; no new reader opened	`refresh_interval`; refresh-flush-merge.md
Replica applies op with different version than primary	Replica path re-resolved versioning instead of obeying primary's `origin=REPLICA`	`index()` origin handling; index-shard-lifecycle.md
Heap pressure from a huge version map	Very high write rate with long refresh interval keeps map large	shorten refresh interval; monitor; circuit-breakers-memory.md

Validation: prove you understand this

Draw the InternalEngine.index() pipeline and mark the durability point and the visibility point.
Explain LiveVersionMap in one paragraph: what it stores, why Lucene alone is insufficient, and what a bug in it causes.
Explain local vs global checkpoint with a worked seq-no example containing a gap.
Explain how origin makes the same index() method behave correctly for a primary write and for a translog replay during recovery.
Describe how a refresh turns an in-memory write into a searchable document, naming the Lucene constructs (IndexWriter, DirectoryReader, SearcherManager).
Name the three companions of the engine (translog, refresh/flush/merge, replication/recovery) and one way each depends on the engine's seq nos.

The Translog

Lucene only makes data durable when it commits — an expensive fsync of all segment files — and commits are far too costly to perform per write. So how does OpenSearch acknowledge a write as durable without committing Lucene every time? The answer is the translog: a per-shard, append-only write-ahead log. Every indexing operation is appended (and, by default, fsynced) to the translog before the write is acknowledged. If the node crashes before the next Lucene commit, the operations are replayed from the translog on restart. This chapter covers the Translog class and its readers/writers, generations, durability modes, the relationship between the translog and the Lucene commit (flush), recovery from translog, and how the global checkpoint governs trimming.

After this chapter you should be able to: explain exactly what "durable" means for an acknowledged write; describe the difference between request and async durability; explain how a flush relates a translog generation to a Lucene commit; and trace how a crash recovers unflushed operations.

Note: "Durable" (survives a crash) and "visible" (searchable) are different. The translog gives durability immediately; a refresh gives visibility later. Do not conflate them.

The classes

find server -name "Translog.java" -o -name "TranslogWriter.java" -o -name "TranslogReader.java" \
  -o -name "TranslogConfig.java"
grep -n "class Translog\|class TranslogWriter\|class TranslogReader\|add(\|newSnapshot\|rollGeneration\|trimUnreferencedReaders" \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head

Class	Role
`Translog`	The per-shard log. Owns the current writer and the set of older readers; appends ops, fsyncs, rolls generations, trims.
`TranslogWriter`	The single current generation being written to (`translog-N.tlog`). Appends and fsyncs.
`TranslogReader`	A sealed, older generation, opened for reading during recovery.
`Translog.Operation`	One logged op: `Index`, `Delete`, or `NoOp`, each `Writeable`, carrying seq no and primary term.
`Translog.Location`	A pointer (generation + offset) to an op in the log — returned to the engine so it can later confirm fsync.
`TranslogConfig`	Configuration: path, durability, sync interval.

flowchart TD
    Eng[InternalEngine.index] -->|"Translog.add(op)"| W["TranslogWriter (gen N, current)"]
    W --> File1["translog-N.tlog (being written)"]
    Trans[Translog] --> R1["TranslogReader gen N-1 (sealed)"]
    Trans --> R2["TranslogReader gen N-2 (sealed)"]
    R1 --> File2[translog-N-1.tlog]
    R2 --> File3[translog-N-2.tlog]
    Trans --> Ckp[translog.ckp - checkpoint file]

The files live under the shard's translog directory:

# On a running node, find the data path, then:
find . -path "*nodes*translog*" -name "*.tlog" 2>/dev/null | head
# Or read the config defaults:
grep -n "translog\|durability\|sync_interval\|flush_threshold_size" \
  server/src/main/java/org/opensearch/index/IndexSettings.java | head

The Operation model

grep -n "class Index\|class Delete\|class NoOp\|abstract class Operation\|enum Type\|writeTo\|readOperation" \
  server/src/main/java/org/opensearch/index/translog/Translog.java

Op type	When
`Index`	A document was indexed (add or update). Carries source, id, version, seq no, primary term.
`Delete`	A document was deleted.
`NoOp`	A "gap filler" — a seq no that produced no document (e.g. a failed op, or a deliberately reserved seq no during recovery). Keeps the seq-no history dense.

Every op carries its sequence number and primary term (see engine-internals.md). That is what lets translog replay be deterministic and idempotent: a replayed op reuses its recorded seq no rather than getting a new one.

Durability: request vs async

The key tunable is index.translog.durability:

grep -n "durability\|REQUEST\|ASYNC\|sync_interval\|Durability" \
  server/src/main/java/org/opensearch/index/translog/Translog.java \
  server/src/main/java/org/opensearch/index/IndexSettings.java | head

`index.translog.durability`	When the translog is fsynced	Trade-off
`request` (default)	Before each write/bulk is acknowledged	Strongest: an acked write survives a crash. Costs an fsync per request (batched per bulk).
`async`	Every `index.translog.sync_interval` (default 5s)	Faster: fsync amortized. Risk: up to `sync_interval` of acked writes can be lost on a crash.

sequenceDiagram
    participant C as Client
    participant E as Engine
    participant T as Translog
    C->>E: index doc
    E->>E: IndexWriter.addDocument (in-memory)
    E->>T: Translog.add(op) -> Location
    alt durability = request
        T->>T: fsync NOW
        T-->>E: synced
        E-->>C: 200 (durable)
    else durability = async
        T-->>E: appended (not yet fsynced)
        E-->>C: 200 (durable only after next interval sync)
        Note over T: fsync every sync_interval
    end

Warning: async durability is a deliberate durability-for-throughput trade. With async, a 200 OK does not guarantee the write survives an immediate power loss — up to sync_interval of operations are at risk. Default request durability is what makes OpenSearch's write acknowledgements meaningful. Change it only with eyes open.

Generations, flush, and the Lucene commit

The translog only needs to retain operations not yet captured in a Lucene commit. A flush is the operation that bridges them:

grep -n "flush\|rollGeneration\|commitIndexWriter\|getMinTranslogGenerationForRecovery\|trimUnreferencedReaders" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

A flush does three things atomically with respect to recovery:

Lucene commit — IndexWriter.commit() fsyncs all segments; the data is now durable in Lucene itself.
Roll the translog generation — start a new TranslogWriter (gen N+1); the commit records which translog generation it corresponds to.
Trim old generations — older translog files whose ops are now safely in the Lucene commit (and below the global checkpoint) are deleted.

flowchart LR
    subgraph "before flush"
      Seg1[committed segments @ gen 5] --> TL5["translog gens 5..7 retained"]
    end
    Flush["flush()"] --> Commit[IndexWriter.commit]
    Commit --> Roll[roll to translog gen 8]
    Roll --> Trim[trim gens now in commit + below global checkpoint]
    subgraph "after flush"
      Seg2[committed segments @ gen 8] --> TL8["translog gen 8 only"]
    end

Flushes happen automatically when the translog grows past index.translog.flush_threshold_size (default 512mb), or on an explicit POST /index/_flush. This is why a write-heavy index periodically flushes without you asking.

Note: Refresh, flush, and merge are three different operations and are easy to confuse. Refresh = visibility (new searcher). Flush = durability milestone (Lucene commit + translog roll). Merge = compaction (fewer segments). The whole comparison is in refresh-flush-merge.md.

Recovery from translog: how a crash is survived

On restart (or on local recovery), the engine opens the last Lucene commit, then replays the translog operations that came after that commit:

grep -n "recoverFromTranslog\|openEngineAndRecoverFromTranslog\|recoverFromTranslogInternal\|Snapshot\|applyTranslogOperation" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java | head

sequenceDiagram
    participant N as Node restart
    participant E as Engine
    participant L as Lucene commit
    participant T as Translog
    N->>E: open shard
    E->>L: load last commit (durable segments)
    E->>T: open translog, snapshot ops AFTER commit's generation
    loop each op > commit
        T-->>E: op (seqNo, primaryTerm)
        E->>E: re-apply (origin=LOCAL_TRANSLOG_RECOVERY, reuse seqNo)
    end
    E->>E: shard now reflects all acked, durable writes

The ops replayed are exactly those acknowledged after the last commit. Because each op carries its original seq no and is applied idempotently, replaying is safe even if some ops were partially applied before the crash. This is the mechanism that makes request durability meaningful: the write was in the fsynced translog, so it is replayed.

Global checkpoint and trimming

The translog cannot be trimmed solely on the basis of "it's in a Lucene commit"; it must also respect the global checkpoint — the highest seq no known to be on all in-sync copies (engine-internals.md, replication.md). Operations at or below the global checkpoint are safe everywhere and recoverable from peers, so they can be trimmed locally; operations above it may still be needed for peer recovery and must be retained.

grep -n "trimUnreferencedReaders\|getMinGenerationForSeqNo\|globalCheckpoint\|minSeqNoToKeep\|getMinTranslogGeneration" \
  server/src/main/java/org/opensearch/index/translog/Translog.java | head

Retention driver	Effect
Last Lucene commit	Keep ops not yet in any committed segment.
Global checkpoint	Keep ops above the global checkpoint for peer recovery.
Translog retention settings	Historical: extra retention for ops-based recovery (largely superseded by soft-deletes/seq-no retention leases).

The interplay with recovery is the subject of recovery.md: a sequence-number-based peer recovery copies segments (phase 1) then replays the translog/operations above the global checkpoint (phase 2).

Reading exercise

# 1. Append, snapshot, roll, trim.
grep -n "public Location add\|newSnapshot\|rollGeneration\|trimUnreferencedReaders\|sync(" \
  server/src/main/java/org/opensearch/index/translog/Translog.java

# 2. Durability modes.
grep -n "Durability\|REQUEST\|ASYNC\|ensureSynced\|sync_interval" \
  server/src/main/java/org/opensearch/index/translog/Translog.java \
  server/src/main/java/org/opensearch/index/IndexSettings.java

# 3. Recovery replay.
grep -n "recoverFromTranslog\|applyTranslogOperation\|LOCAL_TRANSLOG_RECOVERY" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

# 4. Tests + live stats.
./gradlew :server:test --tests "org.opensearch.index.translog.TranslogTests"
curl -s "localhost:9200/orders/_stats/translog?pretty"

# 5. Inspect on disk.
find . -path "*translog*" -name "*.tlog" 2>/dev/null | head

Answer:

What exactly does it mean for an acknowledged write to be "durable" under the default durability setting? Which fsync provides that guarantee?
Contrast request and async durability. With async and a 5s interval, how much acknowledged data can a power loss destroy?
What three things does a flush do, and how does it let the translog be trimmed? Why does a busy index flush on its own?
Trace a crash recovery: which Lucene artifact is loaded first, and which translog operations are replayed?
Why must translog trimming respect the global checkpoint and not just the last Lucene commit?
Distinguish refresh, flush, and merge in one sentence each.

Common bugs and symptoms

Symptom	Root cause	Where to look
Acked writes lost after power loss	`index.translog.durability: async` with a non-trivial `sync_interval`	the durability setting; expected behavior, document it
Disk fills with `.tlog` files	Translog not trimming: global checkpoint stuck, or a lagging replica retaining ops	`globalCheckpoint`; `trimUnreferencedReaders`; lagging copy
Slow indexing, high fsync time	`request` durability + slow disk; many small bulks	batch larger bulks; faster disk; do not switch to `async` blindly
Long recovery / startup	Huge unflushed translog to replay after a crash	tune `flush_threshold_size`; expect replay time
`TranslogCorruptedException` on start	Partial write / disk corruption in the current generation	`opensearch-shard` translog truncate tool; investigate disk
Replica recovery copies too much	Translog/seq-no retention insufficient to do ops-based recovery	retention leases; recovery.md

Validation: prove you understand this

Explain, using the translog, why an acked write under default settings survives an immediate crash even though Lucene has not committed.
Draw the request vs async durability sequence and state the precise data-loss window of each.
Explain what a flush does to (a) Lucene, (b) the translog generation, and (c) old translog files, and how the global checkpoint constrains step (c).
Walk crash recovery end to end: last commit loaded, ops replayed, why replay is idempotent.
Explain why translog trimming respects the global checkpoint, with a scenario where ignoring it would break peer recovery.
In one sentence each, contrast refresh, flush, and merge so you never confuse them again.

Mapping and Analysis

Before a document can be indexed or a field can be searched, OpenSearch must know its schema — what fields exist and how each is typed and stored — and it must know how to analyze text into the terms that go into the inverted index. Those two jobs are mapping and analysis. Mapping (MapperService, DocumentMapper, FieldMapper/MappedFieldType) defines field types and how they are indexed; analysis (Analyzer, Tokenizer, TokenFilter, CharFilter, AnalysisRegistry) defines how text fields are broken into tokens. This chapter covers both, plus dynamic mapping, the _source/_doc model, the built-in and analysis-common analyzers, how an AnalysisPlugin extends the chain, and the _analyze API for inspecting it directly.

After this chapter you should be able to: explain the difference between a FieldMapper and a MappedFieldType; describe how a text field's value is turned into index terms; explain dynamic mapping and why it is both convenient and dangerous; and use _analyze to debug why a query does not match.

Note: Mapping and analysis are index-time and query-time concerns living inside a single index. They feed engine-internals.md (what gets written to Lucene) and query-dsl-querybuilders.md (how a query term is analyzed to match). Get analysis wrong and "matching" documents silently fail to match.

The mapping classes

find server -name "MapperService.java" -o -name "DocumentMapper.java" -o -name "Mapping.java" \
  -o -name "FieldMapper.java" -o -name "MappedFieldType.java"
grep -n "class MapperService\|merge\|documentMapper\|fieldType\|class DocumentMapper" \
  server/src/main/java/org/opensearch/index/mapper/MapperService.java | head

Class	Role
`MapperService`	Per-index façade. Owns the current `DocumentMapper`, merges new mappings, resolves a field name to its `MappedFieldType`.
`DocumentMapper`	The compiled mapping for the index — parses a source document into a Lucene `Document`.
`Mapping`	The immutable mapping tree (root + field mappers + metadata mappers).
`FieldMapper`	The index-time behavior of one field: how to parse a JSON value and emit Lucene fields.
`MappedFieldType`	The query-time behavior of one field: how to build queries, whether it has doc values, its search analyzer.
`ObjectMapper` / `RootObjectMapper`	Mappers for object/nested structure and the document root.

The FieldMapper vs MappedFieldType split is the single most important idea here:

	`FieldMapper`	`MappedFieldType`
When used	indexing a document	building a query, sorting, aggregating
Knows	how to turn a JSON value into Lucene fields	how to turn a query value into a Lucene `Query`, whether doc values exist
Lives	one per field in the `DocumentMapper`	one per field, retrieved via `MapperService.fieldType(name)`

flowchart LR
    Doc["JSON document"] -->|index time| DM[DocumentMapper.parse]
    DM --> FM["FieldMapper per field"]
    FM --> Lucene["Lucene Document (indexed fields, doc values, stored _source)"]
    Query["query value"] -->|query time| MFT["MappedFieldType.termQuery/rangeQuery"]
    MFT --> LQ["Lucene Query"]

Field types

find server modules -name "*FieldMapper.java" | head -40
grep -rn "contentType()\|CONTENT_TYPE = " server/src/main/java/org/opensearch/index/mapper/ | head

The core field types you must know:

Type	Mapper	Analyzed?	Notes
`text`	`TextFieldMapper`	Yes — full analysis chain	For full-text search; not aggregatable by default (no doc values).
`keyword`	`KeywordFieldMapper`	No (whole string is one term)	Exact-match, sorting, aggregations; has doc values.
numeric (`long`, `integer`, `double`, …)	`NumberFieldMapper`	No	Range queries, sorting, aggregations via doc values.
`date`	`DateFieldMapper`	No (parsed by format)	Stored as epoch millis internally.
`boolean`	`BooleanFieldMapper`	No	—
`object`	`ObjectMapper`	n/a	Nested JSON object, flattened by dotted path.
`nested`	`NestedObjectMapper`	n/a	Object indexed as separate hidden Lucene docs (preserves array element boundaries).

Note: The classic gotcha: a text field has no doc values, so you cannot sort or aggregate on it directly — you get an error pointing you at fielddata (docvalues-fielddata.md). The idiomatic fix is a keyword multi-field (text with a .keyword sub-field). This is the most common mapping mistake new users hit.

_source and the document model

find server -name "SourceFieldMapper.java" -o -name "DocumentParser.java"
grep -n "_source\|_id\|_routing\|_field_names\|class .*FieldMapper" \
  server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java | head

Metadata field	Role
`_source`	The original JSON document, stored verbatim. Powers `get`, returned hits, reindex, update, highlighting.
`_id`	The document id (a Lucene term used for updates/gets).
`_routing`	Custom routing value (which shard).
`_field_names`	Tracks which fields exist (for `exists` queries).

_source is what makes OpenSearch feel document-oriented: Lucene stores inverted-index terms, but _source keeps the original document so you can return it, reindex it, and run partial updates. Disabling _source saves space but breaks update, reindex, and some highlighting — a frequently regretted optimization.

Dynamic mapping

If a document contains a field with no mapping, OpenSearch can infer one:

grep -n "dynamic\|DynamicTemplate\|parseDynamicValue\|createDynamicUpdate\|Dynamic\b" \
  server/src/main/java/org/opensearch/index/mapper/DocumentParser.java | head

`dynamic` setting	Behavior on an unmapped field
`true` (default)	Infer a type and add a mapping (a string becomes `text`+`keyword`, a number becomes `long`/`double`, etc.).
`runtime`	Add it as a runtime field (computed at query time, no indexing).
`false`	Ignore the field for indexing (stored in `_source`, not searchable).
`strict`	Reject the document with an error.

flowchart TD
    Doc["doc has field 'price' (unmapped)"] --> Dyn{index.mapping.dynamic}
    Dyn -->|true| Infer["infer type -> add mapping via cluster-state update"]
    Dyn -->|runtime| RT[add as runtime field]
    Dyn -->|false| Ignore[store in _source only]
    Dyn -->|strict| Reject[reject document]

Warning: Dynamic mapping adds fields by issuing a cluster-state update (the new mapping must be published — see cluster-state-publishing.md). Unbounded dynamic fields cause mapping explosion: thousands of fields bloat the cluster state, slow publishing, and can exhaust the field limit (index.mapping.total_fields.limit). Untrusted/high-cardinality keys (e.g. user-supplied JSON object keys) are the usual culprit. Prefer strict or dynamic: false for such data, or use flattened-style modeling.

The analysis chain

Analysis converts a text value into a stream of index terms. The chain has three stages, applied in order:

find server -name "AnalysisRegistry.java"
find modules/analysis-common -name "*.java" | head
grep -n "class AnalysisRegistry\|getAnalyzer\|buildAnalyzer\|tokenizers\|tokenFilters\|charFilters" \
  server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java | head

flowchart LR
    Raw["raw text: 'The Quick-Brown Fox!'"] --> CF[CharFilter: strip/replace chars]
    CF --> TOK[Tokenizer: split into tokens]
    TOK --> TF[TokenFilter chain: lowercase, stop, stem, ...]
    TF --> Terms["index terms: quick, brown, fox"]

Stage	Interface	Job	Examples
Character filter	`CharFilter`	Transform the raw character stream before tokenization	`html_strip`, `mapping`, `pattern_replace`
Tokenizer	`Tokenizer`	Split the stream into tokens	`standard`, `whitespace`, `keyword`, `pattern`, `ngram`
Token filter	`TokenFilter`	Transform/add/remove tokens	`lowercase`, `stop`, `stemmer`, `synonym`, `asciifolding`

An Analyzer is the assembled pipeline (CharFilters → Tokenizer → TokenFilters). Built-in analyzers combine these:

Analyzer	Pipeline
`standard`	`standard` tokenizer + `lowercase` filter (Unicode-aware word splitting). The default.
`keyword`	`keyword` tokenizer (the whole input is a single token — no splitting).
`simple`, `whitespace`, `stop`, `english`, etc.	Various combinations in the `analysis-common` module.

Most non-trivial analyzers (language analyzers, many tokenizers/filters) live in the bundled modules/analysis-common module, not in server:

find modules/analysis-common/src/main/java -name "*TokenFilterFactory.java" | head
find modules/analysis-common/src/main/java -name "*TokenizerFactory.java" | head

Note: Index-time and query-time analysis can differ. A field has an analyzer (index time) and a search_analyzer (query time). If they disagree, a query term can be analyzed into something that never matches the indexed terms — the classic "my search returns nothing" bug. Use _analyze (below) to see both.

Extending analysis with AnalysisPlugin

A plugin contributes new tokenizers/filters/analyzers via AnalysisPlugin:

find server -name "AnalysisPlugin.java"
grep -n "getTokenizers\|getTokenFilters\|getCharFilters\|getAnalyzers" \
  server/src/main/java/org/opensearch/plugins/AnalysisPlugin.java

public class MyAnalysisPlugin extends Plugin implements AnalysisPlugin {
    @Override
    public Map<String, AnalysisProvider<TokenFilterFactory>> getTokenFilters() {
        return Map.of("my_filter", MyTokenFilterFactory::new);
    }
}

The plugins/analysis-icu, analysis-kuromoji, etc. plugins are exactly this: factories registered into the AnalysisRegistry. The mechanism (factories, AnalysisProvider, classloader isolation) is covered in plugin-architecture.md. Building such a plugin is the spirit of the analysis plugin labs and Level 7.

The _analyze API: your debugging tool

The single best tool for understanding and debugging analysis:

# Run an analyzer and see the tokens it produces.
curl -s -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d '{
  "analyzer": "standard",
  "text": "The Quick-Brown Fox!"
}'

# Build an ad-hoc chain (char filter + tokenizer + filters).
curl -s -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d '{
  "char_filter": ["html_strip"],
  "tokenizer": "standard",
  "filter": ["lowercase", "stop"],
  "text": "<b>The</b> Quick Fox"
}'

# Analyze with the actual field mapping (uses the field's configured analyzer).
curl -s -X POST "localhost:9200/orders/_analyze?pretty" -H 'Content-Type: application/json' -d '{
  "field": "description",
  "text": "Wireless-Headphones"
}'

The output lists each token with its start_offset/end_offset/position. When a search "should match but doesn't," run the document text through the field's index analyzer and the query text through the search analyzer and compare the token sets — they must overlap.

Reading exercise

# 1. The mapper split.
grep -n "class FieldMapper\|parseCreateField\|class MappedFieldType\|termQuery\|rangeQuery\|hasDocValues" \
  server/src/main/java/org/opensearch/index/mapper/FieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/MappedFieldType.java | head

# 2. Dynamic mapping inference.
grep -n "createDynamicUpdate\|parseDynamicValue\|Dynamic\." \
  server/src/main/java/org/opensearch/index/mapper/DocumentParser.java | head

# 3. The analysis registry and built-ins.
grep -n "getAnalyzer\|buildMapping\|PreBuiltAnalyzers\|standard\|keyword" \
  server/src/main/java/org/opensearch/index/analysis/AnalysisRegistry.java | head

# 4. analysis-common contents.
ls modules/analysis-common/src/main/java/org/opensearch/analysis/common/ | head -40

# 5. Tests + the API.
./gradlew :server:test --tests "org.opensearch.index.mapper.MapperServiceTests"
./gradlew :modules:analysis-common:test --tests "*Standard*" 2>/dev/null || true
curl -s -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' \
  -d '{"analyzer":"standard","text":"Hello World"}'

Answer:

Explain the division of labor between FieldMapper and MappedFieldType. Which one is used when you run a query, and which when you index a document?
Why can you not sort or aggregate on a text field, and what is the idiomatic fix? Where is the doc-values distinction encoded?
Walk the three stages of the analysis chain for the input "<b>The</b> Quick-Brown Fox!" through html_strip + standard + lowercase. List the resulting tokens.
What does dynamic mapping do under the hood that makes mapping explosion a cluster-wide (not just index-local) problem?
What is the difference between an index analyzer and a search_analyzer, and what bug appears when they disagree?
Use _analyze to show that keyword and standard analyzers produce different token sets for the same input.

Common bugs and symptoms

Symptom	Root cause	Where to look
`Fielddata is disabled on text fields by default` on sort/agg	Sorting/aggregating a `text` field with no doc values	use a `.keyword` multi-field; docvalues-fielddata.md
Search returns nothing for an obviously-present term	Index analyzer ≠ search analyzer; tokens don't overlap	`_analyze` both sides; compare token sets
Cluster state bloats; publishing slows	Mapping explosion from unbounded dynamic fields	`index.mapping.total_fields.limit`; set `dynamic: strict`/`false`
`mapper_parsing_exception` on index	Document field conflicts with the existing mapping type	the field's mapping; reindex with corrected mapping
Nested array queries match across elements incorrectly	Used `object` where `nested` was needed (element boundaries lost)	`nested` type; `NestedObjectMapper`
Custom analyzer "not found"	`AnalysisPlugin` not installed on the node, or factory name typo	`getTokenFilters`/`getTokenizers`; plugin install

Validation: prove you understand this

Draw the index-time path (DocumentMapper → FieldMapper → Lucene Document) and the query-time path (MappedFieldType → Lucene Query), and explain why the split exists.
Explain the text vs keyword distinction in terms of analysis and doc values, and give the standard multi-field pattern.
From memory, list the three analysis stages and one example component of each, and trace one input string through them.
Explain how dynamic mapping interacts with cluster state and why mapping explosion is dangerous; name the setting that bounds it and the dynamic modes that prevent it.
Explain index vs search analyzer and demonstrate, with _analyze, a mismatch that causes a non-matching search.
Write the AnalysisPlugin method signature that registers a custom token filter, and name two in-repo plugins that work this way.

Refresh, Flush, and Merge

Three background mechanisms turn a stream of indexing operations into a queryable, durable, and space-efficient Lucene index. Refresh makes recently-indexed documents visible to search. Flush makes them durable (a Lucene commit) and lets the translog be trimmed. Merge keeps the segment count bounded so search stays fast. None of them are the same thing, and conflating them is the single most common conceptual error new OpenSearch contributors make.

This chapter pins each mechanism to its classes, its REST surface, its settings, and the failure modes it produces. It builds directly on Engine Internals (which owns the IndexWriter) and The Translog (durability before commit), and it feeds Search Execution (which consumes the searcher refresh produces).

After this chapter you can:

Explain, without hand-waving, why a document you just indexed isn't found by the next search, and what ?refresh=wait_for actually does.
Trace _refresh, _flush, and _forcemerge from REST handler to Engine.
Read _cat/segments and _stats and reason about whether a shard is over-segmented, refresh-bound, or merge-starved.
Tune index.refresh_interval, merge throttling, and translog flush thresholds and predict the trade-off you just made.

The three operations at a glance

Operation	Lucene primitive	Makes docs…	Touches translog?	REST	Engine method (grep target)
Refresh	open new `DirectoryReader` (NRT)	visible to search	no (translog untouched)	`POST /idx/_refresh`	`InternalEngine.refresh(...)`
Flush	`IndexWriter.commit()`	durable (survive restart without translog replay)	trims translog up to commit	`POST /idx/_flush`	`InternalEngine.flush(...)`
Merge	`IndexWriter.maybeMerge` / forced	faster to search; fewer segments	indirectly (commit after)	`POST /idx/_forcemerge`	`InternalEngine` via `MergePolicy`/`MergeScheduler`

Note: Visibility and durability are orthogonal. A refreshed-but-unflushed document is searchable but lives only in the translog + an in-memory segment; a crash before flush recovers it by replaying the translog. A flushed document is on disk in a committed segment even if never refreshed (a flush implies a refresh of the committed point, but you should not rely on flush as your visibility mechanism).

Find the engine that owns all three:

find server/src/main/java/org/opensearch/index/engine -name "*.java"
grep -n "public void refresh\|public void flush\|void forceMerge\|maybeMerge" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

Refresh — the visibility boundary

Indexed documents land in Lucene's in-memory buffer via IndexWriter.addDocument/updateDocument (see Engine Internals). They are not searchable until a new near-real-time (NRT) DirectoryReader is opened over the writer's current state. That reader-open is a refresh.

The reader is managed by Lucene's ReferenceManager — specifically an OpenSearchReaderManager (OpenSearch's subclass of Lucene SearcherManager) that the engine wraps in an OpenSearchReferenceManager/SearcherManager. The searcher you acquire for a query (IndexShard.acquireSearcher) is a reference into the most recently refreshed reader.

grep -rn "ReaderManager\|SearcherManager\|acquireSearcher" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java
grep -rn "class.*ReaderManager" server/src/main/java/org/opensearch/index/engine/

Who triggers a refresh

Trigger	Mechanism	Notes
Periodic	`index.refresh_interval` (default 1s)	A scheduled task on the `REFRESH` thread pool. Set to `-1` to disable.
Explicit	`POST /idx/_refresh` → `TransportRefreshAction`	Broadcast to all shards.
Per-request	`?refresh=true` / `?refresh=wait_for` on index/bulk/update/delete	`true` forces an immediate refresh; `wait_for` blocks the response until the next scheduled refresh covers the op.
Search-idle optimization	a shard with no searches for `index.search.idle.after` (default 30s) stops periodic refresh until searched again	Reduces refresh cost on write-heavy, search-cold shards. Surprises people: writes "disappear" from `_cat/segments` cadence.

grep -rn "refresh_interval\|REFRESH_INTERVAL_SETTING" \
  server/src/main/java/org/opensearch/index/IndexSettings.java
grep -rn "search.idle\|searchIdle\|scheduledRefresh" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java

`wait_for` is not a forced refresh

?refresh=wait_for registers a listener (RefreshListeners) that completes the request when the operation's location becomes visible via the next refresh — it does not force one. This is the cheap way to get read-your-writes without hammering refresh.

grep -rn "RefreshListeners\|addOrNotify\|wait_for" \
  server/src/main/java/org/opensearch/index/shard/

Increasing throughput: the classic tuning move

For bulk-load workloads, set index.refresh_interval to 30s (or -1 during a backfill, restored afterward). Fewer refreshes mean fewer tiny segments, less merge pressure, and more CPU for indexing. The cost is higher search latency to freshness.

curl -s -XPUT localhost:9200/idx/_settings -H 'content-type: application/json' \
  -d '{"index":{"refresh_interval":"30s"}}'

Flush — durability and translog trim

A flush issues a Lucene IndexWriter.commit(): in-memory segments are written and fsync'd, a new commit point (segments_N) is created, and — crucially — the translog can be trimmed because every operation up to that commit is now recoverable from the Lucene index itself without replay.

sequenceDiagram
    participant W as Write op
    participant TL as Translog
    participant IW as Lucene IndexWriter
    participant CP as Commit point
    W->>TL: Translog.add(op)  (durable on fsync per index.translog.durability)
    W->>IW: addDocument / updateDocument (in-memory)
    Note over IW: refresh -> visible, still uncommitted
    Note over IW,CP: flush -> IndexWriter.commit() -> segments_N
    CP-->>TL: translog generation rolled & trimmed up to commit

When a flush happens

Trigger	Setting / cause
Translog size	`index.translog.flush_threshold_size` (default 512mb) — engine flushes when the uncommitted translog exceeds it.
Periodic / idle	engine may flush idle shards to bound recovery time.
Explicit	`POST /idx/_flush` → `TransportFlushAction` → `IndexShard.flush`.
Before some ops	shard close, snapshot, certain recovery transitions.

grep -rn "flush_threshold_size\|FLUSH_THRESHOLD\|shouldPeriodicallyFlush" \
  server/src/main/java/org/opensearch/index/ server/src/main/java/org/opensearch/index/engine/InternalEngine.java

Warning: Do not script periodic _flush calls as a "safety" measure. The engine decides flush timing far better than a cron job; manual flushing usually just creates extra small commits and fights the translog heuristics. Forced flush has legitimate uses (before a clean shutdown, for tests), but "flush every minute to be safe" is an anti-pattern.

The durability of writes before flush comes from the translog and its index.translog.durability setting (request vs async) — that is the Translog deep dive's territory; flush is where the translog gets to forget.

Merge — keeping segment count bounded

Every refresh can create a new segment. Every segment is an independent mini-index that search must visit. Left unchecked, segment count explodes and search degrades (more files, more terms dictionaries, more per-segment overhead, deleted docs never reclaimed). Merge combines smaller segments into larger ones and is how deletes/updates actually free space (Lucene deletes are tombstones until the segment is merged away).

The policy and the scheduler

Concern	Class	What it does
Which segments to merge	`TieredMergePolicy` (Lucene)	Groups segments into tiers by size; selects merge candidates. The OpenSearch default.
When/how many merges run	`ConcurrentMergeScheduler` (Lucene)	Runs merges on background threads, throttles I/O.
OpenSearch wiring	`OpenSearchConcurrentMergeScheduler`, `MergePolicyProvider`/`MergePolicyConfig`	Applies `index.merge.*` settings, exposes merge stats, hooks logging.

grep -rn "TieredMergePolicy\|MergeScheduler\|ConcurrentMergeScheduler" \
  server/src/main/java/org/opensearch/index/
find server/src/main/java/org/opensearch/index -name "*Merge*"

Key merge settings

Setting	Default-ish	Effect
`index.merge.policy.max_merged_segment`	5gb	Cap on a single merged segment's size.
`index.merge.policy.segments_per_tier`	10	More tiers = fewer merges but more segments.
`index.merge.policy.floor_segment`	2mb	Segments below the floor are treated as floor-sized for tiering.
`index.merge.scheduler.max_thread_count`	f(CPU/SSD)	Concurrent merges. Too high on spinning disks thrashes I/O.

grep -rn "max_merged_segment\|segments_per_tier\|floor_segment\|max_thread_count" \
  server/src/main/java/org/opensearch/index/MergePolicyConfig.java \
  server/src/main/java/org/opensearch/index/MergeSchedulerConfig.java

force_merge — the loaded gun

POST /idx/_forcemerge?max_num_segments=1 forces merging down to N segments. It is the right tool for read-only / cold indices (e.g., time-based indices no longer being written) to collapse them to one big segment and purge deletes. It is the wrong tool for actively-written indices:

Warning: force_merge to 1 segment on a live index produces enormous segments that the normal TieredMergePolicy will never merge again (they exceed max_merged_segment), permanently distorting the tier structure and stranding deleted docs. Only force-merge indices you have stopped writing to. Forcing is also a blocking, I/O-heavy operation — never do it on a hot cluster during peak load.

grep -rn "forcemerge\|forceMerge\|max_num_segments\|only_expunge_deletes" \
  server/src/main/java/org/opensearch/action/admin/indices/forcemerge/

How they interact

flowchart TD
    I[Index op] --> B[Lucene in-memory buffer + Translog.add]
    B -->|refresh interval / explicit| R[Open new DirectoryReader]
    R --> V[Docs visible to search]
    B -->|buffer/segment created| S[New small segment]
    S -->|TieredMergePolicy selects| M[Background merge -> bigger segment]
    M -->|reclaims deletes| S
    B -->|translog > flush_threshold| F[IndexWriter.commit = flush]
    F --> TT[Translog trimmed]
    F --> CP[New commit point segments_N]
    M -.merged segments committed on.-> F

The feedback loops worth internalizing:

Refresh ↑ → segment count ↑ → merge pressure ↑. A 200ms refresh_interval on a write-heavy shard buries the merge scheduler.
Flush trims translog, so a stuck flush (e.g., I/O saturation from merges) makes the translog grow unbounded → flush_threshold_size exceeded → forced flush competes with the merges that caused it.
Segment replication (see Replication) changes who runs merges: replicas copy segments from the primary instead of indexing independently, so merge cost is paid once on the primary. Document replication pays it on every copy.

Observability: read the shard like an X-ray

# Per-segment view: size, doc count, deletes, committed/searchable flags
curl -s 'localhost:9200/_cat/segments/idx?v&h=index,shard,prirep,segment,docs.count,docs.deleted,size,committed,searchable'

# Aggregate refresh/flush/merge/segment stats for an index
curl -s 'localhost:9200/idx/_stats/refresh,flush,merge,segments?pretty'

Map columns to meaning:

Signal	Where	Healthy?
`segments.count` climbing, never falling	`_stats/segments`	merges starved or refresh too aggressive
`merges.current` pinned > 0 for long	`_stats/merge`	merge can't keep up; check `max_thread_count`, disk
`docs.deleted` high fraction of `docs.count`	`_cat/segments`	needs merge / expunge-deletes
`flush.total` spiking with translog growth	`_stats/flush` + `_stats/translog`	flush threshold thrash
`committed=false` for most segments	`_cat/segments`	normal between flushes; persistent = no recent flush

Reading exercise

# 1. The searcher/reader plumbing the engine uses for refresh
grep -rn "ReferenceManager\|ReaderManager\|maybeRefresh\|maybeRefreshBlocking" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

# 2. Periodic refresh scheduling and search-idle
grep -rn "scheduledRefresh\|searchIdle\|markSearcherAccessed" \
  server/src/main/java/org/opensearch/index/shard/IndexShard.java

# 3. The decision to flush
grep -rn "shouldPeriodicallyFlush\|flush_threshold_size\|translog.*size" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

# 4. Merge wiring
grep -rn "mergePolicy\|mergeScheduler\|maybeMerge\|forceMerge" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

Answer:

When you call IndexShard.acquireSearcher, into which reader version do you get a reference — the latest refreshed one or the latest committed one? Where in the engine is that reference obtained?
What exactly does ?refresh=wait_for register, and what completes it? Why is it cheaper than ?refresh=true under high write load?
Trace POST /idx/_flush: name the transport action, the shard method, and the Lucene call it ultimately makes.
A shard has 80 segments and docs.deleted ≈ 30% of docs.count. The index is read-only now. What one command fixes it, and why is it safe here but dangerous on a live index?
With segment replication, which node runs the TieredMergePolicy merges, and how do the merged segments reach the replicas? (Cross-check Replication.)
Set refresh_interval to -1. Index 1000 docs. Search returns 0 hits. Name two distinct ways to make them visible and the durability difference between them.

Common bugs and symptoms

Symptom	Likely cause	Where to look
"I indexed a doc and the next `_search` doesn't find it"	Not refreshed yet (default 1s interval, or `refresh_interval: -1`, or search-idle paused refresh)	`IndexShard.scheduledRefresh`, `searchIdle`
Segment count grows without bound; search slows	`refresh_interval` too low and/or merge scheduler throttled by disk	`_cat/segments`, `MergeSchedulerConfig.max_thread_count`
Translog on disk huge; recovery slow after restart	Flush not happening (I/O saturated, or threshold mis-set)	`_stats/translog`, `shouldPeriodicallyFlush`, Translog
One giant 50GB segment never merges, deletes never reclaimed	Earlier `force_merge` on a still-live index exceeded `max_merged_segment`	`_cat/segments` size column; review ops history
Write throughput tanks under bulk load	Refresh per op (`?refresh=true` in a tight loop) creating tiny segments	bulk request params; switch to `wait_for` or interval
`CircuitBreakingException` during force_merge	Force merge built huge in-memory structures / fielddata pressure	Circuit Breakers and Memory
Replica shows stale results vs primary (segrep)	Segment replication lag; replica hasn't pulled latest checkpoint	Replication, `_cat/segment_replication`

Validation: prove you understand this

Draw the timeline of a single document from index call to "durable on disk and visible to search," labeling exactly where refresh, translog fsync, and flush each occur. State which events are required for visibility vs durability.
Explain in two sentences why a flush implies the durability of a refresh's contents but a refresh does not imply durability.
Given _cat/segments showing 120 segments on a 2GB shard with 1% deletes on a write-heavy index, decide: is the problem refresh, merge, or neither? Justify from the numbers and name the setting you'd change.
Write the curl to disable periodic refresh during a backfill and the one to restore it to 1s afterward. Explain what wait_for would do during the backfill if a client needs read-your-writes on one specific doc.
From memory, name the Lucene class for merge policy, the Lucene class for merge scheduling, and the OpenSearch config class that maps index.merge.* onto them.
Explain why force_merge?max_num_segments=1 is correct for a closed time-series index but corrupts the tiering of a live index — cite max_merged_segment.

Search Execution

A _search request is not "run the query on a shard." It is a scatter-gather across many shards on many nodes, a two-phase (query-then-fetch) protocol, and a reduce on the coordinating node that stitches partial results into a final answer. Understanding this fan-out — who holds what state, what crosses the wire, and where things can go wrong — is the difference between debugging a slow search and guessing.

This chapter follows a search from TransportSearchAction on the coordinating node down to per-shard SearchService phases and back up through SearchPhaseController. It assumes you know the request path (The Action Framework, The Transport Layer) and feeds the Query DSL and Aggregations deep dives, which detail what runs inside a phase.

After this chapter you can:

Diagram the query-then-fetch fan-out and say what each round trip carries.
Explain why deep from/size pagination is expensive and what search_after, scroll, and point-in-time (PIT) fix.
Locate the coordinating-node reduce and the per-shard phase entry points in source.
Reason about can_match pre-filtering, shard routing, and partial failures.

The cast

Concern	Class	Lives on	grep target
Coordinating-node entry	`TransportSearchAction`	coordinating node	`server/src/main/java/org/opensearch/action/search/TransportSearchAction.java`
Per-phase fan-out drivers	`SearchQueryThenFetchAsyncAction`, `SearchDfsQueryThenFetchAsyncAction`, `AbstractSearchAsyncAction`, `CanMatchPreFilterSearchPhase`	coordinating node	`server/src/main/java/org/opensearch/action/search/`
Per-shard execution	`SearchService` (`executeQueryPhase`, `executeFetchPhase`, `executeDfsPhase`, `canMatch`)	data nodes	`server/src/main/java/org/opensearch/search/SearchService.java`
The phases themselves	`QueryPhase`, `FetchPhase`, `DfsPhase`, `RescorePhase`	data nodes	`server/src/main/java/org/opensearch/search/query/`, `.../fetch/`, `.../dfs/`
Per-shard state	`SearchContext` / `DefaultSearchContext`	data nodes	`server/src/main/java/org/opensearch/search/internal/`
Coordinator reduce	`SearchPhaseController`	coordinating node	`server/src/main/java/org/opensearch/action/search/SearchPhaseController.java`

grep -n "executeQueryPhase\|executeFetchPhase\|executeDfsPhase\|public.*canMatch" \
  server/src/main/java/org/opensearch/search/SearchService.java
ls server/src/main/java/org/opensearch/action/search/ | grep -i "asyncaction\|phase"

Query-then-fetch: the two-phase model

OpenSearch does not ship full documents from every shard for every candidate hit. That would be wasteful: to return the top 10 of millions, you only need the document IDs and sort values from each shard first, pick the global top 10, then fetch the bodies of exactly those 10. That is query-then-fetch.

sequenceDiagram
    participant C as Client
    participant Co as Coordinating node (TransportSearchAction)
    participant S1 as Shard A (data node)
    participant S2 as Shard B (data node)
    C->>Co: POST /idx/_search {query, from, size, aggs}
    Note over Co: resolve indices, routing -> target shards
    par Query phase (scatter)
        Co->>S1: ShardSearchRequest (QueryPhase)
        Co->>S2: ShardSearchRequest (QueryPhase)
    end
    S1-->>Co: top docIds + sort values + agg slices + maxScore
    S2-->>Co: top docIds + sort values + agg slices + maxScore
    Note over Co: SearchPhaseController.reducedQueryPhase -> global top-K + which shards own them
    par Fetch phase (scatter, only to shards holding winners)
        Co->>S1: ShardFetchRequest (docIds to materialize)
        Co->>S2: ShardFetchRequest
    end
    S1-->>Co: _source + fields + highlights for its winners
    S2-->>Co: _source + fields ...
    Note over Co: merge hits, reduce aggs, build SearchResponse
    Co-->>C: SearchResponse (hits, aggregations, took, _shards)

Phase 1 — Query

On each target shard, SearchService.executeQueryPhase builds a DefaultSearchContext, runs QueryPhase.execute, and returns a QuerySearchResult: the top from + size document IDs for that shard, their sort/score values, maxScore, total hit count (subject to track_total_hits), and any aggregation partials. It does not contain _source.

grep -n "class QuerySearchResult\|topDocs\|aggregations\|totalHits" \
  server/src/main/java/org/opensearch/search/query/QuerySearchResult.java
grep -n "public.*execute" server/src/main/java/org/opensearch/search/query/QueryPhase.java

Reduce — pick the global winners

The coordinating node collects every QuerySearchResult and calls SearchPhaseController.reducedQueryPhase(...): it merges the per-shard top-K into a global top-K (a sorted merge over sort values), reduces aggregations (InternalAggregation.reduce, see Aggregations), and produces a ScoreDoc[] plus the map of which shard owns each winning doc.

grep -n "reducedQueryPhase\|mergeTopDocs\|reduceAggs\|sortDocs" \
  server/src/main/java/org/opensearch/action/search/SearchPhaseController.java

Phase 2 — Fetch

For each shard that owns at least one global winner, the coordinator sends a ShardFetchSearchRequest listing the exact doc IDs. SearchService.executeFetchPhase runs FetchPhase, which materializes _source, stored/doc-value fields, highlights, inner hits, etc., into a FetchSearchResult. The coordinator merges these into the final SearchHits.

grep -n "executeFetchPhase\|FetchPhase\|FetchSearchResult\|fetchSubPhase" \
  server/src/main/java/org/opensearch/search/fetch/FetchPhase.java \
  server/src/main/java/org/opensearch/search/SearchService.java

DFS — when local term stats lie

BM25 scoring (see Query DSL and QueryBuilders) depends on term frequencies and document frequencies. Each shard only knows its own docFreq. With few shards and uniform data this is fine; with skewed data or rare terms across many shards, per-shard scoring can rank inconsistently.

The optional DFS phase fixes this. With ?search_type=dfs_query_then_fetch, the coordinator first runs DfsPhase on each shard to collect global term statistics, aggregates them, and feeds the global stats into the query phase so every shard scores against the same numbers.

flowchart LR
    A[DfsPhase per shard: collect docFreq/termStats] --> B[Coordinator aggregates global term stats]
    B --> C[QueryPhase per shard scores with global stats]
    C --> D[reduce -> FetchPhase]

grep -n "DfsPhase\|AggregatedDfs\|dfs_query_then_fetch\|SearchType.DFS" \
  server/src/main/java/org/opensearch/search/dfs/DfsPhase.java \
  server/src/main/java/org/opensearch/action/search/SearchType.java

Note: DFS adds a network round trip. Use it when scoring consistency matters (e.g., few docs, relevance-sensitive ranking). Default is query_then_fetch.

can_match: don't even ask a shard that can't match

CanMatchPreFilterSearchPhase runs a cheap pre-filter (SearchService.canMatch) against each shard before the real query phase. It uses min/max ranges from segment metadata (e.g., a @timestamp range filter against a time-based index) to skip shards that provably hold no matching documents. This is huge for time-series clusters with hundreds of shards where a query only touches one day.

It also drives field-sort shard ordering: shards are sorted so the most promising are queried first, enabling early termination.

grep -n "canMatch\|CanMatchPreFilter\|MinAndMax\|pre_filter_shard_size" \
  server/src/main/java/org/opensearch/search/SearchService.java \
  server/src/main/java/org/opensearch/action/search/CanMatchPreFilterSearchPhase.java

The pre-filter only kicks in when the number of target shards exceeds pre_filter_shard_size (or for certain frozen/searchable-snapshot indices).

Pagination: from/size vs search_after vs scroll vs PIT

Mechanism	How	Cost / limit	Use when
`from` / `size`	each shard returns top `from+size`; coordinator drops `from`	deep pages multiply memory across shards; capped by `index.max_result_window` (default 10000)	shallow paging in a UI
`search_after`	resume after the last sort tuple of the previous page	stateless, no deep-page blowup; must have a deterministic full sort (include a tiebreaker like `_id`/`_shard_doc`)	deep, forward-only pagination
Scroll	freezes a point-in-time view per shard via a `SearchContext` kept open	holds segment readers open (resource cost); legacy	large exports, not user-facing paging
Point-in-time (PIT)	a named, shareable frozen view (`POST /idx/_search/point_in_time`) used with `search_after`	replaces scroll for consistent deep paging without per-request context churn	modern consistent deep paging

grep -n "max_result_window\|from()\|size()\|searchAfter\|search_after" \
  server/src/main/java/org/opensearch/search/internal/DefaultSearchContext.java \
  server/src/main/java/org/opensearch/index/IndexSettings.java
grep -rn "point_in_time\|PitService\|CreatePit\|SearchContextId" \
  server/src/main/java/org/opensearch/action/search/ | head

Warning: Deep from/size (e.g., from=100000) forces every shard to build a 100k+ priority queue and ship it to the coordinator, which merges all of them. This is why max_result_window exists. The fix is almost always search_after over a PIT, not raising the window.

SearchContext — the per-shard state holder

SearchContext (concrete: DefaultSearchContext) is the per-shard, per-request state: the parsed query, the searcher (a reference into the latest refreshed reader — see Refresh, Flush, and Merge), aggregation collectors, from/size, sort, the QueryShardContext, timeouts, and the fetch sub-phases. For query-then-fetch it is typically created and torn down within a single phase; for scroll/PIT it is kept alive across requests (keyed by a SearchContextId).

grep -n "class DefaultSearchContext\|QueryShardContext\|ContextIndexSearcher\|aggregations(" \
  server/src/main/java/org/opensearch/search/internal/DefaultSearchContext.java

Leaking these (scroll contexts never cleared) is a real production incident: open contexts pin segments and prevent merge reclamation. Watch open_contexts in _nodes/stats/indices/search.

Partial failures and `_shards`

A search is "successful" even if some shards fail — the response's _shards block reports total, successful, skipped, and failed. The coordinator (AbstractSearchAsyncAction) tolerates per-shard failures up to the point that a phase can still produce a result; failed shards' contributions are simply absent. This is why you can get fewer results than expected without an HTTP error.

grep -n "successfulShards\|skippedShards\|shardFailures\|onShardFailure" \
  server/src/main/java/org/opensearch/action/search/AbstractSearchAsyncAction.java

Note: "Why is my count off?" is frequently a partial shard failure hiding in _shards.failed, not a query bug. Always read _shards before blaming the query.

Reading exercise

# 1. Coordinating-node entry and how it picks the async action (qtf vs dfs)
grep -n "executeSearch\|searchAsyncAction\|SearchQueryThenFetch\|SearchDfsQuery" \
  server/src/main/java/org/opensearch/action/search/TransportSearchAction.java

# 2. The reduce
grep -n "reducedQueryPhase\|reduce\|merge" \
  server/src/main/java/org/opensearch/action/search/SearchPhaseController.java

# 3. Per-shard phases
grep -n "execute" server/src/main/java/org/opensearch/search/query/QueryPhase.java
grep -n "execute\|hitsExecute" server/src/main/java/org/opensearch/search/fetch/FetchPhase.java

# 4. can_match
grep -n "canMatch\|MinAndMax\|FieldSortBuilder" \
  server/src/main/java/org/opensearch/search/SearchService.java

Answer:

In the query phase result (QuerySearchResult), is _source present? Where is _source actually loaded, and on which round trip?
After reducedQueryPhase, how does the coordinator know which shard to send each fetch request to? What data structure carries that mapping?
Trace how TransportSearchAction decides between query_then_fetch and dfs_query_then_fetch. What does DFS buy you, and what does it cost?
A user paginates with from=50000&size=20. Quantify (in words) the work each shard does and why search_after over a PIT is strictly cheaper.
On a 200-shard time-series index, a query filters @timestamp >= now-1h. Which class skips the irrelevant shards, what segment metadata does it use, and what setting gates whether it runs at all?
Find where a per-shard timeout or failure is recorded and explain how the final _shards.failed count is produced.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`Result window is too large` error	`from + size > index.max_result_window`	switch to `search_after`/PIT; `IndexSettings.MAX_RESULT_WINDOW_SETTING`
Inconsistent relevance ranking with rare terms across many shards	per-shard term stats; not using DFS	`?search_type=dfs_query_then_fetch`, `DfsPhase`
Search slower than expected on time-series cluster	`can_match` pre-filter not engaging (shards below `pre_filter_shard_size`, or non-range query)	`CanMatchPreFilterSearchPhase`, `pre_filter_shard_size`
Node heap creeping, merges blocked	leaked scroll/PIT contexts pinning readers	`_nodes/stats` `open_contexts`; ensure `DELETE _search/scroll` / `DELETE _search/point_in_time`
`search_after` returns duplicates/gaps	non-deterministic sort (no tiebreaker)	add `_id`/`_shard_doc` to sort
Fewer hits than expected, no error	partial shard failure	response `_shards.failed`, `AbstractSearchAsyncAction.onShardFailure`
`took` huge but each shard fast	coordinator reduce dominated by deep paging / many aggs	`SearchPhaseController` reduce; reduce `size`/agg cardinality

Validation: prove you understand this

Draw the full query-then-fetch sequence for a 3-shard index returning the top 10 hits. Label exactly what data crosses on each of the (up to) four network legs and which carries _source.
Explain why query-then-fetch sends two rounds instead of one, in terms of bytes-on-the-wire and memory on the coordinator.
Given a relevance complaint on a 50-shard index with a rare search term, state the one-flag change you'd make and the new phase that runs, with its source file.
Contrast from/size, search_after, scroll, and PIT on three axes: statefulness, deep-page cost, and consistency. Name the class that holds the long-lived state for scroll/PIT.
Locate SearchPhaseController.reducedQueryPhase and describe, in two sentences, the merge it performs over per-shard top-docs and the agg reduce it triggers.
A user reports a search "missing" documents intermittently. Outline the 3-step diagnosis starting from _shards, then routing, then shard health.

Query DSL and QueryBuilders

The JSON you put in a _search request body is not executed directly. It is parsed into a tree of QueryBuilder objects, optionally rewritten into a simpler equivalent tree, and finally converted to a Lucene Query that actually runs against the index. Three distinct representations, three distinct classes of bug. This chapter walks that pipeline end to end and shows where a SearchPlugin injects a custom query type.

It sits under Search Execution (which calls into the query phase) and above the Lucene scoring machinery. It pairs with Mapping and Analysis (field types decide what a query can do) and DocValues and Fielddata (what sorting and some queries read).

After this chapter you can:

Trace a match query from JSON to a Lucene BooleanQuery of TermQuerys.
Explain the difference between parse-time, rewrite-time, and toQuery-time, and which context each runs in.
Register a custom QueryBuilder through SearchPlugin.getQueries().
Use _validate/query?explain and _explain to debug parsing and scoring.

The class hierarchy

Role	Class	Where
Base for every query builder	`AbstractQueryBuilder<T>` implements `QueryBuilder`	`server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java`
The interface	`QueryBuilder` (extends `NamedWriteable`, `ToXContentObject`, `Rewriteable<QueryBuilder>`)	`server/src/main/java/org/opensearch/index/query/QueryBuilder.java`
Leaf queries	`TermQueryBuilder`, `MatchQueryBuilder`, `RangeQueryBuilder`, `MatchAllQueryBuilder`, `ExistsQueryBuilder`, `PrefixQueryBuilder`, `WildcardQueryBuilder`, `FuzzyQueryBuilder`	`server/src/main/java/org/opensearch/index/query/`
Compound queries	`BoolQueryBuilder`, `DisMaxQueryBuilder`, `ConstantScoreQueryBuilder`, `FunctionScoreQueryBuilder`, `NestedQueryBuilder`	same dir
Parse + rewrite context	`QueryRewriteContext` (and subclass `QueryShardContext`)	`server/src/main/java/org/opensearch/index/query/`

ls server/src/main/java/org/opensearch/index/query/ | grep QueryBuilder | head -40
grep -n "interface QueryBuilder\|toQuery\|doRewrite\|getWriteableName" \
  server/src/main/java/org/opensearch/index/query/QueryBuilder.java

Stage 1 — Parse JSON into a QueryBuilder tree

The query JSON object has exactly one key naming the query type (match, bool, term, …). AbstractQueryBuilder.parseInnerQueryBuilder reads that key and dispatches through the NamedXContentRegistry to the registered parser for that name (each builder exposes a static fromXContent(XContentParser)).

flowchart LR
    J["{ bool: { must: [ { match: { title: hi } } ] } }"] --> P[parseInnerQueryBuilder reads first key 'bool']
    P --> R[NamedXContentRegistry lookup: bool -> BoolQueryBuilder.fromXContent]
    R --> B[BoolQueryBuilder]
    B --> P2[recurse into 'match' -> MatchQueryBuilder]
    P2 --> T[QueryBuilder tree built, NOT yet a Lucene Query]

grep -n "parseInnerQueryBuilder\|fromXContent\|NamedXContentRegistry" \
  server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java
grep -n "public static.*fromXContent" \
  server/src/main/java/org/opensearch/index/query/BoolQueryBuilder.java

Parsing failures (unknown query name, malformed body) surface here as ParsingException / XContentParseException — before any index is touched. This is why a typo'd query type fails fast with a clear message.

The registry that maps "bool" → BoolQueryBuilder is built in SearchModule from every core query plus everything SearchPlugin.getQueries() contributes.

grep -n "registerQuery\|QuerySpec\|getQueries" \
  server/src/main/java/org/opensearch/search/SearchModule.java

Stage 2 — Rewrite

Before conversion, the tree is rewritten (Rewriteable.rewrite → QueryBuilder.rewrite(QueryRewriteContext) → each builder's doRewrite). Rewrite simplifies and resolves:

MatchNoneQueryBuilder shortcuts (e.g., a range that can't match anything on this shard rewrites to match-none — feeds the can_match optimization in Search Execution).
Asynchronous resolution: terms lookup, wrapper, percolator, and some queries fetch data during rewrite (QueryRewriteContext can do async fetches), so rewrite can be a multi-round process.
Constant folding: bool with a single clause may collapse.

grep -n "doRewrite\|rewrite(\|MatchNoneQueryBuilder" \
  server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java \
  server/src/main/java/org/opensearch/index/query/RangeQueryBuilder.java

Note: Rewrite runs on the coordinating node (with a QueryRewriteContext) and per shard (with a QueryShardContext, which is a QueryRewriteContext subclass that also has mapper/searcher access). A query may rewrite differently per shard — that is by design.

Stage 3 — toQuery: become a Lucene Query

The final step is AbstractQueryBuilder.toQuery(QueryShardContext) → doToQuery(...), which produces an actual Lucene org.apache.lucene.search.Query. This is where field types from the mapping (Mapping and Analysis) matter: MatchQueryBuilder consults the field's analyzer to tokenize the input, then emits one TermQuery per token wrapped in a BooleanQuery. A term query does not analyze — it matches the exact token, which is why term on an analyzed text field is a classic mistake.

DSL	QueryBuilder	Lucene `Query` produced	Analyzed?
`match`	`MatchQueryBuilder`	`BooleanQuery` of `TermQuery` (or `PhraseQuery`/`SynonymQuery`)	yes (field analyzer)
`term`	`TermQueryBuilder`	`TermQuery` (exact)	no
`range`	`RangeQueryBuilder`	`PointRangeQuery` / `TermRangeQuery` (type-dependent)	n/a
`bool`	`BoolQueryBuilder`	`BooleanQuery` (must/should/filter/must_not clauses)	per child
`prefix`	`PrefixQueryBuilder`	`PrefixQuery` (multi-term)	no

grep -n "doToQuery\|MappedFieldType\|getMapperService\|analyzer" \
  server/src/main/java/org/opensearch/index/query/MatchQueryBuilder.java \
  server/src/main/java/org/opensearch/index/query/TermQueryBuilder.java

The full pipeline:

flowchart TD
    A[JSON DSL] -->|parseInnerQueryBuilder + NamedXContentRegistry| B[QueryBuilder tree]
    B -->|rewrite QueryRewriteContext| C[Simplified QueryBuilder tree]
    C -->|toQuery QueryShardContext| D[Lucene Query]
    D -->|IndexSearcher.search with Similarity| E[scored hits in QueryPhase]

Scoring: BM25 and Similarity

Once you have a Lucene Query, Lucene's IndexSearcher runs it against each segment and scores matches using a Similarity. The default is BM25 (BM25Similarity), driven by term frequency (tf), inverse document frequency (idf), and field-length normalization. filter/must_not clauses and constant_score produce no score contribution (they affect matching, not ranking).

Per-field similarity can be customized via index.similarity.* and the field's similarity mapping. Global term statistics for idf can be made cluster-consistent via the DFS phase (see Search Execution).

grep -rn "BM25Similarity\|SimilarityProvider\|index.similarity" \
  server/src/main/java/org/opensearch/index/similarity/

Note: filter context (in bool.filter, constant_score) is both faster and cacheable because it skips scoring entirely and Lucene can cache the DocIdSet. Move non-relevance predicates (timestamps, status flags) into filter, not must.

Registering a custom query in a SearchPlugin

A plugin adds a query type by implementing SearchPlugin.getQueries() and returning QuerySpecs, each binding: the query name, a Writeable.Reader (stream deserialization — see Serialization and BWC), and an fromXContent parser.

// In your Plugin implements SearchPlugin
@Override
public List<QuerySpec<?>> getQueries() {
    return List.of(new QuerySpec<>(
        MyQueryBuilder.NAME,                 // "my_query"
        MyQueryBuilder::new,                 // StreamInput ctor (wire)
        MyQueryBuilder::fromXContent));      // JSON parser
}

SearchModule folds these into both the NamedXContentRegistry (for JSON parsing) and the NamedWriteableRegistry (for cross-node transport). Your MyQueryBuilder extends AbstractQueryBuilder<MyQueryBuilder> must implement doWriteTo, doXContent, doRewrite, doToQuery, doEquals, doHashCode, and getWriteableName.

grep -n "interface SearchPlugin\|QuerySpec\|getQueries" \
  server/src/main/java/org/opensearch/plugins/SearchPlugin.java

See Plugin Architecture for how SearchPlugin is discovered and wired, and the Level-7 plugin lab for a full custom-query build.

Debugging tools: _validate/query and _explain

# Did my query parse, and what Lucene query does it become?
curl -s 'localhost:9200/idx/_validate/query?explain=true&pretty' \
  -H 'content-type: application/json' \
  -d '{"query":{"match":{"title":"open source"}}}'

# Why did (or didn't) THIS doc match, and how was it scored?
curl -s 'localhost:9200/idx/_explain/DOC_ID?pretty' \
  -H 'content-type: application/json' \
  -d '{"query":{"match":{"title":"open source"}}}'

_validate/query?explain returns the Lucene string form of the rewritten query per index — the fastest way to confirm that match analyzed your text the way you expected (e.g., you'll see title:open title:source, proving the analyzer split and lowercased it). _explain returns the BM25 score breakdown for one document: tf, idf, boost, field-length norm.

Warning: The Lucene string in _validate/query reflects the query after analysis and rewrite. If you see title:Open (capital O) you have a keyword field or a non-lowercasing analyzer — a mapping problem, not a query problem.

Reading exercise

# 1. The parse dispatch
grep -n "parseInnerQueryBuilder\|fromXContent" \
  server/src/main/java/org/opensearch/index/query/AbstractQueryBuilder.java

# 2. A leaf builder end to end
sed -n '1,60p' server/src/main/java/org/opensearch/index/query/TermQueryBuilder.java
grep -n "doToQuery\|doRewrite\|doWriteTo\|fromXContent\|NAME" \
  server/src/main/java/org/opensearch/index/query/MatchQueryBuilder.java

# 3. Compound query
grep -n "must\|should\|filter\|mustNot\|doToQuery" \
  server/src/main/java/org/opensearch/index/query/BoolQueryBuilder.java

# 4. Registry wiring
grep -n "registerQuery\|QuerySpec\|getQueries\|namedWriteables" \
  server/src/main/java/org/opensearch/search/SearchModule.java

Answer:

Where, exactly, does the registry decide that the JSON key "bool" maps to BoolQueryBuilder? Trace from parseInnerQueryBuilder to the registry lookup.
Name three things that can happen during rewrite that change the query tree, and give one query type for each.
Run _validate/query?explain for {"term":{"title":"Open Source"}} against a text field. Predict the Lucene query string and explain why it likely matches nothing.
In toQuery, where does MatchQueryBuilder obtain the analyzer, and what Lucene query does a two-token input produce?
List the six do* methods a custom AbstractQueryBuilder subclass must implement, and say which two are about wire/JSON serialization vs query semantics.
Why is a predicate in bool.filter cheaper than the same predicate in bool.must? What can Lucene cache for the filter case?

Common bugs and symptoms

Symptom	Likely cause	Where to look
`unknown query [my_query]`	query name not registered (plugin not installed, or `getQueries()` typo)	`SearchPlugin.getQueries`, `SearchModule.registerQuery`
`term` query on `text` field never matches	`term` is exact/un-analyzed; the indexed tokens were analyzed/lowercased	`_validate/query?explain`; query a `keyword` field instead
Custom query works in one node, fails across cluster	missing/mismatched `Writeable.Reader` or wire BWC break	Serialization and BWC, `QuerySpec` ctor
Scores look "wrong" / inconsistent across shards	per-shard idf; not using DFS	Search Execution, `dfs_query_then_fetch`
Slow query that should be fast	relevance predicates in `must` instead of `filter` (scoring + no cache)	move to `bool.filter`/`constant_score`
`range` on a date returns nothing	date format mismatch resolved at rewrite/parse	`RangeQueryBuilder`, field `format` in mapping
Query parses but `_explain` shows 0.0 score for a matching doc	clause in `filter`/`must_not` context (no score by design)	expected; check clause placement

Validation: prove you understand this

From memory, list the three pipeline stages (parse, rewrite, toQuery), the context object active in each, and the kind of exception each stage throws on failure.
Given {"match":{"title":"Open Source"}} against a standard-analyzed text field, write the Lucene query you expect _validate/query?explain to return and justify each token.
Explain why {"term":{"title":"Open Source"}} on that same field almost certainly returns no hits, and rewrite it correctly two different ways.
Sketch the QuerySpec you'd pass from SearchPlugin.getQueries() for a new my_query, naming the three arguments and what each is for.
Explain the role of NamedXContentRegistry vs NamedWriteableRegistry for a query builder — which is used for JSON, which for transport, and why a custom query needs both.
Take a slow relevance query of the form bool.must:[range, match] and rewrite it for performance. Justify the change in terms of scoring and query caching.

Aggregations

Aggregations are OpenSearch's analytics engine: terms, date_histogram, avg, cardinality, nested sub-aggregations — the things that power dashboards. Like search, they run per shard and then reduce on the coordinating node. But unlike top-K search, aggregations can be approximate by design, and understanding why (and where the error comes from) is essential to using them correctly and to contributing a new aggregation.

This chapter follows the aggregation framework from AggregatorFactory through per-shard collection into InternalAggregation.reduce on the coordinator. It extends Search Execution (aggs ride the query phase) and depends hard on DocValues and Fielddata (aggs read columnar doc values, and global ordinals make terms fast).

After this chapter you can:

Explain the factory → aggregator → internal-agg → reduce lifecycle.
Distinguish bucket aggregators from metric aggregators and what each emits.
Reason precisely about terms approximation error and the size / shard_size trade-off.
Register a custom aggregation via SearchPlugin.getAggregations().

The four-stage lifecycle

Stage	Class	Runs on	Output
1. Build	`AggregationBuilder` (e.g. `TermsAggregationBuilder`) parsed from JSON	coordinator (parse)	the request spec
2. Create	`AggregatorFactory` / `AggregatorFactories` → `Aggregator`	each shard	per-segment collectors
3. Collect	`Aggregator` (a `BucketCollector` / `LeafBucketCollector`)	each shard	partial `InternalAggregation`
4. Reduce	`InternalAggregation.reduce(List, ReduceContext)`	coordinator	final `InternalAggregation`

ls server/src/main/java/org/opensearch/search/aggregations/
grep -n "class AggregatorFactories\|class AggregatorFactory\|abstract class Aggregator" \
  server/src/main/java/org/opensearch/search/aggregations/AggregatorFactories.java \
  server/src/main/java/org/opensearch/search/aggregations/AggregatorFactory.java \
  server/src/main/java/org/opensearch/search/aggregations/Aggregator.java

Bucket vs metric aggregators

Kind	Examples	What it does	Internal type
Bucket	`terms`, `date_histogram`, `histogram`, `range`, `filters`, `nested`	partition documents into buckets; can hold sub-aggregations	`InternalTerms`, `InternalDateHistogram`, … (each is a `MultiBucketsAggregation`)
Metric	`avg`, `sum`, `min`, `max`, `stats`, `cardinality`, `percentiles`	compute a single number (or a few) over the docs in scope	`InternalAvg`, `InternalCardinality`, …
Pipeline	`derivative`, `bucket_script`, `moving_fn`	operate on the output of other aggs at reduce time	`InternalSimpleValue`, etc.

A bucket aggregator owns child AggregatorFactories; metric aggregators are leaves. This nesting is what makes terms → avg ("average price per category") work: the terms aggregator creates one avg sub-aggregator instance per bucket and routes documents into the right one.

find server/src/main/java/org/opensearch/search/aggregations/bucket -name "TermsAggregator*.java"
find server/src/main/java/org/opensearch/search/aggregations/metrics -name "AvgAggregator*.java"

Collection: CollectorManager and BucketCollector

Aggregation collection plugs into Lucene's Collector machinery. Each Aggregator exposes a LeafBucketCollector (via getLeafCollector(LeafReaderContext)) that Lucene calls once per matching doc with collect(int doc, long bucketOrd). Bucket aggregators allocate bucket ordinals and recursively drive their sub-aggregators with the child bucket ordinal.

For concurrent (segment-parallel) search, aggregators are created through a CollectorManager/AggregationCollectorManager so that per-slice partials can be merged. The values themselves come from doc values — terms over a keyword field reads SortedSetDocValues and uses global ordinals to bucket by ordinal rather than by string, which is dramatically faster (see DocValues and Fielddata).

grep -rn "LeafBucketCollector\|getLeafCollector\|collect(int\|bucketOrd" \
  server/src/main/java/org/opensearch/search/aggregations/
grep -rn "CollectorManager\|AggregationCollectorManager\|globalOrdinals\|SortedSetDocValues" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

Reduce: where shards become one answer

Each shard returns a partial InternalAggregation. The coordinator collects them and calls InternalAggregation.reduce(List<InternalAggregation>, ReduceContext). For avg: sum the sums, sum the counts, divide once. For terms: merge bucket lists by key, summing doc counts and recursively reducing sub-aggregations, then trim to the requested size. For cardinality: merge HyperLogLog++ sketches.

flowchart TD
    subgraph Shard A
      A1[QueryPhase scan] --> A2[TermsAggregator collect by global ordinal]
      A2 --> A3[InternalTerms partial: top shard_size buckets]
    end
    subgraph Shard B
      B1[QueryPhase scan] --> B2[TermsAggregator collect]
      B2 --> B3[InternalTerms partial]
    end
    A3 --> R[Coordinator: InternalTerms.reduce]
    B3 --> R
    R --> F[Merge buckets by key, sum counts, reduce sub-aggs, trim to size]
    F --> O[Final aggregations in SearchResponse]

grep -rn "public InternalAggregation reduce\|doReduce\|ReduceContext" \
  server/src/main/java/org/opensearch/search/aggregations/InternalAggregation.java
grep -rn "reduce" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java

Note: Reduce can happen in stages. For many shards the coordinator does a partial reduce in batches (batched_reduce_size) to bound memory, then a final reduce. A pipeline aggregation's logic runs in the final reduce.

The `terms` approximation: why counts can be wrong

This is the single most-misunderstood aggregation behavior, and a frequent "bug report" that is actually correct-by-design.

Each shard returns only its top shard_size terms (not all terms). The coordinator merges these. A term that is globally in the top size but ranks just below shard_size on every individual shard can be undercounted or missed entirely, because no shard ever reported it. OpenSearch surfaces this honestly:

doc_count_error_upper_bound — the maximum a returned bucket's count could be off by.
sum_other_doc_count — documents in buckets that didn't make the final list.

The mitigation is shard_size (defaults to roughly size * 1.5 + 10): each shard returns more candidates than the final size, shrinking the chance of a miss, at the cost of more memory/transfer.

grep -rn "shard_size\|shardSize\|doc_count_error\|sum_other_doc_count\|showTermDocCountError" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

Knob	Effect on accuracy	Cost
`size`	how many final buckets you want	bigger result
`shard_size`	candidates each shard reports	memory + network per shard
`order` by sub-agg metric	increases error (metric order is harder to bound)	accuracy warning

Warning: Ordering a terms agg by a sub-aggregation metric (e.g., "top 5 categories by avg price") is inherently more approximate than ordering by doc_count, because the shard can't know which terms will win the metric ranking globally. Raise shard_size substantially or use composite/cardinality strategies for exactness-sensitive cases.

Sub-aggregations and the matrix-stats module

Sub-aggregations are just nested AggregatorFactories: a date_histogram with a child avg produces, per time bucket, the average of a field. The parent routes each doc to a bucket ordinal and calls the child's collect with that ordinal.

Not every aggregation lives in server/. The aggs-matrix-stats module (under modules/) is the canonical example of an aggregation shipped as a bundled module via the plugin SPI, not baked into core. It computes covariance/correlation across multiple fields (matrix_stats). Read it as the reference implementation for "how to add an aggregation through SearchPlugin."

find modules/aggs-matrix-stats -name "*.java" | grep -i "agg\|plugin" | head
grep -rn "getAggregations\|AggregationSpec\|MatrixStatsAggregationBuilder" \
  modules/aggs-matrix-stats/src/main/java/

Registering a custom aggregation

A plugin implements SearchPlugin.getAggregations() and returns AggregationSpecs. Each binds the agg name, the builder's stream reader and fromXContent parser, and registers the InternalAggregation's NamedWriteable reader so partial results can cross the wire and be reduced.

@Override
public List<AggregationSpec> getAggregations() {
    return List.of(
        new AggregationSpec(
            MyAggBuilder.NAME,            // "my_agg"
            MyAggBuilder::new,            // StreamInput ctor (wire)
            MyAggBuilder::fromXContent)   // JSON parser
        .addResultReader(InternalMyAgg::new)); // reduce-side wire reader
}

You implement: MyAggBuilder extends AbstractAggregationBuilder, a MyAggregatorFactory, the MyAggregator (its getLeafCollector/collect), and InternalMyAgg extends InternalAggregation with a correct reduce(...). The reduce implementation is the part reviewers scrutinize most — it must be associative and commutative because shard order is non-deterministic and partial reduces batch arbitrarily.

grep -n "getAggregations\|AggregationSpec\|addResultReader" \
  server/src/main/java/org/opensearch/plugins/SearchPlugin.java

See Plugin Architecture for wiring and the Level-7 plugin lab for a build-it walkthrough.

Reading exercise

# 1. Factory -> aggregator creation
grep -n "createInternal\|doCreateInternal\|class TermsAggregatorFactory" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorFactory.java

# 2. Collection
grep -n "getLeafCollector\|collect\|collectBucket\|collectExistingBucket" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/BucketsAggregator.java

# 3. Reduce + error accounting
grep -n "reduce\|doc_count_error\|sum_other\|shardSize" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java

# 4. The bundled module pattern
grep -rn "getAggregations" modules/aggs-matrix-stats/src/main/java/

Answer:

When a terms agg has an avg sub-aggregation, how many avg aggregator instances effectively exist, and how does a document get routed to the right one? (Hint: bucket ordinals.)
Why must InternalAggregation.reduce be associative and commutative? Tie your answer to partial reduces and non-deterministic shard ordering.
Walk through how doc_count_error_upper_bound is computed for a terms agg. Under what order does the error become unbounded/harder to compute?
Where does a terms aggregator read its values from, and what makes bucketing by global ordinal faster than bucketing by string? (Cross-check DocValues and Fielddata.)
In aggs-matrix-stats, find getAggregations() and list everything the AggregationSpec registers. Which registration is specifically for the reduce side?
What does batched_reduce_size control, and on which node does it matter?

Common bugs and symptoms

Symptom	Likely cause	Where to look
`terms` counts slightly off / a known term missing	shard-level top-`shard_size` truncation (by design)	`doc_count_error_upper_bound`, raise `shard_size`
`terms`/sort on a `text` field rejected	fielddata disabled by default on `text`	DocValues and Fielddata
`CircuitBreakingException` on a high-cardinality `terms` agg	too many buckets / fielddata pressure trips the request breaker	Circuit Breakers and Memory
Custom agg works on 1 shard, wrong on many	`InternalAgg.reduce` not associative/commutative, or missing `addResultReader`	your `reduce`, `AggregationSpec.addResultReader`
Ordering by sub-metric gives unstable/wrong top-N	metric-order approximation; `shard_size` too low	raise `shard_size`, reconsider exactness needs
OOM on coordinator with many shards	large partials reduced all at once	`batched_reduce_size`, partial reduce
Sub-aggregation values all zero/empty	parent never routed docs into the bucket (filter mismatch)	parent aggregator `collectBucket`

Validation: prove you understand this

Draw the four-stage lifecycle (build, create, collect, reduce), labeling which stage runs on the coordinator vs the data nodes, and what object crosses the wire between collect and reduce.
Explain, with a concrete 3-shard example, how a globally-popular term can be under-reported by a terms agg, and how shard_size reduces the risk.
Define doc_count_error_upper_bound and sum_other_doc_count and say what a non-zero doc_count_error_upper_bound tells a user about trusting the counts.
For a custom InternalAgg.reduce, write (in pseudocode) the reduce for a sum-style metric and argue it is associative and commutative.
Explain why bucketing a terms agg by global ordinal beats bucketing by raw string, and what data structure supplies the ordinals.
Using aggs-matrix-stats as the template, list the four classes you'd create to add a new aggregation and the one method whose correctness the reviewer will care about most.

DocValues and Fielddata

A Lucene inverted index is excellent at answering "which documents contain term X" and terrible at answering "for this document, what is the value of field Y." Sorting, aggregating, and scripting all need the second access pattern — given a doc, get its value — which is columnar. OpenSearch satisfies it two ways: doc values (on disk, columnar, the default) and fielddata (in-heap, built from the inverted index, the legacy fallback that exists mainly to make text sortable/aggregatable and is off by default for good reason).

This chapter explains both, why keyword and text behave so differently, and how global ordinals make terms aggregations fast. It underpins Aggregations and Search Execution (sorting), and connects to Circuit Breakers and Memory (the fielddata breaker) and Mapping and Analysis (which field types get doc values).

After this chapter you can:

Explain why sorting/aggregating on a text field is blocked by default and what fielddata: true actually costs.
Name the four doc-values flavors and which field types use each.
Describe global ordinals and why they make high-cardinality keyword aggs viable.
Read the fielddata circuit breaker and diagnose a heap blow-up.

Two columnar stores, very different costs

	DocValues	Fielddata
Where	on disk (columnar segment files), OS page cache	on JVM heap
Built	at index time, written with the segment	lazily at query time by uninverting the inverted index
Default for	`keyword`, numerics, `date`, `boolean`, `ip`, `geo_point`	nothing — must be enabled per `text` field
Cost	cheap, OS-cached, scales with data	expensive, heap-bound, can OOM the node
Used for	sort, aggregate, script field access, some queries	making `text` sortable/aggregatable (rarely a good idea)

grep -rn "DocValuesType\|SortedSetDocValues\|hasDocValues\|docValues" \
  server/src/main/java/org/opensearch/index/mapper/ | head
grep -rn "fielddata\|FieldDataType\|IndexFieldData" \
  server/src/main/java/org/opensearch/index/fielddata/ | head

DocValues flavors

Lucene stores doc values in one of a few column types; OpenSearch picks the type from the field's mapping.

Lucene `DocValuesType`	Holds	OpenSearch field types
`NUMERIC`	one number per doc	`long`, `integer`, `double`, `date` (as epoch), `boolean`
`SORTED_NUMERIC`	multiple numbers per doc, sorted	numeric fields that are multi-valued
`SORTED`	one term ordinal per doc	single-valued `keyword`-like
`SORTED_SET`	a set of term ordinals per doc	multi-valued `keyword`, `ip`
`BINARY`	raw bytes per doc	`geo_point`, custom binary

The key idea for strings: doc values store ordinals, not the strings themselves. A per-segment sorted dictionary maps ordinal → term. To sort or bucket, you compare ordinals (cheap integers) and only resolve the actual term when you need to emit it.

grep -rn "SortedSetDocValues\|SortedNumericDocValues\|NumericDocValues\|BinaryDocValues" \
  server/src/main/java/org/opensearch/index/fielddata/ | head

The IndexFieldData abstraction

OpenSearch reads doc values (and fielddata) through a uniform interface so aggregations and sorting don't care which backing store is used:

Class	Role
`IndexFieldData<FD>`	per-field, per-index accessor; produces `LeafFieldData` per segment
`LeafFieldData`	per-segment view; yields `SortedNumericDocValues` / `SortedSetDocValues`-style accessors
`IndexFieldDataService`	builds/caches `IndexFieldData` instances per field, wired to the breaker
`IndexFieldData.Builder`	each `MappedFieldType` returns one via `fielddataBuilder(...)`

grep -rn "interface IndexFieldData\|class LeafFieldData\|IndexFieldDataService\|fielddataBuilder" \
  server/src/main/java/org/opensearch/index/fielddata/
grep -rn "fielddataBuilder" server/src/main/java/org/opensearch/index/mapper/ | head

A terms aggregator (see Aggregations) asks the field's IndexFieldData for SortedSetDocValues, then buckets by ordinal. A sort on a numeric field asks for SortedNumericDocValues. Same abstraction, different backing.

Global ordinals — the trick that makes terms aggs fast

Ordinals are per-segment: ordinal 5 in segment A and ordinal 5 in segment B may be different terms. To aggregate across a whole shard you need a shard-global numbering: global ordinals, a mapping from each segment's local ordinals to a single shard-wide ordinal space.

flowchart LR
    SA["Segment A local ords<br/>0=apple 1=pear"] --> GO[Global ordinal map]
    SB["Segment B local ords<br/>0=pear 1=plum"] --> GO
    GO --> G["Shard-global ords<br/>0=apple 1=pear 2=plum"]
    G --> AGG[terms agg buckets by global ord]

Global ordinals are built lazily on first aggregation and cached per shard, rebuilt when segments change (a refresh that adds segments). They cost memory and a build pass, which is why the first terms agg after a refresh can be slower than subsequent ones. eager_global_ordinals: true in the mapping pre-builds them at refresh time to move that cost off the query path.

grep -rn "GlobalOrdinal\|buildGlobalOrdinals\|eager_global_ordinals\|OrdinalMap" \
  server/src/main/java/org/opensearch/index/fielddata/

Note: High-cardinality keyword aggregations live or die on global ordinals. If you aggregate a field with millions of distinct values on every query, consider eager_global_ordinals, or reconsider whether you need a full terms agg vs a composite/cardinality approach.

Why `text` is special — and why sorting/aggregating it is blocked

A text field is analyzed: "Open Source Engine" becomes tokens [open, source, engine]. There is no single value to sort by, and text fields do not get doc values (it wouldn't be meaningful for a token stream). So by default:

// Sorting or aggregating a `text` field:
"Fielddata is disabled on text fields by default. Set fielddata=true on
 [your_field] in order to load fielddata in memory by uninverting the inverted
 index. Note that this can use significant memory."

That error is OpenSearch protecting your heap. Enabling fielddata: true makes the node uninvert the entire inverted index for that field into heap at query time — for a high-cardinality analyzed field across millions of docs, that is a classic node-OOM recipe.

The right answer is the multi-field pattern: index the field as text for search and as a keyword sub-field for sort/aggregate:

{
  "title": {
    "type": "text",
    "fields": { "raw": { "type": "keyword" } }
  }
}

Then search title (analyzed) and aggregate/sort title.raw (doc values, cheap). This is the idiom; fielddata: true is the exception you reach for almost never.

grep -rn "Fielddata is disabled\|fielddata.*default\|TextFieldMapper" \
  server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java | head

The fielddata circuit breaker

Because fielddata lives on heap and is built lazily, it is the most dangerous memory consumer in the system. The fielddata circuit breaker (indices.breaker.fielddata.limit, default ~40% of heap) caps total fielddata heap; exceeding it throws CircuitBreakingException instead of OOM-ing the node. DocValues do not count against it (they're off-heap / page cache). See Circuit Breakers and Memory for the full breaker hierarchy.

grep -rn "fielddata\|FIELDDATA\|CircuitBreaker.FIELDDATA" \
  server/src/main/java/org/opensearch/indices/breaker/HierarchyCircuitBreakerService.java
curl -s 'localhost:9200/_nodes/stats/breaker?pretty' | grep -A6 fielddata

Inspect actual fielddata heap usage and evict it:

curl -s 'localhost:9200/_cat/fielddata?v&h=node,field,size'
curl -s -XPOST 'localhost:9200/idx/_cache/clear?fielddata=true'

Reading exercise

# 1. Where field types declare doc values and build fielddata
grep -rn "hasDocValues\|fielddataBuilder\|fielddata(" \
  server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java

# 2. The IndexFieldData abstraction + caching + breaker wiring
grep -rn "IndexFieldDataService\|CircuitBreakerService\|getForField\|clearField" \
  server/src/main/java/org/opensearch/index/fielddata/IndexFieldDataService.java

# 3. Global ordinals
grep -rn "GlobalOrdinalsBuilder\|buildGlobalOrdinals\|eager_global_ordinals" \
  server/src/main/java/org/opensearch/index/fielddata/

# 4. The famous error
grep -rn "Fielddata is disabled" server/src/main/java/org/opensearch/index/mapper/

Answer:

Why can't a text field have doc values in any meaningful way? Tie it to analysis producing a token stream rather than a single value.
Walk the path a terms aggregation takes to read a keyword field's values: from IndexFieldData to the ordinals it buckets on.
What is a global ordinal and why is it needed in addition to per-segment ordinals? What event invalidates the cached global ordinals?
What does eager_global_ordinals: true change about when the global ordinal build cost is paid, and why might you turn it on?
Compare where doc values and fielddata live (disk/page-cache vs heap) and explain why only one of them counts against the fielddata circuit breaker.
Write the mapping that lets you full-text search a field and aggregate on it without ever enabling fielddata.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`Fielddata is disabled on text fields by default`	sorting/aggregating a `text` field	use a `keyword` multi-field; `TextFieldMapper`
Node OOM / GC death after enabling `fielddata: true`	uninverting a high-cardinality analyzed field into heap	revert; `_cat/fielddata`; multi-field pattern
`CircuitBreakingException: [fielddata] Data too large`	fielddata exceeded `indices.breaker.fielddata.limit`	Circuit Breakers and Memory, `_cache/clear?fielddata=true`
First `terms` agg after each refresh is slow, later ones fast	global ordinals rebuilt lazily after segment change	`eager_global_ordinals: true`
Sorting a numeric field returns nonsense / errors	field indexed without doc values (`doc_values: false`)	mapping; `NumberFieldMapper.hasDocValues`
High-cardinality `keyword` agg slow + heavy	global ordinal build per shard dominates	`eager_global_ordinals`, reconsider `composite`/`cardinality`
`_cat/fielddata` shows large entries you didn't expect	scripts or aggs forced fielddata on some field	audit aggs/scripts; clear cache

Validation: prove you understand this

State the one-sentence rule for when OpenSearch uses doc values vs fielddata, and the default availability of each for keyword, numeric, and text fields.
Explain to a teammate why fielddata: true on a text field is almost always the wrong fix for "I can't aggregate this field," and give the correct mapping.
Define local vs global ordinals with a two-segment example, and name the moment the global ordinal cache is invalidated.
Trace, class by class (MappedFieldType → IndexFieldData → LeafFieldData → Lucene doc-values accessor), how a terms agg reads a keyword field.
Explain why doc values do not trip the fielddata breaker but fielddata does, in terms of where each lives in memory.
Given a 50-shard index where the first dashboard query each minute is slow, diagnose it as a global-ordinals symptom and propose the mapping change that fixes it.

Shard Allocation

Every shard — primary or replica — has to live on some node, and the choice is not arbitrary. Shard allocation is the cluster-manager-side subsystem that decides, on every relevant cluster-state change, where each shard should go: which unassigned shards to assign, which assigned shards to move for balance, and which moves to forbid because they'd violate a constraint (don't put a primary and its replica on the same node; don't fill a disk; respect rack awareness). Get this wrong as a contributor and you produce yellow/red clusters, hot nodes, or data loss across a rack failure.

This chapter dissects AllocationService, the BalancedShardsAllocator, the AllocationDeciders chain, and the operator-facing _cluster/allocation/explain and _cluster/reroute APIs. It runs on the cluster manager (formerly master) and mutates the RoutingTable inside the ClusterState; the new state is then published. It feeds directly into Recovery (an assignment triggers a recovery) and Replication.

After this chapter you can:

Trace an allocation round from a triggering event to a new RoutingTable.
Name the major deciders and the real-world failure each one prevents.
Read _cluster/allocation/explain and translate a NO decision into a fix.
Force or cancel an assignment with _cluster/reroute and know when it's safe.

The cast

Concern	Class	grep target
Orchestrates a round, produces new routing	`AllocationService`	`server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java`
Decides placement + balancing	`BalancedShardsAllocator` (impl of `ShardsAllocator`)	`.../allocation/allocator/BalancedShardsAllocator.java`
Yes/No/Throttle gate per move	`AllocationDeciders` + concrete `*AllocationDecider`	`.../allocation/decider/`
Mutable working state for a round	`RoutingAllocation`, `RoutingNodes`	`.../routing/allocation/RoutingAllocation.java`, `.../routing/RoutingNodes.java`
Where do primaries of existing data come from	`GatewayAllocator` / `ExistingShardsAllocator`	`server/src/main/java/org/opensearch/gateway/GatewayAllocator.java`
Manual operator commands	`AllocationCommand` (`AllocateEmptyPrimary`, `MoveAllocationCommand`, …)	`.../allocation/command/`

ls server/src/main/java/org/opensearch/cluster/routing/allocation/decider/
grep -n "class AllocationService\|reroute\|applyStartedShards\|applyFailedShards" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java

What triggers an allocation round

Allocation runs as part of cluster-state updates. The common triggers:

Trigger	Path
Node joins/leaves	`Coordinator` join/leave → reroute
Index created / settings changed (e.g., `number_of_replicas`)	metadata update → reroute
Shard started (recovery finished)	`AllocationService.applyStartedShards`
Shard failed	`AllocationService.applyFailedShards`
Manual `_cluster/reroute`	`TransportClusterRerouteAction`
Disk usage crosses a watermark	`DiskThresholdMonitor` triggers reroute

grep -rn "reroute\|allocationService.reroute\|applyStartedShards\|applyFailedShards" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/ | head
grep -rn "DiskThresholdMonitor" server/src/main/java/org/opensearch/cluster/routing/allocation/

The allocation round

flowchart TD
    T[Trigger: node change / shard started/failed / reroute / watermark] --> AS[AllocationService.reroute]
    AS --> RN[Build RoutingNodes + RoutingAllocation from current ClusterState]
    RN --> EXIST[GatewayAllocator / ExistingShardsAllocator: assign existing primaries from on-disk data]
    EXIST --> BSA[BalancedShardsAllocator.allocate]
    BSA --> U[allocateUnassigned: place unassigned shards]
    U --> M[moveShards: relocate shards that can no longer stay]
    M --> B[balance: even out weight across nodes]
    U -. each candidate move .-> D[AllocationDeciders.canAllocate / canRemain / canRebalance]
    M -. each candidate move .-> D
    B -. each candidate move .-> D
    D -->|YES / THROTTLE / NO| BSA
    B --> NRT[New RoutingTable]
    NRT --> CS[New ClusterState -> publish -> recovery starts]

The allocator proposes moves; the deciders veto or throttle them. A move happens only if every decider returns YES (or, for throttling, the round schedules it later). One NO blocks it. _cluster/allocation/explain literally replays this decider chain for a chosen shard and prints each decider's verdict — that's why it's the single most useful allocation-debugging tool.

grep -n "canAllocate\|canRemain\|canRebalance\|Decision" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/AllocationDeciders.java
grep -n "allocateUnassigned\|moveShards\|balance\|weight" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

The deciders — each one prevents a specific disaster

Decider	Returns NO when…	The disaster it prevents
`SameShardAllocationDecider`	a primary and its replica would land on the same node (or host)	losing both copies if that node dies
`DiskThresholdDecider`	target node is above `low`/`high`/`flood_stage` disk watermark	filling a disk to 100% (read-only flood)
`AwarenessAllocationDecider`	placement would violate `cluster.routing.allocation.awareness.attributes` (rack/zone balance)	a whole rack/AZ failure taking all copies
`FilterAllocationDecider`	shard excluded by `index.routing.allocation.{include,exclude,require}.*`	placing data on draining/decommissioned nodes
`MaxRetryAllocationDecider`	a shard has failed allocation more than `index.allocation.max_retries` (default 5)	infinite reallocation loops on a poison shard
`ThrottlingAllocationDecider`	too many concurrent recoveries already in flight	recovery storm saturating I/O/network
`EnableAllocationDecider`	allocation disabled via `cluster.routing.allocation.enable` (e.g., `none`/`primaries`)	accidental rebalancing during a rolling restart
`NodeVersionAllocationDecider`	replica would be on an older-version node than primary	BWC: can't recover newer Lucene/segments onto older node
`ShardsLimitAllocationDecider`	exceeds `index.routing.allocation.total_shards_per_node`	piling too many shards on one node

ls server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ | sed 's/AllocationDecider.java//'
grep -rn "Decision.NO\|Decision.THROTTLE\|Decision.single" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

Warning: Before a rolling restart, operators set cluster.routing.allocation.enable: primaries (or none) via the EnableAllocationDecider so the cluster doesn't try to rebuild replicas of a node that's about to come right back. Forgetting to re-enable it afterward leaves replicas permanently unassigned — a top operational footgun.

RoutingAllocation and RoutingNodes — the scratch pad

AllocationService builds a mutable RoutingNodes (a node→shards view derived from the immutable RoutingTable) and a RoutingAllocation that carries the deciders, cluster info (disk usage via ClusterInfoService), and the unassignedInfo for each shard. The allocator mutates RoutingNodes; at the end, AllocationService reads the result back into a fresh, immutable RoutingTable for the new ClusterState. The immutable-snapshot discipline of ClusterState is preserved: the scratch pad is mutable, the published artifact is not.

grep -n "class RoutingNodes\|initializeShard\|relocateShard\|startShard\|class RoutingAllocation" \
  server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java \
  server/src/main/java/org/opensearch/cluster/routing/allocation/RoutingAllocation.java

UnassignedInfo records why a shard is unassigned (INDEX_CREATED, NODE_LEFT, ALLOCATION_FAILED, REPLICA_ADDED, …) and the failure count — that's what MaxRetryAllocationDecider reads and what allocation/explain reports.

grep -n "enum Reason\|allocationStatus\|failedAllocations" \
  server/src/main/java/org/opensearch/cluster/routing/UnassignedInfo.java

Where primaries of existing data come from

For a brand-new index, primaries can be allocated anywhere (empty). For an index with on-disk data (cluster restart, node rejoin), the primary must go where the best copy of the data already lives. The GatewayAllocator (an ExistingShardsAllocator) queries nodes for their on-disk shard copies and allocation IDs (see in-sync allocation IDs in Replication), then assigns the primary to the node with the most up-to-date copy. Replicas prefer nodes that can do a cheap sequence-number-based recovery from the primary (see Recovery).

grep -rn "ExistingShardsAllocator\|allocateUnassigned\|class GatewayAllocator\|TransportNodesListShardStoreMetadata" \
  server/src/main/java/org/opensearch/gateway/ | head

Operator surface: explain and reroute

# Why is THIS shard unassigned / why won't it move? Replays the decider chain.
curl -s 'localhost:9200/_cluster/allocation/explain?pretty' \
  -H 'content-type: application/json' \
  -d '{"index":"idx","shard":0,"primary":false}'

# Manually move / allocate / cancel (dry_run first!)
curl -s 'localhost:9200/_cluster/reroute?dry_run=true&pretty' \
  -H 'content-type: application/json' \
  -d '{"commands":[{"move":{"index":"idx","shard":0,"from_node":"nodeA","to_node":"nodeB"}}]}'

allocation/explain output for an unassigned shard lists, per node, every decider's decision and a human explanation. A red cluster's root cause is usually one decider saying NO (e.g., DiskThresholdDecider: "the node is above the high watermark") — read the explanation, fix the underlying condition, and the next reroute assigns it.

grep -rn "TransportClusterAllocationExplainAction\|ClusterAllocationExplanation" \
  server/src/main/java/org/opensearch/action/admin/cluster/allocation/
grep -rn "AllocationCommand\|MoveAllocationCommand\|AllocateEmptyPrimaryAllocationCommand" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/command/

Warning: allocate_empty_primary and allocate_stale_primary discard data — they tell the cluster to accept an empty or stale copy as the new primary. They are last-resort recovery commands for "all good copies are permanently gone," never routine. dry_run everything first.

Reading exercise

# 1. The reroute entry and how started/failed shards feed back
grep -n "public ClusterState reroute\|applyStartedShards\|applyFailedShards\|deassociateDeadNodes" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/AllocationService.java

# 2. The balancer's three jobs
grep -n "allocateUnassigned\|moveShards\|balanceByWeights\|weight(" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

# 3. One decider in detail
sed -n '1,120p' server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

# 4. Why-unassigned bookkeeping
grep -n "enum Reason\|AllocationStatus\|failedAllocations" \
  server/src/main/java/org/opensearch/cluster/routing/UnassignedInfo.java

Answer:

List the three phases inside BalancedShardsAllocator.allocate and what each accomplishes.
For a move to occur, how many deciders must say YES? What does THROTTLE mean vs NO, and which decider issues THROTTLE most often?
Name the decider that prevents a primary and replica from co-locating, and explain why co-location would defeat the purpose of replication.
Why does MaxRetryAllocationDecider exist? What would happen without it for a shard whose underlying disk is corrupt?
After a rolling restart, all replicas are unassigned and the cluster is yellow. Which setting/decider is the likely cause and what's the one-line fix?
Explain what allocate_stale_primary does and why it is a data-loss operation, citing in-sync allocation IDs from Replication.

Common bugs and symptoms

Symptom	Likely cause	Where to look
Replicas stuck unassigned after rolling restart	`cluster.routing.allocation.enable` left at `primaries`/`none`	`EnableAllocationDecider`; reset to `all`
Cluster yellow, one shard won't allocate	a decider says `NO` (disk, awareness, filter)	`_cluster/allocation/explain`
Shards refuse to move onto a node	node above disk high watermark	`DiskThresholdDecider`; free disk or raise watermark
Index turns read-only unexpectedly	disk hit `flood_stage` watermark (95%)	`DiskThresholdDecider`; clear block after freeing space
Shard never reallocates after repeated failures	hit `index.allocation.max_retries`	`MaxRetryAllocationDecider`; `reroute?retry_failed=true` after fixing root cause
Both copies of data lost when one rack died	awareness not configured	`AwarenessAllocationDecider` + `awareness.attributes`
Hot node, uneven shard distribution	balancer weights / `total_shards_per_node` / filters skewing placement	`BalancedShardsAllocator` weights, `_cat/allocation`
Replica won't recover onto an upgraded node	version mismatch (older target)	`NodeVersionAllocationDecider`

Validation: prove you understand this

Draw the allocation round (trigger → RoutingAllocation/RoutingNodes → existing-shards allocator → balancer's three phases → deciders → new RoutingTable), and state where the result re-enters the immutable ClusterState.
For each of SameShard, DiskThreshold, Awareness, Filter, and MaxRetry deciders, name the production incident it prevents in one sentence.
Given a red cluster, write the exact allocation/explain request for the primary of idx shard 0 and describe how you'd read the result to find the blocking decider.
Explain the rolling-restart enable: primaries pattern: what it prevents during the restart and the failure mode if you forget to revert it.
Distinguish move, allocate_replica, allocate_empty_primary, and allocate_stale_primary reroute commands by their data-safety, and say which two can lose data.
Describe how GatewayAllocator decides which node should host the primary of an index that already has on-disk data, and how that ties to in-sync allocation IDs.

Recovery

When a shard copy is assigned to a node (see Shard Allocation), it does not magically have the data. It has to recover it: a brand-new replica must copy the primary's Lucene segments and replay any operations it missed; a node that crashed and came back must replay its own translog; a relocating shard must hand off to its new home. Peer recovery — replica catching up from primary — is the workhorse, and modern OpenSearch makes it cheap with sequence numbers and retention leases so a briefly-disconnected replica copies only the operations it missed instead of the whole shard.

This chapter dissects the peer-recovery handshake (RecoverySourceHandler's two phases), the sequence-number machinery that makes it incremental, retention leases, and the _recovery API. It is the operational consequence of Replication (the checkpoints come from there) and The Translog (phase 2 replays translog ops), and it's what an allocation decision actually does.

After this chapter you can:

Trace a peer recovery from StartRecoveryRequest through phase 1 and phase 2.
Explain how sequence numbers + retention leases turn a full copy into an ops-only catch-up.
Read RecoveryState.Stage and the _recovery API to see where a recovery is.
Reason about recovery throttling and why a recovery storm hurts a cluster.

The cast

Concern	Class	grep target
Primary side: serves a recovery to a replica	`PeerRecoverySourceService` + `RecoverySourceHandler`	`server/src/main/java/org/opensearch/indices/recovery/`
Replica side: drives its own recovery	`PeerRecoveryTargetService` + `RecoveryTarget`	same dir
The request that kicks it off	`StartRecoveryRequest`	`.../recovery/StartRecoveryRequest.java`
Progress/state machine	`RecoveryState` + `RecoveryState.Stage`	`.../recovery/RecoveryState.java`
Sequence-number tracking	`LocalCheckpointTracker`, `SeqNoStats`, global checkpoint via `ReplicationTracker`	`server/src/main/java/org/opensearch/index/seqno/`
Retention leases (keep ops around)	`RetentionLease`, `RetentionLeases`, `ReplicationTracker`	`.../index/seqno/RetentionLease.java`

ls server/src/main/java/org/opensearch/indices/recovery/
grep -n "class RecoverySourceHandler\|recoverToTarget\|phase1\|phase2\|prepareTargetForTranslog" \
  server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Recovery types

Type	When	What it does
Existing-store / local	node restart with intact data	open the local Lucene index, replay the local translog to catch up
Peer recovery	new/empty replica, or replica that fell behind	copy segments + ops from the primary (the focus of this chapter)
Snapshot recovery	restore from repository	restore segments from a snapshot, then replay translog ops (see Snapshots)
Relocation	shard moving between nodes	a peer recovery to the new node, then hand-off

grep -rn "RecoverySource\|PeerRecoverySource\|ExistingStoreRecoverySource\|SnapshotRecoverySource" \
  server/src/main/java/org/opensearch/cluster/routing/RecoverySource.java

Peer recovery: the two-phase handshake

The replica (target) sends a StartRecoveryRequest to the primary (source) carrying the replica's current state: its SeqNoStats (local checkpoint, max seqno), its store metadata (which segment files it already has), and any RetentionLease info. The primary's RecoverySourceHandler.recoverToTarget decides between a full file copy and an ops-only catch-up, then runs:

sequenceDiagram
    participant T as Replica (RecoveryTarget)
    participant S as Primary (RecoverySourceHandler)
    T->>S: StartRecoveryRequest (seqNo stats, store metadata, retention lease)
    Note over S: decide: can we skip phase 1?<br/>(replica's seqno >= a retained safe commit)
    alt full copy needed
        S->>S: phase 1: snapshot a safe Lucene commit
        S->>T: send missing segment files (throttled)
        T->>S: filesReceived
    else ops-only catch-up
        Note over S: skip phase 1 entirely
    end
    S->>T: prepareForTranslogOperations
    Note over S: phase 2: replay translog ops from a startingSeqNo
    S->>T: send translog ops [startingSeqNo .. endingSeqNo] (batched, throttled)
    T->>S: ops applied
    S->>T: finalizeRecovery (mark replica in-sync, update global checkpoint)
    Note over T: RecoveryState.Stage = DONE; shard STARTED

Phase 1 — segments

The primary identifies a safe commit (one whose operations are all ≤ a checkpoint the replica can build on), diffs its segment files against the replica's store metadata, and ships only the missing/different files. Identical files (matched by name + checksum) are not resent. While this happens the primary holds a Retention so the underlying commit isn't merged away.

If the replica already has a recent-enough copy (sequence-number-based recovery, below), phase 1 is skipped entirely — no segments are copied.

Phase 2 — translog ops

The primary replays operations from a startingSeqNo (the first op the replica is missing) through the current endingSeqNo, streaming them as Translog.Operations the replica applies via its engine. New writes arriving during recovery are also captured (the primary tracks them) so the replica ends fully caught up before finalizeRecovery marks it in-sync.

grep -n "phase1\|recoverFilesFromSource\|sendFiles\|phase2\|sendSnapshotOfOperations\|finalizeRecovery\|prepareTargetForTranslog" \
  server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Sequence numbers: how the replica catches up cheaply

Every write gets a monotonically increasing sequence number on the primary, assigned via the LocalCheckpointTracker. Two checkpoints matter:

Checkpoint	Meaning	Owner
local checkpoint	highest seqno below which this shard has applied every op contiguously	per shard, `LocalCheckpointTracker`
global checkpoint	highest seqno that all in-sync copies have applied	primary, `ReplicationTracker`

The global checkpoint is the linchpin: operations at or below it are safely on every in-sync copy. A replica that disconnects and returns only needs operations after its last local checkpoint. If the primary still retains those ops (it hasn't trimmed/merged them away), phase 1 is skipped and recovery is just phase 2 over a tiny range. That retention is what retention leases guarantee.

grep -n "class LocalCheckpointTracker\|getProcessedCheckpoint\|markSeqNoAsProcessed" \
  server/src/main/java/org/opensearch/index/seqno/LocalCheckpointTracker.java
grep -n "globalCheckpoint\|getGlobalCheckpoint\|class ReplicationTracker\|markAllocationIdAsInSync" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

Retention leases: keep the ops the replica will need

A retention lease is a promise the primary makes to retain operations above a given sequence number for a given retention period, so a lagging replica can do an ops-only recovery instead of a full copy. The primary creates a peer recovery retention lease per replica (ReplicationTracker manages RetentionLeases). Without leases, the primary's translog/soft-deletes could be trimmed before the replica reconnects, forcing an expensive full file copy.

Setting	Effect
`index.soft_deletes.retention_lease.period`	how long an absent peer's lease (and thus its ops) is retained (default 12h)
`index.soft_deletes.enabled`	soft deletes underpin op retention for recovery (on by default)

grep -n "RetentionLease\|addPeerRecoveryRetentionLease\|retention_lease\|renewPeerRecoveryRetentionLeases" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java
grep -rn "soft_deletes.retention_lease.period\|SOFT_DELETES" \
  server/src/main/java/org/opensearch/index/IndexSettings.java

Note: "Why did a 5-minute node blip cause a full-shard recopy?" usually means the retention lease expired (node down longer than the lease period) or soft-deletes were disabled — so the ops the replica needed were already gone and only a full copy could rebuild it.

RecoveryState.Stage — the progress machine

RecoveryState tracks a recovery through a fixed sequence of stages, surfaced by the _recovery API:

stateDiagram-v2
    [*] --> INIT
    INIT --> INDEX: copying segment files (phase 1)
    INDEX --> VERIFY_INDEX: checksum/verify
    VERIFY_INDEX --> TRANSLOG: replay ops (phase 2)
    TRANSLOG --> FINALIZE: mark in-sync, bump global checkpoint
    FINALIZE --> DONE
    DONE --> [*]

grep -n "enum Stage\|INIT\|INDEX\|VERIFY_INDEX\|TRANSLOG\|FINALIZE\|DONE" \
  server/src/main/java/org/opensearch/indices/recovery/RecoveryState.java

Read live recoveries:

# Per-shard recovery progress: stage, % bytes, % translog ops, source/target node
curl -s 'localhost:9200/idx/_recovery?active_only=true&pretty'
curl -s 'localhost:9200/_cat/recovery?v&h=index,shard,type,stage,source_node,target_node,files_percent,translog_ops_percent,time'

A recovery stuck in INDEX with slow files_percent is bandwidth/throttle bound; stuck in TRANSLOG with a huge op count means a long range to replay (lease expired → big catch-up).

Throttling: don't let recovery DoS the cluster

Copying segments and replaying ops is heavy I/O and network. A node that loses several peers at once could try to recover dozens of shards simultaneously and saturate everything. Two layers prevent this:

Control	Setting	What it limits
Per-node recovery bandwidth	`indices.recovery.max_bytes_per_sec` (default 40mb)	byte rate of segment copy
Concurrent incoming/outgoing recoveries	`cluster.routing.allocation.node_concurrent_recoveries` etc. (enforced by `ThrottlingAllocationDecider`)	how many recoveries run at once

grep -rn "max_bytes_per_sec\|RateLimiter\|recoverySettings" \
  server/src/main/java/org/opensearch/indices/recovery/RecoverySettings.java
grep -rn "node_concurrent_recoveries\|ThrottlingAllocationDecider" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ThrottlingAllocationDecider.java

The throttling decider lives on the allocation side (see Shard Allocation) — it returns THROTTLE rather than NO, meaning "not now, retry next round," which is why recoveries trickle in rather than all firing at once.

Reading exercise

# 1. The source-side orchestration
grep -n "recoverToTarget\|phase1\|phase2\|isSequenceNumberBasedRecovery\|startingSeqNo" \
  server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

# 2. The target-side driver
grep -n "doRecovery\|StartRecoveryRequest\|getStartRecoveryRequest\|cleanFiles" \
  server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java

# 3. Checkpoints
grep -n "getProcessedCheckpoint\|globalCheckpoint\|markAllocationIdAsInSync\|in-sync\|inSync" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

# 4. Retention leases
grep -n "PeerRecoveryRetentionLease\|renewPeerRecovery\|retention_lease" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

Answer:

What does the replica send in StartRecoveryRequest, and how does the primary use the replica's SeqNoStats and store metadata to decide whether to skip phase 1?
Distinguish the local checkpoint from the global checkpoint. Which one defines "all in-sync copies have this op," and which class tracks it?
Explain, step by step, how a replica that was disconnected for 2 minutes performs an ops-only recovery, and what guarantees the needed ops still exist on the primary.
What happens if the retention lease for an absent replica expires before it reconnects? Which stage of the next recovery will be expensive and why?
Map RecoveryState.Stage values to phase 1 vs phase 2 of RecoverySourceHandler.
ThrottlingAllocationDecider returns THROTTLE, not NO, for an over-the-limit recovery. Why does that distinction matter for cluster recovery behavior?

Common bugs and symptoms

Symptom	Likely cause	Where to look
Brief node restart triggers full-shard recopy	retention lease expired or soft-deletes disabled	`index.soft_deletes.retention_lease.period`, `ReplicationTracker`
Recovery stuck in `INDEX`, slow `files_percent`	bandwidth throttle / saturated network	`indices.recovery.max_bytes_per_sec`, `_cat/recovery`
Recovery stuck in `TRANSLOG` with huge op count	replica far behind; long op range to replay	`_recovery` translog ops; lease/period
Many shards "initializing" forever after a node loss	recovery throttle limiting concurrency (by design)	`ThrottlingAllocationDecider`, `node_concurrent_recoveries`
Replica won't start, checksum/verify failure	corrupt segment on source or transfer	`RecoveryState.VERIFY_INDEX`, store metadata
Cluster I/O saturated after big node failure	recovery storm; throttles set too high	lower `max_bytes_per_sec`; check decider
Relocation never completes	target can't finalize / decider blocking	`_cat/recovery type=relocation`, allocation explain

Validation: prove you understand this

Draw the peer-recovery sequence (StartRecoveryRequest → optional phase 1 → phase 2 → finalize), labeling which side is source/target and what flows on each arrow.
Define local checkpoint and global checkpoint precisely, name the class that owns each, and explain why the global checkpoint is what makes ops-only recovery safe.
Walk through a 90-second replica disconnect that results in an ops-only recovery: what was retained, by what mechanism, and why phase 1 was skipped.
Explain the failure path where the same disconnect, but for 13 hours, forces a full file copy. Cite the exact setting involved.
Read a _recovery response stuck at stage: INDEX, files_percent: 12 and propose two distinct causes and the setting you'd check for each.
Explain why recovery throttling uses THROTTLE (retry-later) on the allocation side instead of NO, and what would go wrong if it used NO.

Replication: Document and Segment

A shard has a primary and zero or more replicas, and the replicas have to stay consistent with the primary. OpenSearch offers two fundamentally different ways to keep them in sync. The classic model is document replication: the primary indexes a write, then forwards the same operation to each replica, which indexes it independently. The newer model is segment replication: the primary indexes alone, and replicas copy the resulting Lucene segments from the primary instead of re-doing the indexing work. The choice trades CPU against a different durability/visibility story, and as a contributor you must understand both because half the engine's correctness arguments hinge on which one is active.

This chapter contrasts the two, dissects the document-replication write path (TransportReplicationAction, ReplicationOperation, ReplicationTracker, in-sync allocation IDs, the checkpoints), and the segment-replication services. It builds on Recovery (which uses the same checkpoints) and feeds Refresh, Flush, and Merge (segrep changes who merges).

After this chapter you can:

Explain the document-replication primary→replica write path and how a write is acknowledged.
Define in-sync allocation IDs, local/global checkpoints, and what they guarantee.
Contrast document vs segment replication on CPU, durability, and freshness.
Reason about when index.replication.type: SEGMENT is the right call.

Two models at a glance

	Document replication (default)	Segment replication
What crosses to replica	the indexing operation (the doc)	the Lucene segment files
Replica work	re-indexes the op (CPU + own merges)	copies segments (I/O), no re-indexing, no own merges
Setting	`index.replication.type: DOCUMENT`	`index.replication.type: SEGMENT`
Driver classes	`TransportReplicationAction`, `ReplicationOperation`	`SegmentReplicationSourceService`, `SegmentReplicationTargetService`
Freshness on replica	as fresh as the op forwarded (then refresh)	as fresh as the last copied checkpoint (lags primary)
Durability of replica copy	replica has the op in its own translog	replica relies on copied segments + primary's translog
Best for	low write-to-search latency, write-heavy with many replicas needing fresh reads	read-heavy, many replicas, reduce duplicate indexing CPU

grep -rn "replication.type\|ReplicationType\|SEGMENT\|DOCUMENT" \
  server/src/main/java/org/opensearch/indices/replication/common/ReplicationType.java \
  server/src/main/java/org/opensearch/index/IndexSettings.java

Document replication: the write path

A write enters via TransportShardBulkAction → IndexShard.applyIndexOperationOnPrimary → InternalEngine.index(...) (Lucene + translog). The replication itself is orchestrated by TransportReplicationAction / ReplicationOperation: the primary applies the op locally first, assigns it a sequence number, then forwards it to every replica in the in-sync set. The request is acknowledged to the client after enough copies confirm (per the write wait_for_active_shards/consistency rules).

sequenceDiagram
    participant C as Client
    participant P as Primary shard
    participant R1 as Replica 1
    participant R2 as Replica 2
    C->>P: index op
    Note over P: applyIndexOperationOnPrimary -> Engine.index -> assign seqNo + translog
    par forward to in-sync replicas
        P->>R1: replica request (same op + seqNo + primary term)
        P->>R2: replica request
    end
    R1-->>P: applied (local checkpoint advances)
    R2-->>P: applied
    Note over P: ReplicationTracker advances global checkpoint = min(in-sync local checkpoints)
    P-->>C: ack

grep -n "class ReplicationOperation\|performOnPrimary\|performOnReplica\|markUnavailableShardsAsStale\|getReplicationGroup" \
  server/src/main/java/org/opensearch/action/support/replication/ReplicationOperation.java
grep -n "class TransportReplicationAction\|shardOperationOnPrimary\|shardOperationOnReplica" \
  server/src/main/java/org/opensearch/action/support/replication/TransportReplicationAction.java

In-sync allocation IDs

The primary doesn't replicate to all assigned copies — only to those in the in-sync set: copies that are known to have every op up to the global checkpoint. The set is part of IndexMetadata (inSyncAllocationIds) and is maintained through cluster-state updates. A replica that falls behind or fails is removed from the in-sync set (so the primary stops blocking acks on it) and must re-enter via recovery (markAllocationIdAsInSync after it catches up). This is the mechanism that prevents acknowledging writes that a "replica" never actually received.

grep -rn "inSyncAllocationIds\|in_sync_allocation\|markAllocationIdAsInSync\|removeAllocationId" \
  server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

The checkpoints (shared with recovery)

Checkpoint	Definition	Owner
local checkpoint	highest seqno below which a copy has contiguously applied every op	each copy, `LocalCheckpointTracker`
global checkpoint	`min` of in-sync copies' local checkpoints — every op ≤ it is on all in-sync copies	primary, `ReplicationTracker`
primary term	monotonically increasing generation of "who is primary"	cluster, bumped on primary failover

The primary term guards against split-brain writes: a stale ex-primary's replica requests carry an old term and are rejected. Operations below the global checkpoint are durable across the in-sync set and are what recovery builds on.

grep -n "primaryTerm\|getGlobalCheckpoint\|getLocalCheckpoint\|computeGlobalCheckpoint" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

Segment replication: copy segments, don't re-index

With index.replication.type: SEGMENT, only the primary indexes documents (runs Lucene IndexWriter, refreshes, and merges). After a refresh produces new segments, the primary publishes a new ReplicationCheckpoint; replicas fetch the new/changed segment files from the primary and open a reader over them. The replica never re-indexes and never runs its own merges — it just mirrors files.

flowchart TD
    subgraph Primary
      W[index ops] --> IW[Lucene IndexWriter]
      IW -->|refresh| SEG[new segments]
      SEG --> CP[publish ReplicationCheckpoint]
    end
    CP -->|notify| TS[SegmentReplicationTargetService on replica]
    TS -->|request diff| SRC[SegmentReplicationSourceService on primary]
    SRC -->|stream missing segment files| TS
    TS --> RR[replica opens reader over copied segments]
    RR --> V[replica search now sees primary's segments]

Concern	Class	grep target
Replica side: drives copy	`SegmentReplicationTargetService` + `SegmentReplicationTarget`	`server/src/main/java/org/opensearch/indices/replication/`
Primary side: serves segments	`SegmentReplicationSourceService`	same dir
The version marker	`ReplicationCheckpoint`	`.../replication/checkpoint/ReplicationCheckpoint.java`
Publish-checkpoint action	`PublishCheckpointAction`	`.../replication/checkpoint/`

ls server/src/main/java/org/opensearch/indices/replication/
grep -n "class SegmentReplicationTargetService\|onNewCheckpoint\|startReplication\|class ReplicationCheckpoint" \
  server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java \
  server/src/main/java/org/opensearch/indices/replication/checkpoint/ReplicationCheckpoint.java

Why segrep changes the story

CPU: indexing and merging happen once on the primary, not on every replica. For a 1-primary-5-replica read-heavy index, that's a large CPU/IO saving. (See Refresh, Flush, and Merge: replicas don't run TieredMergePolicy.)
Freshness: a replica is only as fresh as the last checkpoint it copied. There is inherent replication lag — a search hitting a replica may return slightly staler results than one hitting the primary. Document replication forwards each op, so replicas can refresh to near-primary freshness.
Durability/visibility: the replica's searchable state is the primary's segments, not ops it indexed itself. Writes are still made durable via the primary's translog; the replica doesn't independently re-derive segments. This is why the durability argument must be reasoned about per replication type.

Warning: Under segment replication, "read your write" is not guaranteed against a replica until the relevant checkpoint has been copied. If your test indexes a doc and immediately searches a replica, it can legitimately miss — this is a common source of "flaky" integration tests that are actually correct behavior. Search the primary, or wait for the checkpoint.

Comparison summary

flowchart LR
    subgraph DOC[Document replication]
      DP[Primary indexes op] --> DF[forward op to replicas]
      DF --> DR[each replica re-indexes + merges]
    end
    subgraph SEG[Segment replication]
      SP[Primary indexes + merges] --> SCP[publish checkpoint]
      SCP --> SR[replicas copy segments]
    end

Axis	Document	Segment
Indexing CPU	paid N+1 times (primary + N replicas)	paid once (primary)
Merge CPU/IO	paid on every copy	paid on primary only
Network	small (ops)	larger (segment files)
Replica freshness	near-primary after refresh	lags by checkpoint interval
Read-your-write on replica	strong after refresh	not guaranteed until checkpoint copied

Reading exercise

# 1. The replication operation: primary then in-sync replicas
grep -n "ReplicationOperation\|performOnReplicas\|getReplicationGroup\|onSuccessfulReplica" \
  server/src/main/java/org/opensearch/action/support/replication/ReplicationOperation.java

# 2. In-sync set + global checkpoint
grep -n "inSyncAllocationIds\|markAllocationIdAsInSync\|updateGlobalCheckpointOnPrimary\|computeGlobalCheckpoint" \
  server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java

# 3. Segment replication target flow
grep -n "onNewCheckpoint\|startReplication\|getCheckpoint\|forceReplication" \
  server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

# 4. Where the replication type is chosen and what it gates
grep -rn "ReplicationType\|isSegRepEnabled\|replication.type" \
  server/src/main/java/org/opensearch/index/IndexSettings.java

Answer:

In document replication, after the primary applies an op locally, how does it decide which replicas to forward to? Name the set and the class that owns it.
Define the global checkpoint as a function of local checkpoints, and explain why an op below it is safe.
What role does the primary term play when a former primary's stale request arrives at a replica? Why is this essential for correctness?
In segment replication, what triggers a replica to start copying, and what object tells it what to copy?
Explain precisely why "index a doc, immediately search a replica" can return zero hits under segment replication but (after refresh) not under document replication.
For a read-heavy index with 1 primary and 5 replicas, argue from CPU and merge cost why segment replication is attractive, and name the cost you accept.

Common bugs and symptoms

Symptom	Likely cause	Where to look
Replica returns staler results than primary	segment replication lag (checkpoint not yet copied)	`_cat/segment_replication`, `ReplicationCheckpoint`
Flaky IT: index then search replica misses doc	segrep read-your-write not guaranteed; test assumption wrong	search primary or assertBusy on checkpoint
Write acked but lost on primary failover	replica wasn't actually in-sync; or stale-primary write	`inSyncAllocationIds`, primary term
Writes blocked / `not enough active shards`	`wait_for_active_shards` can't be satisfied	replication settings, allocation
High CPU across all replicas under heavy indexing (doc rep)	every replica re-indexes + merges	consider segment replication
Replica stuck behind, growing lag (segrep)	source can't keep up streaming segments / network	`SegmentReplicationSourceService`, throttling
Replica rejected after old node rejoins	stale primary term on its requests	`ReplicationTracker` term checks

Validation: prove you understand this

Diagram the document-replication write path from client to ack, marking where the seqno is assigned, where the global checkpoint advances, and what condition gates the ack.
Define in-sync allocation IDs and explain what would go wrong if the primary acked writes to copies not in the in-sync set.
Give the formula for the global checkpoint in terms of in-sync local checkpoints and explain why a lagging in-sync replica holds the global checkpoint back.
Draw both replication models side by side, labeling what crosses the wire and who runs Lucene merges in each.
Explain the read-your-write difference between the two models with a concrete test scenario, and how you'd write a correct integration test for a segment-replicated index.
Recommend document vs segment replication for (a) a write-heavy log index with 1 replica needing fresh reads, and (b) a read-heavy catalog index with 6 replicas. Justify each from CPU, freshness, and network.

Snapshots and Repositories

A snapshot is OpenSearch's backup-and-restore mechanism, but it is far more interesting than "copy the files somewhere." It is incremental (a new snapshot only stores segments that changed since the last one), repository-backed (a pluggable blob store: a shared filesystem, S3, GCS, Azure), and the foundation for forward-looking features like searchable snapshots and remote-backed storage. A contributor touching this code is one mistake away from a corrupt backup, so understanding the data model — shard-level blobs, RepositoryData, generations — is non-negotiable.

This chapter maps SnapshotsService, RepositoriesService, the Repository / BlobStoreRepository / BlobContainer abstraction, the incremental shard format, and the restore flow. It connects to Refresh, Flush, and Merge (snapshots capture committed segments), Recovery (restore is a snapshot recovery), and Plugin Architecture (repository-s3 is a RepositoryPlugin).

After this chapter you can:

Explain how a snapshot is incremental at the segment level, not the index level.
Trace the snapshot and restore flows from REST to BlobStoreRepository.
Describe RepositoryData and why repository generations matter for consistency.
Name the repository types and how a custom one is contributed.

The cast

Concern	Class	grep target
Orchestrates snapshot/restore as cluster-state ops	`SnapshotsService`, `RestoreService`	`server/src/main/java/org/opensearch/snapshots/`
Manages registered repositories	`RepositoriesService`	`server/src/main/java/org/opensearch/repositories/RepositoriesService.java`
The repository abstraction	`Repository` (interface), `BlobStoreRepository` (base impl)	`server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java`
Blob storage primitive	`BlobStore`, `BlobContainer`, `BlobPath`	`server/src/main/java/org/opensearch/common/blobstore/`
The repository's index-of-snapshots	`RepositoryData`	`server/src/main/java/org/opensearch/repositories/RepositoryData.java`
Per-shard snapshot bookkeeping	`BlobStoreIndexShardSnapshot`, `BlobStoreIndexShardSnapshots`	`.../blobstore/`

ls server/src/main/java/org/opensearch/repositories/
ls server/src/main/java/org/opensearch/snapshots/
grep -n "class BlobStoreRepository\|snapshotShard\|restoreShard\|writeIndexGen\|getRepositoryData" \
  server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

Repository types

Type	Plugin	Backing store
`fs`	built-in	a shared filesystem path (`path.repo` in `opensearch.yml`)
`s3`	`repository-s3` (in-repo plugin under `plugins/`)	AWS S3 / S3-compatible
`gcs`	`repository-gcs`	Google Cloud Storage
`azure`	`repository-azure`	Azure Blob Storage
`hdfs`	`repository-hdfs`	HDFS
`url`	built-in	read-only over a URL

Each non-fs type is a RepositoryPlugin that provides a Repository factory; the core engine never depends on the S3 SDK directly. See Plugin Architecture.

grep -rn "RepositoryPlugin\|getRepositories\|class S3Repository" \
  plugins/repository-s3/src/main/java/org/opensearch/repositories/s3/ | head
grep -rn "interface RepositoryPlugin\|getRepositories" \
  server/src/main/java/org/opensearch/plugins/RepositoryPlugin.java

curl -s -XPUT 'localhost:9200/_snapshot/my_repo' \
  -H 'content-type: application/json' \
  -d '{"type":"fs","settings":{"location":"/mnt/backups","compress":true}}'

The data model: blobs, generations, RepositoryData

A repository is a flat blob namespace organized by convention:

<repo root>/
  index-N                 # RepositoryData for generation N (the source of truth)
  index.latest            # pointer to the current generation N
  snap-<uuid>.dat         # SnapshotInfo per snapshot
  meta-<uuid>.dat         # global metadata per snapshot
  indices/
    <indexUUID>/
      meta-<uuid>.dat      # index metadata
      0/                   # shard 0
        __<uuid>           # segment data blobs (the actual Lucene files, renamed)
        snap-<uuid>.dat    # this snapshot's shard-level file list
        index-<gen>        # BlobStoreIndexShardSnapshots: the shard's snapshot index
      1/ ...

RepositoryData (serialized as index-N) is the repository's master index: which snapshots exist, which indices each contains, and the mapping of index → shard generations. It is read at the start of every snapshot/restore and rewritten (bumping N) whenever the repository changes. index.latest names the current N.

grep -n "class RepositoryData\|EMPTY_REPO_GEN\|getGenId\|shardGenerations\|indexMetaDataGenerations" \
  server/src/main/java/org/opensearch/repositories/RepositoryData.java

Warning: Two cluster managers writing the same repository (e.g., the same S3 bucket registered in two clusters) corrupts RepositoryData — the generation bookkeeping assumes a single writer. This is why pointing two clusters at one repository is unsupported and a known foot-gun. Repository generation/UUID checks (SnapshotsService uses repository_uuid and pending_generations) exist precisely to detect this.

Why snapshots are incremental — at the segment level

Lucene segments are immutable once written (see Refresh, Flush, and Merge). A snapshot of a shard records the list of committed segment files and uploads only the segment blobs the repository doesn't already have (matched by file name + length + checksum from the previous snapshot's shard index). A second snapshot of an unchanged shard uploads nothing new — it just writes a new shard-level snap file referencing the existing blobs.

flowchart LR
    S1[Snapshot 1: shard has segments A,B,C] -->|upload A,B,C| R[(Repository)]
    M[merge: A,B -> D; new write -> E] --> S2[Snapshot 2: shard has C,D,E]
    S2 -->|C already in repo, upload D,E only| R

This is why the first snapshot of a large index is slow (everything uploads) and subsequent ones are fast (only deltas). It also means deleting a snapshot must reference-count blobs: a segment blob is only deleted when no remaining snapshot references it (BlobStoreRepository computes this during deleteSnapshots).

grep -n "snapshotShard\|incremental\|existingFiles\|FileInfo\|filesToSnapshot\|deleteSnapshots" \
  server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

The snapshot flow

sequenceDiagram
    participant U as User
    participant CM as Cluster manager (SnapshotsService)
    participant DN as Data nodes (per shard)
    participant Repo as Repository (BlobStoreRepository)
    U->>CM: PUT _snapshot/repo/snap1 {indices}
    CM->>Repo: read RepositoryData (index-N)
    CM->>CM: cluster-state: SnapshotsInProgress entry
    Note over CM,DN: each shard snapshotted where its primary lives
    par per shard
        DN->>DN: acquire a committed Lucene index (engine snapshot)
        DN->>Repo: upload new segment blobs (incremental) + shard snap file
        DN-->>CM: shard snapshot SUCCESS/FAILED
    end
    CM->>Repo: write SnapshotInfo (snap-uuid.dat) + new RepositoryData (index-N+1)
    CM->>CM: remove SnapshotsInProgress entry
    CM-->>U: snapshot complete

Note the snapshot runs against a committed view (a Lucene commit / engine snapshot), on the SNAPSHOT thread pool, per shard on the node holding its primary. It does not block indexing — new writes land in new segments not part of this snapshot.

grep -n "class SnapshotsService\|SnapshotsInProgress\|beginSnapshot\|snapshotShard\|endSnapshot" \
  server/src/main/java/org/opensearch/snapshots/SnapshotsService.java

The restore flow

Restore is the inverse, and it routes through recovery:

flowchart TD
    U[POST _snapshot/repo/snap1/_restore] --> RS[RestoreService]
    RS --> CS[cluster-state: create indices with SnapshotRecoverySource]
    CS --> ALLOC[shards allocated unassigned, RecoverySource = SNAPSHOT]
    ALLOC --> REC[per shard: restore segments from repo via recovery]
    REC --> ST[shard STARTED]

Each restored shard becomes an unassigned shard with a SnapshotRecoverySource; shard allocation places it, and recovery performs a snapshot recovery (download segment blobs into the shard's store, then open it). You can rename indices on restore, restore a subset of indices, and restore with adjusted settings.

grep -n "class RestoreService\|SnapshotRecoverySource\|restoreSnapshot\|renamePattern" \
  server/src/main/java/org/opensearch/snapshots/RestoreService.java
curl -s -XPOST 'localhost:9200/_snapshot/my_repo/snap1/_restore' \
  -H 'content-type: application/json' \
  -d '{"indices":"idx","rename_pattern":"idx","rename_replacement":"idx_restored"}'

Forward-looking: searchable snapshots and remote-backed storage

The same blob model underpins newer capabilities you should know exist:

Searchable snapshots — mount a snapshot as a read-only searchable index whose data lives in the repository, fetched and cached on demand rather than fully restored to local disk. Great for cold/warm tiers: keep old indices searchable without paying for hot storage.
Remote-backed storage / remote store — index data (segments + translog) is continuously durably written to a remote repository, decoupling durability from local disk and enabling cheaper replicas and faster recovery (a remote-store shard recovers from the remote store rather than a peer).

Both reuse Repository/BlobContainer. Treat them as the strategic direction; classic snapshot/restore remains the baseline you must master first.

grep -rn "searchable_snapshot\|SearchableSnapshot\|remote_store\|RemoteStore\|remoteStoreEnabled" \
  server/src/main/java/org/opensearch/index/ server/src/main/java/org/opensearch/snapshots/ | head

Reading exercise

# 1. The repository abstraction surface
grep -n "void snapshotShard\|void restoreShard\|getRepositoryData\|writeIndexGen\|RepositoryMetadata" \
  server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

# 2. Incremental shard upload
grep -n "snapshotShard\|existingFiles\|filesToSnapshot\|BlobStoreIndexShardSnapshot" \
  server/src/main/java/org/opensearch/repositories/blobstore/BlobStoreRepository.java

# 3. RepositoryData / generations
grep -n "getGenId\|shardGenerations\|index.latest\|INDEX_FILE_PREFIX\|resolveIndexName" \
  server/src/main/java/org/opensearch/repositories/RepositoryData.java

# 4. Restore -> recovery
grep -n "SnapshotRecoverySource\|restoreShard\|recoverFromRepository" \
  server/src/main/java/org/opensearch/snapshots/RestoreService.java

Answer:

At what granularity is a snapshot incremental — index, shard, or segment — and what property of Lucene segments makes that possible?
What does RepositoryData (the index-N blob) contain, and why does its generation number N increase on every snapshot/delete?
Why is registering one repository (e.g., one S3 bucket) in two clusters unsupported? What corruption does it cause and what check guards against it?
Trace a restore: name the recovery source type a restored shard gets and which subsystem actually downloads the segments.
When you delete a snapshot, why can't OpenSearch just delete that snapshot's blobs? What does it have to compute first?
How does repository-s3 avoid making the core engine depend on the AWS SDK? Name the plugin interface.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`[my_repo] could not read repository data`	corrupt/contended `RepositoryData` (two writers)	one-writer rule; `repository_uuid` check
First snapshot slow, later ones fast	incremental — first uploads everything	expected; `snapshotShard`
Deleting a snapshot frees little space	other snapshots still reference the blobs (ref-counting)	`deleteSnapshots` blob GC
Restore fails: index already exists	restore target collides with a live index	use `rename_pattern`/`rename_replacement`, or close/delete first
Snapshot `PARTIAL`	some shards failed (unassigned/red at snapshot time)	`SnapshotInfo.shardFailures`, shard health
`repository verification failed` on register	`path.repo` not set / bucket creds wrong / not shared across nodes	`path.repo`, repo settings, `_snapshot/repo/_verify`
Snapshot stuck `IN_PROGRESS` after node loss	shard's primary moved mid-snapshot	`SnapshotsInProgress`, cluster-manager logs

Validation: prove you understand this

Draw the repository blob layout (index-N, index.latest, snap-*.dat, indices/<uuid>/<shard>/__*) and say which blob is the source of truth.
Explain segment-level incrementality with a two-snapshot example where a merge happened between them — say exactly which blobs the second snapshot uploads.
Define RepositoryData and the role of the generation N, and explain why a single-writer invariant is required.
Diagram the snapshot flow (cluster-manager orchestration + per-shard upload + final RepositoryData write) and the restore flow (down through recovery).
Explain what must happen before a snapshot's segment blobs can be safely deleted, and why naive deletion would corrupt other snapshots.
Describe, at a high level, how searchable snapshots and remote-backed storage reuse the same Repository/BlobContainer model and what they change about where shard data lives.

Circuit Breakers and Memory

A search engine that lets a single greedy request OOM the JVM is a search engine that takes down a node for everyone. Circuit breakers are OpenSearch's defense: a hierarchy of accountants that track estimated heap usage per category (fielddata, aggregation request memory, in-flight network bytes, …) and throw a CircuitBreakingException before the allocation that would blow the heap, rather than letting the JVM die. Understanding the breakers — and the BigArrays machinery that feeds them — is essential both for diagnosing production "Data too large" errors and for writing engine code that participates in accounting instead of silently leaking heap.

This chapter dissects HierarchyCircuitBreakerService, the individual breakers, the real-memory parent breaker, and how BigArrays/PageCacheRecycler make heap allocations measurable. It connects to Aggregations (which trip the request breaker), DocValues and Fielddata (the fielddata breaker), and The Transport Layer (in-flight requests).

After this chapter you can:

Name the breakers, what each tracks, and their default limits.
Explain the difference between the summing parent breaker and the real-memory parent breaker.
Read a CircuitBreakingException and _nodes/stats/breaker and find the culprit.
Explain how aggregations and BigArrays reserve and release bytes against a breaker.

The breaker hierarchy

Breaker	Tracks	Default limit	grep target
`parent`	sum of all child breakers (or real heap)	~70% (95% with real-memory)	`HierarchyCircuitBreakerService`
`fielddata`	heap used by fielddata (uninverted `text` fields)	~40% of heap	see DocValues and Fielddata
`request`	per-request data structures (agg buckets, BigArrays)	~60% of heap	aggregation collectors
`in_flight_requests`	bytes of in-flight transport/HTTP requests	~100% of heap	transport layer
`accounting`	things held after a request ends (e.g., Lucene segment memory)	~100% of heap	engine/segments

grep -n "class HierarchyCircuitBreakerService\|CircuitBreaker.FIELDDATA\|CircuitBreaker.REQUEST\|IN_FLIGHT\|ACCOUNTING\|PARENT" \
  server/src/main/java/org/opensearch/indices/breaker/HierarchyCircuitBreakerService.java
ls server/src/main/java/org/opensearch/core/common/breaker/ 2>/dev/null || \
  find . -path '*common/breaker/CircuitBreaker.java'

Note: The CircuitBreaker interface and ChildMemoryCircuitBreaker live in libs/core (org.opensearch.core.common.breaker), while the service that wires the hierarchy and reads settings lives in server (org.opensearch.indices.breaker.HierarchyCircuitBreakerService). Know both locations.

The two parent-breaker strategies

The parent breaker can enforce its limit two ways, controlled by indices.breaker.total.use_real_memory:

Strategy	Setting	How it decides to trip
Summing (`use_real_memory: false`)	sum of child breaker reservations	trips when child reservations exceed `indices.breaker.total.limit`
Real-memory (`use_real_memory: true`, the default)	actual JVM heap used (`MemoryMXBean`)	trips when measured heap exceeds the limit, regardless of what the children think

Real-memory mode is strictly better at catching the heap usage children don't account for (Lucene internals, object overhead, third-party allocations). It is why a request can be rejected even when the named breakers look fine — the parent saw real heap pressure. It pairs with G1 GC heuristics so it doesn't trip on transient post-allocation-pre-GC spikes.

grep -n "use_real_memory\|USE_REAL_MEMORY\|realMemoryUsage\|MemoryMXBean\|getHeapMemoryUsage\|G1" \
  server/src/main/java/org/opensearch/indices/breaker/HierarchyCircuitBreakerService.java

How a breaker actually gates an allocation

A breaker is not magic — code must call it. Before reserving heap, code calls breaker.addEstimateBytesAndMaybeBreak(bytes, label), which adds to the running total and, if the new total would exceed the limit, throws CircuitBreakingException instead of letting the allocation proceed. When the work finishes, code calls breaker.addWithoutBreaking(-bytes) to release the reservation.

sequenceDiagram
    participant A as Aggregator / BigArrays
    participant CB as request CircuitBreaker
    participant P as parent breaker
    A->>CB: addEstimateBytesAndMaybeBreak(n, "<agg>")
    CB->>P: check parent (sum or real heap) + this child
    alt over limit
        P-->>A: throw CircuitBreakingException (Data too large)
    else ok
        CB-->>A: reserved
        A->>A: allocate the bytes (e.g., grow a LongArray)
        Note over A: ... do work ...
        A->>CB: addWithoutBreaking(-n)  (release on close)
    end

grep -rn "addEstimateBytesAndMaybeBreak\|addWithoutBreaking\|CircuitBreakingException" \
  server/src/main/java/org/opensearch/ | head

Warning: If engine code reserves bytes but forgets to release them on the failure path (no try/finally), the breaker's accounting leaks — it keeps rejecting requests even though the heap is actually free. This is a real class of bug; always release reservations in finally/close.

BigArrays and PageCacheRecycler — measurable, recyclable heap

Aggregations and other hot paths don't new long[hugeSize]. They allocate through BigArrays, which (a) reserves the bytes against the request breaker before allocating, (b) hands out array-like abstractions (LongArray, DoubleArray, ByteArray, ObjectArray) backed by reusable pages, and (c) returns those pages to a PageCacheRecycler on release to avoid GC churn. This is the mechanism that makes aggregation memory both bounded (breaker-checked) and efficient (page-recycled).

grep -n "class BigArrays\|newLongArray\|newDoubleArray\|adjustBreaker\|class PageCacheRecycler" \
  server/src/main/java/org/opensearch/common/util/BigArrays.java \
  server/src/main/java/org/opensearch/common/util/PageCacheRecycler.java

When a terms aggregation over millions of buckets trips the request breaker, it is almost always a BigArrays allocation (a growing bucket-ordinal array) calling addEstimateBytesAndMaybeBreak and hitting the limit. The fix is rarely "raise the limit" — it's "ask for fewer buckets" (see Aggregations, size/shard_size/cardinality).

How aggregations trip the request breaker

flowchart TD
    Q[Big terms/cardinality agg] --> C[Aggregator grows BigArrays for buckets]
    C --> R[BigArrays reserves bytes on request breaker]
    R -->|under limit| OK[allocate, keep collecting]
    R -->|would exceed| X[CircuitBreakingException: Data too large for request]
    X --> RESP[search returns 503 / per-shard failure in _shards]

The exception is the intended outcome — a rejected query is infinitely better than a dead node. As a contributor, when you see a circuit-breaking test failure, the question is almost never "is the breaker wrong"; it's "is my code reserving an unbounded amount of memory."

grep -rn "REQUEST\|CircuitBreaker.REQUEST\|addEstimateBytesAndMaybeBreak" \
  server/src/main/java/org/opensearch/search/aggregations/ | head

Observability

# Per-node breaker state: limit, estimated usage, trip count, overhead
curl -s 'localhost:9200/_nodes/stats/breaker?pretty'

# Just the parent + fielddata estimates
curl -s 'localhost:9200/_nodes/stats/breaker?pretty' | grep -A6 -E 'parent|fielddata'

Field	Meaning
`limit_size_in_bytes`	the breaker's cap
`estimated_size_in_bytes`	current reservation
`tripped`	how many times this breaker has thrown
`overhead`	multiplier applied to estimates (conservatism factor)

A climbing fielddata.estimated_size with tripped > 0 points at fielddata; a high request trip count points at aggregations; in_flight_requests trips point at huge bulk/ search payloads (transport).

Reading exercise

# 1. The service: breakers, limits, parent strategy
grep -n "registerBreaker\|childCircuitBreakers\|parentLimit\|checkParentLimit\|use_real_memory" \
  server/src/main/java/org/opensearch/indices/breaker/HierarchyCircuitBreakerService.java

# 2. The breaker interface + child impl (libs/core)
find . -path '*common/breaker/CircuitBreaker.java'
grep -rn "addEstimateBytesAndMaybeBreak\|addWithoutBreaking\|class ChildMemoryCircuitBreaker" \
  libs/core/src/main/java/org/opensearch/core/common/breaker/ 2>/dev/null

# 3. BigArrays reserving against the breaker
grep -n "adjustBreaker\|breaker\|newLongArray\|resize" \
  server/src/main/java/org/opensearch/common/util/BigArrays.java

# 4. Where aggs hold the request breaker
grep -rn "bigArrays\|CircuitBreaker.REQUEST\|breakerService" \
  server/src/main/java/org/opensearch/search/aggregations/AggregatorBase.java

Answer:

List the five breakers and one example of a workload that trips each.
Explain the difference between summing and real-memory parent strategies and why real-memory catches heap pressure the named breakers miss.
Trace one allocation through addEstimateBytesAndMaybeBreak: what is checked, what is thrown, and when is the reservation released?
Why does aggregation memory go through BigArrays instead of plain arrays? Name two things BigArrays provides (hint: one is the breaker, one is recycling).
A node keeps rejecting requests with "parent breaker tripped" but fielddata/request estimates look small. What is the likely explanation and which setting governs the behavior?
Describe the accounting-leak bug: how can a missing release cause persistent false CircuitBreakingExceptions, and what code pattern prevents it?

Common bugs and symptoms

Symptom	Likely cause	Where to look
`CircuitBreakingException: [request] Data too large`	aggregation buckets/BigArrays exceeded request breaker	Aggregations; reduce `size`/cardinality
`[fielddata] Data too large`	fielddata on `text` field uninverted to heap	DocValues and Fielddata; use `keyword`
`[parent] Data too large` but children look small	real-memory parent saw unaccounted heap pressure	`use_real_memory`, GC state, reduce concurrency
`[in_flight_requests] Data too large`	giant bulk/search payload	Transport; smaller bulk batches
Breaker keeps tripping even when cluster is idle	accounting leak (reservation not released)	grep for `addEstimateBytesAndMaybeBreak` without matching release; `try/finally`
Raising the limit "fixed" it, then node OOM'd	masking a real memory problem by disabling the safety net	revert; fix the query/code, not the limit
Flaky test trips breaker under random seed	test allocates near the limit; non-deterministic	bound the test's memory; don't widen the breaker

Validation: prove you understand this

Draw the breaker hierarchy with the parent on top and the four children, and annotate each child with its default limit and one triggering workload.
Explain, with the MemoryMXBean in the picture, why the default real-memory parent breaker can reject a request that the summed child reservations say is fine.
Write the two breaker calls (reserve and release) an aggregator makes around a BigArrays allocation, and explain where the release must live to avoid an accounting leak.
Given _nodes/stats/breaker showing request.tripped: 42 and small fielddata, identify the subsystem at fault and the user-facing fix.
Explain why "increase indices.breaker.request.limit" is usually the wrong response to a tripped request breaker, and what the right response is.
Describe how BigArrays + PageCacheRecycler make aggregation memory both bounded and cheap, naming the breaker interaction and the recycling behavior.

Plugin Architecture

Almost everything interesting about OpenSearch in production — security, k-NN vector search, SQL/PPL, alerting, ML — is a plugin. The core engine ships a deliberately small surface and exposes dozens of extension points; plugins implement them to add field types, queries, aggregations, REST endpoints, analyzers, repositories, scripts, and entire subsystems. As a contributor you will either write plugins or write the extension points plugins consume, and either way you must understand how PluginsService discovers, isolates, and wires them.

This chapter maps the Plugin base class and the extension interfaces, the PluginsService loading flow, the isolated-classloader model, the plugin-descriptor.properties contract, the Gradle build plugin, and the forward-looking Extensions SDK. It ties together the consumer-side deep dives: Query DSL (SearchPlugin), Aggregations (SearchPlugin), Snapshots and Repositories (RepositoryPlugin), The Action Framework (ActionPlugin), and Mapping and Analysis (MapperPlugin/AnalysisPlugin).

After this chapter you can:

Name the major plugin extension interfaces and what each contributes.
Trace plugin discovery and loading through PluginsService.
Explain the isolated-classloader model and why it both protects and constrains plugins.
Build an in-repo plugin skeleton with the opensearch.opensearchplugin Gradle plugin.

The base class and the extension interfaces

A plugin extends org.opensearch.plugins.Plugin (which provides lifecycle hooks and component creation) and implements zero or more extension interfaces, each of which the engine queries during startup for contributions.

Interface	Contributes	Consumed by
`ActionPlugin`	`TransportAction`s, REST handlers, action filters	Action Framework, REST Layer
`SearchPlugin`	queries, aggregations, suggesters, rescorers, fetch sub-phases	Query DSL, Aggregations
`MapperPlugin`	custom field types (`Mapper`/`MappedFieldType`)	Mapping and Analysis
`AnalysisPlugin`	analyzers, tokenizers, token/char filters	analysis registry
`IngestPlugin`	ingest pipeline processors	ingest service
`ClusterPlugin`	allocation deciders, cluster-state listeners	Shard Allocation
`NetworkPlugin`	transports, HTTP transports, transport interceptors	Transport Layer
`RepositoryPlugin`	snapshot repository implementations	Snapshots
`EnginePlugin`	custom `EngineFactory`	Engine Internals
`ScriptPlugin`	script engines/contexts	scripting service
`ExtensiblePlugin`	lets other plugins extend this plugin (SPI)	inter-plugin extension

ls server/src/main/java/org/opensearch/plugins/
grep -rln "interface .*Plugin" server/src/main/java/org/opensearch/plugins/
grep -n "interface SearchPlugin\|getQueries\|getAggregations\|getRescorers" \
  server/src/main/java/org/opensearch/plugins/SearchPlugin.java

The base Plugin also has general hooks: createComponents(...) (register singletons into the dependency-injection graph), getSettings() (declare custom Setting<T>), getNamedWriteables() / getNamedXContent() (register serialization — see Serialization and BWC), and getExecutorBuilders() (custom thread pools).

grep -n "createComponents\|getSettings\|getNamedWriteables\|getNamedXContent\|getExecutorBuilders" \
  server/src/main/java/org/opensearch/plugins/Plugin.java

PluginsService: discovery and loading

At node startup, PluginsService scans the plugins/ directory of the installed distribution. Each plugin is a subdirectory containing its jars and a plugin-descriptor.properties. PluginsService reads the descriptor, validates version compatibility, builds an isolated classloader per plugin, instantiates the Plugin subclass, and exposes the collection to the rest of the node.

flowchart TD
    N[Node startup] --> PS[PluginsService scans plugins/ dir]
    PS --> D[read plugin-descriptor.properties per plugin]
    D --> V[verify version + java version + dependencies]
    V --> CL[build isolated ClassLoader for the plugin jars]
    CL --> I[instantiate Plugin subclass]
    I --> R[Node queries each extension interface: getQueries/getActions/getMappers...]
    R --> W[contributions wired into registries, ActionModule, SearchModule, etc.]
    W --> C[createComponents -> DI graph; node starts]

grep -n "class PluginsService\|loadBundle\|loadPlugin\|PluginInfo\|verifyCompatibility\|getPluginsAndModules" \
  server/src/main/java/org/opensearch/plugins/PluginsService.java

Note: Modules (under modules/ in the source, e.g. analysis-common, aggs-matrix-stats, transport-netty4) use the same plugin mechanism but are bundled and loaded by default — you can't uninstall them. Optional plugins (under plugins/, or installed via bin/opensearch-plugin) are opt-in. Same machinery, different lifecycle.

Isolated classloaders — protection and constraint

Each plugin gets its own classloader, a child of the core classloader. This:

Protects the core and other plugins: a plugin can bundle its own dependency versions without colliding with the engine's.
Constrains plugins: a plugin can see core classes (parent) but not another plugin's classes — unless that other plugin is ExtensiblePlugin and deliberately exposes an SPI. This is why two unrelated plugins can't directly call each other.

flowchart TD
    SYS[System / core classloader: org.opensearch.* + Lucene] --> P1[Plugin A classloader]
    SYS --> P2[Plugin B classloader]
    SYS --> P3[Plugin C classloader]
    P1 -. cannot see .-> P2
    P3 -->|SPI via ExtensiblePlugin| P3X[Extension plugin loaded into C's loader]

grep -rn "ExtensionLoader\|loadExtensions\|reloadSPI\|ExtensiblePlugin\|URLClassLoader\|ExtendedPluginsClassLoader" \
  server/src/main/java/org/opensearch/plugins/

The SPI mechanism (ExtensiblePlugin + ExtensionLoader) is how, for example, an analysis plugin can be extended by another plugin's tokenizer — the extensible plugin's classloader is made visible to its extenders.

plugin-descriptor.properties — the contract

Every plugin ships a plugin-descriptor.properties that the engine validates before loading. The critical fields:

name=my-plugin
description=Adds a custom query and aggregation
version=3.0.0
opensearch.version=3.0.0
java.version=21
classname=org.example.MyPlugin
extended.plugins=                    # plugins this one extends via SPI (optional)
custom.foldername=                   # optional install dir name

opensearch.version must match the running node's version (plugins are version-locked — a 2.x plugin will not load on 3.x). classname names the Plugin subclass to instantiate. The opensearch.opensearchplugin Gradle plugin generates this file from your build.gradle.

grep -rn "opensearch.version\|classname\|java.version\|extended.plugins" \
  server/src/main/java/org/opensearch/plugins/PluginInfo.java
find . -name "plugin-descriptor.properties" 2>/dev/null | head

Warning: Version mismatch is the #1 plugin install failure. Building a plugin against opensearch 3.1.0 and installing it on a 3.0.0 node fails at load with a clear compatibility error. Plugin authors must rebuild per OpenSearch version; this is intentional, because the internal APIs plugins use are not guaranteed stable across versions (unlike the REST API).

Building and installing a plugin

The opensearch.opensearchplugin Gradle plugin packages a plugin zip with the descriptor, jars, and (optional) security.policy.

// build.gradle of an in-repo plugin
apply plugin: 'opensearch.opensearchplugin'

opensearchplugin {
  name 'my-plugin'
  description 'Adds a custom query and aggregation'
  classname 'org.example.MyPlugin'
  licenseFile rootProject.file('LICENSE.txt')
  noticeFile rootProject.file('NOTICE.txt')
}

# Build the plugin zip (in-repo)
./gradlew :plugins:my-plugin:assemble
ls plugins/my-plugin/build/distributions/

# Install into a distribution
bin/opensearch-plugin install file:///abs/path/my-plugin-3.0.0.zip
bin/opensearch-plugin list
bin/opensearch-plugin remove my-plugin

bin/opensearch-plugin verifies the descriptor, checks the version, optionally prompts for security policy grants the plugin requested, unpacks into plugins/, and the next node start loads it.

grep -rn "class InstallPluginCommand\|class ListPluginsCommand\|verify" \
  distribution/tools/plugin-cli/src/main/java/org/opensearch/plugins/ | head

In-repo vs out-of-repo plugins

	In-repo (`plugins/` of OpenSearch)	Out-of-repo (separate repo)
Examples	`analysis-icu`, `repository-s3`, `discovery-ec2`, `mapper-murmur3`	`security`, `k-NN`, `sql`, `alerting`, `ml-commons`, `index-management`
Builds against	the local source tree	published `org.opensearch:opensearch` artifacts
Release cadence	with the engine	own cadence, version-matched to engine
Why separate	small, generic, low-churn	large subsystems with their own teams/roadmaps

The big functional plugins live in their own repos under opensearch-project/ precisely because they're large, independently-maintained subsystems. They consume the same extension interfaces you've seen here, built against the published artifacts.

Forward-looking: the Extensions SDK

Plugins run in-process with full access to internal APIs — powerful, but it means a buggy plugin can destabilize the node and plugin authors are coupled to unstable internals. OpenSearch 2.10+ ships an experimental Extensions SDK (opensearch-sdk-java) that runs extensions out-of-process, communicating over a defined protocol, for fault isolation and looser version coupling. Treat it as the strategic direction for some extension categories; in-process plugins remain the mainstream, fully-supported model you should learn first.

grep -rn "ExtensionsManager\|extension\b\|DiscoveryExtensionNode" \
  server/src/main/java/org/opensearch/extensions/ 2>/dev/null | head

Reading exercise

# 1. The extension interfaces and their methods
for f in SearchPlugin ActionPlugin MapperPlugin AnalysisPlugin RepositoryPlugin EnginePlugin ClusterPlugin; do
  echo "== $f =="; grep -n "interface $f\|default\|List<" \
    server/src/main/java/org/opensearch/plugins/$f.java | head
done

# 2. PluginsService loading
grep -n "loadBundle\|loadPlugin\|verifyCompatibility\|getPlugins\|onModule" \
  server/src/main/java/org/opensearch/plugins/PluginsService.java

# 3. Descriptor parsing + version lock
grep -n "opensearch.version\|isOfficialPlugin\|readFromProperties\|class PluginInfo" \
  server/src/main/java/org/opensearch/plugins/PluginInfo.java

# 4. A real in-repo plugin's Plugin class
find plugins -name "*Plugin.java" -path "*repository-s3*" -o -name "*Plugin.java" -path "*analysis-icu*" | head

Answer:

Name the extension interface for each of: a custom query, a custom field type, a custom REST endpoint, a snapshot repository, and a custom analyzer.
Walk PluginsService from "scan plugins dir" to "node has a live Plugin instance," naming the descriptor validation and the classloader step.
Why does each plugin get its own classloader? Give one thing it protects and one thing it prevents.
What does ExtensiblePlugin enable that a normal plugin can't, and how does the SPI/ExtensionLoader make one plugin's classes visible to another?
Why must a plugin be rebuilt for each OpenSearch version, and where is that version compatibility enforced?
Contrast in-repo and out-of-repo plugins on build dependency and release cadence, and name two examples of each.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`plugin [x] is incompatible with version [y]` at startup	`opensearch.version` in descriptor ≠ node version	rebuild plugin for the node version; `PluginInfo`
`NoClassDefFoundError` for a class another plugin defines	classloader isolation; not exposed via SPI	`ExtensiblePlugin`, `extended.plugins`
Custom query/agg `unknown [name]` though plugin "installed"	plugin not actually loaded, or `getQueries`/`getAggregations` not implemented	`_cat/plugins`, `SearchPlugin` impl
`AccessControlException` / security policy denial	plugin needs a permission not granted in its `plugin-security.policy`	plugin's security policy; SecurityManager
Two dependency versions clash	bundled a jar that collides with core	rely on classloader isolation / shade conflicting deps
Plugin loads but components not wired	`createComponents` not returning them / DI miss	`Plugin.createComponents`
`opensearch-plugin install` hangs on a prompt	interactive security-policy confirmation	run with `--batch` in automation

Validation: prove you understand this

From memory, list eight plugin extension interfaces and exactly one thing each contributes to the engine.
Draw the PluginsService loading flow from directory scan through descriptor validation, classloader creation, instantiation, and extension-point querying.
Explain the isolated-classloader model with a diagram showing why Plugin A can't see Plugin B's classes and how ExtensiblePlugin makes an exception.
Write a minimal plugin-descriptor.properties and explain which field locks the plugin to an engine version and which names the entrypoint class.
Show the Gradle config and the bin/opensearch-plugin commands to build, install, list, and remove an in-repo plugin.
Explain why the big functional plugins (security, k-NN, SQL) live in separate repos and how they still consume the same extension interfaces, then state how the Extensions SDK differs from in-process plugins.

Serialization and Backward Compatibility

This is the chapter that separates a contributor from a maintainer. Almost every other deep dive describes how OpenSearch works now; this one describes how OpenSearch keeps working while it changes — across a rolling upgrade, between a 3.0 node and a 3.1 node in the same cluster, between an old index written years ago and the engine reading it today. Get serialization BWC wrong and you don't get a failing unit test — you get a corrupted cluster during a customer's upgrade, the single worst class of bug this project can ship. Reviewers reject PRs over BWC more than over almost anything else, and learning to see the BWC implications of a one-line change is the skill this chapter builds.

It depends on The Transport Layer (the wire that carries Writeables) and connects to every chapter that defines a serialized type: ClusterState, Query DSL, Aggregations, Replication, and the Level-9 BWC lab.

After this chapter you can:

Read and write version-gated Writeable serialization correctly.
Explain why reordering stream writes silently corrupts a mixed-version cluster.
Use Version, out.getVersion(), NamedWriteableRegistry/NamedXContentRegistry appropriately.
Reason about wire BWC, index/Lucene format BWC, and how qa/ BWC tests prove it.

Three kinds of compatibility — don't conflate them

Kind	Spans	Mechanism	Breaks look like
Wire (transport) BWC	nodes of different versions in one cluster (rolling upgrade)	`Version`-gated `StreamInput`/`StreamOutput`	mixed-cluster deserialize errors, silent field corruption
Index/Lucene format BWC	an index written by an older engine, read by a newer one	Lucene codec back-compat (N-1 major)	"this index was created with a version that is no longer compatible"
REST API BWC	clients across versions	additive JSON, deprecation cycle	clients break on removed/renamed fields

This chapter is mostly about the wire kind, with a section on index format. REST BWC is a policy concern covered in the compatibility mindset chapter.

The serialization primitives

Every object that crosses the transport wire implements Writeable: a writeTo(StreamOutput) method and a StreamInput constructor (the read side). The contract is brutally simple and brutally unforgiving: the read side must read fields in exactly the same order, of exactly the same type, as the write side wrote them.

public class MyState implements Writeable {
    private final String name;
    private final int count;

    public MyState(StreamInput in) throws IOException {  // READ
        this.name = in.readString();
        this.count = in.readVInt();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {  // WRITE
        out.writeString(name);
        out.writeVInt(count);
    }
}

grep -n "interface Writeable\|writeTo\|class StreamInput\|class StreamOutput" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/Writeable.java
grep -n "public .*read\|public void write" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/StreamInput.java | head -40

StreamInput/StreamOutput carry a Version (the version of the node on the other end of the connection). That Version is the lever for all wire BWC.

The cardinal sin: reordering or retyping stream writes

A StreamOutput is a positional byte stream. There are no field names. If you write [string, vint] and read [vint, string], the reader doesn't error politely — it reads your string's length prefix as an int, then reads garbage bytes as a string, and you get corruption that may not surface until much later.

flowchart LR
    W["writeTo: writeString(name); writeVInt(count)"] --> B["bytes: [len][name...][count]"]
    B --> R1["correct read: readString(); readVInt() OK"]
    B --> R2["reordered read: readVInt(); readString() -> reads name-length as count, then garbage"]

Warning: Never reorder, insert-in-the-middle, retype, or remove a field in an existing writeTo/StreamInput pair without version-gating. This is the #1 BWC bug. New fields go at the end, behind a version check. The order in writeTo and the order in the StreamInput constructor must match each other and must match every version that's still in the mixed-cluster window.

Version gating: adding a field safely

When you add a field in version X, you must serialize it only when talking to a node that is on or after X — otherwise an older node reads bytes it doesn't expect. The pattern:

public MyState(StreamInput in) throws IOException {
    this.name = in.readString();
    this.count = in.readVInt();
    if (in.getVersion().onOrAfter(Version.V_3_1_0)) {
        this.newField = in.readOptionalString();   // only present from 3.1+
    } else {
        this.newField = null;                       // older peer didn't send it
    }
}

@Override
public void writeTo(StreamOutput out) throws IOException {
    out.writeString(name);
    out.writeVInt(count);
    if (out.getVersion().onOrAfter(Version.V_3_1_0)) {
        out.writeOptionalString(newField);          // only send to 3.1+ peers
    }
}

The symmetry is the whole game: the if condition on read and write must be identical, and the field must be appended after all pre-existing fields. The Version class defines named constants and ordering.

grep -n "public static final Version V_3\|onOrAfter\|before\|CURRENT\|minimumCompatibilityVersion" \
  server/src/main/java/org/opensearch/Version.java
grep -rn "out.getVersion().onOrAfter\|in.getVersion().onOrAfter" \
  server/src/main/java/org/opensearch/ | head

Operation	Safe?	Rule
Add a field at the end, version-gated	yes	append + symmetric `onOrAfter` check
Remove a field	no (until all old versions drop out of support)	keep reading/writing it, gate around it, retire after the window
Reorder fields	never	breaks positional decoding
Change a field's type/width	no	gate as "old type for old versions, new type for new"
Add an enum value	careful	old nodes can't deserialize a value they don't know; gate the producer

NamedWriteable: polymorphic types over the wire

A QueryBuilder, an Aggregation, an AllocationDecider decision — these are polymorphic. The stream needs to know which concrete class to instantiate. That's NamedWriteable: each type declares a stable string name (getWriteableName()), and a NamedWriteableRegistry maps name → reader. Core types and every plugin's types (see Plugin Architecture) register here.

flowchart LR
    O[writeNamedWriteable q] --> N["bytes: [name='bool'][BoolQueryBuilder fields]"]
    N --> R[readNamedWriteable: registry lookup 'bool' -> reader -> BoolQueryBuilder StreamInput ctor]

grep -n "interface NamedWriteable\|getWriteableName\|class NamedWriteableRegistry\|writeNamedWriteable\|readNamedWriteable" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/NamedWriteable.java \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/NamedWriteableRegistry.java

The XContent analog is NamedXContentRegistry (name → fromXContent parser), used for JSON/YAML parsing of the same polymorphic types (see Query DSL). A plugin that adds a query must register in both: NamedWriteableRegistry for transport, NamedXContentRegistry for the REST body.

Note: Changing a getWriteableName() string is a wire break — old nodes registered the old name. The name is part of the contract, not an implementation detail.

Mixed clusters and the rolling upgrade

During a rolling upgrade, a cluster temporarily runs two versions at once. Every transport message between an old node and a new node must serialize correctly in both directions. The cluster's effective behavior is gated by the minimum node version in the cluster — the elected cluster manager (formerly master) won't enable features that older nodes can't speak.

sequenceDiagram
    participant N3 as Node v3.1 (new)
    participant N0 as Node v3.0 (old)
    Note over N3,N0: connection negotiates min(version) = 3.0
    N3->>N0: writeTo with out.getVersion()=3.0 -> SKIPS the 3.1 field
    N0->>N3: writeTo with out.getVersion()=3.1? No -> N0 has no 3.1 field; sends 3.0 layout
    N3->>N3: reads with in.getVersion()=3.0 -> does NOT read the 3.1 field

The negotiated Version on each connection is what makes out.getVersion() return the peer's version, so a 3.1 node automatically downshifts its wire format when talking to a 3.0 node. This only works if your version gate is correct.

grep -rn "minimumNodeVersion\|minimumCompatibilityVersion\|getMinNodeVersion\|smallestNonClientNodeVersion" \
  server/src/main/java/org/opensearch/cluster/node/DiscoveryNodes.java \
  server/src/main/java/org/opensearch/Version.java

Index and Lucene format BWC

Wire BWC is about nodes; index BWC is about data on disk. OpenSearch reads indices created by the previous major version (Lucene's N-1 codec back-compat), but not older than that — an index created two majors ago must be reindexed. IndexMetadata records the creation version (index.version.created), and the engine refuses to open an index whose format is too old.

grep -rn "index.version.created\|VERSION_CREATED\|minimumIndexCompatibilityVersion\|isCompatible" \
  server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java \
  server/src/main/java/org/opensearch/Version.java

Note: This is why upgrade docs say "you can upgrade across one major, then reindex." A 1.x index can be read by 2.x; to use it on 3.x you reindex (or upgrade through 2.x). The _cat/indices + index.version.created tell you what you have.

Proving BWC: the tests that gate your PR

BWC is not a thing you eyeball — it's a thing you test, and OpenSearch has dedicated machinery.

Test type	What it proves	Where
`AbstractWireSerializingTestCase<T>`	a `Writeable` round-trips (write then read = equal) and survives serialization to an older version	`test/framework`
`AbstractSerializingTestCase`	XContent round-trip too	`test/framework`
`qa/` BWC / mixed-cluster / rolling-upgrade	a real two-version cluster actually works	`qa/`
`bwcVersion` Gradle wiring	runs the above against specific old versions	build logic

The serialization round-trip test is the cheapest, highest-value BWC guard. It serializes your object, deserializes it, and asserts equality — and the bwc-aware variants do this at an older Version, catching a missing or asymmetric gate immediately.

public class MyStateTests extends AbstractWireSerializingTestCase<MyState> {
    @Override protected Writeable.Reader<MyState> instanceReader() { return MyState::new; }
    @Override protected MyState createTestInstance() {
        return new MyState(randomAlphaOfLength(8), randomIntBetween(0, 100));
    }
    // The framework also exercises serialization at random older versions.
}

grep -rn "class AbstractWireSerializingTestCase\|assertSerialization\|copyInstance\|VersionUtils.randomVersion" \
  test/framework/src/main/java/org/opensearch/test/AbstractWireSerializingTestCase.java
ls qa/
grep -rn "bwcVersion\|bwc_short_version\|BWC_VERSION" qa/ build.gradle settings.gradle 2>/dev/null | head

Run them:

# Round-trip serialization tests for your type
./gradlew :server:test --tests "org.opensearch.cluster.metadata.IndexMetadataTests"

# The real mixed-version / rolling-upgrade QA (slow, the actual proof)
./gradlew :qa:rolling-upgrade:check
./gradlew :qa:mixed-cluster:check

Reading exercise

# 1. The stream primitives and how Version rides along
grep -n "getVersion\|setVersion\|class StreamOutput" \
  libs/core/src/main/java/org/opensearch/core/common/io/stream/StreamOutput.java

# 2. A real version-gated writeTo in core (find live examples)
grep -rln "out.getVersion().onOrAfter" server/src/main/java/org/opensearch/cluster/ | head
# pick one and read both its writeTo and StreamInput ctor:
grep -n "out.getVersion().onOrAfter\|in.getVersion().onOrAfter\|writeTo\|StreamInput" \
  server/src/main/java/org/opensearch/cluster/metadata/IndexMetadata.java | head -40

# 3. NamedWriteable registration
grep -rn "getNamedWriteables\|new NamedWriteableRegistry.Entry" \
  server/src/main/java/org/opensearch/search/SearchModule.java | head

# 4. The round-trip test base
grep -n "copyInstance\|assertEqualInstances\|serialize\|deserialize" \
  test/framework/src/main/java/org/opensearch/test/AbstractWireSerializingTestCase.java

Answer:

Why does reordering two writeTo writes corrupt rather than error? Explain in terms of the positional byte stream and length prefixes.
Write the symmetric read/write pair for adding an optional long field in Version.V_3_2_0. What must be true of the two if conditions?
What does out.getVersion() return during a rolling upgrade when a 3.2 node talks to a 3.0 node, and how does that make the new field "disappear"?
Why must a plugin's custom query register in both NamedWriteableRegistry and NamedXContentRegistry? What does each one serve?
Explain index/Lucene format BWC: how many major versions back can an index be read, where the creation version is recorded, and what the upgrade path is for an index two majors old.
Which single test base class catches a missing version gate fastest, and what does its bwc-aware variant do that a plain round-trip doesn't?

Common bugs and symptoms

Symptom	Likely cause	Where to look
Mixed-cluster nodes fail to deserialize a message	new field added without a version gate (or asymmetric gate)	the type's `writeTo`/`StreamInput`; add/fix `onOrAfter`
Silent wrong values after upgrade, no exception	reordered/retyped stream fields → positional misread	diff `writeTo` order vs `StreamInput` order across versions
`qa:rolling-upgrade` red but unit tests green	BWC only manifests across real versions	run the QA suite; add a version gate
`unknown NamedWriteable [x]` on one node	name not registered on that node / renamed `getWriteableName`	`NamedWriteableRegistry` entries; keep the name stable
"index created with a version that is no longer compatible"	index too old (2+ majors)	reindex / upgrade through the intermediate major
Round-trip test fails at a random old version	missing gate caught by `VersionUtils.randomVersion`	the `if (out.getVersion()...)` you forgot
Enum deserialization fails on old node	added an enum value the old node doesn't know	gate the producer so old peers never receive it

Validation: prove you understand this

From memory, write a correct version-gated writeTo/StreamInput pair that adds one new optional field in Version.V_3_1_0, and state the invariant that makes it safe in a mixed cluster.
Explain, with a byte-level sketch, why swapping the order of two writes corrupts the stream silently instead of throwing.
Describe what happens on each connection during a 3.0→3.1 rolling upgrade in terms of out.getVersion()/in.getVersion(), and why the negotiated version is the minimum of the two endpoints.
Distinguish wire BWC, index/Lucene format BWC, and REST BWC, giving one concrete failure mode and one mitigation for each.
Explain the role of NamedWriteableRegistry vs NamedXContentRegistry for a polymorphic type, and why renaming getWriteableName() is a breaking change.
Name the test base class for serialization round-trips, write a minimal subclass for a two-field type, and state what running ./gradlew :qa:rolling-upgrade:check proves that the unit test cannot.

Apache Lucene: The Engine Beneath OpenSearch

Every OpenSearch shard is a Lucene index. Not "backed by," not "similar to" — a shard is a directory of Lucene segment files, driven by a Lucene IndexWriter, read through a Lucene DirectoryReader. The OpenSearch Engine is a wrapper that adds versioning, sequence numbers, the translog, and the distributed-system machinery — but the bytes on disk, the inverted index, the BKD trees, the HNSW vector graph, the columnar doc values, the merge policy: all Lucene. If you want to be a core OpenSearch contributor rather than a plugin author, you have to be able to read Lucene.

This section is the layer the rest of the OpenSearch deep dives stand on. The engine internals chapter told you that InternalEngine drives an IndexWriter; here you learn what an IndexWriter actually does. The docvalues/fielddata chapter told you OpenSearch reads columnar values through IndexFieldData; here you learn the Lucene DocValuesFormat underneath. This is the advanced layer: the same density, one level deeper.

After this chapter you can: explain the precise relationship between an OpenSearch shard and a Lucene index; check out and build apache/lucene from source; open a real index with the Luke inspector; navigate the seven internal structures (segments, postings, points, doc values, the writer, HNSW, SIMD); and understand why "upgrade Lucene" is one of the highest-leverage recurring tasks in the OpenSearch repo.

Why a contributor must know Lucene

Three blunt reasons.

1. Many "OpenSearch" bugs are Lucene behavior. A range query that's slow on a date field is a BKD-tree question. A terms aggregation that eats heap is a doc-values / global-ordinals question. A vector search with bad recall is an HNSW-graph question. When you triage an issue, the first skill is knowing which layer owns the behavior — and very often the answer is "this is Lucene doing exactly what Lucene does, and the OpenSearch fix is to configure it differently." You cannot make that call if Lucene is a black box.

2. Vector search is Lucene's HNSW. OpenSearch's k-NN plugin has three engines (faiss, lucene, nmslib). The lucene engine is literally org.apache.lucene.codecs.*HnswVectorsFormat + KnnFloatVectorField + HnswGraph. Even the native faiss engine is wired in through a custom Lucene Codec. The vector-search future of OpenSearch is built on Lucene primitives covered in HNSW Vector Search in Lucene and SIMD and the Panama Vector API.

3. Upgrading the bundled Lucene version is a core task that never ends. OpenSearch bundles one specific Lucene version. When Lucene ships a new minor (a new codec, a faster HNSW format, an API break), someone has to do the upgrade PR: bump the dependency, fix the breaks, regenerate any pinned codec versions, adapt to renamed/removed APIs, and re-run the whole test suite. It is recurring, high-skill, high-visibility work — exactly the kind of thing that earns trust.

# Find the real upgrade work in the OpenSearch repo:
gh search prs   --repo opensearch-project/OpenSearch "Upgrade to Lucene"  --limit 20
gh search issues --repo opensearch-project/OpenSearch "Upgrade to Lucene"  --limit 20
# And the bundled version OpenSearch currently pins:
grep -rn "lucene" /path/to/OpenSearch/buildSrc/version.properties

Note: OpenSearch and Lucene release on independent cadences. The bundled Lucene is whatever the upgrade PRs have landed — never assume; grep the version. Throughout this section we say "the current default LuceneNNNCodec" rather than nailing a number, because the number changes and the concept does not.

The OpenSearch ↔ Lucene mapping

A small table you should be able to reproduce from memory. The left column is OpenSearch; the right is the Lucene class it ultimately drives.

OpenSearch concept	Lucene reality
A shard	A Lucene index = a `Directory` of segment files
`InternalEngine`	Owns a Lucene `IndexWriter` + a `DirectoryReader`/`SearcherManager`
`Engine.Index` / `Engine.Delete`	`IndexWriter.updateDocument` / `deleteDocuments`
Refresh	`DirectoryReader.openIfChanged` (a new near-real-time reader)
Flush	`IndexWriter.commit()` (writes a new `segments_N`)
Merge	Lucene `MergePolicy` (`TieredMergePolicy`) + `MergeScheduler`
A `MappedFieldType`	Lucene `FieldType` + the `Field` subclasses it produces
`match` / `term` query	Lucene `TermQuery` → terms dict → postings
Numeric / `date` / `geo_point` range	Lucene `PointValues` → BKD tree
Sort / aggregate value access	Lucene `DocValues` (NUMERIC/SORTED_SET/…)
k-NN `lucene` engine	Lucene `KnnFloatVectorField` + `*HnswVectorsFormat`
`IndexSearcher` (per shard)	Lucene `IndexSearcher` over a `DirectoryReader`

flowchart TD
    OS["OpenSearch shard request"] --> Eng["InternalEngine (versioning, seqNo, translog)"]
    Eng --> IW["Lucene IndexWriter"]
    IW --> Seg["Immutable segments on disk"]
    Eng --> SM["SearcherManager / DirectoryReader"]
    SM --> IS["Lucene IndexSearcher"]
    IS --> Post["postings / points / docvalues / HNSW"]
    Seg --> Post

The OpenSearch layer is coordination and durability; the Lucene layer is storage and retrieval. Keep that line clear in your head and most "where does this belong" questions answer themselves.

Getting and reading Lucene source

Lucene is Apache Software Foundation, on GitHub at apache/lucene. It builds with Gradle and targets JDK 21 (matching OpenSearch). Development moved from the old JIRA-only workflow to GitHub PRs, though a lot of history still lives under LUCENE-NNNN JIRA keys you'll see referenced in commit messages.

git clone https://github.com/apache/lucene.git && cd lucene
./gradlew :lucene:core:test                 # build + run core tests (slow first time)
./gradlew :lucene:core:test --tests "org.apache.lucene.index.TestIndexWriter" -Ptests.seed=DEADBEEF
./gradlew check                             # full quality gate: forbidden-apis, format, tests

The single most useful tool for this section is Luke, the Lucene index inspector. It opens any Lucene index (including an OpenSearch shard directory) and lets you browse fields, terms, postings, doc values, points, and per-segment files in a GUI.

# From the lucene checkout — launches the Luke GUI:
./gradlew :lucene:luke:run

# Point it at a real OpenSearch shard, e.g.:
#   <data.path>/nodes/0/indices/<index-uuid>/0/index

Almost everything in this section lives in lucene/core/src/java/org/apache/lucene/ (index/ codecs/ search/ store/ util/ document/). Signposts: find ... -name IndexWriter.java, BKDWriter.java, HnswGraph*.java, and codecs/ for the Codec SPI + default impls.

Note: Real Lucene package paths in this section are under org.apache.lucene.*, with source at lucene/core/src/java/org/apache/lucene/.... When we name a class, find it in your checkout — class locations are stable across versions even when codec version numbers move.

Reading order: the seven chapters and four labs

Read these roughly in order; each builds vocabulary the next assumes. The "consumed by" column shows which later OpenSearch/k-NN material depends on it.

#	Chapter	One line	Consumed by
1	Segments and Codecs	Immutable segments, the `segments_N` commit, the file zoo, the `Codec` SPI	everything below; `index.codec`
2	Inverted Index and Postings	terms → postings, the FST-backed BlockTree dictionary, `TermsEnum`/`PostingsEnum`	`match`/`term`, query DSL
3	Points and BKD Trees	multidimensional points, `BKDWriter`/`BKDReader`, numeric/`date`/`geo` ranges	mapping, range queries
4	DocValues: The Columnar Store	per-doc columnar values, ordinals, the backing for sort/agg/scripts	docvalues/fielddata, aggregations
5	IndexWriter and Merge Internals	DWPT buffers → flush → segments; `MergePolicy`/`MergeScheduler`; commit vs flush	engine, refresh/flush/merge
6	HNSW Vector Search in Lucene	`KnnFloatVectorField`, `HnswGraph`, the vector formats	the k-NN `lucene` engine
7	SIMD and the Panama Vector API	`jdk.incubator.vector`, `VectorUtil`, why HNSW scoring is fast	k-NN performance, faiss SIMD

The four hands-on labs:

Lab	What you build
Lab L1: Crack open a Lucene index	Use Luke + `find` to dissect a real shard's segment files
Lab L2: Write a custom codec	A `FilterCodec` registered via `META-INF/services`
Lab L3: HNSW from scratch	Build and query an `HnswGraph` directly
Lab L4: Contribute to Apache Lucene	A real upstream PR workflow on `apache/lucene`

How this maps back into OpenSearch classes

When you finish a Lucene chapter, re-read the matching OpenSearch deep dive — the two now interlock. The clearest example is the engine. In OpenSearch:

# The OpenSearch engine wraps the Lucene writer/reader:
grep -n "IndexWriter\|DirectoryReader\|SearcherManager\|ReferenceManager" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

You will see InternalEngine constructing an IndexWriter, holding a SearcherManager (a Lucene ReferenceManager<DirectoryReader>), and calling commit() on flush. Chapter 5 (IndexWriter and Merge Internals) is the inside of those calls; engine-internals.md is the OpenSearch coordination around them. Same for codecs: OpenSearch's index.codec setting chooses a Lucene Codec (see Segments and Codecs), and the k-NN plugin ships its own codec to write the vector graph.

How to contribute upstream

The highest-skill move in this section is landing a change in apache/lucene itself — and OpenSearch benefits when its next Lucene upgrade pulls in your fix. The full workflow (fork, branch, ./gradlew check, the PR template, the reviewer culture, JIRA/PR conventions) is its own lab: Lab L4: Contribute to Apache Lucene.

A taste of where to start looking for tractable first issues:

gh issue list --repo apache/lucene --label "good first issue" --limit 30
gh search issues --repo apache/lucene "newdev OR beginner" --state open

Pair an upstream Lucene fix with an OpenSearch "Upgrade to Lucene" PR and you have demonstrated the full vertical: you can fix the engine and land it in the product. That is the profile of a core contributor.

Validation: prove you understand this

State, in one sentence each, the relationship between an OpenSearch shard and a Lucene index, and between InternalEngine and IndexWriter.
Give three concrete examples of an "OpenSearch" symptom whose root cause lives in a Lucene structure, and name the structure for each.
Explain why "upgrade the bundled Lucene version" is a recurring core task and what work it actually involves. Show the gh search that finds those PRs.
Clone apache/lucene, run ./gradlew :lucene:luke:run, and open any index directory. Name three things Luke shows you.
Reproduce the OpenSearch ↔ Lucene mapping table from memory for at least six rows.
Find, in InternalEngine.java, the three Lucene types OpenSearch wraps (IndexWriter, DirectoryReader/SearcherManager, and the commit call).

Segments and Codecs

A Lucene index is not one file — it is a pile of segments, each segment a small, self-contained, immutable mini-index, plus one tiny commit file that says "this set of segments, right now, is the index." Every byte Lucene reads at query time comes from a segment; every byte it writes goes into a new segment. Nothing is ever modified in place. Understanding this immutability, the commit point that ties segments together, the per-segment file zoo, and the Codec SPI that defines those files' formats is the foundation for every other chapter in this section.

This is the Lucene-level companion to OpenSearch's Refresh, Flush, and Merge: that chapter told you a flush is an IndexWriter.commit() that writes segments_N; here you learn what that file is and what the segments it points to actually contain.

After this chapter you can: list a segment's files and say what each holds; explain why immutability shapes the entire engine (deletes, merges, NRT); describe the Codec SPI and its five sub-formats; explain how a Codec is discovered via META-INF/services; and set the codec on an OpenSearch index.

Immutable segments and the commit point

When IndexWriter flushes its in-RAM buffer, it writes a brand-new segment and never touches it again. This single design choice explains an enormous amount:

Consequence of immutability	Why
Deletes don't delete	You can't edit a segment, so a delete just marks the doc dead in a side `liveDocs` bitset (`.liv`). The bytes stay until a merge drops them.
Updates = delete + add	An update writes a new doc in a new segment and tombstones the old one.
Lock-free reads	A `DirectoryReader` over immutable segments needs no locking against writers — readers and writers never contend on the same bytes.
Cheap NRT	A refresh just opens a reader over the segments that exist now; no copying.
Merges are mandatory	Segment count and dead-doc debt grow forever without merging — see IndexWriter and Merge Internals.

The index's current state is a commit point: a file named segments_N (generation N, monotonically increasing) listing exactly which segments are live and which codec wrote each. In Lucene this is the SegmentInfos object.

# In any index directory, the commit point is the segments_N file:
ls -la /path/to/index/ | grep segments
#   segments_7      <- the current commit (generation 7)
#   write.lock      <- the IndexWriter lock

grep -rn "class SegmentInfos\|class SegmentCommitInfo\|class SegmentInfo " \
  lucene/core/src/java/org/apache/lucene/index/

flowchart TD
    SN["segments_7 (SegmentInfos / commit point)"] --> S0["_0 (segment)"]
    SN --> S1["_1 (segment)"]
    SN --> S2["_5 (segment, merged)"]
    S0 --> F0["_0.si _0.cfs _0.cfe ..."]
    S1 --> F1["_1.fnm _1.tim _1.doc ..."]
    S2 --> F2["_5.dvd _5.kdd _5.vec ..."]

Note: A commit (segments_N) is what survives a crash. Refreshed-but-not- committed segments are visible to search but recovered from the translog after a crash, not from a commit. Visibility (refresh) and durability (commit/flush) are independent — the OpenSearch engine chapter calls this out, and it is exactly the Lucene segment/commit distinction.

The per-segment file zoo

Each segment _N is a set of files sharing the prefix _N, one per data structure. (When compound mode is on, most of them are packed into a single .cfs with a .cfe table of contents — fewer file handles.) Memorize this table; you'll read these extensions for the rest of your career.

Extension(s)	Structure	Holds	Covered in
`.si`	`SegmentInfo`	per-segment metadata: doc count, codec, files, diagnostics	this chapter
`.cfs` / `.cfe`	Compound file + entries	all other files packed into one + its directory	this chapter
`.fnm`	Field infos	the field list: name, number, index options, doc-values type, points/vector dims	this chapter
`.fdt` / `.fdx` / `.fdm`	Stored fields data / index / meta	the original `_source`-style stored field values, compressed	this chapter
`.tim` / `.tip` / `.tmd`	Terms dictionary / index / meta	the BlockTree terms dict + its FST index	Inverted Index
`.doc` / `.pos` / `.pay`	Postings: docs / positions / payloads+offsets	the inverted lists per term	Inverted Index
`.dvd` / `.dvm`	DocValues data / meta	columnar per-doc values	DocValues
`.nvd` / `.nvm`	Norms data / meta	per-field length norms used in scoring	Inverted Index
`.kdd` / `.kdi` / `.kdm`	Points data / index / meta	the BKD tree for numeric/geo points	Points & BKD
`.vec` / `.vex` / `.vem`	Vector data / graph / meta	HNSW float/byte vectors + the graph	HNSW
`.liv`	Live docs	the deletion bitset (which docs are still alive)	IndexWriter

Go look at a real one. Index a few docs into OpenSearch, find the shard, and find its files:

# 1. Locate the data path and a shard directory:
curl -s 'localhost:9200/_nodes/settings?filter_path=**.path.data' | python3 -m json.tool
#   <data>/nodes/0/indices/<index-uuid>/<shard>/index

# 2. List the segment files by extension:
SHARD=/path/to/<data>/nodes/0/indices/<uuid>/0/index
find "$SHARD" -maxdepth 1 -type f | sed 's/.*\.//' | sort | uniq -c
ls -la "$SHARD"

# 3. Ask OpenSearch to describe the same segments:
curl -s 'localhost:9200/my-index/_segments?pretty' | head -60
curl -s 'localhost:9200/_cat/segments/my-index?v'

If you only see .cfs/.cfe/.si per segment, the segment is compound. Small segments are compound by default; large merged ones are usually non-compound (the per-file overhead amortizes away). Luke (Lab L1) shows the same picture visually.

The Codec SPI

A Codec defines the on-disk format of all those files. It is the central extension point of Lucene's storage layer, wired through Java's ServiceLoader (the SPI mechanism). A Codec is really a bundle of sub-format factories:

Sub-format	Defines the format of	Files
`PostingsFormat`	terms dictionary + postings	`.tim/.tip/.tmd/.doc/.pos/.pay`
`DocValuesFormat`	columnar doc values	`.dvd/.dvm`
`StoredFieldsFormat`	stored field values	`.fdt/.fdx/.fdm`
`PointsFormat`	BKD point trees	`.kdd/.kdi/.kdm`
`KnnVectorsFormat`	HNSW vectors + graph	`.vec/.vex/.vem`
`NormsFormat`	scoring norms	`.nvd/.nvm`
`FieldInfosFormat`	the field metadata	`.fnm`
`SegmentInfoFormat`	per-segment metadata	`.si`
`LiveDocsFormat` / `CompoundFormat`	deletes / compound packing	`.liv` / `.cfs/.cfe`

The default codec is versioned — LuceneNNNCodec, where NNN bumps when the on-disk format changes. Do not hard-code the number; grep for it, because a Lucene upgrade in OpenSearch is precisely "the default codec version moved."

# Find the current default codec class in your Lucene checkout:
ls lucene/core/src/java/org/apache/lucene/codecs/lucene*/
grep -rln "extends Codec\b" lucene/core/src/java/org/apache/lucene/codecs/

# The default is also declared as a service — see next section.
grep -rn "Codec.getDefault\|setDefault" lucene/core/src/java/org/apache/lucene/codecs/Codec.java

A codec's sub-formats are themselves SPI-pluggable and versioned independently. This is why a single codec can read older segments written by an older codec (each segment records the codec that wrote it, in its .si) — Lucene keeps the old format readers around for one major version of back-compat. That back-compat window is the deep reason an OpenSearch major can read indices from the previous major but not from two majors ago.

SPI registration via META-INF/services

A Codec (or any sub-format) is discovered at runtime by listing its fully-qualified class name in a META-INF/services file named for the SPI interface. No central registry, no config — drop the jar on the classpath with the right service file and Lucene finds it. This is exactly how OpenSearch's k-NN plugin ships its own vector codec.

# The default codec registers itself here, in Lucene's core jar:
find lucene -path "*META-INF/services/org.apache.lucene.codecs.Codec"
cat lucene/core/src/resources/META-INF/services/org.apache.lucene.codecs.Codec

# And the sub-format services:
find lucene -path "*META-INF/services/org.apache.lucene.codecs.*Format"

# META-INF/services/org.apache.lucene.codecs.Codec
org.apache.lucene.codecs.lucene101.Lucene101Codec
# (number illustrative — grep your checkout for the real one)

To write your own, you extend Codec (usually FilterCodec, delegating most sub-formats and overriding one), then add a line to the service file. That is the whole of Lab L2: Write a custom codec.

Per-field codecs

A codec doesn't have to use one format for the entire index — it can choose a different format per field. Lucene exposes this through PerFieldPostingsFormat, PerFieldDocValuesFormat, and PerFieldKnnVectorsFormat: the codec is asked, per field name, which format to use. This is how a single index can have, say, a high-compression postings format on a rarely-queried field and the fast default on a hot one — and how the k-NN plugin attaches its HNSW format only to knn_vector fields while leaving the rest on the standard codec.

grep -rn "PerFieldPostingsFormat\|PerFieldDocValuesFormat\|PerFieldKnnVectorsFormat\|getPostingsFormatForField" \
  lucene/core/src/java/org/apache/lucene/codecs/perfield/

The mechanism: the codec returns a PerField*Format, whose getXForField(field) returns the concrete format; Lucene records the chosen format name per field in the .fnm so the reader can re-instantiate the right one. Override one method, target one field — clean.

How OpenSearch chooses the codec

OpenSearch exposes the codec choice as the index setting index.codec. It does not let you pick an arbitrary Lucene codec by class; it maps a few friendly names to codec configurations (chiefly different stored-fields compression).

`index.codec`	Effect
`default`	the standard Lucene codec, `LZ4` stored-fields compression (fast)
`best_compression`	the standard codec with `DEFLATE`/Zstd-style higher stored-fields compression (smaller, slower to fetch `_source`)
(others, version-dependent)	e.g. specialized compression — grep to see what your version registers

# Set it at index creation:
curl -s -XPUT 'localhost:9200/logs' -H 'Content-Type: application/json' -d'
{ "settings": { "index.codec": "best_compression" } }'

# Where OpenSearch maps the name to a Lucene codec:
grep -rn "index.codec\|CodecService\|best_compression\|BEST_COMPRESSION\|Lucene.*Codec" \
  server/src/main/java/org/opensearch/index/codec/CodecService.java \
  server/src/main/java/org/opensearch/index/engine/EngineConfig.java 2>/dev/null

best_compression is the standard answer when disk is the constraint and _source fetches are rare (log/metrics indices). It only affects stored fields compression, not postings/points/docvalues — a common misconception. The k-NN plugin layers its own codec on top of whichever index.codec you pick, adding the vector format per knn_vector field.

Reading exercise

# 1. The commit point and its segments.
grep -rn "class SegmentInfos\|class SegmentInfo \|class SegmentCommitInfo" \
  lucene/core/src/java/org/apache/lucene/index/

# 2. The default codec and its sub-formats.
ls lucene/core/src/java/org/apache/lucene/codecs/lucene*/
grep -rln "extends Codec\b\|extends FilterCodec" lucene/core/src/java/org/apache/lucene/codecs/

# 3. SPI registration files.
find lucene -path "*META-INF/services/org.apache.lucene.codecs.*"

# 4. Per-field formats.
grep -rn "getPostingsFormatForField\|getKnnVectorsFormatForField" \
  lucene/core/src/java/org/apache/lucene/codecs/perfield/

# 5. Real files in a real shard.
find /path/to/shard/index -maxdepth 1 -type f | sed 's/.*\.//' | sort | uniq -c
curl -s 'localhost:9200/my-index/_segments?pretty' | head -40

# 6. OpenSearch's codec setting.
grep -rn "best_compression\|index.codec\|CodecService" \
  server/src/main/java/org/opensearch/index/codec/

Answer:

Why are segments immutable, and name three consequences (one of them about deletes, one about reads, one about merges).
What does segments_N contain, and what is the Lucene class for it? Why does each segment record the codec that wrote it?
Given the files _3.si _3.tim _3.doc _3.kdd _3.dvd _3.vec, say what each one holds and which chapter covers it.
List the five most-used sub-formats of a Codec and the file extensions each produces.
Explain SPI registration: how does Lucene find a custom codec at runtime without any central config?
What exactly does index.codec: best_compression change, and what does it not change? Why would you choose it for a logs index?

Common bugs and symptoms

Symptom	Root cause	Where to look
Index won't open: "format version is not supported"	a segment written by a newer/older codec than this Lucene can read (back-compat window exceeded)	the segment's `.si` codec; the Lucene upgrade story
Too many open files	non-compound segments × many segments; file-descriptor limit	enable compound; merge; `_cat/segments`
`best_compression` didn't shrink the index	it only compresses stored fields; your size is postings/docvalues/points	profile per-file sizes with `find`/`du`; `_segments`
Custom codec "not found" at runtime	missing/typo'd `META-INF/services` line, or jar not on classpath	the service file; the SPI loader
k-NN vector files (`.vec`) missing on a `knn_vector` field	per-field vector format not attached; wrong codec wired	the plugin's codec; `index.knn`; per-field format
Disk grows without bound	dead docs never reclaimed; merges starved	`_cat/segments` `deleted.docs`; merge internals

Validation: prove you understand this

Draw an index as a segments_N commit pointing at three segments, and one segment as its set of extension files; label what each file holds.
Explain immutability and derive from it: how deletes work, why reads need no lock against writes, and why merges are mandatory.
Define the Codec SPI and list its sub-formats and their file extensions. Show the META-INF/services line that registers a codec.
Explain per-field formats and give the k-NN-plugin example of why you'd want one format on one field only.
State precisely what index.codec: best_compression does and does not affect, and the workload it suits.
From a find over a real shard, classify every file extension you see and say which chapter of this section owns it.

The Inverted Index and Postings

The inverted index is the structure that makes full-text search fast, and it is the thing most people mean when they say "Lucene." Flip a normal document store on its head — instead of "document → its words," store "word → the documents that contain it." That inverted mapping, term to postings list, is what turns match: "quick brown fox" from a scan of every document into three cheap lookups and a merge. This chapter is the inside of an OpenSearch match or term query: the terms dictionary, the FST that indexes it, the postings, and the iterators a TermQuery drives.

This is the Lucene layer beneath Query DSL and QueryBuilders: that chapter showed MatchQueryBuilder analyzing text into TermQuery objects; here you learn what those TermQuery objects do when they execute.

After this chapter you can: explain terms → postings and the on-disk layout; describe the BlockTree terms dictionary and the FST that indexes it; drive a TermsEnum + PostingsEnum in your head; explain positions/offsets/payloads and the IndexOptions that control them; and trace a TermQuery end to end.

Terms and postings

For each indexed field, Lucene builds a sorted dictionary of terms (the distinct tokens produced by analysis — see Mapping and Analysis). Each term points to a postings list: the sorted list of document IDs that contain it, plus optional per-occurrence data (how often, where, with what payload).

Field "body":
  "brown" -> docs [2, 5, 9]      freqs [1, 2, 1]   positions [...]
  "fox"   -> docs [2, 9, 17]     ...
  "quick" -> docs [2, 5]         ...

A query for body:fox is now: find "fox" in the dictionary, open its postings, iterate doc IDs. A phrase query for "quick brown" intersects the postings of both terms and checks that "brown" appears at position+1 after "quick". Scoring multiplies in term frequency, document length norms, and the collection-level inverse document frequency — but the retrieval is pure postings iteration.

On disk these live in the postings sub-format's files (see Segments and Codecs):

File	Holds
`.tim` / `.tip` / `.tmd`	terms dictionary (BlockTree), its FST index, and metadata
`.doc`	the doc-ID postings (delta-encoded) + term frequencies
`.pos`	term positions (for phrase/proximity queries)
`.pay`	payloads and offsets

The BlockTree terms dictionary, backed by an FST

A field can have millions of distinct terms; you cannot binary-search a flat file of them cheaply. Lucene's default PostingsFormat uses a BlockTree terms dictionary: terms are sorted and grouped into blocks sharing a common prefix, and an in-memory index over the block prefixes points into the on-disk blocks. That index is a finite-state transducer (FST).

An FST is a compressed automaton mapping a byte sequence (a term prefix) to an output (a file pointer / metadata). It is like a trie that shares both prefixes and suffixes, stored as a tiny, cache-friendly byte array. The FST holds just enough of the dictionary (prefixes of block boundaries) to seek you to the right on-disk block; the block is then scanned linearly. So a term lookup is: walk the FST to find the block, read the block, scan to the term.

flowchart TD
    Q["seek 'brown'"] --> FST["FST index (.tip, in RAM)"]
    FST -->|"file pointer for 'br...' block"| Block["terms block (.tim, on disk)"]
    Block -->|scan to 'brown'| Meta["term metadata: doc count, postings file pointers"]
    Meta --> Doc[".doc postings"]
    Meta --> Pos[".pos positions"]

# The terms dictionary + FST live here:
grep -rln "BlockTree" lucene/core/src/java/org/apache/lucene/codecs/
ls lucene/core/src/java/org/apache/lucene/codecs/lucene*/   # *PostingsFormat
find lucene/core/src/java/org/apache/lucene -path "*util/fst/FST.java"
grep -rn "class FST\b\|class FSTCompiler\|Outputs" \
  lucene/core/src/java/org/apache/lucene/util/fst/

Note: The FST is the reason Lucene's term seeks are fast and small. It's also reused far beyond the terms dict — synonym filters, the suggest/completion components, and more all build FSTs. Reading FST.java once is one of the highest-return hours you can spend in the Lucene codebase.

Terms, TermsEnum, PostingsEnum

At read time the API mirrors the structure exactly:

API	Role
`Terms`	the terms of one field in one segment (`leafReader.terms("body")`)
`TermsEnum`	an iterator/seeker over those terms; `seekExact(BytesRef)`, `next()`, `term()`
`PostingsEnum`	the postings of the term the `TermsEnum` is positioned on; `nextDoc()`, `freq()`, `nextPosition()`

The canonical access pattern — read it, then write it from memory:

LeafReader leaf = ...;                       // one segment
Terms terms = leaf.terms("body");            // field's terms, or null
TermsEnum te = terms.iterator();
if (te.seekExact(new BytesRef("fox"))) {     // found the term?
    PostingsEnum pe = te.postings(null, PostingsEnum.FREQS);
    int docID;
    while ((docID = pe.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        int freq = pe.freq();                // term freq in this doc
        // with PostingsEnum.POSITIONS you could also pe.nextPosition()
    }
}

grep -rn "abstract class Terms\b\|abstract class TermsEnum\|abstract class PostingsEnum" \
  lucene/core/src/java/org/apache/lucene/index/
grep -rn "seekExact\|seekCeil\|postings(" \
  lucene/core/src/java/org/apache/lucene/index/TermsEnum.java

Note PostingsEnum is a DocIdSetIterator: nextDoc() / advance(target) / NO_MORE_DOCS. Every query in Lucene ultimately produces a DocIdSetIterator; postings are just the most fundamental source of one. That shared interface is what lets Lucene intersect a term, a range (BKD), and a filter (a bitset) in a single conjunction.

Positions, offsets, payloads — and IndexOptions

How much per-occurrence data a field stores is governed by IndexOptions, set at index time per field. More options = richer queries but bigger postings.

`IndexOptions`	Stores	Enables	Cost
`DOCS`	doc IDs only	`term`/`match` (boolean) existence	smallest
`DOCS_AND_FREQS`	+ term frequency	relevance scoring (BM25 needs freq)	small
`DOCS_AND_FREQS_AND_POSITIONS`	+ positions	phrase, proximity, span queries	`.pos` file
`DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS`	+ character offsets	fast highlighting (posting highlighter)	`.pay` offsets
`NONE`	nothing	field not inverted (e.g. doc-values-only)	—

Positions are token positions ("brown" is the 2nd token), needed to verify a phrase like "quick brown" truly appears adjacent.
Offsets are character start/end in the original text, used by highlighters to underline the matched span without re-analyzing.
Payloads are arbitrary bytes attached to a token occurrence (e.g. a part-of-speech tag, a per-term boost) — read via PostingsEnum.getPayload().

grep -rn "enum IndexOptions" lucene/core/src/java/org/apache/lucene/index/IndexOptions.java
# How OpenSearch field types pick index options:
grep -rn "IndexOptions\|setIndexOptions\|indexOptions" \
  server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java

A keyword field is typically DOCS (you only ask "does this doc equal this term"); a text field is DOCS_AND_FREQS_AND_POSITIONS (you score and run phrase queries). That difference is set in the OpenSearch field mappers and flows straight into the postings files Lucene writes.

Skip lists

A postings list can be enormous (a common term may appear in millions of docs). Conjunctive queries don't read all of it — a phrase query intersecting fox and quick repeatedly calls advance(target) to skip ahead to the next candidate doc. To make advance sublinear, the postings format embeds a multi-level skip list: periodic checkpoints that let the iterator jump over blocks of doc IDs without decoding them.

flowchart LR
    A["advance(1000)"] --> SkipL2["skip level 2: jump near 1000"]
    SkipL2 --> SkipL1["skip level 1: refine"]
    SkipL1 --> Block["decode the block containing 1000"]
    Block --> Doc["land on first doc >= 1000"]

grep -rn "Skip\|skipper\|class .*Skip" \
  lucene/core/src/java/org/apache/lucene/codecs/lucene*/ | head

This is why conjunctions over a rare term + a common term are fast: the rare term's postings drive the iteration, and each advance on the common term skips huge stretches of its postings via the skip list. Understanding this is the key to reasoning about query cost in search execution.

How a TermQuery walks all of this

Put it together. An OpenSearch term/match query becomes a Lucene TermQuery (after analysis splits text into terms). Execution:

flowchart TD
    TQ["TermQuery(body:fox)"] --> W["createWeight() -> TermWeight"]
    W --> PerSeg["for each segment (LeafReaderContext)"]
    PerSeg --> Terms["leaf.terms('body')"]
    Terms --> TE["TermsEnum.seekExact('fox')  (walks FST -> block)"]
    TE -->|found| Scorer["TermScorer over PostingsEnum"]
    TE -->|not found| Empty["no scorer for this segment"]
    Scorer --> Iter["nextDoc()/advance(): iterate postings + score (BM25)"]
    Iter --> Collect["LeafCollector collects + scores top docs"]

grep -rn "class TermQuery\|class TermWeight\|class TermScorer" \
  lucene/core/src/java/org/apache/lucene/search/

TermQuery.createWeight builds a Weight (per-query, captures statistics for scoring).
For each segment, the weight's scorer(leafContext) seeks the term via TermsEnum (FST → block → metadata) and, if found, wraps the PostingsEnum in a TermScorer.
The IndexSearcher drives each segment's scorer with a Collector, iterating postings (nextDoc/advance, using skip lists for conjunctions) and computing BM25 scores from freq + norms + IDF.

This is the same retrieval machinery OpenSearch's search execution layer coordinates across shards; the per-segment work is exactly the above. A match with multiple terms is a BooleanQuery of TermQuerys; a phrase is a PhraseQuery that also reads positions; a prefix/wildcard is a MultiTermQuery that enumerates many terms via the TermsEnum.

Reading exercise

# 1. The read API.
grep -rn "abstract class Terms\b\|class TermsEnum\|class PostingsEnum" \
  lucene/core/src/java/org/apache/lucene/index/

# 2. The FST + BlockTree.
find lucene/core/src/java/org/apache/lucene -path "*util/fst/FST.java"
grep -rln "BlockTree" lucene/core/src/java/org/apache/lucene/codecs/

# 3. IndexOptions and where OpenSearch sets them.
grep -rn "enum IndexOptions" lucene/core/src/java/org/apache/lucene/index/IndexOptions.java
grep -rn "setIndexOptions\|IndexOptions" \
  server/src/main/java/org/opensearch/index/mapper/TextFieldMapper.java

# 4. The query.
grep -rn "class TermQuery\|class TermScorer\|class TermWeight" \
  lucene/core/src/java/org/apache/lucene/search/

# 5. Inspect real terms with Luke (or the explain API):
#    ./gradlew :lucene:luke:run    -> Documents/Terms tab
curl -s 'localhost:9200/my-index/_search?pretty' -H 'Content-Type: application/json' -d'
{ "explain": true, "query": { "match": { "body": "fox" } } }' | head -40

Answer:

Explain "inverted index" by contrasting forward vs inverted storage, and say what a postings list contains.
What is the BlockTree terms dictionary, and what role does the FST play? Why not just binary-search a flat term file?
Write the Terms → TermsEnum → PostingsEnum access pattern from memory and name the method that seeks a term and the one that iterates docs.
List the IndexOptions levels and say which OpenSearch field types (text vs keyword) use which, and why.
What problem do skip lists solve, and why do they make a rare-term + common- term conjunction fast?
Trace a TermQuery from createWeight to scored docs, naming TermWeight/TermScorer/PostingsEnum and where the FST seek happens.

Common bugs and symptoms

Symptom	Root cause	Where to look
Phrase query returns nothing though both words present	field indexed without positions (`IndexOptions` too low)	field mapper `IndexOptions`; reindex with positions
Highlighting slow / re-analyzes text	offsets not stored; highlighter falls back	`..._AND_OFFSETS`; posting highlighter
Huge `.pos`/`.pay` files	positions/offsets enabled on a field that never needs phrases	drop to `DOCS_AND_FREQS` for that field
Scoring ignores term frequency	field indexed `DOCS` only (no freqs)	`IndexOptions`; `keyword` can't BM25-score meaningfully
Slow leading-wildcard / regex query	`MultiTermQuery` enumerates a huge slice of the terms dict via `TermsEnum`	avoid leading wildcards; n-gram field
Conjunction unexpectedly slow	iteration driven by the common term, no skip benefit	check which term leads; cost model in search-execution

Validation: prove you understand this

Draw the inverted index for three short documents and show the postings (with freqs and positions) for two terms.
Explain the BlockTree + FST: what the FST stores, what the on-disk block stores, and the two-step seek between them.
Hand-trace TermsEnum.seekExact + PostingsEnum.nextDoc for a term across a two-segment reader.
Map each IndexOptions level to a query capability it unlocks and a file it adds, and assign text/keyword to the right level.
Explain a multi-level skip list and why it makes advance(target) sublinear.
Trace match: {body: "fox"} from MatchQueryBuilder (OpenSearch) through TermQuery/TermScorer (Lucene) to collected hits, naming every layer.

Points and BKD Trees

The inverted index is brilliant at "which docs contain this term" and useless at "which docs have a number between 30 and 70." Range queries, geo queries, and multi-dimensional lookups need a spatial structure, not a postings list. Lucene's answer is points: values indexed into a BKD tree (block KD-tree), a disk-resident, write-once KD-tree that answers range and nearest-region queries efficiently. Every numeric range, every date range, and every geo_point bounding-box query in OpenSearch is a BKD-tree traversal underneath.

This is the Lucene layer beneath the numeric and geo parts of Mapping and Analysis: when that chapter says a long, date, or geo_point field is "indexed," this is the index it builds.

After this chapter you can: explain why points replaced the old numeric trie terms; describe the BKD tree's structure and the BKDWriter/BKDReader split; use PointValues and the IntPoint/LongPoint/DoublePoint field types; trace a range query through a BKD traversal; and connect it to OpenSearch numeric, date, and geo_point queries.

Why points replaced numeric trie terms

Before points (Lucene < 6), numeric range queries were a hack on the inverted index: a number was indexed at multiple precisions as a set of terms (NumericRangeQuery / "trie" fields), and a range query OR'd together the terms covering it. It worked but was costly — it bloated the terms dictionary with synthetic terms, it didn't generalize past one dimension, and range queries expanded into large boolean queries.

Points (PointValues, the BKD format) replaced all of that with a purpose- built structure:

	Numeric trie terms (old)	Points / BKD (current)
Storage	synthetic terms in the inverted index	dedicated BKD tree (`.kdd/.kdi/.kdm`)
Dimensions	1 (awkwardly)	up to 8 dims × up to 16 bytes/dim
Range query	OR of many precision terms	tree traversal pruning whole subtrees
Geo	bolted on	native (2D points, bounding boxes, polygons)
Terms-dict bloat	significant	none

The lesson generalizes: when the access pattern is "values in a region," you want a spatial tree, not an inverted list. Points are how Lucene serves numeric, date, IP, and geo simultaneously with one structure.

grep -rn "class PointValues\|class IntPoint\|class LongPoint\|class DoublePoint\|class LatLonPoint" \
  lucene/core/src/java/org/apache/lucene/document/ \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

The BKD tree

A KD-tree recursively partitions k-dimensional space, splitting on one dimension at a time. Lucene's BKD variant ("block KD") is built for disk: it groups points into fixed-size leaf blocks (e.g. ~512 points), builds a balanced tree of split nodes over those blocks, and writes the whole thing once, immutably, into the segment. It is bulk-loaded — points are sorted and the tree built bottom-up — which is why it's fast to build and tight on disk, at the cost of being write-once (perfectly fine: segments are immutable anyway).

Structure:

Leaf blocks hold the actual point values plus the doc IDs that own them.
Inner nodes store split values: "left subtree has dim-d < v, right has ≥ v," plus the min/max bounding box of each subtree.
A query descends, and at each node uses the subtree's bounding box to prune: if a subtree's box is entirely inside the query range, accept all its docs without inspecting them (CELL_INSIDE_QUERY); if entirely outside, skip it (CELL_OUTSIDE_QUERY); if it straddles, recurse (CELL_CROSSES_QUERY).

flowchart TD
    Root["root: bbox of all points"] -->|"crosses query"| N1["split on dim x @ v1"]
    N1 -->|"inside query -> accept all"| L1["leaf block (512 pts) all matched"]
    N1 -->|"crosses"| N2["split on dim y @ v2"]
    N2 -->|"outside -> skip"| L2["leaf block (pruned)"]
    N2 -->|"crosses -> scan"| L3["leaf block: test each point"]

grep -rn "class BKDWriter\|class BKDReader\|enum Relation\|CELL_INSIDE_QUERY\|CELL_OUTSIDE_QUERY\|CELL_CROSSES_QUERY" \
  lucene/core/src/java/org/apache/lucene/util/bkd/ \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

Note: The three-way relation (INSIDE/OUTSIDE/CROSSES) is the whole trick. A range query that covers a dense region accepts huge subtrees wholesale and only scans the boundary blocks. That's why a numeric range over a large matching set is cheap — it doesn't enumerate matches, it accepts subtrees.

BKDWriter and BKDReader

The format splits, as most Lucene structures do, into a writer (index time) and a reader (query time), via the PointsFormat sub-codec (see Segments and Codecs).

Class	Role	Files
`BKDWriter`	sorts points, bulk-builds the balanced tree, writes leaf blocks + index	writes `.kdd` (data), `.kdi` (index), `.kdm` (meta)
`BKDReader`	opens the tree, exposes `PointValues` for traversal	reads `.kdd/.kdi/.kdm`
`PointValues`	the query-time view: `intersect(visitor)`, `getMinPackedValue`, `getDocCount`, dims	per field, per segment
`PointValues.IntersectVisitor`	a callback: `compare(min,max)` → `Relation`, and `visit(docID, packedValue)`	you implement this for a custom point query

The query model is a visitor: you give PointValues.intersect an IntersectVisitor. Lucene walks the tree, calling your compare(minPacked, maxPacked) at each node to get a Relation (so it can prune), and visit(docID) for every point inside a fully-matched cell (or visit(docID, packedValue) for straddling leaf blocks where it must test each value). This is exactly how IntPoint.newRangeQuery and friends are implemented.

grep -rn "interface IntersectVisitor\|void intersect\|Relation compare\|void visit" \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

The point field types

You index a point by adding a *Point field to a document; you query it with that type's static factory methods. The values are stored as packed byte arrays (big-endian, so byte comparison equals numeric comparison) — that uniform encoding is what lets one BKD implementation serve ints, longs, doubles, and geo.

Field type	Dimensions	Used for	Query factories
`IntPoint`	1–8 × 4 bytes	`integer`	`newExactQuery`, `newRangeQuery`, `newSetQuery`
`LongPoint`	1–8 × 8 bytes	`long`, `date` (epoch millis)	same family
`FloatPoint` / `DoublePoint`	1–8 × 4/8 bytes	`float` / `double`, `scaled_float`	same family
`LatLonPoint`	2 × encoded	`geo_point`	`newBoxQuery`, `newDistanceQuery`, `newPolygonQuery`
`BinaryPoint`	arbitrary	`ip` (packed IPv6)	`newRangeQuery`, `newExactQuery`

// Index side:
doc.add(new LongPoint("created", epochMillis));      // BKD-indexed
doc.add(new IntPoint("price", 4200));

// Query side (Lucene):
Query q = LongPoint.newRangeQuery("created", lo, hi);          // a date range
Query g = LatLonPoint.newDistanceQuery("loc", lat, lon, 5000); // 5km radius

grep -rn "newRangeQuery\|newDistanceQuery\|newBoxQuery\|newPolygonQuery" \
  lucene/core/src/java/org/apache/lucene/document/LongPoint.java \
  lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java

Note: A field is often indexed both as a point (for range queries) and as doc values (for sorting/aggregations — see DocValues). They are two different structures answering two different questions about the same value. OpenSearch numeric fields enable both by default.

How OpenSearch numeric, date, and geo queries use points

The OpenSearch field mappers build the matching Lucene point field, and the query builders produce the matching point query. Walk it from the mapper:

# Numeric & date fields create LongPoint/IntPoint/etc:
grep -rn "LongPoint\|IntPoint\|DoublePoint\|newRangeQuery\|PointRangeQuery" \
  server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/DateFieldMapper.java

# geo_point uses LatLonPoint:
grep -rn "LatLonPoint\|newBoxQuery\|newDistanceQuery\|GeoPointFieldMapper\|LatLonShape" \
  server/src/main/java/org/opensearch/index/mapper/GeoPointFieldMapper.java

OpenSearch query	Lucene point query
`range` on `long`/`integer`/`double`	`LongPoint.newRangeQuery` etc.
`range` on `date`	`date` stored as epoch `LongPoint` → `LongPoint.newRangeQuery`
`term`/`terms` on a numeric	`IntPoint.newExactQuery`/`newSetQuery`
`geo_bounding_box`	`LatLonPoint.newBoxQuery`
`geo_distance`	`LatLonPoint.newDistanceQuery`
`geo_polygon` / `geo_shape`	`LatLonPoint.newPolygonQuery` / `LatLonShape`

So a range: { created: { gte: "now-1d" }} is, four layers down, a LongPoint.newRangeQuery that BKDReader answers by pruning subtrees of the date BKD tree. Knowing this lets you reason about why a range over a high- cardinality timestamp on a huge index is still fast (subtree pruning) and why an over-broad range that matches most docs is not (it accepts nearly everything and the cost moves to whatever consumes the matches).

flowchart TD
    OS["range: {created: gte..lte}"] --> QB["RangeQueryBuilder (OpenSearch)"]
    QB --> LQ["LongPoint.newRangeQuery (Lucene)"]
    LQ --> PV["PointValues.intersect(visitor) per segment"]
    PV --> BKD["BKDReader walks tree, prunes via Relation"]
    BKD --> Hits["matched docIDs -> DocIdSetIterator"]

Reading exercise

# 1. The point types and the read API.
grep -rn "class IntPoint\|class LongPoint\|class LatLonPoint\|class PointValues" \
  lucene/core/src/java/org/apache/lucene/document/ \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

# 2. The BKD writer/reader and the prune relation.
grep -rn "class BKDWriter\|class BKDReader\|enum Relation\|CELL_CROSSES_QUERY" \
  lucene/core/src/java/org/apache/lucene/util/bkd/ \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

# 3. The visitor model.
grep -rn "interface IntersectVisitor\|Relation compare\|void visit" \
  lucene/core/src/java/org/apache/lucene/index/PointValues.java

# 4. The range query implementation.
grep -rn "class PointRangeQuery" lucene/core/src/java/org/apache/lucene/search/

# 5. OpenSearch mappers that build points.
grep -rn "LongPoint\|IntPoint\|LatLonPoint\|newRangeQuery" \
  server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/GeoPointFieldMapper.java

# 6. See the point files in a real shard.
find /path/to/shard/index -name "*.kdd" -o -name "*.kdi" -o -name "*.kdm"

Answer:

How did Lucene do numeric ranges before points, and name three concrete problems that approach had.
Describe the BKD tree: leaf blocks, split nodes, and the bounding boxes that enable pruning. Why is it bulk-loaded and write-once?
Explain the three-way Relation (INSIDE/OUTSIDE/CROSSES) and how it makes a range over a large matching set cheap.
Write (from memory) how you'd query a date range with LongPoint, and explain how date becomes a LongPoint.
Walk a geo_distance query from the OpenSearch builder to LatLonPoint to BKDReader.
Why is a numeric field often indexed as both a point and doc values? What does each answer?

Common bugs and symptoms

Symptom	Root cause	Where to look
`range` on a numeric field errors / returns nothing	field mapped without indexing (`index: false`), so no BKD tree	mapping; `NumberFieldMapper`
Sort/agg on a numeric works but `range` doesn't (or vice versa)	one of points / doc values disabled independently	`index` vs `doc_values` mapping params
`geo_distance` slow on huge index	over-broad radius matches most points; pruning can't help	tighten radius; pre-filter; `geo_shape` strategy
Dates compared wrong across time zones	epoch-millis `LongPoint` is UTC; TZ handled above the point layer	`DateFieldMapper` formatting; query TZ
`ip` range behaves oddly	IPv4/IPv6 packed as `BinaryPoint`; byte order matters	`IpFieldMapper`; packed encoding
Range that "should be cheap" is slow	it matches a large fraction of docs; cost is downstream collection	rethink the query; search execution

Validation: prove you understand this

Contrast numeric trie terms vs points with a table, and state the one-sentence reason points are strictly better for ranges.
Draw a small BKD tree with leaf blocks and bounding boxes, and trace a range query showing one INSIDE, one OUTSIDE, and one CROSSES decision.
Explain the BKDWriter/BKDReader/PointValues/IntersectVisitor roles and which files each produces or reads.
Match four OpenSearch query types (numeric range, date range, geo bbox, geo distance) to their Lucene point factory methods.
Explain why a numeric field carries both a point index and doc values, naming the question each answers.
Given a slow geo_distance query on a large index, explain whether BKD pruning can help and what you'd change.

DocValues: The Columnar Store

The inverted index answers "which docs contain term X." The exact opposite question — "given doc D, what is the value of field F" — is what sorting, aggregating, and scripting need, and the inverted index is terrible at it. Lucene's answer is doc values: a columnar, per-document, on-disk store, written into each segment alongside the postings. This is the structure behind every OpenSearch sort, every terms/stats/cardinality aggregation, and every doc['field'].value script access.

This chapter is the Lucene-level companion to OpenSearch's DocValues and Fielddata deep dive. That chapter covered the OpenSearch view — the IndexFieldData abstraction, the fielddata circuit breaker, global ordinals, why text is special. Here we go a layer down: the Lucene DocValuesType flavors, the DocValuesFormat that encodes them, the per-doc iterator APIs, and how ordinals work at the format level. We do not re-derive the OpenSearch side — read that chapter for it; this one is the storage format underneath.

After this chapter you can: name the five DocValuesType flavors and the Lucene API for each; explain the columnar encoding and why it's disk/page-cache friendly; describe how SORTED/SORTED_SET store ordinals + a sorted term dictionary; explain global ordinals at the format level; and trace how OpenSearch reads a keyword field's values for an aggregation down to the Lucene accessor.

Row store vs column store

Stored fields (.fdt, the _source) are a row store: doc 5's whole document sits together, great for "fetch doc 5," terrible for "read field F across a million docs" (you'd touch a million scattered rows). Doc values are a column store: all values of field F sit together, contiguous, so "read F across all docs" is a sequential scan of one tight array — cache-friendly, off-heap, OS-page-cached.

	Stored fields (row)	DocValues (column)
Layout	per-doc, all fields together	per-field, all docs together
Good for	fetching a hit's `_source`	sort / aggregate / script over a field
Lives	`.fdt/.fdx/.fdm`, compressed	`.dvd/.dvm`, off-heap / page cache
Access	"give me doc 5"	"give me field F for every doc"

grep -rn "enum DocValuesType\|class DocValuesFormat" \
  lucene/core/src/java/org/apache/lucene/index/DocValuesType.java \
  lucene/core/src/java/org/apache/lucene/codecs/DocValuesFormat.java

The five DocValuesType flavors

Lucene stores doc values as one of five column types, chosen at index time per field. Each has a matching read API (an iterator that advances doc-by-doc).

`DocValuesType`	Holds per doc	Read API	OpenSearch field types
`NUMERIC`	one `long`	`NumericDocValues`	`long`, `integer`, `double` (encoded), `date`, `boolean`
`SORTED_NUMERIC`	many `long`s, sorted	`SortedNumericDocValues`	multi-valued numerics
`BINARY`	one `byte[]`	`BinaryDocValues`	`geo_point` (encoded), custom binary
`SORTED`	one term ordinal	`SortedDocValues`	single-valued `keyword`-like
`SORTED_SET`	a set of term ordinals	`SortedSetDocValues`	multi-valued `keyword`, `ip`

The read APIs are all DocValuesIterators — they extend DocIdSetIterator, so you advance(docID) to a doc and then read its value. This iterator model (introduced when doc values became sparse-aware) means a field that's absent on many docs costs nothing for those docs.

// NUMERIC: read a long per doc
NumericDocValues nd = leaf.getNumericDocValues("price");
if (nd.advanceExact(docID)) {
    long v = nd.longValue();
}

// SORTED_SET: read ordinals, resolve to bytes only when needed
SortedSetDocValues ss = leaf.getSortedSetDocValues("tags.keyword");
if (ss.advanceExact(docID)) {
    long ord;
    while ((ord = ss.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
        // bucket by ord (cheap); resolve text only to emit:
        BytesRef term = ss.lookupOrd(ord);
    }
}

grep -rn "class NumericDocValues\|class SortedDocValues\|class SortedSetDocValues\|class SortedNumericDocValues\|class BinaryDocValues" \
  lucene/core/src/java/org/apache/lucene/index/
grep -rn "advanceExact\|nextOrd\|lookupOrd\|getValueCount" \
  lucene/core/src/java/org/apache/lucene/index/SortedSetDocValues.java

Ordinals: the key trick for strings

SORTED/SORTED_SET do not store the strings per doc. They store, per doc, a small integer ordinal into a per-segment sorted dictionary of the field's distinct terms. So a keyword field with a million docs but only 50 distinct values stores a million tiny ordinals + a 50-entry dictionary, not a million strings.

flowchart LR
    Dict["Per-segment sorted dict:<br/>0=apple 1=mango 2=pear"] 
    Doc0["doc0 -> ord 2"] --> Dict
    Doc1["doc1 -> ord 0"] --> Dict
    Doc2["doc2 -> ord {0,2}"] --> Dict
    Dict -->|lookupOrd| Bytes["BytesRef('pear')"]

This is why string aggregations are fast: a terms agg buckets by ordinal (integer compares, integer hash keys), and only calls lookupOrd to turn the winning ordinals back into strings at the very end. Ordinals are also already sorted (the dictionary is sorted), so a sort on a keyword field is a sort on ordinals — no string comparison until the final fetch.

Property	Why it matters
Per-doc value = ordinal (small int)	tiny storage, cache-friendly
Dictionary is sorted	ordinal order == term order, so sort is free
`lookupOrd` deferred	resolve to bytes only for emitted buckets/hits
`getValueCount()`	number of distinct ords in the segment

Note: Ordinals are per-segment. Ordinal 2 in segment A and ordinal 2 in segment B may be different terms. To aggregate across a whole shard you need a shard-wide numbering — global ordinals, below.

The DocValuesFormat and the .dvd/.dvm files

The on-disk encoding is defined by the codec's DocValuesFormat sub-format (see Segments and Codecs). It writes two files:

File	Holds
`.dvd`	the doc-values data: the packed columns, the ordinal arrays, the term dictionaries
`.dvm`	metadata: per-field offsets, value counts, min/max, encoding parameters

The default format is aggressively compact. Numerics are stored with techniques chosen from the data: delta/GCD encoding (store differences or a common factor), table/ordinal encoding when few distinct values, and bit-packing to the minimum bits per value. Sorted dictionaries are prefix-compressed. None of this is loaded onto the JVM heap — it's mmap'd and served from the OS page cache, which is exactly why doc values do not count against the fielddata circuit breaker (the OpenSearch deep dive's point, now with the format reason underneath).

ls lucene/core/src/java/org/apache/lucene/codecs/lucene*/   # *DocValuesFormat
grep -rln "DocValuesFormat" lucene/core/src/java/org/apache/lucene/codecs/
grep -rn "class .*DocValuesConsumer\|class .*DocValuesProducer" \
  lucene/core/src/java/org/apache/lucene/codecs/lucene*/ | head

The format splits into a consumer (write side, DocValuesConsumer) and a producer (read side, DocValuesProducer) — the same writer/reader split you've now seen for postings (BlockTree) and points (BKD). It's the consistent shape of every Lucene format.

Global ordinals at the format level

A terms agg needs one ordinal space for the whole shard, but each segment has its own. Lucene builds an OrdinalMap: it merges the per-segment sorted dictionaries into one shard-global sorted dictionary and produces, per segment, a mapping localOrd → globalOrd. Now an aggregator can bucket by global ordinal across all segments and resolve to text once at the end.

flowchart LR
    SA["Seg A: 0=apple 1=pear"] --> OM[OrdinalMap]
    SB["Seg B: 0=pear 1=plum"] --> OM
    OM --> G["Global: 0=apple 1=pear 2=plum"]
    OM --> MapA["A: local 0->0, 1->1"]
    OM --> MapB["B: local 0->1, 1->2"]

grep -rn "class OrdinalMap\|build(" \
  lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java
grep -rn "MultiDocValues\|getSortedSetValues\|OrdinalMap" \
  lucene/core/src/java/org/apache/lucene/index/MultiDocValues.java

This OrdinalMap is precisely what the OpenSearch docvalues/fielddata chapter calls "global ordinals" — built lazily on first agg, cached per shard, invalidated when segments change (a refresh). eager_global_ordinals: true just moves the OrdinalMap build to refresh time. The Lucene class is the implementation; the OpenSearch chapter is the operational behavior. Read both.

How OpenSearch enables and reads doc values

By default OpenSearch sets doc_values: true on keyword, numeric, date, boolean, ip, and geo_point fields — and false on text (a token stream has no single value). The field mapper declares the DocValuesType, and at query time OpenSearch's IndexFieldData reads the matching Lucene accessor.

# Where field types declare doc values:
grep -rn "hasDocValues\|DocValuesType\|SortedSetDocValuesField\|SortedNumericDocValuesField\|NumericDocValuesField" \
  server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java \
  server/src/main/java/org/opensearch/index/mapper/NumberFieldMapper.java

# The bridge from OpenSearch fielddata to Lucene doc-values accessors:
grep -rn "getSortedSetDocValues\|getNumericDocValues\|SortedSetDocValues\|LeafFieldData" \
  server/src/main/java/org/opensearch/index/fielddata/ | head

The chain for a keyword terms agg, end to end:

flowchart TD
    Agg["terms agg on tags.keyword"] --> IFD["IndexFieldData (OpenSearch)"]
    IFD --> LFD["LeafFieldData per segment"]
    LFD --> SSDV["SortedSetDocValues (Lucene)"]
    SSDV --> Ord["nextOrd() -> local ordinals"]
    Ord --> GO["OrdinalMap: local -> global ordinal"]
    GO --> Bucket["bucket by global ord; lookupOrd at the end"]

That's the same picture the OpenSearch deep dive draws as MappedFieldType → IndexFieldData → LeafFieldData → Lucene accessor — now you can name the Lucene end of it: SortedSetDocValues + OrdinalMap, encoded in .dvd/.dvm by the DocValuesFormat.

Reading exercise

# 1. The five types and their accessors.
grep -rn "enum DocValuesType" lucene/core/src/java/org/apache/lucene/index/DocValuesType.java
grep -rn "class SortedSetDocValues\|class NumericDocValues\|class SortedNumericDocValues" \
  lucene/core/src/java/org/apache/lucene/index/

# 2. Ordinals.
grep -rn "lookupOrd\|nextOrd\|getValueCount\|ordValue" \
  lucene/core/src/java/org/apache/lucene/index/SortedSetDocValues.java \
  lucene/core/src/java/org/apache/lucene/index/SortedDocValues.java

# 3. The format: consumer/producer and files.
ls lucene/core/src/java/org/apache/lucene/codecs/lucene*/
grep -rn "DocValuesConsumer\|DocValuesProducer" lucene/core/src/java/org/apache/lucene/codecs/ | head

# 4. Global ordinals (OrdinalMap).
grep -rn "class OrdinalMap\|OrdinalMap.build" lucene/core/src/java/org/apache/lucene/index/OrdinalMap.java

# 5. OpenSearch side (cross-link to docvalues-fielddata.md).
grep -rn "SortedSetDocValuesField\|NumericDocValuesField\|hasDocValues" \
  server/src/main/java/org/opensearch/index/mapper/KeywordFieldMapper.java

# 6. The files on disk.
find /path/to/shard/index -name "*.dvd" -o -name "*.dvm"

Answer:

Contrast a row store (stored fields) and a column store (doc values) and say which access pattern each is built for.
Name the five DocValuesType flavors, the read API for each, and an OpenSearch field type that uses each.
Explain ordinals: what is stored per doc for a keyword, why aggregations bucket on ordinals, and when lookupOrd is called.
Why do doc values not count against the fielddata circuit breaker? Tie it to where the .dvd bytes live.
Explain OrdinalMap (global ordinals) at the format level and how it relates to eager_global_ordinals in OpenSearch.
Trace a keyword terms agg from the OpenSearch IndexFieldData to the Lucene SortedSetDocValues and OrdinalMap.

Common bugs and symptoms

Symptom	Root cause	Where to look
`Fielddata is disabled on text fields`	sorting/aggregating `text`, which has no doc values	use a `keyword` multi-field; docvalues-fielddata.md
Sort on a field errors / returns garbage	`doc_values: false` set on the mapping	re-enable doc values; reindex
First agg after refresh slow, later fast	`OrdinalMap` (global ordinals) rebuilt lazily after segment change	`eager_global_ordinals`; docvalues-fielddata.md
High-cardinality `keyword` agg heavy	huge global-ordinal dictionary per shard	reconsider `composite`/`cardinality`; eager ords
Disk size dominated by `.dvd`	many high-cardinality doc-values fields	drop `doc_values` on fields never sorted/aggregated
Script `doc['f'].value` throws	field has no doc values	enable doc values or use `_source` access

Validation: prove you understand this

Draw the column layout for one keyword field across three docs, showing the sorted dictionary and the per-doc ordinals.
Explain why ordinals make both sorting and terms aggregation fast, and where lookupOrd enters.
Name the five DocValuesTypes and give the Lucene accessor and an OpenSearch field type for each.
Explain the DocValuesFormat consumer/producer split and the .dvd/.dvm files, and one numeric encoding technique it uses.
Define OrdinalMap and connect it precisely to the OpenSearch notion of global ordinals and eager_global_ordinals.
Trace, naming every layer, how a terms agg on tags.keyword reads values from OpenSearch down to the Lucene SortedSetDocValues in a .dvd.

IndexWriter and Merge Internals

IndexWriter is the write path. It takes documents, buffers them in RAM, flushes them into immutable segments, tracks deletes as side bitsets, merges small segments into bigger ones, and — when told to commit() — writes the segments_N that makes everything durable. OpenSearch's InternalEngine owns one IndexWriter per shard and drives every one of these operations. This chapter is the inside of those calls: the per-thread buffers (DWPT), the flush-to-segment pipeline, the merge policy and scheduler, how deletes and updates work without mutating segments, and the precise difference between flush() and commit().

This is the Lucene engine beneath OpenSearch's Refresh, Flush, and Merge: that chapter mapped OpenSearch refresh→openIfChanged, flush→commit(), merge→MergePolicy. Here you learn what each of those Lucene operations actually does to RAM, disk, and readers.

After this chapter you can: explain DWPT and why indexing scales with threads; trace the buffer→flush→segment pipeline; describe TieredMergePolicy + ConcurrentMergeScheduler; explain deletes/updates via liveDocs; state the commit() vs flush() distinction precisely; and explain near-real-time readers.

IndexWriter and DocumentsWriterPerThread (DWPT)

A naive writer would serialize all indexing through one buffer and one lock — disastrous under concurrent writes. Lucene instead gives each indexing thread its own in-RAM buffer: a DocumentsWriterPerThread (DWPT). A thread that calls addDocument grabs a DWPT (from a pool), analyzes and buffers the doc into its private structures, and never contends with other indexing threads on the hot path. The shared DocumentsWriter coordinates the pool; IndexWriter sits above it.

flowchart TD
    T1["thread 1 addDocument"] --> D1["DWPT #1 (private RAM buffer)"]
    T2["thread 2 addDocument"] --> D2["DWPT #2 (private RAM buffer)"]
    T3["thread 3 addDocument"] --> D3["DWPT #3 (private RAM buffer)"]
    D1 -->|flush| S1["segment _a"]
    D2 -->|flush| S2["segment _b"]
    D3 -->|flush| S3["segment _c"]
    S1 & S2 & S3 --> R["DirectoryReader sees all segments"]

grep -rn "class IndexWriter\b\|class DocumentsWriter\b\|class DocumentsWriterPerThread" \
  lucene/core/src/java/org/apache/lucene/index/
grep -rn "addDocument\|updateDocument\|getAndLock\|flushControl" \
  lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java | head

Each DWPT, when it fills up (RAM budget, doc count) or on a flush request, is flushed independently into its own segment. So one logical "flush" can produce several segments — one per active DWPT. This is the source of the parallelism in Lucene indexing: more threads → more DWPTs → more concurrent buffering and flushing. The OpenSearch indexing thread pool sizing (thread pools) interacts directly with how many DWPTs are in flight.

Note: "Indexing buffer" in OpenSearch (indices.memory.index_buffer_size) is the RAM budget shared across DWPTs of all shards on the node. When it fills, Lucene flushes the largest DWPTs to reclaim RAM — which is why a node under heavy indexing produces many small segments (and then leans on merges to clean up).

The buffer → flush → segment pipeline

What a DWPT does on flush, in order:

Invert the buffered docs: build the in-memory inverted index (terms → postings), doc values, points, stored fields, vectors — all the per-field structures.
Encode them through the active Codec's sub-formats into segment files (.tim/.doc, .dvd, .kdd, .fdt, etc. — see Segments and Codecs).
Write the .si and register the new SegmentCommitInfo with the writer's in-memory SegmentInfos (pending — not yet a commit point).
The segment is now flushed: it exists on disk and a refresh can open a reader over it. It is not yet committed (no new segments_N).

grep -rn "flush(\|doFlush\|class FlushByRamOrCountsPolicy\|getFlushingBytes" \
  lucene/core/src/java/org/apache/lucene/index/ | head

This is the crux of near-real-time: a flushed segment is visible (after a reader opens) but not durable across a crash via the index alone — Lucene itself only guarantees durability at commit(). OpenSearch closes that gap with the translog: the op is in the translog before it's acknowledged, so an unflushed-but-refreshed doc survives a crash via translog replay. Visibility (refresh) and durability (commit + translog) are independent — the recurring theme of this whole subsystem.

Deletes and updates: liveDocs, not mutation

Segments are immutable, so a delete cannot remove bytes. Instead Lucene keeps a per-segment liveDocs bitset: bit i is set if doc i is still alive. A delete clears the bit; queries skip dead docs by consulting liveDocs. The dead doc's bytes (postings, doc values, stored fields) stay on disk until a merge rewrites the segment and physically drops them.

Operation	What actually happens
`deleteDocuments(term)`	find matching docs, clear their bits in the target segments' `liveDocs`
`updateDocument(term, doc)`	delete-by-term (clear old doc's bit) + add the new doc to a DWPT → new segment
`liveDocs` storage	the `.liv` file (a `LiveDocsFormat`), regenerated each commit
reclaiming dead bytes	only a merge does this

grep -rn "liveDocs\|class .*LiveDocs\|applyDeletes\|deleteDocument" \
  lucene/core/src/java/org/apache/lucene/index/ | head

So an "update" in OpenSearch is, at the Lucene level, delete-by-_id-term + add. The old document is tombstoned in its old segment; the new one lives in a fresh segment. A high-churn index (constant updates) accumulates dead-doc debt — the percentage of deleted docs is a key health metric (_cat/segments shows deleted.docs), because that debt is only paid down by merges.

Merges: MergePolicy and MergeScheduler

Without merging, segment count and dead-doc debt grow forever: more segments means more per-query overhead (every query visits every segment), and tombstoned docs waste disk and slow scans. Merging combines several segments into one new segment, dropping deleted docs in the process. Two pluggable pieces decide which and when:

Piece	Default impl	Decides
`MergePolicy`	`TieredMergePolicy`	which segments to merge — groups segments into size tiers, selects a set whose merge gives the best size reduction for the I/O
`MergeScheduler`	`ConcurrentMergeScheduler`	when/how merges run — on background threads, with I/O throttling and a max-concurrent-merge cap

TieredMergePolicy (unlike the old log-based policies) does not require adjacent segments — it picks the best-value set anywhere, favoring segments with many deletes (to reclaim them) and avoiding repeatedly rewriting huge segments. Key knobs: maxMergeAtOnce, segmentsPerTier, maxMergedSegmentMB, deletesPctAllowed.

grep -rn "class TieredMergePolicy\|segmentsPerTier\|maxMergedSegment\|deletesPctAllowed\|class MergePolicy\b" \
  lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
grep -rn "class ConcurrentMergeScheduler\|maxThreadCount\|maxMergeCount\|doMerge" \
  lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java

OpenSearch wires these with OpenSearchConcurrentMergeScheduler and MergePolicyConfig/MergePolicyProvider, exposing index.merge.* settings and merge stats. A _forcemerge to one segment is the manual override (great before a shard goes read-only; dangerous on a live write-heavy index — see the warning in refresh-flush-merge.md).

grep -rn "TieredMergePolicy\|ConcurrentMergeScheduler\|MergePolicyConfig\|index.merge" \
  server/src/main/java/org/opensearch/index/ | head

flowchart TD
    Buf["DWPT buffers"] -->|flush| Small["many small segments"]
    Small --> MP["TieredMergePolicy: pick best-value set"]
    MP --> CMS["ConcurrentMergeScheduler: run on bg thread, throttle I/O"]
    CMS --> Merged["one larger segment (dead docs dropped)"]
    Merged --> MP

commit() vs flush(): the precise distinction

This trips people up; nail it. The two words mean specific, different things in Lucene — and OpenSearch overloads "flush" to mean Lucene commit(), which adds to the confusion.

Lucene call	Does	Durable across crash?	Writes `segments_N`?	Trims translog?
`IndexWriter.flush()` (internal)	push DWPT buffers to disk as segments	No (index alone)	No	n/a
`IndexWriter.commit()`	fsync all pending segment files + write a new `segments_N` commit point	Yes	Yes	(OpenSearch) yes
OpenSearch refresh	`DirectoryReader.openIfChanged` — open a reader over current segments (triggers an internal flush of buffers)	n/a (visibility, not durability)	No	No
OpenSearch flush	calls Lucene `commit()` + rolls/trims translog	Yes	Yes	Yes

The mental model:

Lucene flush = "get my RAM buffers onto disk as segments" (visibility-ish, not durability).
Lucene commit = "make the current set of segments the official, crash-safe index by writing segments_N."
OpenSearch refresh ≈ Lucene flush + open reader = visible.
OpenSearch flush = Lucene commit + translog roll = durable, translog trimmed.

grep -rn "public long commit\|void flush(\|prepareCommit\|fsync\|SegmentInfos" \
  lucene/core/src/java/org/apache/lucene/index/IndexWriter.java | head
# OpenSearch side: refresh vs flush:
grep -rn "openIfChanged\|commitIndexWriter\|indexWriter.commit\|refresh(" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java | head

Warning: A commit is expensive (fsync everything + a new commit point). This is exactly why OpenSearch does not commit on every write — it refreshes for visibility (cheap, no fsync) and relies on the translog for durability between commits, committing only on flush. Do not script periodic _flush; you fight the heuristics and create commit churn.

Near-real-time readers

Reads go through a DirectoryReader, a point-in-time view of a set of segments. To see new docs without a full commit, Lucene opens a near-real-time (NRT) reader directly from the IndexWriter:

DirectoryReader.open(IndexWriter) — open an NRT reader that includes the writer's currently-flushed (uncommitted) segments.
DirectoryReader.openIfChanged(oldReader, writer) — cheaply get a new reader reflecting changes since oldReader, reusing unchanged segment readers (only new/changed segments get new readers — this is what makes refresh cheap).

grep -rn "static DirectoryReader open\|openIfChanged\|class StandardDirectoryReader\|class DirectoryReader" \
  lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java | head

OpenSearch wraps this in a SearcherManager (a Lucene ReferenceManager<DirectoryReader>): a refresh calls openIfChanged, swaps in the new reader, and reference-counts the old one until in-flight searches release it. That is the entire mechanism by which a just-indexed doc becomes searchable — covered from the OpenSearch side in engine internals.

The segment lifecycle, end to end

stateDiagram-v2
    [*] --> Buffered: addDocument -> DWPT RAM buffer
    Buffered --> Flushed: flush (RAM full / refresh) -> new immutable segment on disk
    Flushed --> Visible: DirectoryReader.openIfChanged (refresh) opens reader
    Flushed --> Committed: IndexWriter.commit() writes segments_N (durable)
    Visible --> Merging: TieredMergePolicy selects + ConcurrentMergeScheduler runs
    Committed --> Merging
    Merging --> Merged: dead docs dropped, segments combined
    Merged --> Committed: next commit records the merged set
    Committed --> [*]: segment dropped when superseded by a merge

A doc travels: buffered in a DWPT → flushed to a small segment → made visible by a refresh → made durable by a commit → eventually merged into a larger segment (its old segment dropped). Deletes ride along as liveDocs bits that merges finally reclaim. Everything in OpenSearch's write path is this lifecycle plus versioning, seq numbers, and the translog layered on top.

Reading exercise

# 1. The writer and DWPT.
grep -rn "class IndexWriter\b\|class DocumentsWriterPerThread\|class DocumentsWriter" \
  lucene/core/src/java/org/apache/lucene/index/

# 2. Flush pipeline.
grep -rn "doFlush\|flush(\|FlushByRamOrCountsPolicy" lucene/core/src/java/org/apache/lucene/index/ | head

# 3. Deletes / liveDocs.
grep -rn "liveDocs\|applyDeletes\|deleteDocuments" lucene/core/src/java/org/apache/lucene/index/ | head

# 4. Merge policy + scheduler.
grep -rn "class TieredMergePolicy\|class ConcurrentMergeScheduler\|segmentsPerTier\|maxMergedSegment" \
  lucene/core/src/java/org/apache/lucene/index/

# 5. commit vs flush vs NRT.
grep -rn "public long commit\|openIfChanged\|static DirectoryReader open" \
  lucene/core/src/java/org/apache/lucene/index/IndexWriter.java \
  lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java

# 6. OpenSearch wiring + live merge stats.
grep -rn "indexWriter.commit\|openIfChanged\|ConcurrentMergeScheduler" \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java
curl -s 'localhost:9200/_cat/segments/my-index?v&h=segment,docs.count,docs.deleted,size'
curl -s 'localhost:9200/my-index/_stats/merge?pretty'

Answer:

What is a DWPT and why does giving each indexing thread its own buffer make indexing scale? What produces multiple segments from one flush?
List the steps a DWPT performs on flush, and say at which step the data becomes visible vs durable.
How do deletes and updates work given immutable segments? What reclaims the dead bytes, and what metric tracks the debt?
Distinguish MergePolicy from MergeScheduler, name the defaults, and give two TieredMergePolicy knobs and what they do.
State the exact difference between Lucene flush(), Lucene commit(), OpenSearch refresh, and OpenSearch flush — in one row each.
Explain how openIfChanged makes a refresh cheap and what SearcherManager adds on top.

Common bugs and symptoms

Symptom	Root cause	Where to look
Segment count keeps climbing, queries slow	merges starved (throttled / too few scheduler threads)	`ConcurrentMergeScheduler` `maxThreadCount`; `_cat/segments`
Huge `docs.deleted` fraction	high update/delete churn; `TieredMergePolicy` not reclaiming fast enough	`deletesPctAllowed`; consider `_forcemerge` (read-only shard only)
Indexing stalls under load	indexing buffer full, all threads waiting on flush	`indices.memory.index_buffer_size`; DWPT flush pressure
Data lost on crash despite "successful" indexing	relied on refresh (visibility) as durability; no commit/translog fsync	translog durability policy; commit vs refresh
`_forcemerge` to 1 segment then write-heavy → giant unmergeable segment	one oversized segment `TieredMergePolicy` won't touch again	only force-merge read-only shards; refresh-flush-merge.md
First search after refresh slow	new segment readers opened by `openIfChanged`; caches cold	expected; warmers / `eager_global_ordinals`

Validation: prove you understand this

Draw the multi-DWPT model and explain why it's the source of Lucene indexing parallelism and of "many small segments."
Walk the buffer→flush→segment pipeline and mark the visibility point and the durability point.
Explain deletes/updates via liveDocs and .liv, and what physically reclaims dead docs.
Compare TieredMergePolicy and ConcurrentMergeScheduler and name two tuning knobs on each.
Produce the four-row table distinguishing Lucene flush, Lucene commit, OpenSearch refresh, and OpenSearch flush, with durability/visibility/segments_N for each.
Draw the full segment lifecycle state machine and place a single document on it from addDocument to "dropped after merge."

HNSW Vector Search in Lucene

Everything you know about Lucene so far — the inverted index, BKD trees, columnar DocValues — answers a lexical question: "which documents contain this term, this number, this range?" Vector search answers a semantic one: "which documents are nearest to this point in a high-dimensional space?" An embedding model turns text or an image into a float[] of, say, 768 or 1024 dimensions; two vectors that are close (by cosine or Euclidean distance) mean two things that are semantically similar. The problem is that an exact nearest-neighbour scan over millions of 768-dim vectors is a brute-force O(N·d) dot-product per query — too slow for interactive search.

The answer Lucene ships is HNSW — a Hierarchical Navigable Small World graph — an approximate nearest-neighbour (ANN) index that finds the k nearest vectors in roughly O(log N) distance computations instead of O(N). This chapter covers the HNSW algorithm itself (layers, greedy search, the M / efConstruction / efSearch knobs), the concrete org.apache.lucene.* classes that implement it, the on-disk vector format and its scalar-quantized variants, how it integrates with merges, and exactly how OpenSearch's k-NN lucene engine reuses all of it.

After this chapter you should be able to: explain what an HNSW graph is and why greedy search over it is fast; name the Lucene classes that build, store, and query it; map the .vec/.vex/.vem files to a chapter concept; reason about the recall/latency/memory trade-offs the parameters control; and connect a Lucene HNSW concept to the OpenSearch k-NN setting that exposes it.

Note: This is the Lucene layer. OpenSearch's k-NN plugin has three engines — faiss (default, native C++), lucene (this chapter), and the deprecated nmslib. Only the lucene engine uses the classes below; faiss writes its own graph into a custom codec. See the k-NN engines chapter for the full comparison.

Why approximate? The cost of exact nearest-neighbour

Exact k-NN ("flat" or "brute-force" search) computes the distance from the query vector to every stored vector and keeps the best k. For N documents of dimension d, that is N·d multiply-adds per query. At N = 10M, d = 768, a single query is ~7.7 billion float operations — even with SIMD that is tens of milliseconds per shard per query, and it scales linearly with the corpus. Exact search is correct but does not scale.

ANN trades a small, bounded loss of recall (the fraction of the true top-k you actually return) for a large gain in speed. HNSW is the dominant ANN structure in production search because it has high recall at low latency, supports incremental insertion (important for a mutable index), and — critically for Lucene — can be persisted to immutable segment files and merged.

Approach	Distance computations per query	Recall	Use when
Exact / flat	`O(N)`	100%	Small corpora, or a rescoring pass over candidates
HNSW (this chapter)	`~O(log N)`	tunable, typically 0.9–0.99	The default for interactive vector search
IVF (faiss only)	`O(N/nlist · nprobe)`	tunable	Very large corpora; covered in k-NN algorithms

The HNSW algorithm

HNSW is a proximity graph: each vector is a node, and edges connect a node to a bounded number of its near neighbours. Searching means walking the graph greedily toward the query. The "hierarchical" part is the trick that makes the walk start close: the graph is built in layers, like a skip list for geometry.

Layers

Layer 0 contains every node and is densely connected — it is the fine-grained graph.
Each higher layer contains a random, exponentially smaller subset of nodes (a node's top layer is drawn from a geometric distribution). The top layer might have a handful of nodes; layer 0 has all of them.
Higher layers are the "express lanes": long-range hops that get the search into the right neighbourhood quickly before it descends to the dense bottom layer for the precise finish.

flowchart TD
    subgraph L2["Layer 2 (sparse, long hops)"]
      A2((entry))
    end
    subgraph L1["Layer 1"]
      A1((n)) --- B1((n)) --- C1((n))
    end
    subgraph L0["Layer 0 (all nodes, dense)"]
      A0((n)) --- B0((n)) --- C0((n)) --- D0((n)) --- E0((n))
    end
    A2 -->|descend| B1
    C1 -->|descend| C0

Greedy search

A k-NN query enters at the single entry point on the top layer and runs a greedy best-first search:

At the current layer, look at the current node's neighbours, compute each one's distance to the query, and move to the closest neighbour that improves on the current best. Repeat until no neighbour is closer — a local minimum for this layer.
Drop down one layer (same node) and repeat. Higher layers are sparse, so each is a few hops.
At layer 0, run a wider search: maintain a candidate set of size efSearch (a priority queue), expanding the closest unexpanded candidate and tracking the best k found so far. A larger efSearch explores more of the graph → higher recall, more distance computations.

flowchart LR
    Q["query vector"] --> Entry["entry point, top layer"]
    Entry --> G1["greedy hop to local min (layer 2)"]
    G1 --> Down1["descend to layer 1"]
    Down1 --> G2["greedy hop to local min"]
    G2 --> Down0["descend to layer 0"]
    Down0 --> Beam["beam search, candidate set size = efSearch"]
    Beam --> Top["return best k by distance"]

The number of distance computations is dominated by the layer-0 beam, which is bounded by efSearch · M rather than N — that is where the O(log N)-ish behaviour comes from.

Construction and the parameters

Insertion mirrors search: to add a node, find its nearest neighbours at each layer (a search with beam width efConstruction), then connect it to up to M of them, pruning each neighbour's edge list back down to M (a heuristic that keeps the graph diverse, not just locally clustered). The three knobs you will tune everywhere — in Lucene and in OpenSearch — are:

Parameter	What it controls	Higher value →	Cost
`M` (a.k.a. `maxConn`)	Max edges per node on layer 0 (2·M on layer 0 in some impls)	Better recall, more robust graph	More memory per node, larger `.vex`
`efConstruction` (`beamWidth`)	Beam width while building the graph	Higher-quality graph → better recall	Slower indexing/merge
`efSearch` (`k` expansion at query time)	Beam width while searching	Higher recall	Slower queries

Note: M and efConstruction are baked into the graph at index time — you cannot change them without reindexing/rebuilding the graph. efSearch is a query-time knob. This asymmetry matters: pick M/efConstruction conservatively-high, and tune efSearch per query for the recall/latency you need.

The Lucene classes

Lucene's vector support lives under org.apache.lucene.codecs.* (format/IO), org.apache.lucene.util.hnsw.* (the graph), org.apache.lucene.document.* (the field), and org.apache.lucene.search.* (the query). Grep the version you are on — names are stable but the codec version number moves:

# In an apache/lucene checkout:
find lucene -name "KnnFloatVectorField.java" -o -name "HnswGraph*.java" \
  -o -name "*HnswVectorsFormat.java" -o -name "VectorSimilarityFunction.java"
grep -rln "class HnswGraphBuilder" lucene/core/src/java
grep -rln "Lucene9.*HnswVectorsFormat\|Lucene10.*ScalarQuantized" lucene/core/src/java

Fields: what you put in a document

Class	Role
`KnnFloatVectorField`	A document field holding a `float[]` vector + a `VectorSimilarityFunction`. The common case.
`KnnByteVectorField`	A `byte[]` vector field — quarter the memory of `float`, for pre-quantized or byte embeddings.
`FloatVectorValues` / `ByteVectorValues`	The per-segment, per-doc reader view of stored vectors (iterate, random-access by ordinal).
`VectorSimilarityFunction`	The distance/score function: `EUCLIDEAN`, `DOT_PRODUCT`, `COSINE`, `MAXIMUM_INNER_PRODUCT`.

// Indexing a float vector into a document.
float[] embedding = embed("the quick brown fox");          // e.g. 768 dims from a model
Document doc = new Document();
doc.add(new KnnFloatVectorField("vector", embedding, VectorSimilarityFunction.COSINE));
doc.add(new StoredField("id", docId));
writer.addDocument(doc);

VectorSimilarityFunction is the single most consequential choice after dimension. It is fixed per field at index time and the query must use the same one:

Function	Distance / score	Notes
`EUCLIDEAN`	L2 squared distance, turned into a higher-is-better score	Magnitude-sensitive; OpenSearch `space_type: l2`
`DOT_PRODUCT`	Inner product (vectors should be normalized to unit length)	Fast; requires normalized inputs for correctness
`COSINE`	Cosine similarity	Lucene normalizes internally; OpenSearch `space_type: cosinesimil`
`MAXIMUM_INNER_PRODUCT`	MIP — inner product without requiring normalization	For maximum-inner-product search; OpenSearch `space_type: innerproduct`

The graph: `HnswGraph` and `HnswGraphBuilder`

HnswGraph is the abstract in-memory/on-disk graph: given a node and a level, iterate its neighbours. HnswGraphBuilder constructs it — it is the code that runs the insertion-with-pruning described above, parameterized by M and beamWidth (== efConstruction). At search time HnswGraphSearcher runs the greedy + beam traversal, collecting results into a KnnCollector/NeighborQueue bounded by efSearch.

grep -n "beamWidth\|maxConn\|M\b\|addGraphNode\|class HnswGraphBuilder" \
  lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java | head
grep -n "search\|EntryPoint\|class HnswGraphSearcher" \
  lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java | head

The query: `KnnFloatVectorQuery`

// Search for the 10 nearest neighbours of queryVector in field "vector".
Query knn = new KnnFloatVectorQuery("vector", queryVector, 10);
TopDocs hits = searcher.search(knn, 10);

KnnFloatVectorQuery (and KnnByteVectorQuery) is a two-phase query: per segment it runs the HNSW search to get k approximate neighbours, then rewrites into the equivalent of a doc-id + score list that the rest of Lucene's collector machinery consumes. It also accepts a filter Query — the lucene engine's headline feature — which restricts the candidate set so the graph walk only returns docs matching the filter (Lucene applies the filter during traversal, falling back to exact search if the filter is very selective). Compare this with the faiss engine, where pre-/post-filtering is handled differently; see the k-NN query path.

The vector format and the files on disk

Like every other Lucene data structure, vectors are written by a codec component — a KnnVectorsFormat. The format owns three file extensions:

Extension	File	Holds
`.vec`	Vector data	The raw vector values (the `float[]`/`byte[]` payloads), per ordinal.
`.vex`	Vector index (graph)	The HNSW graph itself — the neighbour lists per node per layer.
`.vem`	Vector metadata	Field metadata: dimension, similarity function, doc-count, offsets into `.vec`/`.vex`.

# After indexing vectors, find them on disk (mnemonic: vec=values, vex=graph, vem=meta):
find /path/to/index -name "*.vec" -o -name "*.vex" -o -name "*.vem"

The format classes

Lucene99HnswVectorsFormat is the baseline HNSW format (full-precision float32 vectors + the graph). Storing raw float32 is expensive: 768 dims × 4 bytes = 3 KB per vector, and the working set must be in memory (or page cache) for fast search. Scalar quantization shrinks each component to fewer bits, trading a little recall for a large memory cut:

Format class	Stores	Notes
`Lucene99HnswVectorsFormat`	`float32` vectors + HNSW graph	The full-precision baseline.
`Lucene99ScalarQuantizedVectorsFormat`	int8-quantized vectors (flat, no graph)	The quantization layer used as a building block.
`Lucene99HnswScalarQuantizedVectorsFormat`	int8-quantized vectors + HNSW graph	HNSW over int8 — ~4× smaller than float32, near-baseline recall.
`Lucene104ScalarQuantizedVectorsFormat`	configurable 1/2/4/7/8-bit SQ (flat)	Newer, finer-grained quantization (Lucene 10.1).
`Lucene104HnswScalarQuantizedVectorsFormat`	HNSW over 1/2/4/7/8-bit SQ vectors	The newer HNSW + custom-bits combo.

Scalar quantization for vectors entered Lucene with the int8 work in LUCENE-10577 / apache/lucene #11613, and the scalar-quantized codec format landed in apache/lucene #12497. The newer Lucene104* formats generalize the bit width (1/2/4/7/8 bits) so you can dial memory vs recall much more finely. To see which codec/format your Lucene version defaults to:

grep -rln "Lucene9.*HnswVectorsFormat\|Lucene10.*HnswScalarQuantized" lucene/core/src/java
grep -n "knnVectorsFormat\|getKnnVectorsFormatForField" \
  lucene/core/src/java/org/apache/lucene/codecs/lucene*/Lucene*Codec.java

You select a non-default vector format per field by overriding getKnnVectorsFormatForField in a custom codec — exactly the technique Lab L2 builds, and exactly how OpenSearch's k-NN plugin injects its own format.

Note: Quantization is lossy. The standard production pattern is: search the quantized graph to get a candidate set quickly, then rescore those candidates against the full-precision vectors for the final ranking. OpenSearch's disk-based ANN does exactly this; see quantization and disk-ANN.

HNSW and merges: the graph is rebuilt

Here is the part that connects vectors to the rest of Lucene. Segments are immutable (segments and codecs), and so is the HNSW graph inside a segment. When merges combine N segments into one, the merged segment needs a single graph covering all the surviving documents — and you cannot simply concatenate the old graphs, because their node ordinals and neighbour edges are local to each old segment.

So merge rebuilds the HNSW graph for the merged segment. The naive approach re-inserts every vector from scratch (expensive — graph construction is the dominant cost of a vector merge). Lucene's optimization, IncrementalHnswGraphMerger, seeds the new graph from the largest input segment's existing graph and only inserts the vectors from the smaller segments, which cuts merge cost substantially. This is also where the SIMD story pays off: faster distance computations make both construction and merge faster, contributing to the ~25% indexing speedups Lucene reported.

flowchart TD
    S1["segment A: vectors + graph A"] --> M["merge"]
    S2["segment B: vectors + graph B"] --> M
    S3["segment C: vectors + graph C"] --> M
    M --> Seed["IncrementalHnswGraphMerger: seed from largest graph"]
    Seed --> Insert["insert remaining vectors (HnswGraphBuilder)"]
    Insert --> New["merged segment: ONE rebuilt graph + concatenated .vec"]

Practical consequence: vector-heavy indices have expensive merges. A big force-merge of a vector index re-builds the whole graph and can run for a long time and use a lot of CPU/heap. This is why vector indexing throughput is so sensitive to merge policy and why remote/GPU index-build RFCs exist on the OpenSearch side (see the k-NN architecture and the GPU/remote-build RFCs).

# Watch a vector merge rebuild the graph: index vectors, force-merge, observe .vex churn.
grep -rln "IncrementalHnswGraphMerger\|class .*VectorsWriter" lucene/core/src/java

How OpenSearch's k-NN lucene engine uses exactly this

OpenSearch's k-NN plugin's lucene engine is not a reimplementation — it is a thin mapping onto the classes above. When you create a knn_vector field with engine: lucene:

PUT /products
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "space_type": "cosinesimil",
        "method": {
          "name": "hnsw",
          "engine": "lucene",
          "parameters": { "m": 16, "ef_construction": 128 }
        }
      }
    }
  }
}

maps onto Lucene as follows:

OpenSearch k-NN setting	Lucene construct
`engine: lucene`	Uses Lucene's `KnnVectorsFormat` (the `Lucene9x/10xHnsw...` formats) — no native JNI.
`dimension: 768`	The vector field's dimension.
`space_type: cosinesimil` / `l2` / `innerproduct`	`VectorSimilarityFunction.COSINE` / `EUCLIDEAN` / `MAXIMUM_INNER_PRODUCT`.
`method.name: hnsw`	`HnswGraphBuilder` / the HNSW format.
`parameters.m`	Lucene `M` (`maxConn`).
`parameters.ef_construction`	Lucene `beamWidth` (`efConstruction`).
`parameters.ef_search` (query/index setting)	Lucene `efSearch` (beam width at query time).
A `knn` query with a `filter`	`KnnFloatVectorQuery` with a filter `Query` — efficient filtered ANN.

The lucene engine's two big advantages over faiss/nmslib are that the vectors live inside the JVM/Lucene segment files (no separate native memory pool, no circuit breaker to manage) and that filtered search is first-class. Its disadvantage is that for the very largest, highest-throughput workloads the native faiss engine can be faster and offers IVF/PQ. The full trade-off table is in the k-NN engines chapter; the algorithm details (HNSW vs IVF vs PQ) are in k-NN algorithms.

Note: Because the lucene engine is Lucene HNSW, a bug or improvement in Lucene's HNSW (e.g. a new quantization format, a faster merger) flows into OpenSearch the moment OpenSearch upgrades its bundled Lucene. This is the upstream-first dynamic that Lab L4 is about: some vector work must land in Lucene before OpenSearch can expose it.

Reading exercise

# In an apache/lucene checkout:

# 1. The graph builder — find M (maxConn) and beamWidth (efConstruction).
grep -n "maxConn\|beamWidth\|addGraphNode" \
  lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

# 2. The searcher — find the entry-point descent and the layer-0 beam.
grep -n "graphSearch\|findBestEntryPoint\|class HnswGraphSearcher" \
  lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

# 3. The similarity functions.
grep -n "EUCLIDEAN\|DOT_PRODUCT\|COSINE\|MAXIMUM_INNER_PRODUCT" \
  lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java

# 4. The default codec's vector format.
grep -rn "knnVectorsFormat\|HnswVectorsFormat" \
  lucene/core/src/java/org/apache/lucene/codecs/lucene*/Lucene*Codec.java | head

# 5. The merger that rebuilds the graph.
grep -rln "IncrementalHnswGraphMerger" lucene/core/src/java

Answer:

What does the hierarchical part of HNSW buy you over a single flat proximity graph? Describe the role of the top layers vs layer 0.
Walk a greedy HNSW search from entry point to the final top-k. Where does efSearch enter, and what does increasing it cost and buy?
Which two parameters are fixed at index time and which is a query-time knob? Why does that distinction matter operationally?
Map each of .vec, .vex, .vem to what it stores. Which one grows with M?
Why must a merge rebuild the HNSW graph rather than concatenate the inputs' graphs? What does IncrementalHnswGraphMerger optimize?
Translate this OpenSearch field config to its Lucene equivalents: engine: lucene, space_type: innerproduct, m: 32, ef_construction: 256.

Common bugs and symptoms

Symptom	Root cause	Where to look
Low recall (misses obvious neighbours)	`efSearch`/`efConstruction` too low, or `M` too small for the data	Raise `ef_search` first (query-time); then `m`/`ef_construction` (needs reindex)
Query latency spikes on a vector field	Graph/vectors not in page cache; `efSearch` too high	Warm the index; lower `ef_search`; check `.vec`/`.vex` size vs RAM
Wrong/garbage results after switching models	`VectorSimilarityFunction` mismatch (e.g. `DOT_PRODUCT` on un-normalized vectors)	Normalize for `DOT_PRODUCT`, or use `COSINE`/`MAXIMUM_INNER_PRODUCT`
`IllegalArgumentException: vector's dimension differs`	Indexed and queried vectors have different `dimension`	The field's `dimension` is fixed; all docs and the query must match
Force-merge on a vector index runs for hours	Merge rebuilds the whole HNSW graph; expensive by design	Avoid gratuitous force-merge; tune merge policy; consider remote/GPU build
Quantized index has noticeably lower recall	Scalar quantization is lossy and no rescoring step	Add a full-precision rescore pass; raise bits (`Lucene104*`)
OOM / heap pressure when indexing vectors	Graph construction holds vectors + neighbour lists in heap	Reduce concurrent merges; smaller `ef_construction`; more heap

Validation: prove you understand this

Draw the HNSW layered graph and trace a single query through it, marking where the greedy descent ends and the layer-0 beam begins. Label where efSearch controls the search.
Explain in one paragraph why HNSW is ~O(log N) while flat search is O(N), and what recall you give up to get there.
List the three tuning parameters, say which are index-time vs query-time, and state the recall/latency/memory effect of raising each.
Name the Lucene classes for: the field you index, the per-segment reader of vectors, the graph, the graph builder, and the query. Map each .vec/.vex/.vem file to one of them.
Explain why a merge rebuilds the graph, and what makes vector merges expensive.
Given an OpenSearch knn_vector field with engine: lucene, name the exact Lucene VectorSimilarityFunction, M, and beamWidth it produces, and one capability the lucene engine has that the faiss engine handles differently.

When you can do all six, continue to SIMD and the Panama Vector API — the hot loop inside every distance computation HNSW makes — and then build a real HNSW index in Lab L3. For the plugin layer on top, read the k-NN engines chapter.

SIMD and the Panama Vector API

The HNSW chapter showed which vectors a query compares. This chapter is about the comparison itself — the single hottest loop in all of vector search. Every greedy hop in an HNSW walk, every candidate in the layer-0 beam, every node inserted during a merge, costs one distance computation: a dot product, a squared Euclidean distance, or a cosine over two float[]s of 768, 1024, or more dimensions. A single query touches hundreds or thousands of these. A merge touches millions. If you make that loop 4× faster, you make vector search ~4× faster and vector indexing dramatically faster. That is the entire "vectorization" theme.

The word vector is overloaded here, so be precise:

An embedding vector is the data — a float[768] representing a document.
A SIMD vector (the topic of this chapter) is a hardware register that holds several floats at once and operates on all of them in one instruction. SIMD = Single Instruction, Multiple Data.

This chapter covers why distance is the hot loop, the Panama Vector API (jdk.incubator.vector) that lets Java code emit SIMD, how Lucene's VectorUtil + VectorizationProvider selects a SIMD-accelerated implementation or a scalar fallback, how MemorySegment-based scoring works, how to prove SIMD is active, and how OpenSearch and the k-NN plugin inherit all of it for free.

After this chapter you should be able to: write a scalar dot product and its Panama-vectorized twin; explain how the JIT maps Panama lanes onto AVX-512 / AVX2 / ARM NEON; describe how Lucene chooses the implementation at runtime; verify SIMD is engaged on a running node; and name the JDK flags and pitfalls that silently disable it.

Note: This is the layer beneath HNSW and beneath OpenSearch's k-NN engines. The lucene engine gets SIMD through Lucene's VectorUtil; the native faiss engine gets it through its own C++ SIMD kernels (compiled with -mavx2 etc.). Both ultimately ride the same hardware instructions — see k-NN native integration.

Why distance is the hot loop

A dot product over d dimensions is d multiplies and d-1 adds. Look at the call counts:

Operation	Distance computations
One HNSW query, `efSearch=100`, `M=16`	~thousands
Building/merging a 1M-vector graph	~hundreds of millions
One exact (flat) query over 1M vectors	exactly 1M

At 768 dims that is ~1535 float operations per distance. Multiply by the call counts and the CPU spends the overwhelming majority of vector-search time in this one arithmetic kernel. Profilers of vector workloads light up on dotProduct / squareDistance / cosine. Everything else — graph traversal bookkeeping, queue management — is noise by comparison. Therefore: make the kernel fast and you make the system fast. A scalar Java loop processes one float per instruction; a SIMD loop processes 8 (AVX2, 256-bit) or 16 (AVX-512, 512-bit) floats per instruction. That is the 4–8× that matters.

The Panama Vector API

Historically, Java could not emit SIMD directly — you got whatever auto-vectorization the JIT managed, which for reductions like a dot product was unreliable. Project Panama added the Vector API (jdk.incubator.vector), an incubator module that exposes SIMD as portable Java: you write lane-wise operations against a FloatVector, and the JIT compiles them to the actual vector instructions of the host CPU (AVX-512/AVX2 on x86, NEON/SVE on ARM), or falls back to scalar where unsupported.

Because it is an incubator module, it is not on the module path by default. You must opt in:

java --add-modules jdk.incubator.vector ...      # required to load the module

Lucene knows this and probes for the module at startup; if it is absent, Lucene uses its scalar implementation instead of crashing.

# In an apache/lucene checkout, see where Lucene references the incubator module:
grep -rln "jdk.incubator.vector\|FloatVector\|VectorSpecies\|--add-modules" lucene/core/src/java

The key Panama types

Type	Role
`VectorSpecies<Float>`	Describes the lane shape for the host — e.g. `FloatVector.SPECIES_PREFERRED` picks the widest the CPU supports.
`FloatVector`	A SIMD register of floats; supports `add`, `mul`, `fma`, `reduceLanes`, etc.
`VectorOperators`	The operation enum (`ADD`, `MUL`, …) used by `reduceLanes`.
`MemorySegment`	An off-heap (or on-heap) memory region the API can load lanes from directly — used for `.vec` data.

Scalar vs vectorized: the dot product

Here is the canonical kernel both ways. First, the scalar version — clear, correct, one float at a time:

/** Scalar dot product: one multiply-add per loop iteration. */
static float dotScalar(float[] a, float[] b) {
    float sum = 0f;
    for (int i = 0; i < a.length; i++) {
        sum += a[i] * b[i];            // one FMA worth of work per iteration
    }
    return sum;
}

Now the Panama-vectorized version. We process SPECIES.length() floats per iteration (8 for AVX2, 16 for AVX-512), accumulating into a SIMD register, then do a horizontal reduce at the end:

import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.VectorOperators;
import jdk.incubator.vector.VectorSpecies;

/** Vectorized dot product using the Panama Vector API. */
static float dotVectorized(float[] a, float[] b) {
    final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED; // widest lanes the CPU has
    int i = 0;
    // A running accumulator vector: lane j holds the partial sum for positions j, j+L, j+2L, ...
    FloatVector acc = FloatVector.zero(SPECIES);
    int upper = SPECIES.loopBound(a.length);         // largest multiple of L <= length
    for (; i < upper; i += SPECIES.length()) {
        FloatVector va = FloatVector.fromArray(SPECIES, a, i);
        FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
        acc = va.fma(vb, acc);                       // acc += va * vb, lane-wise, one instruction
    }
    float sum = acc.reduceLanes(VectorOperators.ADD); // horizontal sum across lanes
    for (; i < a.length; i++) {                      // scalar tail for the remainder
        sum += a[i] * b[i];
    }
    return sum;
}

The two structural pieces every vectorized kernel has:

The main loop strides by SPECIES.length(), doing the work on a full register per iteration. fma (fused multiply-add) does a*b+c in one instruction with one rounding — both faster and more accurate than separate multiply and add.
The tail loop handles the leftover length % laneCount elements scalar-style, because the array length is rarely an exact multiple of the lane count.

Squared Euclidean distance and cosine are the same shape — subtract-square-accumulate for L2, two extra accumulators (for |a|² and |b|²) for cosine.

How the JIT maps lanes to AVX/NEON

FloatVector.SPECIES_PREFERRED is resolved at runtime to the host's widest supported shape. The HotSpot JIT then intrinsifies the Panama operations into the matching machine instructions:

Host CPU	`SPECIES_PREFERRED`	`va.fma(vb, acc)` compiles to
x86 with AVX-512	512-bit, 16 float lanes	`vfmadd231ps` on `zmm` registers
x86 with AVX2 (no AVX-512)	256-bit, 8 float lanes	`vfmadd231ps` on `ymm` registers
ARM with NEON	128-bit, 4 float lanes	NEON `fmla`
No vector unit / module absent	1 lane (scalar)	ordinary scalar FP — the same as `dotScalar`

Crucially, the same Java source runs on all of them. You do not write AVX intrinsics; you write lane-wise Java and the JIT picks the instruction. This is what makes the Vector API attractive for a portable library like Lucene that must run on x86 and ARM (Graviton) servers.

Warning: Vectorization is not free correctness. Floating-point addition is not associative, so dotVectorized and dotScalar can produce slightly different sums (different summation order). This is expected and within float tolerance — never assert exact equality between a scalar and a vectorized score in a test; assert within delta.

Lucene's VectorUtil and VectorizationProvider

Lucene does not call Panama from a hundred places. It funnels all vector arithmetic through one class, org.apache.lucene.util.VectorUtil, with methods like dotProduct, squareDistance, cosine, and the int8/byte variants. VectorUtil delegates to a VectorUtilSupport chosen once at class-load time by a VectorizationProvider:

# In an apache/lucene checkout:
find lucene -name "VectorUtil.java" -o -name "VectorizationProvider.java" \
  -o -name "*VectorUtilSupport.java"
grep -n "PanamaVectorUtilSupport\|DefaultVectorUtilSupport\|lookup\|isSupported\|runtimeVersion" \
  lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java

The selection logic:

flowchart TD
    Start["VectorizationProvider.lookup() (once, at class init)"] --> Mod{"jdk.incubator.vector present<br/>AND JDK version supported<br/>AND CPU has vectors?"}
    Mod -->|yes| Panama["PanamaVectorizationProvider<br/>-> PanamaVectorUtilSupport (SIMD)"]
    Mod -->|no| Scalar["DefaultVectorizationProvider<br/>-> DefaultVectorUtilSupport (scalar)"]
    Panama --> VU["VectorUtil.dotProduct / squareDistance / cosine"]
    Scalar --> VU
    VU --> HNSW["HnswGraphSearcher / VectorScorer / merge"]

The two implementations:

Provider	Backing support	When chosen
`PanamaVectorizationProvider`	`PanamaVectorUtilSupport` (Vector API, SIMD)	Module present + supported JDK + CPU has SIMD
`DefaultVectorizationProvider`	`DefaultVectorUtilSupport` (plain Java loops)	Anything else — always-correct fallback

The provider is chosen once per JVM and logged. Because the choice is at the VectorUtil boundary, every consumer — HNSW search, exact scoring, graph construction, merge — automatically gets SIMD when available and the scalar path when not. There is no per-call branch in the hot loop; the implementation is selected before the loop ever runs.

MemorySegment-based scoring

Stored vectors live in the .vec file (HNSW chapter). Modern Lucene reads them through MemorySegment (the Panama Foreign Function & Memory API), which lets the Vector API load lanes directly from the mapped file region without first copying into a float[]. The scorer (RandomVectorScorer / the off-heap VectorScorer) computes a distance straight against the segment-backed data. This avoids an allocation and a copy per vector compared and is a meaningful part of why recent Lucene vector search is fast on large indices that do not fit in heap.

grep -rln "MemorySegment\|OffHeap.*Vector\|RandomVectorScorer" lucene/core/src/java | head

The payoff: ~25% indexing speedups

Faster distance has two effects. At query time, lower per-comparison cost directly lowers HNSW search latency. At index/merge time, graph construction is dominated by distance computations (see the merge-rebuilds-the-graph discussion in the HNSW chapter), so a faster kernel plus an improved HNSW graph merger produced ~25% indexing speedups in Lucene's nightly benchmarks. That number is a combination — better merging algorithm and SIMD scoring — not SIMD alone, but SIMD is a load-bearing part of it. The lesson for a contributor: a change to the distance kernel or the merge path is high-leverage because it multiplies across millions of calls.

How to verify SIMD is actually active

A vectorized library that is silently running the scalar fallback is a classic, invisible performance bug. Prove it is on:

# 1. Confirm the incubator module is on the command line of a running OpenSearch node:
jps -lvm | grep -i opensearch | tr ' ' '\n' | grep -i "add-modules"
#   expect: --add-modules jdk.incubator.vector  (OpenSearch's jvm.options enables it on JDK 21)
ps -ef | grep -i opensearch | grep -o "add-modules [^ ]*"

# 2. In a standalone Lucene/Java program, ask the provider what it picked:
#    System.out.println(VectorizationProvider.getInstance().getClass());
#    -> PanamaVectorizationProvider  (good) vs DefaultVectorizationProvider (scalar)

# 3. Lucene logs the chosen provider; enable its logging or check for the panama class in a heap/thread dump.

# 4. Microbench: time dotScalar vs dotVectorized in a JMH harness; a >2x gap on long vectors
#    confirms SIMD. (No speedup => module missing or species is scalar.)

A minimal self-check program:

import jdk.incubator.vector.FloatVector;
import jdk.incubator.vector.VectorSpecies;

public class SimdCheck {
    public static void main(String[] args) {
        VectorSpecies<Float> s = FloatVector.SPECIES_PREFERRED;
        System.out.println("Preferred lane count: " + s.length()
            + " (vector bits: " + s.vectorBitSize() + ")");
        // 16 => AVX-512, 8 => AVX2, 4 => NEON, 1 => no SIMD / module problem
    }
}

javac --add-modules jdk.incubator.vector SimdCheck.java
java  --add-modules jdk.incubator.vector SimdCheck
#   e.g. "Preferred lane count: 8 (vector bits: 256)"  on an AVX2 box

If you omit --add-modules jdk.incubator.vector, the class won't even load — which is exactly why Lucene guards the Panama path behind a runtime check and ships the scalar fallback.

Common pitfalls

Pitfall	Effect	Fix
Missing `--add-modules jdk.incubator.vector`	Lucene silently uses the scalar fallback; ~2–8× slower vectors	Add the flag (OpenSearch's `jvm.options` already does on JDK 21)
Wrong / very new JDK the provider doesn't recognize	Provider refuses Panama (it pins to known JDK versions) and falls back	Use the supported JDK (21 for current OpenSearch); upgrade Lucene if needed
Asserting exact equality scalar == vectorized	Flaky test from float associativity	Assert within a delta
Forgetting the scalar tail loop in a hand-rolled kernel	Wrong result for non-multiple-of-lane lengths	Always process the `length % laneCount` remainder
Allocating a `FloatVector`/`float[]` per call inside the loop	GC churn erases the SIMD win	Reuse buffers; prefer `MemorySegment` loads
Assuming AVX-512 everywhere	ARM/Graviton has 128-bit NEON; gains are smaller	Use `SPECIES_PREFERRED`; benchmark per platform
Comparing throughput before JIT warmup	Misleading "SIMD didn't help"	Warm up (JMH) before measuring

How OpenSearch and k-NN benefit

OpenSearch does nothing special to get this — it inherits Lucene's VectorUtil. Two paths:

k-NN lucene engine: uses Lucene's HNSW and therefore Lucene's VectorUtil. As long as the node runs with --add-modules jdk.incubator.vector (it does, via the bundled jvm.options on JDK 21), vector scoring is SIMD-accelerated for free. An OpenSearch Lucene upgrade that improves VectorUtil improves k-NN's lucene engine with zero plugin changes.
k-NN faiss engine: does not use Java SIMD — its native C++ library is compiled with AVX2/AVX-512 (and NEON on ARM) kernels. The hardware speedup is the same idea, achieved in C++ rather than Panama. See k-NN native integration and memory.

So "is SIMD on?" is a real operational question for an OpenSearch operator running vector search: check the JVM flags (above) for the lucene engine, and check the faiss build variant for the native engine.

Reading exercise

# In an apache/lucene checkout:

# 1. The funnel: every vector arithmetic op goes through VectorUtil.
grep -n "dotProduct\|squareDistance\|cosine\|IMPL" \
  lucene/core/src/java/org/apache/lucene/util/VectorUtil.java | head

# 2. The runtime selection of SIMD vs scalar.
grep -n "PanamaVector\|DefaultVector\|lookup\|getInstance" \
  lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java

# 3. The Panama implementation itself — find the FloatVector species and fma.
grep -rn "FloatVector\|SPECIES\|fma\|reduceLanes" \
  $(grep -rln "class PanamaVectorUtilSupport" lucene/core)

# 4. MemorySegment-based scoring.
grep -rln "MemorySegment\|RandomVectorScorer" lucene/core/src/java | head

# 5. Verify on a running node.
ps -ef | grep -i opensearch | grep -o "add-modules [^ ]*"

Answer:

Why is the distance kernel the right place to optimize vector search? Quantify roughly how many times it runs for one query vs one merge.
Write (from memory) the structure of a Panama dot product: the main lane-striding loop and the tail. What does fma do, and why is the tail necessary?
How does the same Panama Java source end up as AVX-512 on one machine and NEON on another?
Trace how VectorUtil.dotProduct decides between SIMD and scalar. When is the decision made, and why is there no per-call branch in the hot loop?
Give two independent ways to prove SIMD is actually active on a running OpenSearch node.
The "~25% indexing speedup" came from two things — name both, and explain why a distance-kernel change is high-leverage.

Validation: prove you understand this

Implement squareDistance both scalar and Panama-vectorized; assert they agree within a delta (and explain why exact equality would be wrong).
Explain VectorizationProvider selecting PanamaVectorUtilSupport vs DefaultVectorUtilSupport: what triggers each, and when the choice is made.
State the single JDK flag that gates the Vector API and what happens to Lucene without it.
Explain how MemorySegment-based scoring avoids a copy that an array-based scorer would incur.
Describe how a k-NN lucene-engine workload and a k-NN faiss-engine workload each obtain hardware SIMD — and why the answer is different.
List three ways SIMD can be silently disabled or negated, and how you would detect each.

When you can do all six, you understand the vectorization theme end to end — from the HNSW graph walk down to the AVX instruction. Now make it concrete: Lab L1 dissects the files these kernels read, and Lab L3 builds and benchmarks a real HNSW index whose every comparison runs this loop. For the native side, see k-NN native integration and memory.

Lab L1: Crack Open a Lucene Index

Background

You have read about segments and codecs, the inverted index, BKD trees, DocValues, and HNSW vectors. Each of those chapters named files on disk — .tim, .kdd, .dvd, .vec, and friends. This lab makes them real. You will create an actual Lucene index, find the files on disk, decode them with three different tools — CheckIndex, Luke, and a small Java program using the reader API — and map every file extension back to the chapter that explains it.

This is the foundational "I can see inside the box" lab. Once you can open an index and enumerate its segments, fields, terms, and per-field codecs by hand, every later lab (custom codecs, HNSW from scratch, k-NN debugging) becomes legible — you can always drop down and inspect what Lucene actually wrote.

Note: OpenSearch indices are Lucene indices. The same files, the same tools, work on an OpenSearch shard's data directory. We will create a standalone Lucene index for full control, then point the same techniques at an OpenSearch shard so you see they are identical.

Why This Lab Matters for Contributors

Bugs in storage, merges, and corruption show up as files — knowing how to read them is a core debugging skill. CheckIndex is the first thing a Lucene/OpenSearch committer runs on a suspected-corrupt index.
Mapping file extensions to codec components demystifies the codec SPI: you will see that a .tim is the terms dictionary and a .vec is vector data, not just read it.
Luke is the canonical index inspector; being fluent in it (and in the reader API it is built on) is table stakes for working on the storage layer.
The reader API (DirectoryReader → LeafReader → terms/fields/codec) is the same API the search layer uses. Walking it by hand teaches you how a query reaches the data.

Prerequisites

JDK 21 (the OpenSearch/Lucene bundled JDK is fine).

An apache/lucene checkout (for Luke and for the Lucene jars). Clone it:

git clone https://github.com/apache/lucene.git
cd lucene && ./gradlew assemble    # first build is slow; grab coffee

Optionally, a running OpenSearch from ./gradlew run (for the "do it on a real shard" step).
The Lucene lucene-core jar on your classpath for the standalone program (we resolve it below).

Step-by-Step Tasks

Step 1: Hold the file-extension map in your head

This is the decoder ring. Each per-segment file belongs to one codec component and one chapter:

Extension(s)	Component	Chapter
`segments_N`	Commit point (`SegmentInfos`) — the list of live segments	segments-and-codecs
`.si`	Segment info (codec, doc count, diagnostics)	segments-and-codecs
`.cfs` / `.cfe`	Compound file (bundles a small segment's files) + its entries table	segments-and-codecs
`.fnm`	Field infos (names, types, which features each field has)	segments-and-codecs
`.fdt` / `.fdx` / `.fdm`	Stored fields: data / index / metadata	segments-and-codecs
`.tim` / `.tip` / `.tmd`	Terms dictionary / terms index / metadata (BlockTree + FST)	inverted-index-and-postings
`.doc` / `.pos` / `.pay`	Postings: doc ids / positions / payloads & offsets	inverted-index-and-postings
`.nvd` / `.nvm`	Norms data / metadata	inverted-index-and-postings
`.dvd` / `.dvm`	DocValues data / metadata (columnar)	docvalues-columnar
`.kdd` / `.kdi` / `.kdm`	Points / BKD tree: data / index / metadata	points-and-bkd-trees
`.vec` / `.vex` / `.vem`	Vectors: values / HNSW graph / metadata	hnsw-vector-search

Step 2: Write a tiny indexer that uses every component

We index documents with a text field (postings + norms), a stored field, a numeric point + docvalue, and a float vector — so the segment contains all the file types above. Resolve the Lucene jars from your checkout first:

# Find the built lucene-core jar in your apache/lucene checkout:
LUCENE=$(pwd)                                   # your apache/lucene dir
CORE=$(find "$LUCENE" -name "lucene-core-*.jar" | grep -v sources | head -1)
echo "lucene-core: $CORE"
mkdir -p ~/lucene-lab && cd ~/lucene-lab

CrackOpenIndexer.java:

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;            // indexes a point + docvalue
import org.apache.lucene.document.KnnFloatVectorField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.nio.file.Paths;
import java.util.Random;

/** Builds an index that exercises postings, stored fields, points, docvalues, and vectors. */
public class CrackOpenIndexer {
    public static void main(String[] args) throws Exception {
        Directory dir = FSDirectory.open(Paths.get("index"));
        IndexWriterConfig cfg = new IndexWriterConfig();
        // Keep files NON-compound so each extension is a separate file we can find/inspect.
        cfg.setUseCompoundFile(false);
        Random rnd = new Random(42);

        try (IndexWriter w = new IndexWriter(dir, cfg)) {
            String[] bodies = {
                "the quick brown fox jumps over the lazy dog",
                "a fast dark fox leaps across a sleepy hound",
                "lucene segments are immutable and merged over time",
                "vector search finds nearest neighbours in high dimensions"
            };
            for (int i = 0; i < bodies.length; i++) {
                Document doc = new Document();
                doc.add(new StoredField("id", i));                              // .fdt/.fdx
                doc.add(new TextField("body", bodies[i], Field.Store.NO));      // .tim/.doc/.nvd
                doc.add(new IntField("year", 2020 + i, Field.Store.NO));        // .kdd (point) + .dvd
                float[] vec = new float[8];                                     // small vector for clarity
                for (int d = 0; d < vec.length; d++) vec[d] = rnd.nextFloat();
                doc.add(new KnnFloatVectorField("embedding", vec, VectorSimilarityFunction.COSINE)); // .vec/.vex/.vem
                w.addDocument(doc);
            }
            w.commit();   // writes segments_N
        }
        System.out.println("Indexed. Files written to ./index");
    }
}

Compile and run:

javac -cp "$CORE" CrackOpenIndexer.java
java  -cp "$CORE:." --add-modules jdk.incubator.vector CrackOpenIndexer

Note: --add-modules jdk.incubator.vector is needed so the vector format can use SIMD (SIMD chapter). It works without it (scalar fallback), but include it for realism.

Step 3: `find` the files and map each one

cd ~/lucene-lab
ls -la index/
find index -type f | sort

You should see the segments_N commit and a set of _0.* files (segment _0). Match every one to the table in Step 1:

# Group by extension and eyeball the map:
find index -type f | sed 's/.*\.//' | sort | uniq -c
#   expect: si fnm fdt fdx fdm tim tip tmd doc pos nvd nvm kdd kdi kdm dvd dvm vec vex vem ...

Find the .vec, .vex, .vem trio — those are your vectors. Note their relative sizes (.vec is raw values, .vex is the graph).
Find .tim/.tip — the terms dictionary for body.
Find .kdd — the BKD point for year. Find .dvd — its docvalue.

Step 4: Decode with `CheckIndex`

CheckIndex walks the whole index and prints, per segment, every component it verifies — a perfect human-readable inventory and the standard corruption check:

java -cp "$CORE" --add-modules jdk.incubator.vector \
  org.apache.lucene.index.CheckIndex index

Read the output: it lists the segment, the number of docs, and a line for each part — field infos, field norms, terms index, stored fields, term vectors, docvalues, points, and vectors (it reports the vector fields and their dimension/similarity). Confirm:

No problems were detected with this index.
The vectors check reports field embedding, dimension 8, similarity COSINE.
The points check reports field year.

Step 5: Decode with Luke (the GUI inspector)

Luke is Lucene's index browser. Launch it from your apache/lucene checkout:

cd /path/to/apache/lucene
./gradlew :lucene:luke:run

In the GUI:

Open ~/lucene-lab/index.
Overview tab: see the fields (body, year, embedding, id) and per-field term counts.
Click the body field → browse the terms (brown, fox, quick, …) and their doc frequencies — this is the inverted index from the chapter, visualized.
Documents tab: step through docs; see the stored id.
Commits / Segments: see segment _0, its codec, and the files it owns.

Luke is built on the same reader API you use next — it is a GUI over DirectoryReader.

Step 6: Decode with the reader API (the real skill)

A GUI is nice; the skill is doing it in code, because that is how you debug and how the search layer works. This program opens the index and enumerates segments, fields, per-field codec formats, and terms:

CrackOpenReader.java:

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.FieldInfo;
import org.apache.lucene.index.LeafReader;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.SegmentReader;
import org.apache.lucene.index.Terms;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.BytesRef;

import java.nio.file.Paths;

public class CrackOpenReader {
    public static void main(String[] args) throws Exception {
        try (DirectoryReader reader = DirectoryReader.open(FSDirectory.open(Paths.get("index")))) {
            System.out.println("maxDoc=" + reader.maxDoc() + " numDocs=" + reader.numDocs()
                + " leaves(segments)=" + reader.leaves().size());

            // One LeafReader per segment.
            for (LeafReaderContext ctx : reader.leaves()) {
                LeafReader leaf = ctx.reader();
                System.out.println("\n=== segment ord " + ctx.ord + " (docBase=" + ctx.docBase + ") ===");

                // The codec that wrote this segment.
                if (leaf instanceof SegmentReader sr) {
                    Codec codec = sr.getSegmentInfo().info.getCodec();
                    System.out.println("codec: " + codec.getName() + " (" + codec.getClass().getName() + ")");
                }

                // Field infos: every field and the features it carries.
                System.out.println("fields:");
                for (FieldInfo fi : leaf.getFieldInfos()) {
                    StringBuilder feats = new StringBuilder();
                    if (fi.getIndexOptions() != org.apache.lucene.index.IndexOptions.NONE) feats.append("postings ");
                    if (fi.getPointDimensionCount() > 0) feats.append("points(" + fi.getPointDimensionCount() + "d) ");
                    if (fi.getDocValuesType() != org.apache.lucene.index.DocValuesType.NONE)
                        feats.append("docvalues(" + fi.getDocValuesType() + ") ");
                    if (fi.getVectorDimension() > 0)
                        feats.append("vectors(dim=" + fi.getVectorDimension()
                            + ", sim=" + fi.getVectorSimilarityFunction() + ") ");
                    System.out.println("  - " + fi.name + ": " + feats.toString().trim());
                }

                // Enumerate the terms of the analyzed text field.
                Terms terms = leaf.terms("body");
                if (terms != null) {
                    System.out.println("terms of 'body' (term -> docFreq):");
                    TermsEnum te = terms.iterator();
                    BytesRef t;
                    while ((t = te.next()) != null) {
                        System.out.println("  " + t.utf8ToString() + " -> " + te.docFreq());
                    }
                }
            }
        }
    }
}

javac -cp "$CORE" CrackOpenReader.java
java  -cp "$CORE:." --add-modules jdk.incubator.vector CrackOpenReader

Expected (abridged):

maxDoc=4 numDocs=4 leaves(segments)=1

=== segment ord 0 (docBase=0) ===
codec: Lucene101  (org.apache.lucene.codecs.lucene101.Lucene101Codec)   # version varies — grep to confirm
fields:
  - id: docvalues(...)            # or just stored, depending
  - body: postings
  - year: points(1d) docvalues(SORTED_NUMERIC)
  - embedding: vectors(dim=8, sim=COSINE)
terms of 'body' (term -> docFreq):
  a -> 2
  brown -> 1
  fox -> 2
  ...

Note: The codec name (Lucene101, Lucene103, …) depends on your Lucene version. Do not hard-code it — read it from info.getCodec() as above. This is the "grep to confirm the real name" discipline from the deep dives.

Step 7: Do it on a real OpenSearch shard

The whole point — these are the same files. Start OpenSearch, index a doc, find the shard, and run CheckIndex on it:

# In the OpenSearch repo:
./gradlew run &
curl -s -XPOST 'localhost:9200/lab/_doc?refresh' -H 'Content-Type: application/json' \
  -d '{"body":"opensearch indices are lucene indices"}'

# Find the shard's Lucene index directory on disk:
find . -path "*indices*0/index/segments_*" 2>/dev/null | head
SHARD=$(dirname "$(find . -path '*indices*0/index/segments_*' 2>/dev/null | head -1)")
find "$SHARD" -type f | sed 's/.*\.//' | sort | uniq -c    # same extensions you mapped in Step 3

CheckIndex works on it identically (stop the node first or use a copy, since the files are open):

cp -r "$SHARD" /tmp/shard-copy
java -cp "$CORE" org.apache.lucene.index.CheckIndex /tmp/shard-copy

Confirm the OpenSearch shard directory has the same .tim/.kdd/.dvd (and .vec if you added a knn_vector field) extensions. They are Lucene segments.

Implementation Requirements / Deliverables

CrackOpenIndexer builds an index exercising postings, stored fields, points, docvalues, and vectors (non-compound so files are separate).
A find listing of index/ with every extension mapped to its chapter in Step 1's table.
CheckIndex output showing no problems and reporting the vector field (dim, similarity) and the point field.
Luke opened on the index; you browsed the body terms and saw segment _0's codec/files.
CrackOpenReader prints, per segment: the codec name, every field with its features, and the body terms with doc frequencies.
You ran find/CheckIndex against a real OpenSearch shard and confirmed identical file types.

Troubleshooting

Symptom	Likely cause	Fix
`ClassNotFoundException: ...KnnFloatVectorField`	Wrong/old `lucene-core` jar	Re-resolve `$CORE` from your built checkout
Only `.cfs`/`.cfe`, no individual extensions	Compound files on	`cfg.setUseCompoundFile(false)`
`CheckIndex` reports a vectors/Panama error	Module not added	Add `--add-modules jdk.incubator.vector`
Luke won't launch	Old JDK / no display	Use JDK 21; run on a machine with a GUI/X forwarding
Codec name differs from the example	Different Lucene version	Expected — read it from `info.getCodec()`, don't hard-code
`find` on the shard returns nothing	Index not flushed/wrong path	`?refresh`, and search under `indices<shard>/index`
`CheckIndex` on a live shard errors	Files held open by the node	Run on a copy or stop the node

Validation / Self-check

List every file extension your index produced and name the codec component and chapter for each.
Which three files make up the vector data, and what does each hold? Why is .vec usually the biggest of the three for a high-dimensional field?
What does CheckIndex verify, and why is it the first tool a committer runs on a suspected-corrupt index? Name three components it checks.
In CrackOpenReader, what is a LeafReader and how does it relate to a segment? How did you read the codec that wrote a segment?
How did you enumerate the terms of body, and what does docFreq() mean? Tie it back to the inverted index chapter.
Show that an OpenSearch shard is a Lucene index: which extensions did it share with your standalone index, and which tool proved it?
Why did we set setUseCompoundFile(false), and what would change in the find output if we left it on?

When you can crack open any Lucene (or OpenSearch shard) index, map its files, and enumerate its contents three ways, you are ready to modify the codec: continue to Lab L2: Write a Custom Codec.

Lab L2: Write a Custom Codec

Background

In Lab L1 you cracked open an index and saw that every file — .tim, .kdd, .dvd, .vec — is written by a codec component. The segments and codecs chapter explained the codec SPI: Codec is the top-level format plugin, and it delegates to a PostingsFormat, DocValuesFormat, KnnVectorsFormat, StoredFieldsFormat, and PointsFormat, each pluggable via META-INF/services. This lab makes you write one.

You will build a custom Codec that delegates to the current default but wraps one component to observe and (optionally) influence what Lucene writes — a FilterCodec that logs every postings format and vector format decision, plus a thin PostingsFormat wrapper that counts the terms written. You will register it via SPI, drive it from a Lucene unit test, and see exactly how OpenSearch's k-NN plugin uses this same mechanism to inject its faiss/lucene vector formats and how OpenSearch exposes codecs through the index.codec setting.

This is the canonical "extend the storage layer" exercise. The k-NN plugin's KNN990Codec (and its versioned siblings) is precisely a FilterCodec that overrides knnVectorsFormat(). After this lab, that class will read like something you could have written.

Note: A custom codec is one of the most powerful extension points in Lucene — and one of the most dangerous. A codec change affects how bytes are written to immutable segments; get it wrong and you corrupt or can't reopen the index. This lab uses the safe pattern: delegate to the default, wrap, observe. Never hand-roll a binary format for a real index without a strong reason and extensive testing.

Why This Lab Matters for Contributors

The k-NN plugin's entire native-vector storage (faiss/nmslib graphs as segment files) is delivered through a custom FilterCodec overriding knnVectorsFormat(). Understanding FilterCodec is a prerequisite to reading or contributing to k-NN's codec layer.
OpenSearch's index.codec setting (default, best_compression, zstd, …) is the codec SPI surfaced to users. Knowing how codecs are selected explains that setting.
The META-INF/services SPI is how every Lucene format is discovered. Writing one teaches you Java's ServiceLoader mechanism, which recurs throughout OpenSearch (analyzers, score functions).
Wrapping a format to log/transform is a real debugging and experimentation technique — e.g. to see which vector format a field actually gets, or to A/B a quantization setting.

Prerequisites

Lab L1 done — you can build and inspect an index.
An apache/lucene checkout (./gradlew assemble done) — for running the unit test and resolving jars.
JDK 21.

Know the current default codec name in your Lucene version (it moves: Lucene101Codec, Lucene103Codec, …). Find it:

grep -rn "DEFAULT_CODEC\|public static final Codec" \
  lucene/core/src/java/org/apache/lucene/codecs/Codec.java
grep -rln "extends FilterCodec\|new Lucene.*Codec()" lucene/core/src/java | head
# Or read it at runtime: System.out.println(Codec.getDefault().getName());

Step-by-Step Tasks

Step 1: Understand the codec hierarchy you are extending

flowchart TD
    SPI["META-INF/services/org.apache.lucene.codecs.Codec"] --> MyCodec["LoggingCodec extends FilterCodec"]
    MyCodec -->|delegate| Default["Lucene10xCodec (the current default)"]
    MyCodec -->|override postingsFormat| PF["LoggingPostingsFormat extends PostingsFormat"]
    MyCodec -->|override knnVectorsFormat| VF["(log which vector format the field gets)"]
    PF -->|delegate fieldsConsumer/Producer| DefaultPF["Lucene...PostingsFormat"]

Piece	What it does
`LoggingCodec extends FilterCodec`	Delegates every format to the default codec; overrides the ones we want to observe.
`LoggingPostingsFormat extends PostingsFormat`	Wraps the default postings format's consumer/producer to log/count.
SPI file `META-INF/services/org.apache.lucene.codecs.Codec`	Makes `ServiceLoader` find `LoggingCodec` by name.
SPI file for the `PostingsFormat`	Makes the wrapped format discoverable (postings formats are loaded by name too).

Step 2: Write the `FilterCodec`

FilterCodec is the safe base: its constructor takes a name and a delegate, and every xxxFormat() method returns the delegate's by default. You override only what you want to change:

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.FilterCodec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.PostingsFormat;

/**
 * A codec that behaves exactly like the default codec, but:
 *  - wraps the postings format to count/log terms written, and
 *  - logs which KNN vectors format each field is assigned.
 */
public class LoggingCodec extends FilterCodec {

    private final PostingsFormat postingsFormat;

    public LoggingCodec() {
        // SPI name "Logging", delegate = whatever the current default codec is.
        super("Logging", Codec.getDefault());
        this.postingsFormat = new LoggingPostingsFormat(super.postingsFormat());
    }

    @Override
    public PostingsFormat postingsFormat() {
        return postingsFormat;        // our wrapper instead of the delegate's
    }

    @Override
    public KnnVectorsFormat knnVectorsFormat() {
        KnnVectorsFormat delegate = super.knnVectorsFormat();
        // This is the exact hook k-NN uses to inject faiss/lucene vector formats.
        System.out.println("[LoggingCodec] knnVectorsFormat -> " + delegate.getClass().getSimpleName());
        return delegate;              // pass through; a real plugin would return its own format here
    }
}

Note: super("Logging", Codec.getDefault()) captures the default at construction time. That is fine for a lab. A production FilterCodec (like k-NN's) typically delegates to a specific versioned codec (e.g. new Lucene101Codec()) so its on-disk format is pinned and reproducible, not dependent on whatever default is active.

Step 3: Write the wrapping `PostingsFormat`

A PostingsFormat produces a FieldsConsumer (write side) and FieldsProducer (read side). We delegate both, but wrap the consumer to count terms as they are written:

import org.apache.lucene.codecs.FieldsConsumer;
import org.apache.lucene.codecs.FieldsProducer;
import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.SegmentReadState;
import org.apache.lucene.index.SegmentWriteState;
import org.apache.lucene.index.Terms;

import java.io.IOException;

/** Delegates to a real PostingsFormat but logs how many fields/terms get written. */
public class LoggingPostingsFormat extends PostingsFormat {

    private final PostingsFormat delegate;

    /** Public no-arg ctor required for SPI: wraps the default. */
    public LoggingPostingsFormat() {
        this(PostingsFormat.forName(defaultPostingsName()));
    }

    public LoggingPostingsFormat(PostingsFormat delegate) {
        super("Logging");                 // SPI name for this PostingsFormat
        this.delegate = delegate;
    }

    private static String defaultPostingsName() {
        // The default postings format's SPI name (e.g. "Lucene101"). Read it at runtime.
        return org.apache.lucene.codecs.Codec.getDefault().postingsFormat().getName();
    }

    @Override
    public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
        FieldsConsumer inner = delegate.fieldsConsumer(state);
        return new FieldsConsumer() {
            @Override
            public void write(Fields fields, org.apache.lucene.index.NormsProducer norms) throws IOException {
                long fieldCount = 0, termCount = 0;
                for (String field : fields) {
                    fieldCount++;
                    Terms terms = fields.terms(field);
                    if (terms != null) termCount += terms.size();   // -1 if unknown; fine for a log
                }
                System.out.println("[LoggingPostingsFormat] segment=" + state.segmentInfo.name
                    + " fields=" + fieldCount + " terms=" + termCount);
                inner.write(fields, norms);     // do the real write
            }

            @Override
            public void close() throws IOException { inner.close(); }
        };
    }

    @Override
    public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException {
        return delegate.fieldsProducer(state);   // read path: pure delegation
    }
}

Warning: The wrapped write path must call inner.write(...) exactly once and propagate close(). Forgetting either produces an empty or unclosed postings file — and a corrupt segment. The "delegate, don't reimplement" rule is what keeps this safe.

Step 4: Register both via `META-INF/services`

ServiceLoader discovers formats by reading text files listing implementation class names. Create two files (use your real package; this example uses the default package):

mkdir -p src/main/resources/META-INF/services
# The Codec SPI file:
printf 'LoggingCodec\n' > src/main/resources/META-INF/services/org.apache.lucene.codecs.Codec
# The PostingsFormat SPI file:
printf 'LoggingPostingsFormat\n' > src/main/resources/META-INF/services/org.apache.lucene.codecs.PostingsFormat

(If your classes are in a package, list the fully qualified names, e.g. com.example.LoggingCodec.) On the classpath, both files must be visible. Confirm the format:

src/main/resources/META-INF/services/
├── org.apache.lucene.codecs.Codec            # contains: LoggingCodec
└── org.apache.lucene.codecs.PostingsFormat   # contains: LoggingPostingsFormat

Note: Lucene's own codecs ship the same way — look at lucene-core.jar's META-INF/services/org.apache.lucene.codecs.Codec to see the built-in list. The k-NN plugin's jar has a META-INF/services/org.apache.lucene.codecs.Codec listing its KNN*Codec.

Step 5: Drive it from a test (standalone)

Wire the codec into an IndexWriterConfig and index a doc — you'll see the log lines fire:

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.ByteBuffersDirectory;
import org.apache.lucene.store.Directory;

public class LoggingCodecDemo {
    public static void main(String[] args) throws Exception {
        Directory dir = new ByteBuffersDirectory();
        IndexWriterConfig cfg = new IndexWriterConfig();
        cfg.setCodec(new LoggingCodec());          // <-- inject our codec
        try (IndexWriter w = new IndexWriter(dir, cfg)) {
            Document d = new Document();
            d.add(new TextField("body", "the quick brown fox", Field.Store.NO));
            w.addDocument(d);
            w.commit();                            // triggers the postings write -> our log line
        }
        System.out.println("default codec name: " + Codec.getDefault().getName());
    }
}

CORE=$(find /path/to/apache/lucene -name "lucene-core-*.jar" | grep -v sources | head -1)
javac -cp "$CORE" LoggingCodec.java LoggingPostingsFormat.java LoggingCodecDemo.java
java  -cp "$CORE:.:src/main/resources" --add-modules jdk.incubator.vector LoggingCodecDemo

Expected:

[LoggingPostingsFormat] segment=_0 fields=1 terms=4
[LoggingCodec] knnVectorsFormat -> Lucene...HnswVectorsFormat
default codec name: Lucene101    # version varies

Step 6: Run it as a Lucene unit test

Lucene has a strong testing framework. To run a real Lucene unit test that exercises a codec, use the Gradle test task (the canonical contributor command):

cd /path/to/apache/lucene
# Run an existing codec/SPI test to see the harness:
./gradlew :lucene:core:test --tests "org.apache.lucene.codecs.TestCodecLoadingDeadlock"
# Run the postings-format SPI test:
./gradlew :lucene:core:test --tests "*TestNamedSPILoader*"

To make LoggingCodec itself testable, drop it into a Lucene test module (or your own Gradle module that depends on lucene-core + lucene-test-framework) and extend LuceneTestCase:

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.tests.util.LuceneTestCase;

public class LoggingCodecTest extends LuceneTestCase {
    public void testSpiLoads() {
        // Proves the META-INF/services file is on the classpath and the codec is discoverable by name.
        Codec c = Codec.forName("Logging");
        assertNotNull(c);
        assertEquals("Logging", c.getName());
    }

    public void testRoundTrip() throws Exception {
        // A real round-trip: index with LoggingCodec, reopen, read back.
        // Use newDirectory()/newIndexWriterConfig() from LuceneTestCase, set cfg.setCodec(new LoggingCodec()).
        // Index a doc, commit, DirectoryReader.open, assert numDocs()==1.
    }
}

./gradlew :lucene:core:test --tests "*LoggingCodecTest*"

Note: Codec.forName("Logging") succeeding is the proof your SPI registration works — it is ServiceLoader finding your META-INF/services entry. If it throws IllegalArgumentException: An SPI class of type ... with name 'Logging' does not exist, your SPI file is missing, misnamed, or not on the classpath.

Step 7: How this plugs into OpenSearch

Two connection points:

1. The k-NN plugin's codec. k-NN registers a FilterCodec (historically named per Lucene version, e.g. KNN990Codec / a KNNCodecVersion enum) whose knnVectorsFormat() returns either Lucene's HNSW format (lucene engine) or a native format that writes the faiss/nmslib graph as segment files. It is the Step 2 pattern, with the override actually returning a custom format instead of passing through:

# In a k-NN checkout:
grep -rln "extends FilterCodec\|knnVectorsFormat\|KNNCodecVersion\|KNN.*Codec" src/main/java
find . -path "*META-INF/services/org.apache.lucene.codecs.Codec"

2. The index.codec setting. OpenSearch lets users pick a codec per index (index.codec: default | best_compression | zstd | zstd_no_dict | ...). Each name maps to a Codec in core. To find how OpenSearch resolves it:

# In the OpenSearch repo:
grep -rln "index.codec\|CodecService\|class CodecService" server/src/main/java | head
grep -rn "best_compression\|Lucene.*Codec\|zstd" \
  server/src/main/java/org/opensearch/index/codec/CodecService.java 2>/dev/null | head

A custom plugin codec is registered through EnginePlugin/a codec service provider so an index can select it by name — the same SPI idea, surfaced as an OpenSearch setting. (Building a full OpenSearch-installable codec plugin is a stretch goal; the mechanism is the FilterCodec you wrote.)

Implementation Requirements / Deliverables

LoggingCodec extends FilterCodec delegating to the default, overriding postingsFormat() and knnVectorsFormat() (the latter logging the assigned format).
LoggingPostingsFormat extends PostingsFormat that delegates read/write and logs field/term counts on the write path, calling inner.write(...) exactly once.
Two META-INF/services files registering the codec and the postings format by name.
Codec.forName("Logging") succeeds (SPI proof), demonstrated in a test.
An index built with LoggingCodec round-trips: reopen with DirectoryReader, numDocs() matches; the log lines fired during the write.
You located the k-NN plugin's FilterCodec (and its knnVectorsFormat override) and the OpenSearch CodecService/index.codec mapping with grep.

Troubleshooting

Symptom	Likely cause	Fix
`SPI class ... with name 'Logging' does not exist`	`META-INF/services` file missing/misnamed/not on classpath	Filename must be exactly `org.apache.lucene.codecs.Codec`; contents the FQN; resources dir on classpath
Index won't reopen / `CorruptIndexException`	Wrapper didn't call `inner.write()` or didn't `close()`	Delegate once; propagate `close()`
`knnVectorsFormat` log never prints	No vector field indexed	Add a `KnnFloatVectorField` to the doc
Wrong default codec name printed	Different Lucene version	Read it at runtime via `Codec.getDefault().getName()`
Postings format SPI not found	Forgot the second SPI file (`...codecs.PostingsFormat`)	Register both Codec and PostingsFormat
Test can't find `LuceneTestCase`	Missing `lucene-test-framework` dep	Add it (or run inside the lucene gradle module)
Duplicate-name SPI error	Your name collides with a built-in	Pick a unique name like `Logging`

Validation / Self-check

What does FilterCodec give you, and why is "delegate + override only what you need" the safe pattern for a codec?
Which two META-INF/services files did you create, what are their exact filenames, and what does each contents line mean to ServiceLoader?
In LoggingPostingsFormat, why must the wrapped FieldsConsumer.write call inner.write(...) exactly once and propagate close()? What breaks otherwise?
How does Codec.forName("Logging") prove your registration worked? What exception do you get if it didn't, and what are the three usual causes?
Which Codec method does the k-NN plugin override to inject faiss/lucene vector storage, and how is that the same shape as your LoggingCodec?
What is the OpenSearch index.codec setting, and which core class resolves a codec name to a Codec?
Why should you almost never hand-roll a new binary format (vs wrapping) for a real index?

When Codec.forName("Logging") resolves, your wrapper logs during a write, and the index round-trips, you understand the codec SPI well enough to read k-NN's codec. Next, build a real vector index and benchmark it in Lab L3: Build an HNSW Graph from Scratch.

Lab L3: Build an HNSW Graph from Scratch

Background

You have read how HNSW works — layers, greedy search, the M/efConstruction/efSearch knobs — and how the SIMD kernel makes each distance fast. Now you will build it: a standalone Java program that indexes float vectors into a Lucene index using Lucene99HnswVectorsFormat, runs a KnnFloatVectorQuery, and measures recall@k against a brute-force exact scan. Then you will sweep M, efConstruction, and efSearch and tabulate how recall and latency move — turning the abstract trade-off table from the chapter into numbers you produced.

This is the lab that makes ANN concrete. "Approximate" stops being a word and becomes a measured recall of 0.94 that you can push to 0.99 by raising efSearch at a latency cost you can read off a table. At the end you map every knob to its OpenSearch k-NN equivalent so you can tune a real knn_vector field with intuition instead of guesswork.

Note: This uses the Lucene HNSW classes directly — exactly what OpenSearch's k-NN lucene engine wraps. The recall/latency behavior you measure here is the behavior of a lucene-engine knn_vector field; only the configuration surface differs. See k-NN algorithms.

Why This Lab Matters for Contributors

Recall@k vs latency vs memory is the vector-search trade-off. Measuring it yourself builds the intuition every k-NN issue, benchmark, and tuning question depends on.
You will see directly why M/efConstruction are index-time (graph-baked) and efSearch is query-time — because you change one and must rebuild, the other you change per query.
Writing the brute-force baseline teaches you what "ground truth" means for ANN evaluation — the thing every benchmark (k-NN #2595, ann-benchmarks) compares against.
The exact configuration you set on Lucene99HnswVectorsFormat is what k-NN's lucene engine sets under the hood; this lab demystifies that engine.

Prerequisites

Lab L1 (build/inspect an index) and the HNSW chapter.

JDK 21; the lucene-core jar from your apache/lucene checkout:

CORE=$(find /path/to/apache/lucene -name "lucene-core-*.jar" | grep -v sources | head -1)
echo "$CORE"

Run everything with --add-modules jdk.incubator.vector so distance scoring is SIMD-accelerated.

Step-by-Step Tasks

Step 1: The plan

flowchart LR
    Gen["generate N random vectors (dim d)"] --> Index["index into Lucene<br/>Lucene99HnswVectorsFormat(M, efConstruction)"]
    Gen --> Brute["brute-force exact top-k<br/>(ground truth)"]
    Index --> Query["KnnFloatVectorQuery(k, efSearch)"]
    Query --> Compare["recall@k = overlap / k"]
    Brute --> Compare
    Compare --> Sweep["sweep M / efConstruction / efSearch -> table"]

You need four pieces: a vector generator, an HNSW indexer with tunable M/efConstruction, a brute-force scorer for ground truth, and a recall+latency harness that sweeps parameters.

Step 2: Configure the HNSW vector format with `M` and `efConstruction`

To set M and efConstruction you supply a custom codec whose knnVectorsFormat() returns a Lucene99HnswVectorsFormat(maxConn, beamWidth) — the same FilterCodec pattern from Lab L2:

import org.apache.lucene.codecs.Codec;
import org.apache.lucene.codecs.FilterCodec;
import org.apache.lucene.codecs.KnnVectorsFormat;
import org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat;

/** A codec that pins the HNSW build parameters M (maxConn) and efConstruction (beamWidth). */
public class HnswCodec extends FilterCodec {
    private final KnnVectorsFormat vectorsFormat;

    public HnswCodec(int m, int efConstruction) {
        super("HnswLab", Codec.getDefault());
        // maxConn == M, beamWidth == efConstruction. These are baked into the graph at index time.
        this.vectorsFormat = new Lucene99HnswVectorsFormat(m, efConstruction);
    }

    @Override
    public KnnVectorsFormat knnVectorsFormat() {
        return vectorsFormat;
    }
}

Note: Lucene99HnswVectorsFormat(maxConn, beamWidth) is the full-precision (float32) HNSW format. To experiment with quantization, swap in Lucene99HnswScalarQuantizedVectorsFormat (int8) or a Lucene104HnswScalarQuantizedVectorsFormat (1/2/4/7/8-bit) and watch recall and .vec size change — a great stretch goal that mirrors OpenSearch's quantization modes.

Step 3: Index vectors, with `M`/`efConstruction` fixed at build time

import org.apache.lucene.document.Document;
import org.apache.lucene.document.KnnFloatVectorField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import java.nio.file.Paths;
import java.util.Random;

public class HnswLab {

    static final int N = 20_000;     // corpus size
    static final int DIM = 64;       // dimensions (small for speed; try 768 for realism)
    static final int K = 10;         // top-k

    /** Deterministic random vectors so runs are comparable. */
    static float[][] randomVectors(int n, int dim, long seed) {
        Random r = new Random(seed);
        float[][] vs = new float[n][dim];
        for (int i = 0; i < n; i++)
            for (int d = 0; d < dim; d++)
                vs[i][d] = r.nextFloat() * 2 - 1;   // [-1, 1)
        return vs;
    }

    static Directory buildIndex(float[][] vectors, int m, int efConstruction) throws Exception {
        Directory dir = FSDirectory.open(Paths.get("hnsw-index-" + m + "-" + efConstruction));
        IndexWriterConfig cfg = new IndexWriterConfig();
        cfg.setCodec(new HnswCodec(m, efConstruction));      // <-- M & efConstruction here
        try (IndexWriter w = new IndexWriter(dir, cfg)) {
            for (int i = 0; i < vectors.length; i++) {
                Document doc = new Document();
                doc.add(new StoredField("id", i));
                doc.add(new KnnFloatVectorField("vec", vectors[i], VectorSimilarityFunction.EUCLIDEAN));
                w.addDocument(doc);
            }
            // One segment => one graph => clean, comparable measurements.
            w.forceMerge(1);
        }
        return dir;
    }
    // ... continued below
}

Step 4: Brute-force ground truth

The recall denominator is the true top-k from an exact scan. Compute it once per query:

    /** Exact top-k by squared Euclidean distance. Returns the k nearest doc ids (the original index). */
    static int[] bruteForceTopK(float[][] vectors, float[] query, int k) {
        // Max-heap of size k keyed by distance (largest distance on top, so we can evict).
        java.util.PriorityQueue<int[]> heap =
            new java.util.PriorityQueue<>((a, b) -> Integer.compare(b[1], a[1])); // [id, scaledDist]
        // Use a parallel array of doubles for precise distance, ids for tie-break.
        double[] bestDist = new double[k];
        int[] bestId = new int[k];
        java.util.Arrays.fill(bestDist, Double.MAX_VALUE);
        java.util.Arrays.fill(bestId, -1);
        for (int i = 0; i < vectors.length; i++) {
            double dist = 0;
            float[] v = vectors[i];
            for (int d = 0; d < query.length; d++) {
                double diff = v[d] - query[d];
                dist += diff * diff;
            }
            // Insert into the running top-k (worst slot is index of max).
            int worst = 0;
            for (int j = 1; j < k; j++) if (bestDist[j] > bestDist[worst]) worst = j;
            if (dist < bestDist[worst]) { bestDist[worst] = dist; bestId[worst] = i; }
        }
        return bestId;
    }

Step 5: ANN query with `efSearch` and recall measurement

KnnFloatVectorQuery's third arg is k. To raise the search beam (efSearch) beyond k, request a larger candidate set and keep the top k — Lucene's collector expands the beam to satisfy the requested count, so querying for efSearch candidates and trimming to k is the practical lever from the reader API:

import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.KnnFloatVectorQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;

    /** Run an ANN query; return the doc ids of the top-k (mapped via the stored "id"). */
    static int[] annTopK(IndexSearcher searcher, DirectoryReader reader,
                         float[] query, int k, int efSearch) throws Exception {
        // Request efSearch candidates (>= k) to widen the beam, then keep the best k.
        KnnFloatVectorQuery q = new KnnFloatVectorQuery("vec", query, Math.max(k, efSearch));
        TopDocs td = searcher.search(q, Math.max(k, efSearch));
        int[] ids = new int[Math.min(k, td.scoreDocs.length)];
        org.apache.lucene.index.StoredFields sf = reader.storedFields();
        for (int i = 0; i < ids.length; i++) {
            Document d = sf.document(td.scoreDocs[i].doc);
            ids[i] = d.getField("id").numericValue().intValue();
        }
        return ids;
    }

    /** recall@k = |annTopK ∩ trueTopK| / k. */
    static double recall(int[] ann, int[] truth) {
        java.util.Set<Integer> t = new java.util.HashSet<>();
        for (int id : truth) if (id >= 0) t.add(id);
        int hit = 0;
        for (int id : ann) if (t.contains(id)) hit++;
        return (double) hit / t.size();
    }

Step 6: The sweep harness — `main`

Tie it together: build one index per (M, efConstruction), then for a set of query vectors measure recall and average latency across efSearch values:

    public static void main(String[] args) throws Exception {
        float[][] corpus = randomVectors(N, DIM, 1L);
        float[][] queries = randomVectors(200, DIM, 999L);    // disjoint seed = held-out queries

        // Ground truth once (independent of HNSW params).
        int[][] truth = new int[queries.length][];
        for (int qi = 0; qi < queries.length; qi++) truth[qi] = bruteForceTopK(corpus, queries[qi], K);

        int[] Ms          = { 8, 16, 32 };
        int[] efConstrs   = { 64, 128 };
        int[] efSearches  = { 10, 50, 100, 200 };

        System.out.printf("%-4s %-6s %-6s %-9s %-12s%n", "M", "efC", "efS", "recall@" + K, "avg_us/query");
        for (int m : Ms) {
            for (int efc : efConstrs) {
                Directory dir = buildIndex(corpus, m, efc);                 // index-time params
                try (DirectoryReader reader = DirectoryReader.open(dir)) {
                    IndexSearcher searcher = new IndexSearcher(reader);
                    for (int efs : efSearches) {                            // query-time param
                        double recallSum = 0; long nanos = 0;
                        for (int qi = 0; qi < queries.length; qi++) {
                            long t0 = System.nanoTime();
                            int[] ann = annTopK(searcher, reader, queries[qi], K, efs);
                            nanos += System.nanoTime() - t0;
                            recallSum += recall(ann, truth[qi]);
                        }
                        System.out.printf("%-4d %-6d %-6d %-9.4f %-12.1f%n",
                            m, efc, efs, recallSum / queries.length, (nanos / 1000.0) / queries.length);
                    }
                }
            }
        }
    }
}

Compile and run (all the source in one dir):

javac -cp "$CORE" HnswCodec.java HnswLab.java
java  -cp "$CORE:." --add-modules jdk.incubator.vector -Xmx2g HnswLab

Step 7: Read the table

You will get something shaped like this (your exact numbers vary by machine and seed):

M    efC    efS    recall@10 avg_us/query
8    64     10     0.8120    41.3
8    64     50     0.9460    88.7
8    64     100    0.9710    142.5
8    64     200    0.9850    240.1
16   128    10     0.8740    55.0
16   128    50     0.9760    121.2
16   128    100    0.9900    188.6
16   128    200    0.9960    310.4
32   128    100    0.9950    250.9
32   128    200    0.9985    402.7

Read it like an engineer:

efSearch ↑ → recall ↑ and latency ↑, at fixed graph. The cheapest recall lever; it is query-time, so tune it per query.
M ↑ / efConstruction ↑ → higher recall ceiling and a better-connected graph, but slower indexing (you will notice the build step take longer) and more memory (bigger .vex). These are baked in — to change them you rebuilt the index.
There are diminishing returns: going efSearch 100→200 buys little recall for ~1.7× the latency. The art is finding the knee.
Confirm efSearch raises recall monotonically at fixed M/efConstruction.
Confirm a larger M/efConstruction raises the recall ceiling and slows the build.

Inspect .vec/.vex sizes across M (Lab L1 technique) — .vex grows with M:

for d in hnsw-index-*; do echo "$d:"; find "$d" -name '*.vex' -exec ls -la {} \; ; done

Step 8: Map every knob to OpenSearch k-NN (lucene engine)

The whole point: these are the same parameters a real knn_vector field exposes.

This lab (Lucene)	OpenSearch k-NN (lucene engine)	When set
`Lucene99HnswVectorsFormat(m, …)` → `M`	`method.parameters.m`	Index time (mapping)
`Lucene99HnswVectorsFormat(…, beamWidth)` → `efConstruction`	`method.parameters.ef_construction`	Index time (mapping)
`KnnFloatVectorQuery` beam → `efSearch`	`method.parameters.ef_search` (index/query setting)	Query time
`VectorSimilarityFunction.EUCLIDEAN`	`space_type: l2`	Index time
`KnnFloatVectorField`	`knn_vector` with `engine: lucene`	Index time
recall@k vs brute force	what k-NN benchmarks (#2595) measure	—

The equivalent OpenSearch mapping:

PUT /vectors
{
  "settings": { "index.knn": true },
  "mappings": { "properties": { "vec": {
    "type": "knn_vector", "dimension": 64, "space_type": "l2",
    "method": { "name": "hnsw", "engine": "lucene",
                "parameters": { "m": 16, "ef_construction": 128, "ef_search": 100 } }
  }}}
}

Your table is the recall/latency curve that index would exhibit. Deepen the algorithm comparison (HNSW vs IVF vs PQ, and why faiss adds IVF) in k-NN algorithms, and benchmark a real cluster in Lab K6.

Implementation Requirements / Deliverables

HnswCodec pinning M/efConstruction via Lucene99HnswVectorsFormat.
An indexer that builds one forceMerge(1) segment per (M, efConstruction).
A brute-force exact top-k baseline (ground truth).
An ANN query path using KnnFloatVectorQuery with a tunable efSearch.
A sweep harness producing a recall@k + latency table across M, efConstruction, efSearch.
Observations: efSearch ↑ → recall ↑/latency ↑; bigger M/efConstruction → higher ceiling, slower build, bigger .vex; diminishing returns identified.
A mapping table from each Lucene knob to its OpenSearch k-NN method.parameters equivalent.

Troubleshooting

Symptom	Likely cause	Fix
Recall is ~1.0 everywhere	Corpus too small / `efSearch` already covers it	Raise `N` (e.g. 100k) and `DIM` (e.g. 256/768)
Recall is very low even at high `efSearch`	Similarity mismatch (query vs field) or wrong ground-truth metric	Use the same metric (L2) for brute force and the field
`OutOfMemoryError` during build	Vectors + graph in heap; big `M`/`N`	`-Xmx4g`; smaller `N`/`DIM`; fewer concurrent merges
Latency numbers are noisy/huge on first rows	No JIT warmup	Add a warmup pass before timing; or use JMH
`ClassNotFoundException: Lucene99HnswVectorsFormat`	Wrong/old `lucene-core` jar	Re-resolve `$CORE` from the built checkout
`.vex` size doesn't change with `M`	You compared the same index dir	Each `(M, efC)` writes its own `hnsw-index-*` dir
Slow build at high `efConstruction`	Expected — graph construction cost	That is the index-time cost of recall; note it

Validation / Self-check

Define recall@k for ANN. What is the denominator, and how did you compute the ground truth?
Which parameters did you set at index time and which at query time? Show where each appears in your code, and explain why changing M forced a rebuild but changing efSearch did not.
From your table, describe the recall/latency effect of doubling efSearch at fixed M. Where is the knee of diminishing returns?
What does raising M/efConstruction buy and cost? Which file grows with M, and how did you verify it?
Why does the brute-force scan have to use the same distance metric as the indexed field for the recall number to mean anything?
Map each of your knobs (M, efConstruction, efSearch, similarity) to the exact OpenSearch knn_vector method.parameters/space_type field.
If a teammate reports "k-NN recall dropped after we lowered ef_search to cut p99 latency," explain the trade-off they hit using your data.

When your table shows recall climbing with efSearch and you can name the OpenSearch equivalent of every knob, you understand ANN tuning from the inside. Next, learn to give that knowledge back: Lab L4: Contribute to Apache Lucene.

Lab L4: Contribute to Apache Lucene

Background

The previous three labs cracked open a Lucene index (L1), wrote a custom codec (L2), and built an HNSW graph from scratch (L3). You now know Lucene from the inside. This lab takes the obvious next step: contributing a fix upstream, to apache/lucene itself — the Apache Software Foundation project that OpenSearch and the k-NN lucene engine are built on.

This is a different game from contributing to OpenSearch. Different repo, different governance (Apache, not a single-company-led project), a CHANGES.txt instead of a changelog bot, a developer mailing list that still matters, and a release cadence you do not control. But the engineering muscle is the same one PR quality trains: a small, well-scoped, well-tested change, explained well, that a maintainer can say yes to quickly.

Why an OpenSearch contributor should care. OpenSearch bundles a specific Lucene version. A fix you land in apache/lucene does not help OpenSearch the day it merges — it flows in on the next Lucene upgrade inside OpenSearch (a recurring, high-skill core task; you can watch it happen by searching the OpenSearch repo for Upgrade to Lucene). More striking: a number of the vector and quantization features that OpenSearch's k-NN lucene engine relies on landed in Lucene first — int8 scalar quantization, the scalar-quantized HNSW formats, faster graph merging. If you want to shape vector search at its root, the root is Lucene. This lab is how you reach it.

Note: This is a real contribution lab. You will clone Lucene, build it, run its tests with a reproducible seed, and walk the exact shape of a PR — CHANGES.txt entry, the JIRA/GitHub history, the mailing list. The walked example (a small VectorUtil/vector-format improvement) is illustrative — a realistic shape, not a claim that one specific line exists. You will grep to find the real site before you touch anything.

What you will do

Clone apache/lucene and build it with Gradle (JDK 21).
Run the full check, a single module's tests, and reproduce a test with a fixed seed.
Run tidy so your change is format-clean before you ever push.
Find a real, small improvement site in VectorUtil / a vector format by grep.
Stage the change as an illustrative diff with a test.
Write the CHANGES.txt entry and the PR the Apache way.
Trace how that fix would reach OpenSearch on the next Lucene upgrade.

Step 1 — Clone and build

Lucene moved from Ant to Gradle; it targets JDK 21 (the same JDK as OpenSearch and k-NN, which is not a coincidence — they move together). Build it once cold so the toolchain is warm before you care about a diff.

mkdir -p ~/src/oss-repos && cd ~/src/oss-repos
git clone https://github.com/apache/lucene.git
cd lucene
git log --oneline -1            # note the HEAD you started from
java -version                   # must be 21.x; Lucene's Gradle will complain otherwise

# Cold build + the full validation suite (slow the first time — go get coffee).
./gradlew check

./gradlew check is the gate: it compiles, runs tests, and runs the static/format checks. It is the "is my tree even sane" command, and you run it before and after your change.

Warning: the first ./gradlew check downloads a toolchain and a lot of dependencies and runs the entire test suite — it can take a long while and a lot of RAM. For day-to-day iteration you do not run the whole thing; you run one module (next step). Run the full check before you push.

# Sanity-inspect an index visually with Luke (Lucene's index inspector) — the GUI cousin of Lab L1.
./gradlew :lucene:luke:run

Step 2 — The iteration loop: module tests, seeds, tidy

You will spend almost all your time in three commands. Learn them cold.

Run one module's tests. Vector code lives in lucene-core, so:

./gradlew :lucene:core:test

This is your inner loop — seconds-to-minutes, not the whole suite.

Run one test class or method. Far tighter:

./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil"
./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil.testDotProduct*"

Reproduce a failure with a fixed seed. Lucene's tests are randomized (the RandomizedTesting framework). A failure prints a seed; you reproduce it exactly by passing it back. This is the single most important Lucene-testing skill, because "it passed on my machine" means nothing when the seed differs.

# A failing run prints something like:  -Ptests.seed=DEADBEEFCAFE
./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil" -Ptests.seed=DEADBEEFCAFE

# Hammer a flaky test across many random seeds to flush out rare failures:
./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil" -Ptests.iters=50

Make it format-clean. Lucene enforces formatting; an unformatted diff fails check. Run tidy before you push so the formatter, not the reviewer, fixes your whitespace:

./gradlew tidy
git diff            # tidy may have reformatted — review what it touched

Note: -Ptests.seed=... is also how a reviewer reproduces the failure your test catches. When you report a bug, include the seed. When you fix one, your new test should fail on the old code and pass on the new — independent of seed, ideally, or with the triggering seed pinned.

Step 3 — Find a real improvement site (grep, do not guess)

The walked example below is a small numeric/vector-format improvement — the kind of contained change that is realistic for a first upstream PR and squarely in the vector territory this curriculum cares about. Find the real site yourself; do not trust a line number printed in a lab.

# The SIMD-backed vector kernels — dot product, squared distance, cosine.
# This is the same VectorUtil / VectorizationProvider machinery from the SIMD chapter
# (../../lucene/simd-and-the-vector-api.md).
grep -rn "class VectorUtil\|dotProduct\|squareDistance\|cosine\|VectorUtilSupport" \
  lucene/core/src/java/org/apache/lucene/util/ | head -20

# The vector formats — HNSW + scalar quantization (the features that landed in Lucene first).
grep -rln "HnswVectorsFormat\|ScalarQuantizedVectorsFormat\|ScalarQuantizer" \
  lucene/core/src/java/org/apache/lucene/codecs/ | head

# The scalar quantizer itself — bounds, error handling, edge cases are fertile ground.
grep -rn "class ScalarQuantizer\|confidenceInterval\|quantize\|MINIMUM_MAGNITUDE" \
  lucene/core/src/java/org/apache/lucene/util/quantization/ 2>/dev/null | head

Good first contributions in this area tend to be one of:

A missing bounds/precondition check (a method that NaNs or silently misbehaves on a zero-length or mismatched-dimension vector).
A test gap (an edge case the existing tests do not cover — the easiest, most welcome first PR).
A small clarity/correctness fix in an error message or an off-by-one in a fallback path.

Warning: do not open your first Lucene PR as a sweeping refactor of VectorUtil or a new SIMD kernel. Those touch performance-critical, heavily-benchmarked code and invite long review. Start with a test gap or a precondition. Earn the context first — exactly the PR quality discipline, applied to a project where you have less standing than you do in OpenSearch.

Step 4 — The illustrative change

Suppose your grep found that a VectorUtil distance helper does not reject mismatched-dimension inputs early — it would read past one array or produce a meaningless result instead of failing loudly. (Confirm against your checkout; the real method name and whether the check already exists vary by version.) A defensive precondition plus a test is a clean, contained contribution.

// ILLUSTRATIVE shape — grep to find the real method/signature in YOUR checkout first.
// In org.apache.lucene.util.VectorUtil (or its support class):

public static float dotProduct(float[] a, float[] b) {
  if (a.length != b.length) {
    throw new IllegalArgumentException(
        "vector dimensions differ: " + a.length + " != " + b.length);
  }
  return IMPL.dotProduct(a, b);
}

The test is the part that actually carries the PR. It must fail on the old code and pass on the new, and it should respect Lucene's randomized-testing conventions (extend LuceneTestCase, use the framework's RNG via random()).

// ILLUSTRATIVE — in TestVectorUtil (org.apache.lucene.util):
public void testDotProductRejectsMismatchedDimensions() {
  float[] a = new float[atLeast(1)];
  float[] b = new float[a.length + 1];          // deliberately different dimension
  expectThrows(IllegalArgumentException.class, () -> VectorUtil.dotProduct(a, b));
}

# Run just your new test, then the whole VectorUtil suite, then tidy.
./gradlew :lucene:core:test --tests "*TestVectorUtil.testDotProductRejectsMismatchedDimensions"
./gradlew :lucene:core:test --tests "*TestVectorUtil"
./gradlew tidy

Note: if the precondition turns out to already exist, that is a successful investigation, not a failure — you learned the code is more careful than you guessed. Pivot to a genuine gap your grep surfaced (a format edge case, an untested space type, a quantization bound). The skill being trained is finding the real, small, defensible change, not forcing a predetermined one.

Step 5 — `CHANGES.txt` and the contribution conventions

This is where Lucene differs visibly from OpenSearch. Three artifacts matter.

1. CHANGES.txt. Lucene tracks user-facing changes in a hand-edited CHANGES.txt at the repo root, grouped by version and by category (New Features, Improvements, Bug Fixes, Optimizations). You add a one-line entry under the unreleased version, in the right category, crediting yourself.

head -60 CHANGES.txt          # see the current unreleased section + the format
grep -n "Bug Fixes\|Improvements\|Optimizations\|API Changes" CHANGES.txt | head

* GITHUB#NNNNN: VectorUtil.dotProduct now rejects vectors of mismatched
  dimension with a clear IllegalArgumentException.  (Your Name)

2. The issue/PR history — JIRA then GitHub. Lucene's history lives partly in Apache JIRA as LUCENE-NNNN issues — that is why you will see LUCENE-10577 (the origin of int8 scalar quantization) and LUCENE-NNNN references throughout the code and CHANGES.txt. The ASF then moved active development to GitHub Pull Requests (you will see apache/lucene #11613, #12497 and similar). Today you open a GitHub PR; older entries reference the JIRA key. Cross-reference both when you cite prior art.

# How the codebase itself cites its history — JIRA keys and GitHub PR numbers side by side.
grep -rn "LUCENE-[0-9]\+\|GITHUB#[0-9]\+" CHANGES.txt | head -20

Artifact	Where	Role
`LUCENE-NNNN`	Apache JIRA (historical)	the legacy issue tracker; many code comments + `CHANGES.txt` entries reference it
`apache/lucene #NNNNN`	GitHub	where PRs (and now issues) live; your contribution goes here
`CHANGES.txt`	repo root	the human-curated changelog; your entry goes under the unreleased version
`dev@lucene.apache.org`	mailing list	where design discussion, release votes, and "is this welcome?" questions happen

3. The mailing list. Unlike a single-vendor project, an Apache project's center of gravity includes the dev@lucene.apache.org developer list. For anything non-trivial — a design question, "would a PR for X be welcome?", a release timing question — the list is the right venue, and a short, polite message there before a big PR is the Apache-native version of design via GitHub. Subscribe before you need it.

Step 6 — Open the PR the Apache way

# Fork apache/lucene on GitHub, then:
git checkout -b vectorutil-dim-check
git add -p                                  # stage the precise hunks, nothing stray
git commit -m "Reject mismatched-dimension vectors in VectorUtil.dotProduct"
git push -u origin vectorutil-dim-check
gh pr create --repo apache/lucene --fill    # or open the PR in the browser

A reviewable Lucene PR has:

A focused diff — one concern. (A dimension check; not a check plus a refactor plus a rename.)
A test that fails before and passes after.
A clean ./gradlew check (run the full one before you push) and a ./gradlew tidy-clean tree.
A CHANGES.txt entry in the right category.
A PR description that says what and why, links any LUCENE-NNNN/GitHub prior art, and notes the seed if a randomized test was involved.

This is precisely the PR quality checklist — the change is the easy part; the reviewability is the contribution. In a project where you have no standing yet, it matters even more.

Step 7 — How this reaches OpenSearch

Close the loop. Your Lucene fix is not in OpenSearch the moment it merges. The path is:

flowchart LR
    PR["your PR merges in apache/lucene"] --> REL["it ships in a Lucene release<br/>(e.g. 10.x.y)"]
    REL --> UP["OpenSearch bumps its bundled Lucene<br/>(the 'Upgrade to Lucene N' PR)"]
    UP --> OS["the fix is now in OpenSearch<br/>(and the k-NN lucene engine inherits it)"]

# See the upgrade task happen, repeatedly, in the OpenSearch repo.
gh search prs --repo opensearch-project/OpenSearch "Upgrade to Lucene" --limit 20
# Find which Lucene version a given OpenSearch checkout bundles.
grep -rn "lucene" ~/src/OpenSearch/buildSrc/version.properties 2>/dev/null \
  || grep -rn "lucene =" ~/src/OpenSearch/gradle/libs.versions.toml 2>/dev/null

This is why the work compounds. A vector or quantization improvement you land in Lucene — exactly the kind of feature that has historically landed in Lucene first (int8 SQ via LUCENE-10577 / apache/lucene #11613; the scalar-quantized vectors format via apache/lucene #12497) — eventually becomes a capability of OpenSearch's k-NN lucene engine, with zero additional OpenSearch code beyond the routine version bump. You contributed to one project and improved two.

Note: the inverse is the upgrade contributor's job: when OpenSearch bumps Lucene, something usually breaks (an API moved, a default changed, a format version bumped). Fixing those breakages is one of the highest-leverage recurring tasks in the OpenSearch repo, and it requires exactly the Lucene fluency these four labs built. The "Upgrade to Lucene" PRs are a standing source of real work.

Validation / Self-check

You are done when you can answer these without notes. Several are runnable; run them.

Build Lucene from a clean clone and run ./gradlew check. What does check actually gate (compile, tests, format)? Why do you run a single module's tests during iteration instead?
A test fails and prints a seed. Reproduce the failure exactly with one command. Why is the seed essential, and how do you hammer a flaky test across many seeds?
Run ./gradlew tidy. What did it change, and why must you run it before you push rather than letting the reviewer find the formatting?
By grep, name one real site in VectorUtil or a vector/quantization format where a small precondition, test, or clarity fix would be a defensible first PR — and say how you'd write a test that fails on the old code and passes on the new.
Where does a user-facing Lucene change get recorded, in which categories, and what is the difference between a LUCENE-NNNN reference and an apache/lucene #NNNNN reference?
What is dev@lucene.apache.org for, and when would you post there before opening a PR? Tie it to the OpenSearch design via GitHub habit.
Draw the path from "my Lucene PR merges" to "the fix is in OpenSearch." Name the intermediate release and the recurring OpenSearch PR, and run the gh search that proves the upgrade task is real. Why do some vector features reach you through Lucene first?

When you can do all seven, you have closed the loop these four Lucene labs opened: you can crack an index, write a codec, build a graph — and now push a fix into the library all of it rests on. Carry that fluency back to the HNSW chapter, the SIMD chapter, and every "Upgrade to Lucene" PR you will ever review.

Vector Search and the k-NN Plugin

A match query asks "which documents contain these words?" A vector query asks a different and stranger question: "which documents are closest to this point in a 1,024-dimensional space?" That single change — from matching tokens to measuring geometric distance — is what powers semantic search, retrieval-augmented generation (RAG), recommendations, image similarity, deduplication, and anomaly detection. The text "a small domestic feline" and the text "kitten" share almost no tokens, so BM25 scores them as unrelated. Run both through an embedding model and they land next to each other in vector space. Vector search finds that neighbor; lexical search never will.

OpenSearch does this through the k-NN plugin (opensearch-project/k-NN), a separate repository that adds a knn_vector field type, a knn query, three approximate-nearest-neighbor engines, and a native C++ layer reached over JNI. This section is the advanced, vector-and-scale layer of the curriculum. It assumes you have done the OpenSearch warm-up, understand the plugin architecture, and have read the Lucene chapter on HNSW vector search — because k-NN's Lucene engine is that HNSW format, and its native engines reimplement the same idea in C++.

Note: "k-NN" is k-nearest-neighbors. In OpenSearch the plugin name is literally k-NN (repo) / knn (the field type, query, and REST namespace). We use "k-NN" and "vector search" interchangeably.

Why vector search matters

Workload	The question	Why lexical search fails	What vectors do
Semantic search	"find docs about X"	synonyms, paraphrase, no shared tokens	embed query + docs, find nearest
RAG	"retrieve context for an LLM prompt"	the prompt rarely shares keywords with the answer	nearest-neighbor retrieval over chunk embeddings
Recommendations	"items like this item"	"like" is not a keyword	nearest neighbors in an item-embedding space
Image / audio similarity	"find similar media"	there are no tokens at all	embed media, ANN over the vectors
Deduplication / anomaly	"near-duplicate or outlier?"	edit distance misses semantic dupes	distance threshold (radial search)

The mechanism is always the same: an embedding model (often run outside OpenSearch, or inside it via the ml-commons plugin and a neural-search pipeline) turns text or media into a fixed-length float array — the vector. You index those vectors, then at query time embed the query the same way and ask k-NN for the k closest stored vectors by a distance metric (l2, cosinesimil, innerproduct, …). "Closest" is the whole game.

The hard part is scale. Computing the exact distance from your query to every one of 100 million stored vectors per search is too slow. So k-NN uses approximate nearest neighbor (ANN) algorithms — chiefly HNSW (a navigable small-world graph) and IVF (inverted-file clustering) — that trade a little recall for orders of magnitude less work. Everything downstream (engines, quantization, native memory, disk-based search) exists to make ANN fast, accurate, and affordable enough to run in production.

How k-NN relates to core OpenSearch

The k-NN plugin is an ordinary OpenSearch Plugin — the same machinery from the plugin architecture deep dive, with two twists that make it unusual.

flowchart LR
    subgraph core["OpenSearch core (published artifacts)"]
        EP["extension interfaces<br/>MapperPlugin / SearchPlugin / ActionPlugin /<br/>EnginePlugin / ScriptPlugin / ..."]
    end
    subgraph knn["k-NN plugin repo (org.opensearch.knn)"]
        J["src/main/java<br/>field type, query, engines, codec"]
        N["jni/ (CMake C++)<br/>faiss + nmslib wrappers"]
    end
    J -->|implements| EP
    J <-->|JNI| N
    N -->|links| FAISS["faiss / nmslib<br/>(native ANN libraries)"]

It builds against published OpenSearch artifacts, not the core source tree. Like security, sql, and ml-commons, k-NN lives in its own repo with its own release cadence, version-locked to a specific OpenSearch version (opensearch.version in its Gradle build). Confirm the version it targets:
```
cd ~/src/k-NN
grep -n "opensearch.version" build.gradle    # e.g. opensearch_version = "3.6.0-SNAPSHOT"
```
It has a native build under jni/. Pure-Java plugins are the norm; k-NN is not pure Java. Under jni/ is a CMake project that compiles C++ glue (faiss_wrapper.cpp, nmslib_wrapper.cpp) and links the faiss and nmslib ANN libraries pulled in as git submodules under jni/external/. The Java side calls into it through org.opensearch.knn.jni.JNIService → FaissService / NmslibService. This is the only part of the OpenSearch ecosystem most contributors meet where a bug might be in C++, a memory leak might be off-heap, and a build failure might be CMake.
```
ls ~/src/k-NN/jni/                 # CMakeLists.txt, src/, external/, include/
ls ~/src/k-NN/jni/external/        # faiss  nmslib  (git submodules)
```

Everything else is recognizable plugin work: KNNVectorFieldMapper is a MapperPlugin contribution, KNNQueryBuilder is a SearchPlugin contribution, the warmup and stats endpoints are ActionPlugin contributions. If you have built an in-repo plugin, you already know 80% of the shape — the native layer is the new 20%.

The three engines at a glance

The engine mapping parameter on a knn_vector field picks which ANN implementation builds and searches the graph. There are three, and choosing among them is the single most consequential modeling decision in vector search.

Engine	Status	Algorithms	Where it runs	Notes
`faiss`	DEFAULT	HNSW, IVF	native C++ (off-heap) via JNI	quantization (PQ/SQ/BQ), disk-based ANN, byte vectors
`lucene`	supported	HNSW	pure Java, on JVM heap (Lucene's own format)	tight filtering integration, scalar quantization, no JNI
`nmslib`	DEPRECATED	HNSW only	native C++ via JNI	new nmslib indices blocked since 3.0; existing indices still read for BWC

# Confirm in the source: the engine enum, the default, and what's deprecated.
cd ~/src/k-NN
grep -nE "FAISS|LUCENE|NMSLIB|DEFAULT|DEPRECATED" \
  src/main/java/org/opensearch/knn/index/engine/KNNEngine.java | head

You will find public static final KNNEngine DEFAULT = FAISS; and a DEPRECATED_ENGINES set containing NMSLIB. The deprecation is real and version-gated: nmslib was deprecated in OpenSearch 2.19 and creating new nmslib indices is blocked in 3.0. There is an active meta-issue for what comes next — [META] Supporting New Vector Engine in OpenSearch (k-NN #2605) (https://github.com/opensearch-project/k-NN/issues/2605) — which is the strategic context for why the engine layer is being restructured. The dedicated Engines chapter compares all three in depth.

How this section is organized

Read in order. Each concept chapter has companion labs; do the concept first, then the lab that exercises it.

Concept chapters

#	Chapter	What you learn
1	Overview (this page)	why vectors, how k-NN relates to core, the three engines
2	k-NN Warm-Up: From User to Contributor	run k-NN as a user, then bridge every behavior to source
3	k-NN Plugin Architecture	`KNNPlugin`, the module layout, the custom codec, native memory
4	Engines: Faiss, Lucene, and NMSLIB	deep compare; the capability matrix; how engine choice flows to segment files
5	Algorithms: HNSW, IVF, and PQ	the ANN algorithms themselves and their parameters
6	Native Integration, JNI, and Memory	the C++ build, the JNI boundary, off-heap memory and the circuit breaker
7	The k-NN Query Path	`KNNQueryBuilder` → `KNNQuery` → `KNNWeight`, filtering, rescoring
8	Quantization and Disk-Based ANN	byte/FP16/PQ/BQ, `on_disk` mode, `compression_level`, rescoring

Labs

#	Lab	Deliverable
K1	Build the k-NN plugin from source	a working native + Java build, plugin installed in a local node
K2	Trace a k-NN query	a stack trace from REST down to the JNI call
K3	The `knn_vector` field type	read and modify the field mapper
K4	Build it — a custom k-NN feature	a real feature added to the plugin
K5	Fix it — a real k-NN issue	a PR-shaped fix against a labeled issue
K6	Benchmark recall and latency	a recall/latency report across engines and parameters

The contribution landscape

k-NN is one of the most active OpenSearch plugin repos and one of the most welcoming to new contributors, precisely because the surface is broad: Java field-mapping work, Lucene-codec work, C++/JNI work, quantization math, benchmarking, and cutting-edge GPU and remote-build research all live in one repo. A few live threads to orient you:

The next vector engine — k-NN #2605 (META): how the engine abstraction evolves after nmslib's deprecation.
GPU acceleration — k-NN #2293 (RFC, NVIDIA cuVS / CAGRA) and remote index build k-NN #2294 (RFC): offload segment graph construction to a remote GPU/CPU fleet.
Native memory circuit breaker — k-NN #1582: rearchitecting the off-heap memory guard. A great window into the off-heap memory story.

For the full catalog of issues, RFCs, and where to start contributing across OpenSearch, Lucene, and k-NN, see Real Issues, RFCs, and Where to Contribute. For everything that does not have a verified issue number here, do not invent one — search GitHub directly:

gh issue list --repo opensearch-project/k-NN --label "good first issue" --state open
gh issue list --repo opensearch-project/k-NN --search "faiss filter rescore in:title"

What this builds on

This section is the application of the Lucene chapter to a production subsystem. Before the architecture chapter, you should be comfortable with:

HNSW Vector Search in Lucene — the graph algorithm, KnnFloatVectorField, Lucene99HnswVectorsFormat, the .vec/.vex/.vem segment files. The k-NN lucene engine is a thin wrapper over exactly this; the faiss and nmslib engines are the same idea implemented in C++ with their own segment files.
SIMD and the Panama Vector API — why distance computations are fast, which is the same reason faiss is fast (it uses AVX2/AVX-512; see the PlatformUtils.isAVX512SupportedBySystem() checks in KNNPlugin).
Plugin architecture, Mapping and analysis, Search execution, and Circuit breakers and memory — the core subsystems k-NN extends.

Now run it as a user. Continue to the k-NN Warm-Up.

k-NN Warm-Up: From User to Contributor

"Space," said the k-NN plugin, "is big. Really big. You just won't believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to a 1,024-dimensional embedding space."

Don't panic. You are about to do something that sounds impossible — search a space with a thousand axes, find the handful of points nearest to where you're standing, and do it in a few milliseconds over a hundred million neighbors — and by the end of this chapter you'll have done it from a terminal, watched the off-heap memory light up, and traced every observable behavior back to the exact org.opensearch.knn.* class that owns it. This is the missing first mile for the k-NN plugin, the way the OpenSearch warm-up was the missing first mile for the core engine. Same idea: sit in the user's seat first, then open the source, so the class names land on something real instead of floating free.

If you haven't read the section overview yet, skim it — it tells you what vector search is for and how k-NN bolts onto core OpenSearch. This chapter assumes a local cluster with the k-NN plugin installed (the build-from-source lab walks you through it; for now, ./gradlew run in a k-NN checkout gives you a node with the plugin already on it, REST on localhost:9200).

The Guide's one piece of advice for the journey: vector search is just normal OpenSearch with a strange field type and a strange query. Everything you already know — shards, segments, the search fan-out, the coordinating-node reduce — still applies. A knn_vector is a field. A knn query is a QueryBuilder. The exotic part is what happens inside one shard, in C++, off the JVM heap. Hold that, and the whole thing stops being intimidating.

The Babel fish: embeddings turn meaning into geometry

Before any curl, the one concept that makes the rest make sense. An embedding model (BERT, a sentence-transformer, OpenAI's text-embedding-3, CLIP for images — pick your poison) is a function that eats text or an image and emits a fixed-length float[] — a vector. The magic property: things that mean similar things land near each other. "a small domestic feline" and "kitten" share zero tokens, so BM25 thinks they're strangers; their embeddings sit a whisker apart. The embedding is a Babel fish for meaning: it translates "what this is about" into "where this is in space," and once meaning is geometry, "find similar" becomes "find nearest."

"a small domestic feline"  --[embed]-->  [0.021, -0.88, 0.13, ... ]   (768 floats)
"kitten"                   --[embed]-->  [0.019, -0.85, 0.15, ... ]   <- nearly the same point
"quarterly tax filing"     --[embed]-->  [0.77,  0.04, -0.6, ... ]    <- far away

You produce these vectors outside the query (an external model, or inside OpenSearch via the ml-commons plugin and a neural-search ingest pipeline). k-NN's job starts the moment you have a float[]: store it, index it for fast nearest-neighbor lookup, and at query time find the k stored vectors closest to a query vector by some distance metric. That's the whole contract. Everything else — engines, graphs, quantization, off-heap memory — is machinery to make "find the nearest k" fast and affordable at scale.

Note: "closest" needs a definition. The space_type mapping parameter picks the distance metric: l2 (Euclidean), cosinesimil (cosine), innerproduct, l1, linf, and hamming (for binary vectors). Pick the one your embedding model was trained for — using l2 on vectors meant for cosine quietly wrecks recall, and it is a top-five user mistake.

Scenario 0: prove the plugin is alive

Two sentences of due diligence before we index anything. Confirm the plugin loaded and the stats endpoint answers — if these fail, nothing below will work and you should go back to the build lab.

# Is the k-NN plugin actually installed on this node?
curl -s 'localhost:9200/_cat/plugins?v' | grep -i knn

# The k-NN stats endpoint — the dashboard for everything off-heap.
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | head -40

The _cat/plugins line names opensearch-knn; _plugins/_knn/stats returns a JSON blob of counters — graph memory used, cache hits/misses, circuit-breaker state, query counts. Most of it reads 0 right now. By the end of this chapter every interesting counter will have moved, and you'll know which line of which class moved it.

Scenario 1: basic approximate k-NN (the "hello, vectors" path)

What the user does — create a knn-enabled index with a small knn_vector field, bulk-index a handful of vectors, run a knn query for the nearest k.

# 1. Create the index. THREE things make it a vector index:
#    index.knn:true, type knn_vector, and a method (hnsw on faiss = the default engine).
curl -s -XPUT 'localhost:9200/cats' -H 'Content-Type: application/json' -d '
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 4,
        "space_type": "l2",
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": { "m": 16, "ef_construction": 128 }
        }
      },
      "label": { "type": "keyword" }
    }
  }
}'

# 2. Bulk-index five 4-dimensional vectors. (Real ones are 384/768/1024-dim; 4-dim
#    fits on a page and the geometry is identical.)
curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/cats/_bulk?refresh=true' --data-binary $'
{"index":{"_id":"1"}}
{"label":"tabby",   "embedding":[0.10, 0.20, 0.30, 0.40]}
{"index":{"_id":"2"}}
{"label":"siamese", "embedding":[0.11, 0.19, 0.31, 0.39]}
{"index":{"_id":"3"}}
{"label":"lion",    "embedding":[0.90, 0.80, 0.10, 0.05]}
{"index":{"_id":"4"}}
{"label":"tiger",   "embedding":[0.88, 0.82, 0.12, 0.04]}
{"index":{"_id":"5"}}
{"label":"sphynx",  "embedding":[0.12, 0.21, 0.29, 0.41]}
'

# 3. Find the 3 nearest neighbors to a query vector near the "house cat" cluster.
curl -s 'localhost:9200/cats/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 3,
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.10, 0.20, 0.30, 0.40],
        "k": 3
      }
    }
  }
}'

You get back docs 1, 5, 2 (the house-cat cluster), each with a _score. The lion and tiger are nowhere — they live in a different neighborhood. Note the _score is a similarity (higher = nearer), not a raw distance; k-NN converts the engine's distance into a monotonically-increasing score so it composes with normal OpenSearch scoring.

What k-NN does under the hood:

At mapping time, KNNVectorFieldMapper.Builder parses dimension, space_type, and the method object, validates them against the chosen KNNEngine (faiss), and produces a KNNVectorFieldType. This is a MapperPlugin contribution — the same extension point any custom field type uses (see plugin architecture).
On index, each float[] is stored, and at flush/merge a custom Lucene codec (KNN1030Codec and its NativeEngines990KnnVectorsFormat, in the current source — grep to confirm the version) calls into faiss over JNI to build an HNSW graph and write it as segment files alongside Lucene's normal files.
The knn query parses into a KNNQueryBuilder, which builds a KNNQuery. Per shard, KNNWeight/KNNScorer ask the loaded faiss graph for the nearest k doc IDs and their distances, convert distance → score, and feed them into the normal Lucene collector. The coordinating node merges the per-shard top-k exactly like any search.

Bridge to source:

cd ~/src/oss-repos/k-NN   # your k-NN checkout

# The field mapper and field type (MapperPlugin contribution)
find src/main/java -name "KNNVectorFieldMapper.java" -o -name "KNNVectorFieldType.java"

# The query path
find src/main/java -name "KNNQueryBuilder.java" -o -name "KNNQuery.java" -o -name "KNNWeight.java"

# The custom codec that writes the graph as segment files
find src/main/java -path "*codec*" -name "KNN*Codec.java" | grep -v backward
find src/main/java -name "NativeEngines990KnnVectorsFormat.java"

Scenario 2: filtered k-NN (vectors that also obey a `term`)

What the user does — "nearest neighbors, but only among indoor cats." Pure ANN ignores metadata; real applications almost always need to combine semantic similarity with a structured filter.

# Re-create the index with a boolean we can filter on.
curl -s -XDELETE 'localhost:9200/cats' >/dev/null
curl -s -XPUT 'localhost:9200/cats' -H 'Content-Type: application/json' -d '
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": { "type": "knn_vector", "dimension": 4, "space_type": "l2",
        "method": { "name": "hnsw", "engine": "faiss" } },
      "indoor": { "type": "boolean" },
      "label":  { "type": "keyword" }
    }
  }
}'
curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/cats/_bulk?refresh=true' --data-binary $'
{"index":{}}
{"label":"tabby","indoor":true, "embedding":[0.10,0.20,0.30,0.40]}
{"index":{}}
{"label":"siamese","indoor":true,"embedding":[0.11,0.19,0.31,0.39]}
{"index":{}}
{"label":"feral","indoor":false,"embedding":[0.10,0.20,0.30,0.41]}
'

# k-NN with a filter: nearest indoor cats only.
curl -s 'localhost:9200/cats/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 3,
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.10, 0.20, 0.30, 0.40],
        "k": 3,
        "filter": { "term": { "indoor": true } }
      }
    }
  }
}'

The nearest vector belongs to feral, but it's indoor:false, so it's excluded — you get the indoor cats. Crucially, the filter is applied during the graph traversal, not naively after: faiss and lucene support efficient (pre-)filtering so the k you ask for is k matching results, not k candidates that mostly get thrown away.

Why this is subtle: naive post-filtering breaks ANN. If you fetch the top-k then filter, and your filter is selective, you can get zero results even though matching neighbors exist further down the graph. So the engine pushes the filter into the traversal. This is exactly why filtering support differs by engine — ENGINES_SUPPORTING_FILTERS in KNNEngine is {LUCENE, FAISS}, and nmslib is not in it. The engines chapter covers why.

Bridge to source:

# Filter handling in the query path — how the term filter becomes a BitSet the
# engine traverses against.
grep -rn "filter\|FilteredIdsKNNIterator\|BitSet\|Weight" \
  src/main/java/org/opensearch/knn/index/query/KNNWeight.java | head

# The engine capability set that decides whether filtering is even allowed.
grep -n "ENGINES_SUPPORTING_FILTERS" src/main/java/org/opensearch/knn/index/engine/KNNEngine.java

Scenario 3: byte vectors (the memory diet)

What the user does — index byte vectors instead of float. Each dimension is a single signed byte (-128..127) instead of a 4-byte float — a 4× memory reduction before any fancier quantization. Supported on faiss and lucene since 2.17.

curl -s -XPUT 'localhost:9200/cats-byte' -H 'Content-Type: application/json' -d '
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 4,
        "data_type": "byte",
        "space_type": "l2",
        "method": { "name": "hnsw", "engine": "lucene" }
      }
    }
  }
}'
curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/cats-byte/_bulk?refresh=true' --data-binary $'
{"index":{}}
{"embedding":[10, 20, 30, 40]}
{"index":{}}
{"embedding":[12, 19, 31, 39]}
'
curl -s 'localhost:9200/cats-byte/_search?pretty' -H 'Content-Type: application/json' -d '
{ "size": 2, "query": { "knn": { "embedding": { "vector": [10,20,30,40], "k": 2 } } } }'

The values are now integers in [-128, 127]. You traded precision for memory; recall drops a little, RAM drops a lot. This is the gentlest rung on a whole quantization ladder — FP16 scalar quantization, Product Quantization (PQ), Binary Quantization (BQ), and full disk-based search — covered in quantization and disk-ANN. The point of trying it now is to see data_type change the field's behavior.

Bridge to source:

# The data-type enum: BINARY, BYTE, FLOAT.
grep -nE "BINARY|BYTE|FLOAT|DEFAULT" src/main/java/org/opensearch/knn/index/VectorDataType.java | head

# Where the mapper branches on data_type to pick the storage/encoding.
grep -rn "VectorDataType\|data_type" src/main/java/org/opensearch/knn/index/mapper/ | head

Scenario 4: radial search (a threshold, not a count)

What the user does — instead of "give me the k nearest," ask "give me everyone within this radius" — every vector closer than a distance threshold (or above a score threshold). This is the right shape for deduplication ("anything within 0.05 is a near-duplicate") and anomaly detection ("nothing within 0.5 → outlier").

# max_distance: return all neighbors within this L2 distance of the query.
curl -s 'localhost:9200/cats/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 10,
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.10, 0.20, 0.30, 0.40],
        "max_distance": 0.05
      }
    }
  }
}'

# Equivalently, a score floor (min_score) instead of a distance ceiling.
curl -s 'localhost:9200/cats/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 10,
  "query": {
    "knn": { "embedding": { "vector": [0.10,0.20,0.30,0.40], "min_score": 0.95 } }
  }
}'

Note there is no k here — the result set size is whatever falls inside the radius, so a tight radius can return nothing and a loose one can return everything. Radial search is supported on lucene and faiss (ENGINES_SUPPORTING_RADIAL_SEARCH in KNNEngine), not nmslib.

Bridge to source:

# min_score / max_distance parsing and the radial query branch.
grep -rn "max_distance\|min_score\|RadialSearch\|radial\|MAX_DISTANCE\|MIN_SCORE" \
  src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java | head

grep -n "ENGINES_SUPPORTING_RADIAL_SEARCH" src/main/java/org/opensearch/knn/index/engine/KNNEngine.java

Scenario 5: exact vs approximate (the script-score escape hatch)

What the user does — sometimes you want the exact nearest neighbors (100% recall, no graph), usually as a rescoring pass over a small candidate set, or on a field that isn't graph-indexed at all. k-NN exposes this as a knn script score — brute-force distance over every matching doc, scored by a knn_score script.

# Exact k-NN: score EVERY doc by L2 distance to the query (a full scan).
# No graph, no approximation, 100% recall — and O(N), so only sane after a filter.
curl -s 'localhost:9200/cats/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 3,
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "knn_score",
        "lang": "knn",
        "params": {
          "field": "embedding",
          "query_value": [0.10, 0.20, 0.30, 0.40],
          "space_type": "l2"
        }
      }
    }
  }
}'

This returns the truly nearest docs, computed exactly — useful as ground truth when you're measuring the recall of the approximate path (the recall/latency benchmark lab leans on this). It's also the only k-NN you get on a knn_vector field that was indexed without a method (a pure stored field). The script-score path is a ScriptPlugin contribution — yet another extension interface KNNPlugin implements.

Bridge to source:

# The knn scoring script engine (ScriptPlugin contribution).
find src/main/java -path "*script*" -name "*.java" | head
grep -rn "knn_score\|KNNScoringSpace\|KNNScoringScriptEngine\|class KNNScoringUtil" \
  src/main/java/org/opensearch/knn/plugin/script/ | head

Scenario 6: a semantic-search end-to-end (embeddings → vectors → results)

The scenarios above used toy 4-dim vectors so the geometry fit on a page. Here is the real shape: text in, embeddings, a semantic query. We'll fake the embedding model with a tiny deterministic stand-in so the example runs with zero external dependencies — in production this float[] comes from a real model (via ml-commons/neural-search, or an external service).

curl -s -XPUT 'localhost:9200/docs' -H 'Content-Type: application/json' -d '
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "text":  { "type": "text" },
      "embedding": { "type": "knn_vector", "dimension": 8, "space_type": "cosinesimil",
        "method": { "name": "hnsw", "engine": "faiss" } }
    }
  }
}'

# In reality: emb = model.encode(text). Here we hand-place vectors so "cat" docs
# cluster and "tax" docs cluster, to show semantic retrieval working.
curl -s -H 'Content-Type: application/x-ndjson' \
  -XPOST 'localhost:9200/docs/_bulk?refresh=true' --data-binary $'
{"index":{}}
{"text":"a small domestic feline","embedding":[0.9,0.1,0.1,0.0,0.1,0.0,0.0,0.1]}
{"index":{}}
{"text":"kitten care basics",      "embedding":[0.88,0.12,0.08,0.0,0.1,0.0,0.0,0.1]}
{"index":{}}
{"text":"quarterly tax filing",    "embedding":[0.0,0.0,0.1,0.9,0.0,0.8,0.1,0.0]}
'

# Query: embed "baby cat" -> a vector near the feline cluster -> nearest neighbors.
curl -s 'localhost:9200/docs/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 2,
  "_source": ["text"],
  "query": { "knn": { "embedding": { "vector": [0.9,0.1,0.1,0.0,0.1,0.0,0.0,0.1], "k": 2 } } }
}'

The two feline docs come back; the tax doc does not — even though "baby cat" shares no words with "a small domestic feline". That is the entire point of vector search: it retrieves by meaning, where match retrieves by tokens. The natural next step is hybrid search — combine a knn query with a match query in one request so you get both semantic and lexical signal — which is one of the strongest reasons to run vectors inside OpenSearch rather than in a standalone vector DB (recall the comparison in the OpenSearch warm-up).

Scenario 7: where the memory actually lives — warmup and stats

This is the scenario that separates k-NN from every other field type, and the one most worth internalizing as a contributor. The faiss/nmslib graphs do not live on the JVM heap. They live in native memory (off-heap), loaded by JNI on first query against a segment — or eagerly via the warmup API.

# 1. Baseline: how much graph memory is loaded right now?
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' \
  | grep -E "graph_memory_usage|cache_capacity_reached|hit_count|miss_count|load_success_count"

# 2. Force-load every segment's graph for an index into native memory NOW,
#    instead of paying the load latency on the first user query.
curl -s -XPOST 'localhost:9200/_plugins/_knn/warmup/cats?pretty'

# 3. Re-check stats: graph_memory_usage and load_success_count have moved.
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' \
  | grep -E "graph_memory_usage|load_success_count|hit_count|miss_count"

Before warmup, the first query against a cold segment pays a one-time cost: JNI reads the graph file off disk into native memory and caches it. The warmup API pays that cost up front so production queries are uniformly fast. The cache is managed by NativeMemoryCacheManager (a guava cache), and a native-memory circuit breaker caps the total — if loading another graph would blow the limit, the breaker trips and queries fail rather than OOM-ing the box. You can watch the breaker in stats:

curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -iE "circuit_breaker|capacity"

What k-NN does under the hood:

The warmup REST call (RestKNNWarmupHandler) dispatches a transport action that, per shard, asks NativeMemoryCacheManager to load each segment's graph allocation.
NativeMemoryLoadStrategy calls JNI (JNIService → FaissService) to mmap/read the graph file and hand back a native pointer wrapped in a NativeMemoryAllocation.
Every load is metered against the circuit breaker (KNNCircuitBreaker, settings knn.memory.circuit_breaker.limit / knn.memory.circuit_breaker.enabled). The rearchitecture of this guard is an active, real design discussion — k-NN #1582 — and a great window into off-heap engineering.
The stats you read come from KNNStats suppliers reading the cache manager and breaker.

Bridge to source:

# Warmup + stats REST handlers (ActionPlugin contributions)
find src/main/java -name "RestKNNWarmupHandler.java" -o -name "RestKNNStatsHandler.java"

# Native memory cache + load + circuit breaker
find src/main/java -name "NativeMemoryCacheManager.java" -o -name "KNNCircuitBreaker.java"
grep -rn "knn.memory.circuit_breaker" src/main/java/org/opensearch/knn/ | head

# The JNI boundary into native faiss
find src/main/java -path "*jni*" -name "JNIService.java" -o -name "FaissService.java"

See the native integration and memory chapter for the full off-heap story: the JNI boundary, JNICommons, mmap, the circuit breaker's rearchitecture, and why a leak here is a native leak your JVM heap dump can't see.

The bridge table: user scenario → k-NN / Lucene class

Use this the way you'd use the master table in the OpenSearch warm-up: observe a behavior, find the code that owns it. Exact names drift by version — grep under src/main/java/org/opensearch/knn to confirm.

Observed behavior	Owning class / subsystem
`index.knn:true` + `knn_vector` mapping accepted/validated	`KNNVectorFieldMapper` (+ `.Builder`), `KNNVectorFieldType` (`MapperPlugin`)
`method`/`engine`/`space_type` validated against capabilities	`KNNEngine`, `KNNMethod`, `MethodComponent`, `SpaceType`
Float vs byte vs binary storage	`VectorDataType` (`FLOAT`/`BYTE`/`BINARY`)
`knn` query parsed	`KNNQueryBuilder` → `KNNQuery` (`SearchPlugin`)
Per-shard nearest-neighbor scan + score conversion	`KNNWeight`, `KNNScorer`
`filter` applied during traversal	`KNNWeight` filter handling; `ENGINES_SUPPORTING_FILTERS`
`max_distance` / `min_score` radial search	`KNNQueryBuilder` radial branch; `ENGINES_SUPPORTING_RADIAL_SEARCH`
Exact `knn_score` script	k-NN scoring script engine (`ScriptPlugin`), `KNNScoringUtil`
Graph written into segment files at flush/merge	`KNN1030Codec` → `NativeEngines990KnnVectorsFormat` (custom Lucene codec)
Lucene engine vectors (`.vec`/`.vex`/`.vem`)	Lucene's `KnnFloatVectorField` + HNSW format — see Lucene HNSW
First-query graph load into native memory	`NativeMemoryCacheManager`, `NativeMemoryLoadStrategy`, `JNIService` → `FaissService`
`POST /_plugins/_knn/warmup/<index>`	`RestKNNWarmupHandler` + warmup transport action
`GET /_plugins/_knn/stats`	`RestKNNStatsHandler`, `KNNStats` suppliers
Off-heap memory cap trips	`KNNCircuitBreaker` (`knn.memory.circuit_breaker.*`) — k-NN #1582
`_train` model build (IVF/PQ)	`TrainingModelTransportAction`, `ModelDao`, `.opensearch-knn-models` (`SystemIndexPlugin`)

Each row maps to a chapter: the field type and query path are detailed in query path; engines and their capability sets in engines; the algorithms behind method in algorithms; native memory in native integration and memory; the codec in architecture.

The contributor's loop for k-NN

The inner loop is the core loop plus one wrinkle: the native build. k-NN is not pure Java, so a clean ./gradlew run also compiles C++ under jni/.

cd ~/src/oss-repos/k-NN

# 1. Build the native libraries (faiss/nmslib wrappers via CMake) + the Java plugin,
#    and run a node with the plugin installed. The first build is slow (it compiles
#    faiss); later builds are incremental.
./gradlew run

# 2. Smoke test from another terminal.
curl -s 'localhost:9200/_cat/plugins?v' | grep -i knn
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | head

# 3. The fast inner loop: one Java test class.
./gradlew test --tests "org.opensearch.knn.index.query.KNNQueryBuilderTests"

# 4. Native-only build (when you're touching C++ under jni/).
cd jni && cmake . && make            # or follow jni/README for the exact invocation

The build-from-source lab does this end to end, including the submodule init for faiss/nmslib under jni/external/ and the platform gotchas. If ./gradlew run fails on the native step, that's a CMake/C++ problem, not a Java one — a distinction you'll learn to make fast.

Warning: A k-NN build pulls native git submodules (jni/external/faiss, jni/external/nmslib). A fresh clone without git submodule update --init --recursive fails the native build with confusing CMake errors. This is the single most common "my k-NN won't build" cause.

What to verify before the architecture chapter

Run this once. It proves your environment, and — more importantly — that you can move between user behavior and source.

# Environment
curl -s 'localhost:9200/_cat/plugins?v' | grep -i knn          # plugin present
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | head     # stats answer

# Round-trip
curl -s -XPUT 'localhost:9200/verify' -H 'Content-Type: application/json' -d \
 '{"settings":{"index.knn":true},"mappings":{"properties":{"v":{"type":"knn_vector","dimension":3,"space_type":"l2","method":{"name":"hnsw","engine":"faiss"}}}}}'
curl -s -XPOST 'localhost:9200/verify/_doc?refresh=true' -H 'Content-Type: application/json' -d '{"v":[1,2,3]}'
curl -s 'localhost:9200/verify/_search?pretty' -H 'Content-Type: application/json' -d \
 '{"query":{"knn":{"v":{"vector":[1,2,3],"k":1}}}}'
curl -s -XPOST 'localhost:9200/_plugins/_knn/warmup/verify?pretty'

You are ready when, without notes, you can:

State the three things that make an index a vector index: index.knn:true, a knn_vector field type, and a method (or model_id).
Explain what an embedding is and why "find similar" becomes "find nearest."
Name the difference between approximate knn (with k), radial (max_distance/ min_score), and exact (knn_score script) search — and which engines support filtering and radial.
Say where a faiss/nmslib graph physically lives at query time (native memory, off-heap) and what the warmup API and circuit breaker do about it.
Trace one knn query from REST to source: KNNQueryBuilder → KNNQuery → KNNWeight/KNNScorer → JNI → faiss.
Map each of: the field type, the query, the codec, and the native load to the KNNPlugin extension interface that owns it (MapperPlugin, SearchPlugin, EnginePlugin, ActionPlugin).

If any box is unchecked, re-run the scenario it maps to. When they all check, you've done the hard part: the source is no longer alien.

Continue to the k-NN Plugin Architecture to see how KNNPlugin wires all of this together — the extension interfaces, the module layout, the codec, the native memory manager — then to Engines for the faiss/lucene/nmslib comparison. The labs (K1: build from source, K2: trace a query) turn this reading into muscle memory. And so long, and thanks for all the vectors.

k-NN Plugin Architecture

The warm-up showed you k-NN from the outside — a field type, a query, a warmup call, some off-heap memory. This chapter is the wiring diagram. It answers: how does one KNNPlugin class register a field type, a query, a custom Lucene codec, a script engine, a system index, a circuit breaker, and a JNI bridge to C++ — and how do those pieces hand a vector from a curl request all the way down to faiss and back?

You already know the plugin architecture in the abstract: a Plugin subclass implements extension interfaces, PluginsService loads it in an isolated classloader, the engine queries each interface at startup. k-NN is the most instructive concrete example in the ecosystem, because it touches almost every extension point at once and then adds something no pure-Java plugin has: a native build. This chapter assumes that deep dive and the section overview; it does not re-derive the plugin-loading machinery, it shows what k-NN plugs into it.

After this chapter you can:

Name every extension interface KNNPlugin implements and what each contributes.
Navigate the k-NN repo's module layout, including the jni/ native build and the custom codec.
Trace the index path (vector → graph in a segment file) and the query path (knn query → native search → score) end to end.
Explain the native-memory subsystem (NativeMemoryCacheManager, the circuit breaker) and the training/model system index, and point at the source for each.

Note: The term cluster manager (formerly master) is the node that owns cluster state. It matters here twice: training is coordinated through it, and the .opensearch-knn-models system index is ordinary cluster-managed index metadata, not shard-local state.

`KNNPlugin`: one class, eleven interfaces

A pure-Java plugin usually implements one or two extension interfaces. KNNPlugin implements eleven. That is the whole architecture in a single declaration — every subsystem k-NN adds is one interface on this class.

cd ~/src/oss-repos/k-NN
grep -n "public class KNNPlugin" src/main/java/org/opensearch/knn/plugin/KNNPlugin.java
# public class KNNPlugin extends Plugin implements
#     MapperPlugin, SearchPlugin, ActionPlugin, EnginePlugin, ClusterPlugin,
#     ScriptPlugin, ExtensiblePlugin, SystemIndexPlugin, ReloadablePlugin,
#     SearchPipelinePlugin { ...
sed -n '179,200p' src/main/java/org/opensearch/knn/plugin/KNNPlugin.java

Warning: the exact interface list and line numbers drift between releases — k-NN targets a specific OpenSearch version (opensearch_version in build.gradle, e.g. 3.6.0-SNAPSHOT). Always grep to confirm against your checkout rather than trusting a number printed here.

Interface	k-NN's contribution	The `KNNPlugin` method	Subsystem chapter
`MapperPlugin`	the `knn_vector` field type	`getMappers()`	field type (below), query path
`SearchPlugin`	the `knn` query	`getQueries()`	query path
`ActionPlugin`	REST handlers + transport actions (warmup, stats, train, clear-cache, model CRUD)	`getRestHandlers()`, `getActions()`	actions (below)
`EnginePlugin`	a custom `EngineFactory` (so the engine knows about k-NN segments/merges)	`getEngineFactory(IndexSettings)`	codec + merge wiring
`ScriptPlugin`	the exact `knn_score` scoring script engine	`getScriptEngine(...)`	warm-up scenario 5
`SystemIndexPlugin`	the `.opensearch-knn-models` model index descriptor	`getSystemIndexDescriptors(...)`	training (below)
`ClusterPlugin`	cluster-lifecycle hooks (e.g. circuit-breaker / cache coordination)	cluster hooks	native memory (below)
`ExtensiblePlugin`	SPI so other plugins can extend k-NN	(SPI loading)	plugin architecture
`ReloadablePlugin`	secure-settings reload (e.g. remote-build credentials)	`reload(Settings)`	remote index build
`SearchPipelinePlugin`	search-pipeline processors (e.g. normalization for hybrid search)	search-pipeline processors	hybrid/neural search

The custom Lucene codec is not a separate extension interface — it is registered the Lucene way, via Java's ServiceLoader/SPI (META-INF/services), and selected per-index through the EnginePlugin's engine factory. More on that in the index-path section.

# Confirm each registration method exists.
grep -n "getMappers\|getQueries\|getRestHandlers\|getActions\|getEngineFactory\|getScriptEngine\|getSystemIndexDescriptors\|reload" \
  src/main/java/org/opensearch/knn/plugin/KNNPlugin.java

The repo layout

The k-NN repository is unusual in the OpenSearch world because it has two trees that must agree: a Java tree (src/main/java) and a native C++ tree (jni/). Knowing where things live saves you the most time you'll lose as a new contributor.

k-NN/
├── build.gradle                      # opensearch_version; depends on PUBLISHED core artifacts
├── jni/                              # the native build (CMake)
│   ├── CMakeLists.txt
│   ├── src/                          # faiss_wrapper.cpp, nmslib_wrapper.cpp, org_opensearch_knn_jni_*.cpp
│   ├── include/                      # JNI headers (generated from the Java native methods)
│   └── external/                     # faiss + nmslib as GIT SUBMODULES
└── src/main/java/org/opensearch/knn/
    ├── plugin/
    │   ├── KNNPlugin.java            # the one class, eleven interfaces
    │   ├── rest/                     # RestKNNWarmupHandler, RestKNNStatsHandler, RestTrainModelHandler, ...
    │   ├── transport/               # *TransportAction + *Request/*Response (warmup, stats, train, model CRUD)
    │   ├── script/                  # the knn_score script engine + KNNScoringUtil
    │   └── stats/                   # KNNStats and its suppliers
    ├── index/
    │   ├── mapper/                  # KNNVectorFieldMapper, KNNVectorFieldType
    │   ├── query/                   # KNNQueryBuilder, KNNQuery, KNNWeight, KNNScorer
    │   ├── engine/                  # KNNEngine, KNNMethod, MethodComponent; faiss/ lucene/ nmslib/
    │   ├── codec/                   # KNN1030Codec, NativeEngines990KnnVectorsFormat, backward_codecs/, nativeindex/
    │   ├── memory/                  # NativeMemoryCacheManager, NativeMemoryAllocation, load strategies
    │   ├── SpaceType.java           # l2 / cosinesimil / innerproduct / l1 / linf / hamming
    │   ├── VectorDataType.java      # FLOAT / BYTE / BINARY
    │   ├── KNNSettings.java         # knn.memory.circuit_breaker.*, ef_search, etc.
    │   └── KNNCircuitBreaker.java   # the native-memory guard
    ├── indices/                     # ModelDao, Model, ModelState, training (.opensearch-knn-models)
    └── jni/                         # JNIService -> FaissService / NmslibService; JNICommons; PlatformUtils

# Walk it yourself.
ls src/main/java/org/opensearch/knn/
ls src/main/java/org/opensearch/knn/index/
ls jni/                              # CMakeLists.txt, src/, include/, external/
ls jni/external/                     # faiss  nmslib  (submodules — empty without --recursive!)

Warning: build.gradle declares a dependency on published OpenSearch artifacts (like every out-of-repo plugin — see in-repo vs out-of-repo), and the native build pulls faiss/nmslib as submodules under jni/external/. A fresh clone without git submodule update --init --recursive builds the Java side fine and then fails the native step with cryptic CMake errors. This is the #1 "k-NN won't build" cause.

The native build (`jni/`) and the custom codec

Two structural facts make k-NN unlike any pure-Java plugin, and they're the two things worth understanding before you read any Java.

First, the JNI native build. Under jni/ is a CMake project that compiles C++ glue (faiss_wrapper.cpp, nmslib_wrapper.cpp, and the generated org_opensearch_knn_jni_* files) and links the faiss and nmslib libraries from jni/external/. The Java side calls into it through JNIService, which dispatches to FaissService or NmslibService. Those classes declare native methods whose implementations live in the compiled .so/.dylib/.dll. JNICommons handles the off-heap memory bookkeeping; PlatformUtils probes the CPU for AVX2/AVX-512 so faiss uses the fastest SIMD path available — the same vectorization story as SIMD in Lucene, just on the C++ side.

grep -rn "native " src/main/java/org/opensearch/knn/jni/FaissService.java | head
ls jni/src/                          # the C++ that backs those native methods
grep -rn "isAVX512SupportedBySystem\|isAVX2SupportedBySystem" \
  src/main/java/org/opensearch/knn/jni/PlatformUtils.java

Second, the custom Lucene codec. When you index into a faiss or nmslib field, the HNSW/IVF graph is not stored the way a normal Lucene field is. k-NN supplies a custom codec — the current one is KNN1030Codec (older versions live under codec/backward_codecs/, e.g. KNN990Codec, for reading older segments). Its KnnVectorsFormat is NativeEngines990KnnVectorsFormat, which, at flush/merge time, calls JNI to build the native graph and writes it as extra segment files beside Lucene's normal .cfs/.fdt/.tim/etc.

# The current codec and its vectors format.
ls src/main/java/org/opensearch/knn/index/codec/KNN1030Codec/
find src/main/java -name "NativeEngines990KnnVectorsFormat.java"

# Older codecs kept only for reading older segments (backward compatibility).
ls src/main/java/org/opensearch/knn/index/codec/backward_codecs/

The lucene engine is the exception: it does not use k-NN's native codec at all. It writes vectors with Lucene's own KnnFloatVectorField and HNSW format (.vec/.vex/ .vem), exactly as described in HNSW in Lucene. This is the single deepest architectural fork inside k-NN, and the next chapter (engines) is built around it: faiss/nmslib write native graphs through k-NN's codec into off-heap memory; lucene writes Lucene graphs through Lucene's codec onto the JVM heap. Keep that sentence; it explains most of k-NN's behavior differences.

flowchart TD
    M["knn_vector mapping: engine = ?"] -->|faiss / nmslib| NC["KNN1030Codec<br/>NativeEngines990KnnVectorsFormat"]
    M -->|lucene| LC["Lucene's KnnVectorsFormat<br/>Lucene99HnswVectorsFormat"]
    NC -->|JNI build| NG["native graph segment files<br/>(loaded OFF-HEAP via NativeMemoryCacheManager)"]
    LC -->|Lucene write| LG[".vec / .vex / .vem<br/>(Lucene's own format, on JVM heap)"]

The index path: a vector becomes a graph in a segment

Follow one float array from curl to disk. This is the same general write path as core indexing — IndexShard → InternalEngine → Lucene IndexWriter — with k-NN inserting itself at the field-mapping and codec layers.

flowchart TD
    R["POST /index/_doc  { embedding: [..] }"] --> SHARD["IndexShard.applyIndexOperationOnPrimary"]
    SHARD --> ENG["InternalEngine.index -> Lucene IndexWriter.addDocument"]
    ENG --> FT["KNNVectorFieldMapper.parseCreateField<br/>validates dim/space/method, stores the float[]"]
    FT --> FLUSH["flush / merge"]
    FLUSH --> CODEC{"engine?"}
    CODEC -->|faiss/nmslib| NF["NativeEngines990KnnVectorsFormat<br/>-> JNIService -> FaissService (build HNSW/IVF)"]
    CODEC -->|lucene| LF["Lucene's HNSW format builds the graph"]
    NF --> SEG["native graph written as segment files"]
    LF --> SEG2[".vec/.vex/.vem written by Lucene"]

KNNVectorFieldMapper.Builder parses the mapping once: dimension, space_type, data_type, and the method (or model_id). It validates them against the chosen KNNEngine's declared capabilities (you cannot, e.g., ask nmslib for a filter or IVF for the lucene engine — those checks happen here). It produces a KNNVectorFieldType.
On each document, KNNVectorFieldMapper.parseCreateField reads the float[]/byte[], checks the dimension, and stores it.
The graph isn't built per-document — it's built at flush (segment creation) and rebuilt at merge, because Lucene segments are immutable (refresh/flush/merge). The codec's KnnVectorsWriter gathers all the vectors in the flushing segment and calls JNI once to build the whole graph, which is far cheaper than incremental inserts.
The result is written as segment files. For faiss/nmslib these are k-NN's own files; for lucene they're Lucene's .vec/.vex/.vem.

# The field mapper that validates and stores.
grep -rn "parseCreateField\|class Builder\|class KNNVectorFieldMapper" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java | head

# The codec's writer that calls JNI at flush/merge to build the graph.
grep -rn "KnnVectorsWriter\|flush\|mergeOneField\|JNIService" \
  src/main/java/org/opensearch/knn/index/codec/ | head

The query path: `knn` query → native search → score

The mirror image: a query vector goes down to faiss and the nearest k come back. This is a SearchPlugin contribution that slots into the normal search execution fan-out — the coordinating node fans out to shards, each shard runs the query phase, results merge.

flowchart TD
    Q["knn query in _search body"] --> QB["KNNQueryBuilder.doToQuery"]
    QB --> KQ["KNNQuery (per-shard Lucene Query)"]
    KQ --> W["KNNWeight.scorer per segment"]
    W --> CB{"graph loaded off-heap?"}
    CB -->|miss| LOAD["NativeMemoryCacheManager loads via JNIService -> FaissService"]
    CB -->|hit| SRCH
    LOAD --> SRCH["JNI: faiss searches the graph, returns docIds + distances"]
    SRCH --> SC["KNNScorer: distance -> score (SpaceType.scoreTranslation)"]
    SC --> COLL["Lucene collector: top-k per shard"]
    COLL --> RED["coordinating node merges per-shard top-k"]

KNNQueryBuilder.doToQuery validates the query against the field's mapping (dimension must match, filter only if the engine supports it, radial only if supported) and builds a KNNQuery.
Per segment, KNNWeight obtains the loaded native graph from NativeMemoryCacheManager (loading it on a cache miss), applies any filter as a BitSet the engine traverses against, and calls JNI to retrieve the nearest k doc IDs with their distances.
KNNScorer converts each engine distance into a Lucene score via the SpaceType's score translation (so nearer = higher, composable with normal scoring).
The per-shard top-k merge on the coordinating node exactly like any search. The full mechanics — filtering, exact rescoring, the k vs size distinction — are the subject of the k-NN query path chapter.

grep -rn "doToQuery\|class KNNQuery\b\|class KNNWeight\|class KNNScorer" \
  src/main/java/org/opensearch/knn/index/query/ | head

Native memory and the circuit breaker

This is the subsystem that makes k-NN operationally different from every other plugin, and the one most likely to be the subject of an issue you pick up. faiss/nmslib graphs live outside the JVM heap, in native memory, loaded on first query (or via the warmup API).

Concept	Class	Job
The cache of loaded graphs	`NativeMemoryCacheManager`	guava-cache keyed by segment graph file; holds `NativeMemoryAllocation`s
A loaded graph allocation	`NativeMemoryAllocation`	a native pointer + its size, with eviction/close semantics
Loading a graph	`NativeMemoryLoadStrategy` → `JNIService`	reads/`mmap`s the graph file into native memory
The off-heap cap	`KNNCircuitBreaker`	trips when loading another graph would exceed the limit
The settings	`KNNSettings`	`knn.memory.circuit_breaker.enabled`, `knn.memory.circuit_breaker.limit`

find src/main/java -name "NativeMemoryCacheManager.java" -o -name "KNNCircuitBreaker.java"
grep -n "KNN_MEMORY_CIRCUIT_BREAKER_ENABLED\|KNN_MEMORY_CIRCUIT_BREAKER_CLUSTER_LIMIT" \
  src/main/java/org/opensearch/knn/index/KNNSettings.java

The circuit breaker is deliberately separate from core OpenSearch's circuit breakers, which guard heap. k-NN's memory is off-heap, invisible to those breakers and to a JVM heap dump — so k-NN needs its own accounting. The design of this guard is actively debated: Investigate rearchitecture of the native memory circuit breaker — k-NN #1582. Read that issue: it's a concentrated dose of the hard questions (per-node vs per-index limits, eviction policy, the interaction with the OS page cache) that off-heap memory management raises.

Warning: because graph memory is off-heap, a leak here does not show up in a heap dump and does not trip the core circuit breakers. You diagnose it through GET /_plugins/_knn/stats (graph_memory_usage, evictions, breaker state) and OS-level RSS, not the JVM tools you'd reach for with any other plugin. See native integration and memory.

Training and the model system index

HNSW is train-free, but IVF and PQ need training (their centroids and codebooks are learned from a data sample — see algorithms). k-NN handles this with a dedicated workflow backed by a system index.

# The training action and the model data-access layer.
find src/main/java -name "TrainingModelTransportAction.java" -o -name "TrainingModelRequest.java"
grep -n "MODEL_INDEX_NAME\|class ModelDao\|class Model\b\|enum ModelState" \
  src/main/java/org/opensearch/knn/indices/ModelDao.java \
  src/main/java/org/opensearch/knn/common/KNNConstants.java
# MODEL_INDEX_NAME = ".opensearch-knn-models"

POST /_plugins/_knn/models/<id>/_train hits RestTrainModelHandler → TrainingModelTransportAction, which runs k-means (IVF centroids / PQ codebooks) over a training sample and writes a serialized model.
The model is stored as a document in the .opensearch-knn-models system index — an ordinary replicated OpenSearch index that k-NN owns and declares via getSystemIndexDescriptors() (SystemIndexPlugin). It is cluster-managed metadata, not shard-local: the cluster manager coordinates its creation and the model's lifecycle state (training → created, or failed), tracked through ModelDao/ModelState.
A field then references the trained model by model_id instead of an inline method.

This is why KNNPlugin is a SystemIndexPlugin: the model store must be a protected index the engine knows not to let users mutate directly. The full workflow, with the sequence diagram, is in algorithms.

The full wiring diagram

Putting it together — KNNPlugin's interfaces on the left, the subsystems they register on the right, and the two data paths that flow through them.

flowchart LR
    subgraph plugin["KNNPlugin (one class)"]
        MP["MapperPlugin"] --> FM["KNNVectorFieldMapper / KNNVectorFieldType"]
        SP["SearchPlugin"] --> QP["KNNQueryBuilder -> KNNQuery -> KNNWeight/KNNScorer"]
        AP["ActionPlugin"] --> RH["Rest*Handler + *TransportAction<br/>warmup, stats, train, clear-cache"]
        EP["EnginePlugin"] --> EF["EngineFactory + custom codec wiring"]
        SCP["ScriptPlugin"] --> SE["knn_score script engine"]
        SIP["SystemIndexPlugin"] --> SI[".opensearch-knn-models (ModelDao)"]
    end
    FM -->|index path| CODEC["KNN1030Codec / NativeEngines990KnnVectorsFormat"]
    QP -->|query path| MEM["NativeMemoryCacheManager + KNNCircuitBreaker"]
    CODEC <-->|JNI| JNI["JNIService -> FaissService / NmslibService"]
    MEM <-->|JNI| JNI
    JNI <-->|links| NATIVE["faiss / nmslib (C++ under jni/)"]
    EF --> CODEC
    SI -.model_id.-> FM

Every box is a grep away. The point of this chapter is that there are no hidden pieces: eleven interfaces, a field mapper, a query, a codec, a native cache, a JNI bridge, and a model index — and the two arrows (index path, query path) that thread through them.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`./gradlew run` fails on a C++/CMake step	submodules not initialized	`git submodule update --init --recursive`; `jni/external/`
Native graph not built; queries do exact scan	field mapped without a `method` (pure stored field)	`KNNVectorFieldMapper` mapping; add a `method`
`circuit_breaker is triggered` on queries	off-heap limit hit loading graphs	`KNNCircuitBreaker`, `knn.memory.circuit_breaker.limit`; #1582
RSS grows but heap dump looks fine	off-heap graph memory (invisible to heap tools)	`_plugins/_knn/stats` graph memory; `NativeMemoryCacheManager`
`model_id` field rejects docs	model state `training`/`failed`, not `created`	`ModelDao`, `GET _plugins/_knn/models/<id>`, `.opensearch-knn-models`
Filter/radial/IVF mapping rejected on an engine	engine capability check (e.g. nmslib has no filter)	`KNNEngine` capability sets; `KNNQueryBuilder`/mapper validation
Old segments unreadable after upgrade	backward codec missing for that segment version	`codec/backward_codecs/`; check the codec version chain
Plugin loads but `knn` query is `unknown [knn]`	wrong plugin version for the node	`opensearch.version` in descriptor; rebuild — see plugin arch

Validation: prove you understand this

List the extension interfaces KNNPlugin implements and name exactly one thing each contributes. Which one registers the field type, which the query, which the model system index, and which the exact-scoring script?
Explain why the custom Lucene codec is not registered as an extension interface, and how it is selected for a faiss field but bypassed for a lucene field.
Trace the index path for one float[] from IndexShard to a segment file, naming the field mapper, the codec/vectors-format, and the JNI call — and say when the graph is actually built (and why not per-document).
Trace the query path for one knn query, naming KNNQueryBuilder, KNNQuery, KNNWeight/KNNScorer, the native-memory load, and the distance→score step.
Explain why k-NN needs its own circuit breaker instead of using core OpenSearch's, and how you would diagnose an off-heap memory problem (which tools, which stats). Cite the rearchitecture issue.
Describe the training/model flow: what runs k-means, where the model is stored, what makes that index "system," which node coordinates it, and how a field references the result.

When you can do all six, read Engines to see how the faiss/lucene/nmslib fork plays out in capabilities and segment files, then the query path and native integration and memory for the two data paths in full depth. For the algorithms behind method, see algorithms; for the contribution landscape and live RFCs, real issues and RFCs.

Engines: Faiss, Lucene, and NMSLIB

The architecture chapter ended on one load-bearing sentence: faiss/nmslib write native graphs through k-NN's codec into off-heap memory; lucene writes Lucene graphs through Lucene's codec onto the JVM heap. That fork is the engine mapping parameter. It is the single most consequential choice you make when you map a knn_vector field, and it decides almost everything downstream: which algorithms you can use, whether vectors live on or off the heap, whether filtering is exact or post-hoc, whether you can quantize, whether you need to train, and which segment files end up on disk.

This chapter is the comparison. It does not re-derive the plugin wiring (that is architecture) or the algorithms themselves (that is algorithms — HNSW, IVF, PQ). It answers a narrower, sharper question: given three engines that all do "approximate nearest neighbor," why are there three, what does each actually give you, and how do you choose. By the end you can fill in the capability matrix from memory, justify a default, and read an engine-related issue without guessing.

Note: "engine" here is k-NN's word, set per field as method.engine in the knn_vector mapping. It is unrelated to OpenSearch's Engine/InternalEngine (the shard write abstraction in engine internals). Same word, different layer. When this chapter says "engine" it means faiss / lucene / nmslib.

After this chapter you can:

Name the three engines, what each is, and the one-line reason each exists.
Fill in the capability matrix (algorithms, quantization, filtering, memory location, training, byte/binary vectors) without looking.
Choose an engine for a given workload and defend the choice.
Explain how the engine value flows through the custom codec into segment files.
Read the "new vector engine" meta-issue and place today's engines on its roadmap.

Why three engines at all

A reasonable question on your first day: if all three implement HNSW, why does k-NN ship three of them? The answer is history plus a genuine three-way trade.

nmslib came first. It is a mature C++ ANN library (Non-Metric Space Library) that does HNSW and nothing else here. It was k-NN's original engine and proved the whole idea: load a graph off-heap, search it via JNI, return doc IDs. It has now served its purpose and is deprecated — more below.
faiss is Meta's vector library. It does HNSW and IVF, and — crucially — it is where the memory story lives: Product Quantization, scalar/FP16 quantization, binary quantization, and the disk-based ANN modes all run through faiss. It is also where the future is being built (GPU and remote index build). It is the default.
lucene is not an external library at all. It is Lucene's own native HNSW vector format, the same one described in HNSW in Lucene. Choosing it means k-NN gets out of the native-memory business entirely for that field: vectors live on the heap, in Lucene's own segment files, searched by Lucene's own code, with exact pre-filtering that the native engines cannot match.

So the three exist because they sit at different points of one triangle:

flowchart TD
    F["faiss<br/>algorithms + quantization + future (GPU/remote)<br/>off-heap, native JNI"]
    L["lucene<br/>exact filtering + heap-managed + no JNI<br/>on-heap, Lucene format"]
    N["nmslib<br/>HNSW only, deprecated<br/>off-heap, native JNI"]
    F --- L
    L --- N
    N --- F

faiss optimizes for capability and scale (every algorithm, every quantization, GPU on the horizon) at the cost of off-heap complexity.
lucene optimizes for operational simplicity and filtering quality at the cost of features (no IVF, no PQ, on-heap memory pressure).
nmslib optimized for being first and simple — and is now a legacy read path.

faiss — the default

The faiss engine builds a faiss index (HNSW or IVF) per segment, in C++, via JNI, and loads it off the JVM heap into native memory. Everything in the native integration and memory chapter — the NativeMemoryCacheManager, the warmup API, the off-heap circuit breaker — is primarily about faiss.

What faiss gives you that the others do not:

Two algorithms. HNSW (graph, train-free) and IVF (inverted-file / cluster-based, needs training). See algorithms for the mechanics.
The whole quantization menu. Product Quantization (PQ) for IVF and HNSW, scalar/FP16 quantization, Binary Quantization (BQ), and disk-based vector search (on_disk mode with compression_level 1x/2x/4x/8x/16x/32x and rescoring on full-precision vectors). This is the entire quantization and disk-ANN chapter, and it is almost entirely a faiss story.
byte and binary vectors (data_type: byte from 2.17+; binary via the hamming space).
SIMD on the C++ side. faiss picks an AVX2/AVX-512 path at runtime (PlatformUtils probes the CPU — the same vectorization theme as SIMD and the Vector API, but in C++, not the JDK Panama API).

cd ~/src/oss-repos/k-NN
# The faiss engine definition and its declared methods/capabilities.
ls src/main/java/org/opensearch/knn/index/engine/faiss/
grep -rn "class Faiss\|HNSW\|IVF\|encoder\|pq\|sq" \
  src/main/java/org/opensearch/knn/index/engine/faiss/ | head -20

# The C++ side it bridges to.
ls jni/src/                 # faiss_wrapper.cpp ...
grep -rn "native " src/main/java/org/opensearch/knn/jni/FaissService.java | head

Where faiss is going: GPU and remote build

The two most exciting live RFCs in the whole plugin are both about moving faiss index construction off the indexing node. Building an HNSW or IVF graph is the expensive part of indexing vectors; the search is comparatively cheap. So:

GPU build. [RFC] Boosting OpenSearch Vector Engine Performance using GPUs — k-NN #2293 proposes building vector indexes on NVIDIA GPUs via cuVS and the CAGRA graph algorithm (FP32), then serving them. Graph construction is embarrassingly parallel; a GPU eats it.
Remote build. [RFC] Remote Vector Index Build — k-NN #2294 proposes offloading segment vector-index construction to a separate remote fleet (CPU or GPU) instead of doing it inline on the data node during flush/merge. The data node ships the raw vectors out, gets a built graph back, and writes it as a segment file.

Both are faiss-engine features. They are also a clean example of an architecture you have already met: separating a write-heavy phase from the node serving reads — the same instinct behind reader/writer separation in core. Read both issues; they are the current frontier of this plugin and exactly the kind of work the capstone projects point at.

lucene — the heap-managed, filter-friendly engine

The lucene engine is the architectural odd one out, and the reason is worth saying precisely: it does not use k-NN's native codec or JNI at all. When you map a field with engine: lucene, k-NN steps aside and lets Lucene store the vectors with its own KnnFloatVectorField/KnnByteVectorField and HNSW format (Lucene99HnswVectorsFormat and friends), exactly as in HNSW in Lucene. The graph lives in Lucene's .vec/.vex/.vem segment files, on the JVM heap (memory-mapped, managed by Lucene), not in k-NN's off-heap native cache.

That single fact produces every difference that matters:

Consequence	Why
No off-heap circuit breaker concern	vectors are in Lucene's files on heap-managed memory; the `knn.memory.circuit_breaker.*` limit and `NativeMemoryCacheManager` do not apply
Exact, efficient pre-filtering	the filter is applied inside Lucene's HNSW search as the graph is traversed, not as a post-hoc step — this is lucene's signature advantage
No IVF, no PQ	Lucene's format does HNSW only; quantization is limited to Lucene's scalar quantization (mirrors `Lucene99ScalarQuantizedVectorsFormat`), not faiss-style PQ/BQ
No training	HNSW is train-free; without IVF/PQ there is nothing to train
Memory pressure shows up in the heap	a vector-heavy lucene index pressures the JVM heap and the OS page cache, visible to heap tools and core circuit breakers — the opposite of faiss

# The lucene engine and how it hands off to Lucene's own format.
ls src/main/java/org/opensearch/knn/index/engine/lucene/
grep -rn "Lucene\|KnnFloatVectorField\|HnswVectorsFormat\|ScalarQuantized" \
  src/main/java/org/opensearch/knn/index/engine/lucene/ | head -20

The headline reason to pick lucene is filtering. With faiss, a filtered k-NN query has historically meant a trade-off: filter-then-search can starve the graph of candidates; search-then-filter can return fewer than k. Lucene's HNSW applies the filter during traversal, so a filtered query is both exact and efficient. If your workload is "find the nearest vectors among documents matching this filter" and your vectors fit comfortably in heap, lucene is often the right answer even though faiss is the default.

nmslib — deprecated, read-only future

nmslib is the original engine and now the cautionary tale. It does HNSW only, off-heap, via JNI — the same shape as faiss but with none of faiss's algorithms, quantization, or roadmap.

Its status is the important part:

Deprecated since 2.19. Mapping a new field with engine: nmslib emits a deprecation warning.
Blocked for new indices in 3.0. You cannot create a new nmslib index in 3.0.
Still read for backward compatibility. Existing nmslib segments from older indices are still loaded and queried, through the codec's backward_codecs/, so an upgraded cluster keeps serving them.

# The deprecation / block lives in the engine + mapper validation. Grep to find it.
grep -rn "nmslib\|NMSLIB\|deprecat\|isCreate\|3.0\|blocked" \
  src/main/java/org/opensearch/knn/index/engine/ | grep -i nmslib | head
# Backward codecs keep reading old nmslib/faiss segments after upgrade.
ls src/main/java/org/opensearch/knn/index/codec/backward_codecs/

Warning: "deprecated" and "blocked for new indices" are not "removed." If you touch nmslib code, you are almost always working on the read/BWC path — keep existing segments readable forever (or until an explicit major-version drop). A PR that breaks reading an old nmslib segment is a backward-compatibility regression, which is the compatibility failure mode the maintainers most reliably reject. Do not "clean up" nmslib by deleting its read path.

The practical rule: never reach for nmslib on a new index. If you find yourself wanting HNSW-off-heap, that is faiss. nmslib exists today only so that yesterday's indices keep working.

The capability matrix

Memorize this table. It is the chapter. When an issue or a user asks "can engine X do Y," this is the lookup.

Capability	faiss	lucene	nmslib
Default?	yes	no	no
Algorithms	HNSW + IVF	HNSW only	HNSW only
Training needed?	only for IVF / PQ	no	no
Product Quantization (PQ)	yes (IVF & HNSW)	no	no
Binary Quantization (BQ)	yes	no	no
Scalar / FP16 quantization	yes (FP16 SQ)	yes (Lucene scalar SQ)	no
Disk-based (`on_disk`, compression 1x–32x + rescore)	yes	no	no
byte vectors (`data_type: byte`, 2.17+)	yes	yes	no
binary vectors (`hamming` space)	yes	no	no
Filtering	post/pre-filter (engine-mediated)	exact pre-filter during HNSW traversal	none
Radial search (`min_score`/`max_distance`)	yes	limited / engine-dependent	no
Memory location	off-heap (native, `NativeMemoryCacheManager`)	on-heap (Lucene files, page cache)	off-heap (native)
Uses JNI / native C++	yes (faiss)	no	yes (nmslib)
Codec	k-NN's `KNN10xxCodec` + `NativeEngines990KnnVectorsFormat`	Lucene's `Lucene9xHnswVectorsFormat`	k-NN's codec (BWC read)
GPU / remote build roadmap	yes (#2293, #2294)	no	no
Status	active, default	active	deprecated 2.19; blocked for new in 3.0; read-only BWC

Note: exact capability boundaries shift release to release (byte-vector support, which spaces a quantizer allows, radial support). The matrix above is the shape; always confirm the boundary in your checkout by grepping the engine's declared methods and the mapper's validation, rather than trusting a table — including this one. The validation lives in the field mapper and the per-engine method definitions.

# Confirm a capability boundary instead of trusting the table.
grep -rn "supportedDataTypes\|validateMethod\|isFilterSupported\|trainingRequired\|SpaceType" \
  src/main/java/org/opensearch/knn/index/engine/ | head -20
# The mapper rejects illegal combinations (e.g. IVF on the lucene engine) here.
grep -rn "validate\|UnsupportedOperation\|not supported" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java | head

When to choose which

A decision procedure, not a religion. Walk it top to bottom; take the first match.

If your need is…	Choose	Because
Filtered k-NN ("nearest among docs matching X"), vectors fit in heap	lucene	exact pre-filter during traversal; no off-heap to manage
Billions of vectors / tight memory budget (PQ, BQ, disk-ANN)	faiss	only faiss has IVF + PQ/BQ + `on_disk` compression
Largest scale where index build is the bottleneck	faiss	GPU / remote build roadmap (#2293, #2294)
Binary embeddings (hamming distance)	faiss	binary vectors live on faiss
"I don't know, give me the safe default"	faiss	it is the default; widest capability surface
A brand-new index and you typed `nmslib`	stop	blocked in 3.0; use faiss
Reading an old index that already uses nmslib	nmslib (BWC)	leave it; the read path keeps it alive

The two-line summary to carry around:

Filtering + heap-friendly → lucene.
Everything else, especially scale and quantization → faiss.
nmslib → never new; legacy reads only.

flowchart TD
    Start["mapping a knn_vector field"] --> Q1{"need to filter, and<br/>vectors fit in heap?"}
    Q1 -->|yes| LUC["engine: lucene<br/>(exact pre-filter, on-heap)"]
    Q1 -->|no| Q2{"need IVF / PQ / BQ /<br/>disk-ANN / binary / GPU?"}
    Q2 -->|yes| FAI["engine: faiss"]
    Q2 -->|no| Q3{"brand-new index?"}
    Q3 -->|yes| FAI2["engine: faiss (the default)"]
    Q3 -->|reading old index| NMS["nmslib (BWC read only)"]

How engine choice flows through the codec to segment files

This is where the abstraction becomes files on disk, and it is the part new contributors most often hold only fuzzily. The engine value chosen in the mapping decides, at flush and merge time, which writer runs and which files appear.

flowchart TD
    MAP["knn_vector mapping<br/>method.engine = faiss | lucene | nmslib"] --> FT["KNNVectorFieldType<br/>(carries the KNNEngine)"]
    FT --> FLUSH["flush / merge<br/>(segment becomes immutable)"]
    FLUSH --> SEL{"which KnnVectorsFormat<br/>does the codec pick?"}
    SEL -->|faiss / nmslib| NEF["NativeEngines990KnnVectorsFormat<br/>(k-NN's KNN10xxCodec)"]
    SEL -->|lucene| LF["Lucene's Lucene9xHnswVectorsFormat<br/>(Lucene's default codec path)"]
    NEF -->|JNIService -> FaissService/NmslibService| NATIVE["build native graph in C++"]
    NATIVE --> NFILES["k-NN native graph files<br/>written as EXTRA segment files,<br/>loaded OFF-HEAP at query time"]
    LF -->|Lucene writes directly| LFILES[".vec / .vex / .vem<br/>(Lucene's own format, ON-HEAP)"]

Trace it as four facts:

The mapper records the chosen KNNEngine on the KNNVectorFieldType. This is the only place the decision is made; everything downstream just reads it.
At flush/merge (segments are immutable — see refresh/flush/merge), the custom codec consults the engine:
- faiss/nmslib → NativeEngines990KnnVectorsFormat, which calls JNIService → FaissService/NmslibService to build the whole graph in C++ and write it as extra segment files beside Lucene's normal .cfs/.fdt/ .tim. These load off-heap at query time.
- lucene → no JNI; Lucene's own KnnVectorsFormat writes .vec/.vex/.vem directly, on-heap.
Because the engine is recorded per field and the graph is built per segment, a single index can in principle host faiss and lucene fields side by side; the codec dispatches per field.
After a version upgrade, old faiss/nmslib segments are read by backward_codecs/; lucene-engine segments are read by Lucene's own backward codec machinery. This is why deleting a backward codec is a BWC break.

# Prove which files an engine produces. Build and run an index per engine, then:
find /path/to/data/nodes/0/indices/*/0/index -type f | sed 's/.*\.//' | sort | uniq -c
# A faiss field adds k-NN's native files; a lucene field adds .vec/.vex/.vem.
# Find the codec's per-engine dispatch in the writer:
grep -rn "NativeEngines990KnnVectorsFormat\|getKnnVectorsFormatForField\|engine" \
  src/main/java/org/opensearch/knn/index/codec/ | head

The "new vector engine" meta

Three engines is not a stable equilibrium — it is a snapshot. The forward-looking view is captured in [META] Supporting New Vector Engine in OpenSearch — k-NN #2605. Read it for two reasons.

First, it frames the engines not as fixed products but as pluggable backends behind one KNNEngine abstraction — which is exactly why adding a new one (or deprecating nmslib) is a contained change rather than a rewrite. The engine param, the per-engine method/capability declarations, the codec dispatch you just traced: all of it is the seam designed to let engines come and go.

Second, it is where the roadmap lives: where GPU/remote build (#2293, #2294) fit, how quantization evolves, and what "next engine" would even mean. If you want to contribute somewhere genuinely open, the meta-issue and the RFCs it links are the map. Use a GitHub search like repo:opensearch-project/k-NN is:issue label:Enhancement engine to find the live sub-tasks; do not assume the numbers above are the whole list.

Note: Treat #2605 as the index and #2293/#2294 as the current frontier chapters. The durable skill is reading a meta-issue, following its task list, and finding the one unclaimed, well-scoped sub-task — see design via GitHub.

Common bugs and symptoms

Symptom	Likely cause	Where to look
`engine [nmslib] ... not supported` on a new index	nmslib blocked for new indices in 3.0	switch to faiss; engine validation in the mapper
Mapping with `engine: lucene` + `method: ivf` rejected	lucene engine has no IVF	capability check in the lucene engine's declared methods
`engine: lucene` field shows no off-heap usage in `_knn/stats`	correct — lucene is on-heap, not in `NativeMemoryCacheManager`	nothing to fix; expected
Heap pressure / GC spikes after adding vectors	a large lucene-engine field lives on heap	consider faiss (off-heap) or quantization; circuit breakers
Filtered faiss query returns fewer than `k`	post-filtering starves results	prefer lucene for filtered workloads, or tune; query path
PQ / `on_disk` rejected on a lucene field	quantization/disk-ANN is a faiss feature	use faiss; quantization
Old segments unreadable after upgrade	a backward codec for that engine/version was removed	`codec/backward_codecs/`; never delete the read path
`circuit_breaker is triggered` only with faiss/nmslib	off-heap graph load hit the native limit	`KNNCircuitBreaker`, `knn.memory.circuit_breaker.limit`; lucene wouldn't hit this
New index slow to build with billions of vectors	inline graph construction is the bottleneck	the GPU/remote-build RFCs (#2293, #2294) target exactly this

Validation: prove you understand this

Name the three engines and give the one-line reason each exists. Which is the default, and which is blocked for new indices in 3.0 (and since which version was it deprecated)?
Reproduce the capability matrix from memory for at least these rows: algorithms, PQ, filtering, memory location (on/off heap), training, and JNI usage.
Explain why the lucene engine has exact filtering but no PQ, in terms of whose format and code it uses. Why does a lucene-engine field never appear in the off-heap native-memory stats?
A user has 2B vectors, a tight RAM budget, and no filtering. Pick an engine and a storage strategy, and name the chapter that covers the strategy. Now they add a strict per-query filter and the vectors shrink to fit heap — does your choice change, and why?
Trace how the engine value flows from the mapping to segment files: which class records it, when the graph is built, which codec/format runs for faiss vs lucene, and which files appear on disk (and whether they load on- or off-heap).
Read k-NN #2293 and #2294. In one sentence each: what is being offloaded, where to, and which phase of indexing it targets. Why are both faiss-engine features and not lucene ones?
Explain why deleting the nmslib read path / its backward codec would be a regression, citing the kind of failure it is and the contributor-mindset chapter that names it.

When you can do all seven, go to algorithms — HNSW, IVF, PQ for the mechanics behind each method, then quantization and disk-ANN for the faiss memory story in full. For the off-heap subsystem the faiss/nmslib engines depend on, see native integration and memory. For the engine roadmap and how to find unclaimed work, read [META] #2605 and real issues and RFCs.

Algorithms: HNSW, IVF, and PQ

The k-NN plugin does not invent its own approximate-nearest-neighbour (ANN) math. It exposes a small set of well-known algorithms — HNSW (a layered proximity graph), IVF (an inverted file over coarse cluster centroids), and PQ (product quantization, a compression scheme) — and wires them to OpenSearch mappings and queries. Which algorithms you can use depends on which engine you pick: faiss offers HNSW and IVF, optionally combined with PQ; lucene offers HNSW (this is literally Lucene's HNSW); the deprecated nmslib offers HNSW only.

This chapter is the algorithm layer. It explains each structure at the depth a contributor needs to reason about a recall regression, a memory blow-up, or a mapping-validation bug — the math, the parameters, the trade-offs, and how each is surfaced through the method mapping object. It assumes you have read the k-NN architecture and engines chapters and the Lucene HNSW chapter; it does not re-derive HNSW's layered graph, it builds on it.

After this chapter you can:

Explain HNSW, IVF, and PQ each in terms of what they index, what they store, and the one trade-off each makes.
Map every method.parameters knob to its effect on recall, latency, and memory.
Say which algorithms require training (_train API + model system index) and why, and which are train-free.
Read a knn_vector mapping and predict the index structure and resource cost it produces.

Note on terminology: the term cluster manager (formerly master) is the node that owns cluster state. It matters here because the _train API and the .opensearch-knn-models system index are cluster-state-coordinated, not shard-local; more on that below.

The three algorithms at a glance

Algorithm	What it is	Index structure	Needs training?	Engines	One-line trade-off
HNSW	Hierarchical Navigable Small World — a layered proximity graph	Graph of neighbour lists per node per layer	No	faiss, lucene, nmslib	Memory (stores the full graph) for `O(log N)` search
IVF	Inverted File — partition vectors into `nlist` cells around centroids	Coarse quantizer (centroids) + posting lists of vectors	Yes (centroids must be learned)	faiss	Recall (only searches `nprobes` cells) for fewer distance comps
PQ	Product Quantization — compress each vector into a short code	Sub-vector codebooks + per-vector codes	Yes (codebooks must be learned)	faiss (with HNSW or IVF)	Recall (lossy compression) for large memory savings

The mental model: HNSW and IVF are indexing structures that decide which vectors you compare against; PQ is a compression layer that decides how cheaply you store and compare each vector. They compose — faiss lets you build HNSW+PQ or IVF+PQ. The selection happens entirely through the method object in the knn_vector field mapping.

# In a k-NN checkout, the method/engine/algorithm abstraction lives here:
grep -rln "class KNNMethod\|enum KNNEngine\|class MethodComponent\|class SpaceType" \
  src/main/java/org/opensearch/knn
grep -rn "METHOD_HNSW\|METHOD_IVF\|ENCODER_PQ\|ENCODER_SQ\|ENCODER_FLAT" \
  src/main/java/org/opensearch/knn/common/KNNConstants.java

HNSW — the layered proximity graph

HNSW is covered in depth at the Lucene layer in HNSW Vector Search in Lucene; read that first. The short version: each vector is a node; edges connect a node to up to M near neighbours; the graph is built in layers (a skip-list-for-geometry), so a greedy walk starts with long hops on sparse top layers and finishes with a wide beam search on the dense bottom layer. Search is roughly O(log N) distance computations instead of the O(N) of an exact scan.

What matters here is how k-NN exposes HNSW's three knobs across engines, because the parameter names differ between the OpenSearch mapping, faiss, and Lucene.

The HNSW parameters

k-NN mapping param	faiss name	Lucene name	Fixed at	Effect of raising it
`m`	`M`	`maxConn`	index time	Higher recall, more robust graph; more memory per node, bigger graph file
`ef_construction`	`efConstruction`	`beamWidth`	index time	Higher-quality graph → better recall; slower indexing/merge
`ef_search`	`efSearch`	`efSearch` (k expansion)	query time	Higher recall; slower queries; no rebuild needed

The index-time vs query-time split is the single most important operational fact about HNSW. m and ef_construction are baked into the graph — changing them means reindexing. ef_search is a per-query (or per-index-setting) dial you can move freely. The standard advice: pick m/ef_construction conservatively-high at index time, then tune ef_search to hit your recall/latency target.

PUT /products
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "space_type": "l2",
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": { "m": 16, "ef_construction": 128, "ef_search": 100 }
        }
      }
    }
  }
}

flowchart TD
    Q["query vector"] --> E["entry point, top layer"]
    E --> H1["greedy hops on sparse layers (long-range)"]
    H1 --> D["descend to layer 0 (all nodes, dense)"]
    D --> B["beam search, width = ef_search"]
    B --> K["top-k by distance"]
    subgraph mem["what HNSW stores per node"]
      direction LR
      V["the vector (or its PQ code)"] --- N["up to m neighbour ids per layer"]
    end

HNSW memory: why graphs are big

HNSW stores the full graph in memory for fast traversal. The rough cost per vector is the vector itself (dimension × 4 bytes for float32) plus the neighbour lists (~m × 2 ids on layer 0, fewer above). At d=768, m=16, a million vectors is on the order of ~3 GB of vectors + a few hundred MB of graph — all of which the faiss/nmslib engines load into native memory outside the JVM heap (see native integration and memory). This is why HNSW's headline trade-off is memory: you pay RAM to keep the graph resident. PQ (below) is the answer when that RAM cost is unacceptable.

Note: HNSW is train-free. You can start indexing immediately — the graph is built incrementally as documents arrive. This is a major operational advantage over IVF/PQ, which need a training pass before they can index anything.

IVF — the inverted file

IVF (Inverted File index) attacks the O(N) scan from a different angle than HNSW. Instead of a graph, it partitions the vector space into nlist cells. Each cell is defined by a centroid (a representative vector), and every indexed vector is assigned to its nearest centroid's posting list. At query time you do not search all cells — you find the nprobes centroids closest to the query and only scan the vectors in those cells.

flowchart LR
    subgraph train["training (offline, on a sample)"]
      S["sample of vectors"] --> KM["k-means -> nlist centroids"]
    end
    subgraph index["indexing"]
      V["incoming vector"] --> A["assign to nearest centroid"]
      A --> P["append to that centroid's posting list"]
    end
    subgraph query["query time"]
      Q["query vector"] --> CQ["coarse quantizer: nearest nprobes centroids"]
      CQ --> SC["scan only those nprobes posting lists"]
      SC --> TK["top-k"]
    end
    KM -.centroids.-> A
    KM -.centroids.-> CQ

The IVF math

Coarse quantizer: a function q(x) mapping a vector x to the index of its nearest centroid. The centroids {c_1, …, c_nlist} are learned by k-means over a training sample.
Cells (Voronoi regions): each centroid owns the region of space closer to it than any other centroid. Posting list L_i holds every indexed vector assigned to c_i.
Search: compute distances from the query to all nlist centroids (cheap if nlist ≪ N), pick the nprobes nearest, and do an exact scan over the ~N·nprobes/nlist vectors in those cells. Distance computations per query are nlist (centroid pass) + ~N·nprobes/nlist (cell scan), versus N for a flat scan.

The recall trade-off is structural: a true nearest neighbour can sit in a cell whose centroid is not among the nprobes you searched (it lives near a Voronoi boundary). Raising nprobes searches more cells → higher recall, more work. Raising nlist makes cells smaller (less work per cell) but increases the chance a neighbour is in an unsearched cell (you need more nprobes to compensate).

The IVF parameters

k-NN mapping param	Meaning	Higher value →	Cost
`nlist`	number of cells / centroids (k-means `k`)	Smaller cells, faster per-cell scan	More training cost; needs higher `nprobes` for recall
`nprobes`	cells searched per query	Higher recall	More distance computations per query
`encoder`	optional `flat`/`pq`/`sq` compression of cell contents	—	see PQ below

PUT /products-ivf
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "space_type": "l2",
        "method": {
          "name": "ivf",
          "engine": "faiss",
          "parameters": {
            "nlist": 4096,
            "nprobes": 8
          }
        }
      }
    }
  }
}

Warning: IVF is faiss-only and requires training — the centroids must be learned before any vector can be assigned to a cell. You cannot create a usable IVF field by mapping alone; you must train a model first (see Training, below) and reference it via model_id. The mapping above is the conceptual shape; the trained-model workflow is the real path.

HNSW vs IVF — when to reach for which

Dimension	HNSW	IVF
Training	none (incremental)	required (k-means over a sample)
Memory	high (full graph resident)	lower (no graph; just centroids + posting lists)
Recall at fixed latency	typically higher	competitive at very large N
Best for	most workloads, especially mid-size with high recall needs	very large corpora where graph memory is prohibitive
Query knob	`ef_search`	`nprobes`
Engine	faiss, lucene, nmslib	faiss only

The rough rule: default to HNSW; reach for IVF when the corpus is so large that the HNSW graph's memory is the binding constraint, and you can tolerate running a training step. IVF also composes especially naturally with PQ (IVF+PQ), which is the classic faiss recipe for billion-scale, memory-bounded search.

PQ — product quantization

PQ is not an index structure — it is a compression scheme that sits underneath HNSW or IVF. Its job is to shrink the per-vector storage from dimension × 4 bytes (float32) down to a handful of bytes, at the cost of representing each vector approximately.

The PQ math

Split each d-dimensional vector into m contiguous sub-vectors of length d/m. For each sub-vector position, run k-means over the training sample to learn a codebook of 2^nbits centroids (typically nbits=8 → 256 centroids per sub-space). Now any vector is encoded by replacing each of its m sub-vectors with the id of the nearest centroid in that sub-space's codebook:

original:   [ ----------------- d floats (d*4 bytes) ----------------- ]
split:      [ sub_1 ][ sub_2 ] ... [ sub_m ]          (each d/m floats)
encode:     [ id_1  ][ id_2  ] ... [ id_m  ]          (each nbits, e.g. 8 bits)
stored as:  m bytes  (when nbits=8)  instead of  d*4 bytes

At d=768, m=96, nbits=8: a vector goes from 768×4 = 3072 bytes to 96 bytes — a 32× reduction. Distances are then computed approximately, between the reconstructed (codebook-centroid) representations, using precomputed lookup tables so the per-distance cost is small.

flowchart LR
    V["vector d=768"] --> SP["split into m=96 sub-vectors of length 8"]
    SP --> CB["per-position codebook (256 centroids each)"]
    CB --> ENC["encode: nearest centroid id per sub-vector"]
    ENC --> CODE["96-byte code (vs 3072-byte float32)"]
    CODE --> AD["approximate distance via lookup tables"]

The PQ parameters

Parameter	Meaning	Higher value →
`m` (PQ `m`, distinct from HNSW `m`)	number of sub-vectors the vector is split into	Finer compression granularity, larger codes, better recall
`code_size` / `nbits`	bits per sub-vector code (centroids = `2^nbits`)	More centroids per sub-space → better recall, larger codebooks

Warning: PQ's m is not HNSW's m. PQ m is the sub-vector count (must divide dimension); HNSW m is the graph's max edges per node. They live in different parameters/encoder blocks but the name collision trips people up constantly. Grep KNNConstants for the exact keys your version uses.

PQ as an encoder under HNSW or IVF

In faiss, PQ is configured as an encoder on the method. The mapping nests it:

"method": {
  "name": "hnsw",
  "engine": "faiss",
  "parameters": {
    "m": 16,
    "ef_construction": 128,
    "encoder": {
      "name": "pq",
      "parameters": { "m": 96, "code_size": 8 }
    }
  }
}

This builds an HNSW graph whose stored vectors are PQ codes rather than float32. The graph still navigates as usual; only the vectors it stores and compares are compressed. IVF+PQ does the same for IVF cell contents. Both require training, because the PQ codebooks (and, for IVF, the centroids) must be learned.

The cost of PQ is recall: you are comparing approximations, so the top-k you retrieve is noisier. The production pattern is to use PQ for the fast candidate retrieval and then rescore the candidates against full-precision vectors — the disk-ANN story, covered in quantization and disk-ANN.

Note: PQ is one point on a spectrum of compression k-NN offers. Byte vectors, FP16 scalar quantization, and Binary Quantization (BQ) are alternatives with different recall/memory trade-offs. PQ is the most aggressive (and the only one requiring trained codebooks). The full compression menu lives in quantization and disk-ANN.

Training: when and why

HNSW is train-free; IVF and PQ are not. Their data structures depend on parameters learned from the data distribution:

IVF's centroids are k-means cluster centers over a sample of your vectors.
PQ's codebooks are per-sub-space k-means over a sample.

You cannot learn these from a single document, and you don't want to relearn them per segment. So k-NN has a dedicated training workflow: build a model once from a representative sample, store it, and reference it from every field that uses it.

The `_train` API and the model system index

# 1. Index a representative training set into some source index (e.g. train-index).

# 2. Train a model from it. This runs k-means (IVF centroids and/or PQ codebooks).
curl -XPOST "localhost:9200/_plugins/_knn/models/my-ivfpq-model/_train" -H 'Content-Type: application/json' -d '
{
  "training_index": "train-index",
  "training_field": "embedding",
  "dimension": 768,
  "description": "IVF4096 + PQ96x8 for product embeddings",
  "method": {
    "name": "ivf",
    "engine": "faiss",
    "space_type": "l2",
    "parameters": {
      "nlist": 4096,
      "encoder": { "name": "pq", "parameters": { "m": 96, "code_size": 8 } }
    }
  }
}'

# 3. Poll the model until state == created.
curl -s "localhost:9200/_plugins/_knn/models/my-ivfpq-model?pretty"

The trained model is serialized and stored in the .opensearch-knn-models system index (a regular, replicated OpenSearch index that k-NN manages). Training is coordinated through the cluster manager and runs as a task; the model has a lifecycle state (training → created, or failed). Once created, you point a knn_vector field at it by model_id instead of specifying a method inline:

PUT /products-trained
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": { "type": "knn_vector", "model_id": "my-ivfpq-model" }
    }
  }
}

sequenceDiagram
    participant U as User
    participant CM as Cluster manager
    participant T as Training node
    participant SI as .opensearch-knn-models
    U->>CM: POST _knn/models/<id>/_train (training_index, method)
    CM->>SI: create model doc (state=training)
    CM->>T: dispatch training task
    T->>T: read sample, run k-means (IVF centroids / PQ codebooks)
    T->>SI: write serialized model (state=created)
    Note over U: poll GET _knn/models/<id> until state=created
    U->>U: PUT index with knn_vector { model_id: <id> }

# Where to read the training + model code in a k-NN checkout:
grep -rln "TrainingJob\|ModelDao\|class Model\b\|MODEL_INDEX_NAME\|ModelState" \
  src/main/java/org/opensearch/knn
grep -rn "_train\|TrainingModelRequest\|TrainingModelTransportAction" \
  src/main/java/org/opensearch/knn/plugin/rest src/main/java/org/opensearch/knn/plugin/transport

Warning: Train on a representative sample. If your training set's distribution differs from production data, IVF cells and PQ codebooks will be mismatched and recall will quietly degrade. The training set should be large enough for nlist k-means (faiss wants on the order of tens of points per centroid as a minimum) — too few training points for the requested nlist is a common training failure.

Tuning: parameters → effect

The whole point of knowing the algorithms is to tune them. Here is the consolidated table a contributor reaches for when triaging a recall or latency report.

HNSW

Param	Raising it improves	Raising it costs	When fixed
`ef_search`	recall	query latency	query time (free to change)
`ef_construction`	recall (graph quality)	index/merge time	index time (reindex)
`m`	recall, robustness	memory, graph file size, index time	index time (reindex)

IVF

Param	Raising it improves	Raising it costs	When fixed
`nprobes`	recall	query latency (more cells scanned)	query time (free to change)
`nlist`	per-cell scan speed	training cost; needs more `nprobes` for same recall	train time (retrain)

PQ

Param	Raising it improves	Raising it costs	When fixed
`m` (sub-vectors)	recall	code size (memory), training time	train time (retrain)
`code_size`/`nbits`	recall	codebook size, training time	train time (retrain)

The recurring pattern: query-time knobs (ef_search, nprobes) are your first, free lever for recall; index/train-time knobs (m, ef_construction, nlist, PQ m/nbits) require a rebuild and are second-line. Always exhaust the query-time dial before reindexing.

Common bugs and symptoms

Symptom	Likely cause	Where to look
Low recall, latency is fine	`ef_search`/`nprobes` too low	raise the query-time knob first; `KNNQueryBuilder` / index setting `index.knn.algo_param.ef_search`
Low recall even at high `ef_search`	`m`/`ef_construction` too low for the data, or PQ too aggressive	reindex with higher `m`/`ef_construction`; reduce PQ compression; add rescoring
`model_id` field rejects documents	model state not `created` (still `training`/`failed`)	`GET _plugins/_knn/models/<id>`; check training task / `.opensearch-knn-models`
Training fails or produces bad recall	training set too small or unrepresentative for `nlist`	enlarge/representative sample; lower `nlist`; check k-means min-points
`dimension` not divisible by PQ `m`	PQ requires `m` to divide `dimension`	pick `m` such that `dimension % m == 0`
Confused HNSW `m` vs PQ `m`	name collision across method vs encoder blocks	verify which block the param sits in; `grep KNNConstants`
IVF/PQ mapping rejected on `lucene`/`nmslib`	IVF and PQ are faiss-only	use `engine: faiss`, or switch algorithm
Huge native-memory use on HNSW	full graph + float32 vectors resident	switch to PQ/quantization; see native memory and quantization

Validation: prove you understand this

For each of HNSW, IVF, PQ, state in one sentence what it indexes/stores and the single trade-off it makes. Which two require training, and why can't HNSW be train-free if those two aren't?
A field is engine: faiss, method.name: hnsw, m: 16, ef_construction: 128. A user reports low recall. Give the ordered list of knobs you would try and say which require a reindex.
Explain, with the Voronoi-cell picture, why IVF recall depends on nprobes and why a true neighbour can be missed even at the same latency as HNSW.
Compute the storage per vector for d=1024 under (a) float32 HNSW and (b) PQ m=128, nbits=8. State the compression ratio and the cost you paid for it.
Walk the _train workflow end to end: what runs k-means, where the model is stored, what its lifecycle states are, and how a field references it. Which node coordinates it?
Translate this requirement into a method (or model_id) mapping: "1B vectors, memory-bounded, can tolerate a training step and a rescoring pass." Name the algorithm combination and why.

When you can do all six, move to native integration and memory to see where these structures actually live at runtime (native memory, the circuit breaker, the cache), then the k-NN query path to trace a query through them. For the compression menu these algorithms plug into, read quantization and disk-ANN; for the Lucene-engine HNSW these map onto, see HNSW in Lucene.

Native Integration, JNI, and Memory

Two of the k-NN plugin's three engines — faiss and the deprecated nmslib — are not Java. They are mature C++ libraries (Facebook's faiss, the nmslib similarity-search library) that the plugin calls across the Java Native Interface (JNI). That single fact drives a large fraction of k-NN's operational behaviour: the graphs live in native memory outside the JVM heap, they are loaded lazily, they are guarded by a separate circuit breaker that the core's heap circuit breakers know nothing about, and when something goes wrong you get C++ symptoms (segfaults, the Linux OOM-killer) rather than Java stack traces.

This chapter is the boundary chapter. It explains how the C++ is built (jni/ + CMake → three shared libraries), how Java reaches it (JNIService/FaissService/NmslibService), where the indexes actually live (native memory, the NativeMemoryCacheManager Guava cache), how warmup and the native-memory circuit breaker work, and how all of that differs from the JVM-heap breaker hierarchy you already know. It assumes you have read k-NN architecture, engines, and the algorithms chapter; the Lucene engine, which has no native code and rides Lucene's own HNSW format, is the deliberate contrast throughout.

After this chapter you can:

Name the three native shared libraries the jni/ CMake build produces and what each one wraps.
Trace a call from Java (JNIService) across the JNI boundary into faiss/nmslib and back.
Explain where faiss/nmslib indexes live at runtime, when they load, and how they are evicted.
Distinguish the native-memory circuit breaker from the JVM-heap parent breaker, and say which one a given OOM symptom implicates.
Build the native libraries locally with CMake and diagnose a segfault or OOM-kill.

Note on terminology: the cluster manager (formerly master) is the node that owns cluster state. It is mostly off-stage in this chapter — native memory is a per-data-node, per-shard concern — but it reappears for the warmup/stats coordination at the end.

Why there is a native layer at all

faiss and nmslib are written in C++ for the same reason BLAS is: tight loops over contiguous float arrays with SIMD, no GC pauses, no object headers, no bounds-checks on the hot path. Re-implementing HNSW/IVF/PQ in Java to faiss's level of optimization would be a multi-year effort and would still lose to hand-tuned C++ plus AVX intrinsics. (The Lucene engine does implement HNSW in Java — and gets its SIMD through the Panama Vector API — which is exactly why the Lucene engine is the "no native code, slightly different perf envelope" alternative.)

The cost of that performance is the boundary. Everything below is bookkeeping around the fact that the actual index is a C++ object on the native heap, reachable from Java only through an opaque long pointer and a set of native methods.

Engine	Language	Where the index lives	SIMD source	Circuit breaker
`faiss`	C++ via JNI	native memory (off-heap)	faiss's own AVX kernels	native-memory breaker
`nmslib` (deprecated)	C++ via JNI	native memory (off-heap)	nmslib's own kernels	native-memory breaker
`lucene`	pure Java	JVM heap / mmap'd segment files	Lucene `VectorUtil` (Panama)	core heap breakers

Warning: nmslib is deprecated since 2.19; creating new nmslib indices is blocked in 3.0. Existing nmslib indices are still read for backward compatibility. New work targets faiss (and lucene). Everything this chapter says about JNI/native memory applies to both, but write your code and your mental model around faiss. Meta issue tracking the engine direction: k-NN #2605.

The `jni/` build: three shared libraries from CMake

The plugin's C++ glue lives in the jni/ directory of the k-NN repo. It is a CMake project, separate from the Gradle build that compiles the Java. Gradle invokes it; CMake compiles the C++ and links against bundled faiss/nmslib to produce three shared libraries.

# In a k-NN checkout, orient yourself in the native tree:
ls jni/                       # CMakeLists.txt, src/, include/, external/ (faiss, nmslib, gtest)
ls jni/src jni/include        # the C++ JNI glue
grep -n "add_library\|target_link_libraries\|opensearchknn" jni/CMakeLists.txt

Shared library	Wraps	Loaded when
`libopensearchknn_faiss`	the bundled faiss C++ library + the plugin's faiss JNI glue	a faiss field is queried/warmed/indexed
`libopensearchknn_nmslib`	the bundled nmslib C++ library + nmslib JNI glue	an nmslib field is queried/warmed/indexed
`libopensearchknn_common`	shared JNI utilities (exception translation, the `KNNQueryResult` marshalling, helpers used by both engine libs)	always, as the common dependency

The split is deliberate: faiss and nmslib are independent third-party codebases with their own headers and their own ABI. Keeping each behind its own .so/.dylib means a faiss-only deployment never has to load nmslib's code, and a build can succeed for one engine even if the other's source is unavailable. The actual file names carry the platform suffix (.so on Linux, .dylib on macOS) and live under the plugin's lib/ directory in the distribution.

# Confirm the library names the build emits (don't trust this doc — grep):
grep -rn "libopensearchknn\|add_library" jni/CMakeLists.txt
# In an installed/built distribution, find the actual artifacts:
find . -name "libopensearchknn_*.so" -o -name "libopensearchknn_*.dylib" 2>/dev/null

Building the native libraries locally

You build the native libraries before (or as part of) building the plugin. The exact flags drift between versions — read jni/README.md and DEVELOPER_GUIDE.md in the repo — but the shape is always: pull the faiss/nmslib submodules, run CMake to configure, then make.

# 1. Pull the vendored native deps (faiss, nmslib, gtest live as submodules/externals).
git submodule update --init --recursive

# 2. Configure + build the native libs with CMake (run from the repo root or jni/).
#    KNN_PLUGIN_VERSION etc. are passed through by Gradle in the normal flow.
cd jni
cmake . -DCMAKE_BUILD_TYPE=Release
make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

# 3. The full plugin build (Gradle drives CMake for you and runs the JNI tests):
cd ..
./gradlew build              # compiles Java + invokes the jni/ CMake build + tests
./gradlew :buildJniLib       # (name varies by version) just the native step — grep build.gradle
grep -n "cmake\|jni\|buildJniLib\|CMakeBuild" build.gradle

Note: The native build needs a C++ toolchain (gcc/clang), CMake, and the platform's build tools. A common first-build failure is a missing submodule (jni/external/faiss empty) — git submodule update --init --recursive fixes it. faiss itself may pull in a BLAS/LAPACK dependency; check jni/CMakeLists.txt and the developer guide for the exact prerequisites on your platform.

There is a from-source walkthrough in the lab lab-k1-build-knn-from-source; this chapter is the conceptual map.

The Java↔native boundary: `JNIService` and the engine services

On the Java side, the boundary is a thin set of classes with native methods. The canonical entry point is JNIService, which dispatches to engine-specific services (FaissService, NmslibService) — grep to confirm the exact names and package in your version, because this layer has been refactored more than once.

# Find the native-method declarations and the service classes (names vary by version):
grep -rln "class JNIService\|class FaissService\|class NmslibService\|native " \
  src/main/java/org/opensearch/knn/jni
grep -rn "public static native\|System.loadLibrary\|loadLibraries\|JNILibraryLoader" \
  src/main/java/org/opensearch/knn/jni

A native method in Java has no body — it is implemented in C++. The signature on the Java side and the function name on the C++ side are linked by the JNI naming convention (Java_<package>_<Class>_<method>), or registered explicitly. A typical declaration looks like:

// org.opensearch.knn.jni.FaissService (illustrative — grep for the real signatures)
public final class FaissService {
    // Load a serialized faiss index from a file into native memory; return a pointer to it.
    public static native long loadIndex(String indexPath);

    // Run a k-NN query against the loaded index; return the top-k as id/score pairs.
    public static native KNNQueryResult[] queryIndex(long indexPointer, float[] query, int k);

    // Free the native index. Java MUST call this or the off-heap memory leaks.
    public static native void free(long indexPointer);
}

The C++ side (in jni/src) implements the matching function. A faiss query implementation is roughly:

// jni/src/faiss_wrapper.cpp (illustrative — grep the real file/signature)
JNIEXPORT jobjectArray JNICALL
Java_org_opensearch_knn_jni_FaissService_queryIndex(
    JNIEnv* env, jclass cls, jlong indexPointer, jfloatArray queryJ, jint k) {

    // Reinterpret the opaque Java long as the real faiss index pointer.
    auto* index = reinterpret_cast<faiss::Index*>(indexPointer);

    // Copy the query out of the JVM (GetFloatArrayElements pins/copies the array).
    jfloat* query = env->GetFloatArrayElements(queryJ, nullptr);

    std::vector<float>   distances(k);
    std::vector<int64_t> labels(k);
    index->search(/*n=*/1, query, k, distances.data(), labels.data());  // <-- the C++ hot path

    env->ReleaseFloatArrayElements(queryJ, query, JNI_ABORT);
    // ... marshal labels+distances back into a Java KNNQueryResult[] (jobjectArray) ...
}

The three things to internalize about this boundary:

The index is an opaque long. Java holds a 64-bit pointer to a C++ object it cannot inspect; all lifecycle (load, query, free) crosses by passing it back down.
Crossing the boundary is not free. Each call marshals arguments (copying or pinning the float[] query), enters native code, and marshals results back. k-NN amortizes this with real work per call (a whole top-k search), not chatty per-vector calls.
Java owns the lifetime, C++ owns the memory. If Java never calls free, the native index leaks — the GC has no idea it exists. This is exactly why the NativeMemoryCacheManager (below) exists: to give those pointers a managed lifecycle.

# The C++ implementations and the result marshalling:
ls jni/src
grep -rn "JNIEXPORT\|reinterpret_cast<faiss\|GetFloatArrayElements\|KNNQueryResult" jni/src | head

Where the index lives: native memory, off the JVM heap

When a faiss field is first queried (or warmed), the plugin reads the serialized faiss index out of its Lucene segment file (k-NN writes the faiss graph as a custom codec format — .faiss/.vec-style files, see architecture and the query path), hands the path to FaissService.loadIndex, and faiss allocates the in-memory graph on the native heap — the process's C/C++ allocator, completely separate from the -Xmx JVM heap.

This has sharp consequences:

-Xmx does not bound it. You can have a 32 GB heap and still OOM the machine because faiss allocated 100 GB of native memory for HNSW graphs.
JVM tools don't see it. A heap dump, jmap, GC logs — none show the faiss index. You see it in RSS (resident set size) at the OS level and in the k-NN stats API.
The core circuit breakers don't account for it. The HierarchyCircuitBreakerService tracks heap; faiss native memory is invisible to it — exactly why k-NN ships its own native-memory breaker.

# Off-heap reality check on a running node: RSS far exceeds -Xmx when faiss indexes load.
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'graph_memory|cache|hit_count|miss_count|eviction'
ps -o rss= -p <opensearch_pid>   # resident memory includes native faiss graphs; heap is a subset

Loading and eviction: `NativeMemoryCacheManager`

Native indexes are not loaded at startup. They load lazily on first query for a segment, or eagerly via the warmup API. Once loaded, they are held by the NativeMemoryCacheManager — a Guava Cache keyed by the index/segment file, whose values are NativeMemoryAllocation objects each wrapping one native pointer.

grep -rln "class NativeMemoryCacheManager\|NativeMemoryAllocation\|NativeMemoryEntryContext\|CacheBuilder\|RemovalListener" \
  src/main/java/org/opensearch/knn/index/memory
grep -rn "maximumWeight\|weigher\|invalidate\|getIndexAllocation\|removalListener" \
  src/main/java/org/opensearch/knn/index/memory

The cache is weight-bounded by the native-memory circuit-breaker limit (below), not by entry count. Each entry's "weight" is the index's native size in kilobytes. When loading a new index would push total weight over the limit, Guava evicts the least-recently-used entries; the cache's RemovalListener is where the crucial step happens — it calls the engine's free across JNI so the native memory is actually returned to the OS. Without that listener, eviction would drop the Java wrapper but leak the C++ allocation.

flowchart TD
    Q["k-NN query hits segment S<br/>(or warmup API)"] --> C{"NativeMemoryCacheManager:<br/>S already cached?"}
    C -->|hit| P["reuse cached native pointer"]
    C -->|miss| CB{"would loading S exceed<br/>native CB limit?"}
    CB -->|yes| EV["evict LRU entries:<br/>RemovalListener -> JNI free()"]
    EV --> L
    CB -->|no| L["FaissService.loadIndex(path)<br/>-> native malloc, return long ptr"]
    L --> W["wrap in NativeMemoryAllocation,<br/>put in Guava cache (weight = size KB)"]
    W --> P
    P --> S["JNIService.queryIndex(ptr, vec, k)<br/>-> top-k across the boundary"]

Note: Eviction is cooperative with the OS allocator, not magic. faiss free returns the memory to the C++ allocator, which may or may not immediately return it to the OS (glibc's malloc keeps arenas). RSS can therefore lag eviction. This is a frequent source of "I evicted but RSS didn't drop" confusion. The cache accounting is correct even when RSS is sticky.

The warmup API — preload instead of paying on the first query

The first query against a cold segment pays the full loadIndex cost (read-from-disk + native build/deserialize), which can be hundreds of milliseconds to seconds for a large graph. The warmup API moves that cost out of the query path by loading every segment's native index for an index up front:

# Preload all faiss/nmslib graphs for these indices into native memory (and the cache):
curl -XPOST 'localhost:9200/_plugins/_knn/warmup/products,reviews?pretty'

# Verify they're resident: cache_capacity_reached / graph_memory_usage in stats.
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'graph_memory|cache_capacity|load'

Warmup is a transport action fanned out to the data nodes holding the shards (KNNWarmupTransportAction/KNNWarmupRequest, names vary — grep to confirm):

grep -rln "Warmup\|warmup" src/main/java/org/opensearch/knn/plugin/transport \
  src/main/java/org/opensearch/knn/plugin/rest

The native-memory circuit breaker — a separate accountant

This is the single most important way k-NN's memory model differs from core's. The core breakers (Circuit Breakers and Memory) guard the JVM heap: parent, fielddata, request, in_flight_requests, accounting. faiss/nmslib native memory is off-heap and invisible to all of them. So k-NN runs its own breaker over the NativeMemoryCacheManager's total weight.

Setting	Default	Meaning
`knn.memory.circuit_breaker.enabled`	`true`	turn the native breaker on/off (cluster-level, dynamic)
`knn.memory.circuit_breaker.limit`	`50%`	cap on native graph memory — a percentage of node RAM or an absolute size (e.g. `8gb`)
`knn.circuit_breaker.triggered`	(read-only stat)	whether the breaker is currently tripped
`knn.circuit_breaker.unset.percentage`	`75%`	hysteresis: usage must fall below this fraction of the limit to un-trip

# Inspect / set the native-memory breaker (cluster settings, not index settings):
curl -s 'localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty' \
  | grep -E 'knn\.memory\.circuit_breaker|knn\.circuit_breaker'

curl -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{ "persistent": { "knn.memory.circuit_breaker.limit": "60%" } }'

The behaviour differs from the heap breakers in a subtle, important way:

Aspect	Core heap breaker	k-NN native-memory breaker
What it measures	estimated/real JVM heap	total native weight in the `NativeMemoryCacheManager`
Where the limit comes from	`% of -Xmx`	`% of node RAM` (or absolute)
What "tripping" does	throws `CircuitBreakingException`, rejecting the request	primarily drives cache eviction; tripped state also blocks loading new graphs / surfaces in stats and can reject
Hysteresis	none (binary)	un-trips only below `unset.percentage`
Visibility	`_nodes/stats/breaker`	`GET /_plugins/_knn/stats`
Accountant	`HierarchyCircuitBreakerService` (server)	`NativeMemoryCacheManager` + the k-NN breaker (plugin)

The mental model: the heap breaker is a hard stop that throws to protect the JVM; the native breaker is more like a cache governor — it bounds resident graph memory by evicting LRU entries, and it refuses to load more once you are over the line. Both exist to stop one greedy actor from killing the node; they just guard different memory regions with different mechanisms.

# Where the native breaker logic lives (grep — names vary by version):
grep -rln "KNNCircuitBreaker\|circuit_breaker.limit\|CircuitBreaker\|isTripped\|KNNSettings" \
  src/main/java/org/opensearch/knn/index/memory \
  src/main/java/org/opensearch/knn/plugin/stats \
  src/main/java/org/opensearch/knn/index/KNNSettings.java 2>/dev/null

Rearchitecture and a real config bug

The native breaker has accumulated known rough edges, and there is an active discussion to rearchitect it — read it before you touch this code:

Rearchitecture discussion: k-NN #1582 — Investigate rearchitecture of the native memory circuit breaker. The crux: the breaker is entangled with the cache and with cluster-settings plumbing, the "tripped/un-tripped" hysteresis is coarse, and the relationship between eviction and rejection is not as clean as it should be.
A real config bug: k-NN #585 — a concrete defect in how the circuit-breaker limit setting was handled. Reading it is a good way to see how a settings-plumbing bug manifests in this subsystem and what a fix looks like in review.

When you contribute here, the durable way to find current work is a GitHub search rather than guessing issue numbers:

repo:opensearch-project/k-NN is:issue label:"Memory" circuit breaker
repo:opensearch-project/k-NN is:issue native memory cache eviction

How this differs from the JVM-heap circuit breakers (the full contrast)

You already know the heap breaker hierarchy from Circuit Breakers and Memory. Here is the explicit bridge, because conflating the two is the #1 conceptual error when debugging k-NN memory:

flowchart TB
    subgraph JVM["JVM heap (-Xmx) — core's territory"]
      P["parent breaker (real-memory)"] --> FD[fielddata]
      P --> RQ[request]
      P --> IF[in_flight_requests]
      P --> AC[accounting]
    end
    subgraph NATIVE["Native memory (off-heap) — k-NN's territory"]
      NB["knn.memory.circuit_breaker"] --> CM["NativeMemoryCacheManager (Guava)"]
      CM --> F1["faiss index ptr (segment 1)"]
      CM --> F2["faiss index ptr (segment 2)"]
    end
    OS["OS RSS = heap + native + page cache + ..."]
    JVM -.contributes to.-> OS
    NATIVE -.contributes to.-> OS

A CircuitBreakingException: [parent] Data too large is a heap event — aggregations, fielddata, big requests. faiss graphs did not cause it.
A k-NN query that suddenly slows down, or a node whose RSS balloons past -Xmx with no heap pressure, is a native event — look at GET /_plugins/_knn/stats and knn.memory.circuit_breaker.*, not _nodes/stats/breaker.
The Linux OOM-killer killing the OpenSearch process while the heap looked healthy is the classic native-memory failure: native graphs + heap + page cache exceeded physical RAM, and the kernel reaped the process. Raising -Xmx makes this worse (less RAM left for native). The fix is lowering the native CB limit, reducing graph memory (quantization — see quantization and disk-ANN), or adding RAM.

Warning: A frequent and dangerous misconfiguration: setting -Xmx to ~50% of RAM (the usual heap guidance) and knn.memory.circuit_breaker.limit to a high percentage of RAM. The two budgets overlap on the same physical RAM. Heap + native graphs + OS page cache for mmap'd segments must all fit in physical memory. Budget them together, not independently.

Common bugs and symptoms

Symptom	Likely cause	Where to look
OpenSearch process killed by the kernel (no Java OOM, dmesg shows `Out of memory: Killed process`)	native faiss memory + heap + page cache exceeded physical RAM	`knn.memory.circuit_breaker.limit`; `GET /_plugins/_knn/stats` graph_memory; reduce with quantization
RSS far exceeds `-Xmx`, heap looks healthy	faiss/nmslib graphs resident in native memory	`_plugins/_knn/stats`; expected — not a leak unless eviction also fails
JVM crash / hs_err_pid log with a faiss/nmslib frame	native segfault across the JNI boundary (bad pointer, ABI/version mismatch, corrupt index file)	the `hs_err_pid*.log` native stack; rebuild native libs; check faiss/nmslib submodule version
First query after restart/refresh is very slow, then fast	cold segment paying `loadIndex` cost on first query	use `POST /_plugins/_knn/warmup/<index>`; check cache hit/miss in stats
Native memory grows and never drops after deletes	evicted Java entry but native `free` not returning to OS (or RemovalListener bug)	cache `eviction` count vs RSS; glibc arena stickiness; verify RemovalListener calls JNI free
k-NN queries start failing / load refused under memory pressure	native-memory circuit breaker tripped	`knn.circuit_breaker.triggered` stat; lower load or raise limit within physical RAM
`UnsatisfiedLinkError` / library fails to load at startup	native `.so`/`.dylib` not built, wrong arch, or missing toolchain dep	rebuild `jni/` with CMake; `find` the `libopensearchknn_*` artifacts; check `System.loadLibrary` path
CMake build fails: `jni/external/faiss` empty	submodules not initialized	`git submodule update --init --recursive`
Heap breaker (`[parent] Data too large`) blamed for a k-NN slowdown	wrong accountant — that is a heap event, native memory is invisible to it	distinguish `_nodes/stats/breaker` (heap) from `_plugins/_knn/stats` (native)

Validation: prove you understand this

Name the three shared libraries the jni/ CMake build produces, what each wraps, and why faiss and nmslib are kept in separate libraries instead of one.
Walk a single queryIndex call across the JNI boundary: what is the opaque long, what gets copied where, who runs the actual top-k search, and who is responsible for freeing the native memory afterward.
A node has -Xmx16g, 32 GB of RAM, and gets OOM-killed by the kernel while a heap dump shows the heap at 6 GB. Explain what happened, which budget was the problem, and why raising -Xmx would make it worse.
Contrast the core parent heap breaker with knn.memory.circuit_breaker on four axes: what each measures, where its limit comes from, what "tripping" does, and which API exposes its state.
Trace a cold k-NN query through NativeMemoryCacheManager: cache miss → CB check → possible eviction → loadIndex → cache put. What does the Guava RemovalListener do on eviction, and why does forgetting it leak native memory?
Explain what the warmup API buys you, when you would call it, and how you would verify (via stats) that it worked.
Read k-NN #1582 and k-NN #585. Summarize, in two sentences each, what is wrong and what a fix would touch.

When you can do all seven, move to the k-NN query path to see these native indexes actually serve a query end-to-end, and to quantization and disk-ANN for how to make the native graphs small enough that the breaker stops being your enemy. For the pure-Java contrast with no JNI at all, re-read HNSW in Lucene and SIMD and the Vector API.

The k-NN Query Path

A knn query is a _search request, so it rides the same query-then-fetch scatter-gather you already know from Search Execution: the coordinating node fans out to shards, each shard returns its local top-k, the coordinator reduces to a global top-k, then fetches _source for the winners. What makes k-NN different is what happens inside the query phase on each shard — a KNNQuery that, depending on the engine, either calls down into native faiss/nmslib across JNI or runs Lucene's own KnnFloatVectorQuery. And what makes it interesting is the pile of variations layered on top: approximate vs exact (script-score fallback), filtered k-NN, radial search, and rescoring after quantized retrieval.

This chapter traces one knn query from REST down to the per-segment search and back up through the coordinator reduce, with a grep target at every hop so you can find the real classes in your checkout. It assumes architecture, engines, algorithms, and native integration; it leans on Search Execution for the fan-out it does not re-derive.

After this chapter you can:

Name every class a knn query passes through, from KNNQueryBuilder to KNNScorer, and find each in source.
Explain the two-phase k-NN fan-out and why k-NN's per-shard "top-k" interacts awkwardly with size and pagination.
Distinguish approximate ANN, exact (script_score) k-NN, filtered k-NN (pre- vs post-filtering / "efficient filtering"), and radial search.
Explain rescoring: why quantized retrieval needs a full-precision second pass and where it runs.

Note on terminology: the cluster manager (formerly master) owns cluster state and the model system index; it is off the query hot path. k-NN search is a per-shard, per-segment, data-node concern coordinated by the same coordinating node as any _search.

The cast — classes on the k-NN query path

Concern	Class (grep to confirm)	Lives on	Package
Parse the `knn` query clause	`KNNQueryBuilder`	coordinating node (rewrite) → shard	`org.opensearch.knn.index.query`
The Lucene `Query`	`KNNQuery` (and `KNNQueryFactory`)	data node, per shard	`org.opensearch.knn.index.query`
Weight (per-search setup)	`KNNWeight`	data node	`org.opensearch.knn.index.query`
Per-segment scorer	`KNNScorer`	data node, per segment	`org.opensearch.knn.index.query`
Native dispatch	`JNIService` / `FaissService` / `NmslibService`	data node	`org.opensearch.knn.jni`
Lucene-engine path	Lucene `KnnFloatVectorQuery` / `KnnByteVectorQuery`	data node	`org.apache.lucene.search`
Exact fallback	`KNNScoreScript` / painless `knn_score`	data node	`org.opensearch.knn.plugin.script`
Coordinator reduce	`SearchPhaseController` (core)	coordinating node	`org.opensearch.action.search`

# Orient in the query package of a k-NN checkout:
ls src/main/java/org/opensearch/knn/index/query
grep -rln "class KNNQueryBuilder\|class KNNQuery\b\|class KNNWeight\|class KNNScorer\|KNNQueryFactory" \
  src/main/java/org/opensearch/knn/index/query

Hop 0 — the REST request

A k-NN search is an ordinary _search with a knn query clause. The minimum is a field, a query vector, and k:

curl -XPOST 'localhost:9200/products/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 10,
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.12, 0.88, /* ... dimension floats ... */ 0.45],
        "k": 10
      }
    }
  }
}'

k is "how many neighbours each shard's ANN search should return," not the same as top-level size — a distinction that bites people (see Common bugs). The clause is registered as a query and parsed into a KNNQueryBuilder.

grep -rn "registerQuery\|knn\"\|NAME\|fromXContent\|KNNQueryBuilder" \
  src/main/java/org/opensearch/knn/plugin/KNNPlugin.java \
  src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

Hop 1 — `KNNQueryBuilder` → `KNNQuery`

KNNQueryBuilder is the QueryBuilder for the clause: it holds the field name, the query vector, k, an optional filter, and radial params (min_score/max_distance). On the coordinating node it may rewrite (resolve the field, validate dimension/space_type against the mapping). On the shard, doToQuery(QueryShardContext) builds the Lucene Query — a KNNQuery (for faiss/nmslib) or, for the lucene engine, delegates to Lucene's KnnFloatVectorQuery. The engine decision comes from the field's KNNVectorFieldType/KNNMethodContext.

grep -rn "doToQuery\|doRewrite\|getKnnEngine\|KNNEngine\|KnnFloatVectorQuery\|RNNQuery\|RadialSearch\|maxDistance\|minScore" \
  src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

Validation happens here, early: dimension mismatch (query vector length ≠ field dimension), space-type incompatibilities, asking for IVF/PQ params on a non-faiss field — all are rejected before any search runs. A clear validation failure here is far better than a confusing recall problem later.

Hop 2 — `KNNQuery` → `KNNWeight` → `KNNScorer` (per segment)

This is the Lucene Query/Weight/Scorer triad, the same shape any Lucene query uses (see Search Execution and the Lucene HNSW chapter):

KNNQuery — the immutable query object: field, vector, k, filter, radial params.
KNNWeight.scorer(LeafReaderContext) — called once per segment. This is where the real work is set up: it resolves the per-segment k-NN data (the faiss/nmslib graph for this segment, loaded via the native cache), applies any filter, runs the ANN search, and produces the segment's matching docIds with scores.
KNNScorer — iterates those results as a Lucene Scorer (a DocIdSetIterator with a score()), so the rest of Lucene's collection machinery treats k-NN hits like any other scored hits.

grep -rn "class KNNWeight\|public Scorer scorer\|doANNSearch\|getFilteredDocIds\|exactSearch\|JNIService\|queryIndex" \
  src/main/java/org/opensearch/knn/index/query/KNNWeight.java
grep -rn "class KNNScorer\|DocIdSetIterator\|float score\|iterator()" \
  src/main/java/org/opensearch/knn/index/query/KNNScorer.java

Inside `KNNWeight.scorer` — the per-segment search

For the faiss/nmslib engines, the per-segment search is a JNI call. KNNWeight gets the segment's native index pointer from NativeMemoryCacheManager (loading it on demand, subject to the native circuit breaker — see native integration and memory) and calls JNIService.queryIndex(pointer, queryVector, k, ...). faiss walks the HNSW graph (or probes IVF cells), returns the top-k (docId, distance) pairs for that segment, and KNNWeight converts faiss distances into OpenSearch scores via the field's SpaceType (e.g. for L2, score is a monotonic transform of 1/(1+d²)).

For the lucene engine, there is no JNI: KnnFloatVectorQuery runs Lucene's HNSW search over the segment's .vec/.vex/.vem files using VectorUtil for the distance math, and returns its own TopDocs.

flowchart TD
    KW["KNNWeight.scorer(leaf segment)"] --> ENG{engine?}
    ENG -->|faiss / nmslib| NC["NativeMemoryCacheManager.get(segment)<br/>(load via JNI if cold, CB-gated)"]
    NC --> JNI["JNIService.queryIndex(ptr, vec, k)"]
    JNI --> FA["faiss/nmslib: walk HNSW / probe IVF<br/>-> top-k (docId, distance)"]
    FA --> SC["convert distance -> score via SpaceType"]
    ENG -->|lucene| LV["KnnFloatVectorQuery over .vec/.vex/.vem<br/>VectorUtil distance (Panama SIMD)"]
    LV --> SC
    SC --> KS["KNNScorer: DocIdSetIterator + score()"]

The key structural fact: k-NN search is per-segment. Each segment has its own graph; k neighbours are retrieved from each segment, and the per-shard top-k is the merge of all segments' results. More segments → more graphs walked per query → why merge policy and segment count matter for k-NN latency, and why force-merging to fewer segments is a real k-NN tuning lever (at the cost of expensive merges that rebuild graphs).

Hop 3 — per-shard reduce, then the coordinator reduce

Within a shard, Lucene's collector merges the per-segment KNNScorer results into the shard's top results — bounded by the larger of k and size. Then we are back on the standard search-execution rails: each shard ships its top docIds + scores in a QuerySearchResult, the coordinator's SearchPhaseController.reducedQueryPhase does a sorted merge into the global top-k, and a fetch phase pulls _source for the winners.

sequenceDiagram
    participant C as Client
    participant Co as Coordinating node
    participant S1 as Shard A (data node)
    participant S2 as Shard B (data node)
    C->>Co: POST /idx/_search { knn: { field, vector, k } }
    par Query phase (scatter) — k-NN per shard
        Co->>S1: ShardSearchRequest
        Note over S1: KNNWeight per segment -> JNI/Lucene ANN<br/>merge segments -> shard top-k
        Co->>S2: ShardSearchRequest
        Note over S2: same: per-segment ANN -> shard top-k
    end
    S1-->>Co: top (docId, score) [no _source]
    S2-->>Co: top (docId, score)
    Note over Co: SearchPhaseController.reducedQueryPhase<br/>-> GLOBAL top-k by score
    par Fetch phase (only winners' shards)
        Co->>S1: ShardFetchRequest (docIds)
        Co->>S2: ShardFetchRequest
    end
    S1-->>Co: _source for its winners
    S2-->>Co: _source ...
    Co-->>C: SearchResponse (hits sorted by score)

Note: Because each of N shards returns up to k candidates, the coordinator reduces over N·k candidates to produce the global top-k. Total recall is therefore a function of both per-shard k and shard count. This is also why fewer, larger shards can give better k-NN recall than many tiny shards at the same total k — each shard's graph sees more of the data.

Approximate vs exact: the script-score fallback

The default knn query is approximate — it walks the ANN structure and may miss true neighbours (that is the whole HNSW/IVF trade-off from the algorithms chapter). When you need exact k-NN — ground truth, or recall measurement, or a tiny corpus where ANN is pointless — you do a brute-force scan via the script_score path: a knn_score script (Painless, backed by KNNScoreScript) that computes the exact distance from the query to every matching document and lets the normal scorer sort them.

curl -XPOST 'localhost:9200/products/_search' -H 'Content-Type: application/json' -d '
{
  "size": 10,
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "knn_score",
        "lang": "knn",
        "params": { "field": "embedding", "query_value": [/* ... */], "space_type": "l2" }
      }
    }
  }
}'

grep -rln "KNNScoreScript\|knn_score\|ScriptEngine\|class KNNScoringSpace\|exactSearch" \
  src/main/java/org/opensearch/knn/plugin/script src/main/java/org/opensearch/knn/index/query

	Approximate `knn` query	Exact (`script_score` / `knn_score`)
Cost	sub-linear (`O(log N)` HNSW / `nprobes` cells)	`O(N)` over matching docs (full scan)
Recall	< 100% (tunable via `ef_search`/`nprobes`)	100% (ground truth)
Uses the graph?	yes	no — distance to every candidate
Use when	production search	recall benchmarking, tiny corpora, exact-required

There is also an internal exact path: when a filter is so selective that there are fewer matching docs than it is worth walking the graph for, KNNWeight can fall back to an exact scan over just the filtered docs. That is the next topic.

Filtered k-NN: pre- vs post-filtering and "efficient filtering"

"Give me the 10 nearest vectors that also match this filter" is the single most requested k-NN feature, and the naive approaches both fail:

Post-filtering (search ANN for k, then drop the ones that fail the filter): if the filter is selective, you can throw away most of your k results and return fewer than k — or even zero. Recall craters.
Pre-filtering by brute force (compute the filter bitset, then exact-scan only those docs): correct, but O(filtered N) — fine for tiny filtered sets, terrible for large ones.

The good answer is efficient (in-graph) filtering: push the filter into the ANN traversal so the graph search only ever accepts docs that pass the filter, walking the graph until it has k valid neighbours. faiss supports this via an id-selector; Lucene's KnnFloatVectorQuery supports it with a filter Query (see the Lucene HNSW chapter). k-NN chooses a strategy based on the filter's selectivity:

flowchart TD
    F["knn query with filter"] --> B["compute filter bitset for the segment"]
    B --> SEL{"filtered cardinality vs threshold"}
    SEL -->|very selective<br/>(few docs pass)| EX["exact brute-force over the filtered docs<br/>(cheap, 100% recall on the subset)"]
    SEL -->|not very selective| EFF["ANN with the filter pushed into traversal<br/>(faiss id-selector / Lucene filtered HNSW)"]
    EX --> R[top-k valid neighbours]
    EFF --> R

curl -XPOST 'localhost:9200/products/_search' -H 'Content-Type: application/json' -d '
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [/* ... */],
        "k": 10,
        "filter": { "term": { "category": "shoes" } }
      }
    }
  }
}'

grep -rn "filter\|getFilteredDocIds\|cardinality\|exactSearch\|FilterWeight\|expandNearestNeighbors\|MAX_DISTANCE\|filterWeight" \
  src/main/java/org/opensearch/knn/index/query/KNNWeight.java

Warning — filter explosion: the exact-vs-efficient decision hinges on the filter's cardinality per segment. A filter that is selective globally can still be non-selective in some segments. If recall or latency surprises you under a filter, check whether you fell into the exact path on big segments or the post-filter path in an older version. Push the filter into the knn clause's filter; do not wrap the whole knn query in a bool/post_filter and expect efficient filtering.

Radial search: `min_score` / `max_distance`

Sometimes you do not want "the k nearest" — you want "every vector within a distance/score threshold," however many that is. That is radial search, expressed with max_distance (a distance cutoff) or min_score (a score cutoff) instead of k:

curl -XPOST 'localhost:9200/products/_search' -H 'Content-Type: application/json' -d '
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [/* ... */],
        "max_distance": 0.35
      }
    }
  }
}'

Internally this is a different traversal: rather than stopping at k results, the search expands the neighbourhood until no candidate within the radius remains. It is supported on faiss and lucene engines (grep your version for the exact support matrix). The trap is unbounded result size: a loose threshold over a dense region can match an enormous number of docs, so radial search must still respect size and can be expensive. Treat max_distance/min_score as a semantic threshold you tune against your space, not a free "give me everything close" button.

grep -rn "max_distance\|min_score\|RadialSearch\|RNNQuery\|radius\|MAX_DISTANCE\|MIN_SCORE" \
  src/main/java/org/opensearch/knn/index/query

Rescoring: full-precision after quantized retrieval

When the field uses a compressed representation — PQ, byte/FP16 scalar quantization, binary quantization, or on_disk mode (all in quantization and disk-ANN) — the ANN search runs over approximate vectors, so its scores are noisy and its ordering is imperfect. The production pattern is a two-pass query:

Retrieve an over-fetched candidate set (oversample × k) using the cheap, compressed representation.
Rescore those candidates by computing the full-precision distance for each (reading the original float32 vectors), then re-sort and keep the true top-k.

This recovers most of the recall lost to compression while keeping the fast first pass. It is expressed with a rescore block (or enabled automatically for on_disk mode):

curl -XPOST 'localhost:9200/products/_search' -H 'Content-Type: application/json' -d '
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [/* ... */],
        "k": 10,
        "rescore": { "oversample_factor": 2.0 }
      }
    }
  }
}'

flowchart LR
    Q[query vector] --> A["pass 1: ANN over compressed vectors<br/>retrieve oversample*k candidates"]
    A --> B["pass 2: rescore candidates with<br/>full-precision float32 distance"]
    B --> C["re-sort -> true top-k"]

grep -rn "rescore\|oversample\|RescoreContext\|fullPrecision\|reScore\|ExactSearcher" \
  src/main/java/org/opensearch/knn/index/query

The oversample factor is a recall/latency dial: more candidates → better post-rescore recall → more full-precision distance computations. This is the same idea as Lucene's quantized-vectors-then-rescore pattern (see HNSW in Lucene); k-NN exposes it as a first-class query option because disk-based and PQ indexes depend on it for acceptable recall.

Putting it together — the full fan-out

sequenceDiagram
    participant U as User
    participant Co as Coordinating node
    participant SH as Shard (data node)
    participant SEG as Segment(s)
    participant NAT as faiss (native) / Lucene HNSW
    U->>Co: knn query (field, vector, k, [filter|radial|rescore])
    Co->>Co: KNNQueryBuilder rewrite/validate (dimension, space, engine)
    Co->>SH: ShardSearchRequest (query phase)
    SH->>SEG: KNNWeight.scorer per segment
    SEG->>NAT: ANN search (JNI queryIndex / KnnFloatVectorQuery)
    NAT-->>SEG: per-segment top-k (docId, distance)
    SEG-->>SH: merge segments -> shard top-k (KNNScorer scores)
    opt rescore (compressed index)
        SH->>SEG: full-precision distance for candidates
        SEG-->>SH: re-sorted true top-k
    end
    SH-->>Co: QuerySearchResult (docIds + scores)
    Co->>Co: reducedQueryPhase -> global top-k
    Co->>SH: fetch phase (_source for winners)
    SH-->>Co: hits
    Co-->>U: SearchResponse

Common bugs and symptoms

Symptom	Likely cause	Where to look
Fewer than `size` hits returned under a filter	post-filtering threw away ANN candidates that failed the filter	use the `knn` clause's `filter` (efficient filtering), not a `bool`/`post_filter` wrapper; `KNNWeight`
Low recall, latency fine	`ef_search`/`nprobes` too low, or `k` too small per shard	raise query-time knob; remember each shard returns only `k` — see algorithms
Recall fine on 1 shard, worse on many shards	per-shard `k` + reduce over `N·k` undersamples	raise `k`, or use fewer/larger shards
`k` vs `size` confusion: asked `k:10`, `size:100`, got 10	per-shard `k` bounds candidates before `size` is applied	set `k` >= desired `size`; understand they are different dials
Quantized/disk index has poor recall	no rescoring pass over full-precision vectors	add `rescore` / raise `oversample_factor`; see quantization
Radial search returns a huge/slow result set	`max_distance`/`min_score` threshold too loose for a dense region	tighten the threshold; respect `size`; radial is not "everything close" for free
First k-NN query after restart very slow	cold native graph load on first query	`POST /_plugins/_knn/warmup/<index>`; native memory
Dimension/space_type validation error	query vector length ≠ field `dimension`, or space mismatch	`KNNQueryBuilder` validation; fix the request or mapping
Exact `script_score` k-NN times out on big index	brute-force scan is `O(N)` by design	use the approximate `knn` query; reserve `knn_score` for small/exact-required cases
Latency grows with document count even at fixed `k`	many segments → many graphs walked per query	force-merge to fewer segments; tune merge policy

Validation: prove you understand this

List, in order, every class a knn query passes through from REST to per-segment search and back to the coordinator reduce. For each, say which node it runs on and what it produces.
Explain why k-NN search is per-segment, what that implies for the relationship between segment count and query latency, and why force-merge is a k-NN tuning lever.
Distinguish k from top-level size. A user sets k:10, size:50 on a 5-shard index and gets 10 hits. Diagnose it and give the fix.
Contrast post-filtering, brute-force pre-filtering, and efficient (in-graph) filtering. What does KNNWeight decide between, and on what signal?
When would you use exact script_score k-NN instead of the approximate knn query? What is the cost, and which class implements the exact score?
Explain radial search and the failure mode of a too-loose max_distance. How does it differ structurally from a k-bounded search?
Walk the rescoring two-pass flow: what runs in pass 1, what runs in pass 2, what oversample_factor trades off, and why disk-based/PQ indexes need it.

When you can do all seven, you understand the runtime. Trace it for real in the lab lab-k2-trace-a-knn-query. For where the native graphs that serve hop 2 actually live, return to native integration and memory; for the compressed representations that make rescoring necessary, read quantization and disk-ANN; for the coordinator fan-out this builds on, re-read Search Execution.

Quantization and Disk-Based ANN

A million 768-dimensional float32 vectors is ~3 GB of raw vectors before you add the HNSW graph. A billion is ~3 TB. The native-memory circuit breaker exists precisely because this number gets out of hand, and the algorithms chapter's headline trade-off — HNSW pays memory for speed — becomes the binding constraint at scale. Quantization is the answer: store each vector in fewer bits, accept a little recall loss, and recover most of it with a full-precision rescore pass.

This chapter is the memory story. It walks the compression spectrum k-NN offers — byte vectors, FP16 scalar quantization, Product Quantization (PQ), Binary Quantization (BQ) — and then disk-based vector search (on_disk mode with compression_level), which is the productized "store the compressed graph in RAM, keep full-precision on disk, rescore from disk" pattern. It maps each to its Lucene equivalent, and it points at the cutting-edge GPU / remote-index-build RFCs that make building these compressed indexes fast. It assumes algorithms (especially PQ), the query path (especially rescoring), and native integration (the memory the compression is fighting).

After this chapter you can:

Place byte/FP16/PQ/BQ on a memory-vs-recall spectrum and pick one for a constraint.
Explain on_disk mode and compression_level (1x → 32x) and how rescoring makes it usable.
Map k-NN quantization to Lucene's Lucene99/Lucene104 scalar-quantized formats.
Reason about why a quantized index lost recall and how rescoring/oversampling recovers it.
Say what GPU and remote index build buy you and cite the RFCs.

Note on terminology: the cluster manager (formerly master) coordinates the model system index and (in the RFCs below) the remote-build component, but quantization itself is a per-segment, per-shard concern that lives in the codec and the native/Lucene vector format.

The problem, in bytes

Representation	Bytes per dim (d=768)	Per vector	1M vectors	Compression vs float32
`float32` (baseline)	4	3072 B	~3.0 GB	1×
`byte` (int8)	1	768 B	~0.77 GB	4×
FP16 (half)	2	1536 B	~1.5 GB	2×
PQ (m=96, nbits=8)	—	96 B	~0.09 GB	32×
Binary (1 bit/dim)	0.125	96 B	~0.09 GB	32×

(The HNSW graph's neighbour lists add to all of these; quantization shrinks the vectors, not the graph edges.) The whole game is buying down the per-vector cost without buying too much recall loss — and the universal recovery mechanism is rescore against full precision, which is why this chapter and the query path's rescoring section are inseparable.

Byte vectors (int8) — the cheap, lossy-but-simple win

Since 2.17, both the faiss and lucene engines support a knn_vector whose data_type is byte: each component is a signed 8-bit integer in [-128, 127] instead of a float. That is a flat 4× reduction with no codebooks and no training — the simplest compression on the menu.

PUT /products-byte
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "data_type": "byte",
        "space_type": "l2",
        "method": { "name": "hnsw", "engine": "faiss", "parameters": { "m": 16, "ef_construction": 128 } }
      }
    }
  }
}

The catch: you must quantize the vectors to [-128, 127] before indexing (or rely on the engine's quantization where supported). Byte vectors are exact as bytes — no approximation in the storage — but the act of squeezing a float embedding into 8 bits per dim is itself lossy unless your embeddings were already in that range. The distance math runs directly on bytes (faster than float on many CPUs), so byte vectors are a latency win as well as a memory win.

grep -rn "data_type\|VectorDataType\|BYTE\|FLOAT\|KNNVectorFieldMapper" \
  src/main/java/org/opensearch/knn/index/mapper \
  src/main/java/org/opensearch/knn/common/KNNConstants.java

FP16 scalar quantization — half the bytes, near-zero recall loss

FP16 (half-precision) scalar quantization stores each component as a 16-bit float — a 2× reduction with very little recall impact, because IEEE-754 half-precision still captures the magnitude and most of the precision of typical normalized embeddings. In faiss this is the fp16 encoder (a scalar quantizer, SQ); the engine handles the float→half conversion, so unlike raw byte vectors you do not pre-quantize.

"method": {
  "name": "hnsw",
  "engine": "faiss",
  "parameters": {
    "encoder": { "name": "sq", "parameters": { "type": "fp16" } }
  }
}

FP16 is the conservative choice: when you want to roughly halve memory and are not willing to risk recall, FP16 with no rescoring is usually safe. It is also the gentlest step on the spectrum — if FP16 already meets your memory budget, you do not need PQ/BQ or disk mode.

grep -rn "fp16\|ENCODER_SQ\|ScalarQuantization\|SQType\|FP16\|clip\|minScore" \
  src/main/java/org/opensearch/knn/common/KNNConstants.java \
  src/main/java/org/opensearch/knn/index/util 2>/dev/null

Product Quantization (PQ) — aggressive, trained, faiss-only

PQ is covered mechanically in the algorithms chapter: split each vector into m sub-vectors, learn a codebook per sub-space, store the nearest-centroid id per sub-vector. At d=768, m=96, nbits=8 that is 32× compression (3072 B → 96 B). The two things to remember here, in the memory story:

PQ is the most aggressive float-based compression and the only one requiring training (the codebooks are learned k-means — needs the _train API and the .opensearch-knn-models system index; see algorithms § Training).
PQ distances are computed between reconstructed (codebook-centroid) approximations, so PQ demands rescoring for production recall. Use PQ for the fast first pass, then rescore against full-precision vectors (see query path § rescoring).

PQ composes with both HNSW and IVF (the classic IVF+PQ billion-scale faiss recipe). It is faiss-only — the lucene engine has no PQ; its compression story is scalar quantization, below.

Binary Quantization (BQ) — one bit per dimension, the extreme

Binary Quantization reduces each component to a single bit (typically: is this dimension above or below a threshold?), and compares vectors with Hamming distance (population count of XOR) — an operation modern CPUs do in one instruction over 64 bits at a time. At 1 bit/dim that is a 32× reduction (same footprint as PQ-96×8 above, but train-free and far cheaper to compare).

BQ's recall loss is the steepest of the spectrum — you have thrown away almost all magnitude information — so BQ always pairs with rescoring, and often with oversampling factors larger than PQ's. It shines as the first stage of a funnel: binary Hamming to cheaply shortlist a large candidate set, then full-precision rescore to rank. The space_type for binary vectors is hamming.

grep -rn "hamming\|BINARY\|BinaryQuant\|bitcount\|popcount\|VectorDataType.BINARY" \
  src/main/java/org/opensearch/knn/common/KNNConstants.java \
  src/main/java/org/opensearch/knn/index/mapper

The compression spectrum at a glance

Method	Compression	Training?	Recall loss	Rescore needed?	Engines	Best when
`byte` (int8)	4×	no	low–moderate	optional	faiss, lucene	embeddings already near int8 range; want a simple 4×
FP16 SQ	2×	no	very low	rarely	faiss (sq)	conservative halving, recall-sensitive
PQ	up to 32×+	yes	moderate–high	yes	faiss	billion-scale, memory-bound, can train + rescore
BQ (binary)	up to 32×	no	high	yes (always)	faiss, lucene	cheap massive shortlist, then rescore
`on_disk` mode	1–32× (you pick)	depends	depends on level	yes (built in)	faiss	RAM-bound; trade disk + rescore latency for memory

Disk-based ANN: `on_disk` mode and `compression_level`

Quantization shrinks the in-memory footprint. Disk-based vector search goes further: keep the compressed graph in RAM (small, fast to traverse) and keep the full-precision vectors on disk, reading them only to rescore the handful of candidates the first pass produced. This is the on_disk mode, and it is the productized version of "PQ/BQ retrieve + float32 rescore" with the storage tiering made explicit.

PUT /products-disk
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "space_type": "innerproduct",
        "mode": "on_disk",
        "compression_level": "32x"
      }
    }
  }
}

You set mode: on_disk and a compression_level — one of 1x, 2x, 4x, 8x, 16x, 32x — and k-NN picks the appropriate quantization (e.g. higher levels use binary/aggressive quantization) to hit that ratio, builds the small in-RAM graph over the compressed vectors, and automatically enables rescoring against the full-precision vectors kept on disk. 1x is effectively "in-memory full precision"; 32x is the most aggressive, smallest-RAM, most-rescore-dependent setting.

flowchart TD
    subgraph RAM["in RAM (small)"]
      G["compressed HNSW graph<br/>(quantized vectors per compression_level)"]
    end
    subgraph DISK["on disk (large)"]
      FP["full-precision float32 vectors"]
    end
    Q[query vector] --> G
    G --> CAND["pass 1: ANN over compressed graph<br/>-> oversample*k candidates"]
    CAND --> RD["read candidates' float32 from disk"]
    RD --> RS["pass 2: full-precision rescore"]
    FP --> RD
    RS --> TK[true top-k]

`compression_level`	In-RAM footprint vs float32	Recall (pre-rescore)	Rescore dependence	Typical use
`1x`	full (no compression)	highest	none	small corpora, RAM is plentiful
`2x` / `4x`	1/2 – 1/4	high	low	balanced
`8x` / `16x`	1/8 – 1/16	moderate	medium	large, RAM-constrained
`32x`	1/32	lowest	high	billion-scale, disk-tolerant latency

Warning: Disk mode trades memory for latency. The rescore pass reads full-precision vectors from disk (or OS page cache); under cache misses that is a random-read penalty per query. The win is fitting a corpus in a fraction of the RAM; the cost is tail latency on cold reads. Tune oversample_factor (more candidates → better recall, more disk reads) against your latency budget. See the query path's rescoring section.

grep -rn "on_disk\|compression_level\|CompressionLevel\|Mode\b\|MODE_ON_DISK\|ON_DISK" \
  src/main/java/org/opensearch/knn/index \
  src/main/java/org/opensearch/knn/common/KNNConstants.java

How it maps to Lucene scalar quantization

The lucene engine does not call faiss; it rides Lucene's own quantized vector formats, covered in HNSW Vector Search in Lucene. When you ask the lucene engine for quantization, you are configuring Lucene's KnnVectorsFormat:

k-NN (lucene engine)	Lucene format	What it stores
float32 HNSW	`Lucene99HnswVectorsFormat`	full-precision vectors + HNSW graph
int8 scalar quantization	`Lucene99ScalarQuantizedVectorsFormat` / `Lucene99HnswScalarQuantizedVectorsFormat`	int8-quantized vectors (flat / + graph) — ~4× smaller
configurable 1/2/4/7/8-bit SQ	`Lucene104ScalarQuantizedVectorsFormat` / `Lucene104HnswScalarQuantizedVectorsFormat`	finer-grained bit-width SQ (Lucene 10.1), dial memory vs recall

So the same idea (scalar-quantize the vectors, optionally rescore with full precision) shows up twice in k-NN: once as the faiss sq/fp16 encoder + native graph, and once as Lucene's Lucene99*/Lucene104* quantized formats for the lucene engine. The int8 scalar-quantization lineage in Lucene traces to LUCENE-10577 / apache/lucene #11613, and the scalar-quantization codec to apache/lucene #12497. Lucene also gets its SIMD distance math from the Panama Vector API, so a Lucene104* int8 distance is a hardware VPDPBUSD-style dot product when the CPU and JDK support it.

# Confirm which Lucene quantized formats your bundled Lucene ships (run in a Lucene checkout):
grep -rln "Lucene9.*ScalarQuantizedVectorsFormat\|Lucene10.*ScalarQuantizedVectorsFormat" \
  lucene/core/src/java
# And which one k-NN's lucene engine wires up:
grep -rn "ScalarQuantized\|Lucene9\|Lucene10\|KnnVectorsFormat\|confidenceInterval\|bits" \
  src/main/java/org/opensearch/knn/index/codec 2>/dev/null

Building compressed indexes fast: GPU and remote index build

Quantization shrinks the stored index; it does not make building the index cheap. Constructing an HNSW graph (even over compressed vectors) is CPU-heavy and happens during indexing and every merge — it is often the dominant cost of a large k-NN ingest. Two active, cutting-edge directions attack this, and both are real RFCs worth reading before you touch index-build code:

GPU-accelerated build — k-NN #2293, [RFC] Boosting OpenSearch Vector Engine Performance using GPUs. Uses NVIDIA cuVS and the CAGRA graph-build algorithm (a GPU-native ANN graph) to construct the vector index in FP32 on the GPU, then serve it on CPU. Graph build is embarrassingly parallel; a GPU can build in a fraction of the wall-clock time.
Remote vector index build — k-NN #2294, [RFC] Remote Vector Index Build and the component meta k-NN #2391, [Meta] Remote Vector Index Build Component. The idea: offload the per-segment graph construction from the data node to a remote GPU/CPU fleet — the data node ships the segment's vectors out, a remote builder constructs the graph, and the result is pulled back and written into the segment. This decouples index-build CPU from the search cluster and lets you use specialized (GPU) hardware for builds only. Benchmarking of these lives in k-NN #2595.

flowchart LR
    subgraph DN["data node (search cluster)"]
      SEG["segment vectors (flush/merge)"]
    end
    subgraph REMOTE["remote build fleet (GPU)"]
      B["cuVS / CAGRA graph build (FP32)"]
    end
    SEG -->|ship vectors| B
    B -->|return built graph| SEG2["write graph into segment file"]
    SEG2 --> SRV["served on CPU like any k-NN segment"]

These are the frontier. They do not change the query path (a remotely-built graph queries identically to a locally-built one); they change who pays the build cost and on what hardware. For current status, search rather than assume:

repo:opensearch-project/k-NN is:issue GPU cuVS CAGRA remote build
repo:opensearch-project/k-NN is:issue label:"Roadmap" remote index build

Choosing: a decision guide

Constraint	Reach for	Why
"Halve memory, do not risk recall"	FP16 SQ	2×, near-zero recall loss, no training, no rescore
"Embeddings already in int8 range"	`byte` vectors	flat 4×, faster byte distance, simple
"Billion vectors, RAM is the wall, can train"	`IVF+PQ` (+ rescore)	32×+ compression; classic faiss billion-scale recipe
"RAM-bound, want a single knob, tolerate disk latency"	`on_disk` mode + `compression_level`	productized quantize-in-RAM + rescore-from-disk
"Cheap massive shortlist then rank"	BQ + heavy oversample + rescore	1-bit Hamming is the cheapest possible first pass
"Pure Java, no native, want quantization"	`lucene` engine + `Lucene104*` SQ	Lucene's own quantized format, Panama SIMD
"Index build is the bottleneck, not search"	GPU / remote build (RFCs)	offload graph construction off the search cluster

The universal rule: start gentle (FP16/byte), measure recall against an exact script_score ground truth (see query path), and only escalate to PQ/BQ/disk when memory forces you to — adding a rescore pass each time you do. Never ship an aggressively-quantized index without measuring post-rescore recall against ground truth; the compression ratio in the table is meaningless if recall is unacceptable for your task.

Common bugs and symptoms

Symptom	Likely cause	Where to look
Quantized index recall much worse than float32	no rescore pass over full-precision vectors	enable `rescore` / raise `oversample_factor`; query path
`on_disk` query latency spikes / high tail	rescore reading full-precision vectors from cold disk	warm OS page cache; lower `compression_level`; tune oversample vs latency
PQ field rejects documents	model not `created`, or `dimension` not divisible by PQ `m`	`GET _plugins/_knn/models/<id>`; pick `m` with `dimension % m == 0`; algorithms
Byte vectors give wrong/garbage scores	float embeddings not quantized into `[-128,127]` before indexing	pre-quantize, or use FP16/SQ which the engine handles
`mode: on_disk` rejected on lucene/nmslib	disk mode and most quantization are faiss-engine features	use `engine: faiss`; lucene uses `Lucene104*` SQ instead
BQ recall unusable even after rescore	oversample too low for how lossy 1-bit is	raise `oversample_factor` substantially; BQ needs a wider funnel than PQ
Memory still high after enabling quantization	graph edges + full-precision (non-disk) vectors still resident; or warmup loaded float32	check `mode`/`compression_level`; native memory stats
Index build (ingest/merge) dominates, search is fine	HNSW graph construction CPU cost, unrelated to query	force-merge tuning; watch the GPU/remote-build RFCs (#2293/#2294/#2391)

Validation: prove you understand this

Compute per-vector storage for d=1024 under float32, byte, FP16, PQ(m=128, nbits=8), and binary. State each compression ratio and which require training.
Explain why quantized retrieval needs rescoring and what the rescore pass computes. For which methods is rescoring optional vs mandatory?
Walk an on_disk query end to end: what is in RAM, what is on disk, what pass 1 and pass 2 do, and what compression_level and oversample_factor each trade off.
A team needs 1B vectors on a memory-bounded cluster and can tolerate a training step and some tail latency. Specify a concrete mapping (engine, algorithm, compression, rescore) and justify each choice.
Map FP16/int8/PQ to their k-NN configuration and their Lucene-engine equivalent. Which compression methods have no Lucene-engine analogue, and why?
Explain what GPU build (cuVS/CAGRA, #2293) and remote index build (#2294/#2391) change about a k-NN deployment — and what they deliberately do not change about the query path.
You are handed an aggressively-quantized index with disappointing recall. Give the ordered diagnostic steps (measure against what ground truth; which knobs first; when to back off the compression level).

When you can do all seven, you have closed the loop: the algorithms decide the structure, this chapter decides how cheaply you store it, the query path decides how it serves a query, and native integration and memory decides where it all lives at runtime. For the Lucene-layer quantized formats these map onto, read HNSW in Lucene and SIMD and the Vector API. Then benchmark it for real in lab-k6-benchmark-recall-latency.

Lab K1: Build the k-NN Plugin from Source

Background

Every other lab in this section assumes you can build and run the k-NN plugin. This lab earns you that. The k-NN repo (opensearch-project/k-NN) is the most structurally unusual project in the OpenSearch world: it has two build systems that must agree — a Gradle build that compiles the Java (src/main/java/org/opensearch/knn/...) and a CMake build under jni/ that compiles C++ glue and links the bundled faiss and nmslib libraries into shared objects (libopensearchknn_faiss, libopensearchknn_nmslib, libopensearchknn_common, and more). Gradle drives CMake; if either side is misconfigured, you get cryptic failures that look nothing like a normal Java compile error.

On top of that, k-NN is an out-of-repo plugin: it does not live inside the OpenSearch core tree, so it builds against published core artifacts. Its build.gradle pins an opensearch_version (e.g. 3.6.0-SNAPSHOT), and if no matching artifact exists in your local Maven cache, you must publish core yourself first.

In this lab you clone the repo, initialize the native submodules, build the JNI libraries with CMake, build the plugin with Gradle, run its unit and integration tests, and install the result into a local OpenSearch distribution. By the end you will have a node running with a plugin you compiled — native code and all.

Note: The cluster manager (formerly master) is the node that owns cluster state. It is mostly off-stage in this lab — building and installing a plugin is a per-node concern — but it reappears the moment you start a multi-node cluster, because every node must run the same plugin version or the cluster will not form.

Read k-NN architecture and native integration and memory alongside this lab; they are the conceptual map for the two trees you are about to build. The core analogue is Lab 1.1: Build OpenSearch from Source — do that first if you have not; the JDK/toolchain/Gradle-wrapper mechanics carry over.

Why This Lab Matters for Contributors

You cannot fix a k-NN issue, run its tests, or reproduce a bug report without a working from-source build. This is the gate to every other contribution.
The dual Gradle+CMake build is the single biggest source of "I can't even get it to compile" pain for new k-NN contributors. Understanding which build owns which failure is half the battle.
The out-of-repo-plugin model (build against published core artifacts, publishToMavenLocal when there is no published snapshot) is how every OpenSearch plugin outside the core repo is built — k-NN, SQL, security, anomaly-detection. Learn it once here.
Installing your own build into a distribution and seeing it in _cat/plugins is the end-to-end loop you will run hundreds of times.

Prerequisites

A clean OpenSearch core build (you have completed Lab 1.1 and ./gradlew assemble works in the core repo).
JDK 21 (the k-NN build, like core, targets JDK 21). The repo's bundled JDK is fine.
A C++ toolchain: gcc/g++ or clang/clang++, plus make.
CMake ≥ 3.24 (the jni/CMakeLists.txt declares cmake_minimum_required(VERSION 3.24.0) — confirm with cmake --version).
git with submodule support, and enough disk (the faiss/nmslib submodules plus their build artifacts are several GB).
On macOS: libomp (OpenMP) for faiss; on Linux: a BLAS/LAPACK implementation may be pulled in by faiss. Check jni/CMakeLists.txt and the repo's DEVELOPER_GUIDE.md for the exact platform prerequisites — they drift between versions.

# Verify the toolchain before you start. Any missing tool here = a build failure later.
java -version          # want 21.x
cmake --version        # want >= 3.24
g++ --version || clang++ --version
git --version

Step-by-Step Tasks

Step 1: Clone the repo WITH its submodules

This is the number-one "k-NN won't build" cause, so get it right first. faiss and nmslib are git submodules under jni/external/. A plain git clone leaves them empty, the Java side builds fine, and then the CMake step dies with errors about missing faiss headers.

mkdir -p ~/src/oss-repos && cd ~/src/oss-repos

# Clone AND initialize submodules in one shot:
git clone --recursive https://github.com/opensearch-project/k-NN.git
cd k-NN

# If you already cloned without --recursive, fix it now:
git submodule update --init --recursive

# Confirm the externals are actually populated (not empty directories):
ls jni/external/                 # expect: faiss  nmslib  (plus possibly gtest)
ls jni/external/faiss | head     # expect faiss's own source tree, NOT an empty dir

Warning: jni/external/faiss being an empty directory is the signature of the missing-submodule failure. The CMake config step also tries to self-heal — the jni/cmake/init-faiss.cmake script runs git submodule update --init -- external/faiss if it can't find faiss — but do not rely on that; initialize submodules explicitly.

Step 2: Read the version contract

k-NN builds against a specific OpenSearch core version. Find it before anything else; a version mismatch here is the second-most-common build failure.

# The pinned core version (e.g. "3.6.0-SNAPSHOT"):
grep -n "opensearch_version" build.gradle | head
#   opensearch_version = System.getProperty("opensearch.version", "3.6.0-SNAPSHOT")

# The build pulls build-tools and core from this version's published artifacts:
grep -n "build-tools\|opensearch_group\|repositories" build.gradle | head

The rule: the plugin's opensearch_version must match a core artifact that exists in your Maven repositories. For a released version, the public Maven repo has it. For a -SNAPSHOT, you usually need to publish core locally (Step 3).

Step 3: Publish core to Maven local (only if needed)

If opensearch_version is a -SNAPSHOT that isn't on the OpenSearch snapshot Maven repo, build it from the core checkout that matches and publish it to your local ~/.m2:

# In your OpenSearch CORE checkout (the one from Lab 1.1), on the matching version/branch:
cd ~/src/oss-repos/OpenSearch
./gradlew publishToMavenLocal -Dbuild.snapshot=true

# This installs org.opensearch:opensearch:<version>, build-tools, etc. into ~/.m2/repository.
ls ~/.m2/repository/org/opensearch/opensearch/ 2>/dev/null

Then point the k-NN build at the local artifacts. The build already adds mavenLocal() in most versions; if you need to override the version explicitly:

cd ~/src/oss-repos/k-NN
./gradlew assemble -Dopensearch.version=3.6.0-SNAPSHOT

Note: You only need this if the plugin can't resolve core. Try the build first (Step 5); if Gradle fails with Could not resolve org.opensearch:opensearch:<version>, come back and publish core to Maven local. Releases-against-released-core don't need this.

Step 4: Build the native JNI libraries with CMake

The native build is a CMake project under jni/. In the normal flow Gradle drives CMake for you (Step 5), but build it standalone once so you understand what Gradle is doing and can diagnose native failures in isolation.

The Gradle task cmakeJniLib runs the configure step and buildJniLib runs the compile step. Read exactly what they invoke:

grep -n "cmakeJniLib\|buildJniLib\|KNN_PLUGIN_VERSION\|jni/build\|--target" build.gradle | head

You will see the configure command is essentially:

# Configure: -S = source dir, -B = build dir. KNN_PLUGIN_VERSION must match opensearch_version.
cmake -S jni -B jni/build \
  -DKNN_PLUGIN_VERSION=3.6.0-SNAPSHOT \
  -DCMAKE_POLICY_VERSION_MINIMUM=3.5

...and the build command targets the named libraries:

# Compile only the native libs, in parallel:
cmake --build jni/build \
  --target opensearchknn_faiss opensearchknn_common opensearchknn_nmslib opensearchknn_simd \
  --parallel "$(nproc 2>/dev/null || sysctl -n hw.ncpu)"

Or, far simpler, let Gradle do it:

./gradlew buildJniLib       # runs cmakeJniLib (configure) then the native compile

When it finishes, find the artifacts. The shared-library suffix is platform-specific (.so on Linux, .dylib on macOS, .dll on Windows):

find jni/build -name "libopensearchknn_*.so"  -o -name "libopensearchknn_*.dylib" 2>/dev/null
#   .../libopensearchknn_faiss.{so,dylib}
#   .../libopensearchknn_common.{so,dylib}
#   .../libopensearchknn_nmslib.{so,dylib}
#   .../libopensearchknn_util.{so,dylib}   (a shared utility lib the others link)
#   .../libopensearchknn_simd.{so,dylib}   (SIMD compute helpers)

Shared library (CMake target)	What it wraps
`opensearchknn_faiss`	the bundled faiss C++ library + the plugin's faiss JNI glue (`faiss_wrapper.cpp`, `org_opensearch_knn_jni_FaissService.cpp`)
`opensearchknn_nmslib`	the bundled nmslib library + nmslib JNI glue (gated; nmslib is deprecated)
`opensearchknn_common`	shared JNI utilities (`org_opensearch_knn_jni_JNICommons.cpp`) used by both engines
`opensearchknn_util`	low-level JNI helpers/exception translation that the others link against
`opensearchknn_simd`	SIMD compute helpers backing `SimdVectorComputeService`

Note: The verified-facts shorthand for this build is "three libraries" (faiss/nmslib/common). The real CMake build emits more (util, simd), and faiss/nmslib are gated by AVX2_ENABLED/AVX512_ENABLED/feature flags. Always grep "set(TARGET_LIB" jni/CMakeLists.txt to see the current target list rather than trusting any doc — including this one.

Step 5: Build the whole plugin

Now the full build — Gradle compiles the Java, invokes the CMake native build, and packages the installable zip:

./gradlew assemble          # Java + native + the plugin zip (no tests)
# OR the fuller build that also runs checks:
./gradlew build             # assemble + precommit + tests

# Find the installable plugin zip:
find build/distributions -name "opensearch-knn-*.zip"
#   build/distributions/opensearch-knn-3.6.0-SNAPSHOT.zip

The fastest inner loop — a node with the plugin already loaded, no manual install:

./gradlew run               # builds + starts a single node with k-NN on the classpath
# In another terminal:
curl -s 'localhost:9200/_cat/plugins?v'
#   name   component  version
#   ...    opensearch-knn  3.6.0-SNAPSHOT

Step 6: Run the unit tests

# All Java unit tests:
./gradlew test

# A focused run (much faster while iterating):
./gradlew test --tests "org.opensearch.knn.index.mapper.KNNVectorFieldMapperTests"
./gradlew test --tests "*KNNQueryBuilderTests"

# Reproduce a flaky/seeded failure exactly (OpenSearch/Lucene test framework uses seeds):
./gradlew test --tests "*KNNQueryTests" -Dtests.seed=DEADBEEF

There are also native (C++) tests built and run via a separate task — the JNI glue has its own gtest suite:

grep -n "buildJniTest\|jniTest\|gtest" build.gradle | head
./gradlew buildJniTest      # builds the native libs, then runs the C++ JNI tests

Step 7: Run the integration tests

Integration tests spin up a real testClusters node with the plugin installed and exercise it over REST. They depend on buildJniLib, so the native libraries must build first.

# The main integ suite (starts a real node):
./gradlew integTest

# k-NN also has multi-node and remote-index-build integ variants — grep to see them all:
grep -n "RestIntegTestTask\|integTest\b\|integTestMultiNode\|integTestRemoteIndexBuild" build.gradle | head

# Run just one integ test class while iterating:
./gradlew integTest --tests "*KNNRestTestCase*" 2>/dev/null || \
  ./gradlew integTest --tests "org.opensearch.knn.*IT"

Note: Integration tests are slow (they boot a JVM cluster) and need the native libs. If integTest fails immediately with an UnsatisfiedLinkError, your buildJniLib step didn't produce the libraries for your platform — fix Step 4 first.

Step 8: Install the built plugin into a local distribution

./gradlew run is the dev loop, but to mimic a real deployment, install your zip into an actual OpenSearch distribution with bin/opensearch-plugin:

# 1. Produce (or locate) an OpenSearch distribution of the MATCHING version.
#    From your core checkout:
cd ~/src/oss-repos/OpenSearch
./gradlew localDistro
DISTRO=$(find distribution/archives -maxdepth 2 -name "opensearch-*" -type d | head -1)
echo "Distribution at: $DISTRO"

# 2. Install the k-NN zip you built (use a file:// URL to a local zip):
ZIP=$(find ~/src/oss-repos/k-NN/build/distributions -name "opensearch-knn-*.zip" | head -1)
"$DISTRO/bin/opensearch-plugin" install "file://$ZIP"
#   -> "Installing file://.../opensearch-knn-....zip"
#   -> "Continue with installation? [y/N]"  (the plugin has native code; answer y)

# 3. Start the node and confirm the plugin loaded:
"$DISTRO/bin/opensearch" &
sleep 20
curl -s 'localhost:9200/_cat/plugins?v'
#   name   component       version
#   node-0 opensearch-knn  3.6.0-SNAPSHOT

Step 9: Smoke-test the install end to end

Prove the native path works, not just that the plugin loaded:

# Create a tiny k-NN index, index two vectors, query, and warm up.
curl -s -XPUT 'localhost:9200/k1demo?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 2,
           "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } }
  }}
}'

curl -s -XPOST 'localhost:9200/k1demo/_bulk?refresh=true' -H 'Content-Type: application/x-ndjson' -d '
{ "index": { "_id": "1" } }
{ "v": [1.0, 1.0] }
{ "index": { "_id": "2" } }
{ "v": [9.0, 9.0] }
'

# Load the faiss graph into native memory, then query it:
curl -s -XPOST 'localhost:9200/_plugins/_knn/warmup/k1demo?pretty'
curl -s -XPOST 'localhost:9200/k1demo/_search?pretty' -H 'Content-Type: application/json' -d '{
  "size": 1,
  "query": { "knn": { "v": { "vector": [1.1, 0.9], "k": 1 } } }
}'
# Expect _id "1" as the nearest neighbor.

# Confirm native memory accounting works (faiss graph is OFF-heap):
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'graph_memory|hit_count|miss_count|load'

Implementation Requirements / Deliverables

A k-NN clone with jni/external/faiss and jni/external/nmslib populated (submodules initialized).
The native libraries built: libopensearchknn_faiss, libopensearchknn_common, libopensearchknn_nmslib (plus util/simd) found under jni/build.
./gradlew assemble produces build/distributions/opensearch-knn-<version>.zip.
./gradlew test passes (or you can name and explain any pre-existing flaky failure).
./gradlew integTest runs a real node and passes its suite.
The plugin installed into a localDistro via bin/opensearch-plugin install, listed in _cat/plugins.
A working faiss query against a 2-D index, with _plugins/_knn/stats showing non-zero graph_memory_usage after warmup.
You can state which build (Gradle vs CMake) owns a given failure mode.

Troubleshooting

Symptom	Likely cause	Fix
CMake errors about missing faiss headers; `jni/external/faiss` empty	submodules not initialized	`git submodule update --init --recursive`
`Could not resolve org.opensearch:opensearch:<version>`	no matching core artifact in Maven repos	publish core: `./gradlew publishToMavenLocal` in the matching core checkout
`CMake 3.24 or higher is required`	toolchain CMake too old	upgrade CMake; check `cmake --version` against `cmake_minimum_required` in `jni/CMakeLists.txt`
`UnsatisfiedLinkError` at node startup or in integTest	native lib not built / wrong arch / missing toolchain	run `./gradlew buildJniLib`; `find` the `libopensearchknn_*` artifacts; confirm they match your CPU arch
faiss CMake step fails on `OpenMP::OpenMP_CXX` (macOS)	OpenMP/`libomp` not installed	install `libomp` (e.g. via Homebrew); re-run `cmakeJniLib`
faiss build fails on BLAS/LAPACK (Linux)	missing BLAS dev package	install your distro's BLAS/LAPACK dev package; see `jni/CMakeLists.txt`
`./gradlew clean` then native rebuild fails repeatedly	stale CMake cache / patch state	`clean` removes `jni/build`; if patches re-apply, check `apply_lib_patches`/`commit_lib_patches` flags in `build.gradle`
Plugin install prompts and aborts	the install confirmation for a plugin with native code	answer `y`, or use `-b`/batch flag for non-interactive installs
`plugin [...] is incompatible with version [...]`	plugin built against a different core version than the distro	rebuild the plugin against the distro's exact version, or build a matching distro
Java compiles but native step never runs	you ran only `compileJava`, not `assemble`/`buildJniLib`	the native build is a separate task chain — run `assemble` or `buildJniLib`

Rule of thumb for which build owns a failure: a stack trace with org.opensearch.knn.* Java frames and a Gradle task name → the Gradle/Java side. A message mentioning cmake, make, faiss/nmslib headers, .so/.dylib, OpenMP, BLAS, or UnsatisfiedLinkError → the CMake/native side. Diagnose them with different tools.

Expected Output

# After ./gradlew buildJniLib
> Task :cmakeJniLib
> Task :buildJniLib
BUILD SUCCESSFUL

# Native artifacts (macOS example)
jni/build/release/libopensearchknn_faiss.dylib
jni/build/release/libopensearchknn_common.dylib
jni/build/release/libopensearchknn_nmslib.dylib

# After ./gradlew assemble
build/distributions/opensearch-knn-3.6.0-SNAPSHOT.zip

# _cat/plugins on the node with your build installed
name   component       version
node-0 opensearch-knn  3.6.0-SNAPSHOT

# Smoke query result
"hits": [ { "_id": "1", "_score": 0.83..., "_source": { "v": [1.0, 1.0] } } ]

# _plugins/_knn/stats after warmup (off-heap memory is non-zero)
"graph_memory_usage": 1,
"hit_count": 1,
"miss_count": 1

Stretch Goals

Build a single engine. The build supports -Pknn_libs=... to limit the native targets. Build only faiss (skip nmslib) and confirm the nmslib .so is absent. grep -n "knn_libs" build.gradle to find the property.
Toggle SIMD. Re-run cmakeJniLib with -Davx2.enabled=false (or the property the build reads — grep for avx2_enabled) and observe the configure flags change. Note that PlatformUtils.isAVX2SupportedBySystem() is what selects the runtime path.
Inspect the JNI glue. ls jni/src and read faiss_wrapper.cpp's query function. Match its Java_org_opensearch_knn_jni_FaissService_* name to the native method declared in src/main/java/org/opensearch/knn/jni/FaissService.java.
Run the C++ tests directly. After buildJniTest, find the gtest binary under jni/build and run it standalone to see the native test output without Gradle in the way.
Diff the published vs local artifact. After publishToMavenLocal, compare the core jar your plugin resolved against the one in the distribution you installed into — confirm they are the same version, which is why the plugin loads.
Time the build. Run ./gradlew assemble --profile and read the generated report. Notice how much of the wall-clock is the CMake/native step versus Java compilation.

Validation / Self-check

Name the two build systems k-NN uses, what each is responsible for, and how Gradle and CMake relate (which one invokes the other, and via which tasks).
Why does a fresh git clone (without --recursive) build the Java but fail the native step? Which directory tells you submodules are missing?
What does opensearch_version in build.gradle control, and when do you need publishToMavenLocal in the core repo before the plugin will build?
List the shared libraries the CMake build emits and what each wraps. Why are faiss and nmslib kept in separate libraries rather than one?
Given a failure message, decide whether the Gradle/Java build or the CMake/native build owns it. Give one concrete example of each.
What is the difference between ./gradlew run and installing the zip with bin/opensearch-plugin install? When would you use each?
After warming up a faiss index, which API proves the native graph is resident in off-heap memory, and which field do you read?

When you can build both trees, run all the test tiers, install your own zip, and serve a faiss query, you are ready to read code that actually runs. Continue to Lab K2: Trace a k-NN Query, and keep k-NN architecture and native integration and memory open as your map.

Lab K2: Trace a k-NN Query

Background

You can build the plugin (Lab K1) and you have read the query-path chapter. This lab makes that chapter yours by forcing you to find every hop in the live source, with a stopwatch running. A knn query travels a long way: REST body → KNNQueryBuilder → a per-shard Lucene Query (KNNQuery for native engines, a Lucene vector query for the lucene engine) → KNNWeight per segment → either a native faiss search across the JNI boundary or Lucene's own HNSW search → a top-k collect per shard → a final merge on the coordinating node.

This is a timed code-reading trace. You will set a 75-minute timer, grep your way down the call chain one hop at a time, capture a one-line note per hop in a reading-log artifact, and corroborate each hop against two runtime signals: the _search?profile=true output and TRACE-level logging from the k-NN query package. The discipline is the point: you are training the muscle that lets you land in an unfamiliar subsystem and orient in an hour, which is exactly what triaging a real k-NN issue demands.

Note: The cluster manager (formerly master) owns cluster state but is not on the hot path of a search. The query fan-out is coordinated by whichever node received the request (the coordinating node), which need not be the cluster manager. Keep these two roles separate as you trace; conflating them is a common early mistake.

This lab leans on two existing chapters — keep them open:

The k-NN query path — the narrative version of what you are about to verify in source.
Search execution — the core query/fetch fan-out that k-NN slots into; the per-shard query phase and coordinator reduce are core mechanics, not k-NN ones.

Why This Lab Matters for Contributors

Most k-NN issues are reported as a query symptom ("recall dropped", "filter ignored", "slower than expected", "wrong scores"). You cannot triage any of them without being able to trace the query path to the exact hop where behavior diverges.
The native (faiss) and Lucene (lucene) engines fork on the query path. Knowing where they diverge — and that one crosses JNI while the other stays in pure Java — is the single most load-bearing fact for debugging engine-specific behavior.
profile=true and TRACE logging are the two tools that turn a guess into evidence. Pairing a static code read with both runtime signals is how senior engineers confirm a hypothesis instead of arguing about it.
A reading-log artifact is reusable: drop it into an issue comment or a PR description and you have just made a reviewer's life easier and your analysis legible.

Prerequisites

A running node built from Lab K1 (./gradlew run is fine), with the k-NN plugin loaded.
The k-NN source checkout open in your editor; you will grep it constantly.
Familiarity with the query-path chapter — read it once before timing yourself.
A timer. Seriously. The constraint is what builds the skill.

# Orient yourself in the query package before the clock starts:
cd ~/src/oss-repos/k-NN
ls src/main/java/org/opensearch/knn/index/query/
#   KNNQueryBuilder.java  KNNQuery.java  KNNWeight.java  DefaultKNNWeight.java
#   KNNScorer.java  KNNQueryFactory.java  lucene/  lucenelib/  nativelib/  exactsearch/  rescore/ ...

Step-by-Step Tasks

Step 0: Set up the index and the reading log (5 min, off the clock)

Create one faiss index and one lucene index so you can trace both forks against real data.

# Native (faiss) engine:
curl -s -XPUT 'localhost:9200/trace_faiss?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 3,
           "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } }
  }}
}'

# Lucene engine (pure-Java path, no JNI):
curl -s -XPUT 'localhost:9200/trace_lucene?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 3,
           "method": { "name": "hnsw", "engine": "lucene", "space_type": "l2" } }
  }}
}'

for idx in trace_faiss trace_lucene; do
  curl -s -XPOST "localhost:9200/$idx/_bulk?refresh=true" -H 'Content-Type: application/x-ndjson' -d '
{ "index": {"_id":"1"} }
{ "v": [1,1,1] }
{ "index": {"_id":"2"} }
{ "v": [2,2,2] }
{ "index": {"_id":"3"} }
{ "v": [9,9,9] }
'
done

Create the artifact you will fill in. A plain Markdown table is enough:

cat > ~/knn-query-reading-log.md <<'EOF'
# k-NN Query Reading Log — Lab K2

| Hop | File:method | What happens here | Evidence (grep/profile/TRACE) |
|-----|-------------|-------------------|-------------------------------|
| 0 REST | | | |
| 1 Builder->Query | | | |
| 2 Weight (per seg) | | | |
| 3a native (faiss/JNI) | | | |
| 3b lucene (KnnFloatVectorQuery) | | | |
| 4 per-shard top-k | | | |
| 5 coordinator reduce | | | |
EOF

Step 1: Turn on the evidence channels — `profile` and `TRACE` (start the timer)

Start your 75-minute timer now. First, enable both runtime signals so every hop you read in source has a corroborating fact.

# (a) Profile a query: per-component timing in the response, no logging needed.
curl -s -XPOST 'localhost:9200/trace_faiss/_search?profile=true&pretty' \
  -H 'Content-Type: application/json' -d '{
  "size": 2,
  "query": { "knn": { "v": { "vector": [1.1, 0.9, 1.0], "k": 2 } } }
}' | sed -n '1,80p'
# Look in the "profile" -> "shards" -> "searches" -> "query" array for the rewritten
# query class name (this is your proof of which Query implementation ran).

# (b) Crank TRACE logging for the k-NN query package (dynamic, no restart):
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.knn.index.query": "TRACE" }
}'

Record in the log (Hop 0): the REST entry point. Find where the knn query name is registered and parsed.

grep -rn "public static final String NAME = \"knn\"\|fromXContent\|VECTOR_FIELD\|K_FIELD" \
  src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java | head
# NAME = "knn"  -> registered via SearchPlugin.getQueries() in KNNPlugin (grep getQueries)
grep -rn "getQueries\|new QuerySpec\|KNNQueryBuilder::new\|KNNQueryBuilder::fromXContent" \
  src/main/java/org/opensearch/knn/plugin/KNNPlugin.java

Step 2: Hop 1 — `KNNQueryBuilder` → a per-shard `Query`

The builder validates the request against the field mapping and produces the Lucene Query. The single most important branch in the whole path lives here: which engine?

# The core conversion method (grep for the line, don't trust a number):
grep -n "doToQuery\|QueryShardContext\|class KNNQueryBuilder" \
  src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java
#   protected Query doToQuery(QueryShardContext context)

# doToQuery delegates the engine fork to a factory. Find it:
grep -rn "KNNQueryFactory\|createKNNQuery\|class KNNQueryFactory" \
  src/main/java/org/opensearch/knn/index/query/ | head

Open KNNQueryFactory and read the fork. This is where native and lucene diverge:

grep -n "LuceneEngineKnnVectorQuery\|OSKnnFloatVectorQuery\|OSKnnByteVectorQuery\|new KNNQuery\|KNNEngine\|isLuceneEngine\|getEngine" \
  src/main/java/org/opensearch/knn/index/query/KNNQueryFactory.java

Engine	Query class produced	Path from here
`faiss` / `nmslib` (native)	`KNNQuery`	→ `KNNWeight`/`DefaultKNNWeight` → JNI faiss search
`lucene`	`LuceneEngineKnnVectorQuery` wrapping `OSKnnFloatVectorQuery`/`OSKnnByteVectorQuery` (extends Lucene's `KnnFloatVectorQuery`/`KnnByteVectorQuery`)	→ Lucene's own HNSW search, pure Java, no JNI

Record in the log (Hop 1): KNNQueryBuilder.doToQuery → KNNQueryFactory → which Query class for each engine. Evidence: the profile=true output names the rewritten query class — confirm faiss shows a KNNQuery-family class and lucene shows a vector-query class.

Note: The lucene engine's queries extend Lucene's KnnFloatVectorQuery / KnnByteVectorQuery (see org.opensearch.knn.index.query.lucenelib). That means the entire search for the lucene engine is executed by Lucene's HNSW code — the same code path described in HNSW in Lucene — with k-NN only wrapping it to add filtering/score semantics. There is no k-NN native code on the lucene fork.

Step 3: Hop 2 — `KNNQuery` → `KNNWeight` per segment

A Lucene Query is turned into a Weight, and the Weight produces a per-segment scorer. For the native fork this is KNNWeight (abstract) with DefaultKNNWeight as the concrete implementation.

grep -n "class KNNWeight\|createWeight\|public Scorer scorer\|searchLeaf\|abstract" \
  src/main/java/org/opensearch/knn/index/query/KNNWeight.java | head
grep -n "class DefaultKNNWeight\|doANNSearch\|protected TopDocs" \
  src/main/java/org/opensearch/knn/index/query/DefaultKNNWeight.java | head

The key method is searchLeaf (per LeafReaderContext = per segment). Read it: it decides between approximate (ANN) search and an exact brute-force fallback, applies any filter as a BitSet, and returns per-leaf TopDocs.

grep -n "searchLeaf\|doANNSearch\|exactSearch\|ExactSearcher\|filterWeight\|BitSet\|canDoExactSearch\|cardinality" \
  src/main/java/org/opensearch/knn/index/query/KNNWeight.java | head -20

Record in the log (Hop 2): KNNWeight.searchLeaf — the per-segment entry, and the ANN-vs-exact branch. Evidence: in the TRACE log you will see per-segment search messages; in profile=true the time is attributed to the query's score/build_scorer breakdown.

Step 4: Hop 3a — the native faiss search across JNI

For a faiss field, doANNSearch is where Java leaves the JVM. It obtains the loaded native graph from the cache, then calls JNI to run the top-k search.

grep -n "doANNSearch\|JNIService\|FaissService\|queryIndex\|NativeMemoryCacheManager\|getIndexAllocation\|loadGraph" \
  src/main/java/org/opensearch/knn/index/query/DefaultKNNWeight.java | head
# The JNI dispatch:
grep -rn "queryIndex\|public static native" \
  src/main/java/org/opensearch/knn/jni/JNIService.java \
  src/main/java/org/opensearch/knn/jni/FaissService.java | head

The chain is: DefaultKNNWeight.doANNSearch → NativeMemoryCacheManager (load or reuse the graph, on a cache miss FaissService.loadIndex) → JNIService.queryIndex → FaissService.queryIndex (native) → C++ faiss::Index::search → results marshalled back as KNNQueryResult[] → KNNScorer translates each distance to a Lucene score via the SpaceType.

grep -n "class KNNScorer\|score()\|scoreTranslation\|SpaceType" \
  src/main/java/org/opensearch/knn/index/query/KNNScorer.java \
  src/main/java/org/opensearch/knn/index/SpaceType.java | head

Record in the log (Hop 3a): doANNSearch → cache load → JNIService.queryIndex → faiss → KNNScorer. Evidence: _plugins/_knn/stats hit_count/miss_count/graph_memory_usage change after the query — proof the native graph was loaded and searched off-heap.

curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'hit_count|miss_count|graph_memory|graph_query'

Step 5: Hop 3b — the Lucene-engine search (no JNI)

Run the same trace against trace_lucene and watch the path stay inside the JVM. The lucene fork never touches KNNWeight/DefaultKNNWeight's native code; it executes Lucene's KnnFloatVectorQuery.

# Profile the lucene index and compare the query class name to the faiss one:
curl -s -XPOST 'localhost:9200/trace_lucene/_search?profile=true&pretty' \
  -H 'Content-Type: application/json' -d '{
  "size": 2, "query": { "knn": { "v": { "vector": [1.1,0.9,1.0], "k": 2 } } }
}' | grep -A2 '"query"' | head -20

# The k-NN wrappers over Lucene's vector queries:
grep -rn "extends KnnFloatVectorQuery\|extends KnnByteVectorQuery\|class OSKnnFloatVectorQuery\|class LuceneEngineKnnVectorQuery" \
  src/main/java/org/opensearch/knn/index/query/lucenelib/ \
  src/main/java/org/opensearch/knn/index/query/lucene/ | head

Record in the log (Hop 3b): the lucene query class, and the fact that _plugins/_knn/stats graph_memory_usage does not grow for a lucene-engine query (its vectors live on the JVM heap / mmap'd Lucene segment files, not in native memory).

Warning: This is the discriminating experiment. If you query the lucene index and see faiss native-memory stats move, you have mixed up your indices. The lucene engine's whole reason to exist is "no native code" — and the stats prove it.

Step 6: Hop 4 and Hop 5 — per-shard top-k and coordinator reduce

The last two hops are core mechanics, shared with every search — k-NN does not own them. Per segment you have TopDocs; Lucene's collector merges them into a per-shard top-k; the coordinating node merges per-shard results into the global top-k. This is the fan-out in search execution.

# k-NN's collector / result merge helpers (per-shard side):
grep -rn "TopDocs\|TopApproxKnnCollector\|ResultUtil\|reduce\|merge" \
  src/main/java/org/opensearch/knn/index/query/ResultUtil.java \
  src/main/java/org/opensearch/knn/index/query/TopApproxKnnCollector.java 2>/dev/null | head

Record in the log (Hops 4–5): per-shard collect, then coordinator reduce — and note that these are core, not k-NN, code. Evidence: in profile=true, the per-shard timing is under each shard's entry; the cross-shard reduce is not in the per-shard profile (it happens on the coordinator after shards return).

Step 7: Confirm with TRACE logs, then turn logging off

Read the logs you generated. The k-NN query package at TRACE prints the decisions you read in source — the engine chosen, ANN vs exact, filter cardinality, k.

# Wherever your node writes logs (./gradlew run -> build/testclusters or console):
grep -i "org.opensearch.knn.index.query" build/testclusters/*/logs/*.log 2>/dev/null | tail -40
# Or just watch the console where ./gradlew run is printing.

# IMPORTANT: reset logging so you don't drown the node in TRACE output:
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": { "logger.org.opensearch.knn.index.query": null }
}'

Stop the timer. Whatever you have in the reading log at 75 minutes is your artifact; the goal is coverage of every hop with at least one piece of evidence, not perfection.

Implementation Requirements / Deliverables

A completed reading-log artifact (~/knn-query-reading-log.md) with all seven rows filled: file:method, one-line "what happens", and an evidence cell per hop.
For Hop 1, the exact KNNQueryFactory branch that selects native vs lucene, named.
For Hop 3a, the JNI call chain named: doANNSearch → cache → JNIService.queryIndex → FaissService (native) → faiss → KNNScorer.
For Hop 3b, the Lucene query class the lucene engine produces, and proof (stats) that it does not consume native graph memory.
A profile=true response saved for both the faiss and lucene indices, with the differing query class names highlighted.
At least one TRACE log line cited per major hop where logging emits one.
Logging reset to default at the end (no lingering TRACE).

Troubleshooting

Symptom	Likely cause	Fix
`profile=true` shows an empty or generic query	the `knn` query rewrote to a different form (e.g. exact/script fallback, or matched no docs)	index more docs; confirm the field has a `method`; check the rewritten class name, not the original
TRACE logging produces nothing	logger name wrong or set on the wrong node	set `logger.org.opensearch.knn.index.query` exactly; on multi-node, it must reach the data node running the shard
faiss and lucene profiles look identical	you queried the same index twice	double-check the index name in the URL; they must be `trace_faiss` vs `trace_lucene`
`graph_memory_usage` is 0 after a faiss query	query hit the exact-search fallback, or you queried the lucene index	confirm engine is `faiss` and the segment has a built graph; `searchLeaf` may have chosen exact for a tiny/filtered set
Can't find `doToQuery` at a line number	line numbers drift between versions	`grep -n "doToQuery"` — never trust a hardcoded line number
`KNNQueryFactory` doesn't exist by that name	refactored in your version	`grep -rln "LuceneEngineKnnVectorQuery\|new KNNQuery"` to find the current fork site
Node drowns in log output	TRACE left on	reset the logger setting to `null` (Step 7)

Expected Output

# profile=true on trace_faiss (native) — the rewritten query is a KNNQuery-family class
"query": [ { "type": "KNNQuery", "description": "...", "time_in_nanos": ... } ]

# profile=true on trace_lucene — the rewritten query is a Lucene vector query
"query": [ { "type": "DocAndScoreQuery" / "KnnFloatVectorQuery"-derived, ... } ]

# _plugins/_knn/stats after the faiss query (off-heap graph loaded & searched)
"hit_count": 1, "miss_count": 1, "graph_memory_usage": 1, "graph_query_requests": 1

# _plugins/_knn/stats after the lucene query (native graph memory unchanged)
"graph_memory_usage": <same as before>   # lucene vectors are NOT in native memory

# Reading log (excerpt)
| 3a native | DefaultKNNWeight.doANNSearch | loads graph via cache, JNIService.queryIndex -> faiss | stats miss_count++ |
| 3b lucene | OSKnnFloatVectorQuery (extends Lucene KnnFloatVectorQuery) | Lucene HNSW, pure Java | graph_memory_usage flat |

Stretch Goals

Trace the filter path. Add a filter clause to the knn query and follow it into KNNWeight — find where the filter Weight is evaluated to a BitSet and how cardinality drives the pre-filter-vs-post-filter / exact-vs-ANN decision (FilterIdsSelector, canDoExactSearch). Document the new sub-hops.
Trace radial search. Issue a min_score/max_distance query instead of k and find where the builder routes it (RNNQueryFactory) and how the per-segment loop differs.
Trace rescoring. Use an on_disk/quantized field and a rescore clause; find RescoreKNNVectorQuery and where full-precision rescoring re-ranks the quantized top-k.
Watch the cache miss become a hit. Run the same faiss query twice and confirm miss_count increments once, then hit_count thereafter — proving the NativeMemoryCacheManager cached the graph.
Add your own TRACE line. In DefaultKNNWeight.doANNSearch, add a temporary log.trace(...) printing k, the segment, and the result count; rebuild, re-run, and see it appear. (Revert before committing — this is a learning probe, not a PR.)
Compare to a script-score query. Run an exact knn_score script query and trace it to the script engine instead of KNNWeight; contrast the two with the query-path chapter's approximate-vs-exact section.

Validation / Self-check

Name every hop from REST to coordinator reduce, the file:method that owns it, and which hops are core (shared with all searches) versus k-NN-specific.
Where exactly does the native-vs-lucene engine fork happen, and what Query class does each engine produce? Why does only one of them cross the JNI boundary?
In KNNWeight.searchLeaf, what decides between approximate (ANN) and exact search? Name one condition that forces exact.
Walk the native search sub-chain: from doANNSearch to faiss and back. What loads the graph, what makes the JNI call, and what turns a distance into a Lucene score?
You query the lucene index and graph_memory_usage grows. What does that tell you went wrong, and why is it impossible if you really queried a lucene-engine field?
Which two runtime signals corroborate your static read, and what does each one tell you that the other can't?
The coordinator reduce does not appear in the per-shard profile output. Why — where does it happen, and on which node?

When your reading log covers all seven hops with evidence, and you can explain the native/lucene fork from memory, you understand the k-NN query path. Continue to Lab K3: The knn_vector Field Type to trace the index path that produced the graphs you just searched, and re-read the query-path chapter for the filtering, radial, and rescoring details you stretched into above.

Lab K3: The knn_vector Field Type

Background

Every k-NN index begins with one mapping decision: a field of type knn_vector. That single type declaration is where the plugin's most consequential choices are made — the dimension, the distance space, the engine (faiss / lucene / nmslib), the algorithm (hnsw / ivf), the data type (float / byte / binary), and the on-disk mode/compression story. Get the mapping right and indexing and querying "just work"; get it wrong and you get rejected documents, the wrong engine, exact scans instead of ANN, or a node that OOMs on native memory. The class that owns all of this is KNNVectorFieldMapper (with its companion KNNVectorFieldType), and reading it is the fastest way to understand how a high-level mapping JSON turns into low-level engine behavior.

In this Build-It lab you first read the mapper deeply — the ParametrizedFieldMapper Builder and its Parameter<> fields, the validation that happens at mapping time, and the wiring from the mapping into the KNNEngine and the custom codec. Then you make a small, real change: you add a validation guard to the dimension parameter (reject dimensions above a configurable ceiling with a clear error), and you write a KNNVectorFieldMapperTests-style unit test that proves it. The change is illustrative — the value is in landing a real edit in a real mapper and testing it the way the project does.

Note: The cluster manager (formerly master) publishes the index metadata that includes this mapping, but the mapper itself runs on the data nodes that index and search the field. Mapping validation happens when the mapping is parsed (on the node creating or updating the index); per-document parsing happens on the primary shard. Keep those two moments distinct — your validation guard fires at the first.

Read these alongside the lab:

k-NN architecture — where the field mapper sits in the index path and how it reaches the codec.
Engines — the faiss/lucene/nmslib capability fork your mapping selects.
Mapping and analysis — the core ParametrizedFieldMapper/Mapper/FieldMapper machinery k-NN's mapper is built on; this lab does not re-derive it.

Why This Lab Matters for Contributors

Mapping bugs and feature requests are a steady stream of k-NN issues ("dimension X rejected", "why can't I use filter with this engine", "add a new compression level"). All of them route through KNNVectorFieldMapper. Knowing it makes those issues legible.
The mapper is the single point where a user's intent (engine, space, method, data type, mode) is validated against what the chosen KNNEngine supports — exactly the kind of small, well-scoped change a new contributor can own.
Writing a ParametrizedFieldMapper Builder with Parameter<> validators is a transferable OpenSearch skill — every field type in core and in plugins uses this exact pattern. You learn it once, on the most interesting field type in the ecosystem.
The project's mapper-test idiom (TypeParser.parse(...) + expectThrows) is how you prove a mapping change without booting a cluster. It is fast, deterministic, and reviewer-friendly.

Prerequisites

A from-source k-NN build (Lab K1); ./gradlew test works.
You have traced the query path (Lab K2) and read k-NN architecture.
Comfort with the core mapping & analysis deep dive — specifically the ParametrizedFieldMapper/Parameter model.

cd ~/src/oss-repos/k-NN
ls src/main/java/org/opensearch/knn/index/mapper/
#   KNNVectorFieldMapper.java  KNNVectorFieldType.java  KNNMappingConfig.java
#   EngineFieldMapper.java  FlatVectorFieldMapper.java  ModelFieldMapper.java
#   CompressionLevel.java  Mode.java  VectorValidator.java  PerDimensionValidator.java ...

Step-by-Step Tasks

Step 1: Map the mapping parameters to the source

The knn_vector field type accepts a fixed set of mapping parameters. Find where each is declared — they are Parameter<> fields on KNNVectorFieldMapper.Builder.

grep -n "class Builder\|Parameter<\|KNNConstants\.\|restrictedStringParam\|stringParam" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java | head -40

# The constant names the parameters use:
grep -n "DIMENSION\|KNN_METHOD\|KNN_ENGINE\|VECTOR_DATA_TYPE_FIELD\|MODE_PARAMETER\|COMPRESSION_LEVEL\|MODEL_ID\|TOP_LEVEL_PARAMETER_SPACE_TYPE" \
  src/main/java/org/opensearch/knn/common/KNNConstants.java | head

You should be able to reconstruct this table from the source (the Builder.getParameters() list — there are 11, confirmed by testBuilder_getParameters):

Mapping param	Constant / `Parameter` field	What it controls	Engine behavior it drives
`dimension`	`KNNConstants.DIMENSION` → `dimension`	vector length; validated `> 0`; must match every doc and the query	graph node arity; rejects mismatched vectors at parse time
`space_type`	`TOP_LEVEL_PARAMETER_SPACE_TYPE` (or under `method`)	distance metric (`l2`/`cosinesimil`/`innerproduct`/`l1`/`linf`/`hamming`)	which `SpaceType` distance the engine computes; `hamming` only for binary
`method.name`	`KNN_METHOD` → `knnMethodContext`	`hnsw` or `ivf`	graph (HNSW) vs inverted-list (IVF, needs training)
`method.engine`	`KNN_ENGINE` (or top-level `engine`)	`faiss` / `lucene` / `nmslib`	native off-heap (faiss/nmslib) vs Lucene on-heap; capability set
`method.parameters`	`m`, `ef_construction`, `ef_search`, `nlist`/`nprobes`	algorithm tuning	recall/latency/memory trade-off
`data_type`	`VECTOR_DATA_TYPE_FIELD` → `vectorDataType`	`float` (default) / `byte` / `binary`	per-element width; binary enables `hamming`
`mode`	`MODE_PARAMETER` → `mode`	`in_memory` / `on_disk`	selects the disk-ANN path + a default compression
`compression_level`	`COMPRESSION_LEVEL_PARAMETER` → `compressionLevel`	`1x`/`2x`/`4x`/`8x`/`16x`/`32x`	quantization aggressiveness for `on_disk`
`model_id`	`MODEL_ID` → `modelId`	trained-model reference (IVF/PQ)	field inherits method from a trained model instead of inline `method`

Note: space_type and engine can appear either at the top level of the mapping or nested under method — k-NN supports both spellings (the topLevelSpaceType/topLevelEngine parameters). Grep TOP_LEVEL_PARAMETER_SPACE_TYPE; it matters when you read validation, because both spellings must be reconciled.

Step 2: Read the validation — where bad mappings are rejected

The mapper validates the combination of parameters against the chosen engine's capabilities. This is the heart of the field type and where most of its bugs live.

# Per-element validators (e.g. byte range, fp16 range):
grep -n "class\|validate\|FloatVectorValidator\|ByteVectorValidator" \
  src/main/java/org/opensearch/knn/index/mapper/PerDimensionValidator.java \
  src/main/java/org/opensearch/knn/index/mapper/VectorValidator.java

# Whole-vector validators (e.g. space-specific constraints):
grep -n "class\|validateVector" \
  src/main/java/org/opensearch/knn/index/mapper/SpaceVectorValidator.java

# Engine-capability checks at build time (e.g. nmslib has no filter, lucene/ivf rules):
grep -rn "validateMethod\|isTrainingRequired\|validate\|UnsupportedOperationException\|addValidationError" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java | head -20

The dimension parameter is the cleanest example. Read its declaration — a Parameter<Integer> whose (name, context, value) -> { ... } lambda parses and validates:

grep -n "Parameter<Integer> dimension" src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java

// src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java (illustrative — grep for the real block)
protected final Parameter<Integer> dimension = new Parameter<>(
    KNNConstants.DIMENSION,
    false,
    () -> UNSET_MODEL_DIMENSION_IDENTIFIER,
    (n, c, o) -> {                                   // n = name, c = context, o = raw value
        if (o == null) {
            throw new IllegalArgumentException("Dimension cannot be null");
        }
        int value;
        try {
            value = XContentMapValues.nodeIntegerValue(o);
        } catch (Exception exception) {
            throw new IllegalArgumentException(
                String.format(Locale.ROOT, "Unable to parse [dimension] from provided value [%s] for vector [%s]", o, name));
        }
        if (value <= 0) {                            // <-- the existing lower-bound guard
            throw new IllegalArgumentException(
                String.format(Locale.ROOT, "Dimension value must be greater than 0 for vector: %s", name));
        }
        return value;
    },
    m -> toType(m).originalMappingParameters.getDimension()
);

This lower-bound guard is exactly the shape of the change you will make: add an upper-bound guard beside it.

Step 3: Wire the mapping to the engine and codec (read-only)

Before changing anything, confirm how the validated mapping reaches the engine and the codec — this is what your validation is protecting.

# The field type holds the resolved mapping config used at index/query time:
grep -n "class KNNVectorFieldType\|getKnnMappingConfig\|KNNMappingConfig\|getKnnEngine\|SpaceType" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldType.java \
  src/main/java/org/opensearch/knn/index/mapper/KNNMappingConfig.java | head

# The engine the mapping selects (faiss/lucene/nmslib) and its capabilities:
grep -rn "enum KNNEngine\|FAISS\|LUCENE\|NMSLIB\|getMaxDimension\|supports" \
  src/main/java/org/opensearch/knn/index/engine/KNNEngine.java | head

# How the codec gets the native-vs-lucene format from the engine (the index path):
grep -rn "NativeEngines990KnnVectorsFormat\|KnnVectorsFormat\|getVectorsFormat\|EngineFieldMapper" \
  src/main/java/org/opensearch/knn/index/codec/ \
  src/main/java/org/opensearch/knn/index/mapper/EngineFieldMapper.java | head

The picture: KNNVectorFieldMapper.Builder validates → produces a KNNVectorFieldType holding a KNNMappingConfig (dimension, SpaceType, KNNEngine, method/model) → at flush/merge the engine selects the vectors format (native NativeEngines990KnnVectorsFormat for faiss/nmslib, Lucene's own format for lucene) → the graph is written as segment files. Your validation runs before any of this, at mapping-parse time, so a bad dimension never reaches the engine.

Map params → engine behavior, the version you should be able to defend:

Mapping choice	faiss	lucene	nmslib (deprecated)
graph storage	native off-heap via k-NN codec	Lucene `.vec`/`.vex`/`.vem` on heap/mmap	native off-heap via k-NN codec
`method.name`	`hnsw`, `ivf`	`hnsw` only	`hnsw` only
`filter` support	yes (efficient filtering)	yes	no
`data_type`	float, byte, binary	float, byte	float
`on_disk` / compression	yes (PQ/BQ/SQ, rescoring)	scalar quant (Lucene SQ)	no
training (`model_id`)	yes (IVF/PQ)	no	no

Step 4: Make the change — add an upper-bound dimension guard

Add a configurable maximum-dimension check beside the existing value <= 0 guard. The ceiling should be read from a constant so it is one obvious place to change, and the error must name the field and the limit (good mapper errors are specific). First, find the real site:

grep -n "Dimension value must be greater than 0" \
  src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java

Add a constant (near the other KNNConstants, or as a static final on the mapper — grep to match the project's placement convention; an illustrative inline constant is shown here):

// In KNNVectorFieldMapper (or KNNConstants — match the project's convention).
// Illustrative ceiling; real engine maxima live on KNNEngine (grep getMaxDimension).
public static final int LAB_MAX_DIMENSION = 16_000;

Then extend the dimension parameter's validation lambda, immediately after the existing lower-bound check:

        if (value <= 0) {
            throw new IllegalArgumentException(
                String.format(Locale.ROOT, "Dimension value must be greater than 0 for vector: %s", name));
        }
        // --- Lab K3 addition: upper-bound guard ---
        if (value > LAB_MAX_DIMENSION) {
            throw new IllegalArgumentException(
                String.format(
                    Locale.ROOT,
                    "Dimension value [%d] exceeds the maximum supported dimension [%d] for vector: %s",
                    value, LAB_MAX_DIMENSION, name));
        }
        // --- end addition ---
        return value;

Warning: This is deliberately illustrative. In a real PR you would not hardcode a ceiling — you would derive it from the chosen KNNEngine (each engine has its own maximum; grep getMaxDimension src/main/java/org/opensearch/knn/index/engine/KNNEngine.java), because faiss, lucene, and nmslib do not share a limit. Note that as the production-correct version in your PR description, and reference the per-engine maxima. The point of the lab is the mechanics of adding and testing a mapper validation, not this specific number.

Step 5: Write the unit test — `KNNVectorFieldMapperTests`-style

The project tests the mapper without a cluster: build a mapping with XContentBuilder, run it through KNNVectorFieldMapper.TypeParser.parse(...), and assert. Use expectThrows for the rejection path. Add this to src/test/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperTests.java:

public void testDimension_aboveMaximum_thenThrows() throws IOException {
    ModelDao modelDao = mock(ModelDao.class);
    KNNVectorFieldMapper.TypeParser typeParser = new KNNVectorFieldMapper.TypeParser(() -> modelDao);

    // A mapping with a dimension just over the lab ceiling.
    XContentBuilder tooBig = XContentFactory.jsonBuilder()
        .startObject()
        .field(TYPE_FIELD_NAME, KNN_VECTOR_TYPE)
        .field(DIMENSION_FIELD_NAME, KNNVectorFieldMapper.LAB_MAX_DIMENSION + 1)
        .endObject();

    IllegalArgumentException ex = expectThrows(
        IllegalArgumentException.class,
        () -> typeParser.parse(
            TEST_FIELD_NAME,
            xContentBuilderToMap(tooBig),
            buildParserContext(TEST_INDEX_NAME, settings)   // settings is the test base helper
        )
    );
    assertTrue(ex.getMessage(), ex.getMessage().contains("exceeds the maximum supported dimension"));
}

public void testDimension_atMaximum_thenSucceeds() throws IOException {
    ModelDao modelDao = mock(ModelDao.class);
    KNNVectorFieldMapper.TypeParser typeParser = new KNNVectorFieldMapper.TypeParser(() -> modelDao);

    // Exactly at the ceiling must still parse (boundary case).
    XContentBuilder atMax = XContentFactory.jsonBuilder()
        .startObject()
        .field(TYPE_FIELD_NAME, KNN_VECTOR_TYPE)
        .field(DIMENSION_FIELD_NAME, KNNVectorFieldMapper.LAB_MAX_DIMENSION)
        .endObject();

    KNNVectorFieldMapper.Builder builder = (KNNVectorFieldMapper.Builder) typeParser.parse(
        TEST_FIELD_NAME,
        xContentBuilderToMap(atMax),
        buildParserContext(TEST_INDEX_NAME, settings)
    );
    assertEquals(KNNVectorFieldMapper.LAB_MAX_DIMENSION, (int) builder.getOriginalParameters().getDimension());
}

Note: Helpers like buildParserContext, the settings field, xContentBuilderToMap, and getOriginalParameters() come from the existing test class (extends KNNTestCase). Copy the current signatures from an existing testTypeParser_* test rather than trusting these; the base class is KNNTestCase, not the generic OpenSearchTestCase.

Step 6: Build, run the test, and verify against a live node

# Compile + run only your new tests (fast inner loop):
./gradlew test --tests "org.opensearch.knn.index.mapper.KNNVectorFieldMapperTests" \
  --tests "*testDimension_aboveMaximum_thenThrows" \
  --tests "*testDimension_atMaximum_thenSucceeds"

# Run the whole mapper test class to confirm you didn't break a sibling test:
./gradlew test --tests "org.opensearch.knn.index.mapper.KNNVectorFieldMapperTests"

Then prove it end to end on a node (./gradlew run with your change):

# Rejected: dimension over the ceiling -> 400 with your specific message.
curl -s -XPUT 'localhost:9200/dim_too_big?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 16001,
           "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } }
  }}
}'
# Expect: "exceeds the maximum supported dimension [16000]"

# Accepted: a normal dimension still works.
curl -s -XPUT 'localhost:9200/dim_ok?pretty' -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 128,
           "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } }
  }}
}'
# Expect: "acknowledged": true

Step 7: Precommit and a clean commit

./gradlew spotlessApply
./gradlew precommit
git checkout -b lab/k3-dimension-guard
git add src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java \
        src/test/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapperTests.java
git commit -s -m "Lab K3: add upper-bound dimension validation to knn_vector mapper"

Implementation Requirements / Deliverables

You can reconstruct the knn_vector parameter table from Builder.getParameters() (all 11 params) and explain what each one drives in the engine.
The new upper-bound guard added to the dimension Parameter validation lambda, with a specific, field-naming error message.
A KNNVectorFieldMapperTests test for the reject path (expectThrows, IllegalArgumentException) and the boundary accept path.
./gradlew test --tests "*KNNVectorFieldMapperTests" passes, including your new tests and all pre-existing ones.
A live node rejects an over-ceiling mapping with your message and accepts a normal one.
precommit passes; a DCO-signed commit; SPDX headers intact on every touched file.
In your write-up: the production-correct version of this change (per-engine getMaxDimension) and why hardcoding a single ceiling is wrong.

Troubleshooting

Symptom	Likely cause	Fix
`cannot find symbol: LAB_MAX_DIMENSION` in the test	constant placed where the test can't see it	make it `public static final` on `KNNVectorFieldMapper` (or import its real location)
Test compiles but `buildParserContext`/`settings`/`xContentBuilderToMap` unresolved	those are helpers on `KNNTestCase`/the test class, signatures differ in your version	copy the exact idiom from an existing `testTypeParser_*` test in the same file
Your guard never fires	you edited the wrong validation lambda, or the value is parsed elsewhere first	confirm you edited the `dimension` `Parameter<Integer>` block; `grep "greater than 0"` to find the real site
`MapperParsingException` instead of `IllegalArgumentException`	the parser wraps your IAE	assert on the wrapped cause/message, or `expectThrows(MapperParsingException.class, ...)` and check `getCause()`
`model_id` mappings skip your check	model-based fields take dimension from the trained model (`UNSET_MODEL_DIMENSION_IDENTIFIER`)	that's expected — dimension is unset inline for model fields; note it as a known gap
`precommit` fails on license headers	new constant block lacks SPDX context (usually fine inside an existing file)	run `./gradlew spotlessApply`; ensure the file still has its header
Live node still accepts a big dimension	you ran an old build	rebuild (`./gradlew run` restarts the node with your change)

Expected Output

# Unit tests
> Task :test
KNNVectorFieldMapperTests > testDimension_aboveMaximum_thenThrows PASSED
KNNVectorFieldMapperTests > testDimension_atMaximum_thenSucceeds  PASSED
BUILD SUCCESSFUL

# Live reject (dimension 16001)
{
  "error": {
    "root_cause": [ { "type": "illegal_argument_exception",
      "reason": "Dimension value [16001] exceeds the maximum supported dimension [16000] for vector: v" } ],
    "type": "illegal_argument_exception"
  },
  "status": 400
}

# Live accept (dimension 128)
{ "acknowledged": true, "shards_acknowledged": true, "index": "dim_ok" }

Stretch Goals

Make it per-engine and correct. Replace LAB_MAX_DIMENSION with the chosen KNNEngine's real maximum (grep getMaxDimension on KNNEngine), so faiss, lucene, and nmslib each enforce their own ceiling. Update the error to name the engine, and add a test per engine.
Validate a parameter combination. Add a check that rejects space_type: hamming unless data_type: binary (grep SpaceType/VectorDataType to find where these meet), with a test. This teaches cross-parameter validation, the harder and more common case.
Trace the value into the codec. With your guard in place, index a vector and confirm (via Lab K2's technique) that the validated dimension flows into KNNMappingConfig and the vectors format. Where would a wrong dimension have surfaced if validation hadn't caught it?
Add a deprecation-style warning. Instead of rejecting, emit a deprecation log for a discouraged-but-allowed dimension range, using the deprecation logger (grep core for DeprecationLogger). Note when a warning beats a hard error.
Write a YAML REST test. Add a do: indices.create + catch: bad_request test under src/yamlRestTest/... asserting the over-ceiling mapping is rejected, and run the yamlRestTest task.

Validation / Self-check

List the knn_vector mapping parameters and, for each, name the KNNEngine behavior it drives. Which two can be written either top-level or nested under method, and why does that matter for validation?
What is the difference between mapping-time validation and per-document parsing in KNNVectorFieldMapper? On which node does each happen, and which one did your guard touch?
Walk the path from a validated mapping to a written graph: Builder → KNNVectorFieldType / KNNMappingConfig → engine → vectors format → segment files. Where does faiss diverge from lucene?
Why is hardcoding a single LAB_MAX_DIMENSION the wrong production design, and what is the correct source of the limit?
In the unit test, what does KNNVectorFieldMapper.TypeParser.parse(...) give you that a live curl mapping does not, and why is expectThrows the right assertion here?
How does a model_id field get its dimension, and why does your inline-dimension guard not apply to it?
Name one cross-parameter constraint the mapper enforces (e.g. engine vs filter, space vs data_type) and where in the source it lives.

When your guard rejects bad dimensions with a clear message, your KNNVectorFieldMapperTests tests pass, and you can defend the per-engine-correct version of the change, you understand the knn_vector field type from mapping JSON to engine behavior. Revisit k-NN architecture for how the mapper sits in the index path, and engines for the capability fork your mapping selects; the core machinery your mapper is built on is in mapping and analysis.

Lab K4: Build It — A Custom k-NN Feature

Background

You have read k-NN architecture: one KNNPlugin class implements eleven extension interfaces, and ActionPlugin is the one that registers REST handlers and transport actions — the warmup API, the stats API, the train API, model CRUD. Each of those is the same three-part shape: a RestHandler that parses the HTTP request, a TransportAction that does the work (often fanned out to data nodes), and a *Request/*Response pair that serialize across the transport layer. Once you have seen that shape once, every k-NN action is a variation on it.

This is a build-it lab. You will add a brand-new REST endpoint to the k-NN plugin end to end, with real Java: a *Request/*Response, a TransportNodesAction that collects per-node native-memory cache statistics, a RestHandler, and the registration lines in KNNPlugin. You will build the plugin (linking lab-k1-build-knn-from-source), install it into a running OpenSearch, and curl your endpoint. You will write a unit test. By the end you will have done the single most transferable thing in OpenSearch plugin work: shipped a new action through the full REST → transport → node fan-out → response machinery.

Note on terminology: the cluster manager (formerly master) owns cluster state, but the metric you are exposing — native-memory cache stats — is a per-data-node, per-shard concern. Your action is therefore a nodes action (fan out to every node, each answers locally, the coordinating node aggregates), not a cluster-manager action. This is the same fan-out shape as GET /_plugins/_knn/stats and the warmup API.

Why This Matters for Contributors

The REST → transport → nodes-action triad is the most common kind of feature PR in OpenSearch and its plugins — a new stat, admin endpoint, or diagnostic is always this shape. Learn it once, reuse it forever.
k-NN's native memory is invisible to every JVM tool (see native integration and memory), so adding a stat is genuinely useful, mergeable work, not a toy.
You touch KNNPlugin registration directly (getActions() / getRestHandlers()), which demystifies the whole plugin architecture, and you must get serialization right — the foundation of backward compatibility.

Pick Your Build

Pick one and implement it end to end. They share the same skeleton; they differ in where the data comes from. The walkthrough below implements option A in full; options B and C are sketched so you can choose your own and reuse the scaffolding.

Option	Endpoint	What it returns	Data source
A (walkthrough)	`GET /_plugins/_knn/cachestats`	per-node native-memory cache size, hit/miss/eviction counts, capacity	`NativeMemoryCacheManager` on each node
B	a new `SpaceType` / scoring function	a custom distance (e.g. a weighted-L2) usable in `space_type` or `knn_score` script	`org.opensearch.knn.index.SpaceType` + the script scoring path
C	`GET /_plugins/_knn/graphinfo/<index>`	per-segment graph metadata (engine, method, vector count) for an index	the k-NN codec's per-segment files / field info

Warning: Do not name your endpoint /_plugins/_knn/stats — that already exists (KNNStatsAction). Pick a distinct path (cachestats, graphinfo) so you do not collide with a registered route and get a startup failure. Grep the existing routes first (below) to be sure.

Prerequisites

lab-k1-build-knn-from-source completed — you can build the plugin (Java + the jni/ native libs via CMake) and have a checkout.
You have read k-NN architecture (especially getActions() / getRestHandlers()) and native integration and memory (what NativeMemoryCacheManager tracks).
A local OpenSearch you can install the plugin into, or the k-NN repo's run task.
JDK 21, Gradle, a C++ toolchain + CMake (for the native step).

# Orient yourself: where the existing actions and rest handlers live.
cd ~/src/oss-repos/k-NN
ls src/main/java/org/opensearch/knn/plugin/transport   # *TransportAction, *Request, *Response
ls src/main/java/org/opensearch/knn/plugin/rest        # Rest*Handler
# Confirm the registration methods and the existing routes (names drift — grep, don't trust):
grep -n "getActions\|getRestHandlers\|registerHandler\|new Route" \
  src/main/java/org/opensearch/knn/plugin/KNNPlugin.java
grep -rn "_plugins/_knn" src/main/java/org/opensearch/knn/plugin/rest

Step-by-Step (Option A: a native-memory cache-stats endpoint)

The endpoint fans a request out to every node, each node reads its own NativeMemoryCacheManager, and the coordinating node concatenates the per-node answers. This is OpenSearch's TransportNodesAction pattern. Five files, then two registration lines.

Step 1: The per-node `Request`/`Response`

A nodes action has a top-level request/response (the whole operation) and a per-node request/response. Start with the per-node response — the actual payload one node reports.

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

package org.opensearch.knn.plugin.transport;

import org.opensearch.action.support.nodes.BaseNodeResponse;
import org.opensearch.cluster.node.DiscoveryNode;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.ToXContentFragment;
import org.opensearch.core.xcontent.XContentBuilder;

import java.io.IOException;

/** One node's view of the native-memory cache. */
public class KNNCacheStatsNodeResponse extends BaseNodeResponse implements ToXContentFragment {

    private final long graphMemoryKb;
    private final long hitCount;
    private final long missCount;
    private final long evictionCount;
    private final long cacheCapacityKb;

    public KNNCacheStatsNodeResponse(
        DiscoveryNode node, long graphMemoryKb, long hitCount,
        long missCount, long evictionCount, long cacheCapacityKb) {
        super(node);
        this.graphMemoryKb = graphMemoryKb;
        this.hitCount = hitCount;
        this.missCount = missCount;
        this.evictionCount = evictionCount;
        this.cacheCapacityKb = cacheCapacityKb;
    }

    /** Wire-format read. Field ORDER must match writeTo exactly. */
    public KNNCacheStatsNodeResponse(StreamInput in) throws IOException {
        super(in);
        this.graphMemoryKb = in.readLong();
        this.hitCount = in.readLong();
        this.missCount = in.readLong();
        this.evictionCount = in.readLong();
        this.cacheCapacityKb = in.readLong();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        super.writeTo(out);
        out.writeLong(graphMemoryKb);
        out.writeLong(hitCount);
        out.writeLong(missCount);
        out.writeLong(evictionCount);
        out.writeLong(cacheCapacityKb);
    }

    @Override
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
        builder.startObject(getNode().getId());
        builder.field("node_name", getNode().getName());
        builder.field("graph_memory_usage_kb", graphMemoryKb);
        builder.field("cache_capacity_kb", cacheCapacityKb);
        builder.field("hit_count", hitCount);
        builder.field("miss_count", missCount);
        builder.field("eviction_count", evictionCount);
        return builder.endObject();
    }

    long getGraphMemoryKb()  { return graphMemoryKb; }
    long getHitCount()       { return hitCount; }
    long getMissCount()      { return missCount; }
    long getEvictionCount()  { return evictionCount; }
    long getCacheCapacityKb(){ return cacheCapacityKb; }
}

Warning: The order of reads in the StreamInput constructor must match the order of writes in writeTo, field for field. A mismatch does not fail at compile time — it corrupts the stream at runtime and only shows up as a garbled response or a deserialization exception between nodes. This is the #1 mistake in transport code; see serialization & BWC.

The top-level request/response wrap the per-node ones. The request is a BaseNodesRequest (it carries the node selection); the response is a BaseNodesResponse<KNNCacheStatsNodeResponse> that holds the list and renders XContent.

package org.opensearch.knn.plugin.transport;

import org.opensearch.action.support.nodes.BaseNodesRequest;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import java.io.IOException;

/** Top-level request: which nodes to ask. Empty body — node selection is in the base. */
public class KNNCacheStatsRequest extends BaseNodesRequest<KNNCacheStatsRequest> {
    public KNNCacheStatsRequest(String... nodeIds) { super(nodeIds); }
    public KNNCacheStatsRequest(StreamInput in) throws IOException { super(in); }
    @Override public void writeTo(StreamOutput out) throws IOException { super.writeTo(out); }
}

package org.opensearch.knn.plugin.transport;

import org.opensearch.action.FailedNodeException;
import org.opensearch.action.support.nodes.BaseNodesResponse;
import org.opensearch.cluster.ClusterName;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.core.xcontent.ToXContentObject;
import org.opensearch.core.xcontent.XContentBuilder;

import java.io.IOException;
import java.util.List;

/** Top-level response: the per-node answers, plus any node failures, as XContent. */
public class KNNCacheStatsResponse
        extends BaseNodesResponse<KNNCacheStatsNodeResponse> implements ToXContentObject {

    public KNNCacheStatsResponse(
        ClusterName clusterName, List<KNNCacheStatsNodeResponse> nodes,
        List<FailedNodeException> failures) {
        super(clusterName, nodes, failures);
    }

    public KNNCacheStatsResponse(StreamInput in) throws IOException { super(in); }

    @Override
    protected List<KNNCacheStatsNodeResponse> readNodesFrom(StreamInput in) throws IOException {
        return in.readList(KNNCacheStatsNodeResponse::new);
    }

    @Override
    protected void writeNodesTo(StreamOutput out, List<KNNCacheStatsNodeResponse> nodes) throws IOException {
        out.writeList(nodes);
    }

    @Override
    public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
        builder.startObject();
        builder.startObject("nodes");
        for (KNNCacheStatsNodeResponse node : getNodes()) {
            node.toXContent(builder, params);
        }
        builder.endObject();
        return builder.endObject();
    }
}

Step 2: The `TransportNodesAction`

This is the engine of the action. TransportNodesAction handles the fan-out: it sends a per-node request to each selected node, calls nodeOperation on each, and gathers the KNNCacheStatsNodeResponses into a KNNCacheStatsResponse. You implement the four abstract methods.

package org.opensearch.knn.plugin.transport;

import org.opensearch.action.FailedNodeException;
import org.opensearch.action.support.ActionFilters;
import org.opensearch.action.support.nodes.BaseNodeRequest;
import org.opensearch.action.support.nodes.TransportNodesAction;
import org.opensearch.cluster.service.ClusterService;
import org.opensearch.common.inject.Inject;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.io.stream.StreamOutput;
import org.opensearch.knn.index.memory.NativeMemoryCacheManager;
import org.opensearch.threadpool.ThreadPool;
import org.opensearch.transport.TransportRequest;
import org.opensearch.transport.TransportService;

import java.io.IOException;
import java.util.List;

public class KNNCacheStatsTransportAction extends TransportNodesAction<
        KNNCacheStatsRequest,
        KNNCacheStatsResponse,
        KNNCacheStatsTransportAction.NodeRequest,
        KNNCacheStatsNodeResponse> {

    @Inject
    public KNNCacheStatsTransportAction(
        ThreadPool threadPool, ClusterService clusterService,
        TransportService transportService, ActionFilters actionFilters) {
        super(
            KNNCacheStatsAction.NAME, threadPool, clusterService, transportService, actionFilters,
            KNNCacheStatsRequest::new, NodeRequest::new,
            ThreadPool.Names.MANAGEMENT, KNNCacheStatsNodeResponse.class);
    }

    @Override
    protected KNNCacheStatsResponse newResponse(
        KNNCacheStatsRequest request, List<KNNCacheStatsNodeResponse> nodes,
        List<FailedNodeException> failures) {
        return new KNNCacheStatsResponse(clusterService.getClusterName(), nodes, failures);
    }

    @Override
    protected NodeRequest newNodeRequest(KNNCacheStatsRequest request) {
        return new NodeRequest();
    }

    @Override
    protected KNNCacheStatsNodeResponse newNodeResponse(StreamInput in) throws IOException {
        return new KNNCacheStatsNodeResponse(in);
    }

    /** Runs on EACH node: read this node's native-memory cache and report it. */
    @Override
    protected KNNCacheStatsNodeResponse nodeOperation(NodeRequest request) {
        NativeMemoryCacheManager cm = NativeMemoryCacheManager.getInstance();
        // Method names below are illustrative — grep NativeMemoryCacheManager for the
        // real accessors in your version (getCacheSizeInKilobytes, getCacheStats, etc.).
        long graphMemKb     = cm.getCacheSizeInKilobytes();
        long maxCacheKb     = cm.getMaxCacheSizeInKilobytes();
        long hits           = cm.getHitCount();
        long misses         = cm.getMissCount();
        long evictions      = cm.getEvictionCount();
        return new KNNCacheStatsNodeResponse(
            clusterService.localNode(), graphMemKb, hits, misses, evictions, maxCacheKb);
    }

    /** Per-node request — empty; node selection lives in the top-level request. */
    public static class NodeRequest extends BaseNodeRequest {
        public NodeRequest() {}
        public NodeRequest(StreamInput in) throws IOException { super(in); }
        @Override public void writeTo(StreamOutput out) throws IOException { super.writeTo(out); }
    }
}

The ActionType that names the action (used in registration and to call it):

package org.opensearch.knn.plugin.transport;

import org.opensearch.action.ActionType;

public class KNNCacheStatsAction extends ActionType<KNNCacheStatsResponse> {
    public static final KNNCacheStatsAction INSTANCE = new KNNCacheStatsAction();
    public static final String NAME = "cluster:admin/knn_cache_stats_action";
    private KNNCacheStatsAction() { super(NAME, KNNCacheStatsResponse::new); }
}

Note: The action name cluster:admin/... is the transport-level identifier, not the HTTP route. The cluster:admin/ prefix tells the security layer this is an admin action. Keep it consistent with the existing k-NN actions — grep transport/KNN*Action.java for public static final String NAME.

Step 3: The `RestHandler`

The REST handler maps GET /_plugins/_knn/cachestats to your transport action.

package org.opensearch.knn.plugin.rest;

import org.opensearch.client.node.NodeClient;
import org.opensearch.knn.plugin.transport.KNNCacheStatsAction;
import org.opensearch.knn.plugin.transport.KNNCacheStatsRequest;
import org.opensearch.rest.BaseRestHandler;
import org.opensearch.rest.RestRequest;
import org.opensearch.rest.action.RestActions.NodesResponseRestListener;

import java.util.List;

import static java.util.Collections.singletonList;
import static org.opensearch.rest.RestRequest.Method.GET;

public class RestKNNCacheStatsHandler extends BaseRestHandler {

    private static final String NAME = "knn_cache_stats_action";

    @Override
    public String getName() { return NAME; }

    @Override
    public List<Route> routes() {
        return singletonList(new Route(GET, "/_plugins/_knn/cachestats"));
    }

    @Override
    protected RestChannelConsumer prepareRequest(RestRequest request, NodeClient client) {
        // null nodeIds => all nodes
        KNNCacheStatsRequest knnRequest = new KNNCacheStatsRequest((String[]) null);
        return channel -> client.execute(
            KNNCacheStatsAction.INSTANCE, knnRequest, new NodesResponseRestListener<>(channel));
    }
}

Step 4: Register both in `KNNPlugin`

Two lines. getActions() binds the ActionType to its TransportAction; getRestHandlers() adds your handler to the REST routing table.

// In src/main/java/org/opensearch/knn/plugin/KNNPlugin.java — add to the existing lists.

// getActions(): bind ActionType -> TransportAction
@Override
public List<ActionHandler<? extends ActionRequest, ? extends ActionResponse>> getActions() {
    return Arrays.asList(
        // ... existing handlers (KNNStatsAction, KNNWarmupAction, TrainingModelAction, ...) ...
        new ActionHandler<>(KNNCacheStatsAction.INSTANCE, KNNCacheStatsTransportAction.class)
    );
}

// getRestHandlers(): add the REST endpoint
@Override
public List<RestHandler> getRestHandlers(/* ...existing params... */) {
    return ImmutableList.of(
        // ... existing handlers (RestKNNStatsHandler, RestKNNWarmupHandler, ...) ...
        new RestKNNCacheStatsHandler()
    );
}

# Confirm the exact existing signatures and the import surface before editing — they drift:
grep -n "getActions\|getRestHandlers\|ActionHandler\|RestHandler" \
  src/main/java/org/opensearch/knn/plugin/KNNPlugin.java

Step 5: Build, install, and curl

Build the plugin (this links lab-k1 — the native jni/ step plus the Java compile), then run it.

# Full build (Java + native libs via CMake). See lab-k1 for the native prerequisites.
./gradlew assemble

# Fastest inner loop: run OpenSearch with the plugin already installed.
./gradlew run                  # boots a single node with k-NN + your new action

# In another shell, hit your endpoint:
curl -s 'localhost:9200/_plugins/_knn/cachestats?pretty'

To prove it reflects real native memory, create a faiss index, load it via warmup (see native integration and memory), then re-curl:

curl -XPUT 'localhost:9200/v' -H 'Content-Type: application/json' -d '
{ "settings": { "index.knn": true },
  "mappings": { "properties": { "e": { "type": "knn_vector", "dimension": 4,
    "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } } } } }'

for i in 1 2 3 4 5; do
  curl -s -XPOST 'localhost:9200/v/_doc' -H 'Content-Type: application/json' \
    -d "{\"e\":[$i.0,$i.1,$i.2,$i.3]}" >/dev/null
done
curl -s -XPOST 'localhost:9200/v/_refresh' >/dev/null
curl -s -XPOST 'localhost:9200/_plugins/_knn/warmup/v?pretty' >/dev/null

# graph_memory_usage_kb should now be non-zero on the node holding the shard:
curl -s 'localhost:9200/_plugins/_knn/cachestats?pretty'

Step 6: A unit test

Test the serialization round-trip of the per-node response — the part most likely to break silently. k-NN's tests subclass KNNTestCase.

/*
 * SPDX-License-Identifier: Apache-2.0
 * ...license header...
 */
package org.opensearch.knn.plugin.transport;

import org.opensearch.Version;
import org.opensearch.cluster.node.DiscoveryNode;
import org.opensearch.common.io.stream.BytesStreamOutput;
import org.opensearch.core.common.io.stream.StreamInput;
import org.opensearch.core.common.transport.TransportAddress;
import org.opensearch.knn.KNNTestCase;

import java.net.InetAddress;
import java.util.Collections;

public class KNNCacheStatsNodeResponseTests extends KNNTestCase {

    public void testSerializationRoundTrip() throws Exception {
        DiscoveryNode node = new DiscoveryNode(
            "node-1", new TransportAddress(InetAddress.getLoopbackAddress(), 9300),
            Collections.emptyMap(), Collections.emptySet(), Version.CURRENT);

        KNNCacheStatsNodeResponse original =
            new KNNCacheStatsNodeResponse(node, 2048L, 17L, 3L, 1L, 51200L);

        // Write to a buffer, read it back — the wire round-trip.
        BytesStreamOutput out = new BytesStreamOutput();
        original.writeTo(out);
        StreamInput in = out.bytes().streamInput();
        KNNCacheStatsNodeResponse restored = new KNNCacheStatsNodeResponse(in);

        assertEquals(original.getGraphMemoryKb(),  restored.getGraphMemoryKb());
        assertEquals(original.getHitCount(),       restored.getHitCount());
        assertEquals(original.getMissCount(),      restored.getMissCount());
        assertEquals(original.getEvictionCount(),  restored.getEvictionCount());
        assertEquals(original.getCacheCapacityKb(),restored.getCacheCapacityKb());
        assertEquals(node.getId(),                 restored.getNode().getId());
    }
}

./gradlew test --tests "org.opensearch.knn.plugin.transport.KNNCacheStatsNodeResponseTests"

Options B and C, in brief

Option B — a custom SpaceType / scoring function. space_type values map to org.opensearch.knn.index.SpaceType enum constants, each carrying a distance/score function. A custom space (e.g. a weighted L2) means adding an enum constant with its getDistance/scoreTranslation, wiring it through validation, and — for the painless knn_score script — registering it on the script scoring path. Option C — GET /_plugins/_knn/graphinfo/<index> reuses option A's nodes-action skeleton, but nodeOperation reads the k-NN codec's per-segment field info (engine, method, vector count) for the index's local shards instead of the cache.

# Option B: the space/scoring surface.
grep -rn "enum SpaceType\|getDistance\|scoreTranslation\|KNNScoringSpace" \
  src/main/java/org/opensearch/knn/index/SpaceType.java src/main/java/org/opensearch/knn/plugin/script
# Option C: reaching per-segment metadata from a transport action.
grep -rn "KNN990Codec\|KNNCodecVersion\|FieldInfo\|perFieldKnnVectors\|getEngine" \
  src/main/java/org/opensearch/knn/index/codec

Implementation Requirements / Deliverables

One option chosen and implemented end to end with real Java.
*Request, *Response, per-node response, TransportNodesAction, ActionType, and RestHandler all present (option A) — or the equivalent files for B/C.
Registration lines added to KNNPlugin.getActions() and getRestHandlers(), confirmed by grep against your checkout.
Plugin builds (./gradlew assemble) including the native step.
Endpoint reachable: a pasted curl of your route returning a sensible body, with a non-zero metric after warmup (option A).
A passing unit test (serialization round-trip, or the equivalent for B/C).
./gradlew spotlessApply && ./gradlew precommit clean.

Troubleshooting

Symptom	Likely cause	Fix
Node fails to start: "duplicate route" / handler registration error	your route collides with an existing one (e.g. `stats`)	pick a distinct path; grep `_plugins/_knn` routes in `plugin/rest`
`ClassNotFoundException` / `NoSuchMethodError` for your action at startup	registration line wrong, or `ActionType` `NAME` mismatched	re-grep `getActions()`; ensure `ActionType` `NAME` is unique and matches the `super(NAME, ...)` call
Response body garbled or deserialization exception between nodes	`writeTo`/`StreamInput` field order mismatch	make the read order in the `StreamInput` ctor exactly match `writeTo`; see serialization & BWC
`graph_memory_usage_kb` always 0	no faiss graph loaded yet	index docs, refresh, then `POST /_plugins/_knn/warmup/<index>` before re-curling
`cm.getInstance()` NPE in a unit test	`NativeMemoryCacheManager` is a node singleton, not available in a plain unit test	test the request/response serialization (as above), not `nodeOperation`; cover `nodeOperation` in an integ test instead
`prepareRequest` won't compile against `NodesResponseRestListener`	import/signature drift across versions	grep an existing nodes-based `Rest*Handler` (e.g. the stats handler) and copy its listener wiring
Build fails in the native step	missing CMake/toolchain/submodule	see lab-k1 and native integration — `git submodule update --init --recursive`

Expected Output

A fresh node, no vectors loaded:

{
  "nodes" : {
    "kP3mGZ...node-1" : {
      "node_name" : "runTask-0",
      "graph_memory_usage_kb" : 0,
      "cache_capacity_kb" : 16777216,
      "hit_count" : 0,
      "miss_count" : 0,
      "eviction_count" : 0
    }
  }
}

After indexing a faiss field and warming it up, graph_memory_usage_kb and miss_count become non-zero on the node holding the shard, and a second query bumps hit_count — exactly mirroring what GET /_plugins/_knn/stats reports, which is your correctness oracle.

Stretch Goals

Cross-check against the real stats API. Confirm your graph_memory_usage_kb matches the graph_memory_usage field of GET /_plugins/_knn/stats. If they differ, you are reading the cache differently than KNNStats does — reconcile.
Add an integ test. Write a *IT (subclassing the k-NN integration base) that boots a node, indexes a faiss field, warms it, calls your endpoint over REST, and asserts a non-zero metric. This covers nodeOperation, which the unit test cannot.
Make it filterable. Accept ?nodeId=... to scope the fan-out to specific nodes — the BaseNodesRequest already supports node selection; wire it through the handler.
Open a real PR-shaped change. Search https://github.com/opensearch-project/k-NN/issues?q=is%3Aissue+label%3Aenhancement+stats for an actually-wanted stat or endpoint, and propose your feature against a real need — with a CHANGELOG.md entry and DCO sign-off.

Validation / Self-check

Name the five+ classes a TransportNodesAction-based endpoint needs and the job of each. Which two carry the per-node request/response vs the top-level ones?
Where exactly in KNNPlugin does an action get registered, and where does a REST route? What does each registration call return?
Why must the read order in the StreamInput constructor match writeTo? What kind of failure does a mismatch produce, and at what time (compile/startup/runtime)?
Why is native-memory cache stats a nodes action and not a cluster-manager action? What does that say about where the data lives?
Walk a request through your endpoint: REST handler → action → fan-out → nodeOperation → aggregate → XContent. Where does each piece run (coordinating node vs every node)?
Why does graph_memory_usage_kb read 0 until you warm up or query a faiss field? (Tie this to lazy native loading in native integration.)
What does your unit test actually prove, and what does it not cover that an integ test would?

When you can answer all seven and your endpoint returns a non-zero metric after warmup, you have built a real k-NN feature through the full REST/transport machinery. Next, take the same muscles to a real defect in lab-k5: Fix It — a Real k-NN Issue, or revisit k-NN architecture to see how your action sits beside the ten other extension points KNNPlugin registers.

Lab K5: Fix It — A Real k-NN Issue

Background

The k-NN plugin's native-memory circuit breaker is one of the most-reported, least-understood subsystems in the repo. It guards memory that no JVM tool can see (faiss/nmslib graphs live off-heap — see native integration and memory), it is entangled with the NativeMemoryCacheManager Guava cache and with cluster-settings plumbing, and its "tripped / un-tripped" hysteresis has sharp edges. Two real issues frame the territory:

k-NN #1582 — Investigate rearchitecture of the native memory circuit breaker — the breaker is too entangled with the cache and the settings layer; the eviction-vs-rejection relationship is muddy.
k-NN #585 — circuit-breaker limit config bug — a concrete defect in how the knn.memory.circuit_breaker.limit setting was handled.

This Fix-It lab works in the flavor of #585: a settings-plumbing bug where the configured circuit-breaker limit is not honored. You will reproduce the symptom, locate the code with grep in org.opensearch.knn.index.memory.* and KNNSettings, apply a minimal diff, write a unit test that asserts the corrected behavior, build and test, and ship a PR with DCO sign-off and a CHANGELOG.md entry. The point is not just the patch — it is doing it the way k-NN maintainers expect, in a subsystem where a careless test leaks off-heap memory.

Note on terminology: the cluster manager (formerly master) publishes the cluster setting knn.memory.circuit_breaker.limit; each data node applies it locally to its own NativeMemoryCacheManager. The bug in this lab is on the apply side — per-node — which is exactly why it is invisible from the cluster-settings API alone.

Warning: This lab uses a representative reconstruction of a #585-flavored bug so it is reproducible on today's tree (the original was fixed years ago). The class names, settings keys, and APIs are real; grep your checkout to find where the equivalent logic lives now, because this file has been refactored more than once.

Why This Lab Matters for Contributors

Settings-plumbing bugs are a classic, well-scoped contributor on-ramp: small surface, high impact, easy to test once you find them. They are also where backward-compat rules bite — see Issue Roadmap Stage 8: Plugin Compatibility.
The native-memory breaker is the difference between "node degrades gracefully under vector load" and "the Linux OOM-killer reaps the process." A limit that is silently ignored is a production landmine.
This subsystem punishes sloppy tests harder than any other: faiss allocations are off-heap, so a test that loads a graph and forgets to free it leaks native memory across the whole test JVM. You will learn to test it without leaking.
"Fix a config bug, add a regression test, un-break the limit" is a concrete, mergeable PR with a clean before/after.

Prerequisites

native integration and memory read — you understand NativeMemoryCacheManager, the native breaker, and that the limit is a percentage of node RAM (or an absolute size), not a percentage of -Xmx.
You can build and test the plugin (lab-k1): ./gradlew test, ./gradlew run.
You can read a cluster setting via the API and set it dynamically.

cd ~/src/oss-repos/k-NN
./gradlew test --tests "*KNNSettings*"   # sanity: the settings tests compile & run

Step 1: Reproduce the symptom

The reported behavior: an operator sets knn.memory.circuit_breaker.limit to a smaller value to bound native memory, but the cache keeps growing past it — the new limit is ignored until restart, or an absolute value like "4kb" is parsed wrong. Reproduce it against a running node.

./gradlew run    # single node with k-NN

# Read the current native breaker settings (this is the cluster-published view):
curl -s 'localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true&pretty' \
  | grep -E 'knn\.memory\.circuit_breaker|knn\.circuit_breaker'

# Set a deliberately tiny absolute limit so any graph trips it:
curl -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{ "persistent": { "knn.memory.circuit_breaker.limit": "1kb" } }'

# Index a faiss field, refresh, warm it up so a native graph loads:
curl -XPUT 'localhost:9200/cb' -H 'Content-Type: application/json' -d '
{ "settings": { "index.knn": true },
  "mappings": { "properties": { "e": { "type": "knn_vector", "dimension": 8,
    "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } } } } }'
for i in $(seq 1 200); do
  curl -s -XPOST 'localhost:9200/cb/_doc' -H 'Content-Type: application/json' \
    -d "{\"e\":[$i,1,2,3,4,5,6,7]}" >/dev/null
done
curl -s -XPOST 'localhost:9200/cb/_refresh' >/dev/null
curl -s -XPOST 'localhost:9200/_plugins/_knn/warmup/cb?pretty' >/dev/null

# Now check whether the breaker tripped. With a 1kb limit and a real graph, it MUST.
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' \
  | grep -E 'graph_memory|circuit_breaker|cache_capacity|eviction'

The bug's signature: graph_memory_usage exceeds the configured limit but knn.circuit_breaker.triggered reads false, or the dynamically-updated limit is not reflected in the cache's maximumWeight. The configured limit was parsed and stored but never propagated to the live cache. That disconnect is the bug class #585 is about.

Warning: Do not conclude "the breaker is broken" from RSS alone. glibc's allocator keeps freed arenas, so RSS lags eviction (see native integration § eviction). Judge by the cache accounting in _plugins/_knn/stats, not by ps.

Step 2: Locate the code

Two suspects: where the setting is defined and parsed (KNNSettings), and where the limit is applied to the cache (org.opensearch.knn.index.memory.*). Grep both.

# The setting definition, parsing, and the byte/percentage resolution:
grep -rn "circuit_breaker.limit\|CIRCUIT_BREAKER_LIMIT\|getCircuitBreakerLimit\|ByteSizeValue\|parseknnMemoryCircuitBreakerValue\|KNN_MEMORY_CIRCUIT_BREAKER_LIMIT" \
  src/main/java/org/opensearch/knn/index/KNNSettings.java

# Where the cache reads the limit to size its maximumWeight, and the settings-update hook:
grep -rn "maximumWeight\|setMaximumWeight\|circuitBreakerLimit\|rebuildCache\|addSettingsUpdateConsumer\|onCacheConfigChanged" \
  src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java \
  src/main/java/org/opensearch/knn/index/KNNSettings.java

You are looking for the seam where a dynamic settings update should call back into the cache and rebuild it with the new weight bound — and isn't, or where the parse of an absolute size ("4kb", "60%") returns the wrong number of bytes.

# Trace the update consumer registration — the dynamic-settings callback:
grep -rn "clusterSettings.addSettingsUpdateConsumer\|settingsUpdateConsumer\|consumer" \
  src/main/java/org/opensearch/knn/index/KNNSettings.java

The two #585-flavored failure modes you might find:

Failure mode	Where	Why it manifests
Stale limit	the dynamic-settings update consumer never rebuilds the cache's `maximumWeight`	new limit stored in `KNNSettings` but the live Guava cache keeps the old bound until restart
Mis-parse	`parseknnMemoryCircuitBreakerValue` resolves a percentage against the wrong base, or an absolute `ByteSizeValue` is read in the wrong unit	`"60%"` or `"4kb"` becomes a wildly wrong byte count, so the limit is effectively ignored

This lab fixes the stale-limit mode: the update consumer doesn't propagate to the cache.

Step 3: The diff

The fix wires the dynamic settings update into a cache rebuild, so changing knn.memory.circuit_breaker.limit at runtime actually re-bounds the live cache. The exact file/lines drift — this is the shape; adapt the surrounding lines to your tree.

--- a/src/main/java/org/opensearch/knn/index/KNNSettings.java
+++ b/src/main/java/org/opensearch/knn/index/KNNSettings.java
@@ -1,5 +1,7 @@
 package org.opensearch.knn.index;

+import org.opensearch.knn.index.memory.NativeMemoryCacheManager;
+
 import org.opensearch.common.settings.ClusterSettings;
 import org.opensearch.common.settings.Setting;
 import org.opensearch.common.unit.ByteSizeValue;
@@ -120,9 +122,18 @@ public class KNNSettings {
     private void setClusterSettings(final ClusterSettings clusterSettings) {
         this.clusterSettings = clusterSettings;
-        // BUG: the limit setting is registered as dynamic, but nothing propagates a
-        //      changed value into the live NativeMemoryCacheManager weight bound, so a
-        //      runtime update is silently ignored until the next node restart.
-        clusterSettings.addSettingsUpdateConsumer(KNN_MEMORY_CIRCUIT_BREAKER_LIMIT, newLimit -> {
-            // value stored in the settings map, but the cache is never rebuilt
-        });
+        // FIX: on a dynamic update of the circuit-breaker limit, rebuild the cache so its
+        //      maximumWeight reflects the new bound immediately (no restart required).
+        clusterSettings.addSettingsUpdateConsumer(KNN_MEMORY_CIRCUIT_BREAKER_LIMIT, newLimit -> {
+            NativeMemoryCacheManager.getInstance().rebuildCache();
+        });
     }

--- a/src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java
+++ b/src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java
@@ -60,12 +60,27 @@ public class NativeMemoryCacheManager implements Closeable {

-    // BUG: maximumWeight is read once at construction from the limit; a later settings
-    //      update has no path to change the live cache's bound.
-    private void initCache() {
-        long maxWeightKb = KNNSettings.getCircuitBreakerLimit().getKb();
-        this.cache = CacheBuilder.newBuilder()
-            .maximumWeight(maxWeightKb)
-            .weigher((k, v) -> v.getSizeInKb())
-            .removalListener(this::onRemoval)
-            .build();
-    }
+    private void initCache() {
+        this.cache = buildCache(KNNSettings.getCircuitBreakerLimit().getKb());
+    }
+
+    private Cache<String, NativeMemoryAllocation> buildCache(long maxWeightKb) {
+        return CacheBuilder.newBuilder()
+            .maximumWeight(maxWeightKb)
+            .weigher((String k, NativeMemoryAllocation v) -> v.getSizeInKb())
+            .removalListener(this::onRemoval)
+            .build();
+    }
+
+    /**
+     * Rebuild the cache against the CURRENT circuit-breaker limit. Invalidating the old
+     * cache fires the RemovalListener for every entry, which frees the native memory via
+     * JNI — so a shrunk limit immediately evicts down to the new bound, and nothing leaks.
+     */
+    public synchronized void rebuildCache() {
+        long newMaxWeightKb = KNNSettings.getCircuitBreakerLimit().getKb();
+        Cache<String, NativeMemoryAllocation> old = this.cache;
+        this.cache = buildCache(newMaxWeightKb);
+        if (old != null) {
+            old.invalidateAll();   // RemovalListener -> JNI free() for every entry
+        }
+    }

The crucial detail is the last comment: invalidateAll() fires the RemovalListener, which calls the engine's free across JNI. Rebuilding the cache without invalidating the old one would leak every loaded faiss graph — a non-obvious off-heap leak. The fix re-bounds the cache and releases native memory in one atomic, synchronized step.

Note: A real PR would likely also debounce or guard the rebuild (it briefly drops all cached graphs, so the next queries pay reload cost), and reconcile with the rearchitecture direction in k-NN #1582. Note that in your PR description — maintainers will ask how your fix relates to the planned rework.

Step 4: A unit test that asserts the corrected behavior — without leaking

The test must prove the limit propagates, without loading a real faiss graph (that allocates off-heap; a unit test that loads and forgets to free leaks native memory for the rest of the JVM run). Assert on the cache's configured bound, not on a live graph.

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */
package org.opensearch.knn.index.memory;

import org.opensearch.common.settings.ClusterSettings;
import org.opensearch.common.settings.Settings;
import org.opensearch.knn.KNNTestCase;
import org.opensearch.knn.index.KNNSettings;

import static org.opensearch.knn.index.KNNSettings.KNN_MEMORY_CIRCUIT_BREAKER_LIMIT;

public class NativeMemoryCircuitBreakerLimitTests extends KNNTestCase {

    /**
     * Updating knn.memory.circuit_breaker.limit at runtime must re-bound the live cache.
     * Before the fix, the cache kept its construction-time maximumWeight and the update
     * was silently ignored. We assert on the cache's CONFIGURED weight, never load a
     * native graph, and free nothing — so this test allocates no off-heap memory.
     */
    public void testDynamicLimitUpdateRebuildsCacheBound() {
        // Register settings with a starting limit.
        Settings initial = Settings.builder()
            .put(KNN_MEMORY_CIRCUIT_BREAKER_LIMIT.getKey(), "10kb")
            .build();
        ClusterSettings clusterSettings = new ClusterSettings(
            initial, KNNSettings.state().getSettings());     // grep for the real registration helper
        KNNSettings.state().setClusterSettings(clusterSettings);

        NativeMemoryCacheManager cm = NativeMemoryCacheManager.getInstance();
        cm.rebuildCache();
        long before = cm.getMaxCacheSizeInKilobytes();
        assertEquals("initial limit should be honored", 10L, before);

        // Apply a dynamic update to a smaller limit.
        clusterSettings.applySettings(Settings.builder()
            .put(KNN_MEMORY_CIRCUIT_BREAKER_LIMIT.getKey(), "4kb")
            .build());

        long after = cm.getMaxCacheSizeInKilobytes();
        assertEquals("dynamic update must re-bound the live cache", 4L, after);
        assertTrue("limit must actually shrink, not stay stale", after < before);
    }

    @Override
    public void tearDown() throws Exception {
        // Hygiene: drop any cache state so this test never leaks native allocations into
        // the shared test JVM. invalidateAll() fires the RemovalListener (JNI free()).
        NativeMemoryCacheManager.getInstance().invalidateAll();
        super.tearDown();
    }
}

Warning — native memory hygiene in tests: Every k-NN test that touches the cache must leave it empty. Off-heap allocations are not GC'd; a leaked faiss pointer survives the test method and inflates the test JVM's RSS, which can OOM-kill the entire test run on CI. Always invalidateAll() (or the equivalent) in tearDown, and prefer asserting on cache configuration over loading real graphs. This is the single rule that separates a green k-NN test PR from a flaky, memory-leaking one.

./gradlew test --tests "org.opensearch.knn.index.memory.NativeMemoryCircuitBreakerLimitTests"

Step 5: Build, test, and verify end to end

# 1) Unit test (fast inner loop).
./gradlew test --tests "*NativeMemoryCircuitBreakerLimit*"

# 2) Full module test + precommit (formatting, license headers, forbidden APIs).
./gradlew spotlessApply
./gradlew precommit

# 3) Manual proof on a running node: the Step-1 reproduction now behaves.
./gradlew run
# ...repeat the Step 1 curls: set limit to "1kb", warmup the faiss field...
# Now the breaker trips / the cache evicts down to the new bound, visible in stats:
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' \
  | grep -E 'graph_memory|circuit_breaker|eviction'
# Lower the limit at runtime and confirm cache_capacity reflects it WITHOUT a restart:
curl -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' \
  -d '{ "persistent": { "knn.memory.circuit_breaker.limit": "2kb" } }'
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'cache_capacity|eviction'

A real fix changes the manual reproduction from "limit ignored until restart" to "limit honored immediately," with the eviction count climbing as the shrunk cache sheds entries.

Step 6: Ship the PR (DCO, CHANGELOG, compat)

git checkout -b fix/native-cb-limit-dynamic-update

# CHANGELOG entry — k-NN keeps a Keep-a-Changelog style CHANGELOG.md:
#   ### Fixed
#   - Honor dynamic updates to knn.memory.circuit_breaker.limit by rebuilding the
#     native-memory cache bound at runtime ([#NNNN](https://github.com/opensearch-project/k-NN/pull/NNNN))

git add src/main/java/org/opensearch/knn/index/KNNSettings.java \
        src/main/java/org/opensearch/knn/index/memory/NativeMemoryCacheManager.java \
        src/test/java/org/opensearch/knn/index/memory/NativeMemoryCircuitBreakerLimitTests.java \
        CHANGELOG.md

# DCO sign-off is mandatory in opensearch-project repos (-s adds Signed-off-by):
git commit -s -m "Honor dynamic native-memory circuit-breaker limit updates

The dynamic settings update for knn.memory.circuit_breaker.limit was
registered but never propagated to the live NativeMemoryCacheManager, so
a runtime limit change was ignored until restart. Rebuild the cache on
update (invalidateAll -> JNI free, no native leak) so the new bound takes
effect immediately. Adds a regression test asserting the cache re-bounds
without loading a native graph.

Relates to #585 and the rearchitecture discussion in #1582."

Note on compatibility (Stage 8): changing settings behavior is a compatibility surface. The fix honors the documented contract (the limit is dynamic), so it is a bug fix, not a breaking change — but in the PR, state explicitly that the setting key, type, and default are unchanged. See Issue Roadmap Stage 8: Plugin Compatibility for what reviewers check: setting renames, default changes, and BWC of serialized state are the things that get a PR blocked.

Pitfalls

Trap	Why it's wrong	Do instead
Rebuild the cache without `invalidateAll()`	leaks every loaded faiss graph (off-heap, never GC'd)	invalidate the old cache so the `RemovalListener` JNI-frees each entry
Test by loading a real faiss graph	allocates native memory; if not freed, leaks for the whole test JVM → CI OOM	assert on the cache's configured bound; `invalidateAll()` in `tearDown`
Judge the fix by RSS dropping	glibc keeps freed arenas; RSS lags eviction	judge by cache accounting in `_plugins/_knn/stats`, not `ps`
`Thread.sleep` to "let the update apply"	settings application is synchronous via the consumer; a sleep hides the real seam	call the update path directly and assert the new bound
Skip the CHANGELOG / DCO	k-NN PRs fail CI without DCO and a maintainer will bounce a missing CHANGELOG	`git commit -s`; add a `### Fixed` entry
Parse-base assumption	a percentage limit resolves against node RAM, not `-Xmx`	confirm against native integration before touching the parse path

Expected Output

Before the fix, after setting the limit to "1kb" and warming a graph:

"graph_memory_usage" : 37,            # KB, well over the 1kb limit
"circuit_breaker_triggered" : false,  # WRONG — limit silently ignored
"cache_capacity_reached" : false

After the fix:

"graph_memory_usage" : 0,             # cache evicted down to the new bound
"circuit_breaker_triggered" : true,
"cache_capacity_reached" : true,
"eviction_count" : 1                  # the shrink fired the RemovalListener (JNI free)

And the regression test:

> Task :test
org.opensearch.knn.index.memory.NativeMemoryCircuitBreakerLimitTests >
  testDynamicLimitUpdateRebuildsCacheBound PASSED

BUILD SUCCESSFUL

Stretch Goals

Fix the other failure mode. Reconstruct the mis-parse variant: make parseknnMemoryCircuitBreakerValue resolve a percentage against the wrong base, write a test that pins the byte count for both "60%" and "4gb", and fix it. Compare to the stale-limit fix — which is easier to catch in review, and why?
Read the real #585. Open k-NN #585, find the actual merged fix PR, and diff your approach against the maintainers'. What did they do that you didn't (debounce? a dedicated config object? a test you missed)?
Engage #1582. Read k-NN #1582 and write two paragraphs on how a clean rearchitecture would make your fix unnecessary — separating the breaker policy from the cache mechanism so a limit change does not require a cache rebuild at all.
Find a live one. Search https://github.com/opensearch-project/k-NN/issues?q=is%3Aissue+is%3Aopen+label%3Abug+circuit+breaker and ... memory cache eviction for a current, real defect in this subsystem and take it.

Validation / Self-check

What is the difference between the setting being stored and the setting being applied? Which side was broken in this lab, and why is that invisible from GET /_cluster/settings?
Why must rebuildCache() call invalidateAll() on the old cache? What leaks if it doesn't, and why won't the GC catch that leak?
Why does the unit test assert on the cache's configured bound instead of loading a real faiss graph and watching it get evicted?
Why is RSS a misleading signal for whether the breaker worked? What is the reliable signal instead?
Is the native-memory circuit-breaker limit a percentage of -Xmx or of node RAM? How does that change the failure analysis for an OOM-killed node?
What two CI gates will reject your PR if you forget them, and what does each enforce?
How does your point fix relate to the rearchitecture in #1582? When you propose a point fix in a subsystem with an open redesign, what should the PR description say?

When you can reproduce the symptom, locate the seam in KNNSettings/NativeMemoryCacheManager, fix it without leaking native memory, prove it with a non-leaking regression test, and ship it with DCO + CHANGELOG, you have done real k-NN maintenance work. Revisit native integration and memory for the full breaker model, and Stage 8: Plugin Compatibility for the compat lens every settings change is read through. Then prove your fix doesn't regress performance in lab-k6: Benchmark Recall and Latency.

Lab K6: Benchmark Recall and Latency

Background

Approximate nearest-neighbor search has a knob for every metric you care about, and they all trade against each other. Crank ef_search and recall climbs but latency rises. Switch to PQ or on_disk mode and memory plummets but recall drops until you add a rescore pass. There is no single number that describes a vector index — there is a frontier: recall@k on one axis, latency (or throughput, or memory) on the other, and a curve you move along by changing engine, method, quantization, and ef_search.

A vector-search PR that claims "faster" or "smaller" without plotting that frontier is a guess, and k-NN maintainers treat it the way OpenSearch core treats an unbenchmarked performance PR (see Lab 9.2: Analyze a Performance Regression): they do not merge "should be better," they merge "here is recall@10 vs p99 latency, before and after, on a named workload, reproducible." This lab teaches you to produce that evidence: measure recall@k against a brute-force ground truth, measure latency/throughput and memory, vary the index parameters, and tabulate the frontier — using OpenSearch Benchmark's vectorsearch workload and a small custom harness. The k-NN repo's own benchmarking effort is k-NN #2595; read it to see what numbers maintainers expect.

Note on terminology: the cluster manager (formerly master) is irrelevant to a single-node recall/latency measurement — vector search performance is a per-shard, per-segment property. Benchmark on a controlled single node first; only scale out once the single-node frontier is understood, or you will conflate ANN behavior with distribution effects.

Why This Lab Matters for Contributors

Every quantization, engine, or algorithm change in k-NN is defined by its effect on the recall/latency/memory frontier. You cannot review — let alone author — such a PR without measuring it. Quantization and disk-ANN ends with exactly this instruction: never ship aggressive compression without measuring post-rescore recall against ground truth.
Recall is meaningless without a ground truth. Learning to compute brute-force exact-kNN as the oracle is the foundation; everything else is comparison against it.
Maintainers gate vector-perf PRs on reproducible, apples-to-apples numbers. A benchmark that changes two variables at once, or doesn't warm the cache, or measures a cold first query, proves nothing. This lab is about methodology discipline.
The skills transfer directly to the capstone project-06: k-NN Benchmark Harness and to the general performance-regression workflow in Lab 9.2.

The metrics, defined precisely

Metric	Definition	How measured
recall@k	of the k results the ANN index returns, the fraction that are in the true top-k	compare ANN result ids to brute-force ground-truth ids, averaged over a query set
p50 / p90 / p99 latency	per-query wall time at those percentiles	OSB `service_time`/`latency`, or timed loop
throughput	queries/sec at a target concurrency	OSB at fixed clients, or a concurrent harness
graph memory	off-heap native memory the index occupies	`GET /_plugins/_knn/stats` `graph_memory_usage` (see native memory)
build/merge time	wall time to index + force-merge to 1 segment	timed ingest; relevant to the GPU/remote-build RFCs

Warning: Recall@k is computed against exact nearest neighbors, not against another approximate run. If your "ground truth" is itself approximate, every recall number is fiction. The ground truth is brute force — score_script / a flat (ivf with nlist=1, or an exact scan) index — and you compute it once per query set.

Prerequisites

Quantization and disk-ANN read — you know the compression spectrum (byte / FP16 / PQ / BQ / on_disk) and that rescore recovers recall.
Native integration and memory read — you know graph_memory_usage and the warmup API (you must warm before measuring latency, or you measure cold-load cost, not query cost).
A running OpenSearch with k-NN, ideally on dedicated hardware (a laptop on battery is not a benchmark host).
Python 3 (for the harness and recall computation), and OpenSearch Benchmark: pip install opensearch-benchmark.

opensearch-benchmark --version
opensearch-benchmark list workloads | grep -i vector   # confirms the vectorsearch workload is available

Step 1: Establish a ground truth (brute force)

Recall needs an oracle. For a fixed query set, compute the exact top-k with a brute-force scan, independent of any ANN index. The cleanest way inside OpenSearch is an exact script_score over the raw vectors (no HNSW graph involved).

# Index the corpus once into a field you can scan exactly. (For real runs use the
# vectorsearch workload's dataset; this shows the mechanism on a tiny corpus.)
curl -XPUT 'localhost:9200/gt' -H 'Content-Type: application/json' -d '
{ "settings": { "index.knn": true },
  "mappings": { "properties": { "v": { "type": "knn_vector", "dimension": 128,
    "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } } } } }'

# For ONE query vector, get the EXACT top-k via knn_score (brute-force, no graph):
curl -s 'localhost:9200/gt/_search' -H 'Content-Type: application/json' -d '
{ "size": 10, "query": { "script_score": {
      "query": { "match_all": {} },
      "script": { "source": "knn_score", "lang": "knn",
        "params": { "field": "v", "query_value": [/* 128 floats */], "space_type": "l2" } } } } }'

The knn_score script computes the true distance for every document — that is exact kNN, your ground truth. Run it for each query in your query set and store the result-id lists. A small Python loop is cleaner for a real query set:

# ground_truth.py — exact top-k per query, the recall oracle.
import json, numpy as np
from opensearchpy import OpenSearch

client = OpenSearch("http://localhost:9200")
K = 10

def exact_topk(field, qvec, k=K, space="l2"):
    body = {"size": k, "query": {"script_score": {
        "query": {"match_all": {}},
        "script": {"source": "knn_score", "lang": "knn",
                   "params": {"field": field, "query_value": qvec, "space_type": space}}}}}
    hits = client.search(index="gt", body=body)["hits"]["hits"]
    return [h["_id"] for h in hits]

queries = json.load(open("queries.json"))             # list of query vectors
ground_truth = {i: exact_topk("v", q) for i, q in enumerate(queries)}
json.dump(ground_truth, open("ground_truth.json", "w"))

Note: Brute force is O(N) per query — fine for tens of thousands of vectors, slow for millions. For large corpora, compute ground truth once with a dedicated tool (faiss IndexFlatL2 offline, or the dataset's provided ground-truth file — ANN benchmark datasets like SIFT/GIST ship one). Never skip it; recall without an oracle is noise.

Step 2: The recall computation

Recall@k is set overlap between ANN results and ground truth, averaged over queries.

# recall.py — recall@k of an ANN run vs the ground truth.
import json

def recall_at_k(ann_results, ground_truth, k=10):
    total = 0.0
    for qid, gt_ids in ground_truth.items():
        gt_set = set(gt_ids[:k])
        ann_set = set(ann_results[qid][:k])
        total += len(gt_set & ann_set) / float(k)
    return total / len(ground_truth)

gt  = {int(k): v for k, v in json.load(open("ground_truth.json")).items()}
ann = {int(k): v for k, v in json.load(open("ann_results.json")).items()}
print(f"recall@10 = {recall_at_k(ann, gt, 10):.4f}")

A knn query produces the ANN results to compare:

curl -s 'localhost:9200/test/_search' -H 'Content-Type: application/json' -d '
{ "size": 10, "query": { "knn": { "v": { "vector": [/* 128 floats */], "k": 10 } } } }'

Step 3: Run the OpenSearch Benchmark `vectorsearch` workload

OSB automates ingest, query load, and latency/throughput collection against a real dataset. The vectorsearch workload is purpose-built for k-NN.

# Inspect the workload's parameters (engine, method, ef_*, dataset path, etc.):
opensearch-benchmark info --workload vectorsearch

# Run it against your node. workload-params override the index/query knobs you vary.
opensearch-benchmark execute-test \
  --target-hosts localhost:9200 \
  --pipeline benchmark-only \
  --workload vectorsearch \
  --workload-params '{
     "target_index_name": "test",
     "dimension": 128,
     "engine": "faiss",
     "method": "hnsw",
     "m": 16,
     "ef_construction": 128,
     "ef_search": 100,
     "k": 10
   }'

OSB reports service_time and latency percentiles, throughput, and error rate per task. Crucially, the vectorsearch workload computes recall for you when given a ground-truth file — so you can let OSB do Steps 1–2 on the standard datasets. Confirm the exact param names for your OSB version:

# The workload's params drift between OSB releases — read the real list:
opensearch-benchmark info --workload vectorsearch | grep -iE 'ef_search|recall|ground|engine|param'

Warning — warm before you measure. A faiss graph loads into native memory on the first query (see native memory). If you measure latency including that cold load, you are benchmarking disk I/O and deserialization, not query speed. Always POST /_plugins/_knn/warmup/<index> (or run a warmup task) before the measured phase, and force-merge to a stable segment count first so segment count isn't a hidden variable.

Step 4: Vary one knob at a time and tabulate the frontier

The discipline that makes a benchmark evidence rather than anecdote: change exactly one variable per row, hold everything else fixed, warm up each time, and record the full vector of metrics. Sweep ef_search first (it moves recall/latency without reindexing):

# Sweep ef_search on a fixed faiss/HNSW index. ef_search is a query-time param:
for ef in 16 32 64 100 200 400; do
  echo "=== ef_search=$ef ==="
  curl -s "localhost:9200/test/_search" -H 'Content-Type: application/json' -d "
  { \"size\": 10,
    \"query\": { \"knn\": { \"v\": { \"vector\": [/* query */], \"k\": 10,
       \"method_parameters\": { \"ef_search\": $ef } } } } }" \
    | python3 -c 'import sys,json; r=json.load(sys.stdin); print("took_ms", r["took"])'
done

Then sweep the index-time dimensions (engine, method, quantization) by building a fresh index per configuration. A complete frontier table looks like this:

Engine	Method	Quantization	ef_search	recall@10	p50 (ms)	p99 (ms)	graph mem (MB)	build (s)
faiss	hnsw	none (fp32)	100	0.992	1.8	6.1	590	41
faiss	hnsw	none (fp32)	400	0.999	4.3	12.7	590	41
faiss	hnsw	fp16 SQ	100	0.989	1.7	5.9	300	44
faiss	hnsw	PQ (m=64) + rescore	100	0.971	2.4	8.8	95	63
faiss	hnsw	`on_disk` 16x + rescore	100	0.965	3.9	21.4	60	58
faiss	hnsw	BQ + rescore (oversample 5x)	100	0.948	2.1	9.5	38	49
lucene	hnsw	Lucene104 int8 SQ	100	0.987	2.6	9.2	n/a (heap)	52
nmslib	hnsw	none (read-only, deprecated)	100	0.991	2.0	7.0	600	40

(Numbers are illustrative shapes, not promises — the point is the trade pattern: quantization buys memory at a recall/latency cost, rescore claws recall back, higher ef_search buys recall with latency.) The lucene engine's vectors are on the JVM heap (mmap'd), so graph_memory_usage from the native stats API reads n/a — that contrast is itself a finding (see native integration).

# Capture graph memory per config from the native stats API (faiss/nmslib only):
curl -s 'localhost:9200/_plugins/_knn/stats?pretty' | grep -E 'graph_memory_usage'

Note: Plot recall@10 (y) against p99 latency (x) for each engine/quantization as a curve, sweeping ef_search along it. A configuration is strictly better only if its curve is up-and-to-the-left of another's. A single point proves nothing; the curve is the deliverable.

Step 5: A reproducible methodology (the checklist maintainers expect)

A benchmark is only evidence if someone else can reproduce it. Pin every variable:

HARDWARE:   instance type / CPU model / RAM / disk (NVMe vs network) — vector perf is CPU+memory bound
JVM:        -Xmx, JDK version (Panama SIMD needs a recent JDK; see ../../lucene/simd-and-the-vector-api.md)
DATASET:    name, N vectors, dimension, distribution (e.g. SIFT-1M, 128-d, L2)
QUERIES:    fixed query set (count, same set across all configs), provided ground truth
INDEX:      shards=1, replicas=0 (isolate ANN from distribution), force-merge to 1 segment
WARMUP:     POST _plugins/_knn/warmup BEFORE measured phase; discard first run
CONFIG:     engine, method, m, ef_construction, ef_search, quantization, oversample_factor, rescore
PROCEDURE:  one variable changed per run; N measured queries; report p50/p90/p99 + recall + mem
REPEAT:     ≥3 runs per config; report median; note variance

Warning: replicas=0 and shards=1 for the controlled run. Replicas mean a query can hit a different copy with a differently-built graph (HNSW construction is non-deterministic across merges), and multiple shards mean per-shard top-k merging — both add variance that has nothing to do with the ANN parameter you are studying. Add shards back only when you are explicitly measuring distribution.

Step 6: How maintainers gate vector-perf PRs

When you submit a change that touches the vector path, the reviewer wants the same evidence this lab produces. The bar (mirroring core perf review in Lab 9.2):

Reviewer asks	What satisfies it
"What did recall do?"	recall@k before/after on a named dataset with a real ground truth — not "should be unaffected"
"What did latency do?"	p50/p99 before/after at the same `ef_search`, warmed, on the same hardware
"What did memory do?"	`graph_memory_usage` before/after (the whole point of a quantization PR)
"Is it apples-to-apples?"	one variable changed; identical dataset, queries, hardware, merge state
"Is it reproducible?"	the methodology block above, so they can re-run it
"Does it regress the others?"	a quantization win that tanks recall, or a recall win that doubles latency, is not a win — show the whole vector

The benchmarking work tracked in k-NN #2595 exists precisely to standardize this so PRs are comparable across time. Read it before proposing a vector-perf change, and frame your numbers against its methodology.

Implementation Requirements / Deliverables

A brute-force ground truth for a fixed query set (exact top-k per query), produced independently of any ANN index.
A working recall@k computation (set overlap vs ground truth), with a pasted number.
An OSB vectorsearch run (pasted summary) and/or a custom harness producing latency percentiles.
A frontier table varying at least engine or quantization and ef_search, with recall + latency + graph memory per row.
The reproducibility methodology block filled in with your actual hardware/dataset/config.
Warmup performed before every measured phase (state how you verified it via stats).

Troubleshooting

Symptom	Likely cause	Fix
recall@10 is suspiciously 1.000 on every config	comparing ANN to itself, not to exact ground truth	recompute ground truth with `knn_score`/flat exact, not another `knn` query
First measured query slow, rest fast	cold native load on first query	`POST /_plugins/_knn/warmup/<index>` before measuring; discard run 1
Latency wildly variable run to run	multiple segments / replicas / background merges	force-merge to 1 segment; `replicas=0`; quiesce indexing before measuring
Quantized config has terrible recall	rescore not enabled or oversample too low	enable rescore; raise `oversample_factor` (quantization)
`graph_memory_usage` is 0 for a lucene-engine index	lucene vectors live on heap/mmap, not native memory	expected — report n/a; compare lucene mem differently
OSB recall is empty/absent	no ground-truth file passed to the workload	supply the dataset's ground truth, or compute it (Step 1) and point the workload at it
`on_disk` p99 spikes	rescore reading full-precision vectors from cold disk	warm OS page cache; report cold vs warm separately; tune oversample vs latency
Numbers don't reproduce on a teammate's box	unpinned hardware/JDK/dataset	fill the methodology block; both run identical config + dataset + merge state

Expected Output

A recall computation and a frontier slice, e.g.:

$ python3 recall.py
recall@10 = 0.9712     # faiss HNSW + PQ(m=64) + rescore, ef_search=100

$ # OSB summary excerpt
|   Metric |        Task | Value |  Unit |
| Min Throughput | knn-search |  812 | ops/s |
| 50th percentile latency | knn-search | 2.4 | ms |
| 99th percentile latency | knn-search | 8.8 | ms |
| Mean recall@10 | knn-search | 0.971 |   - |

The deliverable is not any single number — it is the table and the curve: a reader can see exactly what PQ+rescore costs in recall and latency to save 6× memory versus float32, and decide if that trade fits their constraint.

Stretch Goals

Plot the frontier. Produce a recall@10-vs-p99 scatter with one curve per engine/quantization, ef_search swept along each. Identify which configs are Pareto-dominated (strictly worse on both axes) and drop them.
Measure the rescore knob. Hold quantization fixed (say PQ) and sweep oversample_factor from 1x to 10x. Show recall rising and latency rising, and find the knee where more oversample stops buying recall.
Build vs serve. Time index-build/force-merge per config and tabulate it alongside query metrics — this is the cost the GPU/remote-build RFCs (#2293 / #2294) attack. Which configs are build-bound vs serve-bound?
Compare engines fairly. Put faiss-HNSW and lucene-HNSW on the same recall and compare latency and memory. The lucene engine's heap residency vs faiss's native memory is a real deployment trade — quantify it.
Take it to the capstone. Turn this into the reusable, parameterized harness of project-06: k-NN Benchmark Harness.

Validation / Self-check

Why is recall meaningless without a ground truth, and how do you compute an exact ground truth inside OpenSearch? Why can't another knn query serve as the oracle?
Define recall@k precisely as a set operation. If ANN returns 8 of the true top-10, what is recall@10?
Why must you warm up before measuring latency? What are you actually measuring if you don't, and how do you verify warmup happened?
Name three variables you must hold fixed to make two runs comparable, and explain what variance each one introduces if you let it move.
Why shards=1, replicas=0 for a controlled ANN benchmark? What does each setting isolate you from?
Walk a quantization PR's evidence: which three metrics must move in the reviewer's favor (or be shown not to regress) for it to merge? Give a concrete example of a "win" that is actually a loss.
Why is a single (recall, latency) point insufficient, and what is the curve that replaces it? When is one configuration strictly better than another?

When you can produce a ground truth, compute recall@k, sweep a parameter, and present the recall/latency/memory frontier with a reproducible methodology, you can both author and review vector-perf changes the way maintainers require. Close the loop with quantization and disk-ANN (the trades you just measured), Lab 9.2: Analyze a Performance Regression (the same discipline for core), and the capstone project-06: k-NN Benchmark Harness.

Engineering at Scale: Real RFCs & Hard Problems

Everything before this section was about becoming a contributor: build from source, trace a request, reproduce an issue, write a fix and a test, open a quality PR. That is depth 2 from the Hitchhiker's Guide — and it is a genuine, hireable skill. This section is about the jump to depth 3: driving a hard, cross-cutting change the way maintainers do.

The difference is not "harder bugs". It is kind. A good-first-issue is bounded: one file, one behavior, one test. The work in this section is unbounded by default — concurrent segment search touches the search threadpool, the collector contract, settings, defaults, and BWC simultaneously; reader/writer separation rewrites how routing, allocation, replication, and remote storage interact. You cannot land changes like these by reading one class. You land them the way the maintainers did: with a public design document, an explicit problem statement, measured trade-offs, and a sequence of PRs that each leave the system shippable.

So this section is built around the real RFCs that produced these features. You will read the actual GitHub issues, follow the design from proposal to merged code, and learn the subsystem well enough to extend it. The facts here are verified and the issue numbers are real — cite them, read them, and treat them as the canonical record of how serious OpenSearch engineering actually happens.

How to Read an OpenSearch RFC

OpenSearch design happens in the open, on GitHub, and it has a recognizable grammar. Learn the grammar and a 200-comment issue becomes navigable.

The label vocabulary

Label	Means	What to expect inside
`RFC`	"Request for Comments" — a proposal opened for community feedback before the design is locked.	A problem statement, one or more proposed approaches, open questions, and a discussion thread. The most valuable thing to read.
`META`	An umbrella issue tracking a multi-PR initiative.	A checklist of sub-issues and PRs, a status table, links out to the RFC and design docs. Your map of the whole effort.
`proposal` / `[Proposal]`	A concrete design, often following an accepted RFC.	The chosen approach in detail: interfaces, settings, phasing, BWC plan.
`enhancement`	A scoped feature or improvement.	Smaller than an RFC; often the unit a single PR delivers.
`Roadmap`	Tracked on the project roadmap board.	High-level intent; links down to RFCs/METAs.
`good first issue` / `help wanted`	Explicitly open for new contributors.	A bounded task, sometimes with a starting pointer. Your on-ramp into a theme.

The lifecycle

flowchart LR
    Issue["issue: a real problem / pain point"] --> RFC["[RFC]: proposed approaches, community feedback"]
    RFC --> Review["community-meeting review + maintainer buy-in"]
    Review --> Design["[Proposal]/design: chosen interfaces, settings, phasing"]
    Design --> META["[META]: checklist of sub-issues + PRs"]
    META --> PRs["PRs: each leaves the system shippable"]
    PRs --> Flag["feature flag / experimental setting"]
    Flag --> Default["promoted to default in a release (e.g. 3.0)"]

Read an initiative in that order, not chronologically by comment. Start at the META to see the shape, drop into the RFC for the why and the rejected alternatives, skim the Proposal for the what, then read the PRs for the how. Big features land behind a feature flag or experimental setting first, soak, and only later flip to default — concurrent segment search becoming default-on in 3.0 is the canonical example.

Note: Many of these designs were reviewed live in an OpenSearch community meeting. When an RFC says "discussed in the community meeting", that is where the real decision was socialized. You will not see those minutes in the issue, but the outcome always lands back as a comment. Read for the resolution, not just the debate.

The Themes Covered

Each chapter takes one theme from problem statement to code, anchored on a real meta/RFC issue. Read them as a progression from "search a shard faster" up to "re-architect how reads and writes scale".

Theme	Chapter	The hard problem	Anchor issue
Vector search & vectorization	the Lucene + k-NN sections; cataloged here	ANN recall vs latency vs memory; SIMD; GPU index build.	k-NN #2605 (META: new vector engine), k-NN #2293 (RFC: GPU)
Concurrent search	Concurrent Segment Search	Search a shard's segments in parallel without breaking the collector contract.	search threadpool + `CollectorManager`; default-on in 3.0
Sharding / scaling	Sharding, Scaling & Reader/Writer Separation	Scale reads independently of writes; scale to zero.	#7258 (RFC), #15306 (META)
Durability / remote store	Remote-Backed Storage & Durability	Decouple durability from replica count using an object store.	underpins RW-separation; pairs with segment replication
Distributed load / backpressure	Backpressure & Admission Control	Reject smartly per-shard to prevent cascading failure.	#1446 (META), #478 (META)
Aggregation acceleration	Star-Tree Indexes	Precompute multi-field aggregations with bounded latency.	#12498 (RFC), #13875 (META)
Caching	Tiered Caching	Spill the request cache to disk when heap is exhausted.	#10024 (Proposal), #11464 (META: benchmark)

The full, link-dense catalog of every real issue, RFC, and PR — grouped by theme, with a "how to start" for each — is the next chapter: Real Issues, RFCs, and Where to Contribute. That is the single highest-value page in this section if your goal is to contribute.

Reading Order

Real Issues, RFCs, and Where to Contribute — the catalog. Read it first to see the whole landscape and pick a theme that excites you. Bookmark it; you will return constantly.
Concurrent Segment Search — the most self-contained theme, and a clean example of an RFC that became a default. Best first deep read.
Backpressure & Admission Control — a distributed-systems classic (smart rejection) with a tight blast radius.
Star-Tree Indexes and Tiered Caching — two acceleration/memory features that build on subsystems you already know (aggregations, circuit breakers).
Remote-Backed Storage & Durability then Sharding, Scaling & Reader/Writer Separation — the two most cross-cutting initiatives; read remote store first because RW separation is built on top of it.

Note: None of these require you to have memorized the others, but they share foundations covered earlier. If a chapter references segment replication, shard allocation, or circuit breakers and you feel shaky, the linked deep dive is the prerequisite — go read it, then come back.

How These Connect to the Capstone

This section is the bridge to the Capstone Projects. Each project is a portfolio-grade contribution rooted in a theme here:

Capstone project	Grows out of
Project 1: k-NN disk quantization	vector search / quantization
Project 2: vector-aware allocation decider	sharding/scaling + k-NN memory
Project 3: concurrent search slicing	Concurrent Segment Search
Project 4: upstream Lucene HNSW	HNSW in Lucene
Project 5: extend star-tree	Star-Tree Indexes
Project 6: k-NN benchmark harness	vector benchmarking (k-NN #2595)
Project 7: search backpressure signal	Backpressure & Admission Control
Project 8: segrep/RW-separation observability	remote store + replication

The path is deliberate: read a theme here, find a real sub-task in the catalog, and let the matching capstone project turn it into something you can put your name on.

Start with the catalog of real issues and RFCs, then read Concurrent Segment Search.

Real Issues, RFCs, and Where to Contribute

This is the catalog — the page you bookmark. Everything else in Engineering at Scale explains how a hard change is made; this page tells you which changes are open, what each one teaches you, and how to take a first bite. Every issue, RFC, and PR below is real and verified, cited with a full GitHub URL so you can open it right now and read the actual thread.

The point is not to memorize numbers. The point is that contribution is a search problem, and this gives you the seed set: a curated map of the live frontier in vector search, sharding, aggregation acceleration, caching, and backpressure — grouped so that picking a theme tells you which subsystem you are about to learn. The threads themselves are the syllabus. Read them.

How to read each table. Description tells you what the issue is about. Exercises names the subsystem and skills it builds. How to start gives you what to read first and a good-first sub-task pattern — the shape of a bounded contribution you could carve out without boiling the ocean.

Theme 1 — Vector Search & GPU (the k-NN plugin)

The vector engine is the most active frontier in OpenSearch, and the most employable thing in this curriculum. Start at the meta issue, then pick a sub-task that matches your appetite for native code.

Issue / RFC	Description	Exercises	How to start
k-NN #2605 — [META] Supporting New Vector Engine	The umbrella for evolving the k-NN engine abstraction. Your map of the whole vector-engine effort.	the `KNNEngine` abstraction, codec, faiss/lucene/nmslib boundaries	Read it top-to-bottom as a table of contents. Read the k-NN architecture and engines chapters alongside it. Pick a linked sub-issue that is `good first issue`.
k-NN #2293 — [RFC] Boosting Vector Engine Performance using GPUs	Offload graph construction to NVIDIA cuVS / the CAGRA algorithm (FP32). The cutting edge.	native build, faiss/cuVS, index-build offload, FP32 graphs	Read the RFC for the why (build is CPU-bound at scale). A bounded first task: improve the docs/benchmark methodology, or a CPU-side glue task in `jni/`. Pair with native JNI & memory.
k-NN #2294 — [RFC] Remote Vector Index Build	Build a segment's vector index on a remote fleet, not the indexing node.	remote build protocol, segment upload, the codec write path	Read with #2293 and #2391; they are one story. Start by tracing where the local build happens today (the KNN codec's `flush`/`merge`).
k-NN #2391 — [Meta] Remote Vector Index Build Component	The META that breaks #2294 into shippable pieces.	component decomposition, client/build-service split	This is your sub-task checklist for remote build. Find an unchecked box that is client-side (no GPU needed) and claim it.
k-NN #2595 — Benchmarking	Benchmark recall/latency/build-time for the new engines.	recall@k measurement, latency percentiles, datasets	The best first contribution in this theme: no production-code risk, high value. Pair with the benchmark-harness capstone and the recall/latency lab.
k-NN #1582 — Investigate rearchitecture of the native-memory circuit breaker	The faiss/nmslib graphs live in native memory off-heap; the breaker that caps them needs rework.	`NativeMemoryCacheManager`, the off-heap circuit breaker, guava cache	Read native JNI & memory first. Reproduce the current breaker behavior with `GET /_plugins/_knn/stats` under load, then propose a measurement or a small invariant fix.
k-NN #585 — a real circuit-breaker config bug	A concrete bug in the native-memory circuit-breaker configuration.	settings plumbing, the off-heap breaker, regression tests	A genuine bug-shaped task. Reproduce it, write the failing test first, then fix. This is the cleanest "real fix" entry point in the k-NN repo.

Sub-task pattern for this theme: the safest first contribution is almost always measurement — a benchmark, a stat, a reproduction — because it has no BWC or correctness blast radius but forces you to learn the path. Graduate from measurement to a config/settings fix, then to the codec write path.

Theme 2 — Lucene Vectors (apache/lucene)

OpenSearch bundles a specific Apache Lucene version; the vector formats it inherits come from upstream Lucene, governed by the Apache Software Foundation. Contributing here is the deepest vector work available and a serious credential.

Issue / PR	Description	Exercises	How to start
apache/lucene #12497 — scalar-quantization codec	Introduces a scalar-quantized vectors format (int8) into Lucene.	`KnnVectorsFormat`, scalar quantization, codec SPI	Read the PR diff as a worked example of how a new codec format lands in Lucene. Pair with HNSW in Lucene and the custom-codec lab.
apache/lucene #11613 — LUCENE-10577, int8 SQ origin	The original int8 scalar-quantization work (JIRA `LUCENE-10577`).	quantization math, recall/size trade-off, `VectorSimilarityFunction`	Read it for the origin story of byte-quantized vectors. The JIRA `LUCENE-10577` has the design discussion the PR implements.
`Lucene99ScalarQuantizedVectorsFormat` / `Lucene99HnswScalarQuantizedVectorsFormat`	The shipping SQ formats your bundled Lucene likely uses.	the on-disk `.vec` quantized layout, rescoring	`grep -rn "ScalarQuantized" $LUCENE/lucene/core/src/java/org/apache/lucene/codecs/` to read the format. The SIMD chapter covers the scoring side.
`Lucene104ScalarQuantizedVectorsFormat` / `Lucene104HnswScalarQuantizedVectorsFormat`	Newer formats (custom bits 1/2/4/7/8; Lucene 10.1.0).	configurable-bit quantization, format evolution, BWC of codecs	Diff a `Lucene99` against a `Lucene104` format to see how a codec evolves while staying back-compatible — a core skill for upgrading Lucene inside OpenSearch.

Sub-task pattern for this theme: Lucene runs on Gradle + GitHub PRs (with some history in JIRA LUCENE-NNNN). Build it (./gradlew check, ./gradlew :lucene:core:test, JDK 21), run a single test with -Ptests.seed=..., and start with a test/documentation-labeled issue. The contribute-to-Apache-Lucene lab walks the whole flow. A recurring high-skill OpenSearch task is upgrading the bundled Lucene version — search the OpenSearch repo for "Upgrade to Lucene".

Theme 3 — Sharding & Scaling (reader/writer separation)

The largest re-architecture in modern OpenSearch: scale reads independently of writes, and scale read capacity to zero when idle. Built on remote store. Read the full story in Sharding, Scaling & Reader/Writer Separation.

Issue / PR	Description	Exercises	How to start
#7258 — [RFC] Reader and Writer Separation	The founding RFC: why separate read and write scaling at all.	the motivation, the constraints remote store imposes	Read this first — it frames the whole initiative.
#14596 — [RFC] Indexing and Search Separation	The refined separation proposal.	routing/allocation changes, search replicas	Read after #7258 as the evolved design.
#15237 — [Proposal] Design	The concrete design: interfaces and phasing.	`ShardRouting` roles, allocation deciders	Skim for the what: which classes change. Cross-check against shard allocation.
#15306 — [META] Reader/Writer Separation	The META: the checklist of sub-issues + PRs.	initiative decomposition	Your map. Find an unchecked, scoped sub-issue.
#16720 — Scale to Zero	Scale search replicas to zero when there is no read traffic.	search-replica lifecycle, cold-start	Read as the payoff feature of the whole effort.
#17299 — implementation PR	A concrete implementation PR for the separation work.	the real code: routing, allocation, recovery	Read the diff to see how the design becomes code. The best way to learn the change surface.
#17763 — multi-writer detection RFC	Detect/forbid two writers for one shard under separation.	safety invariants, primary terms, fencing	A distributed-correctness gem. Read with remote store & durability.

Sub-task pattern for this theme: this is not a place to start with code. Start by being able to explain the change surface — write the summary the META is missing, or improve a test that exercises search replicas. Then find a scoped allocation/routing sub-issue. The vector-aware-allocation capstone builds on this terrain.

Theme 4 — Aggregation Acceleration (star-tree)

Precompute multi-field aggregations into a composite index for bounded-latency analytics. Full treatment in Star-Tree Indexes.

Issue	Description	Exercises	How to start
#12498 — [RFC] Pre-Compute Aggregations with Star Tree	The founding RFC for the star-tree composite index.	the star-tree data structure, dimension/metric model	Read for the core idea: trade index-time work + space for query-time speed.
#13875 — [META] Star Tree Index issue list	The META: every star-tree sub-task.	initiative decomposition, supported aggregations	Your checklist. Many sub-issues add support for one more aggregation type — an ideal bounded task.
#14871 — [Star Tree][Search][RFC] resolve aggregation via star tree	How the search path decides to use the star-tree.	the agg → star-tree resolution, query rewriting	Read with aggregations. The resolution logic is where a new contributor can add a supported case.

Sub-task pattern for this theme: the META (#13875) typically tracks "support aggregation X on the star-tree". Pick one unsupported aggregation, add the mapping/validation + the resolution case + a test. The extend-star-tree capstone is exactly this.

Theme 5 — Tiered Caching

Spill the request cache to an on-disk tier when heap is exhausted. Full treatment in Tiered Caching.

Issue	Description	Exercises	How to start
#10024 — [Proposal] Tiered caching	The design: on-heap → disk spillover via `TieredSpilloverCache`.	the pluggable `ICache`, the cache hierarchy, serialization	Read the proposal, then read tiered caching and circuit breakers & memory.
#11464 — [META] perf benchmark	The performance benchmark META for tiered caching.	cache hit-ratio measurement, disk-tier latency	A measurement-first entry: help validate the disk-tier trade-off. No production-code risk.

Sub-task pattern for this theme: caching is settings-heavy. A great first task is improving the observability — a stat or a clearer setting validation — for the disk tier, then a serialization edge case for a key/value type.

Theme 6 — Distributed Load / Backpressure

Smart, per-shard rejection to prevent cascading failure. Full treatment in Backpressure & Admission Control.

Issue / PR	Description	Exercises	How to start
#1446 — [Meta] Indexing Backpressure	The umbrella for indexing backpressure.	outstanding-bytes tracking, rejection policy	Read as the map of the whole indexing-pressure effort.
#478 — [Meta] Shard level Indexing Back-Pressure	The shard-level smart-rejection design.	per-shard tracking, secondary-parameter rejection	Read for the core mechanism: reject the right shard, not the whole node.
#1336 — Shard Indexing Pressure (impl PR)	The implementation PR for shard indexing pressure.	the real tracking code, the rejection path	Read the diff. This is where the design above becomes `ShardIndexingPressure`.

Sub-task pattern for this theme: reproduce a rejection under synthetic load, then improve the signal — a clearer rejection reason, a new stat, a test for a secondary-parameter case. The search-backpressure capstone adds a new signal on the search side.

Finding CURRENT Work (the durable skill)

Issue numbers go stale; search queries do not. The most important skill in this chapter is finding live work yourself. Use GitHub's issue search (web UI or gh issue list). These queries are real and current.

OpenSearch core (`opensearch-project/OpenSearch`)

# Good first issues, newest first.
gh issue list -R opensearch-project/OpenSearch \
  --label "good first issue" --state open --limit 50

# Explicitly wants help.
gh issue list -R opensearch-project/OpenSearch --label "help wanted" --state open

# Open RFCs — the design frontier.
gh issue list -R opensearch-project/OpenSearch --label "RFC" --state open

# Roadmap-tracked and enhancements in a theme you care about.
gh issue list -R opensearch-project/OpenSearch --label "Roadmap" --state open
gh issue list -R opensearch-project/OpenSearch --label "enhancement" --search "star tree"

Web equivalents (paste into the GitHub search bar):

repo:opensearch-project/OpenSearch is:issue is:open label:"good first issue"
repo:opensearch-project/OpenSearch is:issue is:open label:"help wanted"
repo:opensearch-project/OpenSearch is:issue is:open label:"RFC"
repo:opensearch-project/OpenSearch is:issue is:open label:"enhancement" "concurrent segment search"
repo:opensearch-project/OpenSearch is:issue is:open "Upgrade to Lucene"

k-NN plugin (`opensearch-project/k-NN`)

gh issue list -R opensearch-project/k-NN --label "good first issue" --state open
gh issue list -R opensearch-project/k-NN --label "help wanted" --state open
gh issue list -R opensearch-project/k-NN --label "RFC" --state open
gh issue list -R opensearch-project/k-NN --search "GPU OR cuVS OR remote build"

Apache Lucene (`apache/lucene`)

gh issue list -R apache/lucene --label "good first issue" --state open
gh issue list -R apache/lucene --search "vector OR HNSW OR quantization is:open"
# Lucene also tracks design in JIRA (LUCENE-NNNN); the GitHub PR usually links it.

Note: Always check whether an issue is already claimed (someone commented "I am working on this") before you start, and comment to claim it yourself. The contributor mindset section covers this etiquette; jumping a claimed issue is the fastest way to waste your own work.

How to Turn an RFC Into Your Contribution

A real workflow, the way maintainers expect it:

Pick a theme, not an issue. Choose from the six above based on what you want to learn. The theme decides the subsystem you will master.
Read the META top-down, then the RFC for the why. Use the RFC-reading grammar. Find the rejected alternatives — they tell you the real constraints.
Carve out a bounded sub-task. The whole RFC is not yours to land. Find the smallest piece that is independently shippable: a benchmark, a stat, one supported aggregation, one settings fix, one test. Match it to the "sub-task pattern" for that theme above.
Reproduce before you build. For a bug, write the failing test first (capstone step 2). For a feature, write the benchmark that proves it is needed.
Claim it in public. Comment on the issue with your plan and scope. This is the design-via-GitHub skill.
Land it the maintainer way. A focused PR that leaves the system shippable, with tests and a clear description linking the issue. See PR quality and the capstone PR step.
Defend the trade-offs. When a reviewer pushes back on latency, memory, or BWC, you should already have the numbers — because you started with a benchmark. See responding to feedback.

That sequence — theme → META → bounded sub-task → reproduce → claim → ship → defend — is the entire job at depth 3, compressed. Every chapter in this section is a worked example of it.

Next: Concurrent Segment Search — the cleanest example of an RFC that became a default.

Concurrent Segment Search

A shard is a Lucene index, and a Lucene index is a pile of immutable segments. Classically, the query phase walked those segments one thread, one segment at a time: a single search thread iterated every LeafReaderContext sequentially, collecting hits. On a shard with many segments and idle CPU cores, that is a waste — the segments are independent and could be searched in parallel.

Concurrent segment search does exactly that: it splits a shard's segments into slices and searches the slices in parallel on the search threadpool, then reduces the per-slice results into one shard-local answer. It is one of the cleanest examples in this curriculum of an RFC that soaked behind a setting and then became a default at the cluster level in 3.0. It is also a clean lesson in Lucene's CollectorManager contract and the latency-vs-CPU trade-off that every parallelism feature must confront.

After this chapter you should be able to: explain what a slice is and how slices map to threads; describe the CollectorManager/Collector contract and why concurrency requires it; find the settings that turn it on and bound slice count; reason about when it helps and when it hurts; and grep your way to the code that owns the slicing decision.

Note: This sits squarely on the OpenSearch/Lucene boundary. The slicing and the IndexSearcher executor are Lucene mechanisms; OpenSearch chooses the executor, the slice strategy, the settings, and the integration with the search execution path.

The problem, precisely

A search request fans out to one copy of each shard (the Hitchhiker's Guide "return trip"). On each shard, the query phase runs the Lucene query and collects the top-K. Without concurrency, that phase is single-threaded per shard, so a shard's query latency is bounded by all its segments processed in series — even when the node has spare cores.

The win is highest when:

A shard has many sizable segments (the parallelism has something to chew).
The query is CPU-bound per segment (aggregations, expensive scoring).
The node has idle cores to spend.

The win evaporates — or reverses — when segments are tiny (coordination overhead dominates), when the node is already CPU-saturated (you just added contention), or when the query is cheap (latency was never the bottleneck). Concurrency is not free, and this feature's whole design is about spending CPU to buy latency only when that trade is good.

Slices: the unit of parallelism

A slice is a group of one or more segments (LeafReaderContexts) assigned to a single task. Lucene's IndexSearcher, when given an Executor, partitions its leaves into slices and submits one task per slice. OpenSearch controls how many slices there are and how leaves are grouped.

flowchart TD
    Shard["shard query phase"] --> Slice["compute leaf slices"]
    Slice --> S0["slice 0: seg _0,_1"]
    Slice --> S1["slice 1: seg _2,_3"]
    Slice --> S2["slice 2: seg _4"]
    S0 --> C0["leaf collector 0"]
    S1 --> C1["leaf collector 1"]
    S2 --> C2["leaf collector 2"]
    C0 --> R["CollectorManager.reduce(...)"]
    C1 --> R
    C2 --> R
    R --> Top["shard-local top-K"]

Slice count is the central knob. Too few slices and you leave parallelism on the table; too many and you pay task-scheduling and reduce overhead for nothing. OpenSearch bounds it with a max-slice-count setting and a default strategy that balances segment sizes across slices rather than blindly one-segment-per-slice (which would be terrible for many tiny segments).

The `CollectorManager` contract (why this needs a new abstraction)

A Lucene Collector is not thread-safe — it accumulates state (the heap of top hits, the aggregation buckets) as it visits docs. You cannot share one collector across slices running on different threads. The solution is the CollectorManager:

CollectorManager<C extends Collector, T> {
    C newCollector();           // one fresh collector PER slice/thread
    T reduce(Collection<C> cs); // merge the per-slice collectors into one result
}

This is the heart of the feature. Each slice gets its own collector via newCollector(); the slices run in parallel, each collecting independently; then reduce(...) merges them into the shard-local answer. Converting an aggregation or a top-docs collector to be concurrency-safe means expressing it as a CollectorManager with a correct, associative reduce — and that is exactly the work that made many aggregations concurrent-search-compatible (and exactly where the subtle bugs live).

cd ~/src/OpenSearch
# OpenSearch's collector-manager wiring for the query phase.
grep -rln "CollectorManager" server/src/main/java/org/opensearch/search/ | head
grep -rn "newCollector\|public .* reduce" \
  server/src/main/java/org/opensearch/search/query/ | head

Warning: A reduce that is not associative/commutative is a latent correctness bug. Single-threaded, the segments are visited in a fixed order and the bug hides; concurrent, the order changes and results wobble. This is the single most common class of concurrent-search defect — see the bugs table.

The executor: which threadpool runs the slices

The per-slice tasks run on a dedicated executor so they do not starve other work. OpenSearch wires a search executor into the Lucene IndexSearcher for concurrent mode. Find it and the threadpool it draws from:

# Where the concurrent searcher / executor is constructed.
grep -rn "IndexSearcher\|setExecutor\|sliceExecutor\|SliceExecutor\|index_searcher" \
  server/src/main/java/org/opensearch/search/ | head
# The threadpool registry (look for an index_searcher / search pool).
grep -rn "index_searcher\|ThreadPool.Names" \
  server/src/main/java/org/opensearch/threadpool/ThreadPool.java | head

The relationship to the broader thread model is in threadpools and concurrency. The key point: concurrent segment search consumes a finite pool, which is why turning it on globally changes the CPU economics of the whole node — the trade-off is cluster-wide, not per-query.

The reduce after slices

After all slices finish, CollectorManager.reduce produces one shard-local result. That is distinct from the coordinating-node reduce you already know from aggregations: there are now two reduce levels.

flowchart TD
    subgraph Shard A
      sa0["slice 0"] --> sared["slice reduce"]
      sa1["slice 1"] --> sared
    end
    subgraph Shard B
      sb0["slice 0"] --> sbred["slice reduce"]
      sb1["slice 1"] --> sbred
    end
    sared --> coord["coordinating-node reduce (SearchPhaseController)"]
    sbred --> coord
    coord --> answer["global answer"]

This two-level reduce is why an aggregation must be correct both at the slice level and at the shard level. An aggregation that was only ever tested with the single-level (coordinating) reduce can be subtly wrong once a second reduce level appears beneath it.

Settings and defaults

Setting	Scope	Meaning
`search.concurrent_segment_search.enabled`	cluster / index	Master switch for concurrent segment search. Default-on at the cluster level in 3.0. (Earlier versions: opt-in.)
`search.concurrent.max_slice_count`	cluster / index	Upper bound on slices per shard query. `0` typically means "use the default/auto strategy".
`search.concurrent_segment_search.mode`	cluster / index	In some versions, selects the strategy (`auto`/`all`/`none`) — grep to confirm the exact key in your branch.

# Find the real setting keys and defaults in your checkout (do not trust docs blindly).
grep -rn "concurrent_segment_search\|concurrent.max_slice_count\|CONCURRENT_SEGMENT_SEARCH" \
  server/src/main/java/org/opensearch/ | head -20

# Turn it on/off and bound slices at runtime.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "persistent": {
    "search.concurrent_segment_search.enabled": true,
    "search.concurrent.max_slice_count": 4
  }
}'

Note: "Default-on in 3.0" is a cluster-level default; it can still be disabled per-index or per-cluster. When you benchmark, always record the effective setting, because a regression report that does not state the slice count is unfalsifiable.

The trade-offs, as a decision table

Situation	Concurrent search effect
Many large segments, idle cores, CPU-bound agg	Big latency win — the design case.
Few tiny segments	Loss — task + reduce overhead exceeds the work.
Node already CPU-saturated	Loss — you added contention, not parallelism; throughput drops.
Cheap term query on small shard	Neutral/slight loss — latency was never the bottleneck.
High query concurrency (many simultaneous searches)	Risk — each query now wants multiple threads; the search pool saturates faster.

The honest summary: concurrent segment search trades CPU and throughput for single-query latency. On a lightly loaded cluster with fat segments it is a clear win; on a saturated cluster it can make things worse. This is why the optimizing-slicing capstone is about the slicing heuristic, not the feature's existence.

Trace it yourself

# 1. Create an index, force several segments by indexing in batches with refresh.
curl -s -XPUT 'localhost:9200/csstest' -H 'Content-Type: application/json' -d '
{ "settings": { "number_of_shards": 1, "number_of_replicas": 0 } }'
for b in 1 2 3 4 5; do
  for i in $(seq 1 2000); do printf '{"index":{}}\n{"v":%s,"g":"k%s"}\n' "$((RANDOM))" "$((RANDOM%50))"; done \
    | curl -s -H 'Content-Type: application/x-ndjson' \
        -XPOST 'localhost:9200/csstest/_bulk?refresh=true' --data-binary @- >/dev/null
done
curl -s 'localhost:9200/_cat/segments/csstest?v'   # confirm multiple segments

# 2. Run a CPU-bound agg with concurrency OFF, then ON; compare took.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' \
  -d '{"transient":{"search.concurrent_segment_search.enabled":false}}'
curl -s 'localhost:9200/csstest/_search?pretty' -H 'Content-Type: application/json' \
  -d '{"size":0,"aggs":{"g":{"terms":{"field":"g","size":50}}}}' | grep '"took"'

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' \
  -d '{"transient":{"search.concurrent_segment_search.enabled":true}}'
curl -s 'localhost:9200/csstest/_search?pretty' -H 'Content-Type: application/json' \
  -d '{"size":0,"aggs":{"g":{"terms":{"field":"g","size":50}}}}' | grep '"took"'

# 3. profile=true shows per-slice work in the breakdown.
curl -s 'localhost:9200/csstest/_search?pretty' -H 'Content-Type: application/json' \
  -d '{"profile":true,"size":0,"aggs":{"g":{"terms":{"field":"g"}}}}' | grep -i "slice\|collector" | head

# 4. Read the owning code.
grep -rln "ContextIndexSearcher\|slice\|CollectorManager\|reduce" \
  server/src/main/java/org/opensearch/search/internal/ | head
./gradlew :server:test --tests "*ConcurrentSegmentSearch*" 2>&1 | tail -15
./gradlew :server:internalClusterTest --tests "*ConcurrentSearch*" 2>&1 | tail -15

Common bugs and symptoms

Symptom	Root cause	Where to look
Aggregation results differ run-to-run with concurrency on	Non-associative/non-commutative `reduce` in a `CollectorManager`	the aggregation's `CollectorManager.reduce`; test single- vs multi-slice
`ConcurrentModificationException` / corrupted state under load	A `Collector` shared across slices instead of one `newCollector()` per slice	the collector-manager `newCollector()` contract
Higher latency with concurrency on a busy cluster	Search pool saturation; too many slices per query	`search.concurrent.max_slice_count`; the search threadpool sizing
No speedup despite many segments	Tiny segments grouped into too few slices, or query is I/O- not CPU-bound	the slice strategy; `_cat/segments` sizes
A previously-passing agg test fails only in concurrent mode	The agg was correct only for single-level reduce	add a two-level (slice + coordinating) reduce test
Profile shows one slice doing all the work	Skewed slice sizing (one giant segment)	the leaf-slice balancing logic

Validation: prove you understand this

Define a slice and explain how slices map to threads and to segments. Why is one-segment-per-slice a bad default for many tiny segments?
Write the CollectorManager contract from memory and explain why concurrent search requires it rather than a single shared Collector.
Draw the two reduce levels (slice reduce and coordinating-node reduce) and say which classes own each.
Name the setting that enables the feature and the one that bounds slice count, and state what changed in 3.0.
Give two concrete situations where concurrent segment search hurts, and the metric you would watch to detect each.
Explain why a non-associative reduce is invisible single-threaded but breaks under concurrency.

Next: Backpressure and Admission Control, or the optimizing-slicing capstone.

Sharding, Scaling, and Reader/Writer Separation

The classic OpenSearch scaling model couples two things that workloads rarely want coupled: a replica is simultaneously your durability (a second copy of the data) and your read capacity (another shard to serve queries). If you have a read-heavy workload, you add replicas — and pay to keep every byte indexed twice, on local disk, by nodes that are also doing the write-side work. If your write and read traffic have wildly different shapes (steady ingest, bursty search), you cannot scale them independently. You scale the whole shard or nothing.

Reader/writer separation breaks that coupling. Built on top of remote-backed storage, it introduces search replicas that serve reads by pulling segments from the remote store — without participating in indexing or replication of writes — so you can scale read capacity up and down (even to zero) independently of write capacity. This is the largest re-architecture in modern OpenSearch, and it touches routing, allocation, recovery, and the shard role model at once.

After this chapter you should be able to: size shards from first principles; explain what a search replica is and how it differs from a regular replica; trace how reader/writer separation changes routing and allocation; cite the real RFC trail; and reason about scale-to-zero and multi-writer safety.

Prerequisites. This chapter builds on three deep dives and does not re-derive them: shard allocation, replication (segment replication especially), and remote-backed storage and durability. Read at least the first if shards-across-nodes is fuzzy.

Shard sizing fundamentals (the part everyone gets wrong)

Before any of the new machinery, you must be able to reason about shard count and size, because the most common "OpenSearch is slow / unstable" report is really an oversharding report (you saw this in the Warm-Up's Dataset 5).

Principle	Why
A shard is a Lucene index with fixed overhead.	Every shard costs heap (segment metadata, field data structures), file handles, and a slot in cluster state. 1,000 tiny shards is 1,000 copies of that overhead.
Aim for "right-sized" shards (tens of GB), not many tiny ones.	Merges, recovery, and search fan-out all scale per-shard. Too many shards multiplies coordination cost; the search fan-out hits every shard.
Shard count is fixed at index creation (absent reindex/split/shrink).	You cannot trivially change `number_of_shards` later. Time-series uses rollover (Warm-Up Dataset 6) to keep each index right-sized over time.
*Replica count is the dial you can* turn.**	`number_of_replicas` is dynamic. Classically this is the only read-scaling lever — and the one reader/writer separation is replacing.

# The settings that bound shard counts and the allocator that places them.
grep -rn "max_shards_per_node\|cluster.max_shards" \
  server/src/main/java/org/opensearch/indices/ServerlessProvider.java \
  server/src/main/java/org/opensearch/cluster/metadata/ 2>/dev/null | head
grep -rln "class BalancedShardsAllocator" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/

The whole motivation for reader/writer separation is that replica count is a bad read-scaling dial because it also drives durability and write-side cost. The new model gives you a dial that scales only reads.

The coupling problem, made concrete

flowchart TD
    subgraph "Classic model"
      P1["primary (writes + reads + durability)"] --> R1["replica (reads + durability + applies writes)"]
      P1 --> R2["replica (reads + durability + applies writes)"]
    end
    Note["add a replica → +read capacity AND +durability AND +write work AND +local disk"]

Each classic replica must keep up with indexing (apply every write, via document or segment replication), stores a full copy on local disk, and counts toward durability. To get more read throughput you are forced to also buy durability and write-side work you may not need. Reader/writer separation severs read capacity from all three.

Search replicas: read capacity, decoupled

A search replica is a shard copy whose job is only to serve reads. It does not accept writes, does not participate in the in-sync replication group, and does not provide the primary's durability — instead it gets its data from the remote store, pulling the segments the writers uploaded.

flowchart TD
    W["writer (primary): indexes, uploads segments + translog"] --> RS[("remote store (object store)")]
    RS --> SR1["search replica 1 (reads only, pulls segments)"]
    RS --> SR2["search replica 2 (reads only)"]
    RS --> SR0["search replicas → can scale to 0"]
    Q["search request"] --> SR1
    Q --> SR2

Property	Classic replica	Search replica
Accepts writes	applies replicated writes	never
Source of data	from primary (peer recovery / segrep)	from remote store
Provides durability	yes (in-sync copy)	no (durability is the remote store)
Counts for write acks	yes	no
Scales independently of writers	no	yes — including to zero
Lifecycle	tied to primary's replication group	independent; can be added/removed freely

Because a search replica's data comes from the remote store rather than from a live primary, you can add and remove them freely — and crucially, you can have zero of them when there is no read traffic, which is the scale-to-zero feature.

What changes in routing and allocation

This is the part a contributor must understand to work in this area. Search replicas are a new shard role, and that role threads through routing and allocation:

ShardRouting gains a role/role-aware state. A shard copy is now "writer primary", "writer replica", or "search replica" — search routing must send read traffic to search replicas (and writers, depending on config) while write routing goes only to writers.
Allocation deciders become role-aware. The allocator must place search replicas on nodes that can pull from the remote store, must not count them as in-sync copies, and must allow their count to drop to zero without turning the index red.
Recovery for a search replica is a remote-store recovery, not a peer recovery — it hydrates from the object store (remote store & durability).

# The shard routing / role model — grep to find the real role enum in your branch.
grep -rn "SearchReplica\|search_replica\|ShardRouting.Role\|isSearchOnly\|searchOnly" \
  server/src/main/java/org/opensearch/cluster/routing/ | head -20
# Allocation deciders that must become role-aware.
ls server/src/main/java/org/opensearch/cluster/routing/allocation/decider/
# Where search routing chooses copies.
grep -rn "class OperationRouting\|activeShards\|searchShards\|preferredShard" \
  server/src/main/java/org/opensearch/cluster/routing/ | head

Note: Because this is an evolving initiative, exact class and enum names shift between releases. Trust the grep, not a memorized name. The META issue is the source of truth for the current decomposition.

The RFC trail (read these in order)

This is a textbook public design effort. Read it the way Engineering at Scale teaches: META first for the shape, RFC for the why, proposal for the what, PR for the how.

Issue / PR	Role in the story
#7258 — [RFC] Reader and Writer Separation	The founding "why": decouple read and write scaling.
#14596 — [RFC] Indexing and Search Separation	The refined separation proposal.
#15237 — [Proposal] Design	Concrete interfaces, phasing, the role model.
#15306 — [META] Reader/Writer Separation	The checklist of sub-issues + PRs — your map.
#16720 — Scale to Zero	The payoff: drive search replicas to zero when idle.
#17299 — implementation PR	Read the diff to see routing/allocation actually change.
#17763 — multi-writer detection RFC	The safety invariant: never two writers for one shard.

The multi-writer detection thread (#17763) is the distributed-correctness keystone: once durability lives in the remote store and writers are decoupled from readers, you must guarantee that two nodes never both believe they are the writer for a shard and both upload conflicting segments. That fencing problem is covered from the storage side in remote store & durability.

Scale-to-zero, concretely

The endgame: an index that is being written but rarely read keeps zero search replicas, spending nothing on read capacity, and spins search replicas up only when queries arrive (cold start), hydrating them from the remote store. The writer keeps indexing the whole time; durability is never at risk because it lives in the object store, not in any replica count.

no read traffic      →  search replicas = 0   (read cost ≈ 0; writer still indexing)
read traffic arrives →  allocate search replica(s); recover from remote store
read traffic falls   →  scale search replicas back toward 0

This only works because durability was first decoupled from replicas by remote-backed storage. That is why the reading order in the section index puts remote store before this chapter.

Try it / read it

# 1. Inspect the classic dials so you can feel the coupling.
curl -s -XPUT 'localhost:9200/scaling-demo' -H 'Content-Type: application/json' \
  -d '{"settings":{"number_of_shards":2,"number_of_replicas":1}}'
curl -s 'localhost:9200/_cat/shards/scaling-demo?v'        # primaries + replicas
curl -s -XPUT 'localhost:9200/scaling-demo/_settings' -H 'Content-Type: application/json' \
  -d '{"index":{"number_of_replicas":2}}'                  # the only classic read dial
curl -s 'localhost:9200/_cat/shards/scaling-demo?v'        # +read capacity AND +durability

# 2. Find search-replica settings/roles in your checkout (names vary by version).
grep -rn "number_of_search_replicas\|search_replicas\|SEARCH_REPLICA\|SearchReplica" \
  server/src/main/java/org/opensearch/ | head -20

# 3. Read the allocation + recovery surface that the initiative changes.
grep -rln "class AllocationService" server/src/main/java/org/opensearch/cluster/routing/allocation/
grep -rln "RemoteStoreRecovery\|remoteStore" server/src/main/java/org/opensearch/index/shard/ | head

# 4. Tests that exercise search replicas / separation.
./gradlew :server:internalClusterTest --tests "*SearchReplica*" 2>&1 | tail -15
./gradlew :server:test --tests "*SearchReplicaAllocation*" 2>&1 | tail -15

Common bugs and symptoms

Symptom	Root cause	Where to look
Search replica counted as an in-sync copy / blocks write acks	Role not threaded through the replication group	`ReplicationTracker`; the in-sync set; `ShardRouting` role
Index goes red when search replicas drop to zero	Allocator treats search replica as required for health	role-aware health/allocation deciders
Read traffic served stale data after a write	Search replica's remote-store pull lagging	remote-store refresh/upload cadence; remote store
Two writers upload conflicting segments	Missing multi-writer fencing	#17763; primary-term fencing on upload
Oversharding: cluster slow/unstable with tiny shards	Too many shards (classic sizing error)	shard count at creation; rollover; `BalancedShardsAllocator`
Search routing sends reads only to writers	`OperationRouting` not role-aware	`OperationRouting` copy selection

Validation: prove you understand this

Explain the classic coupling: name the three things a regular replica provides at once, and why that makes it a bad read-scaling dial.
Define a search replica and contrast it with a regular replica on data source, durability, write participation, and independent scaling.
Name the three places the role model threads through (routing, allocation, recovery) and what changes in each.
Walk the RFC trail from #7258 to #17299 and say what each step contributes.
Explain why scale-to-zero is only possible because of remote-backed storage.
State the multi-writer safety invariant (#17763) and why it becomes critical once durability lives in the remote store.

Next: Remote-Backed Storage and Durability — the foundation this whole chapter stands on. Then the vector-aware-allocation capstone.

Remote-Backed Storage and Durability

In the classic OpenSearch durability model, "your data is safe" means "there is more than one copy on more than one node's local disk." Durability is replica count. That coupling — examined in Sharding, Scaling & Reader/Writer Separation — is exactly what makes independent read/write scaling impossible, and it has a second cost: a node failure can still lose unflushed data if you were not careful, and recovery means copying segments node-to-node.

Remote-backed storage moves durability off the cluster entirely and onto a remote object store (S3 or equivalent). The writer uploads its Lucene segments and its translog to that remote backend; the remote copy is now the source of truth for durability. A node can die, a shard can move, a whole cluster can be rebuilt — and the data is recovered from the object store, not from a surviving replica. This single change is what makes search replicas, scale-to-zero, and reader/writer separation possible.

After this chapter you should be able to: explain how remote store decouples durability from replica count; describe the segment-and-translog upload path and how it interacts with segment replication; trace recovery from the remote store; explain how remote store underpins reader/writer separation; and reason about the multi-writer safety problem.

Prerequisites. This builds directly on the translog, refresh/flush/merge, and replication (segment replication especially). It is the storage foundation under sharding & scaling.

The idea: durability is a remote object, not a replica

flowchart TD
    subgraph Node["writer node"]
      IW["IndexWriter writes segments locally"] --> Local[("local disk (cache)")]
      TL["Translog appends ops"] --> Local
    end
    Local -->|"upload on refresh/flush"| RS[("remote store: segments")]
    TL -->|"upload on write"| RST[("remote store: translog")]
    RS --> Recover["recovery / new copy: download from remote"]
    RST --> Recover

The local disk becomes, in effect, a cache of what is durably stored in the remote object store. Two upload streams keep the remote copy current:

Segments are uploaded after they are produced (on refresh/flush, when Lucene writes new immutable segment files).
The translog is uploaded as operations arrive, so the window between the last segment upload and "now" is also durable remotely.

With both streams, the remote store always holds enough to reconstruct the shard to its last acknowledged write — without any local replica. That is the whole point: durability no longer requires a second node holding the data.

cd ~/src/OpenSearch
# The remote-store directory + upload machinery (names vary by version — grep).
grep -rln "RemoteSegmentStoreDirectory\|RemoteStoreRefreshListener\|RemoteFsTranslog\|RemoteStoreEnums" \
  server/src/main/java/org/opensearch/index/ | head
ls server/src/main/java/org/opensearch/index/store/ 2>/dev/null | grep -i remote
ls server/src/main/java/org/opensearch/index/translog/ 2>/dev/null | grep -i remote

The upload path: segments and translog

Two listeners do the work. Understanding the cadence is the key to reasoning about the durability window and the staleness a reader might see.

Stream	Trigger	Class (grep to confirm)	Durability meaning
Segment upload	after refresh/flush produces new segments	a `RemoteStoreRefreshListener` driving a `RemoteSegmentStoreDirectory`	makes the committed/refreshed state durable remotely and available to search replicas
Translog upload	per write (or per durability policy)	a remote translog (`RemoteFsTranslog`-style)	makes the uncommitted tail durable remotely, closing the gap between segment uploads

flowchart LR
    Op["index op"] --> TLloc["local translog"]
    TLloc --> TLrem["upload to remote translog"]
    Op --> Buf["IndexWriter buffer"]
    Buf -->|refresh| Seg["new segment files"]
    Seg --> Segrem["upload to remote segment store"]
    TLrem --> Durable["acked write is durable remotely"]
    Segrem --> Searchable["state available to search replicas"]

Note: Just like the local NRT model (a doc is durable before it is visible, see engine internals), remote store has its own ordering: the translog upload provides durability of the tail, while the segment upload provides the searchable, replica-consumable state. A reader pulling from the remote store sees data as of the last segment upload, not the last write — which is the staleness knob for search replicas.

Interplay with segment replication

Remote store and segment replication are designed together, and confusing them is a common mistake. Segment replication (replication.md) means replicas copy segments from the primary instead of re-indexing each document. Remote store is where those segments can be copied from.

Without remote store (peer segrep)	With remote store
Replica pulls segments from the primary node over the transport layer.	Replica (or search replica) pulls segments from the remote object store.
Durability is the set of in-sync local copies.	Durability is the remote object store.
Replica recovery is a peer recovery from a live primary.	Recovery is a download from the remote store; no live primary required.

Pairing them is what lets a search replica exist at all: it is a segment-replication consumer whose source is the remote store rather than a live primary. Do not re-derive segment replication here — read replication.md; just hold that remote store is the storage substrate it can read from.

Recovery from the remote store

When a shard is (re)created — node restart, relocation, a new search replica, a rebuilt cluster — recovery becomes a download rather than a node-to-node copy:

flowchart TD
    Start["shard needs to recover"] --> Meta["read remote segment metadata (latest commit)"]
    Meta --> DL["download missing segment files from remote store"]
    DL --> TLrep["replay remote translog tail past the last commit"]
    TLrep --> Up["shard is up to last acked write"]

This is faster and more flexible than peer recovery because it does not need a healthy source node holding the data — the object store is the source. It is also what makes scale-to-zero safe: a search replica can be created from nothing, hydrate from remote, serve reads, and be torn down again, all without touching the writer.

# The remote-store recovery path (grep to find the real class names).
grep -rln "RemoteStoreRecovery\|recoverFromRemoteStore\|RemoteStoreReplicationSource" \
  server/src/main/java/org/opensearch/index/ \
  server/src/main/java/org/opensearch/indices/recovery/ 2>/dev/null | head

How this underpins reader/writer separation

Stack the pieces and the larger initiative falls out:

remote-backed storage   →  durability lives in the object store, not in replicas
        ↓
search replicas         →  read-only copies that hydrate from the object store
        ↓
reader/writer separation→  scale reads (search replicas) independently of writes
        ↓
scale-to-zero           →  zero search replicas when idle; durability never at risk

Each layer requires the one above it. You cannot have search replicas that pull from remote without remote store; you cannot separate read/write scaling without search replicas; you cannot scale reads to zero without durability being independent of replica count. This is precisely why the section reading order puts this chapter before sharding & scaling.

Multi-writer detection (#17763)

Once durability is a shared remote object that any node could upload to, a new and dangerous failure mode appears: two nodes both believing they are the writer for a shard (a network partition, a botched failover) and both uploading segments. Now the remote store has conflicting histories and the shard is corrupt.

The safety invariant — tracked in #17763 (multi-writer detection RFC) — is that at most one writer may upload for a shard at any time, enforced with fencing. The mechanism is the primary term: every upload is tagged with the primary term of the writer that produced it, and a stale writer (one whose primary term has been superseded) must be detected and prevented from corrupting the remote state. This is the storage-side complement to the routing-side safety in sharding & scaling.

# Primary-term / fencing on the upload path.
grep -rn "primaryTerm\|MultiWriter\|fenc\|stale.*upload\|metadata lock" \
  server/src/main/java/org/opensearch/index/store/ 2>/dev/null | head

Warning: Multi-writer corruption is silent until a read returns garbage or a recovery fails. Any change to the upload path must preserve the primary-term fencing invariant; a "small optimization" that drops the term check is a data-loss bug. Treat this as load-bearing.

Try it / read it

# 1. Read the remote-store directory + listeners.
grep -rln "RemoteSegmentStoreDirectory\|RemoteStoreRefreshListener" \
  server/src/main/java/org/opensearch/index/store/ | head
grep -rln "RemoteFsTranslog\|RemoteTranslog" \
  server/src/main/java/org/opensearch/index/translog/ | head

# 2. The settings that turn remote store on (names vary by version).
grep -rn "remote_store\|RemoteStoreSettings\|cluster.remote_store" \
  server/src/main/java/org/opensearch/ | head -20

# 3. Tests that exercise remote store + recovery.
./gradlew :server:internalClusterTest --tests "*RemoteStore*" 2>&1 | tail -15
./gradlew :server:test --tests "*RemoteSegmentStoreDirectory*" 2>&1 | tail -15

# 4. Inspect a remote-enabled shard's stats (on a configured cluster).
curl -s 'localhost:9200/_remotestore/stats/scaling-demo?pretty' 2>/dev/null | head -40

Common bugs and symptoms

Symptom	Root cause	Where to look
Acked write lost after node failure	Translog not uploaded before ack (durability policy)	remote translog upload cadence; durability setting
Search replica serves stale data	Segment upload lagging behind writes	`RemoteStoreRefreshListener` cadence; refresh interval
Recovery downloads everything (slow)	Local segment cache invalidated; no incremental reuse	remote segment metadata diffing; local cache
Corrupt shard / conflicting segments after partition	Two writers uploaded (multi-writer)	primary-term fencing; #17763
Remote upload backpressure stalls indexing	Upload can't keep up with write rate	upload threadpool; interaction with backpressure
Translog grows unbounded	Remote upload failing silently; commit not advancing	upload error handling; global checkpoint / commit

Validation: prove you understand this

Explain how remote store decouples durability from replica count, and what that enables downstream.
Name the two upload streams (segments, translog), their triggers, and what each one makes durable.
Contrast peer segment replication with remote-store-backed replication on: where the replica pulls from, what provides durability, and what recovery looks like.
Walk remote-store recovery end to end and say why it does not need a live source node.
Draw the four-layer stack (remote store → search replicas → reader/writer separation → scale-to-zero) and explain why each layer requires the one above.
State the multi-writer invariant (#17763) and the mechanism (primary-term fencing) that enforces it.

Next: Backpressure and Admission Control, or back to Sharding, Scaling & Reader/Writer Separation to see what this foundation enables. Related capstone: segrep/RW-separation observability.

Backpressure and Admission Control

Every distributed data system eventually meets the same wall: more work arrives than it can do. The naive response — accept everything and let queues grow — is how clusters die. Queues consume heap, heap pressure trips circuit breakers, garbage collection stalls, nodes drop out of the cluster, shards relocate onto already-overloaded nodes, and a localized hotspot becomes a cascading failure that takes down the whole cluster. The single overloaded shard kills everything.

Backpressure and admission control is the disciplined alternative: when a node or a shard is overloaded, reject some work early and cheaply — with a clear, retryable signal — so the system sheds load instead of collapsing. OpenSearch does this on both sides of the workload: shard indexing pressure (smart, per-shard rejection of writes) and search backpressure (resource tracking and cancellation of expensive queries). The art is rejecting the right work: the request actually responsible for the pressure, not an innocent bystander.

After this chapter you should be able to: explain why naive queueing causes cascading failure; describe how shard indexing pressure tracks outstanding bytes and rejects per-shard; explain secondary-parameter rejection; describe search backpressure's task-resource-tracking and cancellation model; and cite the real RFC trail.

Prerequisites. This is the companion to circuit breakers and memory (the last line of defense) — backpressure is the earlier, smarter line. It also touches threadpools and concurrency.

Why naive queueing kills clusters

flowchart TD
    Load["write load > capacity on shard X"] --> Q["accept everything → queues grow"]
    Q --> Heap["heap fills with in-flight requests"]
    Heap --> CB["circuit breaker trips / GC stalls"]
    CB --> Drop["node drops out of cluster"]
    Drop --> Reroute["shards relocate onto other overloaded nodes"]
    Reroute --> Spread["overload spreads → cascading failure"]

The failure is positive feedback: overload causes a node to leave, which causes its shards to pile onto survivors, which overloads them, which causes them to leave. Backpressure breaks the loop at the first step by refusing to accept work it cannot do, returning a 429 TOO_MANY_REQUESTS that a well-behaved client retries with backoff. A rejected-and-retried request is recoverable; a dead cluster is not.

Note: Circuit breakers (memory accounting that trips when a single operation would exceed a heap limit) are the last line of defense — they prevent OOM but are blunt. Backpressure is earlier and smarter: it watches sustained pressure and rejects the specific shard/request causing it, before the breaker ever needs to fire. Read circuit-breakers-memory.md for the relationship.

Shard indexing pressure: smart, per-shard rejection

The key OpenSearch insight (from the shard-level backpressure RFC) is that node-level rejection is too coarse. A node hosts many shards; one hot shard being slammed should not cause the node to reject writes destined for other, healthy shards on the same node. So OpenSearch tracks indexing pressure per shard and rejects only the writes hitting the shard that is actually in trouble.

The mechanism tracks outstanding bytes of in-flight indexing work, at multiple stages of the write path:

Tracked quantity	What it measures
Coordinating bytes	bytes of write requests being coordinated on this node
Primary bytes	bytes being applied on primaries hosted here
Replica bytes	bytes being applied as replica operations here

When a shard's outstanding bytes exceed its limit, new writes to that shard are rejected — leaving other shards on the node unaffected.

flowchart TD
    W["write op for shard X"] --> Acct["account bytes against shard X (coordinating/primary/replica)"]
    Acct --> Check{shard X over limit?}
    Check -->|no| Accept["proceed; release bytes on completion"]
    Check -->|yes| Sec{secondary parameters also bad?}
    Sec -->|yes| Reject["reject: 429, this shard only"]
    Sec -->|no| Accept

Secondary-parameter rejection

A pure byte threshold over-rejects: a shard can briefly hold a lot of bytes during a healthy burst without being in trouble. So shard indexing pressure does not reject on the byte limit alone — it consults secondary parameters (signals that the shard is genuinely not keeping up, such as the throughput degrading or the request queue/latency for that shard rising) before rejecting. The intent is to reject only shards that are both over their byte budget and showing signs of real distress — minimizing false rejections of healthy bursts.

cd ~/src/OpenSearch
# The shard indexing pressure tracker + the rejection logic.
grep -rln "ShardIndexingPressure\|IndexingPressure\|ShardIndexingPressureStore" \
  server/src/main/java/org/opensearch/index/ | head
grep -rn "coordinatingBytes\|primaryBytes\|replicaBytes\|secondary\|rejection\|markCoordinating" \
  server/src/main/java/org/opensearch/index/stats/ \
  server/src/main/java/org/opensearch/index/ 2>/dev/null | head -20
# The settings.
grep -rn "shard_indexing_pressure\|indexing_pressure" \
  server/src/main/java/org/opensearch/ | head

Search backpressure: track resources, cancel the offender

Reads need protection too, but the model is different. You cannot "reject bytes" — an expensive search has already been accepted before you know it is expensive (a deep aggregation, a giant terms, a runaway script). So search backpressure tracks the resource consumption of in-flight search tasks and cancels the ones consuming disproportionately when the node is under strain.

flowchart TD
    Node["node under resource strain (heap/CPU)"] --> Track["task resource tracking: per-task CPU + heap"]
    Track --> Rank["rank in-flight search tasks by consumption"]
    Rank --> Pick["pick the worst offenders over thresholds"]
    Pick --> Cancel["cancel those tasks (TaskCancellation)"]
    Cancel --> Relief["node sheds load; healthy queries continue"]

The building blocks: the task framework assigns every search a cancellable task; task resource tracking records its CPU time and heap allocations; a backpressure service periodically evaluates node strain and, when over threshold, cancels the tasks consuming the most — so one runaway query dies instead of the node.

# Search backpressure service + task resource tracking.
ls server/src/main/java/org/opensearch/search/backpressure/ 2>/dev/null
grep -rln "SearchBackpressureService\|SearchBackpressureSettings\|TaskResourceTracking\|CancellableTask" \
  server/src/main/java/org/opensearch/search/backpressure/ \
  server/src/main/java/org/opensearch/tasks/ 2>/dev/null | head
grep -rn "search_backpressure\|cancellation" \
  server/src/main/java/org/opensearch/search/backpressure/ 2>/dev/null | head

The search-backpressure capstone is about adding a new signal to this evaluation — which forces you to understand the whole tracking-and-cancellation loop.

The RFC trail

Issue / PR	Role in the story
#1446 — [Meta] Indexing Backpressure	The umbrella for indexing backpressure.
#478 — [Meta] Shard level Indexing Back-Pressure	The key design: reject per-shard, not per-node.
#1336 — Shard Indexing Pressure (impl PR)	The implementation; read the diff to see `ShardIndexingPressure`.

Read #478 for the why per-shard, then #1336 for the how — it is a clean example of an RFC's design becoming concrete tracking code.

Settings (grep for the real keys)

Setting (indexing)	Meaning
`shard_indexing_pressure.enabled`	master switch for shard indexing pressure
`shard_indexing_pressure.enforced`	enforce (reject) vs shadow-mode (track only)
`indexing_pressure.memory.limit`	node-level in-flight indexing byte budget

Setting (search)	Meaning
`search_backpressure.mode`	`disabled` / `monitor_only` / `enforced`
`search_backpressure.*.cpu_time_millis_threshold`	per-task CPU threshold
`search_backpressure..heap_`	per-task / node heap thresholds

# Confirm exact keys/defaults in your checkout.
grep -rn "shard_indexing_pressure\|search_backpressure" \
  server/src/main/java/org/opensearch/ | grep -i "Setting" | head -20

# Observe rejections / backpressure stats live.
curl -s 'localhost:9200/_nodes/stats/indexing_pressure?pretty' 2>/dev/null | head -40
curl -s 'localhost:9200/_nodes/stats/search_backpressure?pretty' 2>/dev/null | head -40

Try it / read it

# 1. Read the trackers.
grep -rln "ShardIndexingPressure" server/src/main/java/org/opensearch/index/ | head
ls server/src/main/java/org/opensearch/search/backpressure/

# 2. Enable enforcement, then hammer one shard and watch for 429s.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "persistent": {
    "shard_indexing_pressure.enabled": true,
    "shard_indexing_pressure.enforced": true
  }
}'
# generate concurrent bulk load to one index, then check stats:
curl -s 'localhost:9200/_nodes/stats/indexing_pressure?pretty' | grep -i "rejection\|current\|limit" | head

# 3. Enable search backpressure and run a deliberately expensive query.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' \
  -d '{"persistent":{"search_backpressure.mode":"enforced"}}'
curl -s 'localhost:9200/_nodes/stats/search_backpressure?pretty' | grep -i "cancel\|limit" | head

# 4. Tests.
./gradlew :server:test --tests "*ShardIndexingPressure*" 2>&1 | tail -15
./gradlew :server:test --tests "*SearchBackpressure*" 2>&1 | tail -15

Common symptoms (and what they mean)

Symptom	Likely cause	Where to look
Clients get `429 TOO_MANY_REQUESTS` on writes	Shard indexing pressure rejecting an overloaded shard (working as designed)	`_nodes/stats/indexing_pressure`; client retry/backoff
Writes to a healthy shard rejected	Node-level limit too low, or secondary params over-rejecting	`indexing_pressure.memory.limit`; secondary-parameter logic
A single deep query stalls the node, never cancelled	Search backpressure in `monitor_only`, or threshold too high	`search_backpressure.mode`; CPU/heap thresholds
Cascading node drops under load	Backpressure disabled; only circuit breakers (too late) catching it	enable shard indexing pressure; circuit breakers
Rejections with no apparent overload	Limits mis-sized for the hardware/workload	tune limits; verify against real outstanding-bytes stats
Cancelled queries the user did not expect	Search backpressure thresholds too aggressive	per-task CPU/heap thresholds; `enforced` vs `monitor_only`

Validation: prove you understand this

Draw the cascading-failure loop and mark the step where backpressure breaks it.
Explain why per-shard rejection is better than per-node, with a concrete two-shard example.
Name the three byte quantities shard indexing pressure tracks and why secondary-parameter rejection exists (what failure does it prevent?).
Contrast indexing backpressure (reject bytes early) with search backpressure (track + cancel) and explain why reads need a different model.
Describe the search backpressure loop: task → resource tracking → evaluation → cancellation, naming the framework pieces.
Explain the relationship between backpressure and circuit breakers: which fires first and why backpressure is "smarter".

Next: Star-Tree Indexes and Aggregation Acceleration, or the search-backpressure capstone.

Star-Tree Indexes and Aggregation Acceleration

Most aggregations in OpenSearch are computed at query time by scanning doc values, bucketing, and reducing (see Aggregations). That is flexible — any field, any filter, any nesting — but the cost scales with the number of matching documents. A terms over a billion-doc index that touches every document does a billion doc-value reads, every time, even if the answer is the same hourly dashboard tile you have rendered a thousand times.

The star-tree index attacks this from the other direction: it pre-aggregates during indexing. You declare a set of dimensions (the fields you group/filter on) and metrics (the fields you sum/count/min/max/avg), and OpenSearch builds a compact tree, inside each Lucene segment, whose nodes already hold the rolled-up metric values for every dimension combination. At search time, an eligible aggregation is answered by walking the tree instead of scanning documents — turning an O(matching-docs) operation into something close to O(tree-depth). The result is bounded, predictable latency for the aggregation shapes you planned for.

This is a composite index — it does not change your stored documents or your normal query path. It is an opt-in acceleration structure that the search layer auto-detects and uses when the request matches. It builds on DocValues and Fielddata (the tree's metric values are columnar, doc-value-backed) and accelerates the framework in Aggregations.

The design is tracked across three real RFCs/metas — read them before you touch the code, because they carry the rationale and the scope boundaries:

After this chapter you can: explain the star-tree data structure and the role of the star node; write a composite/star_tree mapping; describe how the tree is built as part of a segment; trace how the search path detects and resolves an eligible aggregation against the tree; and reason about the storage-vs-latency trade-off and the (deliberately bounded) set of supported aggregations.

Note: The star-tree comes from the Apache Pinot lineage of the same name. The OpenSearch implementation is its own thing, but if you have read the Pinot star-tree paper the vocabulary (dimensions, the special "star" record) will be familiar.

The data structure

A star-tree is a prefix tree over an ordered list of dimensions. Pick an order for your dimensions — say [status, region]. Every path from the root down to a leaf fixes a prefix of dimension values; the node at the end of that path stores the aggregated metrics for all documents matching that prefix.

Concretely, for dimensions status then region and a metric sum(bytes):

                       root
            ┌───────────┼───────────────┐
       status=200    status=404      status=*      (the STAR node)
        ┌────┴────┐      │            ┌────┴────┐
     region=us  region=eu  region=*  region=us  region=*
        │          │        │          │          │
   sum=1.2e9   sum=8.0e8  sum=2.0e9  sum=3.0e9  sum=5.0e9

Two ideas do the heavy lifting:

Sorted, ordered dimensions. Documents are sorted by the dimension order, so building the tree is a grouped roll-up, and querying is a descent that narrows the candidate set one dimension at a time.
The star node (*). At each level, in addition to the children for each concrete value, the tree holds a special star child that aggregates across all values of that dimension. This is what makes a query that does not filter on a dimension cheap: instead of summing over every region=… child, you jump to region=*, which already holds the cross-region roll-up. The star node is the whole trick — it turns "no filter on this dimension" into a single edge.

flowchart TD
    Root["root"] -->|status=200| S200["node: status=200"]
    Root -->|status=404| S404["node: status=404"]
    Root -->|status=*| SSTAR["STAR node: status=* (all statuses)"]
    S200 -->|region=us| L1["leaf: sum,count for 200/us"]
    S200 -->|region=eu| L2["leaf: 200/eu"]
    S200 -->|region=*| L3["STAR leaf: 200/all-regions"]
    SSTAR -->|region=us| L4["leaf: all-status/us"]
    SSTAR -->|region=*| L5["STAR leaf: GRAND TOTAL"]

The bottom-right region=* under status=* is the grand total — the answer to "sum(bytes) over the whole segment with no filters" in one lookup.

`max-leaf-docs` and the size bound

A naive tree would create a star-node path for every dimension combination, which explodes. The build is bounded by max_leaf_docs (the spec calls it the max-leaf-docs threshold): a node is only split further while the number of documents under it exceeds the threshold. Below it, the node keeps the documents as a small ordered run and aggregates them on the fly during the (tiny) descent. This caps the number of nodes, trading a bounded amount of per-query work for a bounded index size. Tuning it is the central storage-vs-latency knob (see the trade-off table).

The mapping: a `composite` index field of type `star_tree`

You declare the star-tree in the index mapping as a composite field. It is not a field you index values into — it is a derived structure over other fields. Exact key names evolve across versions, so create one and read it back, but the shape is:

PUT /logs-startree
{
  "settings": {
    "index.number_of_shards": 1,
    "index.composite_index": true,
    "index.append_only.enabled": true
  },
  "mappings": {
    "composite": {
      "request_stats": {
        "type": "star_tree",
        "config": {
          "max_leaf_docs": 10000,
          "ordered_dimensions": [
            { "name": "status" },
            { "name": "region" }
          ],
          "metrics": [
            { "name": "bytes", "stats": ["sum", "max", "min", "value_count"] },
            { "name": "latency_ms", "stats": ["avg"] }
          ]
        }
      }
    },
    "properties": {
      "status":     { "type": "integer" },
      "region":     { "type": "keyword" },
      "bytes":      { "type": "long" },
      "latency_ms": { "type": "float" }
    }
  }
}

A few things that bite people and are worth internalizing:

Append-only. Star-tree indices are designed for immutable, append-only data (logs, metrics). Updates/deletes do not fit the precomputed-roll-up model cleanly, which is why the feature is gated behind append-only settings. Check the current gating in your branch — grep below.
Ordered dimensions. The order matters for which queries are cheap; a filter on the first dimension is the cheapest descent. Order dimensions by how often you filter on them.
Bounded stats. avg is stored as sum + value_count and divided at read time (the same associative-decomposition trick the normal avg reduce uses). You cannot ask the tree for a stat you did not declare.

cd ~/src/OpenSearch
# The mapper + config parsing for the composite/star_tree field.
grep -rln "star_tree\|StarTree\|composite" \
  server/src/main/java/org/opensearch/index/mapper/ | head
grep -rn "max_leaf_docs\|ordered_dimensions\|StarTreeMapper\|CompositeMappedFieldType" \
  server/src/main/java/org/opensearch/index/mapper/ | head -20
# The feature flag / settings that gate it.
grep -rn "composite_index\|STAR_TREE\|isCompositeIndexEnabled\|append_only" \
  server/src/main/java/org/opensearch/index/ | head -20

Built during indexing, as part of the segment

The star-tree is not a separate file you build with an admin API. It is written inside the Lucene segment, by a composite-index DocValuesConsumer that runs at flush and at merge, exactly when the rest of the segment's doc values are written (see Refresh, Flush, Merge). That is why it is "free" at query time and why it is immutable: it shares the segment's lifecycle.

The build, per segment:

Read the declared dimension and metric doc values for every document in the segment.
Sort the documents by the ordered-dimension tuple.
Aggregate to produce the leaf records, then build star-node records bottom-up, splitting only while a node exceeds max_leaf_docs.
Serialize the tree (node array + the aggregated metric doc values) into the segment as additional files written through the composite codec/format.

flowchart LR
    Flush["segment flush / merge"] --> Read["read dimension+metric docvalues"]
    Read --> Sort["sort docs by ordered dims"]
    Sort --> Rollup["aggregate -> leaf records"]
    Rollup --> Star["build star nodes bottom-up (bounded by max_leaf_docs)"]
    Star --> Write["write tree + metric docvalues into the SEGMENT"]

Because it is built at flush and merge, the tree is always consistent with the segment it lives in; merging two segments rebuilds the tree over the merged docs. That also means the cost shows up as indexing/merge CPU and disk, not query CPU — you pay once at write time for every read after.

cd ~/src/OpenSearch
ls server/src/main/java/org/opensearch/index/compositeindex/
# The builder + the per-segment writer.
grep -rln "StarTreeBuilder\|StarTreeDocValuesConsumer\|OnHeapStarTreeBuilder\|OffHeapStarTreeBuilder" \
  server/src/main/java/org/opensearch/index/compositeindex/ | head
grep -rn "build(\|appendStarTreeDocument\|max_leaf_docs\|StarTreeField" \
  server/src/main/java/org/opensearch/index/compositeindex/datacube/startree/ | head -20
# The codec/format that actually writes it into the segment files.
grep -rln "Composite99\|StarTree.*Format\|Composite.*Codec\|DocValuesFormat" \
  server/src/main/java/org/opensearch/index/codec/composite/ | head

Note: Names like Composite99Codec / OnHeapStarTreeBuilder are version-stamped and will drift — the grep -rln above finds the real site in your checkout. Do not hard-code a class name from this page into a PR description.

The search path: detection and resolution

This is the part that makes the feature transparent. You do not add a "use the star-tree" flag to your query. The search layer inspects the aggregation request and decides whether the star-tree can answer it; if so, it resolves the answer from the tree; if not, it falls back to the normal doc-value scan. RFC #14871 is the canonical description of this resolution logic.

An aggregation is star-tree-eligible only if all of these hold:

Condition	Why
Every filter/query field is a declared dimension (or the query is `match_all`)	The descent narrows by dimension; an undeclared filter field cannot be applied to the tree.
Every grouping field is a declared dimension	Buckets correspond to tree descents per dimension value.
Every metric field + stat was declared in the config	The tree only holds the metrics you precomputed.
The supported-agg shape matches (see below)	Only certain agg types decompose into the tree's roll-ups.

The resolution, conceptually:

flowchart TD
    Req["aggregation request"] --> Check{"all dims/metrics declared?<br/>supported agg shape?"}
    Check -- no --> Scan["normal docvalue scan (default path)"]
    Check -- yes --> Pick["pick star-tree values source"]
    Pick --> Desc["descend tree: concrete child for filtered dim,<br/>STAR child for unfiltered dim"]
    Desc --> Emit["read precomputed metric(s) at the resolved node(s)"]
    Emit --> Reduce["feed into the normal InternalAggregation reduce"]

The elegant part: the resolved metric values are fed into the same InternalAggregation reduce you already know from Aggregations. Per-segment, the star-tree produces a partial; the shard- and coordinator-level reduce are unchanged. The star-tree only replaces the collection step, and only when eligible. That is also why it composes with concurrent segment search — each segment slice either uses its star-tree or scans, then the normal slice/coordinator reduce runs.

cd ~/src/OpenSearch
# The values source / resolution that decides to use the tree.
grep -rln "StarTree\|CompositeIndex\|getStarTreeValues\|supportsStarTree" \
  server/src/main/java/org/opensearch/search/aggregations/ \
  server/src/main/java/org/opensearch/search/startree/ 2>/dev/null | head
grep -rn "StarTreeQueryHelper\|StarTreeFilter\|canUseStarTree\|StarTreeValuesSource" \
  server/src/main/java/org/opensearch/search/ | head -20

Worked example: which queries hit the tree

Given the mapping above (ordered_dimensions = [status, region], metric sum(bytes)):

Request	Eligible?	How it resolves
`sum(bytes)` with no filter	✅	one lookup at `status=, region=` (grand total)
`terms(status)` → `sum(bytes)`	✅	one descent per `status` child, `region=*` underneath
filter `region=us`, then `sum(bytes)`	✅	`status=*` then concrete `region=us`
`terms(region)` → `avg(latency_ms)`	✅ (avg declared)	per-`region` node, `sum/count` → divide
filter on `user_agent` (not a dimension)	❌	falls back to doc-value scan
`cardinality(region)`	❌	not a precomputable roll-up; falls back
`percentiles(latency_ms)`	❌	sketch-based; not a tree roll-up; falls back

Trade-offs

Axis	Star-tree	Normal aggregation
Query latency	bounded, ~tree-depth; near-constant regardless of doc count	scales with matching docs
Index size	larger — extra tree + precomputed metrics per segment	none
Index/merge CPU	higher — build the tree at flush and every merge	none
Flexibility	only declared dims/metrics, only supported aggs	any field, any agg
Mutability	append-only; updates/deletes don't fit	full CRUD
Cardinality sensitivity	many high-cardinality dims explode the tree (bounded by `max_leaf_docs`, but storage grows)	indifferent

Supported aggregations are deliberately bounded to those that decompose into associative roll-ups: sum, min, max, value_count, avg (as sum/count), and the bucketing (terms, range, date_histogram on a declared dimension) that maps to tree descents. Sketch-based aggs (cardinality via HyperLogLog++, percentiles via t-digest) and anything reading an undeclared field are not supported and silently fall back — which is the correct behavior, but means a mis-declared mapping yields no speedup with no error. Check the current supported set in your branch rather than trusting this list:

cd ~/src/OpenSearch
grep -rn "SUPPORTED\|MetricStat\|StarTreeQueryHelper\|isSupported" \
  server/src/main/java/org/opensearch/search/startree/ \
  server/src/main/java/org/opensearch/index/compositeindex/datacube/ 2>/dev/null | head -20

Warning: Because falling back is silent, "the star-tree isn't speeding anything up" is a real and common report. Always confirm the tree was actually used (profile / a star-tree stat) before concluding it doesn't help — see Validation.

Trace it yourself

# 0. Build/run a node with the composite-index feature enabled (check your version's flag).
cd ~/src/OpenSearch
grep -rn "composite_index\|STAR_TREE_INDEX\|FeatureFlags" \
  server/src/main/java/org/opensearch/common/util/FeatureFlags.java | head

# 1. Create the index (mapping above), bulk-load enough docs to force several segments.
curl -s -XPUT 'localhost:9200/logs-startree' -H 'Content-Type: application/json' \
  --data-binary @startree-mapping.json
for b in 1 2 3 4 5; do
  for i in $(seq 1 5000); do
    printf '{"index":{}}\n{"status":%s,"region":"r%s","bytes":%s,"latency_ms":%s}\n' \
      "$((200 + (RANDOM%3)*100))" "$((RANDOM%5))" "$((RANDOM%100000))" "$((RANDOM%500))"
  done | curl -s -H 'Content-Type: application/x-ndjson' \
      -XPOST 'localhost:9200/logs-startree/_bulk?refresh=true' --data-binary @- >/dev/null
done
curl -s 'localhost:9200/_cat/segments/logs-startree?v'   # confirm multiple segments

# 2. Run an ELIGIBLE agg and an INELIGIBLE agg; compare took.
curl -s 'localhost:9200/logs-startree/_search?pretty' -H 'Content-Type: application/json' -d '
{ "size":0, "aggs": { "by_status": { "terms": { "field":"status" },
    "aggs": { "b": { "sum": { "field":"bytes" } } } } } }' | grep '"took"'

curl -s 'localhost:9200/logs-startree/_search?pretty' -H 'Content-Type: application/json' -d '
{ "size":0, "aggs": { "uniq": { "cardinality": { "field":"region" } } } }' | grep '"took"'  # falls back

# 3. profile=true — look for a star-tree collector/values source in the breakdown.
curl -s 'localhost:9200/logs-startree/_search?pretty' -H 'Content-Type: application/json' -d '
{ "profile":true, "size":0, "aggs": { "b": { "sum": { "field":"bytes" } } } }' \
  | grep -i "star\|StarTree\|collector" | head

# 4. Read the owning code.
grep -rln "StarTreeBuilder" server/src/main/java/org/opensearch/index/compositeindex/ | head
grep -rln "StarTreeQueryHelper\|canUseStarTree" server/src/main/java/org/opensearch/search/ | head
./gradlew :server:test --tests "*StarTree*" 2>&1 | tail -15

Common bugs and symptoms

Symptom	Likely cause	Where to look
Star-tree built but agg is no faster	request filters/groups on an undeclared field → silent fallback	the resolution check; confirm with `profile`
Mapping rejected at index creation	composite-index feature flag / append-only setting not enabled	`FeatureFlags`, `index.composite_index`, `index.append_only.enabled`
Index size ballooned	high-cardinality dimension(s) and/or `max_leaf_docs` too low	the `config`; raise `max_leaf_docs`, drop/reorder dims
`avg` from tree disagrees with scan `avg`	rounding when dividing stored `sum`/`value_count`, or float-metric accumulation order	the metric aggregator / `avg` decomposition; compare to scan path
Results wrong after updates/deletes	star-tree assumes append-only; mutations break the precomputed roll-ups	enforce append-only; verify the gating setting
Tree gone / different after force-merge	tree is rebuilt per segment at merge; that's expected, but a merge-path bug can drop it	the merge-time `DocValuesConsumer` / composite codec
`terms` on a declared dim returns fewer buckets than scan	dimension encoding / ordinal handling in the tree vs the field	`StarTreeMapper` dimension type vs the field's docvalues

Validation: prove you understand this

Draw a star-tree for dimensions [a, b] and metric sum(m), and point to the single node that answers "sum(m) over everything with no filter." What is that node called and why does it make a no-filter query O(1)?
Explain what max_leaf_docs bounds, and the storage-vs-latency consequence of making it very small vs very large.
List the exact eligibility conditions for an aggregation to be resolved by the star-tree, and give one agg that is not supported and why it falls back.
Where in the segment lifecycle is the tree built, and why does that make it immutable and "free" at query time? Tie it to flush and merge.
A teammate says "I added a star-tree but my dashboard is the same speed." List the three things you would check (in order) and the one command that proves whether the tree was used.
Explain why a star-tree-resolved aggregation can still feed the unchanged InternalAggregation.reduce, and how it composes with concurrent segment search.

Next: Tiered Caching, or the star-tree aggregation capstone.

Tiered Caching

OpenSearch has always had a request cache: a per-shard cache of the results of expensive, repeatable reads — most importantly aggregation results on indices that do not change between requests (think time-series dashboards hitting yesterday's data over and over). It lives on the JVM heap, and that is its ceiling. Heap is the scarcest resource on a data node; a request cache big enough to hold a busy dashboard's whole working set competes directly with everything else that wants heap, and the circuit breakers will not let it win.

Tiered caching breaks that ceiling. Instead of one on-heap cache that evicts to nowhere, it makes the cache a hierarchy: a small, fast on-heap tier backed by a much larger on-disk tier. An entry evicted from heap does not vanish — it spills over to disk, where a cache hit is still vastly cheaper than recomputing the aggregation. The data structure that does this is TieredSpilloverCache, and the whole thing is built on a deliberately pluggable cache SPI so the disk tier (today, ehcache) can be swapped.

This chapter explains the hierarchy, the SPI that makes it pluggable, the disk tier, the serialization that crossing the heap/disk boundary forces, and how it all relates to the request cache and the circuit breakers you already know. It does not re-derive the request cache or the search path — those are search execution and the existing cache code; this is the layer added beneath the request cache.

The design and benchmark issues to read alongside this chapter: [Proposal] Tiered caching — OpenSearch #10024 and [META] performance benchmark for tiered caching — #11464.

After this chapter you can:

Describe the on-heap → disk hierarchy and what spillover means precisely.
Name TieredSpilloverCache, the ICache/CacheService SPI, and the cache-ehcache disk tier, and find each in the tree.
Explain why crossing to disk forces key/value serialization and what that costs.
Reason about eviction, invalidation, the relevant settings, and the interaction with the request cache and circuit breakers.

Note: This is the infrastructure under the request cache, exposed as a general caching SPI. The same machinery is intended to back other on-heap caches over time, not just the request cache. Read it as "OpenSearch grew a pluggable, tiered cache framework, and the request cache was the first consumer."

The problem, precisely

The request cache is heap-bound, and heap is precious. That produces a sharp, unsatisfying trade-off:

Make the request cache small → high miss rate on dashboards with a large working set → expensive aggregations recomputed constantly.
Make it large → it steals heap from the rest of the node, raising GC pressure and tripping the parent circuit breaker.

There is no good on-heap-only answer for a working set that is bigger than you want to keep on heap but much smaller and cheaper to fetch from disk than to recompute. That is exactly the gap a second, on-disk tier fills. Disk is slower than heap but orders of magnitude faster than re-running the aggregation, and it does not consume the resource (heap) that you are trying to protect.

flowchart LR
    Q["repeatable read<br/>(e.g. dashboard agg)"] --> H{"on-heap tier hit?"}
    H -->|hit| FAST["return (fastest)"]
    H -->|miss| D{"on-disk tier hit?"}
    D -->|hit| MED["deserialize + return<br/>(slower than heap,<br/>far faster than recompute)"]
    D -->|miss| COMP["recompute the result<br/>(slowest), then cache"]

The bet tiered caching makes: most "misses" in a heap-only cache are recently evicted entries that are still useful. Catch them on disk and you convert recomputes into cheap disk reads.

The cache hierarchy and `TieredSpilloverCache`

The core class is TieredSpilloverCache. It implements the same cache interface as any single-tier cache, but internally it holds two caches — an on-heap tier and an on-disk tier — and routes between them.

The behavior, as a contract:

get(key) — check the on-heap tier first. On a hit, return. On a miss, check the on-disk tier. A disk hit is returned (and, depending on policy, may be promoted back to heap). A miss in both is a real cache miss.
put(key, value) — write to the on-heap tier.
eviction from heap — when the on-heap tier evicts an entry (it is full), the entry spills over into the on-disk tier rather than being discarded. This is the word in the class name and the whole point of the design.
eviction from disk — the disk tier has its own bound; when it is full, its evictions are the ones that truly leave the cache.

flowchart TD
    PUT["put(key, value)"] --> HEAP["on-heap tier<br/>(fast, small, bounded by heap budget)"]
    HEAP -->|evicted (heap full)| SPILL["spill over"]
    SPILL --> DISK["on-disk tier<br/>(ehcache, larger, serialized)"]
    DISK -->|evicted (disk full)| GONE["leaves the cache (true eviction)"]
    GET["get(key)"] --> HEAP
    HEAP -->|miss| DISK
    DISK -->|hit| BACK["deserialize → return<br/>(optionally promote to heap)"]

cd ~/src/OpenSearch
# Find the spillover cache and its tier wiring.
grep -rln "TieredSpilloverCache" server/src/main/java modules plugins | head
grep -rn "spillover\|onHeap\|getOnHeapCache\|getDiskCache\|evict" \
  $(grep -rl "class TieredSpilloverCache" server modules plugins) | head -20

Warning: "spill over to disk" is not free. Every entry that crosses the heap→disk boundary must be serialized to bytes (and deserialized on a disk hit). A cache whose values are huge or whose serializer is slow can spend more on serialization than it saves on recompute. The benchmark issue #11464 exists precisely to measure where that trade pays off and where it does not.

The pluggable cache SPI: `ICache` and `CacheService`

Tiered caching did not just add a class; it added a caching framework. The two tiers and the spillover logic sit behind a deliberately pluggable Service Provider Interface so that the disk implementation is not hard-wired into core.

The pieces, conceptually (grep to confirm exact names/packages in your checkout — they live under the cache common package):

Role	What it is	Why it exists
`ICache<K, V>`	the cache interface every tier implements (`get`/`put`/`invalidate`/`refresh`/`count`/`close`)	one contract for on-heap, on-disk, and the tiered wrapper alike — so `TieredSpilloverCache` can hold two `ICache`s and be one
`ICache.Factory`	builds an `ICache` from settings/config	lets a tier be constructed generically by the service
`CacheService`	the registry/factory that resolves which cache implementation to use for a given store name	the seam where a plugin contributes a disk tier
`CachePlugin`	the extension interface a plugin implements to register cache factories	how `cache-ehcache` plugs in (same pattern as any plugin)
`CacheType`	enumerates the consumers (e.g. the request cache) so each can be configured independently	so the request cache can be tiered without forcing every cache to be

# The SPI lives under the common cache package. Walk it.
ls server/src/main/java/org/opensearch/common/cache/
grep -rn "interface ICache\|interface Factory\|class CacheService\|interface CachePlugin\|enum CacheType\|enum CacheStoreType" \
  server/src/main/java/org/opensearch/common/cache/ | head -20

The shape to internalize: TieredSpilloverCache is itself an ICache. It is a cache built out of two caches. That recursion is what makes the framework clean — the request cache asks CacheService for "the cache for INDICES_REQUEST_CACHE," and what it gets back might be a plain on-heap cache or a tiered spillover cache, depending on settings, and it does not care which.

flowchart TD
    RC["IndicesRequestCache<br/>(consumer)"] --> CS["CacheService.createCache(CacheType.INDICES_REQUEST_CACHE)"]
    CS -->|tiered enabled?| TSC["TieredSpilloverCache : ICache"]
    CS -->|tiered disabled| OHC["OpenSearchOnHeapCache : ICache"]
    TSC --> T1["on-heap ICache"]
    TSC --> T2["disk ICache (ehcache)"]

The disk tier: `cache-ehcache`

The on-disk tier ships as a plugin, not core code: cache-ehcache. It wraps Ehcache's disk store and exposes it as an ICache implementation that CacheService can hand back as the lower tier.

Why a plugin and not core:

It pulls in an external dependency (Ehcache + its licensing surface — relevant to licensing). Keeping it as a plugin keeps that dependency out of the core distribution for clusters that do not want a disk tier.
It proves the SPI works. If the disk tier is just "another CachePlugin that registers an ICache.Factory," then the framework is genuinely pluggable and a future tier (a different disk store, a remote cache) is the same shape.

# The ehcache disk-tier plugin.
ls plugins/cache-ehcache/ 2>/dev/null || find . -type d -name "*ehcache*"
grep -rn "Ehcache\|EhcacheDiskCache\|CachePlugin\|getCacheFactoryMap\|ICache.Factory" \
  plugins/cache-ehcache/src/main/java 2>/dev/null | head -20

The disk tier is responsible for the things heap caches never had to think about: file management, on-disk size accounting, durability semantics (a cache is not a source of truth — see invalidation below), and the serialization boundary.

Key/value serialization across the boundary

An on-heap cache stores Java object references. A disk cache cannot — it stores bytes. So the SPI carries a serialization contract: to put an entry in the disk tier you must turn its key and value into bytes, and a disk hit must turn them back.

This is not a footnote; it shapes the whole design:

Keys (for the request cache, a composite of shard + the request's cache key) must serialize deterministically — the same logical request must produce the same bytes, or you get false misses.
Values (the cached result bytes) are serialized aggregation/query results. They are already close to a serialized form, which is part of why the request cache was a natural first consumer.
The disk tier deals in BytesReference/byte arrays; a Serializer<T, byte[]>-style abstraction converts between the consumer's objects and bytes. Grep for the serializer interface to see the exact contract.

grep -rn "Serializer\|serialize\|BytesReference\|byte\[\]" \
  server/src/main/java/org/opensearch/common/cache/ | grep -i serial | head
grep -rn "Serializer\|serialize\|deserialize" \
  plugins/cache-ehcache/src/main/java 2>/dev/null | head

Note: serialization cost is the variable that decides whether tiered caching wins. A heap hit is a pointer follow; a disk hit is a disk read plus deserialization. The design pays off when (recompute cost) ≫ (disk read + deserialize). For cheap-to-recompute results, the disk tier can be net negative — which is the entire reason #11464 is a benchmark meta-issue, not a feature issue.

Eviction and invalidation

Two different mechanisms, often confused:

Eviction is capacity management — the cache is full, something has to go. In the tiered cache, on-heap eviction spills to disk; disk eviction is true removal. Each tier has its own size policy (entry count and/or byte size).
Invalidation is correctness management — the cached entry is no longer valid and must be dropped regardless of capacity. For the request cache, the classic trigger is a refresh that changes the shard's contents: a cached aggregation result is only valid as long as the underlying segments are unchanged. When a shard refreshes (see refresh/flush/merge), the entries keyed to the old reader are invalidated.

In a tiered cache, invalidation must reach both tiers — invalidating only the heap tier while a stale copy lingers on disk is a correctness bug. The cleanup that removes invalidated entries (and the keys that map to them) has to walk the hierarchy.

grep -rn "invalidate\|invalidateAll\|cleanCache\|CleanupKey\|keysToClean\|refreshKey" \
  server/src/main/java/org/opensearch/common/cache server/src/main/java/org/opensearch/indices | head -20

Warning: the most dangerous tiered-cache bug class is a stale entry surviving on disk after the heap tier was invalidated. Single-tier, invalidation was "drop it from the map." Tiered, you must invalidate through to disk. When you review or write invalidation code, prove the disk tier is reached — a unit test that only checks the heap tier is a test that passes while the bug ships.

Settings

The exact keys evolve; grep your checkout. The shape is: a master switch per cache type to enable tiering, a choice of disk store, and per-tier size bounds.

Setting (shape — confirm by grep)	Scope	Meaning
`indices.requests.cache.store.name`	node	which cache store backs the request cache (`opensearch_onheap` vs the tiered store)
`*.tiered_spillover.onheap.store.size`	node	size bound for the on-heap tier
`*.tiered_spillover.disk.store.size`	node	size bound for the on-disk (ehcache) tier
`indices.requests.cache.size`	node	the existing request-cache heap budget (the on-heap tier inherits this concern)

# Find the real setting keys and defaults in your checkout.
grep -rn "tiered_spillover\|TIERED\|store.name\|disk.store\|onheap.store" \
  server/src/main/java/org/opensearch/common/cache \
  plugins/cache-ehcache/src/main/java 2>/dev/null | head -20

# Enable the tiered store for the request cache (key names vary — confirm above first).
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "persistent": {
    "indices.requests.cache.store.name": "tiered_spillover"
  }
}'
# Then exercise a cacheable agg twice and watch request_cache stats.
curl -s 'localhost:9200/myindex/_stats/request_cache?pretty' | grep -A6 request_cache

Relationship to the request cache and circuit breakers

Two existing subsystems you must hold in your head while working here.

The request cache is the first and primary consumer. Tiered caching does not replace it; it gives it a bigger, cheaper backing store. The request cache still decides what is cacheable (only size: 0 / non-scoring repeatable requests on unchanged shards, by default) and still keys entries the same way; tiering changes only where evicted entries go. If you are debugging "my aggregation isn't being cached," that is request-cache eligibility logic, not tiered caching.

The circuit breakers (circuit breakers and memory) guard the JVM heap. The on-heap tier is inside that guarded space — it counts against heap and the breakers can still trip on it. The on-disk tier is outside heap; the heap circuit breaker does not see it (which is the upside — you got cache capacity that does not pressure heap), but it is also not free: disk space and serialization CPU are real, just guarded by the cache's own size bounds rather than a breaker. This is structurally the same lesson as k-NN's off-heap memory: memory that lives outside the heap needs its own accounting, because the heap-oriented tools cannot see it.

flowchart TD
    subgraph heap["JVM heap (guarded by circuit breakers)"]
        OH["on-heap cache tier"]
    end
    subgraph disk["disk (guarded by the cache's own size bounds)"]
        DK["ehcache disk tier"]
    end
    OH -->|spill| DK
    CB["parent / request circuit breaker"] -. sees .-> OH
    CB -. blind to .-> DK

Common bugs and symptoms

Symptom	Root cause	Where to look
Stale aggregation result returned after an index refresh	invalidation reached the heap tier but not the disk tier	the invalidation/cleanup path; prove it walks both tiers
Disk tier never used; everything is heap or recompute	tiered store not selected for the cache type	`indices.requests.cache.store.name`; `CacheService` resolution
Latency worse with tiering on	serialize/deserialize cost exceeds recompute cost for these values	benchmark per #11464; reconsider for cheap aggs
False cache misses (same request recomputed)	non-deterministic key serialization	the key `Serializer`; ensure stable byte output
`cache-ehcache` plugin not found / disk tier missing	the plugin isn't installed in the distribution	install/enable `cache-ehcache`; it's a plugin, not core
Disk usage grows unbounded	disk-tier size bound unset or too high; true evictions not happening	`*.disk.store.size`; the disk tier's eviction policy
Heap pressure unchanged after enabling tiering	on-heap tier still sized large; tiering moves evicted entries, not the hot set	on-heap tier size vs request-cache budget
`ClassCastException` / deserialize failure on disk hit	value serializer mismatch across versions	the `Serializer`; BWC of the serialized form (serialization-bwc)

Validation: prove you understand this

Draw the on-heap → disk hierarchy and define spillover precisely. What happens to an entry evicted from the heap tier vs evicted from the disk tier?
Explain why TieredSpilloverCache is itself an ICache, and what that buys the request cache (which does not know whether its cache is tiered).
Name the SPI pieces — ICache, ICache.Factory, CacheService, CachePlugin, CacheType — and say which one a plugin like cache-ehcache implements to contribute a disk tier.
Why does crossing to the disk tier force serialization of keys and values, and what is the failure mode of a non-deterministic key serializer?
Distinguish eviction from invalidation. Why is "invalidate the heap tier only" a correctness bug, and what triggers request-cache invalidation in the first place?
Explain the relationship to the circuit breakers: which tier they guard, which they are blind to, and why that is the same lesson as off-heap native memory in k-NN.
Read #10024 and #11464. State the one variable that decides whether tiered caching is a win, and explain why the benchmark issue is separate from the proposal.

When you can do all seven, revisit circuit breakers and memory with fresh eyes (the heap/off-heap accounting story is the same shape) and search execution to place the request cache in the read path it short-circuits.

Next: Star-tree aggregations — another "precompute to buy latency" feature — or back to real issues and RFCs.

Plugins, Extensions & Cross-Repo Labs

OpenSearch is not one repository. The engine in opensearch-project/OpenSearch is a deliberately small core; the product most people run is that core plus a constellation of large plugins (security, k-NN, SQL/PPL, alerting, anomaly-detection, ml-commons, index-management, observability) plus the OpenSearch Dashboards UI plus the language clients plus Apache Lucene underneath. Almost every interesting production bug, and a large fraction of the open issues you will triage, lives at a boundary between these repos — not inside any one of them.

This section builds the single most valuable cross-repo skill: given a symptom, attribute it to the right repository and produce a minimal reproducer there. It is the OpenSearch analog of the Hive-on-Tez labs in the sibling Tez curriculum: there, the boundary was Hive's TezTask handing a DAG to Tez; here, the canonical boundary is a Dashboards visualization handing a _search request to core, and the second boundary is a plugin implementing a core extension point. Filing a vector-search bug on core when it belongs in k-NN, or a rendering bug on core when Dashboards built the wrong DSL, wastes the maintainer's time and yours. These six labs make the attribution mechanical.

This builds directly on the Plugin Architecture deep dive (read it first — these labs assume you know how PluginsService loads a plugin and gives it an isolated classloader) and connects to Search Execution, Serialization and BWC, and Stage 8 — Plugin and Extension Compatibility.

Why cross-repo skill matters

A single-repo contributor can read server/ end to end and still be useless on the issues users actually file, because those issues span repos:

A Dashboards dashboard renders the wrong number. Is the bug in the TypeScript that built the aggregation, in core's reduce, or in Lucene's doc-values?
A knn query returns bad neighbors. Is the bug in the k-NN plugin's query builder, in core's search fan-out, or in the underlying Lucene/faiss/nmslib engine?
A request 403s. Is that the security plugin's action filter doing its job, a misconfigured role, or a core action that the filter mis-classifies?

None of these can be answered from one repository. The contributor who can say "this is a k-NN bug, here is the 12-line curl repro, and here is why it is not core" is worth ten who can only say "search is broken."

The repositories involved

Repo	Language	Role	What it owns
`opensearch-project/OpenSearch`	Java	Core engine	REST/transport/action layer, search & indexing, cluster state, the plugin SPI
`opensearch-project/security`	Java	Plugin	AuthN/AuthZ, TLS, action filters, field/document-level security
`opensearch-project/k-NN`	Java + C++ (JNI)	Plugin	`knn_vector` field type, `knn` query, faiss/nmslib/Lucene engines
`opensearch-project/sql`	Java + Kotlin	Plugin	SQL and PPL query languages over `_search`
`opensearch-project/alerting`	Kotlin	Plugin	Monitors, triggers, notifications
`opensearch-project/anomaly-detection`	Java	Plugin	Real-time and historical anomaly detectors
`opensearch-project/ml-commons`	Java	Plugin	Model serving, connectors, agents
`opensearch-project/index-management`	Kotlin	Plugin	ISM policies, rollups, transforms
`opensearch-project/observability`	Kotlin	Plugin	Saved objects for traces/metrics/logs
`opensearch-project/OpenSearch-Dashboards`	TypeScript	UI	Visualizations, the search source builder, server-side proxy to core
`opensearch-project/opensearch-java`	Java	Client	Typed Java client
`opensearch-project/opensearch-js`	TypeScript	Client	Node client; what Dashboards' server uses
`apache/lucene`	Java	Library	Inverted index, codecs, doc-values, `Query`/`Scorer`, HNSW vectors

Note: The big functional plugins live in separate repos precisely because they are large subsystems with their own teams and release cadences. They build against the published org.opensearch:opensearch artifacts, not against a checkout of core. That decoupling is exactly what makes cross-repo attribution a distinct skill — a core change can break a plugin with zero failing tests in the core repo. See Stage 8.

The labs

Lab	Focus	The boundary it teaches
P1: Dashboards Query → OpenSearch Search	The canonical cross-repo trace	Dashboards (TS) builds DSL → `opensearch-js` → core `RestSearchAction`/`TransportSearchAction`
P2: Inspecting a Plugin's Extension Points	Mapping plugin code to the core SPI	`k-NN`/`security` ↔ `MapperPlugin`/`SearchPlugin`/`ActionPlugin`/`EnginePlugin`
P3: Debugging a Failed Cross-Plugin Request	Reading a failure across two stacks	core ↔ plugin transport/action boundary; TRACE logging on both sides
P4: Bug Attribution	Deciding which repo owns a symptom	core vs. plugin vs. Dashboards vs. Lucene — the decision flowchart
P5: Reproducing Integration Bugs	Minimal deterministic cross-repo repro	`curl`/REST-YAML on a distro, or `OpenSearchIntegTestCase` with `nodePlugins()`
P6: Writing Diagnostics	The highest-leverage cross-repo contribution	Actionable boundary errors, `_cat/plugins`, Profile/`_tasks`, version-mismatch messages

The extension model in one screen

There are two ways code extends OpenSearch. You will work almost entirely with the first.

Plugins (the mainstream model). A plugin extends org.opensearch.plugins.Plugin and implements one or more extension interfaces, each of which the engine queries at startup. The relevant ones for these labs:

Interface	Method you'll grep for	Real plugin that uses it
`MapperPlugin`	`getMappers()`	`k-NN` registers `knn_vector`
`SearchPlugin`	`getQueries()`, `getAggregations()`	`k-NN` registers the `knn` query
`EnginePlugin`	`getEngineFactory()`	`k-NN` for the native vector engine path
`ActionPlugin`	`getActions()`, `getRestHandlers()`, `getActionFilters()`	`security` injects an action filter
`NetworkPlugin`	`getTransports()`, `getTransportInterceptors()`	`security` wraps transport with TLS/auth

Plugins run in-process with full access to internal APIs — powerful but version-coupled. This is why a plugin built against opensearch 3.1.0 will not load on a 3.0.0 node (see P5 and P6).

Extensions (forward-looking). OpenSearch 2.10+ ships an experimental Extensions SDK (opensearch-sdk-java) that runs extensions out of process over a defined protocol, for fault isolation and looser version coupling. Treat it as the strategic direction for some extension categories; in-process plugins remain the fully-supported mainstream model, and these labs are entirely about them.

How a plugin builds against core

This mechanic underlies every lab that touches an out-of-repo plugin (P2, P3, P5). A plugin's build.gradle applies the opensearch.opensearchplugin Gradle plugin and depends on the core artifacts at a pinned version:

// build.gradle of an out-of-repo plugin (e.g. k-NN), simplified
buildscript {
  ext { opensearch_version = "3.1.0-SNAPSHOT" }
  dependencies {
    classpath "org.opensearch.gradle:build-tools:${opensearch_version}"
  }
}
apply plugin: 'opensearch.opensearchplugin'
opensearchplugin {
  name 'opensearch-knn'
  classname 'org.opensearch.knn.plugin.KNNPlugin'
}
dependencies {
  compileOnly "org.opensearch:opensearch:${opensearch_version}"
}

To test a local, unreleased core change against a plugin, you publish core to your local Maven cache and point the plugin at the same version:

# In your core checkout — publish the snapshot artifacts locally
cd ~/OpenSearch
./gradlew publishToMavenLocal      # writes to ~/.m2/repository/org/opensearch/...

# In the plugin checkout — build against that exact snapshot
cd ~/k-NN
./gradlew assemble -Dopensearch.version=3.1.0-SNAPSHOT
ls build/distributions/            # opensearch-knn-3.1.0-SNAPSHOT.zip

The plugin produces a zip carrying its jars and a plugin-descriptor.properties. You install it into a runnable distro with bin/opensearch-plugin install, and the opensearch.version field in the descriptor must match the node version exactly. That version lock is the source of a whole class of cross-repo failures you'll learn to recognize.

# In a localDistro built from the same core version
bin/opensearch-plugin install file:///abs/path/opensearch-knn-3.1.0-SNAPSHOT.zip
bin/opensearch-plugin list
curl -s localhost:9200/_cat/plugins?v

Reading order

P1 and P2 are foundational — do them in order. P1 gives you the end-to-end Dashboards→core trace; P2 gives you the plugin↔SPI mapping. P3 and P4 are the debugging-and-attribution pair: P3 teaches you to read a cross-stack failure, P4 teaches you to attribute it. P5 and P6 are the contributor-facing skills — turning an attributed bug into a minimal repro (P5) and into a diagnostics patch that helps the next person (P6).

If you arrived here from the capstone or from Stage 8, P4 and P5 are the most directly relevant.

Validation for the section

You have absorbed this section when, given a freshly-failing query in a production OpenSearch + Dashboards + plugin deployment, you can:

Within 10 minutes, name which repo owns the failure (core / a named plugin / Dashboards / Lucene) and say why.
Within 30 minutes, locate the relevant code on both sides of the boundary.
Within 1 hour, capture the actual _search request (via profile=true, slow log, or browser devtools) and the failing component's log.
Within a day, produce a minimal curl/REST-YAML or OpenSearchIntegTestCase reproducer.
File the issue in the right repo with a repro that does not drag in the other repos unnecessarily.

That routine, executed crisply, is what gets cross-repo OpenSearch issues resolved. The labs build the muscle.

Lab P1: From a Dashboards Query to an OpenSearch Search

Background

This is the canonical cross-repo trace, the OpenSearch analog of "SQL to DAG." A user drags fields onto a Dashboards visualization. Dashboards (TypeScript) turns that into an aggregation _search (or _msearch) request, ships it over HTTP to the OpenSearch core engine (Java), where RestSearchAction parses it and TransportSearchAction fans it out to shards. The result comes back up the same path and Dashboards renders it.

You cannot debug a "the dashboard shows the wrong number" report without being able to walk this path in both directions and say, at each hop, who owns what. This lab makes you trace it concretely, capture the real request on the wire, and draw the boundary between Dashboards' job and core's job.

It builds on the Search Execution deep dive (what happens inside core once the request arrives), the REST Layer, and the Action Framework. It is the foundation for P4: Bug Attribution, where "is this Dashboards or core?" becomes a decision you make under time pressure.

Why This Lab Matters for Contributors

Most Dashboards bug reports arrive as screenshots ("this panel is wrong"). The maintainer's first job is to extract the actual request and response and decide whether the fault was in how Dashboards built the DSL or in how core answered it. If you can capture the request, replay it with curl, and confirm core returns the same "wrong" answer, you have proven the bug is in core (or Lucene) and can file a minimal repro there with zero Dashboards involvement. If curl returns the right answer for the request Dashboards sent, the bug is in Dashboards' rendering, not core. That bisection is the whole skill.

Prerequisites

Requirement	Why
A running core node (`./gradlew run` or a `localDistro`) on `:9200`	The target of every request
OpenSearch Dashboards checkout/running instance on `:5601` (optional but recommended)	To originate the visualization request
`curl` and a browser with devtools	To capture and replay the request
Read Search Execution	You must know what core does after the request lands

Index a tiny dataset so a visualization has something to aggregate:

curl -s -XPUT localhost:9200/sales -H 'Content-Type: application/json' -d '{
  "mappings": { "properties": {
    "ts":     { "type": "date" },
    "region": { "type": "keyword" },
    "amount": { "type": "double" }
  }}}'

for r in us eu apac; do for i in 1 2 3 4 5; do
  curl -s -XPOST localhost:9200/sales/_doc -H 'Content-Type: application/json' \
    -d "{\"ts\":\"2026-06-1${i}\",\"region\":\"$r\",\"amount\":$((RANDOM % 100))}" >/dev/null
done; done
curl -s -XPOST localhost:9200/sales/_refresh >/dev/null

The two-sided picture

flowchart TD
  subgraph Browser["Browser (Dashboards client, TypeScript)"]
    V[Visualization editor] --> SS[SearchSource / agg config builder]
    SS --> DSL[Builds the _search DSL JSON]
    DSL --> HTTP1[POST /api/console or data plugin search route]
  end
  subgraph DashServer["Dashboards server (Node, TypeScript)"]
    HTTP1 --> PROXY[search strategy / OpenSearchClient]
    PROXY --> JS[opensearch-js client]
  end
  subgraph Core["OpenSearch core (Java)"]
    JS -->|HTTP POST /sales/_search| RC[RestController]
    RC --> RSA[RestSearchAction.parse]
    RSA --> NC[NodeClient.execute SearchAction]
    NC --> TSA[TransportSearchAction]
    TSA --> FAN[fan-out to shards: QueryPhase then FetchPhase]
    FAN --> RED[SearchPhaseController reduce]
  end
  RED -->|JSON response| JS
  JS --> PROXY
  PROXY --> HTTP1
  HTTP1 --> RENDER[Dashboards renders chart from agg buckets]

The dashed vertical line you must keep in your head: everything left of RestController is Dashboards' job; everything from RestController rightward is core's job. Lucene sits one level below FetchPhase/QueryPhase and owns the actual inverted-index and doc-values reads.

Step-by-Step Tasks

Step 1 — Build the request the way a visualization would

A Dashboards bar chart "sum of amount by region" is, on the wire, a terms aggregation with a sum sub-aggregation and size: 0 (Dashboards asks for no hits, only buckets). This is the JSON the TypeScript layer ultimately emits:

curl -s 'localhost:9200/sales/_search' -H 'Content-Type: application/json' -d '{
  "size": 0,
  "aggs": {
    "by_region": {
      "terms": { "field": "region", "size": 10 },
      "aggs": { "total": { "sum": { "field": "amount" } } }
    }
  }
}' | python3 -m json.tool

You should get one bucket per region with a total.value. That bucket array is exactly what Dashboards turns into bars. The request is the contract. If this curl is correct but the chart is wrong, the bug is in Dashboards' rendering.

Step 2 — Find where Dashboards constructs the DSL (TS side, high level)

You do not need to be a TypeScript expert; you need to know where the DSL is born so you can reason about Dashboards-side bugs. In the Dashboards repo, the search-source and aggregation-builder machinery lives under the data plugin:

# In an OpenSearch-Dashboards checkout
grep -rln "SearchSource\|buildOpenSearchQuery\|toDsl\|AggConfigs" \
  src/plugins/data/common src/plugins/data/public | head

# The agg builder that turns an AggConfig into request JSON
find src/plugins/data -name "agg_configs.ts" -o -name "*search_source*.ts" | head

Conceptually the chain is: the visualization's AggConfigs → agg_configs.toDsl() builds the aggs block → SearchSource assembles the full body (query filters from the search bar, time range, size: 0) → the request is handed to the data plugin's search strategy → the server-side OpenSearchClient (backed by opensearch-js) sends it to core. The Dashboards server never invents query semantics; it forwards. That matters for attribution: a malformed aggregation almost always traces to the client-side agg builder, not the server proxy.

Step 3 — Capture the actual request on the wire

Do not trust your reconstruction — capture the real bytes. Three ways, in order of preference:

(a) Browser devtools. Open the visualization, open devtools → Network, filter for _search or the data plugin's /internal/search route, and copy the request payload. This is the ground truth of what Dashboards built.

(b) Core slow log. Make core log every search by dropping the slow-log threshold to zero on the index, then read the request from the log:

curl -s -XPUT 'localhost:9200/sales/_settings' -H 'Content-Type: application/json' -d '{
  "index.search.slowlog.threshold.query.trace": "0ms",
  "index.search.slowlog.threshold.fetch.trace": "0ms"
}'
# now run the visualization; then tail the slow log
grep -A2 "took_millis" logs/*_index_search_slowlog.log 2>/dev/null | tail -40

The slow log records the shard-level source — the exact aggregation core received.

(c) profile: true. Add "profile": true to the body and core returns a profile section breaking down query/agg/fetch time per shard — invaluable when the question becomes "is it slow, and where?"

curl -s 'localhost:9200/sales/_search?typed_keys=true' -H 'Content-Type: application/json' -d '{
  "size": 0, "profile": true,
  "aggs": { "by_region": { "terms": { "field": "region" } } }
}' | python3 -c "import sys,json;d=json.load(sys.stdin);print(json.dumps(d['profile']['shards'][0]['aggregations'],indent=2)[:800])"

Note: typed_keys=true is what Dashboards sends so it can tell a terms agg from a date_histogram in the response by the key prefix (sterms#by_region). If you replay a captured request and the response shape differs from what Dashboards expects, check whether typed_keys was set — a classic Dashboards/core mismatch.

Step 4 — Watch the request land in core

Replay the captured request and confirm where it enters the Java side. The entry point is RestSearchAction:

grep -n "class RestSearchAction\|prepareRequest\|parseSearchRequest\|routes()" \
  server/src/main/java/org/opensearch/rest/action/search/RestSearchAction.java
grep -n "class TransportSearchAction\|protected void doExecute" \
  server/src/main/java/org/opensearch/action/search/TransportSearchAction.java

If you launched core with ./gradlew run, attach a debugger (the task prints the JDWP port) and set a breakpoint in RestSearchAction.prepareRequest. Run the visualization; the breakpoint fires with the parsed SearchRequest. You are now standing exactly on the Dashboards↔core boundary, looking at the request Dashboards built, parsed into core's object model. From here, the Search Execution deep dive takes over: TransportSearchAction fans out, QueryPhase/FetchPhase run per shard, SearchPhaseController reduces, and the JSON goes back.

Step 5 — Walk the response back up and find the rendering boundary

The reduced response carries the aggregations.by_region.buckets array. On the way back:

Hop	Owner	What it does to the response
`SearchPhaseController.reduce`	core	merges per-shard `InternalTerms` into final buckets
HTTP response	core	serializes to JSON; `typed_keys` prefixes agg names
`opensearch-js` → `OpenSearchClient`	Dashboards server	forwards body unchanged
data plugin search response handler	Dashboards client	maps buckets to a tabular form
visualization renderer	Dashboards client	draws bars from the table

The rendering boundary is the last hop. If the buckets in the captured JSON are correct but the bars are wrong (wrong order, wrong label, off-by-one time-zone bucketing), the bug is in Dashboards' renderer or the agg-config that chose the wrong interval/time zone — not core.

The boundary: Dashboards' job vs core's job

Concern	Owner	Evidence to check
Which fields/aggs to request	Dashboards (`AggConfigs`)	captured request body
Time range / filters from the search bar	Dashboards (`SearchSource`, `buildOpenSearchQuery`)	the `query` + `range` in the body
`size: 0`, `typed_keys`, `track_total_hits`	Dashboards conventions	request params
Parsing the DSL into objects	core (`RestSearchAction`)	breakpoint / parse errors
Shard fan-out, query, fetch, reduce	core (`TransportSearchAction`)	`profile=true`, slow log
Bucket values, doc counts, scoring	core + Lucene	replay with `curl`; compare to expected
Turning buckets into pixels	Dashboards renderer	compare correct JSON to wrong chart

The one-line test you will use constantly: replay the captured request with curl. Right answer from curl + wrong chart ⇒ Dashboards. Wrong answer from curl ⇒ core (or Lucene) — and now you have a Dashboards-free repro.

Expected Output

A curl that reproduces the visualization's aggregation and returns one bucket per region with a total.value.
The captured real request from devtools or the slow log, matching your hand-built one.
A breakpoint hit in RestSearchAction.prepareRequest (or a slow-log line) proving the request entered core as you expected.
A one-sentence statement, for your dataset, of what Dashboards owned vs what core owned in producing the chart.

Stretch Goals

Trigger an _msearch: a Dashboards dashboard with several panels batches requests. Capture it and find RestMultiSearchAction / TransportMultiSearchAction in core.
Introduce a deliberate Dashboards-vs-core mismatch: send the request without typed_keys and observe how a generic client must disambiguate agg types from the response.
Set track_total_hits: false and watch the hits.total.relation change to gte; explain why Dashboards usually wants an exact total for the hit count display.

Validation / Self-check

Draw the full path from a visualization to a shard and back, naming the class/route at each core-side hop.
Where exactly does Dashboards stop building the request and core start parsing it? Name the first core class.
Given "the bar chart shows 0 for eu but data exists," describe the two-curl bisection that decides Dashboards vs core in under five minutes.
What is typed_keys for, and what breaks if a client omits it?
Name three independent ways to capture the actual request Dashboards sent, and say which one you'd reach for first when you have no browser access.
Why does the Dashboards server (the Node proxy) almost never own a query-semantics bug, while the Dashboards client (the agg builder) often does?

Lab P2: Inspecting a Plugin's Extension Points

Background

The Plugin Architecture deep dive names the extension interfaces in the abstract. This lab makes them concrete: you will open a real, large out-of-repo plugin, find every place it plugs into core, and map each plugin method to the core SPI interface it implements. When you can read a plugin and immediately list "this is a MapperPlugin because of getMappers(), a SearchPlugin because of getQueries(), and an EnginePlugin because of getEngineFactory()," you can reason about what a core change to any of those interfaces would do to the plugin — which is the Stage 8 skill.

The primary subject is k-NN, because it is the most instructive: it implements four core extension interfaces at once and exercises the full stack (field type, query, engine, REST). The secondary subject is security, which is interesting precisely because it does not add a field type or a query — it implements ActionPlugin and NetworkPlugin to wrap every request in an authorization filter.

Why This Lab Matters for Contributors

Every cross-repo bug attribution starts with "what does this plugin actually extend?" If a knn query returns bad results, you need to know that k-NN owns the query builder (SearchPlugin.getQueries), the field type (MapperPlugin.getMappers), and the engine that stores the vectors (EnginePlugin.getEngineFactory) — so the bug could be in any of three plugin subsystems, or below them in Lucene. If you don't know the plugin↔SPI map, you cannot bisect. This lab builds that map by hand, the only way it sticks.

Prerequisites

Requirement	Why
A core checkout (`~/OpenSearch`)	The SPI interfaces live here
A `k-NN` checkout (`~/k-NN`)	The subject plugin
Optionally a `security` checkout	The contrast case
`./gradlew publishToMavenLocal` working in core	To build the plugin against local core
Read Plugin Architecture	Classloader + `PluginsService` model

git clone https://github.com/opensearch-project/k-NN.git ~/k-NN
git clone https://github.com/opensearch-project/security.git ~/security   # optional

The extension interfaces you will map

# In the core checkout — the SPI surface a plugin implements
ls ~/OpenSearch/server/src/main/java/org/opensearch/plugins/
grep -rln "interface .*Plugin" ~/OpenSearch/server/src/main/java/org/opensearch/plugins/

Core SPI interface	Key method(s)	What the plugin contributes
`MapperPlugin`	`getMappers()`	a `Map<String, Mapper.TypeParser>` registering a field type
`SearchPlugin`	`getQueries()`, `getAggregations()`	`QuerySpec`/`AggregationSpec` registrations
`EnginePlugin`	`getEngineFactory(IndexSettings)`	a custom `EngineFactory` for matching indices
`ActionPlugin`	`getActions()`, `getRestHandlers()`, `getActionFilters()`	transport actions, REST routes, action filters
`NetworkPlugin`	`getTransports()`, `getTransportInterceptors()`	transport/HTTP wrappers

# Read the method signatures you will look for in the plugin
grep -n "getMappers"        ~/OpenSearch/server/src/main/java/org/opensearch/plugins/MapperPlugin.java
grep -n "getQueries\|getAggregations\|QuerySpec" \
                            ~/OpenSearch/server/src/main/java/org/opensearch/plugins/SearchPlugin.java
grep -n "getEngineFactory"  ~/OpenSearch/server/src/main/java/org/opensearch/plugins/EnginePlugin.java
grep -n "getActionFilters\|getRestHandlers\|getActions" \
                            ~/OpenSearch/server/src/main/java/org/opensearch/plugins/ActionPlugin.java

Step-by-Step Tasks

Step 1 — Find the k-NN plugin entrypoint and its `implements` clause

The classname in the plugin's descriptor names the Plugin subclass. Find it and read what interfaces it declares — that list is the extension-point map.

# The entrypoint class
find ~/k-NN -name "KNNPlugin.java"
grep -n "class KNNPlugin\|implements\|extends Plugin" \
  $(find ~/k-NN -name "KNNPlugin.java")

You should see it declare (names may vary slightly by branch) something like extends Plugin implements MapperPlugin, SearchPlugin, ActionPlugin, EnginePlugin, ScriptPlugin, SystemIndexPlugin. Each interface there is a contract with core.

Step 2 — Map `MapperPlugin.getMappers` → the `knn_vector` field type

grep -n "getMappers" $(find ~/k-NN -name "KNNPlugin.java")
# the field type implementation it registers
find ~/k-NN -name "KNNVectorFieldMapper.java"
grep -n "class KNNVectorFieldMapper\|CONTENT_TYPE\|TypeParser\|MappedFieldType" \
  $(find ~/k-NN -name "KNNVectorFieldMapper.java")

getMappers() returns a map whose key is the content type string ("knn_vector") and whose value is the TypeParser that builds the mapper from mapping JSON. This is exactly the SPI: core's MapperRegistry merges plugin mappers with built-in ones, which is why you can write "type": "knn_vector" in a mapping.

# Core side: where plugin mappers are collected
grep -n "getMappers\|MapperRegistry\|registerMappers" \
  ~/OpenSearch/server/src/main/java/org/opensearch/indices/IndicesModule.java

Step 3 — Map `SearchPlugin.getQueries` → the `knn` query

grep -n "getQueries" $(find ~/k-NN -name "KNNPlugin.java")
find ~/k-NN -name "KNNQueryBuilder.java"
grep -n "class KNNQueryBuilder\|NAME\|doToQuery\|fromXContent" \
  $(find ~/k-NN -name "KNNQueryBuilder.java")

getQueries() returns a list of QuerySpec entries; each binds a query name ("knn"), the QueryBuilder class, its Writeable reader (for transport serialization), and its fromXContent parser (for the REST DSL). This is why {"query": {"knn": {...}}} is understood. Note the three registrations per query — name, wire reader, XContent parser — because a query crosses the wire and the JSON boundary.

# Core side: where plugin queries are merged into the registry
grep -n "getQueries\|registerQuery\|QuerySpec" \
  ~/OpenSearch/server/src/main/java/org/opensearch/search/SearchModule.java

Step 4 — Map `EnginePlugin.getEngineFactory` → the vector store path

This is the subtle one. For native engines (faiss/nmslib), k-NN does not just add a query and field — it can supply a custom EngineFactory so that matching indices store vectors in a way the query can search.

grep -n "getEngineFactory" $(find ~/k-NN -name "KNNPlugin.java")
# the codec / engine glue
grep -rln "EngineFactory\|KNNCodec\|KNN80\|NativeMemory" ~/k-NN/src/main/java | head

getEngineFactory(IndexSettings) returns Optional<EngineFactory>; core calls it per index and, if present, uses that factory instead of the default InternalEngine (see Engine Internals). This is why a k-NN bug can live below the query layer entirely — in how vectors are written and read.

Step 5 — Map `ActionPlugin` → REST routes and transport actions

grep -n "getRestHandlers\|getActions" $(find ~/k-NN -name "KNNPlugin.java")
# the stats/warmup endpoints k-NN adds
grep -rln "RestKNNStatsHandler\|RestKNNWarmupHandler\|BaseRestHandler" ~/k-NN/src/main/java | head

getActions() registers ActionType → TransportAction pairs (e.g. k-NN stats); getRestHandlers() registers the REST routes that call them. This is the same machinery P1 traced for core's RestSearchAction, just contributed by a plugin.

Step 6 — The contrast: security as `ActionPlugin` + `NetworkPlugin`

security adds no field type and no query. It implements ActionPlugin to inject an action filter that runs before every transport action and can deny it, and NetworkPlugin to wrap transport/HTTP with TLS and authentication.

find ~/security -name "OpenSearchSecurityPlugin.java"
grep -n "getActionFilters\|getTransportInterceptors\|getRestHandlers\|implements" \
  $(find ~/security -name "OpenSearchSecurityPlugin.java")
# the filter that authorizes/denies actions
grep -rln "implements ActionFilter\|class SecurityFilter\|apply(.*Task" ~/security/src/main/java | head

# Core side: where action filters are collected and where they run
grep -n "getActionFilters\|ActionFilters\|ActionFilterChain" \
  ~/OpenSearch/server/src/main/java/org/opensearch/action/support/ActionFilters.java

ActionFilter.apply(...) sits in a chain in front of every TransportAction. A 403 you debug in P3 and P4 almost always originates exactly here — in security's filter deciding the principal lacks permission, before core's action logic even runs.

The plugin↔SPI map you produce

Fill this in from your own greps — it is the deliverable:

Plugin	SPI interface	Plugin method	Concrete class registered
k-NN	`MapperPlugin`	`getMappers()`	`KNNVectorFieldMapper` (`knn_vector`)
k-NN	`SearchPlugin`	`getQueries()`	`KNNQueryBuilder` (`knn`)
k-NN	`EnginePlugin`	`getEngineFactory()`	native vector `EngineFactory`
k-NN	`ActionPlugin`	`getRestHandlers()` / `getActions()`	stats/warmup handlers + transport actions
security	`ActionPlugin`	`getActionFilters()`	`SecurityFilter` (authorizes every action)
security	`NetworkPlugin`	`getTransportInterceptors()`	TLS/auth transport wrapper

Step-by-Step: build the plugin against local core

This is the mechanic from the section index, exercised end to end. It proves you can take an unreleased core and test a plugin against it — the core of Stage 8.

# 1. Publish your local core snapshot to ~/.m2
cd ~/OpenSearch
./gradlew publishToMavenLocal
ls ~/.m2/repository/org/opensearch/opensearch/   # see the snapshot version dir

# 2. Build k-NN against that exact version
cd ~/k-NN
./gradlew assemble -Dopensearch.version=$(cat ~/OpenSearch/build.gradle | grep -m1 version || echo 3.1.0-SNAPSHOT)
ls build/distributions/   # opensearch-knn-*.zip

# 3. Build a runnable distro from the SAME core and install the plugin
cd ~/OpenSearch
./gradlew localDistro
DISTRO=$(ls -d distribution/archives/*/build/install/opensearch-* | head -1)
"$DISTRO/bin/opensearch-plugin" install file://$(ls ~/k-NN/build/distributions/*.zip)
"$DISTRO/bin/opensearch-plugin" list

Warning: The plugin's opensearch.version must equal the distro's version exactly. A mismatch fails at load with a compatibility error — this is the single most common cross-repo build failure, covered in P5 and P6.

Then confirm the extension points are live:

curl -s localhost:9200/_cat/plugins?v        # k-NN listed
# the field type from MapperPlugin:
curl -s -XPUT localhost:9200/vec -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": { "v": { "type": "knn_vector", "dimension": 3 } } } }'
# the query from SearchPlugin (after indexing some vectors):
curl -s localhost:9200/vec/_search -H 'Content-Type: application/json' -d '{
  "query": { "knn": { "v": { "vector": [1,2,3], "k": 2 } } } }'

If knn_vector is accepted as a type and knn is accepted as a query, you have proven both getMappers() and getQueries() registered correctly against your local core.

grep exercises across the SPI

Run these and answer in one line each:

# Which extension interfaces does k-NN implement? (count them)
grep -o "implements .*" $(find ~/k-NN -name "KNNPlugin.java")

# How many queries does it register, and what are their NAMEs?
grep -rn "public static final.*NAME\s*=" ~/k-NN/src/main/java | grep -i query

# Does security register any query or mapper? (should be none)
grep -rn "getQueries\|getMappers" ~/security/src/main/java | head

# For each SPI method, where does CORE call it during startup?
grep -rn "\.getMappers()\|\.getQueries()\|\.getEngineFactory(\|\.getActionFilters()" \
  ~/OpenSearch/server/src/main/java | head

Expected Output

The k-NN implements clause, with each interface annotated by which method and which concrete class implements it.
The filled-in plugin↔SPI map table.
A running distro with k-NN installed, knn_vector accepted as a type, and a knn query returning hits.
A one-paragraph contrast: why security is an ActionPlugin/NetworkPlugin and not a SearchPlugin/MapperPlugin.

Stretch Goals

Find one more SPI interface k-NN implements that this lab didn't cover (e.g. ScriptPlugin or SystemIndexPlugin) and explain what it contributes.
In the SQL plugin, find which extension interfaces it uses (hint: ActionPlugin for its _sql/_ppl REST endpoints) and contrast with k-NN.
Trace EnginePlugin.getEngineFactory from the plugin into core's IndexModule/IndexService to see exactly where core decides whether to use the plugin's engine.

Validation / Self-check

From memory, name the SPI interface for each of: a custom field type, a custom query, a custom engine, a custom REST endpoint, and an authorization filter.
For the knn query, why are there three registrations (name, wire reader, XContent parser) and what would break if one were omitted?
Why is security an ActionPlugin but not a SearchPlugin? What does that tell you about where a 403 comes from?
Walk publishToMavenLocal → plugin assemble → opensearch-plugin install and say why the version must match at each step.
Given "a knn query returns wrong neighbors," list the three plugin subsystems (field type, query, engine) that could own it and how you'd narrow down.
Where in core is each SPI method called during node startup? Name one file.

Lab P3: Debugging a Failed Cross-Plugin Request

Background

A request that fails inside one repo is easy: read the stack trace, fix the code. A request that fails at the boundary between core and a plugin is the hard case, because the stack trace spans two repositories that were built separately, the relevant log lines are split across two logging configurations, and the "obvious" frame often belongs to the wrong owner. This lab teaches you to read a failure that crosses the core↔plugin seam: enable TRACE logging on both sides, read a stack trace whose frames alternate between org.opensearch.<core> and the plugin's package, and decide which side owns the bug.

It is the prerequisite for P4: Bug Attribution, which formalizes the decision into a flowchart. Here you build the raw skill of reading the cross-stack failure; there you make it mechanical. It assumes the plugin↔SPI map from P2 and the request path from P1 and the Action Framework deep dive.

Why This Lab Matters for Contributors

The most common mistake on cross-repo issues is anchoring on the top frame of a stack trace. If the top frame is org.opensearch.action.search.TransportSearchAction, the reflex is "core bug." But the trace continues into the plugin's query builder, and the Caused by is in the plugin's native engine. The top frame is just core calling the plugin. The skill is to find the frame in actionable code and read the Caused by chain, exactly as the Tez curriculum's H4 does for Hive/Tez. Get this wrong and you file on core; the core maintainer bounces it to the plugin repo a week later; everyone loses.

Prerequisites

Requirement	Why
A distro with a plugin installed (k-NN or security) from P2	The failing system
Ability to edit `log4j2.properties`	TRACE logging on both stacks
Read P1 and P2	Request path + plugin↔SPI map

Two boundary failure shapes

There are two archetypal cross-plugin failures. You will reproduce one and read the other.

flowchart TD
  R[Incoming request]
  R --> AF{ActionFilter chain<br/>security plugin}
  AF -->|denies| F403[403 — security owns this]
  AF -->|allows| TA[core TransportAction]
  TA --> Q{query / field type<br/>from a SearchPlugin/MapperPlugin}
  Q -->|mapping/engine mismatch| F500[search_phase_execution_exception<br/>core surfaces, plugin or Lucene owns]
  Q -->|ok| OK[normal response]

Shape A — authorization (security): the request never reaches core's action logic; the security action filter denies it and returns a 403. Owner: the security plugin (or the role config), almost never core.
Shape B — mapping/engine mismatch (k-NN): the request reaches core's search fan-out, which calls the plugin's query against a field whose mapping or engine doesn't match — core surfaces a search_phase_execution_exception but the bug is in the plugin layer (or Lucene below it).

Step-by-Step Tasks

Step 1 — Turn on TRACE on both sides

Core and the plugin both log through log4j2, but to different loggers. You must raise the level on both package roots. Set it dynamically (no restart) for the relevant loggers:

curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '{
  "transient": {
    "logger.org.opensearch.action.search": "TRACE",
    "logger.org.opensearch.search": "TRACE",
    "logger.org.opensearch.knn": "TRACE",
    "logger.org.opensearch.security": "TRACE"
  }
}'

Note: Dynamic logger.<package> cluster settings work for any package, including plugin packages — that's the whole trick. If a plugin's logger isn't taking effect, confirm the package root with grep -rn "LogManager.getLogger" <plugin>/src/main/java | head and set that exact prefix. For startup-time failures that happen before settings apply, edit config/log4j2.properties and restart instead.

Step 2 — Reproduce Shape B (k-NN mapping/engine mismatch)

Create the mismatch deliberately: index a field as a plain float (not knn_vector) and then run a knn query against it. Core's search will invoke the plugin's query builder against a field type it can't handle.

curl -s -XPUT localhost:9200/badvec -H 'Content-Type: application/json' -d '{
  "mappings": { "properties": { "v": { "type": "float" } } } }'    # NOT knn_vector
curl -s -XPOST localhost:9200/badvec/_doc -H 'Content-Type: application/json' \
  -d '{"v":[1,2,3]}'
curl -s -XPOST localhost:9200/badvec/_refresh >/dev/null

curl -s localhost:9200/badvec/_search -H 'Content-Type: application/json' -d '{
  "query": { "knn": { "v": { "vector": [1,2,3], "k": 2 } } } }' | python3 -m json.tool

You get a search_phase_execution_exception. The top-level type is core's; the reason and root_cause name the k-NN class that rejected the field. That split — core's exception type wrapping a plugin's root cause — is the signature of Shape B.

Step 3 — Read the stack trace across the seam

A representative trace for the mismatch (class names vary by branch; the structure is the point):

org.opensearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(...)   <- core: surfaces
        at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(...) <- core
        ...
Caused by: org.opensearch.OpenSearchException: [knn] requires field [v] to be of type [knn_vector]
        at org.opensearch.knn.index.query.KNNQueryBuilder.doToQuery(KNNQueryBuilder.java:...) <- PLUGIN: owns
        at org.opensearch.index.query.AbstractQueryBuilder.toQuery(AbstractQueryBuilder.java:...) <- core: calls plugin
        at org.opensearch.search.SearchService.parseSource(SearchService.java:...)              <- core
        ...

Read it with the P4 rule, applied here:

Step	Observation	Conclusion
Top frame	`org.opensearch.action.search.*` (core)	core surfaced the failure — not necessarily the owner
`Caused by` root	`org.opensearch.knn.index.query.KNNQueryBuilder.doToQuery`	the plugin's query builder threw
Frame just above root	`AbstractQueryBuilder.toQuery` (core)	core called the plugin via the SPI — boundary confirmed
Message	"requires field to be of type knn_vector"	actionable: the field mapping is wrong

Attribution: the immediate fault is user error (wrong mapping). But note the shape: core's toQuery called the plugin's doToQuery, which threw. That toQuery → doToQuery transition is the exact core↔plugin SPI seam from P2. If the k-NN message had been unclear (say a raw ClassCastException instead of "requires knn_vector"), that would be a k-NN bug worth filing — the diagnostics lab P6 is about making exactly this message actionable.

Step 4 — Confirm with the two-side logs

With TRACE on, the boundary is visible in the logs as adjacent lines from different package roots:

grep -E "org.opensearch.search|org.opensearch.knn" logs/*.log | tail -30

You will see core's SearchService log entering the query phase, immediately followed by k-NN's logger reporting the field-type check. The handoff is right there in the timeline — core, then plugin, then the exception. That adjacency is how you see the seam, not just infer it from the stack trace.

Step 5 — Read Shape A (a security 403)

You may not have security configured locally, so read this from a real trace shape. A denied action returns:

{ "status": 403,
  "error": { "type": "security_exception",
    "reason": "no permissions for [indices:data/read/search] and User [name=guest, ...]" } }

There is usually no core stack trace at all — because core's action never ran. The security action filter (SecurityFilter.apply, the ActionFilter from P2) short-circuited the chain before TransportSearchAction.doExecute. The diagnostic is:

# With security TRACE on, find the filter decision
grep -E "org.opensearch.security|privilege|SecurityFilter|no permissions" logs/*.log | tail -20

Attribution: Shape A is owned by the security plugin's policy, not its code and not core. The "bug," if any, is a missing role mapping. It is a core bug only if core mis-declared an action's required privilege name — a rare, specific case you'd confirm by checking which ActionType string the filter evaluated.

Reading rules for cross-plugin failures

Symptom in the response	First hypothesis	Where to look first
`403 / security_exception`, no core stack	security policy denied it	security TRACE log, role mappings
`search_phase_execution_exception` with plugin class in `root_cause`	plugin query/field-type issue	the plugin class in the `Caused by`
Core type, `Caused by` is a Lucene class	Lucene-level issue surfaced by core	the Lucene frame (see P4)
`NoClassDefFoundError` / `NoSuchMethodError` for a core class	plugin/core version skew	descriptor `opensearch.version`; rebuild (P5)
Raw `ClassCastException` / `NullPointerException` in plugin code	a real plugin bug	the plugin frame; this is fileable

The discipline: the top frame tells you who surfaced the error; the Caused by root and the frame that called across the SPI tell you who owns it.

Expected Output

A reproduced search_phase_execution_exception whose root_cause is a k-NN class, captured as JSON.
The stack trace annotated frame-by-frame as core-surface vs plugin-owner, with the toQuery → doToQuery SPI seam identified.
TRACE log excerpts showing core's logger and the plugin's logger on adjacent lines around the failure.
A one-paragraph attribution for each of Shape A and Shape B.

Troubleshooting

Problem	Fix
Plugin logger won't go to TRACE	Wrong package root; grep the plugin for `getLogger` and use that prefix
No stack trace in the response	Add `?error_trace=true` to the request URL
Response truncates the `Caused by`	Read the full trace in `logs/<cluster>.log` instead of the HTTP body
The failure is at startup, before settings apply	Edit `config/log4j2.properties`, set `logger.<pkg>.level=trace`, restart

Stretch Goals

Reproduce a real plugin bug shape: configure k-NN with one engine (faiss) in the mapping but query expecting another behavior, and read where the engine mismatch surfaces.
Turn on error_trace=true and compare the response body's stack to the server log's — note what the HTTP body omits.
Write the same failure twice: once where the plugin message is actionable and once where it's a raw ClassCastException. Argue which one you'd file.

Validation / Self-check

Given a search_phase_execution_exception with a k-NN frame in root_cause, which repo owns it, and what single line in the trace told you?
Why does a security 403 usually have no core stack trace?
What is the toQuery → doToQuery transition, and why is it the most important frame pair in a cross-plugin search failure?
How do you enable TRACE on a plugin's package at runtime without a restart, and when must you fall back to editing log4j2.properties?
Distinguish "core surfaced the error" from "core owns the error" using the frames of a trace you reproduced.
When is a security 403 actually a core bug rather than a policy issue?

Lab P4: Bug Attribution — Core vs. Plugin vs. Dashboards vs. Lucene

Background

This is the keystone cross-repo skill, the direct analog of the Tez curriculum's "Hive-on-Tez attribution." A user reports a symptom — wrong results, an error, slowness. Before a single line of code is written, someone must decide which repository owns it: OpenSearch core, a named plugin (k-NN, security, sql, …), OpenSearch Dashboards, or Apache Lucene. File it on the wrong repo and the maintainer spends a week bouncing it before anyone touches the actual bug.

This lab gives you a decision flowchart, an attribution table, the "reproduce-without-the-suspect" bisection technique, and four worked examples — one per repo. It depends on the reading skill from P3, the request trace from P1, and the plugin↔SPI map from P2. It is the operational core of Stage 8.

Why This Lab Matters for Contributors

Attribution is where cross-repo contributors earn their reputation. A maintainer who consistently attributes correctly and attaches a minimal repro in the right repo is trusted; their issues get picked up fast. The bisection technique here — "reproduce the symptom with the suspect removed" — is the most powerful tool you have, because it converts an opinion ("I think it's core") into a proof ("here's the same bug with no plugin and no Dashboards, on a stock distro").

The decision flowchart

flowchart TD
  S[Symptom reported]
  S --> Q1{Reproducible with raw curl<br/>against core, no Dashboards?}
  Q1 -->|No, only via the UI| DASH[Dashboards owns it<br/>render/agg-config/time-zone]
  Q1 -->|Yes| Q2{Reproducible with the<br/>plugin uninstalled?}
  Q2 -->|Yes, no plugin needed| Q3{Does it involve a query type,<br/>field type, or scoring?}
  Q2 -->|No, needs the plugin| PLUG[A plugin owns it<br/>find which from the field/query]
  Q3 -->|Yes, and result is numerically wrong| Q4{Reproducible on a single shard<br/>with a trivial Lucene-level query?}
  Q3 -->|No — error, allocation, cluster state| CORE[Core owns it]
  Q4 -->|Yes, even a match_all/term is wrong| LUCENE[Lucene owns it<br/>codec/docvalues/scorer]
  Q4 -->|No, only the composite/agg is wrong| CORE
  PLUG --> Q5{Is the plugin's error/message<br/>itself the problem?}
  Q5 -->|Message unclear / wrong| PLUGB[Plugin bug — file on plugin]
  Q5 -->|Message clear, config is wrong| USER[User/config — no bug]

The flowchart encodes one idea: remove suspects one at a time and see if the symptom survives. Each "reproducible with X removed?" answer eliminates a repo.

The attribution table

Symptom	Likely owner	Decisive test	Where to file
Chart renders wrong but captured request's JSON is correct	Dashboards	replay request with `curl`; right answer ⇒ Dashboards	`OpenSearch-Dashboards`
Time bucketing off by an hour	Dashboards	check `time_zone` in the request body	`OpenSearch-Dashboards`
Aggregation result numerically wrong via `curl` too	core (reduce) or Lucene (docvalues)	does a single-shard index reproduce it?	`OpenSearch` / `apache/lucene`
`knn` query returns bad neighbors	k-NN (then core, then Lucene)	run a brute-force `script_score` vs `knn`; differ ⇒ k-NN/engine	`k-NN`
`403 / security_exception`	security policy (rarely core)	does the action's privilege name match?	`security` (or role config)
`NoSuchMethodError` on a core class from plugin code	build/version skew	check descriptor `opensearch.version` vs node	rebuild; not a code bug
Slow search, only with the plugin	plugin	`profile=true`; compare with/without plugin	the plugin
Slow search, no plugin involved	core or Lucene	`profile=true`; which phase dominates?	`OpenSearch` / `apache/lucene`
Wrong scoring on a plain `match` query	Lucene (then core)	reproduce with a Lucene `IndexSearcher` unit test	`apache/lucene`
Shard won't allocate / cluster state stuck	core	`_cluster/allocation/explain`	`OpenSearch`

The bisection technique: reproduce without the suspect

The single most valuable move. For each suspect repo, there's a way to take it out of the picture and see whether the bug survives.

Suspect	How to remove it	If the bug survives...	If it disappears...
Dashboards	replay the captured request with `curl`	the bug is not in Dashboards	Dashboards owns it (rendering/agg-config)
A plugin	uninstall it (`opensearch-plugin remove`) and rebuild the query without it	the bug is not in that plugin	the plugin owns it
Lucene	reduce to a single shard and a single trivial query (`term`, `match_all`)	core's higher layers add the bug	Lucene owns the primitive
A specific agg	replace it with a simpler agg of the same family	core's reduce is fine	the specific agg owns it

Note: "Remove the suspect" sometimes means re-create the symptom with stock tools. You can't uninstall Lucene, but you can reproduce a scoring bug with a 5-line Lucene IndexSearcher test that uses no OpenSearch at all — if that reproduces it, you've proven Lucene owns it and can file apache/lucene with a Lucene-only repro.

Worked Example 1 — Wrong agg result (core vs Lucene)

Symptom: a sum aggregation over a double field returns a value that's off by a tiny amount on a multi-shard index.

Bisect:

# Reproduce via curl (removes Dashboards)
curl -s 'localhost:9200/sales/_search' -H 'Content-Type: application/json' -d '{
  "size":0, "aggs":{"t":{"sum":{"field":"amount"}}}}'

# Single-shard index (removes the multi-shard reduce)
curl -s -XPUT localhost:9200/sales1 -H 'Content-Type: application/json' \
  -d '{"settings":{"number_of_shards":1}}'
# reindex and re-run the sum

If single-shard is correct but multi-shard is wrong, the fault is in core's cross-shard reduce (SearchPhaseController / InternalSum.reduce) — file on OpenSearch. Floating-point sum order across shards is a classic core concern; see Aggregations.
If single-shard is also wrong, the per-shard aggregation read the wrong doc-values — drop to a Lucene SortedNumericDocValues read. If a Lucene-only test reproduces it, file apache/lucene. (Usually it's core's agg, not Lucene; doc-values reads are extremely well-tested.)

Attribution most often: core, the reduce. The decisive evidence is "single-shard correct, multi-shard wrong."

Worked Example 2 — A visualization renders wrong (Dashboards vs core)

Symptom: a date histogram bar chart shows counts in the wrong day buckets.

Bisect: capture the request (P1 Step 3) and replay it.

curl -s 'localhost:9200/sales/_search?typed_keys=true' -H 'Content-Type: application/json' -d '{
  "size":0,
  "aggs":{"h":{"date_histogram":{"field":"ts","calendar_interval":"day","time_zone":"+00:00"}}}}'

If the buckets in the JSON are correct but the chart is wrong, the bug is in Dashboards' rendering — file OpenSearch-Dashboards.
If the JSON buckets are wrong, look at the time_zone Dashboards put in the request. Dashboards sends the browser's time zone; if it sends the wrong one, that's still a Dashboards bug (wrong request), even though core honored it correctly. Core only owns it if core mis-bucketed for a correct time_zone — reproduce with an explicit time_zone via curl and check against a manual calculation.

Attribution most often: Dashboards — either rendering or the time_zone it chose. Core's date_histogram honoring an explicit time zone is rarely wrong.

Worked Example 3 — Vector search returns bad neighbors (k-NN vs core vs Lucene)

Symptom: a knn query returns neighbors that are obviously not the closest.

Bisect against a brute-force ground truth that uses no approximate engine:

# Approximate (k-NN ANN engine)
curl -s localhost:9200/vec/_search -H 'Content-Type: application/json' -d '{
  "query":{"knn":{"v":{"vector":[1,2,3],"k":3}}}}'

# Exact, scripted distance (removes the ANN engine; still uses the plugin's script)
curl -s localhost:9200/vec/_search -H 'Content-Type: application/json' -d '{
  "query":{"script_score":{"query":{"match_all":{}},
    "script":{"source":"knn_score","lang":"knn",
      "params":{"field":"v","query_value":[1,2,3],"space_type":"l2"}}}}}'

If exact returns the right neighbors but approximate does not, the approximation is the issue: the k-NN engine (faiss/nmslib/Lucene HNSW) config — ef_search, m, index params — or a genuine engine bug. File on k-NN (it owns EnginePlugin.getEngineFactory from P2). Often it's a recall-vs-config tradeoff, not a bug.
If both are wrong, the vectors were stored or read wrong — k-NN's codec, or Lucene's HNSW codec if the engine is lucene. Narrow by switching the mapping's engine and seeing which engine reproduces it.
If the field type or query is rejected, that's the P3 Shape-B mismatch, owned by the mapping/config.

Attribution: k-NN (engine/config) is most common; Lucene only if the lucene HNSW engine is in use and even exact reads are wrong.

Worked Example 4 — A 403 (security)

Symptom: a request returns 403 security_exception.

Bisect: this needs the security plugin to even occur, so it's a security/policy matter unless core mis-declared the action.

# What privilege did the action require?  (the message names it)
# "no permissions for [indices:data/read/search] and User [name=guest]"
# Check the role mapping grants that action:
curl -sk -u admin:admin 'https://localhost:9200/_plugins/_security/api/roles/<role>' | python3 -m json.tool

If the role should grant indices:data/read/search but doesn't, it's a config issue — fix the role, no bug.
If the privilege string in the error doesn't match the action the user actually invoked, core may have registered the wrong ActionType name — a rare core bug; confirm with grep in core's ActionModule.
If security denies an action it should allow given a correct role, that's a security plugin bug.

Attribution: usually config; occasionally security; rarely core.

Writing the bug report in the right repo

Once attributed, the report goes in the owning repo with a repro that excludes the other repos:

Owner	Repo	Minimal repro must NOT include
core	`opensearch-project/OpenSearch`	Dashboards, any out-of-repo plugin
k-NN	`opensearch-project/k-NN`	Dashboards; only the plugin + a `curl`
security	`opensearch-project/security`	Dashboards; a `curl` + role config
Dashboards	`opensearch-project/OpenSearch-Dashboards`	(include the captured request that proves core was right)
Lucene	`apache/lucene`	OpenSearch entirely — a Lucene `IndexSearcher` test

The discipline of excluding the other repos from the repro is what proves your attribution. A core bug report that requires installing k-NN and Dashboards to reproduce hasn't been attributed — it's been punted.

Expected Output

The decision flowchart applied to one symptom of your choosing, written out.
For two of the four worked examples, the actual curl bisection run on your local cluster with the result.
One bug-report skeleton for the repo you attributed to, with a repro that does not drag in the other repos.

Stretch Goals

Take a real open issue from OpenSearch-Dashboards and decide, from its description alone, whether it's truly a Dashboards bug or a mis-filed core bug.
Build a Lucene-only IndexSearcher repro for a scoring question and confirm it reproduces (or doesn't) outside OpenSearch.
Write the "reproduce without the suspect" steps as a checklist you could hand to a triager.

Validation / Self-check

State the one-sentence rule for the entire flowchart.
For each of the four repos, give the single decisive test that removes it from suspicion.
A sum is wrong on a multi-shard index but right on one shard. Owner? Why?
A date histogram chart is wrong but the captured request's buckets are correct. Owner? Why?
Why does the requirement "the repro must exclude the other repos" actually prove the attribution rather than just being tidy?
When is a 403 a core bug, and how would you confirm it?

Lab P5: Reproducing Plugin/Core Integration Bugs

Background

P4 ends with "file the bug with a minimal repro in the right repo." This lab is about building that repro when the bug spans core and a plugin. A cross-repo repro is harder than a single-repo one for two reasons: it needs both pieces present at compatible versions, and it must be minimal and deterministic despite involving two independently-built artifacts. You will build both forms of repro: a black-box curl/REST-YAML script against a running distro with the plugin installed, and a white-box integration test in the plugin repo that boots an in-JVM cluster with the plugin loaded via nodePlugins().

You will also confront the thing that makes these bugs slippery: version sensitivity. A plugin must match the core version exactly, and an integration bug often only reproduces on a specific core↔plugin version pair. Getting the pinning right is half the work.

This depends on the build mechanic from the section index, the Plugin Architecture deep dive, and the testing framework you met in the testing levels (Level 7, Level 8). It connects to Serialization and BWC — the reason version skew bites.

Why This Lab Matters for Contributors

A bug report without a repro is a wish. A cross-repo bug report with a repro that only works on your laptop, with five plugins installed and a specific Dashboards build, is barely better — the maintainer can't run it. The contributor who ships a 12-line curl script or a self-contained OpenSearchIntegTestCase that the plugin's CI can run is the one whose bug gets fixed this release instead of next. Determinism (fixed seeds, fixed data, single shard) is what turns "sometimes wrong" into "always wrong here," which is what a maintainer needs to bisect.

Prerequisites

Requirement	Why
Core checkout + `publishToMavenLocal` working	To pin core for the plugin build
A plugin checkout (k-NN used here)	The integration partner
Read the section index build section	The version-pinning mechanic
Familiarity with `OpenSearchIntegTestCase` (Level 8)	The white-box repro

Two repro forms

flowchart LR
  BUG[Suspected core+plugin bug] --> A[Black-box repro]
  BUG --> B[White-box repro]
  A --> A1[Build a localDistro at version V]
  A1 --> A2[Install plugin built against V]
  A2 --> A3[curl / REST-YAML script]
  B --> B1[Plugin repo integ test]
  B1 --> B2[OpenSearchIntegTestCase + nodePlugins]
  B2 --> B3[Deterministic: fixed seed, 1 shard, fixed data]

Form	Lives in	Runs via	Best when
Black-box `curl` / REST-YAML	the bug report / `rest-api-spec`-style YAML	a running distro	the bug is observable over REST; easiest to share
White-box integ test	the plugin repo's test source	`./gradlew integTest`	the bug needs internal state or must run in the plugin's CI

Step-by-Step Tasks

Step 1 — Pin the versions across both repos

The cardinal rule: the plugin's opensearch.version must equal the distro's version. Decide one version V and drive everything from it.

# Discover the core version you're on
cd ~/OpenSearch
V=$(grep -m1 "opensearch[ =]" buildSrc/version.properties 2>/dev/null | sed 's/.*= *//')
echo "Building against core $V"

# Publish core artifacts at V to ~/.m2
./gradlew publishToMavenLocal
ls ~/.m2/repository/org/opensearch/opensearch/$V/ 2>/dev/null

Warning: This is the #1 source of false integration bugs. A "plugin won't load" report is usually a version mismatch, not a code bug — the descriptor says 3.1.0 and the node is 3.0.0. Always confirm _cat/plugins and _nodes/info versions agree before believing it's a real integration bug. See P6 for making that failure self-explanatory.

Step 2 — Build the black-box repro environment

# Build a runnable distro at V
cd ~/OpenSearch
./gradlew localDistro
DISTRO=$(ls -d distribution/archives/*/build/install/opensearch-* | head -1)

# Build the plugin against the SAME V and install it
cd ~/k-NN
./gradlew assemble -Dopensearch.version=$V
"$DISTRO/bin/opensearch-plugin" install --batch file://$(ls ~/k-NN/build/distributions/*$V*.zip)
"$DISTRO/bin/opensearch-plugin" list

# Start it
"$DISTRO/bin/opensearch" -d -p /tmp/os.pid
until curl -s localhost:9200 >/dev/null; do sleep 1; done
curl -s localhost:9200/_cat/plugins?v

Step 3 — Write the minimal, deterministic `curl` repro

Minimize ruthlessly: one shard, no replicas, a handful of documents, the smallest query that triggers the bug. A worked example — a knn query that should return a known nearest neighbor but doesn't on a specific param:

#!/usr/bin/env bash
# repro.sh — minimal cross-repo (core + k-NN) reproducer. Deterministic.
set -euo pipefail
H=localhost:9200
curl -s -XDELETE $H/r >/dev/null 2>&1 || true

# 1 shard, 0 replicas => no cross-shard nondeterminism
curl -s -XPUT $H/r -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true, "number_of_shards": 1, "number_of_replicas": 0 },
  "mappings": { "properties": {
    "v": { "type": "knn_vector", "dimension": 2,
           "method": { "name": "hnsw", "engine": "faiss", "space_type": "l2" } } } }'

# Fixed data — neighbors are unambiguous
for id in 1 2 3; do
  curl -s -XPUT "$H/r/_doc/$id" -H 'Content-Type: application/json' \
    -d "{\"v\":[$id,$id]}" >/dev/null
done
curl -s -XPOST $H/r/_refresh >/dev/null

# The query under test — nearest to [1,1] should be doc 1
echo "Result:"
curl -s "$H/r/_search" -H 'Content-Type: application/json' -d '{
  "size": 1, "query": { "knn": { "v": { "vector": [1,1], "k": 1 } } } }' \
  | python3 -c "import sys,json;d=json.load(sys.stdin);print(d['hits']['hits'][0]['_id'])"
echo "Expected: 1"

The properties that make this a good repro: a fixed index name, single shard (eliminates the reduce as a variable), fixed integer vectors (the nearest neighbor is unarguable), and an explicit expected value printed next to the actual. Anyone can run it and see pass/fail in one line.

Step 4 — Promote it to a shareable REST-YAML test

For a bug you want the plugin's CI to guard against, write it as a REST-YAML test (the format under rest-api-spec in core; plugins have an equivalent yamlRestTest source set). It's the same repro, declarative and assertable:

# plugin-side yamlRestTest: knn nearest neighbor is deterministic
"knn returns the true nearest neighbor on a single shard":
  - do:
      indices.create:
        index: r
        body:
          settings: { index.knn: true, number_of_shards: 1, number_of_replicas: 0 }
          mappings:
            properties:
              v: { type: knn_vector, dimension: 2,
                   method: { name: hnsw, engine: faiss, space_type: l2 } }
  - do: { index: { index: r, id: "1", body: { v: [1, 1] }, refresh: true } }
  - do: { index: { index: r, id: "2", body: { v: [9, 9] }, refresh: true } }
  - do:
      search:
        index: r
        body: { size: 1, query: { knn: { v: { vector: [1, 1], k: 1 } } } }
  - match: { hits.hits.0._id: "1" }

This is the artifact that prevents regression: it lives in the plugin repo and runs against a real cluster on every PR.

Step 5 — The white-box repro: an integ test with `nodePlugins()`

When the bug needs internal state, or to run in the plugin's ./gradlew integTest without a separate distro, write an OpenSearchIntegTestCase that loads the plugin into the in-JVM InternalTestCluster:

// In the k-NN repo's integration test source set
public class KNNNearestNeighborIT extends OpenSearchIntegTestCase {

    @Override
    protected Collection<Class<? extends Plugin>> nodePlugins() {
        // This is the boundary: the in-JVM node loads core + this plugin
        return List.of(KNNPlugin.class);
    }

    public void testNearestNeighborIsDeterministic() throws Exception {
        String index = "r";
        createIndex(index, Settings.builder()
            .put("index.knn", true)
            .put("number_of_shards", 1)
            .put("number_of_replicas", 0)
            .build());
        // ... put the knn_vector mapping, index [1,1] and [9,9], refresh ...

        SearchResponse resp = client().prepareSearch(index)
            .setQuery(new KNNQueryBuilder("v", new float[]{1f, 1f}, 1))
            .setSize(1)
            .get();

        assertEquals("1", resp.getHits().getAt(0).getId());
    }
}

The load-bearing line is nodePlugins() returning the plugin class — that is what makes InternalTestCluster boot a node with the plugin's extension points wired in, so the test exercises the real core↔plugin integration, not a mock.

cd ~/k-NN
./gradlew integTest --tests "*KNNNearestNeighborIT*" -Dtests.seed=DEADBEEF

The -Dtests.seed=... pins the randomized-testing seed so the run is reproducible — paste that seed into the bug report so the maintainer gets your exact run. (See the testing levels for how randomized testing and seeds work.)

Why integration bugs are version-sensitive

A core↔plugin integration bug often reproduces only on a specific version pair, because the plugin compiles against core's internal APIs, which are not stable across versions (unlike the REST API). Three concrete ways version matters:

Version factor	Effect on the repro
Descriptor `opensearch.version` ≠ node version	plugin won't load at all — not a code bug, a build mismatch (P6)
A core SPI method changed signature between `V1`→`V2`	plugin built on `V1` throws `NoSuchMethodError` on `V2`
Wire serialization changed (`Writeable`) between versions	mixed-version cluster fails; see Serialization and BWC
A behavior changed in a core minor	the bug exists only on `V`, gone on `V±1`

So a complete cross-repo repro states the exact versions:

core:   3.1.0-SNAPSHOT (commit abc1234)
plugin: opensearch-knn 3.1.0-SNAPSHOT (commit def5678)
java:   21

Without the version pair, a maintainer who can't reproduce will close it "cannot reproduce" — correctly, because on their version pair it may not happen.

Expected Output

A localDistro at version V with a V-built plugin installed, verified via _cat/plugins.
A repro.sh that is deterministic (single shard, fixed data) and prints actual-vs-expected in one line.
A REST-YAML version of the same repro.
An OpenSearchIntegTestCase skeleton using nodePlugins(), runnable with a pinned -Dtests.seed.
A version block (core commit, plugin commit, JDK) ready to paste into a report.

Troubleshooting

Problem	Fix
`plugin [x] is incompatible with version [y]`	rebuild plugin with the node's exact `V`; check descriptor
`NoSuchMethodError` at query time	core internal API changed; rebuild plugin against the same core commit
Repro passes for you, fails in CI (or vice-versa)	nondeterminism — pin shards to 1, fix the seed, fix the data
`integTest` can't find the plugin class	confirm `nodePlugins()` returns it and the test is in the right source set
Distro won't start after install	check `logs/` for a descriptor or security-policy error

Stretch Goals

Make the black-box repro reproduce only on one core version by checking out a different core commit, rebuilding, and confirming it disappears — then document the version window in the report.
Convert the OpenSearchIntegTestCase to also assert an internal counter (e.g. a k-NN stat) to show the bug's internal footprint, not just the REST result.
Add a second node to the integ test (@ClusterScope(numDataNodes = 2)) and confirm whether the bug is shard-count sensitive.

Validation / Self-check

Why must the plugin's opensearch.version equal the node version, and what's the failure mode when it doesn't?
What three things make a curl repro deterministic, and why does each matter?
What does nodePlugins() do, and why is it the load-bearing line in a white-box integration test?
Why are core↔plugin integration bugs version-sensitive in a way that pure REST API behavior is not?
What exactly do you paste into the bug report so a maintainer can reproduce your randomized integ-test run?
Given "plugin won't load," what do you check before concluding it's a code bug?

Lab P6: Writing Diagnostics for Integration Bugs

Background

The previous five labs were about consuming diagnostics: reading stack traces (P3), attributing symptoms (P4), reproducing bugs (P5). This lab is about producing them. The highest-leverage cross-repo contribution you can make is rarely the fix itself — it is making the next person's attribution trivial. A boundary error that names the owning component, a log line that says which plugin and which version, a clearer message when a plugin and core disagree on versions: each of these saves dozens of future debugging hours across the ecosystem.

You will inventory the diagnostics OpenSearch already exposes at the core↔plugin boundary, then write a worked example: a clearer error when a plugin and core version mismatch. The change is small; the leverage is enormous, because version mismatch is the single most common cross-repo failure (P5), and a confusing message there sends people to the wrong repo.

This builds on the Plugin Architecture deep dive (where PluginsService validates versions), the error-messages roadmap stage, and Stage 8.

Why This Lab Matters for Contributors

Diagnostics are the contribution maintainers love and reviewers approve fastest, because they are low-risk (no behavior change to the happy path) and high-value. A one-line improvement to a boundary exception message can deflect a stream of mis-filed issues. And the skill compounds: once you've added a good diagnostic at one boundary, you see every other boundary's missing diagnostics. This is how contributors become maintainers — by making the system explain itself.

Prerequisites

Requirement	Why
A running distro with a plugin installed (from P5)	The system to diagnose
Core checkout for the worked patch	To edit `PluginsService`/`PluginInfo`
Read P3 and P5	Boundary failures + version sensitivity

The diagnostics OpenSearch already gives you

Before adding new diagnostics, master the existing ones. These are your cross-cutting visibility tools at the boundary.

Tool	What it answers	Command
`_cat/plugins`	which plugins are loaded, on which nodes, at what version	`curl -s localhost:9200/_cat/plugins?v`
`_nodes/plugins`	full per-node plugin info incl. classname, `opensearch.version`	`curl -s localhost:9200/_nodes/plugins?pretty`
`_nodes/info`	the node's own version (to compare against the plugin's)	`curl -s 'localhost:9200/_nodes?filter_path=**.version'`
Profile API	which phase/agg/query spends time — core vs plugin code	add `"profile": true` to a `_search`
`_tasks`	what's running right now, across nodes, with parent/child links	`curl -s 'localhost:9200/_tasks?detailed&group_by=parents'`
`_cluster/allocation/explain`	why a shard won't allocate (a core-side cross-cutting view)	`curl -s 'localhost:9200/_cluster/allocation/explain'`
`error_trace=true`	full stack trace in the HTTP response, not just the message	append `?error_trace=true`

# The version-mismatch triage triad — run these first on any "plugin broken" report
curl -s localhost:9200/_cat/plugins?v
curl -s 'localhost:9200/_nodes?filter_path=nodes.*.version'
curl -s 'localhost:9200/_nodes/plugins?filter_path=nodes.*.plugins.name,nodes.*.plugins.opensearch_version'

If the plugin's opensearch_version and the node version differ, you've diagnosed the bug in three curls — and that's exactly the failure whose error message you're about to improve.

Anatomy of a good boundary diagnostic

A diagnostic at the core↔plugin seam should answer, in the message itself, three questions a confused user has:

Which component owns this? Name the plugin and core, so the reader knows it's a boundary issue, not a pure-core one.
What are the conflicting facts? Print both versions/values, not just "they don't match."
What's the fix? State the action ("rebuild the plugin for version X").

A message that does all three turns a P4 attribution from minutes into seconds, and stops the issue being filed on the wrong repo.

Worked Example — a clearer plugin/core version-mismatch error

Step 1 — Find where core enforces version compatibility

PluginsService validates each plugin's descriptor at load. The version check compares the descriptor's opensearch.version to the running node's Version.

grep -n "verifyCompatibility\|isIncompatible\|opensearch.version\|Version.CURRENT" \
  ~/OpenSearch/server/src/main/java/org/opensearch/plugins/PluginsService.java \
  ~/OpenSearch/server/src/main/java/org/opensearch/plugins/PluginInfo.java

You'll find the place that throws when the versions don't match. The current message is typically along the lines of:

plugin [opensearch-knn] is incompatible with version [3.0.0]; was designed for version [3.1.0]

That's okay — it names both versions — but it doesn't say what to do, doesn't distinguish "rebuild the plugin" from "use a different node," and doesn't point at the plugin's repo. We can make it actionable.

Step 2 — Improve the message

Edit the exception construction to add the action and the boundary framing. A representative diff (exact field/method names vary by branch — confirm with the grep above):

--- a/server/src/main/java/org/opensearch/plugins/PluginsService.java
+++ b/server/src/main/java/org/opensearch/plugins/PluginsService.java
@@ public final class PluginsService {
-            throw new IllegalArgumentException(
-                "plugin ["
-                    + info.getName()
-                    + "] is incompatible with version ["
-                    + Version.CURRENT
-                    + "]; was designed for version ["
-                    + info.getOpenSearchVersion()
-                    + "]");
+            throw new IllegalArgumentException(
+                "plugin ["
+                    + info.getName()
+                    + "] was built for OpenSearch ["
+                    + info.getOpenSearchVersion()
+                    + "] but this node is OpenSearch ["
+                    + Version.CURRENT
+                    + "]. Plugins are version-locked to the node because they "
+                    + "compile against internal APIs that are not stable across "
+                    + "versions. Rebuild the plugin against ["
+                    + Version.CURRENT
+                    + "] (set opensearch.version=" + Version.CURRENT + " when "
+                    + "building the plugin) and reinstall it, or run the plugin "
+                    + "on an OpenSearch [" + info.getOpenSearchVersion() + "] node.");

What changed, against the three-question rule:

Question	Before	After
Which component?	"plugin [x]"	"plugin [x] ... this node is OpenSearch [y]" — boundary framed
Conflicting facts?	both versions shown	both versions shown, labeled "built for" vs "this node"
The fix?	absent	"rebuild against [y] ... or run on an [x] node"

Step 3 — Add a CHANGELOG entry (every core PR needs one)

--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ ## [Unreleased 3.x]
 ### Changed
+- Make the plugin/core version-mismatch error actionable by naming both versions
+  and the rebuild step ([#NNNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNNN))

Step 4 — Test the new message

A plugin-version-compatibility check has an existing unit test; extend it (don't just eyeball the string). Find it and add an assertion on the new wording:

grep -rln "verifyCompatibility\|incompatible with version\|PluginsServiceTests" \
  ~/OpenSearch/server/src/test/java/org/opensearch/plugins/

// In PluginsServiceTests (illustrative)
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
    () -> PluginsService.verifyCompatibility(infoBuiltForDifferentVersion));
assertThat(e.getMessage(), containsString("was built for OpenSearch"));
assertThat(e.getMessage(), containsString("Rebuild the plugin"));

cd ~/OpenSearch
./gradlew :server:test --tests "*PluginsServiceTests*"
./gradlew precommit   # checkstyle, headers, forbidden-APIs — required before a PR

Step 5 — Prove it end to end

Reproduce the original confusing failure and confirm the new message:

# Build the plugin against a DIFFERENT version than the distro, then try to install
cd ~/k-NN && ./gradlew assemble -Dopensearch.version=3.1.0   # mismatched on purpose
"$DISTRO/bin/opensearch-plugin" install --batch file://$(ls build/distributions/*3.1.0*.zip) || true
# Start the node (or check the install-time validation) and read the message

The reader of that message now knows it's a version lock, sees both versions, and knows to rebuild — without opening a single source file or filing an issue on the wrong repo.

Other high-leverage boundary diagnostics to consider

The version-mismatch message is one example. The same three-question discipline applies all over the seam:

Boundary	Weak diagnostic	Stronger diagnostic
`knn` query on a non-`knn_vector` field (P3)	`ClassCastException`	"[knn] query requires field [v] to be type [knn_vector]; it is [float]"
security denies an action	`403 no permissions`	also log the action name + role evaluated, at DEBUG, naming security as the decider
plugin's `createComponents` returns nothing	silent missing feature	a startup log naming the plugin and what it failed to register
`NamedWriteable` not registered	cryptic deserialize error	name the type + the plugin expected to register it (Serialization and BWC)
engine factory mismatch	obscure search failure	name the index, the expected engine, and the plugin that owns it

Each is a small PR; each deflects a category of mis-filed issues.

Why diagnostics are the highest-leverage cross-repo contribution

Property	Why it matters across repos
Low risk	no happy-path behavior change; reviewers approve quickly
High deflection	one good message stops many issues being filed on the wrong repo
Compounding	every future P4 attribution gets faster
Teaches the system	the message itself documents the boundary contract
Cross-repo by nature	a boundary diagnostic helps core, plugin, and Dashboards teams at once

A contributor who fixes one bug helps one user. A contributor who makes the boundary explain itself helps everyone who hits that boundary, forever. That is why diagnostics, not features, are the fastest route from contributor to trusted maintainer in a multi-repo project.

Expected Output

The output of the version-mismatch triage triad (_cat/plugins, node version, _nodes/plugins) on your cluster.
A diff improving the plugin/core version-mismatch message against the three-question rule.
A CHANGELOG entry and a unit-test assertion on the new wording, with :server:test and precommit passing.
An end-to-end demonstration: a deliberately mismatched plugin install showing the new message.

Stretch Goals

Pick one row from the "other high-leverage diagnostics" table, implement it as a real patch in the relevant repo (core or the plugin), and write the test.
Add a DEBUG log line in the security action-filter path (in a security checkout) that names the action and the role evaluated on denial — then verify it appears with security TRACE on (P3).
Propose (as a GitHub issue) adding the plugin's git commit, not just version, to _nodes/plugins, and argue why it helps cross-repo repro (P5).

Validation / Self-check

State the three questions a good boundary diagnostic must answer in its own message text.
Run the version-mismatch triage triad and explain what each of the three calls tells you.
Show the before/after of the version-mismatch message and map each change to one of the three questions.
Why is a diagnostics patch lower-risk and faster to merge than a behavior fix?
Why is improving a boundary message a cross-repo contribution even though you only edited core?
Name two boundary diagnostics, other than the version message, that would deflect mis-filed issues, and which repo each lives in.

Release & Governance Reality

This section takes you inside the maintainer and Technical Steering Committee (TSC) view of OpenSearch. The Contributor Mindset section answered the question "how do I behave so my work gets accepted?" This section answers the opposite, asymmetric question: what is the work being done by the people who accept your pull request? — what a maintainer reads in your diff, why a "trivial" change gets heavy scrutiny, how a fix actually ships in a release, and who decides any of it.

It is written for two audiences:

Contributors who want to understand maintainers. If you know what a maintainer is worried about — backward compatibility across a mixed cluster, the blast radius of a change, the cost of a revert — you write PRs that get merged instead of stalling.
Aspiring maintainers. The path from contributor to maintainer to TSC is not a mystery, but nobody hands you the operational playbook. This section is that playbook.

The chapters are deliberately not aspirational. They are the mechanics — which CI check gates a merge, what the backport 2.x label does, where the SPDX header and licenses/ rules are bright lines, and what a "release readiness" issue tracks.

Note: OpenSearch is not an Apache project, and its governance is not the Apache model. There is no PMC, no IPMC, no incubator, no [VOTE] email thread, and no JIRA. If you came here from an Apache background (or from the sibling Apache Tez curriculum), unlearn those reflexes. The differences are spelled out below and recur throughout the section.

How OpenSearch Governance Differs From the Apache Model

The Apache Software Foundation governs each project through a Project Management Committee (PMC) that holds binding votes, cuts releases by a formal [VOTE] email on a developer mailing list, and tracks work in JIRA. OpenSearch works nothing like that.

Concern	Apache model	OpenSearch model
Legal/foundation home	Apache Software Foundation	OpenSearch Software Foundation, under the Linux Foundation (moved there Sept 2024; previously AWS-stewarded)
Top governance body	Per-project PMC + the ASF Board	Technical Steering Committee (TSC) for technical direction; LF for legal/IP
Per-component authority	Committers + PMC members	Maintainers listed in each repo's `MAINTAINERS.md`
Issue tracker	Apache JIRA	GitHub Issues (one repo, one tracker)
Code review unit	A patch attached to a JIRA, or a GitHub PR mirrored back	A GitHub Pull Request — the native and only unit
Contributor agreement	ICLA / CCLA on file	DCO sign-off per commit (`git commit -s`); no CLA
Design discussion	`dev@` mailing list thread	A GitHub issue labeled `RFC` / `meta` / `proposal`, plus `forum.opensearch.org`
Release decision	A formal `[VOTE]` on `dev@`, 72h, 3 binding +1s	Open release process in `opensearch-build` — a release manager, a "release readiness" tracking issue, no binding email vote
Release artifact provenance	Signed source tarball is the release; binaries are convenience	Built and validated openly via the `opensearch-build` pipeline; release notes generated from `CHANGELOG.md` + PRs
License	Apache License 2.0	Apache License 2.0 (this part is the same — and is the reason OpenSearch exists)

The single most important reframing: in OpenSearch, the design discussion, the review, the merge decision, the backport, and the release tracking all happen in public on GitHub. There is no private mailing list where the "real" decision is made. The artifact you watch is an issue thread and a PR, not an email vote. If you can read GitHub, you can read the governance.

Reading Order

The section is one index plus seven chapters. Read them in order the first time; afterwards they stand alone.

#	Chapter	Audience
1	Communication Channels and Building Consensus	Everyone
2	GitHub Issues and PR Review	Contributors and maintainers
3	How Maintainers Think About Compatibility	Aspiring maintainers; contributors touching wire/index/REST surfaces
4	The Release Process and Release Trains	Anyone who needs a fix in a specific release; release managers
5	The TSC and Project Governance	Aspiring maintainers; anyone proposing a new plugin/repo
6	Licensing, SPDX Headers, and the Apache 2.0 Story	Everyone touching dependencies; release managers
7	Code Style, Test Quality, and Building Trust	All contributors

Chapters 1–2 and 6–7 are immediately useful to any contributor. Chapters 3–5 are maintainer- and release-facing, but read them early: understanding why a maintainer blocks a "small" change on compatibility grounds, or why the backport deadline is firm, will save you a stalled PR.

How This Differs From the Contributor Mindset Section

The two sections are halves of the same coin and are meant to be read together:

The Contributor Mindset section is first person: reading the codebase, designing via GitHub, writing a high-quality PR, responding to feedback, thinking about compatibility, and aiming for maintainership. It is what you do.
This section is second person, inverted: it is what the maintainer reviewing you, the release manager shipping you, and the TSC governing the project do. It is the other side of the table.

Specifically, several mindset chapters have a governance mirror here:

Contributor Mindset chapter	Governance mirror in this section
`pr-quality.md`	`github-review.md` and `code-style-trust.md`
`compatibility.md`	`maintainer-mindset.md`
`responding-to-feedback.md`	`communication-channels.md`
`maintainership.md`	`tsc-governance.md` (the ladder)
`design-via-github.md`	`communication-channels.md` (RFCs, consensus)

Read your own behavior in the mindset section, then read what the receiving end expects here, and your PRs converge on what gets merged.

Prerequisites

Before this section is fully useful:

You have read the Contributor Mindset section, especially compatibility.md and pr-quality.md.
You have a GitHub account and have configured DCO sign-off (git config commit.gpgsign is optional; the Signed-off-by: trailer from git commit -s is mandatory).

You have a local clone of the core engine at ~/OpenSearch:

git clone https://github.com/opensearch-project/OpenSearch.git ~/OpenSearch
cd ~/OpenSearch
ls CONTRIBUTING.md DEVELOPER_GUIDE.md MAINTAINERS.md CHANGELOG.md TESTING.md

You have skimmed CONTRIBUTING.md, DEVELOPER_GUIDE.md, and MAINTAINERS.md in that clone. These are the authoritative process documents; this section explains the why behind them, not a substitute for them.
You have an account on forum.opensearch.org and have joined the public Slack at opensearch.org/slack (covered in Communication Channels).

If you intend to follow the capstone, you will exercise nearly everything in this section: getting a real PR reviewed (github-review.md), backported, and into a release (release-process.md).

You've Absorbed This Section When…

You have internalised this material when you can, without looking it up:

Name the CI checks that gate a merge on the core repo (gradle-check, assemble, precommit, DCO) and say what each one protects.
Explain why OpenSearch uses review-then-merge with maintainer approval, not commit-then-review.
Add the backport 2.x label to a merged PR and describe exactly what the backport bot then does — and what to do when it conflicts.
Predict, before opening an issue, whether it belongs as a plain bug, an enhancement, or an RFC/proposal, and which channel drives consensus for each.
Read a diff that touches StreamInput/StreamOutput or a REST response shape and predict whether a maintainer will block it on backward-compatibility grounds. (See the serialization-BWC deep dive.)
Trace how a one-line fix merged to main ends up in a 2.x patch release: the CHANGELOG.md entry, the backport, the release-readiness issue, the release manager, and the generated release notes.
Add a new dependency and correctly predict the license review it triggers: SPDX header, the licenses/ SHA + LICENSE/NOTICE files, and the allowed-license check.
Describe the contributor → maintainer → TSC ladder and what changes at each rung.

The next chapter — Communication Channels and Building Consensus — covers the operational mechanics of the GitHub/forum/Slack/community-meeting system that this entire section relies on.

Communication Channels and Building Consensus

OpenSearch has no private decision-making channel. There is no dev@ list where the "real" discussion happens, no JIRA, no committee email thread you need to be cc'd on. The project's nervous system is a small set of public channels, and consensus is something you can watch being built in the open. This chapter is the operational map: which channel does what, when to use which, and how to drive a proposal from idea to merged change without burning goodwill.

If you have read Design via GitHub from the contributor side, this is the maintainer- and process-side view of the same machinery.

The Channels at a Glance

Channel	URL / location	Purpose	Latency	Authoritative?
GitHub Issues	`github.com/opensearch-project/OpenSearch/issues`	Bugs, enhancements, RFCs, meta/tracking, release readiness	hours–days	Yes — the system of record
GitHub Pull Requests	same repo, `/pulls`	Code review and the merge decision	hours–days	Yes
Community forum	`forum.opensearch.org`	Usage questions, broad discussion, announcements, "is this a bug?" triage	hours–days	No (but searchable, durable)
Public Slack	`opensearch.org/slack`	Real-time chat, quick questions, coordinating, finding a maintainer	seconds–minutes	No — ephemeral; decisions must land on GitHub
Community meetings	recurring, public, recorded	TSC updates, project demos, roadmap, open discussion	biweekly/monthly	No — but minutes/recordings are durable
Announce / mailing presence	project announce list & blog	Release announcements, security advisories, governance changes	per-event	One-way

There is exactly one bright line to memorize: a decision is not real until it is written down in a GitHub issue or PR. Slack threads and hallway conversations at a community meeting are where alignment forms; the issue is where it is recorded. If you negotiate a design in Slack, your last act is to post the summary back on the issue. Maintainers will ask you to do this if you don't.

Warning: Never treat a Slack +1 or a verbal nod in a community meeting as approval to merge. The approval that counts is a maintainer's review on the PR. Anything else is just momentum.

What Each Channel Is For — and Not For

GitHub Issues

The issue is the unit of work. Every bug, every feature, every design lives here first. Issues carry labels that route and stage them:

Label	Meaning
`untriaged`	Not yet looked at by a maintainer; the default for new issues
`bug` / `enhancement`	Triaged category
`RFC` / `proposal` / `meta`	A design or cross-cutting discussion, not a single change
`good first issue` / `help wanted`	Open for new contributors
`flaky-test`	A non-deterministic test failure
`v3.1.0`, `v2.18.0` (version labels)	Targeted release
`backport 2.x`, `backport 1.x`	Drives the backport bot on the associated PR

A maintainer triaging an issue moves it from untriaged to a real category, may ask for a reproduction, and may attach it to a release version. You can accelerate this: a clean, reproducible bug report with a curl repro against localhost:9200 and the exact version gets triaged faster than a vague one. See GitHub Issues and PR Review for the triage-to-merge flow.

The Forum (`forum.opensearch.org`)

The forum is for discussion that isn't yet a unit of work: "Is this expected behavior?", "How do others solve X?", "We're thinking about a feature in this area — has anyone tried?" It is durable and searchable, which makes it better than Slack for anything you want to reference later. If a forum thread converges on "this is a bug" or "this is worth building," the outcome is to open a GitHub issue and link the thread.

Slack (`opensearch.org/slack`)

Slack is for speed and for humans: unblocking a quick question, finding which maintainer owns an area, coordinating who takes which piece, sanity-checking before you write a long issue. It is ephemeral — old messages scroll away and aren't a record. Use it to move fast, then write the outcome down.

Community Meetings

Recorded, public, recurring meetings cover TSC updates, roadmap, project demos, and open discussion. As a contributor you mostly consume these (watch a recording to understand direction). As an aspiring maintainer you eventually present in them — demoing a feature or walking an RFC. Because they are recorded and minuted, they are a legitimate way to surface a proposal to a wide audience, but the binding artifact is still the linked issue.

How Consensus Is Actually Built

OpenSearch consensus is lazy consensus plus maintainer approval, scaled by blast radius. The bigger the change, the wider the circle that must align.

flowchart TD
    A[Idea / problem] --> B{Scope?}
    B -->|Bug or small change| C[Open issue: bug/enhancement]
    B -->|New feature / design| D[Open RFC issue: label RFC/proposal]
    B -->|Cross-cutting / multi-repo| E[RFC + community meeting + TSC]
    C --> F[Discussion on the issue]
    D --> F
    E --> F
    F --> G{Maintainer alignment?}
    G -->|No objection / +1s| H[Open PR implementing it]
    G -->|Objection| I[Iterate on issue: revise scope/design]
    I --> F
    H --> J[Review-then-merge: maintainer approvals + green CI]
    J --> K[Merged + CHANGELOG entry]
    K --> L{Needs older line?}
    L -->|Yes| M[backport label -> bot opens backport PR]
    L -->|No| N[Ships in next release of main]

The three circles, by blast radius:

Discussion on the issue / RFC. For a contained change, this is the whole game. You describe the problem, a maintainer agrees on the approach, you implement it. "Lazy consensus" means silence after a clear proposal, given reasonable time, is assent — but only a maintainer's actual review approves the PR.
Maintainer alignment. For anything touching a public surface — REST API, wire format, index format, a setting's default — the maintainers who own that area must agree before you write much code. Getting this alignment on the issue is the single highest-leverage thing you can do; it is far cheaper to change a paragraph than a 2,000-line PR. See How Maintainers Think About Compatibility for what they weigh.
The TSC for cross-cutting concerns. Changes that span repos, set project-wide policy, add a whole subsystem, or affect the experience across the bundle escalate to the Technical Steering Committee. You usually don't go straight to the TSC; a maintainer or the discussion routes it there when it's warranted.

When to Use Which Channel

Situation	Start here	Then
"I think I found a bug"	Forum or issue search (avoid duplicates)	Open a `bug` issue with a repro
"I want to build feature X"	`RFC`/`proposal` issue	Align with maintainers, then PR
"Quick question, am I blocked?"	Slack	If it's a real decision, record on the issue
"Is this the expected behavior?"	Forum	Open an issue if it's a bug
"Who owns the allocation code?"	`MAINTAINERS.md` + Slack	Tag them on the issue/PR
"This is a security problem"	Do not post publicly	Use the security disclosure process (`SECURITY.md`)
"Cross-repo / project-wide change"	RFC issue + community meeting	TSC if it needs project policy

Warning: Security vulnerabilities are the one thing that does not go in a public channel. Follow the coordinated disclosure process documented in SECURITY.md (a private report path), never a public issue, forum post, or Slack message.

Driving a Proposal — The Playbook

To take an idea to a merged change without stalling:

Search first. Check existing issues, the forum, and MAINTAINERS.md. Half of all "new" proposals already have a thread; join it instead of forking the discussion.
Write the problem, not the solution, first. A good RFC issue leads with the problem, the constraints, and why now, then proposes an approach and lists alternatives you rejected. Maintainers evaluate problems; they distrust solutions that arrive without one.
Label it correctly. RFC/proposal for design, enhancement for a concrete change, meta for a multi-PR tracking effort. Mislabeling buries your issue.
Tag the right owners. Use MAINTAINERS.md to find who owns the area and @-mention them — but tag, don't spam. One thoughtful ping beats five.
Get alignment before you build. Do not open a large PR before the design has a maintainer's blessing on the issue. A PR is a commitment; an issue is a conversation. Pay for the conversation first.
Implement in a focused PR that links the issue (Closes #NNNN), adds a CHANGELOG.md entry, and is small enough to review. See GitHub Issues and PR Review.
Summarize decisions back onto the issue. If alignment happened in Slack or a meeting, post the conclusion on the issue so the record is complete.

Etiquette

The norms that make you easy to work with — and that maintainers remember:

Be concise and reproducible. A maintainer's scarcest resource is attention. A three-line repro saves them twenty minutes and earns you review bandwidth.
Don't cross-post the same question to three channels. Pick the right one; escalate only if it goes stale, and link the prior thread when you do.
Assume good faith and respond to the strongest version of the objection. Maintainers block changes for reasons (usually compatibility or maintenance cost), not to gatekeep.
Follow the Code of Conduct. It is not decorative.
Be patient and persistent, not loud. Reviewers are volunteers and employees with finite time. A polite "friendly ping" after several days is fine; daily nagging is not.
Leave the channel better than you found it. Update the issue when you learn something; close your own stale threads; link duplicates.

These manners are not soft skills — they are how you accumulate the trust that turns into review bandwidth and eventually maintainership.

Prove You Understand This

Where is the authoritative record of a design decision, and where is it merely formed?
You discussed and agreed an approach with a maintainer in Slack. What is your last step before writing code?
You believe you found a remotely exploitable bug. Which channel do you use, and which do you explicitly avoid?
A new feature you want spans the core repo and two plugin repos. Sketch the path from idea to merged change, naming the channels and the point at which the TSC gets involved.
What does the untriaged label mean, and what would make a maintainer remove it from your issue quickly?

The next chapter — GitHub Issues and PR Review — turns from forming consensus to executing it: how a PR actually gets reviewed and merged.

GitHub Issues and PR Review

This is the chapter you came for if you have ever watched a pull request sit green for a week and wondered what the maintainer was waiting on. OpenSearch review happens entirely on GitHub — no JIRA patches, no mailing-list code review. This chapter is the mechanics of that process from the maintainer's side: the PR lifecycle, the CI gate, who can approve and merge, the MAINTAINERS.md/CODEOWNERS machinery, how the backport bot works, and — most usefully — how a maintainer actually reads your diff.

If PR Quality told you how to write the PR, this tells you how it's received.

The PR Lifecycle

stateDiagram-v2
    [*] --> Opened
    Opened --> CI_running: checks triggered
    CI_running --> CI_failed: a check is red
    CI_failed --> CI_running: push a fix
    CI_running --> In_review: all checks green
    In_review --> Changes_requested: maintainer review
    Changes_requested --> CI_running: address feedback, push
    In_review --> Approved: required approvals met
    Approved --> Merged: maintainer squash-merges
    Merged --> Backport: backport label present
    Backport --> Backport_PR: bot opens PR on 2.x / 1.x
    Backport_PR --> [*]
    Merged --> [*]

From open to merge, every PR passes three gates, in this rough order: DCO + CI green → maintainer review → required approvals → merge. A PR that is missing any one of these does not merge, no matter how good the code is.

Required CI Checks

The core repo gates merges on a set of automated checks. These run on every push to the PR branch. The exact set evolves, but the load-bearing ones are:

Check	What it runs (locally)	What it protects
DCO	every commit has `Signed-off-by:` (from `git commit -s`)	Legal: contributions are licensed under Apache-2.0 (no CLA, so DCO is the mechanism)
gradle-check	the heavy gate: `./gradlew check` — unit + integration tests, precommit	Correctness and regressions across the engine
assemble	`./gradlew assemble`	The build still produces artifacts
precommit	`./gradlew precommit` — checkstyle, forbidden-APIs, license header check, dependency license check, `loggerUsageCheck`, Spotless check	Style, banned APIs, licensing hygiene

Run them yourself before you push, in this order, because they get progressively more expensive:

cd ~/OpenSearch
./gradlew spotlessApply          # auto-fix formatting first
./gradlew precommit              # cheap gate: style, headers, forbidden APIs
./gradlew assemble               # build
./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests"  # scoped tests

Note: gradle-check is the big one and it is randomized — it runs tests with a random seed. A green local run does not guarantee a green CI run, and a red CI run may be a flaky test, not your change. If a failure looks unrelated, find the printed -Dtests.seed=... line, reproduce it, and check whether the same test is already labeled flaky-test. Re-running CI is legitimate when you've confirmed the failure is pre-existing; silently re-running until green to mask a real regression is not.

A maintainer will not even start a serious review until CI is green. Red CI is a self-service problem; fix it before you ask for eyes.

Who Reviews, Who Approves, Who Merges

Three documents define authority in any OpenSearch repo:

File	Role
`MAINTAINERS.md`	The human-readable list of maintainers — who has merge rights and owns the repo
`.github/CODEOWNERS`	Maps paths/areas to reviewers; GitHub auto-requests them on matching PRs
`CONTRIBUTING.md` / `DEVELOPER_GUIDE.md`	The process and standards a PR must meet

The model is review-then-merge: a change is reviewed before it lands, not after. Contrast this with commit-then-review (push first, review later) — OpenSearch does not do that on the core engine, because the cost of a bad change in a distributed storage system is too high. A PR needs:

Green required CI (above), and
Approval from a maintainer (the required count depends on the repo; the spirit is "an area owner has signed off"). CODEOWNERS ensures the right maintainer is asked.

Only maintainers (those in MAINTAINERS.md) can merge. A non-maintainer's "approve" review is valuable signal and helps the maintainer, but it does not unlock the merge button.

Inspect the authority for an area yourself:

cd ~/OpenSearch
sed -n '1,60p' MAINTAINERS.md          # who the maintainers are
cat .github/CODEOWNERS 2>/dev/null     # path -> owner mapping (if present)
git shortlog -sne --since="12 months ago" -- server/src/main/java/org/opensearch/cluster | head
# ^ who has actually been changing this area lately = de facto reviewer

That last command is the trick: the people who recently and repeatedly touched a directory are the ones who will review changes to it, whether or not CODEOWNERS lists them.

Squash-merge norms

OpenSearch repos squash-merge by default: your PR becomes one commit on main, with the PR title as the commit subject. Practical consequences:

Write the PR title as the commit message you want in history — imperative, scoped, referencing the area (Fix DiskThresholdDecider off-by-one on relocating shards (#NNNN)).
Don't obsess over a clean per-commit history inside the PR; it collapses on merge.
Keep one PR to one logical change. A squashed commit that does three things is a worse history entry and a worse git bisect target later.

How a Maintainer Reads a Diff

When a maintainer opens your PR, they are not reading top to bottom. They are scanning for risk in a fixed priority order. Internalize this order and you write PRs that pass on the first read.

#	What they look at	The question in their head
1	Scope — files touched, diff size	"Is this one change, or three smuggled together?"
2	Tests — new/changed tests, do they actually assert the behavior	"If I revert the production change, does a test go red?"
3	Backward compatibility — wire (`StreamInput`/`StreamOutput`), index format, REST shape, setting defaults	"Does this break a mixed-version cluster or a rolling upgrade?"
4	Blast radius — who else calls this, is it on a hot path	"What breaks if this is subtly wrong in production?"
5	Correctness — the actual logic	"Is the change right?"
6	Style — formatting, naming, logging	"Is it consistent? (Spotless/checkstyle already checked most of this.)"

The order is deliberate. Scope and tests come before correctness. A correct change with no test, or a correct change bundled with two unrelated ones, gets sent back before the maintainer evaluates whether the logic is right — because they can't review what they can't isolate, and they won't merge what they can't protect against future regression.

The compatibility lens (row 3) is where "small" changes die. A one-line change to what a Writeable reads off the wire, or to a field in a REST response, can break every node in a cluster mid-upgrade. This is so central it has its own chapter — How Maintainers Think About Compatibility — and a deep dive, serialization-BWC.

Labels and Triage

Issues and PRs move through a label-driven state machine. The two transitions you'll see most:

untriaged → triaged. A new issue starts untriaged. A maintainer reviews it, assigns a real category (bug, enhancement, RFC), possibly a version label, and removes untriaged. A clean reproduction speeds this up enormously.
PR labels signal release targeting (v3.1.0) and backporting (backport 2.x).

# Find issues ready for a new contributor:
# github.com/opensearch-project/OpenSearch/issues?q=is:open+label:"good first issue"
# Find still-untriaged issues a maintainer hasn't looked at:
# github.com/opensearch-project/OpenSearch/issues?q=is:open+label:untriaged

The `backport` Label and the Bot

OpenSearch develops on main (the 3.x line) and maintains older lines on branches: 2.x (maintenance) and 1.x (legacy). A fix that belongs in a released line must be backported to that branch. This is automated:

You (or a maintainer) add the backport 2.x label to the PR before or at merge.
After the PR squash-merges to main, the backport bot cherry-picks the squashed commit onto a new branch off 2.x and opens a backport PR.
That backport PR runs CI and needs its own approval/merge — backporting is not a free pass around review.
If the cherry-pick conflicts, the bot fails and comments. You then create the backport manually:

cd ~/OpenSearch
git fetch origin
git checkout -b backport/2.x/my-fix origin/2.x
git cherry-pick -x <squashed-commit-sha>   # resolve conflicts
# ... fix conflicts, keep the CHANGELOG entry under the right release ...
git push origin backport/2.x/my-fix
# open a PR targeting the 2.x branch

Note: The CHANGELOG entry moves with the backport. On main it sits under the [Unreleased 3.x] section; on the 2.x backport it must sit under the [Unreleased 2.x] section. Getting this wrong is a common backport-PR review comment. See The Release Process for how those sections become release notes.

The labels and bot are why a backport must be intentional: shipping a fix to 2.x is a decision, made by adding a label, recorded on the PR, and re-reviewed on the backport.

What Gets a PR Merged Fast vs. Slow

Merged fast	Merged slow (or stalled)
One focused logical change	Bundles refactor + feature + formatting
Green CI on the first push	Red CI left for the reviewer to interpret
A test that fails without the fix	"Tested locally," no test in the diff
No public-surface change, or a deprecation-policy-respecting one	Silent wire/REST/setting-default change
`CHANGELOG.md` entry in the right section	Missing or mis-sectioned CHANGELOG
Links the issue, explains why	No context; reviewer must reconstruct intent
Author responsive to review within days	Author disappears for weeks mid-review
DCO sign-off on every commit	Missing `Signed-off-by:`, DCO check red

None of the "fast" column is about being a better programmer. It is about respecting the reviewer's time and the engine's compatibility guarantees. That respect, sustained, is what builds the trust that earns you faster reviews and, eventually, a line in MAINTAINERS.md.

Prove You Understand This

Name the four load-bearing CI checks and the local ./gradlew command for each. Which is randomized, and how do you reproduce one of its failures?
A non-maintainer approves your PR and CI is green. Can it merge? Why or why not?
In what order does a maintainer scan a diff, and why do scope and tests come before correctness?
Your PR squash-merges to main. You need the fix in the next 2.x patch release. Walk through the label, the bot, and what you do if the cherry-pick conflicts — including the CHANGELOG.
Given a directory server/src/main/java/org/opensearch/index/shard/, what single command tells you who is most likely to review a change there?
List three things you can do, none of which improve the code itself, that will get your PR merged faster.

How Maintainers Think About Compatibility

The single most common surprise for a new OpenSearch contributor is watching a maintainer block a five-line change. The code is correct. The tests pass. And the maintainer still says no, or "not like this," or "this needs a feature flag." This chapter explains the lens behind that reflex. It is the risk lens: a maintainer is not primarily evaluating whether your change is right — CI and the diff handle most of that — they are evaluating what happens when your change meets a fleet of running clusters that they cannot upgrade atomically.

This is the maintainer-side mirror of the contributor compatibility chapter, and it leans on the serialization-BWC deep dive for the mechanics. Read those for the how; this chapter is the why it dominates everything.

The Core Fact: Clusters Are Never Uniform During an Upgrade

OpenSearch clusters are upgraded node by node — a rolling upgrade. For the duration of the upgrade, the cluster is mixed-version: some nodes run 2.17.0, some run 2.18.0, and they must speak to each other correctly the entire time. The cluster manager (formerly master) might be the old version while a data node is new, or vice versa. Replication, coordination, and search all cross that version boundary live.

This single fact generates almost every compatibility rule a maintainer enforces:

If a change alters…	…then in a mixed cluster…	Maintainer's concern
Wire format (`StreamInput`/`StreamOutput`, `Writeable`)	An old node may read a stream a new node wrote, or vice versa	Corrupt/misparsed messages, node crashes, split cluster state
Index/segment format	A shard written by a new node may be read by an old node after failover	Unreadable shards, data loss
REST request/response shape	Clients and Dashboards see different responses depending on which node answered	Broken integrations, silent contract breaks
Setting name/default	The two versions disagree on a default	Behavior flips mid-upgrade
Cluster state structure	The published state is parsed by all nodes, old and new	Cluster can't form or apply state

A maintainer reads your diff and asks, for each hunk: which of these surfaces does this touch, and is it safe across a version boundary? If the answer is "wire" or "index" or "REST" and the change isn't carefully versioned, the change is blocked until it is.

Wire BWC: The Version Gate

Anything that crosses the transport layer is serialized through StreamOutput.writeX(...) and StreamInput.readX(...). When you add a field to a Writeable, you cannot just write it — old nodes don't know it's there and will misread the stream. The mechanism is a version check against the stream's version:

// Writing: only emit the new field to peers new enough to understand it.
@Override
public void writeTo(StreamOutput out) throws IOException {
    out.writeString(name);
    if (out.getVersion().onOrAfter(Version.V_3_1_0)) {
        out.writeOptionalString(newField);
    }
}

// Reading: only consume it if the sender was new enough to have written it.
public MyMessage(StreamInput in) throws IOException {
    this.name = in.readString();
    if (in.getVersion().onOrAfter(Version.V_3_1_0)) {
        this.newField = in.readOptionalString();
    }
}

Note: Class and version-constant names vary by branch. Find the pattern in the tree with grep -rn "getVersion().onOrAfter" server/src/main/java/org/opensearch | head and read a few real examples before you write your own. The deep dive serialization-bwc walks the full mechanism, including NamedWriteableRegistry for polymorphic types.

A maintainer reviewing a writeTo/reader change checks three things instantly: (1) is the new field guarded by a version check on both sides; (2) is the version constant correct (the version the field actually ships in, not a guess); (3) is there a round-trip serialization test — typically extending AbstractWireSerializingTestCase — that exercises both the new and an older bwcVersion. No version guard, or no BWC test, is an automatic block.

REST API Stability and the Deprecation Policy

The REST API is a public contract. Clients, Dashboards, dashboards plugins, and third-party tools depend on response shapes and parameter names. The rules a maintainer enforces:

You may add new optional fields and new optional parameters. Additive is usually safe.
You may not silently remove or rename an existing field, parameter, or endpoint. That breaks clients without warning.
To change or remove, you go through the deprecation policy: the old form keeps working but emits a deprecation warning (the Warning HTTP header, via the deprecation logger), is documented as deprecated, and is removed only in a later major version.

# See how deprecation warnings are emitted in the codebase:
grep -rn "deprecationLogger\|DeprecationLogger" server/src/main/java/org/opensearch/rest | head

This is why a maintainer will block "just rename this response field to something clearer." Clearer is not worth breaking every consumer. The renamed field has to be added alongside the old one, the old one deprecated, and the removal deferred to a major release.

Blast Radius: Why "Small" Gets Heavy Scrutiny

The size of a diff is a terrible predictor of its risk. A maintainer judges blast radius: how many code paths, nodes, and clusters a change can affect if it is subtly wrong.

"Small" change	Why it's actually high blast radius
One-line tweak to an `AllocationDecider`	Changes shard placement for every index on every cluster; can trigger mass relocation or unassigned shards
Adjusting a default in `ThreadPool` or a circuit breaker	Affects memory/throughput on every node under load
Editing a `StreamInput` read	Can corrupt every inter-node message in a mixed cluster
Changing a cluster-state field	Parsed by every node; a bad change can stop the cluster from forming
Tweaking a Lucene merge/refresh parameter default	Changes I/O and visibility characteristics fleet-wide

Estimate it yourself before you propose the change:

cd ~/OpenSearch
# Who calls the method you're about to change?
grep -rn "applyIndexOperationOnPrimary" server/src/main/java | wc -l
# Is this on a hot path (indexing/search/coordination)?
grep -rln "ThreadPool.Names.WRITE\|ThreadPool.Names.SEARCH" server/src/main/java | head

A change with wide blast radius gets scrutiny proportional to the radius, not the line count. This is not bureaucracy; it is the maintainer pricing in the cost of being wrong in a system that stores other people's data.

Feature Flags and the `experimental` Path

When a change is valuable but risky — a new replication mode, a new storage backend, a behavior the team isn't sure is right yet — the maintainer's tool is to ship it off by default:

A feature flag / setting gates the new behavior so the default path is unchanged. Find the pattern with grep -rn "FeatureFlags" server/src/main/java/org/opensearch | head.
An experimental label/annotation tells users the surface may change and is not yet covered by the usual BWC guarantees.

This lets the project gather real-world signal without betting the default behavior of every cluster on it. As a contributor proposing something ambitious, offering a feature flag proactively is a strong signal to maintainers that you understand the risk — and it is often the difference between "let's discuss for three months" and "merged behind a flag, iterate in the open."

The Cost of a Revert

The last item in a maintainer's risk lens is the exit cost: if this is wrong, how hard is it to undo? A revert is never free:

flowchart LR
    A[Bad change merged to main] --> B[Backported to 2.x]
    B --> C[Shipped in 2.18.0]
    C --> D[Users upgraded, data written in new format]
    D --> E{Revert now?}
    E -->|Wire/REST only| F[Revert + version-guard cleanup, painful]
    E -->|Index/data format| G[Cannot cleanly revert: on-disk data exists]

A revert of a behavioral change is annoying. A revert of a change that altered an on-disk format or a wire contract that nodes now depend on can be effectively impossible — because data has been written, or clusters have upgraded, in a way that assumes the change. This asymmetry is why maintainers front-load scrutiny: it is far cheaper to argue on the issue for an extra week than to live with an un-revertable mistake in a released line. It is also why format and contract changes are gated behind version checks and feature flags in the first place — those mechanisms preserve a revert path.

What This Means for You as a Contributor

Read your own change the way a maintainer will:

Classify the surface. Does it touch wire, index, REST, settings, or cluster state? If yes to any, you are in BWC territory; budget accordingly.
Version-guard and test it. Add the version check on both read and write; add a round-trip BWC test. Do this before you ask for review.
Estimate blast radius, not line count. A one-liner in an allocation decider deserves more justification than a 300-line self-contained new module.
Offer the safety valve. For risky-but-valuable work, propose a feature flag or experimental gate yourself.
Respect the deprecation policy. Add-then-deprecate-then-remove-in-a-major, never silent rename.

Do this and the "why is a maintainer blocking my five lines" mystery evaporates — because you will have already answered the questions they were going to ask.

Prove You Understand This

Why is every OpenSearch cluster mixed-version for a window, and what does that imply for a change to StreamOutput.writeTo?
Show the read and write halves of a version-guarded new field. Which test base class proves it round-trips, and against what versions?
A reviewer asks you to keep an old REST field while adding a clearer new one, instead of renaming. Cite the policy that requires this and explain when the old field can finally go.
Rank these by blast radius and justify: a new self-contained ingest processor; a one-line change to DiskThresholdDecider; a change to a cluster-state serialized field.
When is a feature flag the right answer, and why does offering one make a risky PR easier to merge?
Explain why a change to an on-disk index format is harder to revert than a change to a REST response, and how maintainers preserve a revert path in advance.

Next: The Release Process and Release Trains — how a compatibility-safe, reviewed change actually ships to users.

The Release Process and Release Trains

A reviewed, compatibility-safe change merged to main is not yet released. Between your merged PR and a user running the fix lies the release machinery: a schedule, a release manager, a tracking issue, a build-and-bundle pipeline, and generated release notes. Unlike the Apache model, there is no [VOTE] email and no formal release vote. Release decisions happen openly in a dedicated repo, opensearch-build, coordinated through public GitHub issues. This chapter is how OpenSearch ships, and — the part you actually need — exactly what you must do to get a fix into a given release.

The `opensearch-build` Repo Is the Release Control Plane

The core engine lives in opensearch-project/OpenSearch, but a release is more than the engine. OpenSearch ships as a bundle: the core distribution plus the standard plugins (security, k-NN, SQL, alerting, index-management, ml-commons, observability, and the rest), each from its own repo, assembled into a single coherent product at one version.

That assembly, the schedule, and the per-release coordination live in opensearch-project/opensearch-build:

Lives in `opensearch-build`	Purpose
Release schedule	The published cadence and dates for upcoming releases
Manifests	Which repo/ref of each component goes into a given release version
Release-readiness issues	One tracking issue per release, aggregating every component's status
Build/assemble/test pipeline	Builds each component, bundles them, runs validation
Release notes tooling	Aggregates per-component `CHANGELOG.md` / PR data into release notes

# The control plane (browse, don't clone unless you're release-managing):
# github.com/opensearch-project/opensearch-build
# Look for: the release schedule, the "Release version X.Y.Z" tracking issues,
# and the per-version manifests.

The key mental shift from Apache: the "release decision" is not a moment (an email vote passing). It is a process that becomes visibly done when the release-readiness issue's checklist is complete and the pipeline has produced validated artifacts.

Release Trains: Major, Minor, Patch

OpenSearch ships on trains — scheduled departures you either make or wait for.

Train	Branch	Carries	Compatibility
Major (`3.0.0`)	new line off `main`	Breaking changes, removed deprecations, new index format era	May break BWC vs previous major (within policy)
Minor (`3.1.0`, `3.2.0`)	`main` (the `3.x` line)	New features, additive API, deprecations	Backward compatible within the major
Patch (`3.1.1`, `2.18.1`)	maintenance branch (`2.x`, `1.x`)	Bug fixes, security fixes only	No new features; strict BWC

main is the 3.x development line. 2.x is the active maintenance line (gets backported fixes and minors). 1.x is legacy (security/critical fixes only). The train you can catch depends on what your change is: a feature rides a minor off main; a bug fix can ride a patch on a maintenance branch via backport.

Note: "Train" is not just a metaphor for cadence — it is a metaphor for the cutoff. A train leaves on a schedule whether or not your change is aboard. Miss the merge-and- backport deadline for a release and your fix waits for the next departure.

The Release Manager and the Readiness Issue

Each release has a release manager — a person (rotating, from the community/maintainers) who owns shepherding that version out. They are coordinator, not dictator: the work is done by each component's maintainers; the release manager tracks it.

Their primary instrument is the release-readiness tracking issue in opensearch-build, which aggregates, per component:

Is the component's release branch cut and code-frozen?
Are all targeted PRs merged and backported by the cutoff?
Are the CHANGELOG.md entries in place (they become the release notes)?
Do the integration tests / BWC tests / packaging tests pass in the bundle?
Is the documentation (the separate documentation-website repo) updated?
Are there any release-blocking issues open? (See issue-roadmap stage 12.)

The release ships when this checklist is green across components. That is the "vote" — conducted in checkboxes, in public, on a GitHub issue.

The Release Pipeline

flowchart TD
    A[Changes merged to main and backport branches] --> B[Release branch cut / code freeze]
    B --> C[Per-component build in opensearch-build]
    C --> D[Assemble the bundle: core + standard plugins at one version]
    D --> E[Version bumps applied: 3.x.0]
    E --> F[RC build produced]
    F --> G[Validation: integ tests, BWC, packaging, smoke]
    G -->|Issues found| H[Fix, backport, new RC]
    H --> F
    G -->|Clean| I[Release notes generated from CHANGELOG + PRs]
    I --> J[Documentation-website sync]
    J --> K[GA release: artifacts published, announcement]
    K --> L[Backport branches and main keep moving]

Reading the pipeline:

Code freeze / release branch. At the cutoff, a release branch is cut. After this, only approved fixes land for that version — this is the firm edge of the train.
Build and bundle. opensearch-build builds each component at the manifest-pinned ref and assembles them into one distribution.
Version bumps. Versions are incremented (v3.x.0), including the internal Version constants the BWC machinery keys off (see maintainer-mindset).
RC and validation. Release candidate builds run the full validation: integration, backward-compatibility and rolling-upgrade tests (from qa/), packaging (deb/rpm/ docker/archives), and smoke tests. A failure spawns a fix → backport → new RC loop.
Release notes. Generated from the accumulated CHANGELOG.md entries and PR metadata — which is precisely why every PR must add a CHANGELOG line in the right section.
Docs sync. The documentation-website repo is updated so opensearch.org/docs matches the release.
GA. Artifacts are published and the release is announced.

CHANGELOG → Release Notes

This is the most direct connection between your PR and the release. Every PR to the core repo adds a one-line entry to CHANGELOG.md under the unreleased section, categorized:

## [Unreleased 3.x]
### Added
- Add `include_unloaded_segments` to the nodes stats API ([#NNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNN))
### Changed
### Fixed
- Fix DiskThresholdDecider off-by-one for relocating shards ([#MMMM](https://github.com/opensearch-project/OpenSearch/pull/MMMM))
### Deprecated
### Removed
### Security

At release time, the entries under the unreleased section for the version being cut are rolled up into the human-readable release notes. Consequences for you:

A missing CHANGELOG entry means your change is invisible in the release notes — and the precommit/PR checks will usually catch the omission first.
The section matters. On main your entry sits under [Unreleased 3.x]. When you backport to 2.x, the entry must move to [Unreleased 2.x] so it lands in the right release's notes. Mis-sectioning is a common backport-PR review comment (see GitHub review).

What You Must Do to Get a Fix Into a Given Release

This is the operational checklist. To land a fix in release X.Y.Z:

Merge to main before the cutoff. The fix has to be in main (reviewed, green CI, CHANGELOG entry) before the release branch is cut for the train you want.
Add the backport label for every maintenance line that needs it (e.g. backport 2.x, and backport 1.x if applicable) — at or before merge, so the backport bot opens the backport PR.
Land the backport PR before the cutoff for that line's release. A backport that misses the cutoff misses the train, even if main made it.
Put the CHANGELOG entry in the correct section on both main and the backport.
Watch the release-readiness issue in opensearch-build for the target version; if your change is release-blocking, make sure it's tracked there.

# Concretely, for a fix you want in the next 2.x patch:
# 1) Open the PR against main, with a CHANGELOG entry under [Unreleased 3.x].
# 2) Add the label `backport 2.x` to the PR.
# 3) After it squash-merges to main, the bot opens a backport PR against 2.x.
# 4) Move the CHANGELOG entry to [Unreleased 2.x] on the backport, get it reviewed and merged.
# 5) Confirm it's listed on the release-readiness issue for the target 2.x.Z version.

Warning: The two failure modes that cost contributors a release: (1) merging to main but forgetting the backport label, so the fix never reaches the maintenance line; and (2) the backport PR conflicting and sitting unresolved past the cutoff. Add the label early and resolve backport conflicts promptly. See release-blocking issues for the highest- stakes version of this.

Prove You Understand This

Which repo is the release control plane, and what does a release-readiness issue track?
There is no [VOTE] email — so what is the artifact that signals a release is ready to ship?
Distinguish major / minor / patch trains by branch, contents, and BWC guarantee. Which train can carry a new feature?
Trace your CHANGELOG entry from PR to release notes, including what changes when you backport to 2.x.
You have a one-line bug fix you need in the next 2.x patch release. List, in order, every step from opening the PR to confirming it's in the release.
Name the two most common ways a contributor's fix misses the train despite being merged to main.

Next: The TSC and Project Governance — who sets the direction the trains run on.

The TSC and Project Governance

Every prior chapter has shown you a piece of self-organizing machinery: maintainers approve PRs, a release manager runs a train, consensus forms on issues. This chapter is about the layer above all of that — who decides the direction, who admits a new project into the org, and who owns the trademark. OpenSearch is governed by the OpenSearch Software Foundation under the Linux Foundation, with technical direction set by a Technical Steering Committee (TSC). None of this is the Apache model; there is no PMC and no board-of-an-ASF.

This is the governance mirror of the contributor maintainership chapter: that chapter was your path up; this one is the structure you are climbing into.

A Short History: AWS-Stewarded → Linux Foundation

Period	Steward	What it meant
2021	AWS	OpenSearch is forked from Elasticsearch 7.10.2 / Kibana 7.10.2 after Elastic's SSPL/Elastic-License relicense; AWS creates and stewards the project under Apache-2.0 (see licensing)
2021–2024	AWS-stewarded	Open development on GitHub, public governance docs, but with AWS as the steward and trademark holder
Sept 2024	OpenSearch Software Foundation, under the Linux Foundation	Governance, trademark, and assets move to a vendor-neutral foundation; the TSC formalizes technical direction across the project

The move to the Linux Foundation in September 2024 matters because it makes the project vendor-neutral: the trademark, the assets, and the governance no longer sit with a single company. For a contributor, the practical effect is that direction is set by a chartered committee in the open, not by any one employer.

Note: The Linux Foundation provides the legal and IP home (trademark, the foundation entity, neutral governance). The TSC provides technical direction. These are two different bodies with two different jobs — don't conflate them.

The Technical Steering Committee

The TSC is the project's technical governing body. Think of it as the layer that handles what no single repo's maintainers can decide on their own.

Composition

The TSC is a chartered group of senior contributors/maintainers drawn from across the project (historically multi-company since the LF move). Membership and the charter are documented in the project's governance repository and .github/community docs. It is small enough to decide and broad enough to represent the major subsystems.

What the TSC Decides

TSC decides	Examples
Technical direction	The roadmap's shape; which large initiatives the project pursues
Cross-project policy	BWC and deprecation policy; security disclosure process; release cadence principles; coding/quality standards that span repos
Project admission	Whether a new repo/plugin joins the `opensearch-project` org and ships in the bundle
Escalations	Disputes a single repo's maintainers can't resolve; cross-repo conflicts
Maintainer-level governance	Principles for how repos appoint/remove maintainers

What the TSC does not do: review your individual PR, run the day-to-day of any one repo, or merge code. That is maintainer work.

Maintainers vs. the TSC

The two layers have clean, complementary jobs. Confusing them is the most common governance misunderstanding.

	Per-repo maintainers (`MAINTAINERS.md`)	The TSC
Scope	One repo / subsystem	The whole project
Authority	Approve and merge PRs; own the repo's roadmap and quality	Set cross-project direction and policy; admit projects
Where listed	each repo's `MAINTAINERS.md`	the project governance docs
Day-to-day	Constant: review, triage, release shepherding	Periodic: meetings, policy, admissions
You interact via	PRs, issues, `CODEOWNERS`	RFCs that escalate; community meetings

Most of what you experience as a contributor is maintainer authority. The TSC only enters your world when your work is cross-cutting enough to need project-level alignment — a new subsystem, a policy change, a new plugin joining the org.

How Decisions Escalate

flowchart TD
    A[Contributor opens issue/PR] --> B{Contained in one repo?}
    B -->|Yes| C[Maintainers decide on the issue/PR]
    B -->|No: spans repos or sets policy| D[RFC issue + community meeting]
    D --> E{Resolved by aligned maintainers?}
    E -->|Yes| F[Recorded on the issue, implemented]
    E -->|No / project-level| G[TSC]
    G --> H[TSC decision recorded in governance/issue]
    H --> F

The escalation principle is as local as possible: decisions are made at the lowest level that can make them. A bug fix is decided by the area maintainer. A new aggregation type is an RFC the search maintainers align on. A change to the project's BWC policy, or admitting a new plugin, reaches the TSC. You rarely send something to the TSC directly — a maintainer or the discussion routes it there when it genuinely needs project-level authority. The output is always written down (an issue, the governance docs), because, as everywhere in OpenSearch, a decision isn't real until it's recorded (see communication channels).

How a New Project or Plugin Joins the Org

OpenSearch is an organization of many repos, and it grows. Adding a new plugin or repo to opensearch-project — especially one that will ship in the bundle — is a TSC-level admission, because it commits the project to maintaining it, releasing it on the train, and standing behind its compatibility.

The shape of that process:

Proposal as an RFC describing the project, its scope, why it belongs in the org, maintainership, and licensing (must be Apache-2.0 with clean dependency licensing — see licensing).
Community and maintainer discussion on the RFC and in community meetings.
TSC review and decision on admission, including bundle inclusion and release commitments.
Onboarding: the repo joins the org with its own MAINTAINERS.md, CONTRIBUTING.md, SPDX headers, CHANGELOG discipline, and a slot in the opensearch-build manifests if it ships in the release (release process).

This is the formal counterpart to the casual reality that most features just need maintainer alignment. Whole projects need the TSC.

Trademarks and Branding (High Level)

The OpenSearch trademark is held by the foundation (post-LF, vendor-neutrally). You do not need to be a trademark lawyer, but two practical facts matter for a contributor:

Use the name correctly. "OpenSearch" refers to this project; downstream products that embed it must follow the trademark/branding guidelines and not imply official endorsement.
The trademark's neutral home is part of why the LF move happened. Holding the mark in a foundation rather than a company is what makes "OpenSearch" a community asset rather than a single vendor's product name.

For anything beyond "use the name as the project intends," consult the project's trademark and branding guidelines — this chapter is orientation, not legal advice.

The Ladder: Contributor → Maintainer → TSC

The path is earned, visible, and the same one this whole section has been preparing you for.

flowchart LR
    A[Contributor] -->|sustained quality PRs, reviews, trust| B[Maintainer]
    B -->|cross-project leadership, stewardship| C[TSC member]

Rung	What you do	How you get there
Contributor	Open issues, send PRs, review others' PRs, help triage	Anyone can start today
Maintainer	Approve and merge in a repo, own an area, shepherd releases	Sustained, trusted contribution to that repo; nominated and added to `MAINTAINERS.md` by existing maintainers
TSC member	Set technical direction, decide policy, admit projects	Demonstrated cross-project leadership and stewardship; per the TSC charter

Each rung is built on the one below it. You become a maintainer by behaving, over months, like the kind of contributor maintainers trust — focused PRs, good tests, compatibility awareness, helpful reviews, responsiveness. That is exactly the trust-building covered in the next chapter, and the maintainership mindset covered in the contributor section. The TSC rung, in turn, is built on having been a maintainer who reliably did the right thing for the project, not just one repo.

Note: Nobody is hired into maintainership for OpenSearch by writing one brilliant PR. The rung is granted for a track record — many PRs, many reviews, demonstrated judgment. Optimize for the track record, not the single heroic change.

Prove You Understand This

What changed about OpenSearch governance in September 2024, and why does vendor-neutrality matter to a contributor?
Distinguish the Linux Foundation's role from the TSC's role. Which one owns the trademark; which one sets the BWC policy?
Give two decisions a per-repo maintainer makes and two that escalate to the TSC.
Describe the escalation path for a change that spans three repos and proposes a new project-wide policy.
Outline how a new plugin gets into the opensearch-project org and the bundle, including the licensing precondition.
Lay out the contributor → maintainer → TSC ladder and what is earned (not granted) at each step.

Next: Licensing, SPDX Headers, and the Apache 2.0 Story — the legal bedrock the whole project, and every admission, stands on.

Licensing, SPDX Headers, and the Apache 2.0 Story

OpenSearch exists because of a license. Not a feature, not a performance number — a license. The entire project is a bet that a distributed search engine should be available under a permissive, OSI-approved license, and that bet is enforced mechanically in the build on every single PR. This chapter explains why OpenSearch is Apache License 2.0, the SPDX header that must sit atop every source file, the build-time license checks that gate merges and releases, and what would actually block a release on licensing grounds.

Licensing is not a side concern in OpenSearch. It is the founding concern.

Why OpenSearch Is Apache 2.0: The 2021 Fork

In January 2021, Elastic relicensed Elasticsearch and Kibana away from the Apache License 2.0 to a dual SSPL / Elastic License model. The SSPL is not an OSI-approved open-source license; the Elastic License is a source-available proprietary license. For users and vendors who depended on a genuinely open-source search engine, the ALv2 version was frozen at 7.10.2.

OpenSearch is the response: a fork of Elasticsearch 7.10.2 and Kibana 7.10.2, kept under the Apache License 2.0, created so that an open-source search engine continues to exist.

	Elasticsearch (post-Jan 2021)	OpenSearch
License	SSPL / Elastic License (source-available)	Apache License 2.0 (OSI-approved)
Forked from	—	Elasticsearch / Kibana 7.10.2
Package namespace	`org.elasticsearch.*`	`org.opensearch.*`
Governance	Elastic (company)	OpenSearch Software Foundation / Linux Foundation

This is why the ALv2 commitment is non-negotiable and why the license checks below are strict: the reason the project exists is to stay Apache-2.0. A dependency or a file that compromises that is an existential problem, not a nitpick.

The SPDX Header on Every Source File

Every source file in the core repo carries an SPDX header declaring its license. This is the exact block (from the project's standard and the curriculum spec):

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

Notes that trip people up:

The header is machine-checked. The precommit license-header check fails the build (and the PR CI) if a source file is missing it or has the wrong text.
Files inherited from the Elasticsearch fork may additionally carry the original Apache-2.0 / Elastic / SSPL provenance notice plus a "Modifications Copyright OpenSearch Contributors" line — the fork preserved upstream attribution where required. New files get the clean header above.
Comment syntax varies by language (/* */ for Java, # for shell/YAML, etc.), but the SPDX identifier and intent are constant.

Check your file before you push:

cd ~/OpenSearch
# The precommit task runs the header check among others:
./gradlew precommit
# Spot-check a header by hand:
grep -n "SPDX-License-Identifier" server/src/main/java/org/opensearch/cluster/ClusterState.java

If you add a new .java file without the header, precommit will tell you, and CI will be red until you fix it.

Build-Time License Checks

The build enforces licensing on three fronts. Together they make it impossible to merge a licensing violation without a maintainer override.

Check	What it verifies	Runs under
Source header check	Every source file has the correct SPDX/Apache-2.0 header	`./gradlew precommit`
Dependency license check	Every third-party jar has a matching `licenses/<dep>-LICENSE.txt` and a `licenses/<dep>.sha1` (or `.jar.sha1`) recording the artifact's checksum	`./gradlew precommit` / `dependencyLicenses`
Allowed-licenses check	Every dependency's license is on the permitted (ALv2-compatible) list; copyleft/incompatible licenses are rejected	the dependency license tooling in precommit

The `licenses/` Directory

Each module that has third-party dependencies keeps a licenses/ directory. For every dependency it must contain:

licenses/<dependency>-LICENSE.txt — the dependency's license text.
licenses/<dependency>-NOTICE.txt — its NOTICE, where the upstream provides one.
licenses/<dependency>.sha1 (or <dependency>-<version>.jar.sha1) — the SHA-1 checksum pinning the exact artifact, so the jar can't be swapped without the check noticing.

cd ~/OpenSearch
# Find the license bookkeeping for a module:
find . -type d -name licenses | head
ls modules/lang-painless/licenses/ 2>/dev/null
# A mismatch between the declared sha1 and the resolved jar fails the build.

The SHA pin is the security-relevant part: it means a dependency's bytes are fixed and audited, not just its name and version.

Allowed Dependency Licenses

A dependency may be added only if its license is ALv2-compatible and permitted by the project. The general shape:

License class	Examples	Allowed?
Permissive	Apache-2.0, MIT, BSD-2/3-Clause, ISC	Yes — the bread and butter
Weak copyleft (situational)	EPL, MPL, CDDL	Case-by-case; often restricted by how it's used/linked
Strong copyleft	GPL, LGPL, AGPL	No for bundled runtime dependencies — incompatible with an ALv2 distribution
Source-available / non-OSI	SSPL, Elastic License, "Commons Clause"	No — these are exactly what OpenSearch forked away from

Warning: The project did not fork from Elasticsearch to permissively re-introduce SSPL-, GPL-, or Elastic-License-encumbered code through a dependency. A strong-copyleft or source-available runtime dependency is rejected on principle, not just by a linter.

The precise allowed-license set lives in the build configuration; consult it rather than guessing:

cd ~/OpenSearch
grep -rin "allowedLicenses\|licenseMapping\|SSPL\|GPL" buildSrc build-tools* 2>/dev/null | head

How Adding a Dependency Triggers License Review

Adding a new third-party library is one of the most license-sensitive things you can do in a PR. The flow:

flowchart TD
    A[Add dependency to a module build.gradle] --> B[Run ./gradlew precommit]
    B --> C{licenses/ entry exists?}
    C -->|No| D[Build fails: missing LICENSE/NOTICE/sha1]
    C -->|Yes| E{License on allowed list?}
    E -->|No| F[Build fails: disallowed license]
    E -->|Yes| G{sha1 matches resolved jar?}
    G -->|No| H[Build fails: checksum mismatch]
    G -->|Yes| I[Update NOTICE if required]
    I --> J[Maintainer license review on the PR]
    J --> K[Merge]

Concretely, to add a dependency you must:

Add it to the module's build.gradle.
Create licenses/<dep>-LICENSE.txt, licenses/<dep>-NOTICE.txt (if upstream has one), and the licenses/<dep>.sha1 checksum file.
Ensure its license is ALv2-compatible and on the allowed list.
Update the top-level NOTICE.txt if the dependency's license requires attribution there.
Run ./gradlew precommit until all license checks pass.

Then a maintainer reviews the dependency itself — is it maintained, is it the right library, does it pull transitive dependencies with bad licenses, is it worth the supply-chain weight? A new dependency is never just a build.gradle line; it is a license and supply-chain decision that a maintainer signs off on. This is also why maintainers scrutinize dependency additions more than equivalent- sized code changes.

NOTICE Handling

The Apache License 2.0 requires that attribution notices be preserved. OpenSearch keeps a top-level NOTICE.txt and per-module/per-dependency NOTICE files:

When a dependency ships a NOTICE, its content must be carried in the appropriate NOTICE file so downstream redistribution keeps the required attribution.
The fork preserved upstream Elasticsearch/Apache attribution where the original license required it.
Dropping or mangling a required NOTICE is a license compliance defect, not a cosmetic one.

cd ~/OpenSearch
sed -n '1,20p' NOTICE.txt

What Would Block a Release on Licensing Grounds

Licensing is a hard gate at release time (it feeds the release-readiness checklist). Any of these blocks a release:

Blocker	Why it's release-blocking
A source file missing the SPDX/Apache-2.0 header	Distribution would ship un-headered source; precommit/CI red
A bundled dependency with a disallowed license (GPL/AGPL/SSPL/Elastic, etc.)	Distributing it would violate the ALv2-only promise — existential
A `licenses/` entry missing or with a mismatched SHA	The exact artifact isn't pinned/audited; supply-chain integrity is unverifiable
A required NOTICE dropped or incomplete	ALv2 attribution non-compliance
A new transitive dependency that sneaks in a bad license	Same as a direct bad dependency — checked across the graph

The throughline: OpenSearch will not ship a release it cannot defend as cleanly Apache-2.0. Every licensing check exists to make that defensible automatically, on every PR, so the release itself is never a surprise.

Prove You Understand This

Why is OpenSearch Apache-2.0, and what specifically happened in 2021 that made the fork necessary? What version was it forked from?
Reproduce the exact SPDX header block for a new source file, and name the check that enforces it.
For one third-party dependency, what three kinds of files must exist under licenses/, and what does the .sha1 file protect against?
Classify these for a bundled runtime dependency: MIT, AGPL, SSPL, BSD-3-Clause, Apache-2.0. Which are rejected and why?
Walk through everything that happens, from build.gradle edit to merge, when you add a new dependency.
List three distinct licensing conditions that would block a release, and explain why each is non-negotiable given why the project exists.

Next: Code Style, Test Quality, and Building Trust — the everyday mechanics, including the precommit gate you just used, that earn you review bandwidth.

Code Style, Test Quality, and Building Trust

Trust in OpenSearch is not a feeling. It is an accumulated record of PRs that didn't waste a maintainer's time, didn't break a cluster, and didn't have to be reverted. This chapter is the mechanics of building that record: the formatting and static-analysis gates that you should never make a reviewer enforce by hand, the test discipline that proves your change works and keeps working, the CHANGELOG and scope habits that make you predictable, and the simple truth that sustained quality is what converts into review bandwidth and, eventually, a line in MAINTAINERS.md.

This is the everyday counterpart to PR Quality and the ground floor of the contributor → maintainer ladder.

The Automated Style Gates: Never Make a Human Enforce Them

A reviewer who has to comment "run the formatter" has already lost time, and you have already signaled that you didn't run the gate. The core repo enforces style mechanically; your job is to make all of it green before anyone looks.

Gate	Command	What it enforces
Spotless	`./gradlew spotlessApply` (fix), `./gradlew spotlessJavaCheck` (verify)	Code formatting: imports, whitespace, layout. Auto-fixable.
Checkstyle	runs in `./gradlew precommit`	Structural rules: naming, complexity, banned patterns
forbidden-APIs	runs in `./gradlew precommit`	Banned APIs: e.g. default-locale/charset calls, `System.out`, unsafe time APIs
License/header check	runs in `./gradlew precommit`	SPDX headers + dependency licensing (see licensing)
`loggerUsageCheck`	runs in `./gradlew precommit`	Logging discipline (placeholder usage, no string concat in hot logs)

The discipline is a fixed sequence — run it every time, in this order:

cd ~/OpenSearch
./gradlew spotlessApply        # 1. auto-format; commit the result
./gradlew spotlessJavaCheck    # 2. verify formatting is clean
./gradlew precommit            # 3. checkstyle, forbidden-APIs, headers, loggerUsageCheck, deps

Note: spotlessApply modifies your files. Run it, then git add and commit the result, so the formatted code is what you push. A common newbie failure is running spotlessJavaCheck (which only verifies) and being confused that nothing changed — use spotlessApply to actually fix it.

The precommit gate is the single check that bundles most of these. If ./gradlew precommit is green locally, the corresponding CI check (github review) will almost always be green too. There is no excuse for a red precommit check on a PR; it is entirely self-service.

Test Quality: The Part Maintainers Actually Trust

Style gates are table stakes. Test quality is where trust is built or destroyed, because a maintainer's deepest question about your PR is "if this breaks in six months, will a test catch it?" A change with a weak or absent test is a change the maintainer now has to worry about forever.

The properties of a test a maintainer trusts:

Property	Good	Bad (will get review comments or block)
Deterministic	Passes/fails on the logic, every run	Depends on timing, ordering, or external state
No `Thread.sleep`	Waits on a condition with `assertBusy(...)`	`Thread.sleep(5000)` hoping the thing happened
Real assertions	Asserts the actual behavior/values	Asserts "no exception thrown" and nothing else
Fails without the fix	Revert the production change → test goes red	Passes even with the bug present (tests nothing)
Right level	Unit for logic, integration for cross-node behavior	A heavy `IT` for what a unit test could prove
BWC-aware	Round-trips serialization across versions where relevant	Ignores wire/index compatibility entirely

Use `assertBusy`, not `sleep`

OpenSearch is concurrent and asynchronous; "the thing happened" is rarely instantaneous. The deterministic way to wait is assertBusy, which retries an assertion until it passes or times out:

// GOOD: deterministic, fast when ready, bounded when not.
assertBusy(() -> {
    var resp = client().admin().cluster().prepareHealth().get();
    assertEquals(ClusterHealthStatus.GREEN, resp.getStatus());
});

// BAD: flaky on slow CI, slow on fast CI, proves nothing about *why* it waited.
Thread.sleep(5000);
assertEquals(ClusterHealthStatus.GREEN, /* ... */);

cd ~/OpenSearch
# See how the codebase waits on conditions:
grep -rn "assertBusy(" test/framework/src/main/java | head
grep -rln "Thread.sleep" server/src/test | head   # the anti-pattern to avoid in new tests

Prove your test actually tests

The cheapest way to earn a maintainer's trust is to demonstrate the test fails without the fix:

cd ~/OpenSearch
# 1) Stash your production change, keep the test:
git stash push -- server/src/main/java/...      # the fix only
# 2) Run the new test; it MUST fail:
./gradlew :server:test --tests "org.opensearch.your.NewTest"
# 3) Restore the fix; it MUST pass:
git stash pop
./gradlew :server:test --tests "org.opensearch.your.NewTest"

BWC tests for compatibility-sensitive changes

If your change touches the wire, index, or REST surface (maintainer-mindset), the trusted test is a round-trip serialization test — typically extending AbstractWireSerializingTestCase — exercising both the current and an older bwcVersion. See the serialization-BWC deep dive.

Flaky tests

If you hit a non-deterministic failure that isn't yours, don't silently re-run CI to bury it. Reproduce it from the printed -Dtests.seed=..., check for an existing flaky-test issue, and either fix it or mute it correctly:

@AwaitsFix(bugUrl = "https://github.com/opensearch-project/OpenSearch/issues/NNNN")

Never @Ignore a flaky test — that drops it silently with no tracking. The flaky-test roadmap stage covers this in depth.

CHANGELOG Discipline

Every PR to the core repo adds one line to CHANGELOG.md, in the correct category, under the unreleased section. This is not bureaucracy — it is the raw material for the release notes:

## [Unreleased 3.x]
### Fixed
- Fix DiskThresholdDecider off-by-one for relocating shards ([#NNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNN))

The habits that matter: put it under the right heading (Added/Changed/Fixed/ Deprecated/Removed/Security); write it for a user reading release notes, not for yourself; and remember to move it to [Unreleased 2.x] when you backport. A PR missing its CHANGELOG entry is incomplete, and the checks will usually say so before a human does.

Focused PRs and Responsiveness

Two behavioral habits do more for your reputation than any individual piece of code:

Focused PRs. One PR, one logical change. Do not bundle a refactor, a feature, and a reformat. A focused PR is faster to review, safer to merge (it squashes to one clean commit — see github review), and a better git bisect target later. If you catch yourself writing "and also" in the PR description, split it.
Responsiveness. Address review comments within days, not weeks. Push fixes as new commits during review (they squash on merge); reply to each thread; if you disagree, say so with reasoning rather than going silent. A reviewer who knows you'll respond promptly invests more of their scarce attention in you.

These are precisely the behaviors in the "merged fast" column of the review chapter. None of them is about cleverness; all of them are about being a low-friction collaborator.

How Trust Compounds Into Maintainership

The mechanism is straightforward and entirely earned:

flowchart LR
    A[Green gates + strong tests + focused, responsive PRs] --> B[Reviewer spends less effort per PR]
    B --> C[You earn more review bandwidth, faster merges]
    C --> D[You start reviewing others, triaging, helping]
    D --> E[Maintainers trust your judgment across many changes]
    E --> F[Nominated and added to MAINTAINERS.md]

Each merged, clean PR is a deposit. Each revert, each "please run the formatter," each abandoned PR mid-review is a withdrawal. Maintainers grant the things you want — fast reviews, the benefit of the doubt on a design call, eventually merge rights — on the balance, not on any single transaction. The TSC-governance chapter describes the ladder; this chapter is how you actually climb the first rung of it.

Trust-Building Checklist

Run this on every PR before you ask for review:

./gradlew spotlessApply run and the result committed.
./gradlew precommit green (checkstyle, forbidden-APIs, headers, license, logger).
./gradlew assemble succeeds.
Relevant tests added and run (:server:test --tests "..." / internalClusterTest).
The new test fails without the production change (you verified this).
No Thread.sleep in new tests; conditions waited on with assertBusy.
Real assertions on real values — not "no exception thrown."
BWC round-trip test added if the change touches wire/index/REST surfaces.
CHANGELOG.md entry added under the correct section (and moved on backport).
DCO sign-off on every commit (git commit -s → Signed-off-by:).
PR is one focused logical change; title reads as the squash commit message.
Issue linked; the why is explained for the reviewer.
backport <branch> label added if the fix belongs in a maintenance line.

If every box is checked, you have removed every reason a maintainer could send the PR back before reading the logic — which is exactly how you earn the bandwidth to have the logic taken seriously.

Prove You Understand This

Which single ./gradlew task bundles most of the style/static gates, and what does spotlessApply do that spotlessJavaCheck does not?
Give three properties of a test a maintainer trusts, and the anti-pattern each one rules out.
Show, with commands, how you prove a new test actually fails without your production change.
Why is assertBusy preferred over Thread.sleep, and what's the correct way to handle a flaky test you didn't cause?
Where does your CHANGELOG entry end up, and what must you do to it when you backport?
Explain, in terms of deposits and withdrawals, how a track record of clean PRs becomes a line in MAINTAINERS.md.

This closes the Release & Governance Reality section. From here, take what you've learned into the capstone: a real issue, a reviewed PR, a backport, and a fix in a release — the whole machine, end to end.

Capstone Project

The Capstone is the bridge from "I have read the OpenSearch codebase" to "I have shipped a non-trivial fix that an OpenSearch maintainer merged into main." Everything in Levels 1–9 was preparation. This is the work.

You will pick one real, open issue from github.com/opensearch-project/OpenSearch/issues, reproduce it against a current build, trace the failure through the codebase, identify the root cause, write a minimum-diff fix with deterministic tests, get it through ./gradlew precommit and the gradle-check CI, open a Pull Request, sign your commits off under the DCO (git commit -s — OpenSearch has no CLA), respond to review rounds, land the change, and write it up so the next person can learn from your investigation.

OpenSearch contribution is GitHub-native. There is no JIRA, no patch file emailed to a list, no Apache ID. You fork, branch, push, open a PR, sign off, add a CHANGELOG.md entry, and iterate in the PR conversation until a maintainer merges and the backport bot cherry-picks to the release branches. Keep that model in your head — every step below assumes it.

This chapter is the table of contents. The ten step-chapters that follow are the work itself.

Prerequisites

Do not start the Capstone until you can answer "yes" to every one of these:

Levels 1–9 complete. You can read RestSearchAction, TransportSearchAction, IndexShard, InternalEngine, the coordination layer (Coordinator, MasterService / cluster-manager service, ClusterApplierService), and at least one allocation decider, without a guide open. If those names are not familiar, go back to Level 4 and the deep dives.
You can build from source. ./gradlew assemble succeeds on your machine, and ./gradlew :server:test --tests "org.opensearch.cluster.ClusterStateTests" finishes green. Some flakes are normal — see the flaky-test discussion.
You have run a cluster from source. ./gradlew run brings up a single node with REST on localhost:9200, and you have hit it with curl.
You have run an InternalTestCluster test. ./gradlew :server:internalClusterTest --tests "*ClusterHealthIT" (or any *IT) goes green, so you know the multi-node in-JVM harness works on your box.
You have a GitHub account with DCO configured. You can git commit -s and the Signed-off-by: line carries your real name and the email tied to your GitHub account. (The DCO check on the PR matches the sign-off email to a commit author — get this right once, locally, before you ever push.)
You have read the contribution hygiene files in the repo: CONTRIBUTING.md, DEVELOPER_GUIDE.md, and TESTING.md.

If any of these is "no," stop. Go back. The Capstone is unforgiving of partial preparation — you will spend three weeks confused instead of three weeks shipping.

Note: OpenSearch renamed the master node role and many APIs to cluster manager for inclusive language (the old master terms survive as deprecated aliases). Throughout the Capstone, write cluster manager (formerly master) the first time you reference it in any artifact, then use "cluster manager." Reviewers notice.

The 10-Step Flow

flowchart TD
    A[Step 1: Issue Selection] --> B[Step 2: Reproduction]
    B --> C[Step 3: Execution Path Analysis]
    C --> D[Step 4: Root Cause Identification]
    D --> E[Step 5: Implementation]
    E --> F[Step 6: Testing]
    F --> G[Step 7: Validation]
    G --> H[Step 8: Pull Request Preparation]
    H --> I[Step 9: GitHub Documentation]
    I --> J[Step 10: Engineering Write-Up]
    G -.precommit fail.-> E
    F -.test fail.-> E
    D -.hypothesis wrong.-> C
    H -.review round.-> E
    I -.gradle-check red.-> F

The dotted arrows are the loops you will actually run. Nobody gets root cause right on the first hypothesis. Nobody passes precommit on the first push. Nobody clears review in one round. Plan for two or three iterations through Steps 4–9 before the merge button turns green.

Deliverables

By the time you mark the Capstone done, every one of these artifacts exists:

#	Artifact	Lives in
1	Failing reproducer test (a JUnit test that fails on `main` without your fix and passes with it)	`server/src/test/...` or `server/src/internalClusterTest/...`
2	Root-cause document (200–500 words, with file-path citations)	`capstone-work/root-cause.md` in your fork
3	Minimum-diff fix branch	A branch on your fork of `opensearch-project/OpenSearch`
4	Unit tests (`OpenSearchTestCase`; `AbstractWireSerializingTestCase` if serialization changed)	The relevant `src/test/java`
5	Integration tests (`OpenSearchIntegTestCase` / `InternalTestCluster`) if end-to-end behavior changed	`server/src/internalClusterTest/java/...`
6	Validation report (`./gradlew precommit`, `spotlessJavaCheck`, affected-module tests)	`capstone-work/validation.md`
7	GitHub Pull Request against `opensearch-project/OpenSearch:main`, DCO-signed	`https://github.com/opensearch-project/OpenSearch/pulls`
8	`CHANGELOG.md` entry under `## [Unreleased ...]` (Added / Changed / Fixed)	`CHANGELOG.md` in your branch
9	Issue updated and linked (`Fixes #NNNN`), labels and `backport` label as appropriate	`https://github.com/opensearch-project/OpenSearch/issues/NNNN`
10	Engineering write-up (500–1000 words: problem, investigation, design, alternatives, lessons)	Personal blog, the forum, or the PR itself

Every one. No exceptions. The write-up is not optional — it is how the community (and your future self) learns from your investigation. See Step 10.

Use a - [ ] checklist to track them:

Failing reproducer test committed and confirmed red on main
capstone-work/root-cause.md written
Minimum-diff fix on a feature branch
Unit tests cover every trigger condition
Integration / REST-YAML tests if behavior is end-to-end
capstone-work/validation.md with a green precommit
DCO-signed PR opened against main
CHANGELOG.md entry added
Issue linked with Fixes #NNNN, labels set
Write-up published

100-Point Rubric Summary

The full rubric lives in evaluation-rubric.md. Headline:

Area	Weight
Problem articulation (symptom vs. root cause, trigger conditions)	20
Execution-path mastery (file-path citations, accurate diagram)	20
Implementation quality (minimum diff, conventions, BWC, no scope creep)	20
Testing (unit + integration, deterministic, trigger coverage)	15
Review responsiveness (addresses comments, iteration cadence)	10
Documentation (issue/PR hygiene, CHANGELOG, write-up)	10
Community interaction (forum/Slack/PR etiquette, handoff hygiene)	5

Tier thresholds:

80+ — credible OpenSearch contributor. You can sustain a steady PR flow.
90+ — maintainer-ready. You are doing work a MAINTAINERS.md reviewer would do without hand-holding.
95+ — TSC-track. You are leading work others want to follow, across repos.

You will self-grade in Step 10. Be honest. Inflated self-grades are visible from orbit the moment a maintainer reads your PR.

Timeline

The Capstone is a 4–6 week effort if you have one focused evening per weekday plus weekend mornings. Less than that and you risk losing context between sessions — which is far more expensive than people expect for cluster-state and coordination code.

Week	Steps	Hours
1	1–2: Pick an issue, build a deterministic reproducer	10–15
2	3–4: Trace the execution path, identify root cause	12–18
3	5–6: Implement the fix, write unit + integration tests	12–18
4	7–8: Validate locally, open the DCO-signed PR	8–12
5	8–9: Review iteration — two or three rounds is normal	6–10
6	10: Write-up, issue cleanup, retrospective	4–6

If you blow past six weeks, that is a signal — not a failure. Either the issue is larger than it looked (pause and renegotiate scope in the issue thread), or you are stuck on a specific step (ask on the forum or in #community on Slack). See Communication Channels.

Success Indicators

You will know it is working when:

A maintainer comments "LGTM" / approves the PR, and gradle-check is green.
Your fix appears in git log origin/main with your Signed-off-by: line, and the backport bot opens a follow-up PR onto 2.x.
The issue you claimed flips to closed via Fixes #NNNN, with your name on the merge.
Your write-up gets traffic — forum replies, a question from another contributor, a maintainer pointing the next person at it.
The next time you pick an issue, you reach root cause in days, not weeks.

You will know it is failing when:

You are still editing files in Step 5 with no failing test in hand from Step 2.
Your PR description says "I think this might fix it."
You have not run ./gradlew precommit in over a week.
You are arguing in PR comments instead of changing code or asking questions.
CI (gradle-check) has been red for days and you are pushing "fix CI" commits blind instead of reproducing the failure locally.

If you spot a failure signal, do not push through. Stop, reread the relevant step chapter, and reset.

How to Use This Chapter

Read all ten step-chapters once, end-to-end, before you start Step 1. You need the shape of the whole journey in your head — Step 4 (root cause) makes choices that Step 6 (testing) depends on; Step 8 (PR) assumes you have artifacts from Steps 2 and 7; Step 9 assumes your tests cleared CI. Skim now, deep-read each as you arrive at it.

Then go to Step 1: Issue Selection. Pick the issue. The clock starts when you comment "I'd like to work on this" on the GitHub issue.

Validation / Self-check

Before starting Step 1, confirm:

You can produce, from memory, the source path of RestSearchAction, TransportSearchAction, IndexShard, InternalEngine, and Coordinator.
./gradlew assemble completes against your local clone.
./gradlew run brings up a node and you have curled localhost:9200/_cluster/health.
./gradlew :server:internalClusterTest --tests "*ClusterHealthIT" passes.
git commit -s produces a correct Signed-off-by: line with your GitHub email.
You have a capstone-work/ directory in your fork ready for root-cause.md and validation.md.
You have skimmed every step-chapter once.
You have set aside 4–6 calendar weeks with a realistic time budget.
You have read CONTRIBUTING.md, DEVELOPER_GUIDE.md, and TESTING.md.

Step 1: Issue Selection

The single biggest predictor of whether your Capstone ships is the issue you pick in week one. Pick something too large and you will still be tracing code in week six. Pick something already half-fixed by someone else and you will waste a week before a maintainer points it out. Pick something you cannot reproduce and you have nothing to build on.

A good Capstone issue is real (an open, maintainer-acknowledged problem), tractable (fixable in a focused diff, not a redesign), and reproducible (you can make it fail on demand). This step is how you find one and how you claim it without stepping on anyone.

Where the Issues Live

Everything is on GitHub. There is no JIRA. Start here:

https://github.com/opensearch-project/OpenSearch/issues

Filter with the search bar. The label vocabulary is the same one you read about in the issue roadmap overview; here is the subset that matters for a Capstone:

Label	What it means for you
`good first issue`	Scoped by a maintainer to be approachable. Start here.
`help wanted`	Maintainers want a contributor; no one is on it (usually).
`bug`	A defect with observable wrong behavior — the best Capstone fuel.
`flaky-test`	A test that fails intermittently; often a real concurrency bug. Excellent if you can reproduce it.
`untriaged`	Not yet reviewed by a maintainer. Avoid — the problem may not be real, in scope, or even agreed upon.
`enhancement`	A feature request. Smaller ones are fine; large ones are not.
`RFC` / `meta` / `proposal`	Design discussions and large features. Avoid for a Capstone.
`discuss`	Needs consensus before code. Avoid unless you want to drive the discussion.
`backport 2.x` / `v3.0.0`	Versioning/backport metadata, not a "type" of work.

Useful saved searches

Paste these into the GitHub issue search box (adjust the labels as the project evolves):

is:issue is:open label:"good first issue" no:assignee
is:issue is:open label:bug label:"help wanted" no:assignee
is:issue is:open label:"flaky-test" no:assignee
is:issue is:open label:bug -label:"RFC" -label:"meta" sort:comments-desc

no:assignee is doing real work there — it filters out issues someone has already claimed. sort:comments-desc surfaces issues with active maintainer discussion, which is where you learn whether a fix direction is already agreed.

Note: The biggest functional plugins — security, k-NN, sql, alerting, ml-commons, index-management — live in separate repos under opensearch-project/. For your first Capstone, stay in the core engine (opensearch-project/OpenSearch). The build, test harness, and reviewer pool are what Levels 1–9 trained you on. Cross-repo work is a fine second Capstone.

What Makes an Issue Tractable

A tractable Capstone issue has these properties. Score a candidate against them before you claim it:

Signal	Good	Bad
Blast radius	Touches 1–3 files in one module (`server/.../cluster`, `.../search`, `.../index`)	Spans `server`, `libs`, `modules`, and a plugin repo
Surface	A wrong value, a missing validation, an off-by-one, a race in one component	"Redesign allocation," "add a new node role"
Reproducibility	A `curl` sequence or a JUnit test makes it fail every time	"Sometimes on a 200-node cluster under load"
Agreement	A maintainer has commented confirming it is a bug and roughly how to fix	Open question whether it is even a bug
BWC exposure	No serialization or REST contract change, or a clearly guarded one	Changes a wire format with no `Version` story
Test reachability	You can imagine the `OpenSearchTestCase` / `OpenSearchIntegTestCase` that asserts the fix	"Only reproduces in production telemetry"

The sweet spot for a first Capstone: a bug + good first issue where a maintainer has already written "this is wrong because X; the fix is probably around SomeClass.someMethod." That comment is gold — it is a pre-validated root-cause hypothesis you get to confirm and implement.

Categories that age well as Capstones

A validation gap. A REST parameter or setting accepts a value it should reject, and the failure surfaces deep in the engine instead of at the edge. Fix: validate early in the RestHandler or Setting definition. Small, testable, user-visible.
A flaky test rooted in a real race. A *IT that fails ~1 in 50 because of a missing assertBusy or a genuine cluster-state applier ordering bug. The former is a test fix; the latter is a real engine fix. Both are good — just be honest in your root-cause doc about which one it is.
An aggregation/reduce edge. An aggregation returns a wrong value for an empty bucket, a single-shard index, or a specific missing/min_doc_count combination. Contained in one InternalAggregation.reduce(...) family. See the aggregations deep dive.
A seqno / versioning edge. A document version or sequence-number assertion trips under a specific retry/refresh ordering. Deeper, but extremely instructive. See replication and recovery.
An error-message / diagnostics defect. The engine throws the wrong exception type, or a message omits the field that would let an operator diagnose it. Low blast radius, genuinely appreciated, great first PR.

Categories to avoid for a first Capstone

Anything labeled RFC, proposal, meta, or discuss.
New node roles, new transport actions, new settings frameworks.
Pure performance issues with no correctness component ("make X faster") — benchmarking and proving no regression is its own multi-week skill.
Anything where the issue thread has an unresolved disagreement between two maintainers about whether to fix it. You do not want to land in the middle of that as your first contribution.

Check It Is Not Already Taken

Before you invest a day reproducing, spend five minutes confirming nobody is already on it:

Read the whole thread. Look for "I'll take this," "working on it," or a maintainer assigning someone (Assignees in the sidebar).
Search for an existing PR. GitHub links PRs to issues automatically when the PR body says Fixes #NNNN. Also search the PR list directly:
```
https://github.com/opensearch-project/OpenSearch/pulls?q=is:pr+NNNN
```
and
```
is:pr is:open "Fixes #NNNN"
```
Check linked timeline events. The issue's timeline shows "X mentioned this in PR #M" — follow it. If there is an open PR, the work is taken. If the PR is stale (months without activity, author gone), you can comment offering to pick it up — but say so explicitly and wait for a maintainer's nod.
Check the date. A good first issue opened two days ago may already have three people circling it silently. An issue open for eight months with a maintainer comment "PRs welcome" is much safer.

Claim It

OpenSearch does not require formal assignment to start, but claiming politely prevents duplicate work and signals seriousness:

I'd like to work on this. My plan is to reproduce it with a failing OpenSearchIntegTestCase first, then trace through ClusterApplierService. I'll open a draft PR once I have the reproducer. Anything I should know about the intended fix direction before I start?

That comment does three things: it claims the issue, it shows you already have a plan, and it invites a maintainer to redirect you before you spend a week. Most good first issue threads get a quick "Sounds good, go for it" — and now you have a maintainer lightly on the hook to review.

Warning: Do not comment "Can I work on this?" and then vanish for a week. That is the single most common way to annoy maintainers — it locks an issue in social limbo. If you claim it, start within a day or two, or un-claim it.

Scope Estimation

Write a one-paragraph scope estimate for yourself before you commit. Answer:

How many files do I expect to touch? (Grep the symbols from the issue: grep -rn "theMethodName" server/src/main/java.) If the answer is "dozens," reconsider.
Is there a serialization or REST contract change? If yes, there is a BWC story (a Version guard) and the fix is at least one tier harder. Fine for week three of the Capstone; not ideal for your first one.
Can I name the test that will prove the fix? If you cannot imagine the assertion, you do not understand the bug yet — keep reading the thread and the code before claiming.
Is there a maintainer hypothesis I can confirm? If yes, your Step 4 is mostly validation. If no, budget extra time for root cause.

If the honest answer to "can I finish this in 4–6 weeks of evenings" is "no," pick a smaller issue. There is no shame in a small, clean, merged PR — it is worth infinitely more than an ambitious branch that never lands.

Reproducibility Pre-Check

You will do the full reproduction in Step 2, but do a 30-minute smoke test now, before you fully commit:

Build current main: ./gradlew assemble -q.
If the issue is REST-shaped, run ./gradlew run and try the exact curl from the issue. Does it misbehave?
If it is internals-shaped, find the relevant test class (find . -name "*<Component>*Tests.java") and see whether you can already write an assertion that fails.

If you cannot make it misbehave in 30 minutes, that is not necessarily disqualifying — but note it. A bug you cannot reproduce in week one is a bug you may not reproduce at all, and an irreproducible Capstone is a dead Capstone. If the issue says "only on version 2.7 with these 4 settings," try exactly that combination on the matching tag (git checkout 2.7.0) before claiming.

Selection Rubric

Score your candidate issue out of 10 before committing. Pick one that scores 7 or higher.

Criterion	0	1	2
Maintainer-confirmed it is a real bug/task	Untriaged, unclear	Some discussion	Maintainer confirmed + fix hint
Blast radius	Many modules / cross-repo	2–4 files, one module	1–2 files, one component
Reproducibility	Can't make it fail	Fails sometimes	Deterministic repro in reach
BWC / contract risk	Wire/REST change, no plan	Guarded contract change	No serialization/REST change
Test reachability	Can't imagine the test	Integration test only	Clear unit + integ assertion

A 9–10 is a near-ideal first Capstone. A 7–8 is a solid, slightly stretchy choice. Below 7, keep looking — the issue list is long and a better candidate is usually one search away.

Deliverable for Step 1

A chosen issue number, #NNNN, scoring ≥ 7 on the rubric above.
A claim comment posted on the issue with a brief plan.
A one-paragraph scope estimate saved in capstone-work/scope.md.
A 30-minute reproducibility smoke test result noted (did it misbehave?).
A local feature branch off main: git checkout -b fix/issue-NNNN-short-description.

Validation / Self-check

Before advancing to Step 2:

You can state, in one sentence, the wrong behavior the issue describes (not the desired behavior — the wrong one).
You have confirmed no open or recent PR already fixes it.
You have posted a claim comment and (ideally) gotten a maintainer acknowledgement.
You have grepped the codebase for the symbols named in the issue and the count is small.
You can name the component the bug lives in: REST handler, transport action, cluster-state path, engine, allocation, aggregation, or coordination.
Your scope estimate honestly fits a 4–6 week budget.
You have a feature branch checked out and capstone-work/ ready.

Then go to Step 2: Reproduction.

Step 2: Reproduction

A bug you cannot reproduce on demand is a bug you cannot fix with confidence. The reproducer is the most important artifact in the entire Capstone — it is what makes Step 4 (root cause) provable, what makes Step 5 (implementation) verifiable, and what a maintainer will look for first when they review your PR.

The goal of this step is a deterministic reproducer: something that fails on main without your change, every single run, and that you can re-run in seconds. Ideally it is a JUnit test you will ship in the PR. At minimum it is a recorded curl/REST-YAML sequence that misbehaves the same way every time.

The rule that governs this step: the reproducer must fail on main and pass after the fix. If it does not fail on main, you are testing the wrong thing. If it does not pass after the fix, you have not fixed the bug.

Two Kinds of Reproducer

Pick the lowest-cost level that reliably reproduces the bug.

Level	Harness	When to use	Speed
Unit	`OpenSearchTestCase` / `OpenSearchSingleNodeTestCase`	Logic in one class: an aggregation reduce, a setting validator, a serialization round-trip, a request parser	seconds
Integration	`OpenSearchIntegTestCase` (`InternalTestCluster`)	Behavior that needs multiple nodes / shards / a real cluster state: allocation, recovery, replication, cluster-manager election	tens of seconds
REST-YAML	`OpenSearchRestTestCase` + a `.yml` under `rest-api-spec`	A REST contract: status code, response shape, error message	tens of seconds
Manual `curl`	`./gradlew run` + `curl`	First, to see the bug; then promote to one of the above	interactive

Always start manual to see the failure, then promote it to code. A manual curl repro is fine for Step 2's exploration, but the PR needs an automated test (Step 6). Write the automated version as early as you can — it is your regression guard for the rest of the Capstone.

Manual Reproduction with `./gradlew run`

Bring up a single node from source and reproduce by hand first. This is where you confirm the bug and capture the exact request/response.

# Launch a debuggable single-node cluster from source. REST on :9200.
./gradlew run

# In another shell:
curl -s localhost:9200/_cluster/health?pretty

Now run the exact sequence from the issue. For example, a hypothetical aggregation-edge bug:

curl -s -XPUT localhost:9200/repro -H 'Content-Type: application/json' -d '
{ "settings": { "number_of_shards": 1, "number_of_replicas": 0 } }'

curl -s -XPOST 'localhost:9200/repro/_doc?refresh=true' \
  -H 'Content-Type: application/json' -d '{ "value": 10 }'

curl -s 'localhost:9200/repro/_search?pretty' -H 'Content-Type: application/json' -d '
{
  "size": 0,
  "aggs": {
    "by_missing": {
      "terms": { "field": "absent_field", "missing": "N/A", "min_doc_count": 0 }
    }
  }
}'

Compare the response to what the issue (and the docs) say it should be. Capture both. If the response is wrong, you have a manual reproducer. Save the exact commands and outputs into capstone-work/repro.md — you will paste them into the PR.

Tip: ./gradlew run defaults to a fresh data directory each run, so your repro starts clean. To keep data across restarts, use ./gradlew run --preserve-data. To attach a debugger (you will, in Step 3), use ./gradlew run --debug-jvm and connect on port 5005.

Symptom vs. trigger conditions

While reproducing, separate two things you will need explicitly in your root-cause doc:

Symptom — what the user observes. "The by_missing bucket count is 0 when it should be 1," or "the request returns HTTP 500 with a NullPointerException instead of HTTP 400."
Trigger conditions — the precise circumstances under which the symptom appears. "Only with min_doc_count: 0 and a field that exists in the mapping but in zero documents," or "only on a single-shard index," or "only after a refresh but before a flush."

Bisect the trigger conditions experimentally. Remove missing. Add a second shard. Add a second document. Each variation that makes the symptom disappear teaches you a trigger condition, and trigger conditions are what your test must encode and what your fix must address. A reproducer that only works under your exact lucky setup is a fragile reproducer.

Promote to a JUnit Reproducer

Now write the failing test. Name it after the issue so the intent is obvious and so anyone can find it later: SomeComponentTests with a method testIssueNNNNRepro (or, for an aggregation, add a method to the existing ...AggregatorTests).

A unit-level reproducer (`OpenSearchTestCase`)

Find the existing test class for the component first — do not invent a new one if a home exists:

# Where do tests for this aggregation live?
find server/src/test -name "*Terms*AggregatorTests.java"
# Where is the production class?
grep -rn "class TermsAggregator" server/src/main/java

A minimal failing unit test reads like this (illustrative — match the real base class and helpers in the file you found):

/*
 * SPDX-License-Identifier: Apache-2.0
 * ... (keep the existing header in the file you are editing)
 */
public class TermsAggregatorTests extends AggregatorTestCase {

    public void testIssueNNNNMissingWithMinDocCountZero() throws Exception {
        // Arrange: a field present in the mapping but in zero matching docs,
        // with `missing` set and min_doc_count = 0 — the trigger conditions.
        MappedFieldType fieldType = new KeywordFieldMapper.KeywordFieldType("absent_field");
        TermsAggregationBuilder agg = new TermsAggregationBuilder("by_missing")
            .field("absent_field")
            .missing("N/A")
            .minDocCount(0);

        try (Directory directory = newDirectory();
             RandomIndexWriter w = new RandomIndexWriter(random(), directory)) {
            w.addDocument(List.of()); // a doc with no value for the field
            try (IndexReader reader = w.getReader()) {
                IndexSearcher searcher = newIndexSearcher(reader);
                StringTerms result =
                    searchAndReduce(searcher, new MatchAllDocsQuery(), agg, fieldType);

                // Assert the CORRECT behavior. This FAILS on main today.
                assertEquals(1, result.getBuckets().size());
                assertEquals("N/A", result.getBuckets().get(0).getKeyAsString());
            }
        }
    }
}

Run only that test:

./gradlew :server:test \
  --tests "org.opensearch.search.aggregations.bucket.terms.TermsAggregatorTests.testIssueNNNNMissingWithMinDocCountZero"

It should fail. That red bar is your Step 2 success criterion. If it passes, your assertion encodes the current (wrong) behavior, not the correct one — fix the assertion, not the code (yet).

An integration-level reproducer (`OpenSearchIntegTestCase`)

When the bug needs a real cluster — multiple shards, allocation, recovery, cluster-state propagation — use the in-JVM multi-node harness. These live under server/src/internalClusterTest/java/... and run via the :server:internalClusterTest task.

@OpenSearchIntegTestCase.ClusterScope(scope = OpenSearchIntegTestCase.Scope.TEST, numDataNodes = 2)
public class SomeBehaviorIT extends OpenSearchIntegTestCase {

    public void testIssueNNNNRepro() throws Exception {
        internalCluster().startClusterManagerOnlyNode();
        internalCluster().startDataNodes(2);

        createIndex("repro", Settings.builder()
            .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 1)
            .build());
        ensureGreen("repro");

        // ... drive the exact trigger conditions, then:
        assertBusy(() -> {
            ClusterHealthResponse health = client().admin().cluster()
                .prepareHealth("repro").get();
            assertThat(health.getStatus(), equalTo(ClusterHealthStatus.GREEN));
        });
    }
}

./gradlew :server:internalClusterTest --tests "org.opensearch.*.SomeBehaviorIT.testIssueNNNNRepro"

Warning: Never use Thread.sleep to wait for cluster state to settle. Use assertBusy(...), ensureGreen(...), ensureStableCluster(...), or a ClusterStateListener. A sleep-based test is non-deterministic by construction, and reviewers will (correctly) block it. See Step 6.

A REST-YAML reproducer

If the bug is in the REST contract (status code, response shape, error message), the most faithful reproducer is a REST-YAML test under rest-api-spec (run via :rest-api-spec:yamlRestTest):

---
"terms agg with missing and min_doc_count zero":
  - do:
      indices.create:
        index: repro
        body:
          settings:
            number_of_shards: 1
            number_of_replicas: 0
  - do:
      index:
        index: repro
        refresh: true
        body: { value: 10 }
  - do:
      search:
        index: repro
        body:
          size: 0
          aggs:
            by_missing:
              terms: { field: absent_field, missing: "N/A", min_doc_count: 0 }
  - match: { aggregations.by_missing.buckets.0.key: "N/A" }
  - match: { aggregations.by_missing.buckets.0.doc_count: 1 }

Pin Versions and the Random Seed

OpenSearch tests are randomized (RandomizedRunner). A test can pass on one seed and fail on another. For a reproducer, pin the seed so the failure is exact:

# Reproduce a specific failure deterministically.
./gradlew :server:test --tests "...TermsAggregatorTests.testIssueNNNNRepro" \
  -Dtests.seed=DEADBEEFDEADBEEF

When a randomized existing test already exposes the bug intermittently (a flaky-test), the failure log prints the exact reproduction line — copy it verbatim:

REPRODUCE WITH: ./gradlew ':server:test' --tests "...SomeTests.testThing" \
  -Dtests.seed=ABC123 -Dtests.locale=fr-FR -Dtests.timezone=America/Sao_Paulo

Pin the branch/tag too. State which ref you reproduced on. For the Capstone, reproduce on main. If the issue claims a regression from a released version, also reproduce on the relevant tag to confirm it was correct there:

git fetch origin
git checkout main && ./gradlew :server:test --tests "...testIssueNNNNRepro"   # fails
git checkout 2.11.0 && ./gradlew :server:test --tests "...testIssueNNNNRepro"  # passes? -> regression
git checkout main   # back to work

That last experiment is the seed of Step 4's git bisect. If main fails and 2.11.0 passes, you have a regression and a known-good/known-bad pair to bisect between.

Build the Reproduction Report

Save capstone-work/repro.md with everything a reviewer (or future you) needs to reproduce in 60 seconds:

# Reproduction: #NNNN

## Symptom
<one sentence — the observable wrong behavior>

## Trigger conditions
- <condition 1>
- <condition 2>

## Refs
- Fails on: `main` (commit <sha>)
- Passes on: `2.11.0` (regression) | n/a (never worked)
- Seed (if randomized): <seed>

## Manual repro (curl)
<the exact curl sequence + observed vs. expected output>

## Automated repro (the failing test)
- File: server/src/test/.../TermsAggregatorTests.java
- Method: testIssueNNNNMissingWithMinDocCountZero
- Command: ./gradlew :server:test --tests "...testIssueNNNNMissingWithMinDocCountZero"
- Result on main: FAIL (expected 1 bucket, got 0)

Deliverable for Step 2

A failing automated test (unit, integration, or REST-YAML) that is red on main.
The exact command to run just that test, recorded.
Symptom and trigger conditions written down separately.
The ref(s) you reproduced on, pinned (and the seed, if randomized).
capstone-work/repro.md complete.

Validation / Self-check

Before advancing to Step 3:

Your reproducer fails on main, deterministically, every run.
You can run it in seconds with one ./gradlew ... --tests command.
You have listed the trigger conditions and verified at least one of them by making the symptom disappear when you remove it.
You chose the lowest-cost harness that reliably reproduces (unit over integration over manual).
If randomized, you have a pinned seed; if a regression, you have a known-good ref.
capstone-work/repro.md is written and someone else could follow it.
You did not use Thread.sleep anywhere in the reproducer.

Then go to Step 3: Execution Path Analysis.

Step 3: Execution Path Analysis

You have a reproducer. Now you have to understand why it fails, and that means tracing the exact path a request takes from the wire to the line where the wrong thing happens. Not a sketch. Not "it goes through the search service somewhere." The actual chain of methods, with file and (approximate) line citations at every layer, that you could read aloud and a maintainer would nod along to.

This step produces three artifacts:

An annotated execution path — a list of the methods the request passes through, top to bottom, each with a path/to/File.java citation.
A path diagram (mermaid) showing the same path visually, with the bug site annotated.
A confirmed observation point — the exact line where you can set a breakpoint or log statement and watch the wrong value appear.

The rule that governs this step: you must be able to point at one line and say "the value is correct above this line and wrong below it." Until you can, you are still guessing.

The Four Layers

Every OpenSearch request that reaches a bug crosses some subset of these. Know which ones your bug lives in before you start grepping.

Layer	Entry class	What it does	Where it lives
REST	`RestController` → a `RestHandler` (`BaseRestHandler`)	Parses HTTP, builds a typed request, calls `NodeClient.execute(...)`	`server/src/main/java/org/opensearch/rest/`
Transport / Action	`TransportAction` (`HandledTransportAction`, `TransportClusterManagerNodeAction`, `TransportReplicationAction`, …)	Routes to the right node, enforces blocks, coordinates fan-out	`server/src/main/java/org/opensearch/action/`
Shard / Engine	`IndexShard`, `InternalEngine`, `SearchService`	The actual indexing/search/replication work against Lucene	`server/src/main/java/org/opensearch/index/`, `.../search/`
Coordination / Cluster	`Coordinator`, `MasterService`, `ClusterApplierService`, `AllocationService`	Cluster-state computation, publication, application, shard allocation	`server/src/main/java/org/opensearch/cluster/`

Note: The "cluster manager" node (formerly called "master") owns the coordination layer. OpenSearch renamed master → cluster_manager for inclusive language; you will still see Master-prefixed class names in the code (TransportMasterNodeAction is aliased by TransportClusterManagerNodeAction) and old setting aliases like cluster.initial_master_nodes. When you read coordination code, both vocabularies are live.

A search-aggregation bug lives in REST + Action + Shard (search side). A listener-race lives in Coordination. A seqno/version edge lives in Shard/Engine. Pin your layer first; it tells you where to point grep.

Find the Entry Point with `grep`

Start at the top: which RestHandler services the request in your reproducer? The handler advertises its route in routes(). Grep for the path segment.

cd ~/OpenSearch    # your clone of opensearch-project/OpenSearch

# Which REST handler owns _search?
grep -rn "_search" server/src/main/java/org/opensearch/rest/action/search/
# -> RestSearchAction.java registers GET/POST {index}/_search

# Generic: find the handler class for any action
grep -rln "extends BaseRestHandler" server/src/main/java/org/opensearch/rest/action \
  | xargs grep -l "routes()"

RestSearchAction.prepareRequest(...) is the method that parses the request and hands off. Read it. It ends in a call like client.execute(SearchAction.INSTANCE, searchRequest, ...) — that string, SearchAction.INSTANCE, is the bridge to the transport layer.

Now find the transport action wired to that ActionType:

# The ActionType -> TransportAction wiring happens in ActionModule.
grep -rn "SearchAction.INSTANCE" server/src/main/java/org/opensearch/action/ActionModule.java
grep -rn "class TransportSearchAction" server/src/main/java/org/opensearch/action/search/

Repeat this descent for your specific bug. For a write-path bug:

grep -rn "class RestBulkAction"          server/src/main/java/org/opensearch/rest/action/document/
grep -rn "class TransportShardBulkAction" server/src/main/java/org/opensearch/action/bulk/
grep -rn "applyIndexOperationOnPrimary"   server/src/main/java/org/opensearch/index/shard/IndexShard.java
grep -rn "private IndexResult indexIntoLucene" server/src/main/java/org/opensearch/index/engine/InternalEngine.java

For a coordination bug:

grep -rn "class ClusterApplierService" server/src/main/java/org/opensearch/cluster/service/
grep -rn "callClusterStateListeners\|callClusterStateAppliers" \
  server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java

Tip: Class names are stable across branches but line numbers are not. Everywhere this curriculum cites a line, treat it as "run the grep on your checkout." Pin your work to a commit (git rev-parse HEAD) and record it, so your line citations are reproducible for the reviewer reading your Step 3 doc.

Use the IDE Call Hierarchy

grep gets you the players; the IDE call hierarchy gets you the order. Import the Gradle project into IntelliJ IDEA (the DEVELOPER_GUIDE.md documents this; ./gradlew idea is the legacy task but modern IntelliJ imports the Gradle build directly).

Once imported, two moves do most of the work:

Call hierarchy (callers): put the cursor on the buggy method, Ctrl+Alt+H (Cmd+Alt+H on macOS). This shows who calls this — walk up until you reach a TransportAction or RestHandler. That is your path, bottom-up.
Call hierarchy (callees) / Go to implementation: Ctrl+Alt+B on an interface method (e.g. ClusterStateListener.clusterChanged) shows every implementation. Polymorphism hides the real target; this reveals it. For serialization, StreamInput.readNamedWriteable resolves through NamedWriteableRegistry at runtime — the call hierarchy will not show the concrete Writeable; you find that by grepping the registry wiring instead.

Write the path down as you walk it. The deliverable is a numbered list:

1. RestSearchAction.prepareRequest          rest/action/search/RestSearchAction.java
2. NodeClient.execute                        client/node/NodeClient.java
3. TransportSearchAction.doExecute           action/search/TransportSearchAction.java
4. AbstractSearchAsyncAction.executePhase    action/search/AbstractSearchAsyncAction.java
5. SearchService.executeQueryPhase           search/SearchService.java
6. QueryPhase.execute                        search/query/QueryPhase.java
7. <bug fires here>                          ...

Confirm the Path with a Debugger

A path you reasoned about is a hypothesis. A path you stepped through is a fact. Two ways to attach a debugger to OpenSearch running from source.

Option A — `./gradlew run --debug-jvm`

The single-node run task can launch suspended, waiting for a debugger:

./gradlew run --debug-jvm
# Gradle prints: "Listening for transport dt_socket at address: 5005"
# The JVM is SUSPENDED until you attach.

In IntelliJ: Run → Edit Configurations → + → Remote JVM Debug, host localhost, port 5005, then Debug. The node resumes. Set a breakpoint on the line your path analysis identified, then fire the reproducer curl from another shell. Execution stops on your breakpoint.

Option B — remote-debug a running test

If your reproducer is a JUnit test (it should be, from Step 2), debug that instead — it is faster and more isolated than a full node:

./gradlew :server:test \
  --tests "org.opensearch.search.aggregations.bucket.terms.TermsAggregatorTests.testIssueNNNNRepro" \
  --debug-jvm
# Gradle suspends the test JVM on port 5005; attach the same Remote JVM Debug config.

Either way, once stopped on the breakpoint, inspect the frame:

Evaluate the value the issue is about. For the aggregation bug: result.getBuckets() — is it empty here? Step out one frame at a time (Shift+F8) and re-evaluate. The frame where the value flips from correct to wrong is your defect site.
Watch the call stack. The "Frames" panel is your execution path, authoritatively. Screenshot it; paste it into your Step 3 doc.
Conditional breakpoints keep you out of the hundreds of unrelated calls. Right-click the breakpoint → Condition, e.g. minDocCount == 0 or event.source().equals("..."). Only your trigger condition stops execution.

Warning: OpenSearch tests run under the RandomizedRunner with a security manager and randomized seeds. When debugging a test, pin the seed (-Dtests.seed=... from Step 2) so the run you debug is the run that failed.

TRACE Logging Without a Debugger

Sometimes a breakpoint is the wrong tool — the bug is timing-dependent (a race), or it only manifests in a multi-node integration test where stepping freezes the cluster and changes the behavior. Then you log.

OpenSearch logging is per-logger, configured by package, and changeable at runtime through cluster settings — no restart:

# Turn the relevant package up to TRACE on a running cluster.
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{
  "persistent": {
    "logger.org.opensearch.cluster.service": "TRACE",
    "logger.org.opensearch.cluster.coordination": "DEBUG",
    "logger.org.opensearch.index.engine": "TRACE"
  }
}'

# ... fire the reproducer ...

# Reset when done (null clears the override).
curl -s -XPUT 'localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d '
{ "persistent": { "logger.org.opensearch.cluster.service": null,
                  "logger.org.opensearch.cluster.coordination": null,
                  "logger.org.opensearch.index.engine": null } }'

For an integration test, set the same loggers with an annotation on the test class so the output lands in the test logs:

@TestLogging(value = "org.opensearch.cluster.service:TRACE", reason = "trace listener ordering")
public class SomeBehaviorIT extends OpenSearchIntegTestCase { ... }

The pattern: turn a layer to TRACE, run the repro, read the log top to bottom, and find the line where the logged state stops matching what you expect. That line's logger tells you the class; grep the message string to find the exact source line:

grep -rn "applying cluster state version" server/src/main/java/org/opensearch/cluster/service/

Now you have an observation point without a debugger, which is the only thing that works for a race.

The Annotated Path Diagram

Pull it together into a diagram. This goes straight into your Step 4 root-cause doc and your eventual PR description. Annotate the node where the bug fires.

flowchart TD
    A["HTTP POST {index}/_search<br/>RestController"] --> B["RestSearchAction.prepareRequest<br/>rest/action/search/RestSearchAction.java"]
    B --> C["NodeClient.execute(SearchAction.INSTANCE, ...)"]
    C --> D["TransportSearchAction.doExecute<br/>action/search/TransportSearchAction.java"]
    D --> E["coordinating node fans out to shards<br/>AbstractSearchAsyncAction"]
    E --> F["per-shard: SearchService.executeQueryPhase<br/>search/SearchService.java"]
    F --> G["QueryPhase.execute<br/>+ AggregatorFactory -> Aggregator"]
    G --> H["coordinating node reduce<br/>SearchPhaseController + InternalAggregation.reduce"]
    H --> X["BUG: InternalAggregation.reduce drops the<br/>min_doc_count=0 'missing' bucket when<br/>no shard produced it"]

    style X fill:#ffd6d6,stroke:#c0392b,stroke-width:2px

For a write-path / engine bug, your diagram looks like:

flowchart TD
    A["POST {index}/_bulk<br/>RestBulkAction"] --> B["TransportBulkAction"]
    B --> C["TransportShardBulkAction<br/>action/bulk/TransportShardBulkAction.java"]
    C --> D["IndexShard.applyIndexOperationOnPrimary<br/>index/shard/IndexShard.java"]
    D --> E["InternalEngine.index<br/>index/engine/InternalEngine.java"]
    E --> F["versionMap / LocalCheckpointTracker<br/>seqno assignment"]
    F --> G["Lucene IndexWriter.addDocument/updateDocument<br/>+ Translog.add"]
    G --> R["TransportReplicationAction replicates to replica"]
    F --> X["BUG: seqno/version edge under<br/>concurrent update at the same _id"]

    style X fill:#ffd6d6,stroke:#c0392b,stroke-width:2px

Build the diagram that matches your bug. The non-negotiable parts: every box cites a real file, and exactly one box is styled as the bug site.

Build the Execution-Path Report

Save capstone-work/execution-path.md. A reviewer should be able to open every file at every line and follow the logic without asking you a question.

# Execution path: #NNNN

## Checkout
- Branch/commit traced on: `main` @ <sha from `git rev-parse HEAD`>

## Layers crossed
REST -> Action -> Shard (search/reduce side)

## Annotated path (top to bottom)
1. RestSearchAction.prepareRequest        rest/action/search/RestSearchAction.java:~110
2. NodeClient.execute                      client/node/NodeClient.java:~80
3. TransportSearchAction.doExecute         action/search/TransportSearchAction.java:~280
4. SearchService.executeQueryPhase         search/SearchService.java:~520
5. SearchPhaseController reduce            action/search/SearchPhaseController.java:~430
6. InternalAggregation.reduce  <- BUG      search/aggregations/InternalAggregation.java:~NN

## Observation point
- File/line: search/aggregations/.../InternalTerms.java reduce(...)
- Method to break on: doReduce(...)
- What is correct above this line: per-shard buckets include the 'missing' key
- What is wrong below this line: the merged bucket list omits it when min_doc_count=0

## How confirmed
- [x] Breakpoint hit on the cited line (call-stack screenshot attached)
- [x] TRACE on org.opensearch.search.aggregations showed the bucket present per-shard

Deliverable for Step 3

A numbered, file-cited execution path from REST handler to bug site.
A mermaid (or text-arrow) diagram with exactly one annotated bug node.
A confirmed observation point: the one line where the value flips from correct to wrong, verified by a breakpoint or by TRACE logging.
The traced commit SHA recorded so line citations are reproducible.
capstone-work/execution-path.md complete.

Validation / Self-check

Before advancing to Step 4:

You can name every layer the request crosses (REST / Action / Shard / Coord) and justify which ones it does not cross.
Every method in your path has a real org.opensearch.* file citation; none are hand-waved as "then it goes through the engine somewhere."
You confirmed the path with a debugger (--debug-jvm) or TRACE logging, not by reading alone.
You can point at exactly one line and say "correct above, wrong below."
Your line numbers are pinned to a recorded commit SHA.
The diagram has exactly one styled bug node and would survive a maintainer opening each cited file.
You did not change any production code in this step — observation only.

Then go to Step 4: Root Cause Identification.

Step 4: Root Cause Identification

This step is mostly thinking. The tools are five-whys, git log -S/-G, git blame, and git bisect. The output is a 200–500 word root-cause document and a tested hypothesis. Unlike a JIRA-based project, your archaeology ends not at a JIRA comment thread but at a GitHub Pull Request discussion — read it the same way.

Five Whys, Applied to an OpenSearch Bug

Pick the flavor that matches your Capstone bug. Three are worked below; do the one that fits.

Worked example A: a `ClusterApplierService` listener race

Symptom from Step 2: after a fast index-create-then-delete, a component that holds a per-index resource (a cache, a directory handle) occasionally leaks it, because its ClusterStateListener observed the "created" state but not the "deleted" state.

Why 1: Why did the listener miss the delete?

Because ClusterApplierService applies committed cluster states in order on the clusterApplierService#updateTask thread, and the listener allocates the resource in response to the create delta. The probe trace from Step 3 shows the listener ran for the create state (version N) but the delete state (N+1) was batched away.

Why 2: Why was the delete delta not delivered to this listener?

Because the listener computes its delta from ClusterChangedEvent.indicesDeleted(), and when create (N) and delete (N+1) are coalesced into one applied state transition, the diff this listener sees can jump from "before create" straight past, depending on which states actually got applied versus skipped.

Why 3: Why is the resource allocated on clusterChanged rather than torn down deterministically?

Because the component couples resource lifecycle to observed state deltas instead of to a reconciliation against the current Metadata. Delta-driven listeners assume they see every intermediate state; they do not — the applier is allowed to skip states it never had to apply locally.

Why 4: Why does the code assume every intermediate state is observed?

Because when the listener was written, ClusterApplierService applied states one-by-one with no coalescing for this path, so "see every delta" was true. A later change began allowing the local applier to advance past states it did not need, breaking the invariant the listener silently relied on.

Why 5: Why was the invariant not asserted or documented?

Because the contract of ClusterStateListener — "you may not see every intermediate state; reconcile, don't accumulate" — lives in tribal knowledge and a comment in ClusterApplier, not in a type or assertion. The listener author reasonably read the local behavior, not the contract.

Root cause statement: The component drives a resource lifecycle off ClusterChangedEvent deltas, which is unsound because ClusterApplierService may advance the local applied state past intermediate cluster states without delivering each delta; under a fast create→delete the teardown delta is never observed and the resource leaks. The fix is to reconcile against the current Metadata in clusterChanged (compute what should exist now and close anything that should not), rather than to react to indicesDeleted().

Worked example B: a seqno/version edge in `InternalEngine`

Symptom: under two concurrent updates to the same _id, a stale version occasionally wins, or a VersionConflictEngineException is thrown when it should not be.

Why 1: Two index operations for the same _id interleave in InternalEngine.index(...).
Why 2: Both read the live version from the LiveVersionMap before either writes its own, so they resolve against the same base version.
Why 3: The acquireLock(uid) window does not cover both the version resolution and the seqno assignment for one of the paths (e.g. an optimization that resolves version outside the per-uid lock).
Why 4: That optimization was added to reduce lock contention on the hot indexing path, under the assumption that the version map read was idempotent.
Why 5: It is idempotent in isolation but not under the specific ordering where the LocalCheckpointTracker advances between the two reads.

Root cause shape: a lock window narrower than the read-modify-write it must protect, introduced as a contention optimization, valid until a second writer arrives at exactly the wrong checkpoint boundary.

Worked example C: an aggregation reduce error merging empty shards

Symptom (from the Step 2 / Step 3 examples): a terms aggregation with min_doc_count: 0 and a missing value drops the synthetic bucket when one or more shards contribute zero matching documents.

Why 1: The merged bucket list from InternalTerms.reduce(...) omits the missing bucket.
Why 2: reduce only emits buckets it sees from at least one shard, but a shard with zero docs returns no buckets at all for that field.
Why 3: The min_doc_count: 0 "produce empty buckets" responsibility lives on the shard-local aggregator, which short-circuits when the field has no values in its segment.
Why 4: The short-circuit was a performance optimization for the common case (min_doc_count >= 1), where empty buckets are discarded anyway.
Why 5: The optimization did not special-case min_doc_count == 0 combined with missing, so the synthetic bucket that should exist for every shard is produced by no shard, and reduce has nothing to merge.

Root cause shape: a shard-local fast path that is correct for the dominant parameterization but wrong for one parameter combination the reduce step then cannot recover from.

Any of these three is "a root cause." The fix direction is now obvious-ish. You can argue between options — but you know what each one changes.

Git Archaeology

Once you have a candidate cause, ask: when did this break? And why did the person who wrote it think it was correct?

`git log -S` / `-G` (the pickaxe)

Find every commit that introduced or removed a specific string, or that touched a regex:

cd ~/OpenSearch

# Every commit that changed how many times this expression appears (-S = pickaxe).
git log -p -S "indicesDeleted" \
  -- server/src/main/java/org/opensearch/cluster/ClusterChangedEvent.java

# -G matches commits whose diff text matches a regex (broader than -S).
git log -p -G "min_doc_count|minDocCount" \
  -- server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

# Follow a file through renames (the fork renamed many files).
git log --follow --oneline \
  -- server/src/main/java/org/opensearch/index/engine/InternalEngine.java

-S matches commits where the count of that string changed — added or removed. It is the single most powerful git command in this chapter. Learn it.

`git blame -L <start>,<end>`

Once you know the file and lines (from Step 3), find the commit and author:

git blame -L 2740,2770 \
  server/src/main/java/org/opensearch/index/engine/InternalEngine.java

Output looks like:

a1b2c3d4 (Dev A 2022-08-14 09:34:18 +0000 2745)     try (Releasable ignored = versionMap.acquireLock(op.uid().bytes())) {
a1b2c3d4 (Dev A 2022-08-14 09:34:18 +0000 2746)         ...

Then read the commit and find its PR:

git show a1b2c3d4
git log -1 --format="%B" a1b2c3d4    # full message; OpenSearch commits embed "(#NNNN)"

OpenSearch squash-merges PRs, so the commit subject usually ends with the PR number: ... (#4567). That number is your portal:

# Open the introducing PR's discussion in the browser, or via gh:
gh pr view 4567 --repo opensearch-project/OpenSearch --comments

Read every comment on that PR. Often you will discover:

The change was made to fix a different bug (or as a perf optimization) and introduced your bug as collateral.
A reviewer flagged the exact concern you are now hitting ("does this still hold when the applier skips a state?") and it was deferred.
The fix you are considering was discussed and rejected for a reason you must now address.

`git bisect` Against Tags

If the bug is a regression — passed on 2.11.0, fails on main (you proved this in Step 2) — bisect finds the exact introducing commit. This is the highest-confidence signal in root-cause work.

cd ~/OpenSearch
git fetch origin --tags
git bisect start
git bisect bad main
git bisect good 2.11.0

# git checks out a midpoint. Build only what you need and run the repro:
./gradlew :server:test \
  --tests "org.opensearch.*.TestThing.testIssueNNNNRepro" -Dtests.seed=DEADBEEF -q

# If the test FAILS here, the bug exists at this commit:
git bisect bad
# If it PASSES here, the bug came later:
git bisect good

# Repeat. git narrows to one commit in log2(N) steps.

Once converged:

a1b2c3d4 is the first bad commit
commit a1b2c3d4
Author: Dev A <a@example.org>
Date:   Sun Aug 14 09:34:18 2022 +0000
    Optimize version map locking on the indexing hot path (#4567)

Automate it once your repro returns a clean exit code:

git bisect run ./gradlew :server:test --tests "org.opensearch.*.testIssueNNNNRepro" -q

Now you know the introducing PR (#4567), the author (a natural reviewer to request — @-mention them), and the exact diff to study.

Note: Bisecting across the Elasticsearch→OpenSearch fork boundary (pre-1.0) is rarely useful — the package rename (org.elasticsearch → org.opensearch) means the codebase won't build the same way. Bisect within OpenSearch tags (1.x, 2.x, main). If the bug predates 1.0, say so in the doc and lean on git log -S and blame instead.

Writing the Root-Cause Statement

This document goes into your PR description and your write-up. 200–500 words. Use this template:

## Root cause: #NNNN

### Symptom
<one sentence — what the user sees>

### Trigger conditions
- <condition 1, e.g. fast create-then-delete of an index>
- <condition 2, e.g. the applier coalesces the two states locally>
- <condition 3 if any>

### Affected code
- server/src/main/java/org/opensearch/.../SomeComponent.java (the listener)
- server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java
  (where state application may advance past intermediate states)

### Mechanism
<three to five sentences explaining the actual defect. Use "because",
"as a result", "however". This is the part most people get wrong — they
describe the symptom again instead of the mechanism. The mechanism answers:
of the many ways this code could have been written, why does the current way
produce this wrong answer?>

### Introducing change
- PR #4567 (commit a1b2c3d4) added <X> under the assumption <Y>, which no
  longer holds because <Z>.
- A reviewer raised this on the PR (link to comment); resolution was deferred.

### Fix direction
Three options considered:

1. **<smallest change>.** Risk: <...>.
2. **<narrower-blast-radius change>.** Recommended.
3. **<cheapest / push-burden-elsewhere change>.** Rejected because <...>.

Recommended: option 2. See Step 5 for the diff.

Save as capstone-work/root-cause.md.

A note on the Mechanism section, because it is where most contributors fail the rubric: for the listener-race example, the symptom is "resource leaks." The mechanism is "the component accumulates lifecycle state from per-delta events, but the applier's contract permits skipping deltas, so a coalesced create→delete delivers no teardown delta and the accumulated state never gets corrected." Notice the mechanism explains the design assumption that broke, not the user-visible effect.

Validating the Hypothesis

A root cause is not validated until you have demonstrated it. Two ways.

1. Revert the introducing commit and re-run the repro

git checkout main
git revert --no-commit a1b2c3d4    # introducing commit from bisect
./gradlew :server:test --tests "org.opensearch.*.testIssueNNNNRepro" -q

If the test now PASSES (because you reverted the change that introduced the bug), your root cause is at least partially correct. If it still FAILS, the introducing commit is not the root cause — there is a deeper issue. Reset before you go further:

git reset --hard origin/main

2. A minimal one-line "patch" that confirms the mechanism

You are not writing the real fix yet. You are confirming the mechanism. For the aggregation example:

--- a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java
@@
-        if (segmentHasNoValuesForField(field)) {
-            return;   // fast path: nothing to do
+        if (segmentHasNoValuesForField(field) && bucketCountThresholds.getMinDocCount() != 0) {
+            return;   // fast path only when empty buckets are discarded anyway
         }

Apply it, re-run the repro. If it passes, the mechanism is confirmed — even if this exact diff is not the final, conventionally-correct fix.

If the test still fails: your mechanism is wrong. Go back to the five-whys.

If the test now passes but breaks 14 other tests: your fix direction is too broad. Go back to "fix direction" and pick a narrower option. That signal is valuable — it is the cheap version of the feedback you would otherwise get in Step 7 validation or a reviewer's comment.

Validation / Self-check

Before advancing to Step 5:

capstone-work/root-cause.md exists, follows the template, is 200–500 words, and its Mechanism section explains the broken design assumption — not the symptom restated.
You can name the introducing commit (full SHA) and PR number.
You ran git bisect to convergence against real tags (or documented why bisect doesn't apply — bug predates 1.0, or existed since the file's first commit).
You ran a "revert introducing commit" experiment and saw the test go green (or documented why the revert doesn't apply).
You wrote a one-line throwaway "mechanism confirmation" patch and saw the test pass on it.
You read every comment on the introducing PR (gh pr view <N> --comments).
You can articulate three fix directions and explain in one sentence each why you rejected two.

Then go to Step 5: Implementation.

Step 5: Implementation

You have a proven root cause and a confirmed mechanism. Now write the real fix. Not the throwaway one-liner from Step 4 — the version a maintainer will merge. The difference between those two is almost entirely discipline: minimum diff, right conventions, thread-safety, backward compatibility, no scope creep.

The rule that governs this step: every line in your diff must be justifiable in one sentence. If you cannot say why a line changed without saying "while I was in there," delete it.

Weigh the Fix Directions

You named three fix directions in Step 4. Now choose one, on the record, with reasons. Use the aggregation min_doc_count: 0 bug as the worked example.

Direction	What it changes	Blast radius	Verdict
A — fix at reduce	Have `InternalTerms.reduce(...)` synthesize the missing bucket when `min_doc_count == 0` and no shard produced it	Touches the merge path that runs for every terms agg; must reconstruct a bucket reduce never saw	Rejected: reduce lacks the per-shard context to build a correct synthetic bucket; high risk of double-counting
B — fix at the shard-local aggregator	Don't take the empty-segment fast path when `min_doc_count == 0`; let the aggregator emit the synthetic bucket per shard, so reduce has something to merge	Narrow: one guard on one fast path, only when `min_doc_count == 0`	Chosen: corrects the bug at its source; reduce is unchanged; behavior for `min_doc_count >= 1` is byte-for-byte identical
C — document the limitation	Add a docs note that `min_doc_count: 0` + `missing` is unsupported on empty shards	Zero code	Rejected: it's a real bug with a clear correct answer; documenting around it pushes the cost onto every user

Direction B wins because it has the smallest blast radius that actually fixes the bug, and it leaves the hot reduce path untouched. Write this table into your PR description — reviewers want to see you considered and rejected the obvious alternatives, not that you found the first thing that worked.

Find the Real Fix Site

The throwaway patch told you roughly where. The real fix goes where the abstraction says it belongs, which may be a layer up or down from where you proved the mechanism. Re-grep with the chosen direction in mind:

# Where does the empty-segment fast path actually live for terms?
grep -rn "getMinDocCount\|minDocCount" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/

# Confirm the field that carries min_doc_count through the aggregator.
grep -rn "bucketCountThresholds" \
  server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java

Read the surrounding method end to end before you touch it. Understand the fast path's original purpose so your guard preserves it for every case except the one you are fixing.

The SPDX Header and Spotless

Every OpenSearch source file carries this header. If you create a new file (usually you won't — prefer editing the existing class), it must start with:

/*
 * SPDX-License-Identifier: Apache-2.0
 *
 * The OpenSearch Contributors require contributions made to
 * this file be licensed under the Apache-2.0 license or a
 * compatible open source license.
 */

When editing an existing file, leave its header untouched — do not reformat it, do not update years. precommit enforces header presence; reformatting it is a needless diff line.

Formatting is mechanical and non-negotiable. Run Spotless before you ever look at the diff:

./gradlew spotlessApply        # auto-format the files you changed
./gradlew spotlessJavaCheck    # verify (this is what CI runs)

Warning: Never hand-format to "match the surrounding style." Spotless owns formatting. If spotlessApply changes lines you didn't touch, that means the file was already non-conformant; do not sweep those into your PR — they are scope creep. Run git add -p and stage only your intentional lines, or rebase the unrelated reformat away.

Thread-Safety

OpenSearch is concurrent everywhere. Before you add or move a line, ask: what thread runs this, and what else runs concurrently?

If you're changing…	The concurrency concern is…
An aggregator's `collect`/reduce	Shard-local `collect` runs on a search thread per segment; `reduce` runs once on the coordinating node. Don't share mutable state across them.
`IndexShard` / `InternalEngine` indexing	Many `WRITE`-pool threads index concurrently; per-`_id` correctness relies on `versionMap.acquireLock(uid)`. Any read-modify-write of version/seqno must be inside that lock.
A `ClusterStateListener` / `ClusterStateApplier`	Runs single-threaded on the applier thread, but the cluster state it sees is shared immutable; never block this thread (no I/O, no `get()` on a future).
A `Setting` consumer	Dynamic settings update from a different thread via `ClusterSettings`; make the held field `volatile` and update it atomically in the consumer.

For the engine seqno example, the fix is expanding a lock window, which is a thread-safety change by definition:

// WRONG: version resolved outside the lock, seqno assigned inside.
long version = resolveVersion(op);          // racy read
try (Releasable ignored = versionMap.acquireLock(op.uid().bytes())) {
    ...
}

// RIGHT: the whole read-modify-write is inside one lock acquisition.
try (Releasable ignored = versionMap.acquireLock(op.uid().bytes())) {
    long version = resolveVersion(op);
    ...
}

State the threading argument explicitly in your PR description. Reviewers of engine/coordination code will ask "what thread, what lock" first.

Backward Compatibility

OpenSearch nodes of adjacent versions run in the same cluster during a rolling upgrade, and segments/translog written by an old version are read by a new one. Two BWC tripwires:

Serialization changes. If you add or reorder a field on anything that implements Writeable (a request, response, cluster-state piece, or an InternalAggregation), the wire format changes, and an old node must still parse a new node's bytes (and vice versa). Guard the new field with a Version check on both sides:
```
// writeTo
if (out.getVersion().onOrAfter(Version.V_3_1_0)) {
    out.writeOptionalString(newField);
}
// the StreamInput constructor
if (in.getVersion().onOrAfter(Version.V_3_1_0)) {
    this.newField = in.readOptionalString();
}
```
Use the next unreleased version constant (check server/src/main/java/org/opensearch/Version.java), never a hardcoded literal. Read the serialization & BWC deep dive before you touch any writeTo/readFrom. Most bug fixes — including all three Capstone examples — should avoid changing serialization. The aggregation fix changes only which buckets are produced, not the bucket wire format, so it has no serialization impact. Say that explicitly: "No serialization change; no Version guard needed."
Behavioral compatibility. Even with no wire change, you are changing an observable response (the aggregation now returns a bucket it didn't before). That is the intended fix, but call it out — it's a Fixed CHANGELOG entry, and if anyone depended on the buggy behavior, the PR discussion is where that surfaces. See the compatibility mindset chapter.

Settings Conventions

If your fix introduces a setting (resist this — a bug fix rarely needs a knob), follow the conventions exactly:

public static final Setting<Boolean> SOME_FLAG_SETTING = Setting.boolSetting(
    "search.aggregations.some_flag",
    true,                                   // default preserves prior behavior unless the bug demands otherwise
    Setting.Property.NodeScope,             // or Dynamic + IndexScope as appropriate
    Setting.Property.Dynamic
);

Register it in the relevant settings list (ClusterSettings/IndexScopedSettings wiring lives in SettingsModule / IndexScopedSettings.BUILT_IN_INDEX_SETTINGS), or precommit will fail with "unregistered setting." Default the setting so that existing clusters behave identically after upgrade unless the whole point of the fix is to change the default. For the Capstone bugs, no new setting is needed — the correct behavior is unconditional. Adding a flag to make a bug fix optional is itself a smell a reviewer will challenge.

The Diff

Here is what direction B looks like as a realistic, minimum-surface patch:

--- a/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java
+++ b/server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java
@@ class TermsAggregator extends DeferableBucketAggregator {
     @Override
     protected void doPostCollection() throws IOException {
-        // Skip building buckets when this segment produced no ordinals for the field.
-        if (!hasValuesForField()) {
-            return;
-        }
+        // Skip building buckets when this segment produced no ordinals for the field,
+        // *unless* min_doc_count == 0, in which case the synthetic (and `missing`)
+        // bucket must still be emitted per shard so the reduce step can merge it. #NNNN
+        if (!hasValuesForField() && bucketCountThresholds.getMinDocCount() != 0) {
+            return;
+        }
         ...
     }
 }

Properties of a good fix diff, all visible above:

Tens of lines, not hundreds. This is one guard condition.
A comment citing the issue (#NNNN) at the non-obvious line — the future reader needs to know why the guard has that extra clause.
No drive-by changes. The ... regions are untouched. No reformatting, no rename of hasValuesForField, no "I noticed this other thing."
Behavior for the common case is provably unchanged — when getMinDocCount() != 0, the condition is identical to before.

Generate and read your own diff before you trust it:

git diff origin/main --stat        # which files? should be ONLY the bug's files + tests
git diff origin/main               # read every line; justify each in one sentence

If --stat lists a file unrelated to the bug, that is scope creep. Remove it.

Re-run the Reproducer

The same test that was red in Step 2 must now be green — and only because of your change:

./gradlew :server:test \
  --tests "org.opensearch.search.aggregations.bucket.terms.TermsAggregatorTests.testIssueNNNNRepro"

Green. If it's still red, the fix is wrong or in the wrong place — back to the mechanism. If it's green but you suspect the change is broader than needed, run the surrounding test class to catch collateral breakage early (full validation is Step 7):

./gradlew :server:test --tests "org.opensearch.search.aggregations.bucket.terms.*"

Deliverable for Step 5

Three fix directions weighed in a table; one chosen with a one-line reason per rejection.
A minimum-diff production change — tens of lines, every line justifiable.
./gradlew spotlessApply run; spotlessJavaCheck green; SPDX headers intact; no unrelated reformatting staged.
Thread-safety argument stated (what thread, what lock/visibility).
BWC assessed: either a Version-guarded serialization change or an explicit "no serialization change" note.
The Step-2 reproducer is now green, and git diff --stat shows only the bug's files.

Validation / Self-check

Before advancing to Step 6:

Every line in git diff origin/main is justifiable in one sentence.
git diff origin/main --stat lists only files related to the bug (plus tests).
./gradlew spotlessJavaCheck passes with no manual formatting.
You did not add a setting, a public-API change, or a serialization change unless the bug genuinely required it — and if you did, it is Version-guarded and registered.
You can state which thread runs the changed code and what concurrent access it must tolerate.
The Step-2 reproducer passes; behavior for the non-triggering case is provably identical.
You wrote the "three directions, one chosen" table for the PR description.

Then go to Step 6: Testing.

Step 6: Testing

Your Step 2 reproducer proved the bug exists and your Step 5 fix made it green. That single test is necessary but not sufficient. The tests you ship in the PR have a different job: they are permanent regression protection that must encode every trigger condition, survive thousands of randomized runs by other people on other machines, and convince a maintainer that the fix is correct and that the surrounding behavior is unbroken.

The rule that governs this step: a test that would have passed before your fix is not a test of your fix. Every test you add must be red on main and green with your change.

The Test Pyramid in OpenSearch

Match the test level to what the bug actually needs. Cheaper is better; reach for the in-cluster harness only when single-class testing genuinely cannot reproduce.

Level	Base class	Task	Use when
Unit	`OpenSearchTestCase` (and aggregator/`...TestCase` subclasses)	`:server:test`	Logic in one class: aggregation reduce, setting validator, request parser
Serialization	`AbstractWireSerializingTestCase<T>` / `AbstractSerializingTestCase<T>`	`:server:test`	Anything `Writeable` or XContent — round-trip + BWC
Integration	`OpenSearchIntegTestCase` (`InternalTestCluster`)	`:server:internalClusterTest`	Multi-node / multi-shard: allocation, recovery, replication, cluster-state propagation, listener races
REST contract	`OpenSearchRestTestCase` + YAML	`:rest-api-spec:yamlRestTest`	Status code, response shape, error message
BWC	tests under `qa/` with `bwcVersion`	`:qa:...`	Rolling-upgrade / mixed-cluster behavior, serialization across versions

Unit Tests

The aggregation fix lives in one class, so its primary test is a unit test in the aggregator's existing ...TestCase. Cover the trigger conditions explicitly and add a negative control — the same shape where the bug must not fire.

/*
 * SPDX-License-Identifier: Apache-2.0
 * ... keep the existing header ...
 */
public class TermsAggregatorTests extends AggregatorTestCase {

    // THE FIX: synthetic + `missing` bucket survives when a shard has zero docs.
    public void testMissingWithMinDocCountZeroOnEmptyShard() throws Exception {
        MappedFieldType fieldType = new KeywordFieldMapper.KeywordFieldType("absent_field");
        TermsAggregationBuilder agg = new TermsAggregationBuilder("by_missing")
            .field("absent_field").missing("N/A").minDocCount(0);

        try (Directory dir = newDirectory();
             RandomIndexWriter w = new RandomIndexWriter(random(), dir)) {
            w.addDocument(List.of());                 // a doc with no value -> empty segment for field
            try (IndexReader reader = w.getReader()) {
                StringTerms result = searchAndReduce(
                    newIndexSearcher(reader), new MatchAllDocsQuery(), agg, fieldType);
                assertEquals(1, result.getBuckets().size());            // FAILS on main
                assertEquals("N/A", result.getBuckets().get(0).getKeyAsString());
                assertEquals(1, result.getBuckets().get(0).getDocCount());
            }
        }
    }

    // NEGATIVE CONTROL: with min_doc_count >= 1, the empty bucket must still be dropped.
    public void testMissingWithMinDocCountOneStillDropsEmptyBucket() throws Exception {
        MappedFieldType fieldType = new KeywordFieldMapper.KeywordFieldType("absent_field");
        TermsAggregationBuilder agg = new TermsAggregationBuilder("by_missing")
            .field("absent_field").missing("N/A").minDocCount(1);
        try (Directory dir = newDirectory();
             RandomIndexWriter w = new RandomIndexWriter(random(), dir)) {
            w.addDocument(List.of());
            try (IndexReader reader = w.getReader()) {
                StringTerms result = searchAndReduce(
                    newIndexSearcher(reader), new MatchAllDocsQuery(), agg, fieldType);
                assertEquals(0, result.getBuckets().size());            // unchanged by the fix
            }
        }
    }
}

The negative control is what separates rubric band 14–15 from 11–13: it proves your fix is scoped, not a blanket behavior change. Run them:

./gradlew :server:test \
  --tests "org.opensearch.search.aggregations.bucket.terms.TermsAggregatorTests.testMissingWithMinDocCountZeroOnEmptyShard" \
  --tests "org.opensearch.search.aggregations.bucket.terms.TermsAggregatorTests.testMissingWithMinDocCountOneStillDropsEmptyBucket"

Serialization Tests (when the wire format is involved)

If your fix touched a Writeable (it shouldn't have for the aggregation bug, but the seqno/engine response or a new cluster-state piece might), add or extend an AbstractWireSerializingTestCase. It round-trips the object through StreamOutput/StreamInput on random instances and, critically, across versions:

public class SomeResponseTests extends AbstractWireSerializingTestCase<SomeResponse> {
    @Override protected Writeable.Reader<SomeResponse> instanceReader() { return SomeResponse::new; }
    @Override protected SomeResponse createTestInstance() { /* random fields incl. the new one */ }

    // BWC: serialize at an old version, ensure the new field is absent/defaulted.
    public void testSerializationBwc() throws IOException {
        SomeResponse original = createTestInstance();
        SomeResponse roundTripped = copyInstance(original, Version.V_3_0_0);  // old version
        assertNull(roundTripped.getNewField());                              // guarded out pre-3.1
    }
}

copyInstance(instance, version) writes/reads at a pinned Version — that is the mechanical proof your Version guards from Step 5 are symmetric. See the serialization & BWC deep dive for the full treatment.

Integration Tests (when one class can't reproduce)

The ClusterApplierService listener-race and any replication/recovery bug need a real cluster. These live under server/src/internalClusterTest/java/... and run via :server:internalClusterTest, backed by InternalTestCluster.

@OpenSearchIntegTestCase.ClusterScope(scope = OpenSearchIntegTestCase.Scope.TEST, numDataNodes = 2)
public class IndexResourceLeakIT extends OpenSearchIntegTestCase {

    public void testFastCreateDeleteDoesNotLeakResource() throws Exception {
        internalCluster().startClusterManagerOnlyNode();   // cluster manager (formerly master)
        internalCluster().startDataNodes(2);

        // Drive the trigger: rapid create-then-delete so the applier may coalesce states.
        for (int i = 0; i < 50; i++) {
            createIndex("repro-" + i, Settings.builder()
                .put(IndexMetadata.SETTING_NUMBER_OF_SHARDS, 1)
                .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 1).build());
            client().admin().indices().prepareDelete("repro-" + i).get();
        }

        // The fix reconciles against current Metadata; assert no leaked resources remain.
        assertBusy(() -> {
            long open = someComponent().openResourceCount();   // expose via a test hook
            assertEquals("resources leaked after fast create/delete", 0L, open);
        });
    }
}

./gradlew :server:internalClusterTest \
  --tests "org.opensearch.cluster.*.IndexResourceLeakIT.testFastCreateDeleteDoesNotLeakResource"

Warning: Never wait with Thread.sleep. Cluster state settles asynchronously; a sleep is non-deterministic by construction and a reviewer will (correctly) block it. Use assertBusy(...) (retries until the assertion holds or times out), ensureGreen(...), ensureStableCluster(n), or a ClusterStateListener/ClusterStateObserver latch.

To force the coalescing/race deterministically rather than hoping the loop hits it, the test framework's disruption helpers (test/framework/.../disruption/, e.g. BlockClusterStateProcessing, SlowClusterStateProcessing) let you pause state application on a node and release it after queuing both updates.

REST-YAML Tests (the response contract)

If the bug is user-visible at the REST layer — and the aggregation bug is — a REST-YAML test is the most faithful regression guard for the contract. Add it under the relevant module's yamlRestTest resources (for core search, rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.aggregation/):

---
"terms agg with missing and min_doc_count zero on empty shard":
  - do:
      indices.create:
        index: repro
        body:
          settings: { number_of_shards: 1, number_of_replicas: 0 }
  - do:
      index: { index: repro, refresh: true, body: { value: 10 } }
  - do:
      search:
        index: repro
        body:
          size: 0
          aggs:
            by_missing:
              terms: { field: absent_field, missing: "N/A", min_doc_count: 0 }
  - match: { aggregations.by_missing.buckets.0.key: "N/A" }
  - match: { aggregations.by_missing.buckets.0.doc_count: 1 }

./gradlew :rest-api-spec:yamlRestTest \
  --tests "*ClientYamlTestSuiteIT" -Dtests.rest.suite=search.aggregation

BWC Tests (only if behavior crosses versions)

Most bug fixes do not need a dedicated BWC test — the serialization round-trip test above covers wire compatibility. You add a qa/ BWC test only when the fix changes behavior that a mixed-version cluster or a rolling upgrade must tolerate (e.g. a coordination change where an old cluster manager and new data node must interoperate). These run with a bwcVersion:

./gradlew :qa:rolling-upgrade:check
./gradlew :qa:mixed-cluster:check

If your fix has no cross-version behavior change, state that explicitly in the PR — "no BWC test needed; serialization unchanged, behavior change is local to a single node's response." Reviewers would rather see that judgment than an irrelevant BWC test.

Determinism: the Non-Negotiable

OpenSearch tests run under RandomizedRunner with a random seed, random locale, and random timezone. A test that passes on your seed and fails on someone else's is worse than no test. Enforce determinism:

No Thread.sleep. Use assertBusy, ensureGreen, latches, or ClusterStateObserver.
No wall-clock assertions unless you inject a deterministic clock.
No order-dependent assertions over a HashMap/Set — sort, or assert on a set, not a list order.
Respect the seed. Build random inputs from random(), randomAlphaOfLength(...), randomIntBetween(...) — not Math.random() or new Random(). This makes failures reproducible from the printed -Dtests.seed=... line.

Prove your test is deterministic by running it many times with different seeds:

# Hammer it. iters reruns within one JVM; loop across fresh seeds.
./gradlew :server:test --tests "...testMissingWithMinDocCountZeroOnEmptyShard" \
  -Dtests.iters=50
for i in $(seq 1 10); do
  ./gradlew :server:test --tests "...testMissingWithMinDocCountZeroOnEmptyShard" \
    -Dtests.seed=$(openssl rand -hex 8) -q || echo "FLAKE on iteration $i"
done

If any iteration flakes, the test is not done. A flaky test you ship becomes a flaky-test-labeled issue and an @AwaitsFix — exactly what you do not want attached to your name. (And never mute a real flake with @Ignore; if you must mute, use @AwaitsFix(bugUrl="https://github.com/opensearch-project/OpenSearch/issues/NNNN").)

Deliverable for Step 6

A test at the lowest level that reliably covers the bug (unit > integ > manual), red on main, green with the fix.
A negative control proving the fix is scoped (the near-miss case still behaves as before).
Every trigger condition from Step 2/4 encoded in a test assertion.
A serialization round-trip + BWC test iff the wire format changed (otherwise an explicit "no serialization change" note).
A REST-YAML test iff the bug is a REST contract.
Determinism proven: 50 iters + 10 fresh seeds, zero flakes, no sleep.

Validation / Self-check

Before advancing to Step 7:

Every new test fails on main and passes with your fix — you verified by stashing the fix and re-running.
You included a negative control that would catch an over-broad fix.
You chose the lowest-cost harness that reliably reproduces; you did not write an integration test for a single-class bug.
No Thread.sleep, no wall-clock, no order-dependent assertions anywhere.
You ran the test ≥ 50 iterations and ≥ 10 seeds with no flake.
Serialization/BWC coverage matches your Step 5 BWC decision (test present iff wire changed; otherwise documented as unchanged).
Test names describe the scenario (testMissingWithMinDocCountZeroOnEmptyShard), not the method under test.

Then go to Step 7: Validation.

Step 7: Validation

Your fix works and your tests are green on your machine. That is not the bar. The bar is: the full local gate is green before you push, so CI on the PR is green on the first run. A red CI run on your first push is not fatal, but it is a tell — it says you pushed before validating, and it costs a reviewer a round-trip.

This step runs the same checks CI will run, locally, first, in increasing order of cost, and produces a validation report you attach to the PR. The rule: you do not push until you have personally seen every gate go green.

What "Green" Means

OpenSearch CI ("gradle-check") is, in essence, ./gradlew check sharded across runners. "Green" is not "my one test passed." It means all of:

Gate	Local command	What it catches
Formatting	`./gradlew spotlessJavaCheck`	Whitespace, import order, line length
Precommit	`./gradlew precommit`	Checkstyle, forbidden-APIs, license/SPDX headers, `loggerUsageCheck`, dependency checks, missing-javadoc on public API
Unit tests	`./gradlew :server:test` (+ affected modules)	Logic regressions
Integration tests	`./gradlew :server:internalClusterTest`	Multi-node regressions
REST-YAML	`./gradlew :rest-api-spec:yamlRestTest`	Contract regressions
Full check	`./gradlew check`	Everything above + more, across all modules

You run these from cheapest to most expensive so a fast failure stops you before you spend 40 minutes on check.

The Gate, in Order

1. Format (seconds)

./gradlew spotlessApply        # fix formatting
./gradlew spotlessJavaCheck    # verify; this is exactly what CI runs

If spotlessApply touched files you didn't mean to change, you have unrelated reformatting staged — strip it (see Step 5).

2. Precommit (a few minutes)

./gradlew precommit

precommit is where most first-time PRs fail CI, because it enforces rules you don't think about: an unregistered Setting, a forbidden API (System.currentTimeMillis() instead of the injected clock; raw java.util.Random), a missing SPDX header, a logger used without a guard. Read each failure literally — the task name tells you which rule (forbiddenApisMain, checkstyleMain, licenseHeaders, loggerUsageCheck).

3. Affected-module tests (minutes)

Don't run the whole world yet. Run the modules your diff touches. Find them from the diff:

git diff origin/main --stat        # which dirs changed?

Then scope to those Gradle projects:

# core search/agg change:
./gradlew :server:test --tests "org.opensearch.search.aggregations.*"

# if you touched a module/ or plugins/ project, run that project too, e.g.:
./gradlew :modules:reindex:test

Always include the class you changed and its whole package — your fix can break a sibling test that exercises the same code path.

4. Integration / REST tests (tens of minutes)

If your bug or fix touches cluster behavior or the REST contract:

./gradlew :server:internalClusterTest --tests "org.opensearch.cluster.*"
./gradlew :rest-api-spec:yamlRestTest

5. The full `check` scope (long — run once before push)

./gradlew check

This is the closest local mirror of CI. It is long (tens of minutes to over an hour depending on hardware) — run it once, after the cheaper gates are green, right before you push. Tips:

Use --build-cache (and Gradle's local cache) so unchanged modules aren't rebuilt.
If check fails in a module you never touched, suspect a flaky test before suspecting your change: re-run just that test with its printed -Dtests.seed=.... If it fails deterministically on main too (stash your fix and check), it's a pre-existing flake — note it; it is not yours to fix here.
You can scope check to the heavy parts if hardware is limited: ./gradlew precommit :server:test :server:internalClusterTest is a strong proxy for most server-only fixes.

Note: OpenSearch CI on a PR is triggered by maintainers/automation (gradle-check) and reported as a status check on the PR. You cannot make CI green by wishing; you make it green by having run ./gradlew check locally first. The single best predictor of first-pass green CI is "I ran check and read every line of output."

Reading CI on the PR

Once the PR is up (Step 8), the gradle-check status appears on the PR's Checks tab. When it's red:

Open the failing check's logs (click "Details" next to gradle-check).

Find the failing task and the REPRODUCE WITH: line the test framework prints — it includes the seed, locale, and timezone:

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.x.YTests.testZ" \
  -Dtests.seed=ABC123 -Dtests.locale=fr-FR -Dtests.timezone=America/Sao_Paulo

Run that exact line locally. If it reproduces, it's yours — fix it. If it doesn't reproduce and the same test is flaky on main, it's a pre-existing flaky-test; comment on the PR linking the flaky-test issue and re-trigger CI (a maintainer can re-run, or a gradle-check retry).

Do not push speculative "maybe this fixes CI" commits. Reproduce locally with the seed first, then push a fix you've verified.

The Validation Report Artifact

Produce capstone-work/validation.md — a record of exactly what you ran and what you saw. This goes into (or is summarized in) the PR description; it is the single most reassuring thing a reviewer can read.

# Validation report: #NNNN

## Checkout
- Branch: fix/NNNN-terms-missing-min-doc-count
- Based on: origin/main @ <sha>

## Gates run (all green)
| Gate | Command | Result |
|---|---|---|
| Spotless | `./gradlew spotlessJavaCheck` | PASS |
| Precommit | `./gradlew precommit` | PASS |
| Unit (changed pkg) | `./gradlew :server:test --tests "...aggregations.bucket.terms.*"` | PASS (412 tests) |
| REST-YAML | `./gradlew :rest-api-spec:yamlRestTest` | PASS |
| Full check | `./gradlew check` | PASS (1h03m, --build-cache) |

## Reproducer status
- testMissingWithMinDocCountZeroOnEmptyShard: FAIL on main, PASS with fix
- Negative control testMissingWithMinDocCountOne...: PASS both (unchanged)

## Determinism
- 50 iters + 10 fresh seeds on the new tests: 0 flakes

## Diff scope
- `git diff origin/main --stat`: 2 files (TermsAggregator.java, TermsAggregatorTests.java) + 1 YAML
- No unrelated files touched.

## Notes
- One unrelated flake observed in qa:rolling-upgrade (issue #MMMM); not introduced by this PR.

Deliverable for Step 7

spotlessJavaCheck, precommit, affected-module tests, and a full ./gradlew check all run locally and green.
git diff origin/main --stat confirms only the bug's files changed.
Any failure understood and either fixed or attributed (pre-existing flake, with the issue link).
capstone-work/validation.md written with the exact commands and results.
You have not pushed until every gate is green.

Validation / Self-check

Before advancing to Step 8:

You ran the gates in cost order and saw each one green — you did not skip precommit or check.
git diff origin/main --stat lists only files you intended to change.
You can explain every failing test you saw and whether it is yours or a pre-existing flake (with a link).
The new tests are deterministic across iters and seeds.
capstone-work/validation.md records the literal commands and results, not a vague "all tests passed."
You understand that local check green is what produces first-pass green CI; CI is not something you debug after pushing.
You did not push speculative fixes to chase CI without a local repro.

Then go to Step 8: Pull Request Preparation.

Step 8: Pull Request Preparation

The fix is correct, tested, and validated. Now you package it as a Pull Request that a maintainer can review without friction. OpenSearch has no JIRA and no CLA — contribution flows entirely through a GitHub PR with a DCO sign-off, a CHANGELOG entry, and the PR template filled out honestly. Getting these mechanics right is the difference between "reviewed today" and "sat in the queue because the DCO check is red."

The rule that governs this step: the PR should answer every question a reviewer would ask before they have to ask it — problem, root cause, fix, tests, BWC.

Fork and Branch

You don't push to opensearch-project/OpenSearch; you push to your fork and open a PR from it.

# One-time: fork on GitHub, then add it as a remote.
gh repo fork opensearch-project/OpenSearch --remote --clone=false
# (or manually:) git remote add fork git@github.com:<you>/OpenSearch.git

# Branch off an up-to-date main with a descriptive name.
git fetch origin
git checkout -b fix/NNNN-terms-missing-min-doc-count origin/main

Branch naming: fix/<issue>-<slug> or feature/<slug>. The issue number in the branch name is a small kindness — it threads the work together.

Commit with DCO Sign-off (`git commit -s`)

Every commit must carry a Signed-off-by: trailer. This is the Developer Certificate of Origin — your attestation that you wrote the code and can license it Apache-2.0. The -s flag adds it:

git add server/src/main/java/org/opensearch/.../TermsAggregator.java \
        server/src/test/java/org/opensearch/.../TermsAggregatorTests.java \
        rest-api-spec/.../terms_missing_min_doc_count.yml \
        CHANGELOG.md
git commit -s -m "Fix terms agg dropping missing bucket with min_doc_count=0 on empty shards"

The commit message ends up with:

Fix terms agg dropping missing bucket with min_doc_count=0 on empty shards

Signed-off-by: Your Name <you@example.com>

Warning: The DCO check is a required status check. If any commit lacks Signed-off-by:, the check goes red and the PR cannot merge. The Signed-off-by: name/email must match your git config user.name/user.email. If you forgot -s, fix it without rewriting unrelated history:
git commit --amend -s --no-edit            # last commit
git rebase --signoff origin/main           # all commits on the branch

Keep the commit message subject in the imperative, under ~72 chars, and let the body (if any) explain why, not what (the diff shows what).

The CHANGELOG Entry

OpenSearch tracks a human-readable CHANGELOG.md. Every PR adds exactly one line under the right heading in the [Unreleased ...] section. Skipping it fails the changelog check.

grep -n "## \[Unreleased" CHANGELOG.md      # find the unreleased section + subheadings

The subheadings follow Keep-a-Changelog: Added, Changed, Deprecated, Removed, Fixed, Security. A bug fix goes under Fixed:

 ### Fixed
+- Fix `terms` aggregation dropping the `missing` bucket when `min_doc_count` is 0 and a shard has no matching documents ([#NNNN](https://github.com/opensearch-project/OpenSearch/pull/NNNN))

The entry is one line, user-facing (describe the behavior, not the code), backtick the API surface, and link the PR number (you'll update the placeholder to the real PR number after opening it — or use the issue number and let the bot help). A behavior change that isn't strictly a bug goes under Changed; a new capability under Added. See Step 9 for choosing between Changed and Fixed.

The PR Template

.github/pull_request_template.md auto-populates the PR body. Fill every section — empty checkboxes and unanswered prompts read as "didn't bother." The template asks for:

Description — what and why.
Related Issues — Resolves #NNNN (or Fixes #NNNN), which auto-closes the issue on merge.
Check List — DCO sign-off, CHANGELOG updated, tests added, commit messages follow guidelines, public-API/BWC considered. Tick them honestly; if one doesn't apply, say why rather than leaving it blank.

A Model PR Description

This is what reviewers love to open: it front-loads the problem, the proven root cause, the scoped fix, the tests, and the BWC stance — exactly the artifacts you built in Steps 2–7.

### Description

`terms` aggregations with `min_doc_count: 0` and a `missing` value drop the
synthetic bucket when at least one shard contributes zero matching documents.
A single-shard empty index returns `aggregations.by_missing.buckets: []` instead
of the expected `[{ "key": "N/A", "doc_count": 1 }]`.

### Root cause

The shard-local aggregator takes an empty-segment fast path that returns before
building buckets. That fast path is correct for `min_doc_count >= 1` (empty
buckets are discarded anyway) but wrong for `min_doc_count == 0`: the synthetic
(and `missing`) bucket that should exist for every shard is produced by no shard,
so the coordinating-node reduce in `InternalTerms.reduce(...)` has nothing to
merge. Mechanism and `git blame` (introduced in #4567) are in the linked issue.

### Fix

Gate the empty-segment fast path on `min_doc_count != 0` in
`TermsAggregator.doPostCollection()` (one guard condition). Behavior for
`min_doc_count >= 1` is byte-for-byte unchanged.

### Testing

- Unit: `TermsAggregatorTests.testMissingWithMinDocCountZeroOnEmptyShard`
  (red on main, green here) + a negative control with `min_doc_count: 1`.
- REST-YAML: `search.aggregation/terms_missing_min_doc_count.yml`.
- 50 iters + 10 seeds, no flakes. `./gradlew precommit check` green locally.

### Backward compatibility

No serialization change; no `Version` guard needed. The only observable change is
the intended one: the previously-missing bucket is now returned. No new settings.

### Related Issues

Resolves #NNNN

Notice it is scannable: a reviewer reads four headings and knows whether to trust the change before reading a line of code. A strong description routinely halves review rounds.

Open the PR, Labels, and Reviewers

git push -u fork fix/NNNN-terms-missing-min-doc-count

gh pr create --repo opensearch-project/OpenSearch \
  --base main \
  --title "Fix terms agg dropping missing bucket with min_doc_count=0 on empty shards" \
  --body-file capstone-work/pr-body.md

Then, on the PR:

Labels. Apply bug (and a component label if the repo uses them, e.g. Search:Aggregations). If the fix should ship in the maintenance line too, add the backport 2.x label — this triggers the backport bot to open a backport PR automatically after merge. Do not hand-cherry-pick unless the bot fails. (Some labels you cannot self-apply; ask in the PR or on Slack and a maintainer will add them.)
Reviewers. Request the maintainers who own the area. Find them in MAINTAINERS.md and via git log/git blame on the files you touched (the author of the introducing PR #4567 is a natural reviewer — @-mention them).
Link the issue. Resolves #NNNN in the body does the auto-close; also make sure the issue is referenced so the cross-links resolve both ways.

Keep CI Green

After you push, gradle-check runs (triggered by automation/maintainers). Your job:

It should be green on the first run because you ran ./gradlew check locally in Step 7.
If it's red, follow the Step 7 "Reading CI" procedure: open the logs, grab the REPRODUCE WITH: line, reproduce locally, fix, push. Don't push speculative fixes.
The DCO and CHANGELOG checks are independent of gradle-check — a red DCO/ CHANGELOG check is a mechanics problem (missing sign-off, missing entry), not a test problem. Fix the mechanics.
Re-validate after every push. "I fixed the test" is a claim until CI agrees.

Deliverable for Step 8

A fork + a descriptively-named branch off current origin/main.
All commits DCO-signed (git commit -s); Signed-off-by: matches your git identity; DCO check green.
A one-line CHANGELOG.md entry under the correct heading.
The PR template filled completely and honestly.
A scannable PR description: problem → root cause → fix → tests → BWC → Resolves #NNNN.
Labels applied (bug, backport 2.x if relevant), reviewers requested, issue linked.
gradle-check, DCO, and CHANGELOG checks green.

Validation / Self-check

Before advancing to Step 9:

Run git log --show-signature-style review: every commit has a Signed-off-by: trailer matching your identity, and the DCO check is green.
CHANGELOG.md has exactly one new line, under the right heading, user-facing, linking the PR.
The PR template has no unanswered prompt or unexplained empty checkbox.
A stranger could read only the PR description and correctly state the bug, the cause, the fix, and the BWC impact.
You requested the right reviewers (area maintainers + the introducing-PR author) and applied the right labels.
Resolves #NNNN is present so the issue auto-closes on merge.
CI (gradle-check) is green, or you have an in-progress, locally-reproduced fix for any red — not a speculative push.

Then go to Step 9: GitHub Documentation.

Step 9: GitHub Documentation

A PR is not a fire-and-forget artifact. Between "opened" and "merged" sits the review cycle, and how you run that is most of what a maintainer remembers about you. This step covers the documentation and conversation hygiene that surrounds the code: keeping the issue and PR honest as the change evolves, getting the CHANGELOG heading right, handling the separate documentation-website repo, responding to review rounds cleanly, and the merge + backport-bot endgame.

The rule that governs this step: the PR conversation and the issue should, at any moment, accurately describe the current state of the change — never a stale earlier version of it.

Update the Issue

The issue is where triagers, future searchers, and the maintainer who assigns the review look first. Keep it current:

Comment when you start. "Working on this — have a reproducer and a candidate fix, PR incoming." This claims the work informally (OpenSearch doesn't hard-assign most issues, but a comment prevents duplicate effort) and signals you're active.
Link the PR the moment it's open (the PR's Resolves #NNNN cross-links automatically, but a one-line comment helps).
Drop the untriaged confusion. If the issue carried untriaged, your reproduction and root-cause comment is exactly what a maintainer needs to triage it — paste the symptom, trigger conditions, and affected versions you nailed down in Steps 2 and 4.
Don't close it yourself. Resolves #NNNN in the merged PR auto-closes it. Closing it manually before merge just confuses the cross-link.

PR Conversation Hygiene

The PR description is a living document, not a first-draft you abandon. After any material change to the approach (you switched fix directions, you narrowed the scope), edit the top-of-PR description to match. A reviewer who returns after a week should not have to reconstruct the current design from 30 comments.

Keep the description's "Fix" and "Testing" sections accurate to the latest diff.
When you push changes in response to feedback, leave a short comment summarizing what changed ("Addressed @reviewer: moved the guard into doPostCollection, added the negative-control test").
Resolve review threads only when you've actually pushed the corresponding change — and resolve them yourself with a substantive reply, not silently.

CHANGELOG: `Changed` vs `Fixed`

You added a CHANGELOG line in Step 8. Make sure it's under the right heading, because reviewers and release managers read these for the release notes, and the distinction matters:

Heading	Use for	Example
Fixed	A bug: behavior that was wrong by the documented/intended contract	"Fix `terms` agg dropping the `missing` bucket with `min_doc_count: 0`"
Changed	A deliberate behavior change that wasn't strictly a bug, or a change users may notice and must be told about	"Change default `search.max_buckets` enforcement to apply during reduce"
Added	New capability / setting / API	"Add `search.concurrent_segment_search.enabled` index setting"
Deprecated / Removed / Security	As named	—

The aggregation fix is unambiguously Fixed. But ask yourself the honest question: does anyone depend on the current (buggy) behavior? If realistically yes — if scripts parse the empty-buckets response — then even a bug fix has a Changed-flavored blast radius, and you should flag that in the PR so a maintainer can decide whether it needs a migration note. When in doubt, raise it in the PR conversation; don't decide unilaterally.

Documentation-Website Implications

OpenSearch user docs do not live in the engine repo. They live in a separate repo, opensearch-project/documentation-website (the source of opensearch.org/docs). The engine PR and the docs PR are independent.

Decide whether your change needs a docs update:

Your change…	Docs action
Fixes a bug so the API now matches its documented behavior	Usually no docs change — the docs already describe the correct behavior; you just made the code match. Note "no docs change needed; behavior now matches existing docs" in the PR.
Changes documented behavior, a default, or an error message users see	Open a docs PR in `documentation-website`, and link it from the engine PR ("Docs: opensearch-project/documentation-website#MMM").
Adds a setting, API, or parameter	Open a docs PR documenting it. New surface with no docs is incomplete.

For the aggregation fix, the docs already state that min_doc_count: 0 produces empty buckets; the code was wrong, not the docs. So: no docs PR, but say so explicitly. If you do open a docs PR, it follows the same DCO + template flow as the engine PR, in the docs repo.

Responding to Review Rounds

Reviews come in rounds. The mechanics of responding cleanly — without losing review context or making reviewers re-read the whole diff — matter as much as the content of your replies. (The content — tone, when to push back, how to handle disagreement — is the subject of the responding-to-feedback chapter; read it. This section is the GitHub-mechanics half.)

Fixup commits during review, squash at the end

During active review, push separate fixup commits rather than amending and force-pushing — this preserves the per-comment diff so a reviewer can see exactly what changed since their last look. GitHub shows "changes since you last reviewed," which only works if you didn't rewrite history.

# Make the change a reviewer asked for, as its own signed commit.
git commit -s -m "Address review: extract guard into helper for readability"
git push fork fix/NNNN-terms-missing-min-doc-count

When the PR is approved and ready to merge, the maintainer typically squash- merges (OpenSearch squashes PRs into a single commit), so your fixup commits collapse automatically — you usually don't need to squash by hand. If the project asks you to clean history before merge:

git rebase -i origin/main      # mark fixups as 'fixup'/'squash'
git push --force-with-lease fork fix/NNNN-terms-missing-min-doc-count

Warning: Use --force-with-lease, never bare --force — it refuses to overwrite if someone (or the backport bot) pushed to your branch since your last fetch. And after any force-push, comment to tell reviewers you rewrote history and summarize what changed, or you'll cost them a full re-review.

Each comment gets a resolution

For every review comment, do one of two things, visibly:

Push a change that addresses it, then reply "done in <short-sha>" and resolve the thread.
Make the technical case for why the current code is correct — calmly, with reasoning, not "I think mine is fine." If the reviewer agrees, they resolve; if not, you've at least surfaced a real design discussion.

Don't let stylistic comments slide unacknowledged — even a "good catch, fixed" keeps the cadence healthy.

Keep CI Green Through Review

Every push re-runs gradle-check. A change that addresses one reviewer's comment can break a test in a corner you didn't touch. Re-validate locally (Step 7) after every review-driven change before you push. A reviewer who approves a green PR, only to see it go red on your next push, loses confidence fast.

Merge and Backport

When you have the required approval(s) and green CI, a maintainer merges (squash). Then:

The issue auto-closes via Resolves #NNNN. Confirm it did.
The backport bot runs if you applied a backport <branch> label (e.g. backport 2.x). It opens a backport PR against the maintenance branch. Watch it:
- If it applies cleanly, it opens a backport PR automatically — review it, make sure its CI is green, and it merges with maintainer approval.
- If it conflicts, the bot comments that it couldn't cherry-pick. Now you backport by hand: git checkout 2.x && git cherry-pick -x <merge-sha>, resolve conflicts, push a branch, open the backport PR yourself, and link it.
Add the merged commit and backport PRs to the issue/PR cross-links so the whole story is navigable later.

Then say thanks — to the reviewer, in the PR. It costs nothing and it is how communities stay places people want to contribute to. (More on the social layer in community interaction and the release process, which explains which release your fix actually ships in.)

Deliverable for Step 9

The issue is updated (start comment, repro/root-cause pasted, PR linked) and left to auto-close on merge.
The PR description is current to the latest diff; material changes are summarized in comments.
The CHANGELOG entry is under the correct heading (Fixed vs Changed), with the blast-radius question explicitly considered.
Docs implication decided: a documentation-website PR opened and linked, or an explicit "no docs change needed" note.
Review rounds handled with fixup commits (not silent force-pushes), each thread resolved with a substantive reply, CI re-validated after every push.
Post-merge: issue auto-closed, backport PR (bot or manual) green and linked.

Validation / Self-check

Before advancing to Step 10:

At any moment during review, the PR description accurately described the current change, not an earlier draft.
Your CHANGELOG entry is under the heading you can defend, and you considered whether anyone depends on the old behavior.
You made the correct docs call (PR in documentation-website, or a documented "no docs needed") — not silence.
You pushed fixup commits during review and only force-pushed (with --force-with-lease) when asked, always with a summarizing comment.
Every review thread is resolved with a change or a reasoned reply — none left dangling.
CI was green after every push, not just the first.
Post-merge, the issue closed, and any backport label produced a green, linked backport PR (bot or hand-cherry-picked on conflict).

Then go to Step 10: Engineering Write-Up.

Step 10: Engineering Write-Up

The PR is merged. The issue is closed. The backport is green. Most contributors stop here. The ones who become maintainers write the post. The write-up is the artifact that travels with you when you change jobs, request a maintainer nomination, or get cited by another contributor working a similar bug.

Five hundred to a thousand words. Most of it written in the few hours right after merge, while the dead ends are still fresh in your memory.

Why It Matters

Three audiences:

Future you. Six months from now you'll touch TermsAggregator again and want to remember what you tried and why the fix is shaped the way it is.
The next contributor working a similar bug. They'll find your post via search ("OpenSearch terms agg missing bucket min_doc_count") and shortcut a week of work.
The maintainers / TSC evaluating you. A maintainer nomination is a judgment about whether you can communicate engineering reasoning, not just produce diffs. The write-up is the evidence.

A good write-up is not a press release. It is a postmortem: honest about what you tried, including the approaches that failed.

The Template

Sections in order, with suggested word counts. A worked excerpt follows the aggregation Capstone bug throughout, so you can see the shape.

Title (one line)

Fixing #NNNN: <one-line technical summary>

Examples:

"Fixing #NNNN: terms aggregation dropped the missing bucket on empty shards"
"Fixing #NNNN: a ClusterApplierService listener that leaked on coalesced states"
"Fixing #NNNN: a seqno race in InternalEngine under concurrent same-_id updates"

Technical and specific. Not "My first OpenSearch contribution" — write that post separately. The engineering post stands on its own and gets cited; the journey post gets one-time clicks.

Problem (100–150 words)

What broke, for whom, under what conditions. Plain English, but precise. Symptom, trigger condition, affected version range. No code yet.

A terms aggregation requested with min_doc_count: 0 and a missing value would return an empty buckets array instead of the synthetic bucket, whenever at least one queried shard held zero matching documents. The simplest trigger is a single-shard index where the aggregated field exists in the mapping but in no document: the response is "buckets": [] rather than the documented [{ "key": "N/A", "doc_count": 1 }]. Dashboards visualizations that rely on "show empty buckets" silently lost their zero rows. Reproducible on main and on the 2.x line; the behavior regressed in #4567, which optimized the shard-local empty-segment path.

Investigation Log (200–300 words)

The most valuable section. Walk through what you tried, including the hypotheses that were wrong. Three to five hypotheses, in order, each with one sentence on what suggested it and one on what disproved it.

My first hypothesis was that the reduce step (InternalTerms.reduce) was discarding the bucket while merging shards. I added a unit test driving reduce directly with two hand-built shard results, one containing the missing bucket — reduce kept it. That ruled out the coordinating node and pointed back at the shards.

Second hypothesis: the bucket was never produced on the empty shard. TRACE logging on org.opensearch.search.aggregations for a one-doc, one-shard repro confirmed it: the shard-local aggregator emitted zero buckets for the empty-segment case, so reduce had nothing to merge. The bug was upstream of reduce, in shard-local collection.

Reading TermsAggregator.doPostCollection() top to bottom surfaced an empty-segment fast path: if (!hasValuesForField()) return;. A git blame placed it in #4567 — a performance optimization that skipped bucket-building when a segment had no ordinals for the field. The assumption ("no values ⇒ no buckets to emit") is true for min_doc_count >= 1, where empty buckets are discarded anyway, but false for min_doc_count == 0, where the synthetic bucket must be emitted by every shard. git bisect between 2.11.0 (good) and main (bad) confirmed #4567 as the introducing change in four steps.

Root Cause (50–100 words)

One paragraph, the truth as you now understand it. Cite the introducing PR and the bisect commit.

The shard-local terms aggregator short-circuits bucket construction when a segment has no values for the field. That fast path (introduced in #4567, commit a1b2c3d4) is correct for min_doc_count >= 1 but skips the synthetic and missing buckets that min_doc_count == 0 requires every shard to emit. With no shard producing the bucket, the coordinating-node reduce has nothing to merge and the response omits it.

Final Design (150–200 words)

What you changed and why this design over the alternatives. Show the diff size and the principle.

The fix gates the fast path on min_doc_count:
if (!hasValuesForField() && bucketCountThresholds.getMinDocCount() != 0) {
    return;
}
One guard condition in doPostCollection(). Behavior for min_doc_count >= 1 is byte-for-byte identical (the extra clause is true in that case, so the branch is unchanged). The synthetic bucket is now emitted per shard exactly when min_doc_count == 0, restoring the input reduce expects. No serialization change, so no Version guard and no BWC test — only the intended observable change (the bucket reappears). Public API surface unchanged; no new setting. The production diff is two lines plus a comment citing the issue.

Alternatives Considered (100–150 words)

Two or three alternatives you rejected, with the reason. This is the section that separates contributor-quality from maintainer-quality write-ups.

Synthesize the bucket at reduce time when min_doc_count == 0 and no shard produced it. Rejected: reduce lacks the per-shard missing-value context to build a correct bucket without risking double-counting; it would push correctness into the merge path that runs for every terms agg.

Revert #4567's fast path entirely. Rejected: it's a real and measured optimization for the common min_doc_count >= 1 case; reverting it regresses performance to fix a narrow parameterization.

Document the combination as unsupported. Rejected: there is a clear correct answer; documenting around a fixable bug taxes every user.

Performance / Behavior Impact (50–100 words)

If perf-relevant, numbers. Otherwise, one honest sentence.

No measurable performance impact. The added clause is a single integer comparison on a path already gated by hasValuesForField(), and it only changes behavior in the min_doc_count == 0 case, which previously did less work incorrectly. The common path is unchanged. Validated by re-running the aggregation microbenchmark suite: no statistically significant delta across 10 runs.

Lessons Learned (100–150 words)

Three to five bullets, each concrete enough to be reusable by a peer.

Fast-path optimizations are where correctness goes to die. Any if (cheapCheck) return; added "for performance" is a candidate for a missed parameterization. When you find one in blame, ask which inputs it didn't consider.

Bisect against release tags is cheaper than archaeology. Four git bisect steps between 2.11.0 and main found #4567 faster than reading the file's history would have.

Prove the layer before you fix the layer. Driving reduce directly in a unit test ruled out the coordinating node in 20 minutes and saved me from "fixing" the wrong place.

A negative-control test is non-negotiable for behavior fixes — it's the only thing proving your change is scoped, not a blanket flip.

Links

- Issue:               https://github.com/opensearch-project/OpenSearch/issues/NNNN
- PR:                  https://github.com/opensearch-project/OpenSearch/pull/NNNN
- Merged commit:       <SHA>
- Backport (2.x):      https://github.com/opensearch-project/OpenSearch/pull/MMMM
- Introducing PR #4567: <SHA>

Where to Publish

Three venues, in roughly decreasing order of effort and impact.

1. Personal or company engineering blog

The full ~1000-word write-up. SEO-friendly title with the symptom phrase a user would search ("OpenSearch terms aggregation missing bucket"). Link prominently to the issue and PR. This is the version that follows you across jobs and into a maintainer nomination.

2. The OpenSearch community forum

Post a shorter version (300–500 words) on forum.opensearch.org, focused on the lesson and the user-facing behavior, under the relevant category. This reaches operators who hit the symptom and search the forum before the code. Link the blog post for the deep version.

3. The PR itself + a Slack/community-meeting mention

The PR description already carries the engineering reasoning (you wrote it in Step 8). For a non-obvious or widely-felt fix, a one-line mention in the OpenSearch public Slack (opensearch.org/slack) or a community meeting — "fixed a terms-agg empty-bucket bug that affected Dashboards zero-rows, details in #NNNN" — lets maintainers and watchers see the reasoning without reading the whole diff. Optional, earns goodwill, and is how people start to know your name.

Anti-Patterns

What separates write-ups that help from ones that don't:

"I learned so much!" — We know. Cut it. The artifact is the engineering.
Personal narrative dominating the engineering. Save the "my journey into open source" angle for a separate post. Engineering posts get reread and cited.
The sanitized "I knew the answer all along" version. Nobody believes it, and it misleads new contributors into thinking their messy investigation is abnormal. Be honest about the dead ends — they are the work.
No code and no log line. A write-up that never shows the diff or the symptomatic response is unfalsifiable.
No links. Issue, PR, merged commit — three minimum. Without the PR link the write-up is unreviewable.
Padding to look thorough. A tight 600 words that respects the reader beats a 2000-word slog.

Validation / Self-check

Before declaring the Capstone complete:

The write-up is published at a shareable URL (blog, forum, or a public capstone-work/writeup.md), 500–1000 words — not 200 (thin), not 3000 (padding).
The Investigation Log contains at least two hypotheses you ruled out, not only the winning one.
Alternatives Considered names at least two rejected designs with reasons.
Lessons Learned has three to five bullets, each reusable by another contributor.
Issue, PR, and merged-commit SHA are all linked (plus the introducing PR).
The post reads as something a peer engineer would respect — a postmortem, not a triumph lap.
You picked a publish venue and actually posted (or have it queued), not "I'll write it later."

Then close the loop with the Evaluation Rubric and grade yourself honestly.

Evaluation Rubric

A 100-point self-grading rubric for the Capstone. Score yourself honestly after you finish Step 10. The scoring is calibrated against what OpenSearch maintainers actually look for on a Pull Request — not what feels good to read.

The point of the rubric is not the score. It is the diagnostic: a low score on one dimension tells you exactly where to invest the next contribution.

Scoring Dimensions

Seven dimensions, weighted by how much they matter for review outcomes.

#	Dimension	Points
1	Problem articulation	20
2	Execution-path mastery	20
3	Implementation quality	20
4	Testing	15
5	Review responsiveness	10
6	Documentation	10
7	Community interaction	5
	Total	100

1. Problem Articulation (20 pts)

Can you state, in one paragraph, what was broken, for whom, under what conditions?

Score	What it looks like
18-20	Crisp one-paragraph statement covering symptom, trigger conditions, affected version range (`2.x`/`main`), and operational impact (e.g. Dashboards zero-rows). Distinguishes "what the user sees" from "the underlying mechanism." Could be read aloud at a standup and a peer would grasp the bug.
14-17	Clear symptom but trigger conditions vague ("happens sometimes under load"). OR trigger clear but conflates symptom with root cause.
10-13	Reader must ask follow-ups to understand what broke. Uses jargon (`reduce`, `seqno`) without grounding it in user-visible behavior.
5-9	Mostly restates the GitHub issue title. No conditions, no version impact.
0-4	"It was broken and I fixed it."

Look for: the word "intermittent" without a documented trigger; conflation of symptom (empty buckets) with cause (shard-local fast path).

2. Execution-Path Mastery (20 pts)

Did you actually trace the code, or did you guess?

Score	What it looks like
18-20	Step-3 doc maps the full path from REST handler → transport action → shard/engine/coordination with `org.opensearch.*` file:line citations at every layer, pinned to a commit SHA. Includes a mermaid (or arrow) diagram with the bug node annotated. Identifies the exact line where the value flips from correct to wrong, confirmed by a breakpoint or TRACE log. A reviewer could open each file at each line and follow it without asking.
14-17	Most layers cited but one or two skipped ("then it reaches the aggregator"). Diagram present but missing a critical hop.
10-13	Bug location cited correctly but no trace of how execution reached it. No diagram.
5-9	Vague references ("the search service handles it") without file:line.
0-4	No execution-path document, or just a paragraph of prose.

Look for: server/src/main/java/org/opensearch/...-style paths with line numbers pinned to a recorded commit, plus the confirmed observation point.

3. Implementation Quality (20 pts)

Diff hygiene, scope discipline, convention compliance.

Score	What it looks like
18-20	Minimum-diff fix — production change in tens of lines, not hundreds. Every changed line justifiable in one sentence. No drive-by refactors, no opportunistic renames, no reformatting. Public API and wire format unchanged unless required; any serialization change is `Version`-guarded and registered. SPDX headers intact. Thread-safety reasoned (what thread, what lock). `spotlessJavaCheck` and `precommit` green without manual overrides.
14-17	Mostly minimum-diff but one or two stray changes that don't belong. Conventions mostly followed; minor style nits a reviewer would flag.
10-13	Fix works but broader than necessary. Scope creep ("while I was here..."). Conventions inconsistently applied.
5-9	Significant scope creep. Public API or wire format changed unnecessarily/unguarded. Style violations would block `precommit`.
0-4	Diff so large reviewers would ask it be broken up. OR breaks BWC/public API silently.

Look for: scope-creep tells via git diff origin/main --stat showing files unrelated to the bug; an unguarded writeTo/StreamInput change.

4. Testing (15 pts)

Coverage, determinism, regression value.

Score	What it looks like
14-15	New test reproduces the bug deterministically (red on `main`, green with fix) at the lowest viable level. A negative control proves the fix is scoped. Every trigger condition encoded. Serialization/BWC test present iff the wire changed. No `Thread.sleep`, no wall-clock, no order-dependent assertions; uses `assertBusy`/`ensureGreen`. Ran ≥ 50 iters and multiple seeds without a flake.
11-13	Test present and deterministic but no negative control. OR an integration test where a unit test would do, with a weak unit test.
7-10	Test uses `Thread.sleep` or is otherwise non-deterministic. Coverage of the fix path incomplete.
3-6	Test only checks the happy path; would have passed before the fix.
0-2	No new tests, or tests that fail on both `main` and the fix.

Look for: assertBusy(...) rather than Thread.sleep; a scenario-named test (testMissingWithMinDocCountZeroOnEmptyShard) not a method-named one (testReduce).

5. Review Responsiveness (10 pts)

How well you ran the review cycle.

Score	What it looks like
9-10	Every reviewer comment addressed in code or with a substantive reply. Iteration cadence < 48h on most comments. Disagreements made the technical case without defensiveness. Pushed fixup commits during review (no silent force-push); when a force-push was needed, used `--force-with-lease` and summarized the change. Updated the PR description after material changes so the top stays accurate. CI re-validated and green after every push.
7-8	Addresses comments correctly but slowly (multi-day gaps). OR lets a few stylistic comments slide without acknowledgement.
5-6	Defensive on at least one comment ("but my way is fine"). OR force-pushed without summarizing for reviewers.
2-4	Required multiple reminders. Comments not addressed cleanly. Broke CI on a later push and didn't notice.
0-1	PR went silent > 2 weeks without explanation, or argued every comment.

Look for: review threads resolved by the contributor with a substantive pushed commit, not just a reply; green CI on every revision, not only the first.

6. Documentation (10 pts)

Issue/PR hygiene, CHANGELOG, code comments, write-up.

Score	What it looks like
9-10	CHANGELOG entry under the correct heading (`Fixed` vs `Changed`), user-facing, PR-linked. PR uses `Resolves #NNNN`. Issue updated with repro + root cause. Docs implication resolved (a `documentation-website` PR linked, or an explicit "no docs change needed"). In-code comment cites `#NNNN` at the non-obvious line. Write-up exists at a public URL.
7-8	CHANGELOG present but wrong heading, or docs implication unaddressed on a user-visible change. Code comments present but don't cite the issue.
5-6	Mechanics followed but fields incomplete. No write-up beyond the PR description.
2-4	CHANGELOG missing or DCO/CHANGELOG checks red. Comments absent at surprising lines.
0-1	No CHANGELOG, no issue link, no write-up.

Look for: the CHANGELOG line under a heading you can defend, and an explicit docs decision rather than silence.

7. Community Interaction (5 pts)

Forum/Slack/issue etiquette, claiming and handoff hygiene.

Score	What it looks like
5	Commented on the issue before starting ("working on this, PR incoming"). Requested the right reviewers (area maintainers from `MAINTAINERS.md` + the introducing-PR author). Posted to the forum/Slack only when meaningful (a design question, a summary after merge). Thanked reviewers explicitly. If stuck, posted clearly: "stuck on X, considering A/B/C, leaning A because Y."
3-4	Mostly good etiquette; one minor slip (claimed late, or one low-signal forum post).
1-2	Did not signal intent before working. OR sent low-signal channel traffic ("does anyone know...?").
0	Worked silently for weeks, then dropped a PR with no context and no reviewers requested.

Look for: an issue comment by the contributor before the first push, and an explicit thank-you to reviewers in the PR.

Tier Thresholds

Where you land tells you what to do next.

Score	Tier	Interpretation
95-100	TSC-track	The quality that, sustained across many contributions over months, puts you on the path the Technical Steering Committee notices. You operate at the level the project would trust to steer an area.
90-94	Maintainer-ready	You write PRs at maintainer quality. With several such contributions across areas over 6–12 months plus demonstrated review participation on others' PRs, a `MAINTAINERS.md` nomination is plausible.
80-89	Credible contributor	A reliable contributor whose PRs need minimal review iteration. Keep building the track record; this is where maintainers actively look forward to reviewing your work.
65-79	Contributor	Solid bug-fix-grade work. PRs land with normal review iteration. Most contributions to most projects live here, and it is honorable work.
50-64	Learning	PRs eventually land but with significant reviewer guidance. Use the next contribution to focus on your lowest dimension.
< 50	Foundational gap	The PR may have merged, but the process skipped enough corners that a maintainer paid a tax. Restart with a smaller bug and apply the rubric end to end.

The tier is not a personality assessment. It is calibrated to the artifact you produced for this one Capstone. The same person can score 65 on one contribution and 95 on the next.

How to Self-Grade

Block 30 minutes. Open this rubric beside your own artifacts: the issue, the PR, the diff, capstone-work/root-cause.md, capstone-work/execution-path.md, capstone-work/validation.md, and the write-up. Score each dimension by reading the band descriptions and picking the one that most honestly matches what you produced.

Two rules:

No interpolation upward. Between 14 and 17 and unsure? Score 14. The optimist's tax.
One independent reviewer. Ask a peer (ideally another contributor) to score independently on this rubric. If your scores differ by more than 10 points on any dimension, talk about it — the difference is where the calibration lives.

Record both scores in capstone-work/self-grade.md, with one sentence per dimension on what would have moved the score up one band. This becomes the input for the next contribution's plan.

What to Do With a Low Score

Lowest dimension	Next contribution focus
Problem articulation	Pick a smaller, sharper bug. Write the one-paragraph statement before you comment on the issue, and post it for feedback.
Execution-path mastery	Pick a bug in a layer you've never traced (you did search/agg; now do coordination or the engine). Write the path doc before reading the existing tests.
Implementation quality	Pick a bug whose minimum fix is < 10 lines. Practice leaving surrounding code untouched.
Testing	Pick a `flaky-test`-labeled issue. The whole bug is testing discipline.
Review responsiveness	Pick a bug in a high-traffic area where you'll get more reviewers. Set a 24h SLA on every comment.
Documentation	Pick a fix that needs a `documentation-website` PR. Write the docs change before the code is done.
Community interaction	Review three other contributors' PRs substantively before opening your next one.

Validation / Self-check

Before declaring the Capstone done:

capstone-work/self-grade.md exists with a score per dimension and a total.
The total is honest, not aspirational — you can defend each score with a citation to your own artifact.
At least one independent reviewer also scored, and disagreements > 10 points on any dimension were discussed.
The lowest dimension is identified and the next contribution's focus is written down.
The score is recorded where you'll see it again in three months.
You understand that the tier label ("Contributor", "Maintainer-ready") describes this one piece of work, not you.
You have a candidate next bug picked, with the focus dimension in mind.

Capstone Project Portfolio

The guided Capstone walks you through one contribution end-to-end: pick a real open issue, reproduce it, trace it, fix it, land it, write it up. That is the minimum viable contributor experience — one PR, scaffolded by ten step-chapters that hold your hand through the workflow.

This section is the next thing. It is a portfolio of eight larger, open-ended project briefs, each grounded in a real OpenSearch / k-NN / Lucene issue or RFC. Where the guided Capstone gives you a process with a small bug poured into it, these briefs give you a meaningful slice of an open engineering problem and ask you to scope, design, build, and (for several) land it upstream. They are deliberately harder, deliberately less scaffolded, and deliberately aimed at the advanced layer this expansion is about: vectors, scale, Lucene internals, and the search engine under real load.

You do not do all eight. You do one or two, well. A single finished brief — a designed, tested, benchmarked, upstreamed change — is worth more on a contributor track than a dozen drive-by typo fixes. The portfolio exists so you can choose a problem that matches the muscle you want to build next.

Note: Several of these are explicitly "land it upstream" projects. Project 1 (k-NN), Project 4 (Apache Lucene), Project 5 (star-tree), and Project 8 (segment-replication observability) each have a credible path to a real merged PR in opensearch-project/k-NN, opensearch-project/OpenSearch, or apache/lucene. The others are "build the capability locally to maintainer quality, then decide with the area maintainers whether to upstream the scoped slice." Both are real work. Neither is busywork.

What this is (and what it is not)

It is	It is not
Eight real-issue-grounded briefs, each a multi-week engineering project.	A second set of guided tutorials. There is no step-by-step.
A menu — pick by difficulty, subsystem, and the skill you want to grow.	A checklist to complete all of. Do one or two, deeply.
Calibrated to the same evaluation rubric.	Graded differently. The same 100 points apply.
A bridge from "I read the codebase" to "I shipped a non-trivial feature."	A guarantee of a merge. Upstreaming is a negotiation, not a deliverable you control alone.

Each brief follows a consistent shape so you can compare and plan:

Problem & motivation — what is broken or missing, and why it matters.
Real-world grounding — the actual issue/RFC, linked by full URL.
Subsystems you'll touch — the org.opensearch.* / org.apache.lucene.* / org.opensearch.knn.* classes and the deep dives that cover them.
Phased plan — Phase 1..N, from a small scoped slice you can finish in a weekend to the full feature, with concrete tasks, grep targets, and code sketches.
Deliverables — a - [ ] checklist.
Difficulty & time.
Stretch goals.
Evaluation — tied back to the rubric.
How to turn this into a real contribution — the upstreaming path.

The eight projects

#	Project	Repo	Subsystem	Difficulty	Skills it builds	Land upstream?
1	Disk-based quantization mode for k-NN	k-NN	field mapper, engine/method registration, codec, native memory, rescoring	Hard	Vector storage, plugin internals, codec SPI	Yes (scoped slice)
2	Vector-aware allocation decider	OpenSearch	`AllocationService`, `AllocationDeciders`, `RoutingAllocation`, k-NN native stats	Hard	Cluster coordination, allocation, cross-plugin stats	Maybe (RFC first)
3	Optimizing concurrent segment search slicing	OpenSearch	slice strategy, `CollectorManager`, search threadpool, JMH/OSB	Hard	Search execution, parallelism, benchmarking	Yes (with numbers)
4	Upstream Apache Lucene HNSW contribution	apache/lucene	`hnsw`, `codecs/lucene99`, `util/VectorUtil`	Hard	Lucene internals, SIMD, ASF process	Yes (the whole point)
5	Star-tree aggregation resolution slice	OpenSearch	composite index, aggregation rewrite, query path	Very hard	Aggregations, precomputation, query rewrite	Yes (scoped slice)
6	k-NN recall/latency benchmark harness	k-NN	benchmarking, recall metrics, OSB	Medium	Measurement discipline, recall@k, reproducibility	Yes (tooling)
7	Search backpressure signal	OpenSearch	search backpressure, task cancellation, admission control	Hard	Resiliency, resource accounting, task framework	Maybe (RFC first)
8	Segment-replication observability	OpenSearch	segrep stats, transport, REST `_cat`/stats	Medium-Hard	Remote store, replication, observability	Yes (stats/metrics)

Projects 1–4 are written out in full in this section. Projects 5–8 are sketched in the table and have their own briefs (project-05-… through project-08-…); they share the same nine-part shape. If you want one and the brief is not yet expanded, the grounding RFCs are in Real Issues and RFCs.

Warning: "Difficulty" here is engineering difficulty, not how mergeable the result is. Project 6 (benchmark harness) is "Medium" to build but extremely mergeable; Project 5 (star-tree) is "Very hard" and a scoped slice of it is the only realistic upstream target. Read the "How to turn this into a real contribution" section of any brief before you start, so you scope to what can actually land.

How to pick one

Pick along three axes, in this order:

The subsystem you want to own. If you finished the k-NN chapters and want vectors, take 1 or 6. If you came through shard allocation and coordination, take 2. If search execution and parallelism is your thing, take 3. If you want to touch Apache Lucene directly, take 4.
The skill the rubric says is your weakest. The rubric's "What to Do With a Low Score" table maps a weak dimension to a kind of next contribution. Weak on testing? Project 6 is almost entirely measurement discipline. Weak on execution-path mastery? Project 3 forces you to trace the slicing decision through Lucene and back. Weak on implementation quality? Project 1's Phase 1 is a deliberately tiny, minimum-diff slice.
Whether you want a merge or a capability. Want a real merged PR on your GitHub profile in 4–6 weeks? Take 4 (Lucene), 6 (k-NN tooling), or the scoped Phase-1 slice of 1. Want to build a hard capability locally and possibly RFC it? Take 2 or 7.

Do not pick by "which sounds most impressive." The impressive thing is a finished brief with numbers and a clean diff, whatever the topic.

The shared deliverable shape

Every brief, regardless of repo, produces the same core artifacts — the same ones the guided Capstone demands, adapted for a feature rather than a bug fix:

A design note (capstone-work/design.md in your fork): problem, constraints, the approach you chose, the approaches you rejected and why. For anything touching wire format, public API, or a new setting, this is what you would post to an RFC or the issue thread before writing code.
A scoped, minimum-diff implementation on a feature branch. Even a feature is a sequence of small, reviewable commits — not one 2,000-line drop.
Tests at the lowest viable level: unit tests (OpenSearchTestCase / Lucene's LuceneTestCase), and integration tests (OpenSearchIntegTestCase / InternalTestCluster) only where behavior is genuinely end-to-end. Serialization round-trip tests (AbstractWireSerializingTestCase) iff you changed the wire.
Numbers, where the project is about performance (3, 6, parts of 1): a JMH microbenchmark and/or an OpenSearch Benchmark macro run, before/after, with the recall metric stated for any ANN change.
A validation report (capstone-work/validation.md): the ./gradlew check / precommit output, the test commands, the seeds you ran.
An upstreaming decision: either a real PR (DCO-signed for OpenSearch/k-NN; ASF-style for Lucene) or a written "here is the scoped slice I would propose, and here is the issue/RFC comment I would open it under."
A write-up (500–1000 words): the same engineering write-up the guided Capstone asks for, because the investigation is the durable artifact.

Note: The DCO / git commit -s / CHANGELOG.md / backport-bot model from the guided Capstone applies unchanged to OpenSearch and k-NN PRs. Apache Lucene is different — it uses a CHANGES.txt entry and the ASF's PR conventions, no DCO sign-off line. Project 4 covers that difference in detail.

How these map to the rubric

You self-grade every project against the same 100-point evaluation rubric you used for the guided Capstone. The seven dimensions translate cleanly from "bug fix" to "feature":

Rubric dimension	For a portfolio project, this means
Problem articulation (20)	The design note states the gap, the affected users, and the constraint that makes it hard — not just "vectors use a lot of memory."
Execution-path mastery (20)	You traced the existing path (the codec write, the allocation round, the slicing decision) with file:line citations before you changed it.
Implementation quality (20)	The feature is a sequence of minimum-diff commits; public API / wire format changes are `Version`-guarded and justified; no scope creep beyond the scoped slice.
Testing (15)	Lowest-viable-level tests, deterministic, with a negative control; benchmarks where the claim is "faster"; recall stated for ANN.
Review responsiveness (10)	If you upstreamed: the review cadence on the real PR. If not: a self-review and a peer review against this rubric.
Documentation (10)	Design note, `CHANGELOG.md` / `CHANGES.txt`, the write-up, and an explicit docs decision.
Community interaction (5)	You posted the design to the issue/RFC thread before building, and requested the right reviewers from `MAINTAINERS.md`.

The tier thresholds are identical. A finished portfolio project at 90+ is maintainer-grade work in an advanced area — exactly the track record that, sustained, gets noticed. Score yourself honestly; the rubric's two rules (no interpolation upward, one independent reviewer) still bind.

Before you start any brief

You finished the guided Capstone, or you are confident you can run its workflow (reproduce → trace → fix → test → PR) without the step-chapters open.
You can build the relevant repo from source:
- OpenSearch: ./gradlew assemble (covered throughout Levels 1–9).
- k-NN: the native JNI build via CMake — see k-NN native integration and the lab-k1-build-knn-from-source lab.
- Apache Lucene: ./gradlew check on apache/lucene — see the Lucene contribution lab.
You read the brief's "How to turn this into a real contribution" section first, so you scope Phase 1 to something that can actually land or be RFC'd.

Then pick one. Project 1 is the natural start if you came through the vector chapters; Project 4 is the natural start if you want a real upstream merge with the least cross-plugin surface area.

Project 1: A Disk-Based Quantization Mode for k-NN

Vectors are expensive. A single 768-dimensional float[] is 3 KB; a hundred million of them is ~300 GB of raw data, and the faiss engine loads its HNSW graph outside the JVM heap into native memory before it can answer a query. The whole "quantization and disk-ANN" story in the k-NN plugin exists to make that affordable: store vectors in fewer bytes, keep less in RAM, and pay back the lost precision with a rescoring pass. This project asks you to implement — or meaningfully extend — a quantization / disk-ANN capability in the k-NN plugin, starting from a slice you can finish in a weekend and ending at something a maintainer would review.

This is the natural first portfolio project if you came through the vector chapters. It is also the one with the cleanest "land a scoped slice upstream" path, because the validation and stats surface is small, self-contained, and exactly the kind of thing maintainers merge.

Note: Read Quantization and Disk-ANN and k-NN Native JNI and Memory before you start. This brief assumes you know what on_disk mode, compression_level, the FP16/PQ/BQ knobs, and the native-memory circuit breaker are. It will not re-derive them.

Problem & motivation

The k-NN plugin already ships a rich storage menu: byte vectors (2.17+), FP16 scalar quantization, Product Quantization (PQ), Binary Quantization (BQ), and a disk-based mode (mode: on_disk) with compression_level ∈ {1x, 2x, 4x, 8x, 16x, 32x} and rescoring on the full-precision vectors. The memory story is the whole reason vectors are usable at scale.

But "rich menu" hides sharp edges:

Validation is thin and scattered. Which (engine, space_type, mode, compression_level, data_type) combinations are actually legal? Try a few illegal ones (on_disk with an engine that does not support it, a compression_level the engine cannot honour, a binary data_type with a quantization that assumes floats) and you will get inconsistent errors — sometimes a clean 400, sometimes a confusing native-side failure on first query, sometimes silent acceptance of a combination that quietly degrades recall.
The cost of a mode is invisible. A knn_vector field in on_disk mode at 8x compression has a wildly different native-memory footprint than the same field at 1x. The GET /_plugins/_knn/stats surface does not cleanly tell an operator "this field is quantized this way, costing roughly this much."
Adding a new mode is a deep, multi-file change that most contributors never attempt, because the field mapper, the engine/method registration, the codec, and the native build are all involved and there is no single "here is how a mode flows end to end" map.

This project makes one of those better. The phased plan starts at validation + a stats field (genuinely mergeable, low blast radius) and builds toward a real feature.

Real-world grounding

The grounding issue is the meta-issue for the vector engine's evolution and the quantization roadmap:

[META] Supporting a New Vector Engine in OpenSearch — k-NN #2605: https://github.com/opensearch-project/k-NN/issues/2605

That meta-issue frames how new storage/engine capabilities get added — exactly the surface you are touching. Around it sits the broader memory and disk-ANN work:

The native-memory circuit-breaker rearchitecture discussion — k-NN #1582: https://github.com/opensearch-project/k-NN/issues/1582
A real, concrete circuit-breaker config bug — k-NN #585: https://github.com/opensearch-project/k-NN/issues/585

Citation discipline: the disk-ANN / compression-level feature set evolves fast. Do not cite an issue number you have not opened. Instead, run a live search before you scope: is:issue is:open label:"vector indexing" disk OR quantization OR compression and is:issue is:open on_disk in the opensearch-project/k-NN repo. Link the current issue in your design note. The meta-issue above is the durable anchor.

Subsystems you'll touch

Subsystem	Class / area (grep to confirm names per version)	What it owns
Field mapper	`org.opensearch.knn.index.mapper.KNNVectorFieldMapper`, `KNNVectorFieldType`, the `*FieldMapper` builder	Parses the `knn_vector` mapping; validates `dimension`, `space_type`, `method`, `mode`, `compression_level`, `data_type`
Engine / method registration	`org.opensearch.knn.index.engine.KNNEngine` (FAISS/LUCENE/NMSLIB), `KNNMethod`, `MethodComponent`, `KNNMethodConfigContext`, `Encoder`	Declares which methods/encoders/parameters each engine supports, and validates a resolved config
Mode / compression resolution	`org.opensearch.knn.index.Mode`, `CompressionLevel` (grep — names vary), the resolver that maps `(mode, compression_level)` → encoder params	Turns the user-facing knobs into concrete encoder/quantization parameters
Codec	`org.opensearch.knn.index.codec.*` (`KNN990Codec` / current `KNNNNNCodec`, `KNNCodecVersion`), the per-field `KnnVectorsFormat` wiring	Writes the graph + quantized vectors into segment files
Native memory & stats	`org.opensearch.knn.index.memory.NativeMemoryCacheManager`, `NativeMemoryAllocation`, the `KNNStats` / `StatNames` registry behind `GET /_plugins/_knn/stats`	Loads native indexes, accounts memory, exposes stats
Rescoring	the query-path rescore step (grep `rescore` under `org.opensearch.knn.index.query`)	Re-ranks `on_disk` candidates against full-precision vectors

Deep dives that cover the surrounding ground: k-NN engines · algorithms (HNSW/IVF/PQ) · native JNI and memory · quantization and disk-ANN · k-NN query path · the Lucene HNSW format the lucene engine reuses.

Phased plan

The discipline of this project is that Phase 1 is independently mergeable. You do not need to finish Phases 3–4 to have shipped something real.

Phase 0 — Build it and map the mode path (½ day)

Build k-NN from source (CMake JNI build + Gradle) per lab-k1-build-knn-from-source, then trace how a mode/compression_level mapping turns into encoder parameters.

# In your k-NN clone:
./gradlew build -x test          # full build incl. native (or buildKNNLib first; see lab-k1)
./gradlew run                    # single node with the plugin installed, REST on :9200

# Create a disk-mode field and watch it work:
curl -s -X PUT localhost:9200/disk-test -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true },
  "mappings": { "properties": { "v": {
    "type": "knn_vector", "dimension": 8,
    "space_type": "l2",
    "mode": "on_disk", "compression_level": "8x"
  }}}
}' | python3 -m json.tool

Grep the path from mapping to encoder:

grep -rn "on_disk\|CompressionLevel\|compression_level\|\bMode\b" src/main/java/org/opensearch/knn/index | head -40
grep -rn "class KNNVectorFieldMapper\|parseCreateField\|TypeParser" src/main/java/org/opensearch/knn/index/mapper
grep -rn "supportedMethods\|MethodComponent\|Encoder\b" src/main/java/org/opensearch/knn/index/engine | head -40

Write a one-page capstone-work/mode-path.md: mapping JSON → KNNVectorFieldMapper.Builder → method/encoder resolution → codec field config → native build. With file:line citations. This is your execution-path-mastery artifact and you cannot skip it.

Phase 1 — Tighten validation (the scoped, mergeable slice)

Pick one illegal (engine, mode, compression_level, data_type, space_type) combination that currently fails late, silently, or confusingly, and make it fail early, clearly, and deterministically at mapping-creation time with a 400 and a precise message.

// In the resolver/validation path (grep for where the resolved config is checked):
if (mode == Mode.ON_DISK && !knnEngine.supportsMode(Mode.ON_DISK)) {
    throw new MapperParsingException(
        "Engine [" + knnEngine.getName() + "] does not support mode [on_disk]; "
        + "supported modes: " + knnEngine.getSupportedModes());
}
if (compressionLevel != CompressionLevel.x1
        && !knnEngine.supportsCompression(compressionLevel, dataType)) {
    throw new MapperParsingException(
        "Engine [" + knnEngine.getName() + "] does not support compression_level ["
        + compressionLevel.getName() + "] for data_type [" + dataType + "]");
}

Then a unit test that asserts the bad mapping is rejected and a good one is accepted:

// KNNVectorFieldMapperTests (OpenSearchTestCase-based)
public void testOnDiskUnsupportedEngineRejected() {
    XContentBuilder mapping = mapping(b -> b.startObject("v")
        .field("type", "knn_vector").field("dimension", 8)
        .field("mode", "on_disk")
        .startObject("method").field("name","hnsw").field("engine","nmslib").endObject()
        .endObject());
    MapperParsingException e = expectThrows(MapperParsingException.class,
        () -> createDocumentMapper(mapping));
    assertThat(e.getMessage(), containsString("does not support mode [on_disk]"));
}

Run the gates:

./gradlew spotlessApply
./gradlew :test --tests "org.opensearch.knn.index.mapper.KNNVectorFieldMapperTests"
./gradlew check -x integTest    # the broad gate; integTest needs the native lib

Why this is the right Phase 1: it is a minimum diff, it is exactly the kind of hardening maintainers merge without an RFC, and it forces you to read the resolver for real. A clean validation PR with a deterministic test is a credible first k-NN contribution.

Phase 2 — Surface the cost in stats

Add a per-field (or per-index) view to GET /_plugins/_knn/stats (or a _cat-style helper) that reports the quantization mode and an estimated native-memory footprint for each knn_vector field, so an operator can see what a field costs.

grep -rn "class KNNStats\|StatNames\|enum.*Stat\|register" src/main/java/org/opensearch/knn/plugin/stats
grep -rn "graphMemoryUsage\|NativeMemoryAllocation\|getSizeInKB\|cacheStats" src/main/java/org/opensearch/knn/index/memory

Add a stat name, populate it from NativeMemoryCacheManager, and round-trip it through the existing stats transport action. Test with KNNStatsTests (or the stats response serialization test) and a small integration test that creates two fields at different compression levels and asserts the reported footprints differ in the expected direction.

Phase 3 — A real mode/encoder extension

Now the feature. Choose one, scoped tightly, and write a design note first:

Option	Touches	Why it is real
Expose a not-yet-exposed compression level / encoder param the underlying engine already supports	resolver + mapper + codec field config	Mostly wiring; recall/latency measurable
A better rescore-candidate heuristic for `on_disk` (`oversample` factor tuning)	query-path rescore	Directly moves the recall/latency curve
A new validated `(mode, data_type)` pairing with correct codec field config	mapper + engine registration + codec	The full "add a capability" loop

Whatever you pick: round-trip it through a real index, and prove recall (next phase).

Phase 4 — Prove recall and latency

A quantization change is meaningless without a recall number. Use the k-NN benchmark approach: index a known dataset, run a fixed query set, compute recall@k against an exact (1x / flat) baseline, and report before/after for your change.

config            recall@10   p50 query (ms)   native mem (MB)
1x (baseline)        1.000          18              3100
on_disk 8x (old)     0.91x          21               420
on_disk 8x (yours)   0.9xx          2x               4xx

Recall is the headline metric for any ANN change. A latency win that drops recall below the mode's documented floor is not a win — it is a bug.

Deliverables

capstone-work/mode-path.md — the mapping → encoder → codec → native trace, with citations
capstone-work/design.md — the gap, the chosen slice, the combinations you considered and rejected
Phase 1: a validation tightening on a feature branch, with a unit test (red without the fix)
Phase 2: a stats field exposing per-field quantization mode + footprint, with a test
(Phase 3) a scoped mode/encoder extension behind clean validation
(Phase 4) a recall@k + latency + native-memory before/after table on a real dataset
capstone-work/validation.md — ./gradlew spotlessApply check output, test commands, seeds, dataset
A CHANGELOG.md entry under the k-NN repo's ## [Unreleased]
An upstreaming decision: a DCO-signed k-NN PR for the Phase-1/2 slice, or a written scoped proposal
A 500–1000 word write-up: the gap, the path you traced, the trade-off you made

Difficulty & time


Engineering difficulty	Hard (Phase 1 alone is Medium)
Mergeability	High for Phases 1–2; negotiated for Phase 3
Time	Phase 1: a weekend. Phases 1–2: ~2 weeks. Through Phase 4: 4–6 weeks
Hardest part	The native build (CMake/JNI) and proving recall didn't silently regress

The native JNI build is the gate most people underestimate. Budget a full session for lab-k1-build-knn-from-source before you scope.

Stretch goals

Wire the new stat into a _cat/knn style endpoint so footprint is visible at a glance.
Add a mapping-time warning (not error) when a chosen (mode, compression_level) is legal but historically recall-risky for the given dimension, pointing at the rescore knobs.
Extend the warmup API accounting so a warmed disk-mode field reports its loaded footprint, closing the loop between Phase 2's estimate and reality.
Reproduce the spirit of k-NN #585 in a test: assert that the circuit breaker accounts for the quantized footprint, not the full-precision one.

Evaluation

Self-grade against the 100-point rubric. For this project:

Dimension	What earns the points here
Problem articulation (20)	The design note names the exact illegal combination / missing stat and why the current behaviour is wrong, not "vectors use memory"
Execution-path mastery (20)	`mode-path.md` traces mapping → resolver → codec → native with file:line citations, before you changed anything
Implementation quality (20)	Phase 1 is a minimum diff; validation lives where the resolver already validates; no new public mapping params without justification
Testing (15)	A unit test red without the fix; a recall@k number for any ANN change; a negative control
Review responsiveness (10)	The real k-NN PR cadence, or a peer review against this rubric
Documentation (10)	Design note, `CHANGELOG.md`, write-up, and an explicit docs decision (the mode knobs are user-facing)
Community interaction (5)	You commented your scope on the current disk/quantization issue and pinged the right `MAINTAINERS.md` reviewers

A finished Phase 1–2 at 90+ is a real, merged k-NN contribution in the memory subsystem.

How to turn this into a real contribution

Start from validation, not the feature. Phase 1 is the upstream target. A 400-on-bad-mapping PR with a deterministic test is mergeable on its own merits and needs no RFC.
Comment before you code. Find the current disk/quantization issue (search query above), comment that you are tightening validation / adding a footprint stat, and confirm the maintainers want it. Link the [META] #2605 thread for context.
One slice per PR. Validation, then stats, then (if negotiated) the mode extension. Do not bundle a new mode with the validation cleanup — reviewers will (correctly) ask you to split it.
Bring a recall number to any encoder/mode change. The k-NN maintainers will ask for it; arrive with the table already in the PR description.
DCO applies. k-NN is GitHub-native like core: git commit -s, CHANGELOG.md, and the backport bot. Get the sign-off email right locally before you push.

If only Phase 1 lands, you have still shipped a real fix in one of OpenSearch's hardest plugins. That is the point of scoping from the small slice up.

Project 2: A Vector-Aware Allocation Decider

OpenSearch's allocation layer decides which node each shard copy lives on. It balances by shard count and by disk, and it consults a chain of AllocationDeciders that can each veto or throttle a move. None of them know that a knn_vector shard using the faiss engine carries a large native-memory footprint that lives entirely outside the JVM heap — loaded by the plugin on first query or via the warmup API, capped by a native-memory circuit breaker. So the balancer will happily pack three heavy vector shards onto one node and starve another, because from its point of view they are "just shards."

This project builds a vector-aware AllocationDecider (or an allocation-influencing awareness attribute) that accounts for the native-memory footprint of k-NN shards when placing and balancing them. It sits exactly on the seam between two subsystems most contributors treat as separate worlds: cluster coordination / allocation and k-NN native memory. Bridging them is the whole skill.

Note: This is the project to take if you came through shard allocation and want vectors, or vice versa. You already met an allocation decider in Lab 4: Fix-It Allocation Decider — this is that muscle, scaled up to a real cross-plugin design.

Problem & motivation

The allocation deciders you know enforce real constraints: SameShardAllocationDecider (don't put primary and replica together), DiskThresholdDecider (watermark-aware), AwarenessAllocationDecider (zone spreading), ThrottlingAllocationDecider (rate-limit recoveries), FilterAllocationDecider (include/exclude). They reason about shard count, on-disk bytes, and topology. None reason about native memory.

For a vector workload that is a real, operational gap:

A faiss HNSW index loads its graph into native memory, sized by the number of vectors, the dimension, and M. That footprint can dwarf the JVM heap and is capped by knn.memory.circuit_breaker.limit. When too many heavy vector shards land on one node, that node's circuit breaker trips on first query — queries to those shards fail — while a sibling node sits half-empty. The balancer caused it and cannot see it.
"Just add more shard-count balancing" does not help: ten tiny lexical shards and three huge vector shards have the same count weight but radically different native-memory cost.
The footprint is known — the plugin exposes it through GET /_plugins/_knn/stats and accounts it in NativeMemoryCacheManager. The information exists; allocation just never asked for it.

The motivating constraint that makes this hard (and worth a design note): allocation runs in core, the footprint lives in a plugin, and a decider must not make a remote call mid-allocation round. You need the signal to arrive in cluster state (or node stats already gossiped), so the decider can read it cheaply and deterministically.

Real-world grounding

There is no single merged "vector-aware allocation decider." That is why this is a build-the-capability-then-RFC project. Ground it in two real bodies of work:

The native-memory circuit-breaker rearchitecture — k-NN #1582: https://github.com/opensearch-project/k-NN/issues/1582 — the discussion of how native memory should be accounted and capped, which is precisely the signal your decider consumes.
A concrete native-memory circuit-breaker config bug — k-NN #585: https://github.com/opensearch-project/k-NN/issues/585 — evidence that getting this accounting right is genuinely subtle.

For the allocation side, ground in the existing deciders themselves (read the source) and search live for prior art before you design:

# in opensearch-project/OpenSearch
is:issue allocation decider memory OR "native memory" OR knn
is:issue rebalance vector OR knn
# in opensearch-project/k-NN
is:issue allocation OR rebalance OR "shard placement"

Citation discipline: cite #1582 and #585 (real). For the allocation-side issue, link the current one you find — do not invent a number. If you find none, that is itself a finding for your design note: this gap is unaddressed, which is the case for an RFC.

Subsystems you'll touch

Subsystem	Class (grep to confirm)	What it owns
Allocation orchestration	`org.opensearch.cluster.routing.allocation.AllocationService`	Runs reroute; applies deciders + the balancer
Decider chain	`org.opensearch.cluster.routing.allocation.decider.AllocationDeciders`, `AllocationDecider` (base), e.g. `DiskThresholdDecider`, `AwarenessAllocationDecider`	The veto/throttle chain you extend
Allocation context	`org.opensearch.cluster.routing.allocation.RoutingAllocation`, `RoutingNodes`, `ShardRouting`	The mutable state of one allocation round
Balancer	`org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator` (and `WeightFunction`)	Where the count/disk weights live; where a native-memory weight could go
Node-level signal	`org.opensearch.cluster.ClusterInfo` / `ClusterInfoService`, or node stats / cluster state custom	How per-node footprint reaches the decider cheaply
k-NN footprint source	`org.opensearch.knn.index.memory.NativeMemoryCacheManager`, the `KNNStats` registry	The native-memory number to surface

Cross-references: shard allocation deep dive · cluster state · k-NN native JNI and memory · circuit breakers and memory · the allocation-decider lab.

Phased plan

Phase 0 — Read the chain, run the failure (1 day)

Read the decider chain and the balancer weight function. Then reproduce the imbalance.

# In an OpenSearch clone:
grep -rn "class .*AllocationDecider" server/src/main/java/org/opensearch/cluster/routing/allocation/decider
grep -rn "canAllocate\|canRemain\|canRebalance" server/src/main/java/org/opensearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
grep -rn "class BalancedShardsAllocator\|WeightFunction\|weight(" server/src/main/java/org/opensearch/cluster/routing/allocation/allocator

# Reproduce: bring up a 3-node cluster with k-NN, create several heavy vector indices,
# warm them, and watch the native-memory stats skew across nodes.
curl -s localhost:9200/_cat/shards?v
curl -s "localhost:9200/_plugins/_knn/stats?pretty"      # per-node graph memory usage
curl -s "localhost:9200/_nodes/stats/breaker?pretty"     # where the circuit breaker shows

Write capstone-work/allocation-trace.md: how a reroute round flows through AllocationService → AllocationDeciders → BalancedShardsAllocator, with the file:line where the count and disk weights are applied. This is your execution-path artifact.

Phase 1 — A read-only decider that observes (the scoped slice)

Build a custom AllocationDecider (registered via the plugin's ClusterPlugin getAllocationDeciders extension point — grep getAllocationDeciders) that, for now, only logs/decides on a synthetic per-node footprint you put in cluster state by hand. It returns Decision.YES always but records what it would have decided. This proves the wiring without risking allocation correctness.

public class VectorMemoryAllocationDecider extends AllocationDecider {
    static final String NAME = "vector_memory";
    private volatile long perNodeLimitBytes;   // from a dynamic cluster setting

    @Override
    public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
        long projected = projectedNativeMemory(node, shardRouting, allocation);
        if (projected > perNodeLimitBytes) {
            return allocation.decision(Decision.NO, NAME,
                "node [%s] projected native vector memory [%d] would exceed limit [%d]",
                node.nodeId(), projected, perNodeLimitBytes);
        }
        return allocation.decision(Decision.YES, NAME, "within native vector memory budget");
    }
}

// in your plugin (a ClusterPlugin):
@Override
public Collection<AllocationDecider> createAllocationDeciders(Settings settings, ClusterSettings cs) {
    return List.of(new VectorMemoryAllocationDecider(settings, cs));
}

Test with the allocation test harness used by the built-in deciders:

// extends OpenSearchAllocationTestCase
public void testVetoesNodeOverNativeMemoryBudget() {
    AllocationService service = createAllocationService(/* with the decider */);
    ClusterState state = /* 2 nodes, set footprints so node1 is full */;
    state = service.reroute(state, "test");
    assertThat(shardsOn(state, "node1"), /* shard did not land on the full node */);
}

./gradlew :server:test --tests "*VectorMemoryAllocationDecider*"
./gradlew :server:check -x internalClusterTest

Phase 2 — Get the real footprint into allocation

Replace the synthetic number with the actual k-NN footprint. The clean path: have the k-NN plugin publish per-node native-memory usage into a place the decider can read without an RPC — either via ClusterInfo (the same mechanism DiskThresholdDecider uses for disk usage) or a small cluster-state custom updated from node stats.

grep -rn "ClusterInfo\b\|ClusterInfoService\|getNodeLeastAvailableDiskUsages" server/src/main/java/org/opensearch/cluster
grep -rn "getShardSize\|shardSizes\|NodeStats" server/src/main/java/org/opensearch/cluster/ClusterInfo.java

Decide and document the mechanism in your design note — this is the architecturally load-bearing choice and the thing an RFC would argue about (latency of the signal, staleness, who owns the field, BWC of any new cluster-state custom).

Phase 3 — Influence balancing, not just placement

A canAllocate veto stops bad placements but does not rebalance an already-skewed cluster. For that, the native-memory cost must enter the BalancedShardsAllocator weight function, so the balancer actively moves heavy shards off hot nodes.

grep -rn "WeightFunction\|float weight\|theta\|indexBalance\|shardBalance" \
  server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

This is the deepest and most contentious change — touching the weight function affects every workload, not just vectors. Scope it behind a setting, off by default, and prove with a test that a skewed cluster converges to balanced native memory without destabilizing count/disk balance.

Phase 4 — Prove it converges and does no harm

Build an OpenSearchAllocationTestCase / InternalTestCluster scenario: a skewed start state, run reroute to convergence, assert native-memory variance across nodes drops below a threshold and shard-count/disk balance does not regress. The "does no harm" half is as important as the "it balances" half — a decider that fixes vectors by wrecking lexical balance won't land.

Deliverables

capstone-work/allocation-trace.md — reroute → deciders → balancer, with file:line citations
capstone-work/design.md — the gap, the signal-delivery mechanism chosen (ClusterInfo vs custom), alternatives rejected, BWC analysis
Phase 1: a registered read-only/veto decider with an OpenSearchAllocationTestCase test
Phase 2: real per-node native-memory footprint delivered into allocation without an RPC
(Phase 3) the native-memory weight in the balancer, behind a default-off setting
(Phase 4) a convergence test that also proves count/disk balance does not regress
capstone-work/validation.md — ./gradlew :server:check output, test commands, seeds
An upstreaming decision: a written RFC draft (this almost certainly needs one) + the issue comment you would open
A 500–1000 word write-up: the cross-subsystem seam and how you bridged it

Difficulty & time


Engineering difficulty	Hard (Phase 3 is Very Hard)
Mergeability	Maybe — RFC first. The weight-function change is contentious by nature
Time	Phase 1: a weekend. Phases 1–2: ~2–3 weeks. Through Phase 4: 5–6 weeks
Hardest part	Getting a fresh, cheap, BWC-safe footprint signal into the allocation round

Warning: Touching BalancedShardsAllocator's weight function changes behaviour for every index in every cluster. That is exactly why Phases 1–2 (a vetoing decider behind a setting) are the realistic upstream slice and Phase 3 is RFC-gated.

Stretch goals

Make the decider respect the circuit-breaker limit per node, so it leaves headroom rather than packing to 100% — directly addressing the spirit of k-NN #585.
Add an awareness-attribute mode: spread vector shards across attributes the way AwarenessAllocationDecider spreads across zones, but weighted by footprint.
Expose a _cat/allocation-style column showing per-node native vector memory.
Account for the warmup state: an unwarmed shard's footprint is projected, a warmed one is real — model both and let the decider use the real number when available.

Evaluation

Self-grade against the 100-point rubric:

Dimension	What earns the points here
Problem articulation (20)	The design note states why count/disk balancing is insufficient for native memory, with the reproduced skew as evidence
Execution-path mastery (20)	`allocation-trace.md` follows a reroute through deciders and the balancer weight function, cited
Implementation quality (20)	The decider uses the existing extension point; the signal is BWC-safe; the weight change is setting-gated and default-off
Testing (15)	`OpenSearchAllocationTestCase` veto tests and a convergence test that proves no regression to count/disk balance
Review responsiveness (10)	The RFC thread cadence, or a peer review against this rubric
Documentation (10)	RFC draft, design note, write-up, `CHANGELOG.md` for the scoped slice
Community interaction (5)	You posted the RFC/design before the weight-function change and pinged allocation + k-NN maintainers

How to turn this into a real contribution

This is an RFC-first project. Allocation behaviour is core; you do not change the balancer by surprise. Write the design note as an RFC and open it on the OpenSearch repo.
Phase 1–2 is the mergeable slice: a registered, setting-gated, default-off vetoing decider plus a clean footprint signal. That can land as an opt-in feature.
The cross-plugin signal is the crux. Argue the mechanism (ClusterInfo vs cluster-state custom) in the RFC; both have BWC and staleness consequences. Maintainers will care more about how the number arrives than about the decider logic.
Bring the reproduction. A _cat/shards + _plugins/_knn/stats before/after that shows the skew and then the fix is worth more than any prose.
DCO applies (GitHub-native, git commit -s, CHANGELOG.md, backport bot) for whatever slice lands in core.

Even if nothing merges, a clean RFC + a working setting-gated decider + a convergence proof is a strong portfolio artifact in the hardest part of the engine.

Project 3: Optimizing Concurrent Segment Search Slicing

Concurrent segment search splits a shard's segments into slices and searches them in parallel on the search threadpool, then reduces the per-slice results into one shard-local answer. It became a cluster-level default in 3.0. The mechanism is settled; the slicing policy — how many slices, and which segments go in each — is where the performance still lives. Slice too finely over many tiny segments and coordination overhead swamps the win. Slice too coarsely and you leave cores idle. This project asks you to analyze the current slicing decision, find a regime where it is suboptimal, and improve it — with a measurement gate that proves the change is real before any code lands.

This is the project for someone who wants search execution and parallelism, and who is willing to live or die by a benchmark. There is no "I think it's faster" here. You bring numbers or you bring nothing.

Note: Read the Concurrent Segment Search chapter first; this brief assumes you know what a slice is, the CollectorManager contract, and the settings that bound slice count. It builds directly on Lab 9.2: Performance Regression — the measurement discipline there is the spine of this whole project.

Problem & motivation

The slicing decision is a heuristic, and every heuristic has a regime where it is wrong:

Small-segment overhead. A shard freshly written by many indexing threads has lots of tiny segments. Splitting them into slices creates more tasks than useful work — each slice's fork/join, collector creation, and reduce cost can exceed the time to scan a small segment sequentially. The default maxDocsPerSlice / maxSegmentsPerSlice heuristic (Lucene's IndexSearcher.slices(...)) does not always group these well for OpenSearch's workloads.
Slice-count vs. threadpool. Slices contend for the search threadpool. Over-slicing one request steals threads from concurrent requests; the per-request latency win becomes a cluster-wide throughput loss. The cap (search.concurrent.max_slice_count) is a blunt global knob, not a per-request decision.
Cheap queries don't benefit. For a query whose per-segment cost is tiny, the reduce and coordination dominate, and concurrency is pure overhead — but the policy slices anyway.

The opportunity: a smarter slicing policy that collapses tiny segments into fewer slices, or makes the slice count sensitive to estimated per-segment cost and threadpool pressure, can win latency where the win is real and stop spending where it is not. That is a measurable, reviewable change — if you can prove it.

The constraint that makes it hard: "faster" is workload-dependent. A change that helps the many-tiny-segments case must be shown not to regress the many-large-segments case it was designed for. The benchmark is the design.

Real-world grounding

The grounding is the concurrent-segment-search feature itself, which graduated to a 3.0 default behind real settings. Read the existing chapter and then find the live work:

The slicing/maxSliceCount settings and the feature in the OpenSearch repo. Search:

# in opensearch-project/OpenSearch
is:issue concurrent segment search slice
is:issue "max_slice_count" OR "slice count" OR slicing
is:pr concurrent search slice latency OR throughput

The Lucene side of the slicing primitive: IndexSearcher.slices(...) and LeafSlice in apache/lucene. The OpenSearch slice strategy wraps or replaces Lucene's default partitioning.

Citation discipline: the settings (search.concurrent_segment_search.enabled, search.concurrent.max_slice_count) and the 3.0-default fact are verified — cite them. The current slicing-tuning issue you choose to ground in must be one you actually find; link it in your design note. Do not invent an issue number.

Subsystems you'll touch

Subsystem	Class / area (grep to confirm)	What it owns
Slice strategy	the OpenSearch slice computation (grep `slice`, `LeafSlice`, `computeSlices`, `SliceCount` under `server/src/main/java/org/opensearch/search`)	How leaves are grouped into slices
Searcher integration	`org.opensearch.search.internal.ContextIndexSearcher` (extends Lucene `IndexSearcher`)	Where the executor and slicing plug into the query phase
Collector contract	Lucene `CollectorManager` / `Collector`, the OpenSearch collector-manager wrappers	Per-slice collection + `reduce`
Threadpool	`org.opensearch.threadpool.ThreadPool` — the `search` / `search_throttled` pools	Where slice tasks actually run; the contention source
Settings	the cluster settings registry for `search.concurrent_segment_search.*` and `search.concurrent.max_slice_count`	The knobs and any new one you add
Benchmark	JMH (`benchmarks/` module) + OpenSearch Benchmark (OSB)	The gate

Cross-references: Concurrent Segment Search · search execution deep dive · threadpools and concurrency · refresh/flush/merge (why you get many small segments) · the performance-regression lab.

Phased plan

The measurement gate is Phase 1, not an afterthought. You do not write a line of slicing code until you can measure slicing latency reproducibly and show the regime where it is bad.

Phase 0 — Find the slicing decision (1 day)

# In an OpenSearch clone:
grep -rn "LeafSlice\|computeSlices\|slices(\|maxSliceCount\|SliceCount\|max_slice_count" \
  server/src/main/java/org/opensearch/search
grep -rn "class ContextIndexSearcher\|protected LeafSlice\|getSlices" \
  server/src/main/java/org/opensearch/search/internal/ContextIndexSearcher.java
grep -rn "CollectorManager\|reduce(" server/src/main/java/org/opensearch/search/query

Write capstone-work/slicing-trace.md: query phase → ContextIndexSearcher → slice computation → per-slice CollectorManager → reduce, with file:line citations and the exact heuristic (the maxDocs/maxSegments thresholds) that decides slice count today.

Phase 1 — Build the measurement gate (the real first deliverable)

You need two instruments:

(a) A JMH microbenchmark isolating the slicing+reduce cost as a function of segment count and size, independent of the rest of the query phase.

// benchmarks/src/main/java/org/opensearch/benchmark/search/SlicingBenchmark.java
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class SlicingBenchmark {
    @Param({"4", "32", "256"})      int segmentCount;
    @Param({"100", "10000"})        int docsPerSegment;
    @Param({"1", "2", "4", "8"})    int maxSlices;

    // build a Directory with segmentCount segments, run the slice+collect+reduce path,
    // measure end-to-end query-phase time at each (segmentCount, docsPerSegment, maxSlices).
    @Benchmark public long sliceAndCollect() { /* ... */ }
}

./gradlew :benchmarks:jmh -Pjmh.include=SlicingBenchmark

(b) An OSB macro run on a realistic index, with concurrent search on, sweeping max_slice_count, so you see real cluster latency and throughput:

opensearch-benchmark execute-test --workload=nyc_taxis \
  --target-hosts=localhost:9200 \
  --workload-params='{"search_clients": 8}' \
  --kill-running-processes
# vary search.concurrent.max_slice_count between runs; record p50/p90/p99 and throughput

Produce capstone-work/baseline.md: a table showing the regime where current slicing is suboptimal (e.g. many tiny segments, or high concurrency where over-slicing hurts throughput). If you cannot show a bad regime, there is no project — pick a different one.

Phase 2 — A better slicing policy (scoped)

Pick one lever, justified by your baseline:

Lever	Idea	Risk
Min-work-per-slice	Collapse tiny segments so each slice has ≥ a threshold of docs/postings before adding a slice	Could under-parallelize large-segment case
Cost-aware slice count	Estimate per-segment cost; slice count scales with total estimated work, not raw segment count	Cost estimate must be cheap and not wrong
Threadpool-pressure-aware cap	Reduce effective slice count when the search pool queue is deep	Couples slicing to live pool state

Implement behind a new, default-off setting so the existing default behaviour is untouched until proven. Sketch:

// in the slice computation:
int targetSlices = costAwareSliceCount(leaves, settings, threadPoolPressure());
List<LeafSlice> slices = groupByMinWork(leaves, minDocsPerSlice, targetSlices);
return slices.toArray(new LeafSlice[0]);

Phase 3 — Prove the win and the no-harm

Re-run the exact Phase-1 instruments with your policy on. The bar is two-sided:

workload / regime          metric    default     yours      verdict
many tiny segments (p90)   ms          42          2x        WIN
many large segments (p90)  ms          120        12x        NO REGRESSION
high-concurrency tput      qps         8x0         8x0+      WIN/EQUAL
cheap query (p50)          ms          3.1         3.x       NO REGRESSION

Every row must be explained. A win in one regime that regresses another is not done — either the policy auto-detects the regime, or it stays setting-gated and documented for the regime it helps. State the seed and the OSB workload + params so the result is reproducible.

Phase 4 — Correctness, not just speed

Concurrency bugs hide in reduce. Add/extend tests that prove your slicing produces identical results to the sequential path: same hits, same scores, same aggregation values, across segment counts. A faster wrong answer is the worst possible outcome here.

./gradlew :server:test --tests "*ConcurrentSegmentSearch*" --tests "*Slice*"
./gradlew :server:internalClusterTest --tests "*ConcurrentSearch*IT"

Deliverables

capstone-work/slicing-trace.md — query phase → slice computation → reduce, with citations and the current heuristic
capstone-work/baseline.md — JMH + OSB results showing the suboptimal regime (the gate)
capstone-work/design.md — the lever chosen, alternatives rejected, why it's setting-gated
A JMH microbenchmark committed under benchmarks/
The new slicing policy behind a default-off setting
capstone-work/results.md — two-sided before/after table (win and no-harm), with seeds + OSB params
Correctness tests proving slice results == sequential results
capstone-work/validation.md — ./gradlew :server:check output, test commands, seeds
An upstreaming decision: a DCO-signed PR with the numbers in the description, or a written proposal + issue comment
A 500–1000 word write-up: the regime, the heuristic, the measurement

Difficulty & time


Engineering difficulty	Hard (the code is small; the measurement is the work)
Mergeability	High if the numbers are clean and two-sided
Time	Phase 0–1 (gate): ~2 weeks. Through Phase 4: 4–6 weeks
Hardest part	A reproducible benchmark that isolates slicing from noise, and proving no regression

Warning: Benchmark noise will lie to you. Pin CPU, disable turbo where you can, run enough JMH forks/iterations, and run OSB enough times to have a confidence interval — not one run. See the performance-regression lab for the discipline.

Stretch goals

Make the policy adaptive: detect the regime at query time (segment-size histogram + threadpool queue depth) and choose slice count automatically, so it can be on by default.
Add a per-request override (a search-request parameter) so an operator can tune slicing for a specific expensive query without changing the cluster default.
Extend the win analysis to aggregations, where per-slice reduce is heavier than for plain top-K — the regime where concurrency pays best.
Feed the slicing decision a profile signal so search profiling shows the slice count chosen and why.

Evaluation

Self-grade against the 100-point rubric. This project is weighted, in spirit, toward measurement:

Dimension	What earns the points here
Problem articulation (20)	The design note names the exact regime where slicing is suboptimal, backed by the baseline table
Execution-path mastery (20)	`slicing-trace.md` follows the slice decision through `ContextIndexSearcher` and Lucene's `slices(...)`, cited
Implementation quality (20)	The change is small, setting-gated, default-off; the slice computation stays in the existing seam
Testing (15)	Correctness (slice == sequential) tests and a reproducible benchmark with seeds; this is the dimension you must max
Review responsiveness (10)	The PR cadence, or a peer review against this rubric
Documentation (10)	The two-sided results table, design note, `CHANGELOG.md`, write-up
Community interaction (5)	You posted the regime + baseline on the issue before proposing a policy change

A clean two-sided benchmark with a small, gated slicing improvement is exactly the shape of a mergeable search-performance PR.

How to turn this into a real contribution

Lead with the benchmark. A search-performance PR with no numbers will be asked for numbers; arrive with the two-sided table in the description and the JMH benchmark committed.
Default-off, then propose default-on with data. Land the policy behind a setting first. Changing the default slicing behaviour is a separate, data-backed conversation.
Prove no-harm explicitly. Reviewers will worry about the many-large-segments case the feature was built for. Show that row green.
Correctness first in the PR narrative. State up front that slice results are identical to sequential, with the test that proves it — concurrency reviewers look for that before speed.
DCO applies (GitHub-native core: git commit -s, CHANGELOG.md, backport bot).

The benchmark harness you build in Phase 1 is itself a reusable contribution — even if the policy change needs more iteration, a committed JMH slicing benchmark is mergeable on its own.

Project 4: An Upstream Apache Lucene HNSW Contribution

This is the one project whose deliverable does not land in an OpenSearch repository at all. It lands in apache/lucene — and then flows into OpenSearch on the next Lucene upgrade, because OpenSearch bundles a specific Lucene version and upgrading it is a recurring core task. A small, well-scoped HNSW / vector-format / VectorUtil improvement or bug fix in Lucene is the cleanest path in this entire portfolio to a real merged PR in a tier-1 open-source project, with the least cross-plugin surface area to learn.

It is also the project where you contribute under a different process: the Apache Software Foundation's. No DCO sign-off line, a CHANGES.txt entry instead of CHANGELOG.md, ASF PR conventions, and a community that lives partly in GitHub PRs and partly in JIRA history (LUCENE-NNNN). Project 4 is as much about learning that process as about the code.

Note: Read HNSW Vector Search in Lucene and SIMD and the Vector API first, and do Lab L4: Contribute to Apache Lucene — this brief assumes you can already build Lucene with ./gradlew check and run a single test with a seed. The lab is the on-ramp; this brief is the real thing.

Problem & motivation

Lucene's vector search is the foundation under OpenSearch's k-NN lucene engine and, indirectly, the bar everything else is measured against. It is also actively evolving — the scalar quantization formats, the SIMD-accelerated distance kernels, and the HNSW graph builder/merger are all areas with open work and reachable bugs. That makes it fertile ground for a scoped contribution:

The scalar-quantized vector formats (int8 and the newer custom-bit variants) are a thin, well-tested, well-bounded subsystem. A correctness bug, an edge case in dequantization, or a missing validation is the kind of thing a careful newcomer can find and fix.
VectorUtil and the Panama Vector API acceleration (dotProduct, squareDistance, cosine) are pure functions with exact scalar fallbacks — ideal for a small correctness or performance improvement that is easy to test against the fallback.
The HNSW graph builder/merger is where the big wins live (faster merging gave ~25% indexing speedups in Lucene's nightly benchmarks) — too large to rewrite, but full of scoped, reachable improvements.

The motivation is leverage: a 50-line fix in Lucene's vector code is exercised by every Lucene user on earth and arrives in OpenSearch automatically at the next upgrade. That is the highest blast-radius-per-line in the whole portfolio — which is exactly why the bar for correctness and test rigor is highest here too.

Real-world grounding

Two real, verified Lucene vector-search issues anchor this — read both to understand the shape of the subsystem and the kind of work that lands:

Scalar quantization codec — apache/lucene #12497: https://github.com/apache/lucene/issues/12497 — the introduction of a scalar-quantized vectors codec format. Reading this issue and its PR teaches you exactly how a vector format is structured, tested, and reviewed in Lucene.
int8 scalar quantization origin — apache/lucene #11613 / LUCENE-10577: https://github.com/apache/lucene/issues/11613 — the original int8 scalar quantization work (tracked historically as LUCENE-10577). This is the design lineage of the Lucene99ScalarQuantizedVectorsFormat family and the later custom-bit Lucene104ScalarQuantizedVectorsFormat.

To find a current fixable issue, search live and filter for newcomer-sized work:

# in apache/lucene
is:issue is:open label:"module:vector" OR label:vector hnsw OR quantiz OR VectorUtil
is:issue is:open label:"good first issue" vector
is:issue is:open scalar quantiz dequantize OR edge case OR NaN OR overflow

Citation discipline: #12497 and #11613 / LUCENE-10577 are real — cite them. The specific issue you fix must be one you actually find open (or a bug you reproduce in a test); link it in your design note. Do not invent a LUCENE number.

Subsystems you'll touch

Subsystem	Package / class (grep to confirm per Lucene version)	What it owns
HNSW graph	`org.apache.lucene.util.hnsw.HnswGraph`, `HnswGraphBuilder`, `HnswGraphSearcher`, `NeighborQueue`	Graph build, merge, and greedy search
Vector codec	`org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat`, `Lucene99ScalarQuantizedVectorsFormat`, `Lucene99HnswScalarQuantizedVectorsFormat`, and the newer `Lucene104*` scalar-quantized formats	The `.vec`/`.vex`/`.vem` on-disk format and quantization
Vector values	`org.apache.lucene.index.FloatVectorValues`, `ByteVectorValues`, `VectorSimilarityFunction` (EUCLIDEAN, DOT_PRODUCT, COSINE, MAXIMUM_INNER_PRODUCT)	The per-doc vector access + distance
Distance kernels	`org.apache.lucene.util.VectorUtil`, the `VectorizationProvider` + `PanamaVectorUtilSupport` (SIMD) vs. the scalar fallback	`dotProduct`/`squareDistance`/`cosine`
Query	`org.apache.lucene.search.KnnFloatVectorQuery`, `KnnByteVectorQuery`	Top-k ANN query execution
Tests	`org.apache.lucene.tests.util.LuceneTestCase`, `BaseKnnVectorsFormatTestCase`, `TestVectorUtil`, `randomized` test seeds	Where you reproduce and prove

Cross-references: HNSW vector search in Lucene · SIMD and the Vector API · segments and codecs · the Lucene contribution lab · and the OpenSearch k-NN engines chapter where the lucene engine reuses all of this.

Phased plan

The ASF rule of thumb: reproduce the problem inside Lucene's own test framework first. Lucene reviewers expect a failing LuceneTestCase with a fixed seed before they look at a fix.

Phase 0 — Build Lucene and run the vector tests (½ day)

git clone https://github.com/apache/lucene.git && cd lucene
./gradlew :lucene:core:compileJava                     # JDK 21
./gradlew :lucene:core:test --tests "org.apache.lucene.util.hnsw.*"
./gradlew :lucene:core:test --tests "org.apache.lucene.util.TestVectorUtil"
./gradlew :lucene:luke:run                             # inspect a real index, optional

Reproduce a known-good run with a seed so you understand the randomized harness:

./gradlew :lucene:core:test --tests "*TestVectorUtil*" -Ptests.seed=DEADBEEF

Write capstone-work/lucene-vector-map.md: how a KnnFloatVectorField flows from IndexWriter → the KnnVectorsFormat → .vec/.vex/.vem files, and how a KnnFloatVectorQuery walks HnswGraphSearcher and scores with VectorUtil. Cite files.

Phase 1 — Pick and reproduce a scoped target (the real first deliverable)

Choose one target type. The whole project's mergeability depends on scoping this small:

Target	Example	Why it's reachable
A correctness bug	A dequantization edge case, a similarity function mishandling a degenerate vector (all-zero, NaN guard), an off-by-one in a format reader	Bounded, testable against the scalar reference
A `VectorUtil` improvement	A scalar-fallback path that disagrees subtly with the Panama path; a missing fast path	Pure function; the fallback is the oracle
A validation/clarity fix	A format that accepts a parameter it cannot honour; a confusing exception	Low risk, real value, easy review
A small perf win	A reduced allocation in the graph builder/merger hot path, proven with a benchmark	High value if measured cleanly

Then reproduce it as a failing test:

// extends LuceneTestCase
public void testScalarQuantizedDequantizeEdgeCase() {
    float[] v = new float[] { 0f, 0f, 0f, 0f };   // degenerate input that triggers the bug
    // quantize -> dequantize -> assert the round-trip / similarity invariant the code claims
    assertEquals(expected, actual, DELTA);        // fails on main, passes with your fix
}

./gradlew :lucene:core:test --tests "*YourNewTest*"   # confirm it FAILS on main first

This failing test, with a seed, is the artifact that turns "I think there's a bug" into a Lucene PR reviewers will engage with.

Phase 2 — The fix (minimum diff)

Fix it where the bug lives, nothing more. Lucene reviewers are strict about scope and about not touching the on-disk format casually (format changes have versioning and BWC implications).

grep -rn "dequantize\|quantize\|scalar" lucene/core/src/java/org/apache/lucene/codecs/lucene99
grep -rn "squareDistance\|dotProduct\|cosine" lucene/core/src/java/org/apache/lucene/util/VectorUtil.java

If your change touches a kernel that has both a Panama and a scalar path, fix both and assert they agree — that parity test is often the most valuable thing in the PR.

public void testPanamaAndScalarAgree() {
    float[] a = randomVector(768), b = randomVector(768);
    assertEquals(VECTOR_UTIL_SCALAR.dotProduct(a, b),
                 VECTOR_UTIL_PANAMA.dotProduct(a, b), 1e-4f);
}

Phase 3 — Run the full gate

Lucene's check is broad and strict (forbidden APIs, formatting, the full test sweep). Run it, and run your area's tests with several seeds to flush randomized flakiness.

./gradlew check                                  # the full gate; expect it to be heavy
./gradlew :lucene:core:test --tests "org.apache.lucene.codecs.lucene99.*"
# repeat your test under a few seeds:
for s in CAFE D00D F00D; do \
  ./gradlew :lucene:core:test --tests "*YourNewTest*" -Ptests.seed=$s; done

If you touched a SIMD path, run with the Vector API module enabled to exercise the Panama provider, not just the fallback:

./gradlew :lucene:core:test --tests "*VectorUtil*" \
  -Dtests.jvmargs="--add-modules jdk.incubator.vector"

Phase 4 — The PR, the ASF way

This is where the process differs from OpenSearch. Do all of it:

Add a CHANGES.txt entry under the right section (Bug Fixes / Improvements / Optimizations) for the next release — not a CHANGELOG.md, and no DCO sign-off line.
Open a GitHub PR against apache/lucene:main following the PR template; reference the issue number (and the LUCENE-NNNN if one exists).
Expect review from Lucene committers. Respond by changing code or explaining with a test, not by arguing. Randomized-test reviewers will ask "does this hold under any seed?" — have the answer.

  # CHANGES.txt (excerpt — section names/format vary by version, check the file)
  ## Bug Fixes
+ * GITHUB#NNNNN: Fix scalar-quantized dequantization of all-zero vectors so the
+   round-trip similarity matches the documented invariant. (Your Name)

Deliverables

capstone-work/lucene-vector-map.md — field → format → files, query → graph → VectorUtil, cited
capstone-work/design.md — the chosen target, why it's in scope, alternatives rejected
A failing LuceneTestCase (with a seed) that is red on main
The minimum-diff fix, green under multiple seeds
A Panama/scalar parity test if a SIMD kernel was touched
capstone-work/validation.md — ./gradlew check output, the seeds run, the JVM args used
A CHANGES.txt entry (ASF-style, no DCO line)
A GitHub PR against apache/lucene:main referencing the issue
A 500–1000 word write-up: the bug/improvement, the reproduction, the ASF process you learned

Difficulty & time


Engineering difficulty	Hard (the code can be small; the rigor is high)
Mergeability	High — this is the whole point of the project; a clean scoped Lucene fix lands
Time	Phase 0–1 (reproduce): ~1–2 weeks. Through merged PR: 4–6 weeks incl. review
Hardest part	Scoping small enough, and the randomized-test rigor Lucene demands

Warning: Do not change the on-disk vector format for a first contribution. Format changes carry BWC and versioning weight and will not be reviewed as a newcomer PR. Stay in correctness, validation, VectorUtil, or an allocation/perf tweak that doesn't alter bytes on disk.

Stretch goals

After a correctness fix lands, propose a small optimization in the same area (a reduced allocation in the graph merger hot path) with a luceneutil/JMH benchmark — this is the path toward the ~25%-indexing-speedup class of wins.
Trace your merged change into OpenSearch: find the "Upgrade to Lucene X.Y" issue/PR in opensearch-project/OpenSearch and note where your fix arrives. Search: is:pr "Upgrade to Lucene" in the OpenSearch repo.
Add a parity test for a kernel that currently lacks one, even without a bug — strengthening the test suite is a legitimate, welcome Lucene contribution.

Evaluation

Self-grade against the 100-point rubric:

Dimension	What earns the points here
Problem articulation (20)	The design note states the exact invariant violated / improvement, not "make vectors faster"
Execution-path mastery (20)	`lucene-vector-map.md` traces field → format → files and query → graph → `VectorUtil`, cited
Implementation quality (20)	Minimum diff, no on-disk format change, fix where the bug lives; ASF conventions followed
Testing (15)	A failing `LuceneTestCase` red on `main`, green under multiple seeds; a Panama/scalar parity test if relevant — max this
Review responsiveness (10)	The real Lucene PR cadence with committers; you answer seed questions with tests
Documentation (10)	`CHANGES.txt`, design note, the write-up, an explicit note on BWC/format impact
Community interaction (5)	You followed ASF PR etiquette, referenced the issue, and requested review correctly

A merged Lucene vector PR at 90+ is the single most transferable credential in this portfolio.

How to turn this into a real contribution

This is the real contribution — unlike the other projects, the upstream PR is the primary deliverable, not an optional stretch. Scope Phase 1 to land.
Reproduce in Lucene's tests first. A failing LuceneTestCase with a seed is the entry ticket. No reproducer, no review.
Learn the ASF process deliberately. CHANGES.txt (not CHANGELOG.md), no DCO line, ASF PR template, committer review. The Lab L4 walks the mechanics.
Respect the format boundary. Stay out of on-disk format changes for a first PR. Pick correctness, VectorUtil, validation, or a byte-stable optimization.
Close the loop into OpenSearch. When your fix ships in a Lucene release and OpenSearch upgrades, your code is running in OpenSearch — note that in your write-up. It is the cleanest demonstration in the whole curriculum of how the layers connect.

This is the project to choose when you want a real upstream merge with the least cross-plugin surface area and the clearest process — at the cost of the highest test-rigor bar in the portfolio.

Project 5: Extending Star-Tree Aggregations

Most aggregations in OpenSearch are computed at query time by walking DocValues across every matching document — general, correct, and, for a high-cardinality dashboard query over a billion documents, slow. The star-tree index is the precomputation answer: during indexing, OpenSearch builds a composite index — a tree keyed on a set of dimension fields, with precomputed metric aggregations at each node — so that a matching (filter dims, metric) query is answered by a bounded tree traversal instead of a full DocValues scan. It is one of the highest-leverage pieces of search-performance engineering in the codebase, and it is still young: the set of supported dimension types, metric aggregations, and query shapes that can be resolved against the tree is deliberately incomplete.

This project asks you to extend the star-tree — add support for one aggregation, dimension type, or query shape it does not yet handle — starting from a slice you can scope in a weekend (a resolution gap with a clear "falls back to the default path" symptom) and ending at a real, tested, upstreamable feature.

Note: Read Star-Tree Aggregations and the Aggregations deep dive before you start. This brief assumes you know what a composite index is, what the difference between an ordinal dimension and a numeric dimension is, what star_tree mapping config looks like, and how AggregatorFactories build the query-time aggregation tree. It will not re-derive them.

This is the hardest brief in the portfolio. The star-tree spans the mapping layer, a custom composite-index builder that runs at flush/merge time, a custom Lucene DocValuesFormat, and the aggregation resolution path that decides — per query — whether the tree can answer it. A full new aggregation is a multi-week change. The discipline here, even more than elsewhere, is to scope to a resolution slice: a case the tree could answer but currently does not, where your change is "teach the resolver to recognize it" rather than "build new tree structure."

Problem & motivation

The star-tree was introduced to bound aggregation latency for the dashboard-style workload: many terms + date_histogram filters, a handful of sum/avg/min/max/value_count metrics, run repeatedly over an append-mostly index. When the query's filter dimensions and requested metrics are a subset of the dimensions and metrics the tree was built on, OpenSearch can resolve the aggregation against the tree — reading a precomputed metric at a node — instead of running the normal collector over every matching document.

The gaps that make this a project:

Not every metric aggregation is supported. The initial set is the "additive" metrics that compose cleanly up a tree (sum, min, max, value_count, and avg as sum/count). Aggregations whose partial results do not trivially merge — or that need a different stored representation — are not yet resolvable. (avg already shows the pattern: you store sum and count and divide at read time. The same "store a decomposable intermediate" trick extends to more aggregations than are currently wired.)
Not every dimension type is supported. Keyword/ordinal and numeric dimensions work; other field types, ranges as dimensions, or specific date-rounding granularities may not resolve.
The resolver is conservative. Even for a supported metric, a query can fail to resolve — and silently fall back to the slow default path — because the resolver does not recognize the query shape (an extra clause, a sub-aggregation arrangement, a date_histogram interval the tree was not built for). The symptom is not an error; it is the star-tree silently not being used, which is invisible unless you are looking for it.

That last point is the key to scoping. The cheapest real contribution is to find a query that should resolve against an existing tree but does not, and either make it resolve or make the not-resolving observable so an operator can see the fallback. Both are mergeable; the first is a feature, the second is the diagnostic that the feature work depends on.

Real-world grounding

The star-tree is RFC-driven and tracked by a meta-issue with a live sub-issue list. These are real:

[RFC] Pre-compute aggregations with a star-tree index — OpenSearch #12498: https://github.com/opensearch-project/OpenSearch/issues/12498
[META] Star-tree index — issue list — OpenSearch #13875: https://github.com/opensearch-project/OpenSearch/issues/13875
[Star Tree][Search][RFC] Resolve aggregation via star-tree — OpenSearch #14871: https://github.com/opensearch-project/OpenSearch/issues/14871

Issue #14871 is the one you live in: it is specifically about the resolution path — which aggregations/queries get matched to the tree — and it is exactly the surface where a scoped slice is realistic. The meta-issue #13875 is the durable anchor for "what is built, what is in flight, what is open."

Citation discipline: the star-tree's supported-aggregation list moves fast. Do not cite a sub-issue number you have not opened and read. Run a live search before you scope: is:issue is:open label:"Search:Aggregations" star tree and is:issue is:open "star tree" resolve OR support OR aggregation in opensearch-project/OpenSearch, and read the current checklist in #13875. Link the current sub-issue in your design note; the three issues above are the anchors.

Subsystems you'll touch

Subsystem	Class / area (grep to confirm names per version)	What it owns
Mapping	`org.opensearch.index.mapper.StarTreeMapper` (under `server/src/main/java/org/opensearch/index/mapper`), the `composite` mapping parser	Parses the `star_tree` config: ordered dimensions, metrics, `max_leaf_docs`
Composite index core	`org.opensearch.index.compositeindex.*` (under `server/.../index/compositeindex`)	The composite-index abstractions shared by star-tree (and future composite types)
Star-tree builder	`org.opensearch.index.compositeindex.datacube.startree.builder.*` (e.g. `BaseStarTreeBuilder`, `OnHeapStarTreeBuilder`, `OffHeapStarTreeBuilder`)	Builds the tree at flush/merge: sorts docs by dimension, aggregates metrics up the tree
Aggregators (metric compose)	`org.opensearch.index.compositeindex.datacube.startree.aggregators.*` (`ValueAggregator`, `MetricAggregatorInfo`, `StarTreeAggregatorFactory`)	The per-metric "how do partial values combine up the tree" logic
Codec / DocValues format	the star-tree `DocValuesFormat` / `Composite99DocValuesFormat` and reader (grep `Composite*DocValues` under `server/.../codec`)	Writes/reads the tree as segment files
Aggregation resolution	`org.opensearch.search.aggregations.startree.*` and the rewrite hook in `AggregatorFactory` / `SearchContext` (grep `StarTree` under `server/.../search/aggregations`)	Decides per-query whether the tree can answer, and builds the star-tree-backed collector
Query path	`org.opensearch.search.startree.*` (the `StarTreeQueryContext` / `StarTreeFilter`, grep to confirm)	Translates filter dimensions into a tree traversal

Deep dives that cover the surrounding ground: Star-tree engineering chapter · Aggregations deep dive · DocValues and fielddata (the columnar layer the tree precomputes over) · Search execution (where the resolution decision lives) · Refresh/flush/merge (when the builder runs).

Phased plan

The discipline of this project is that Phase 1 produces a diagnostic that is mergeable on its own and that you then need in order to do Phases 3–4 honestly. You cannot tune resolution you cannot observe.

Phase 0 — Build it, build a tree, and watch it resolve (1 day)

Build OpenSearch from source and stand up an index with a star_tree composite mapping.

# In your OpenSearch clone:
./gradlew assemble
./gradlew run        # single node, REST on :9200

# Create an index with a star-tree composite index.
curl -s -X PUT localhost:9200/logs -H 'Content-Type: application/json' -d '{
  "settings": {
    "index.number_of_shards": 1,
    "index.composite_index": true,
    "index.append_only.enabled": true
  },
  "mappings": {
    "composite": {
      "startree": {
        "type": "star_tree",
        "config": {
          "ordered_dimensions": [ { "name": "status" }, { "name": "region" } ],
          "metrics": [ { "name": "bytes", "stats": [ "sum", "max", "value_count" ] } ]
        }
      }
    },
    "properties": {
      "status": { "type": "keyword" },
      "region": { "type": "keyword" },
      "bytes":  { "type": "integer" }
    }
  }
}' | python3 -m json.tool

Note: the exact settings names (index.composite_index, the composite/startree mapping shape, whether append_only is required) change across versions. Grep the mapper test for the current canonical mapping before you trust the snippet above: grep -rn "star_tree\|ordered_dimensions\|StarTreeMapper" server/src/*/java/org/opensearch/index/mapper.

Index some documents, then run an aggregation that should resolve and one that should not, and find out how the codebase tells you which path ran. This is the crux of Phase 0.

# A query that should resolve against the tree (filter on a dimension, sum a metric):
curl -s localhost:9200/logs/_search -H 'Content-Type: application/json' -d '{
  "size": 0,
  "query": { "term": { "status": "200" } },
  "aggs": { "total": { "sum": { "field": "bytes" } } }
}'

Now find the resolution decision in code:

grep -rn "StarTree" server/src/main/java/org/opensearch/search/aggregations | head -40
grep -rn "canUseStarTree\|getStarTreeQueryContext\|StarTreeQueryContext\|supportedStarTree" \
  server/src/main/java/org/opensearch/search | head -40
grep -rn "class StarTreeQueryContext\|StarTreeFilter\|StarTreeValuesIterator" \
  server/src/main/java/org/opensearch/search/startree

Write a one-page capstone-work/resolution-path.md: query → AggregatorFactory → the resolution check → StarTreeQueryContext (resolved) or the normal collector (fallback). With file:line citations. This is your execution-path-mastery artifact and you cannot skip it. You must be able to point at the exact branch where a query is decided to be tree-resolvable or not.

Phase 1 — Make the fallback observable (the scoped, mergeable slice)

Right now the painful truth is that "the star-tree silently did not get used" is nearly invisible. Make it observable. Add a profile/stats signal that says, per aggregation, whether it resolved against the star-tree or fell back — and why it fell back.

The cleanest hook is the profiler. The aggregation profiler (GET /_search?profile=true) already reports per-aggregator timing; add a debug field that records the star-tree decision.

grep -rn "class.*ProfilingAggregator\|AggregationProfiler\|collectDebugInfo\|debug" \
  server/src/main/java/org/opensearch/search/profile/aggregation
grep -rn "collectDebugInfo" server/src/main/java/org/opensearch/search/aggregations/metrics

// In the metric aggregator that can be star-tree-backed (e.g. the sum aggregator), when it has a
// StarTreeQueryContext it records that; when it does not, it records the reason.
@Override
public void collectDebugInfo(BiConsumer<String, Object> add) {
    super.collectDebugInfo(add);
    add.accept("star_tree_used", starTreeQueryContext != null);
    if (starTreeQueryContext == null && starTreeFallbackReason != null) {
        add.accept("star_tree_fallback_reason", starTreeFallbackReason); // e.g. "metric not supported by tree"
    }
}

Set starTreeFallbackReason at the exact resolution branches you mapped in Phase 0 (unsupported metric, dimension not in tree, query shape not recognized, no composite index on the field).

Then a unit/integration test that asserts the profiler reports star_tree_used: true for a resolvable query and a specific star_tree_fallback_reason for one that is not:

// StarTreeAggregationProfileIT (OpenSearchIntegTestCase-based)
public void testProfileReportsStarTreeUsage() throws Exception {
    // ... create the star_tree index above, index docs, refresh ...
    SearchResponse r = client().prepareSearch("logs")
        .setSize(0)
        .setQuery(QueryBuilders.termQuery("status", "200"))
        .addAggregation(AggregationBuilders.sum("total").field("bytes"))
        .setProfile(true)
        .get();
    Map<String, Object> debug = firstAggProfileDebug(r);     // helper that drills into the profile tree
    assertEquals(Boolean.TRUE, debug.get("star_tree_used"));
}

Run the gates:

./gradlew spotlessApply
./gradlew :server:test --tests "*StarTree*"
./gradlew precommit

Why this is the right Phase 1: it is a minimum diff, it is exactly the kind of observability maintainers merge without an RFC, and it forces you to enumerate every resolution branch — which is precisely the map you need to make a branch resolve in Phase 3. A clean "star-tree usage in the profiler" PR is a credible first star-tree contribution and the meta-issue #13875 almost certainly has observability on its checklist.

Phase 2 — Reproduce a silent fallback and characterize it

Using your Phase-1 profiler signal, find a query that you believe the tree should answer but that falls back. Write it down as a failing-expectation test (the profiler reports a fallback today; after Phase 3 it should report star_tree_used: true).

# Compare results AND path for the same logical aggregation, with and without a forced fallback.
# Forcing fallback (so you have a correctness oracle): point the same query at a non-composite index.

Characterize the gap precisely in capstone-work/design.md: is it an unsupported metric, an unsupported dimension type, or a query shape the resolver does not recognize? Each has a different implementation surface and a different difficulty. Pick the one with the smallest surface.

Phase 3 — Make it resolve (the feature)

Now the real work. Choose one, scoped tightly:

Option	Touches	Difficulty	Why it is real
Resolve a query shape the tree can answer but the resolver rejects (e.g. a metric already stored, blocked by an over-conservative check)	resolution path only	Hardest-but-smallest	Pure resolver win; no new tree structure, no format change — the most mergeable feature
Add a decomposable metric (store an intermediate that composes up the tree, divide/finalize at read)	aggregators + resolution	Hard	Follows the `avg = sum/count` pattern; recall is exact, so correctness is provable
Support an additional dimension type / date-rounding granularity	mapper + builder + resolution	Very hard	Touches the builder and format — only attempt if Phases 1–2 went smoothly

Whatever you pick, the non-negotiable is correctness: the star-tree result must equal the default path's result for the same query, byte for byte (these are exact aggregations, not approximations — unlike ANN there is no recall tolerance). Your test asserts equality against the non-composite path.

// The correctness oracle: same data, same query, two indices — one with the star-tree, one without.
assertEquals(
    aggResult("logs-no-startree", query, agg),   // default DocValues path
    aggResult("logs-startree",    query, agg));   // resolved against the tree

Phase 4 — Prove the latency win

A star-tree change that does not move latency is not worth the maintenance surface. Use an OpenSearch Benchmark macro run (or the JMH approach from the perf-regression lab): a fixed aggregation query set, run against the star-tree index vs the same data without a star-tree, on a realistic document count.

query                         path        p50 (ms)   p99 (ms)   docs scanned
sum(bytes) filter status      default        420       1180      120,000,000
sum(bytes) filter status      star-tree        3          9          (tree)
<your newly-resolved query>   default         ...        ...        ...
<your newly-resolved query>   star-tree        ...        ...       (tree)

The headline number is "the newly-resolved query now takes the tree path," shown by the star_tree_used: true from Phase 1 and the latency drop. Both together are the proof.

Deliverables

capstone-work/resolution-path.md — query → factory → resolution branch → context/fallback, with citations
capstone-work/design.md — the gap (metric / dimension / query shape), the slice chosen, what you rejected
Phase 1: a profiler/stats signal for star-tree usage + fallback reason, with a test
Phase 2: a characterized silent-fallback case, captured as a failing-expectation test
(Phase 3) the resolver/metric change that makes that case resolve, behind a correctness oracle test
(Phase 4) an OSB/JMH latency before/after table proving the tree path is taken and faster
capstone-work/validation.md — ./gradlew spotlessApply precommit output, test commands, seeds, dataset
A CHANGELOG.md entry under ## [Unreleased]
An upstreaming decision: a DCO-signed OpenSearch PR for the Phase-1 observability slice, or a written scoped proposal under #14871
A 500–1000 word write-up: the resolution gap, the path you traced, the correctness argument

Difficulty & time


Engineering difficulty	Very hard (Phase 1 alone is Medium-Hard)
Mergeability	High for Phase 1 (observability); negotiated and RFC-anchored for Phase 3
Time	Phase 1: ~1 week. Phases 1–2: ~2–3 weeks. Through Phase 4: 5–8 weeks
Hardest part	The builder runs at flush/merge and the format is custom — debugging a wrong tree value means reading the `OffHeapStarTreeBuilder` and the composite DocValues reader together

The star-tree is the deepest subsystem in this portfolio because the data structure, its on-disk format, and the per-query resolution decision are all in scope. Scope ruthlessly to the resolution layer for your first contribution; leave the builder and format for a stretch.

Stretch goals

Wire the star-tree usage stat into the index/shard stats API (not just the per-query profiler) so an operator can see, fleet-wide, what fraction of aggregations are resolving against trees.
Add a _validate/query-style explain for star-tree resolution: "this aggregation would/would not resolve against star-tree X, because Y."
For a decomposable metric you added in Phase 3, add the off-heap builder path so it works under the OffHeapStarTreeBuilder, not just on-heap.
Reproduce the spirit of #14871 end to end: take one concrete query from that thread that does not resolve today and make it resolve, with the profiler proof and the latency table in the PR.

Evaluation

Self-grade against the 100-point rubric. For this project:

Dimension	What earns the points here
Problem articulation (20)	The design note names the exact resolution gap (which metric/dimension/query shape) and why the current resolver is conservative there — not "aggregations are slow"
Execution-path mastery (20)	`resolution-path.md` traces the per-query resolve/fallback branch with file:line citations, before you changed anything
Implementation quality (20)	Phase 1 is a minimum diff in the profiler; Phase 3 changes the resolver where it already decides, with no new mapping params unless justified; correctness is guarded
Testing (15)	A correctness oracle (star-tree result == default-path result, exact); a test red without the fix; a negative control (a query that should still fall back)
Review responsiveness (10)	The real OpenSearch PR cadence, or a peer review against this rubric
Documentation (10)	Design note, `CHANGELOG.md`, write-up, and an explicit docs decision (star-tree config is user-facing)
Community interaction (5)	You commented your scope on #14871 / the current sub-issue and pinged the `MAINTAINERS.md` Search/Aggregations reviewers

A finished Phase 1 at 90+ is a real, merged observability contribution in OpenSearch's highest-leverage aggregation subsystem — and the foundation for the resolution feature in Phase 3.

How to turn this into a real contribution

Start from observability, not the feature. The Phase-1 profiler signal is the upstream target. "The star-tree silently does not get used" is a known operator pain, the change is low-blast-radius, and it needs no new RFC — only a comment under #13875 / #14871.
Comment before you code. The star-tree is RFC-governed. Read #12498 and the #13875 checklist, then comment on #14871 with the specific resolution gap you intend to close. Confirm a maintainer agrees it should resolve before you change the resolver — the conservatism may be deliberate.
One slice per PR. Observability, then (separately) the resolution/metric change. Never bundle a new resolvable metric with the profiler signal — reviewers will split it.
Bring a correctness oracle, not a recall number. Star-tree aggregations are exact. The reviewer's first question is "does it match the default path on every query?" Arrive with that test already in the PR, plus the latency table.
DCO applies. This is core OpenSearch: git commit -s, CHANGELOG.md, the backport bot. Get the sign-off email right locally before you push.

If only Phase 1 lands, you have shipped a genuinely useful diagnostic in the most sophisticated aggregation engine OpenSearch has — and you have the map that makes the resolution feature tractable. That is the point of scoping from the observability slice up.

Project 6: A k-NN Recall/Latency Benchmark Harness

Every claim about vector search is a lie until it has a recall number attached. "Faster" is meaningless if recall dropped. "Smaller" is meaningless if latency tripled. "Better quantization" is three numbers — recall@k, latency, and memory — measured together, on a fixed dataset, against a known ground truth, reproducibly. The single most common reason a k-NN performance PR stalls is that the author brought one number (latency) and the maintainer asked for the other two.

This project builds the thing that makes every other k-NN project credible: a reproducible benchmark harness that, given an engine/quantization configuration, reports recall@k and p50/p99 query latency and native-memory footprint, against a ground truth it computes itself. It is the "Medium" project in the portfolio to build and one of the most mergeable, because a clean, documented harness is exactly the tooling the k-NN team and every future contributor wants.

Note: Read Quantization and Disk-ANN and do the benchmark lab before you start. This brief assumes you know what recall@k means, what a space_type is, what on_disk mode and compression_level do, and how to read GET /_plugins/_knn/stats. It builds the lab's one-off measurement into a harness — repeatable, configurable, and trustworthy.

This project is also the prerequisite tool for the recall claims in Project 1. If you are doing both, build this first.

Problem & motivation

Benchmarking ANN is uniquely easy to get wrong, and the wrong way looks fine:

Recall needs a ground truth, and the ground truth is the expensive part. Recall@k is "of the true k nearest neighbours, how many did the approximate search return?" The true k nearest neighbours come from an exact (brute-force / flat) search over the same vectors with the same space_type. People skip this and report latency alone — or, worse, compute "recall" against an approximate baseline, which measures nothing.
The three numbers trade off against each other and must be reported together. A configuration is a point on a recall/latency/memory surface. Reporting one axis hides the trade. ef_search alone moves recall and latency in opposite directions; quantization moves memory and recall together. A harness that cannot report all three at one configuration is not a benchmark.
"Reproducible" is the whole game and is rarely achieved. Different dataset, different query set, different ef_construction, a warmup that was or wasn't run, a circuit breaker that tripped mid-run, a single-shard vs multi-shard index, JIT not warmed — any of these silently changes the numbers. A result without the full configuration captured is folklore, not data.
The k-NN team explicitly wants better benchmarking. Comparing engines, quantization modes, and the in-flight GPU/remote-index-build work requires a trustworthy harness. This is open tooling the project asks for.

This project builds the harness so that "I made k-NN faster" becomes a table anyone can regenerate.

Real-world grounding

The grounding issue is the k-NN benchmarking work itself — the project's own statement that it needs reproducible recall/latency/memory comparison across engines and quantization modes:

Benchmarking the vector engine — k-NN #2595: https://github.com/opensearch-project/k-NN/issues/2595

This sits next to the GPU / remote-index-build RFCs whose claims a harness like this exists to validate — those features are entirely about recall/latency/memory trade-offs and cannot be evaluated without exactly this tool:

[RFC] Boosting OpenSearch vector-engine performance using GPUs — k-NN #2293: https://github.com/opensearch-project/k-NN/issues/2293
[RFC] Remote vector index build — k-NN #2294: https://github.com/opensearch-project/k-NN/issues/2294

Citation discipline: the benchmarking effort spawns sub-issues fast. Before you scope, search is:issue is:open label:Benchmarks and is:issue is:open recall OR latency benchmark in opensearch-project/k-NN, and check whether OpenSearch Benchmark already has a vector workload you should extend rather than reinvent (search its repo for vectorsearch / bigann). Link the current issue in your design note; #2595 is the anchor.

Subsystems you'll touch

This project is more integration and tooling than core-Java surgery — which is exactly why it is mergeable and a great first "I shipped infrastructure" contribution. You will touch:

Subsystem	Where it lives	What you use it for
OpenSearch Benchmark (OSB) vector workloads	`opensearch-benchmark` + `opensearch-benchmark-workloads` (the `vectorsearch` workload)	The macro driver: bulk-index a vector dataset, run a query set, record latency percentiles
The k-NN bulk/query REST surface	`PUT <index>` (`knn_vector` mapping), `_bulk`, the `knn` query	Index the vectors and run approximate search with `k`
The k-NN stats API	`GET /_plugins/_knn/stats`, the warmup API `POST /_plugins/_knn/warmup/<index>`	Read native-memory footprint, graph counts, cache state; force-load before timing
Ground-truth / recall computation	your harness code (Python, or an OSB custom param source / `recall` metric if OSB already exposes one)	Compute exact neighbours and recall@k from the approximate results
Native memory accounting	`org.opensearch.knn.index.memory.NativeMemoryCacheManager` (read-only — you report it via stats, not modify it)	The MB number; understand what it counts (faiss/nmslib native, not JVM heap)

Deep dives that cover the surrounding ground: k-NN query path · native JNI and memory · quantization and disk-ANN · the benchmark lab (the one-off this project generalizes) · the perf-regression lab (the discipline of before/after measurement, applied here to recall as well as latency).

Phased plan

The discipline of this project is that Phase 1 is a single-configuration harness that already produces a correct, reproducible three-number result. Everything after is sweeping that over more configurations and hardening it for others to run.

Phase 0 — Establish ground truth and the recall definition (1 day)

Before any benchmark, you must be able to compute the exact k nearest neighbours for a dataset, in the same space_type the index uses. Pick a standard, license-clean dataset with a fixed query split (a SIFT/GIST-style ANN dataset, or the dataset the OSB vectorsearch workload already uses — prefer the latter so your numbers are comparable to the project's).

# A "flat" / exact index = an HNSW field is the wrong oracle. Use brute force.
# Option A (recommended): index the SAME vectors with an exact method (flat) or compute exact
# neighbours offline in NumPy and store them as the ground-truth file.
python3 - <<'PY'
import numpy as np
base = np.load("base.npy")      # (N, dim) corpus
queries = np.load("query.npy")  # (Q, dim) query set
# L2 ground truth (mirror your space_type EXACTLY — l2 here):
d = ((queries[:, None, :] - base[None, :, :])**2).sum(-1)   # (Q, N), do it in blocks for big N
gt = np.argsort(d, axis=1)[:, :100]                          # top-100 true neighbours per query
np.save("groundtruth.npy", gt)
PY

Warning: the ground truth MUST use the same distance as the index space_type. Compute L2 ground truth for an l2 index, cosine for cosinesimil, inner-product for innerproduct. A mismatched oracle silently makes every recall number wrong. This is the single most common benchmark-harness bug — assert the space match in your harness config.

Write capstone-work/recall-definition.md: the exact recall@k formula you use, the dataset, the query split, the space_type, and how you computed ground truth. This is the document that makes your numbers auditable.

Phase 1 — A single-configuration, three-number harness (the scoped, mergeable core)

Build a harness that takes ONE configuration and emits recall@k, p50/p99 latency, and native-memory MB — reproducibly. Prefer extending the OSB vectorsearch workload over a bespoke script, so it plugs into tooling the project already runs.

# Driver outline (OSB or a thin Python wrapper around the REST API):
# 1. PUT the index with the configured knn_vector mapping
curl -s -X PUT localhost:9200/bench -H 'Content-Type: application/json' -d '{
  "settings": { "index.knn": true, "index.number_of_shards": 1 },
  "mappings": { "properties": { "v": {
    "type": "knn_vector", "dimension": 128, "space_type": "l2",
    "method": { "name": "hnsw", "engine": "faiss",
                "parameters": { "m": 16, "ef_construction": 256 } }
  }}}
}'
# 2. _bulk index the corpus (batched), then force-merge to a stable segment count
curl -s -X POST localhost:9200/bench/_forcemerge?max_num_segments=1
# 3. WARM UP native memory so the first query isn't penalized
curl -s -X POST localhost:9200/_plugins/_knn/warmup/bench
# 4. record native memory BEFORE timing
curl -s localhost:9200/_plugins/_knn/stats?pretty
# 5. run the fixed query set, capturing per-query ids and latency
# 6. compute recall@k against groundtruth.npy; report p50/p99 latency; report native MB delta

The harness must capture the full configuration in its output so the run is reproducible:

{
  "config": { "engine": "faiss", "method": "hnsw", "m": 16, "ef_construction": 256,
              "ef_search": 100, "space_type": "l2", "mode": "in_memory",
              "compression_level": "1x", "dimension": 128, "shards": 1,
              "dataset": "sift-1m", "k": 10, "queries": 10000, "warmup": true },
  "result": { "recall_at_10": 0.987, "p50_ms": 1.8, "p99_ms": 4.1,
              "native_mem_mb": 612, "index_build_s": 73 }
}

Why this is the right Phase 1: a single-configuration harness that captures the full config and emits all three numbers correctly is already the thing the project needs. It is the unit the sweep in Phase 2 repeats. Get one row provably right — including ground-truth correctness — before you generate a hundred.

Validate Phase 1 by proving recall is correct at the extremes:

sanity check          expected            why
exact / flat method   recall@10 == 1.000  brute force must score perfect against its own ground truth
ef_search very high   recall@10 -> 1.000   HNSW with huge ef_search approaches exact
ef_search very low    recall@10 low        and latency low — the trade is visible

If your "exact" configuration does not score recall 1.000 against your ground truth, your ground truth or your recall computation is wrong. Fix that before anything else. This self-check is the single most important gate in the whole project.

Phase 2 — Sweep the configuration space

Now make the harness parametric: run the matrix and produce one row per configuration, so trade-offs are visible.

Axis	Values to sweep
Engine	`faiss` (hnsw, ivf), `lucene` (hnsw)
`ef_search`	a ladder, e.g. 16 / 32 / 64 / 128 / 256
Quantization / mode	`1x` (baseline), FP16, `on_disk` at `4x` / `8x` / `16x`
`space_type`	the one your dataset is labelled for (do not mix)

engine  method  quant   ef_search  recall@10  p50_ms  p99_ms  native_mb  build_s
faiss   hnsw    1x        64         0.962      1.4     3.0      612        73
faiss   hnsw    1x        128        0.987      1.9     4.1      612        73
faiss   hnsw    on_disk8x 128        0.94x      2.x     x.x       8x         7x
lucene  hnsw    1x        128        0.98x      x.x     x.x       (heap)     x

The deliverable is not the numbers (those are dataset-specific) — it is the recall/latency Pareto curve the harness can draw for any engine/quant combination, reproducibly, from a config file.

Phase 3 — Make it reproducible by someone who is not you

This is what separates a script from a harness. Pin everything:

A single config file drives a full run; the output embeds the config (as above).
The dataset is fetched/verified by checksum; the ground truth is regenerated or checksum-verified.
The harness records the k-NN/OpenSearch version, JDK, engine native-lib version, and shard count.
A --dry-run validates the config (space match, dataset present, ground-truth present) before spending an hour indexing.
A README a stranger can follow to regenerate any row of your table.

Test the reproducibility claim literally: run the same config twice and assert recall is identical and latency is within a stated tolerance band. If two runs disagree on recall, something is nondeterministic that should not be (different query order? a circuit breaker tripping?) — find it.

Phase 4 — Wire it into CI-shaped, regression-catching form

The highest-value version of this harness catches a regression: a code change that drops recall or raises latency. Add a "compare two runs" mode that diffs a current run against a stored baseline and flags any recall drop beyond tolerance or latency rise beyond tolerance — the ANN analogue of the perf-regression lab.

$ harness compare --baseline baseline.json --candidate run.json
config: faiss/hnsw/on_disk8x/ef=128
  recall@10:  0.943 -> 0.911   REGRESSION (drop 0.032 > tolerance 0.01)   FAIL
  p50_ms:     2.1  -> 2.0      ok

That is the artifact a maintainer dreams about: a harness that turns "did this PR quietly hurt recall?" from a guess into a gate.

Deliverables

capstone-work/recall-definition.md — the recall@k formula, dataset, query split, space_type, ground-truth method
capstone-work/design.md — extend-OSB vs bespoke decision, the config schema, what you measure and why
Phase 1: a single-config harness emitting recall@k + p50/p99 + native MB, with the exact/ef_search sanity checks passing
Phase 2: a parametric sweep producing a recall/latency Pareto table across engines and quantization
Phase 3: a fully reproducible run (config-driven, checksummed dataset+ground truth, version-stamped output) + README
(Phase 4) a compare mode that flags recall/latency regressions beyond tolerance
capstone-work/validation.md — the commands, dataset checksums, versions, and a same-config-twice reproducibility check
An upstreaming decision: a PR to the OSB vectorsearch workload / k-NN benchmarking, or a written proposal under #2595
A 500–1000 word write-up: the trade-off surface, the ground-truth-correctness story, the reproducibility argument

Difficulty & time


Engineering difficulty	Medium to build (the trap is correctness of the ground truth, not code volume)
Mergeability	High — clean, documented benchmarking tooling is exactly what #2595 asks for
Time	Phase 1: a weekend. Phases 1–2: ~2 weeks. Through Phase 4: 3–4 weeks
Hardest part	Proving the ground truth is correct (the exact-config recall-1.000 gate) and keeping runs reproducible

The reason this is "Medium" and not "Easy" is that a benchmark that looks fine but has a wrong ground truth is worse than no benchmark — it produces confident, wrong numbers. The engineering is modest; the measurement discipline is the whole grade.

Stretch goals

Add a filtered k-NN axis: recall/latency with a restrictive filter (the lucene engine's filtered path vs faiss), since filtering changes the recall story materially.
Add a memory-pressure scenario: index past the native-memory circuit-breaker limit and report how recall/latency degrade when graphs are evicted and reloaded.
Plot the Pareto curves automatically (a tiny matplotlib step) so a PR can paste a figure, not just a table.
Validate one claim from the GPU RFC (#2293) or remote-index-build RFC (#2294) if you have access to the relevant build — the harness exists precisely to check those claims.
Contribute the harness output format as a shared schema so multiple contributors' numbers are directly comparable.

Evaluation

Self-grade against the 100-point rubric. For this project the Testing dimension dominates — this is a testing/measurement project:

Dimension	What earns the points here
Problem articulation (20)	The design note states why recall-without-ground-truth and one-number benchmarks are wrong, with the trade-off surface named
Execution-path mastery (20)	You can explain what `native_mem_mb` from the stats API actually counts, what warmup does, and why force-merge matters for stable numbers
Implementation quality (20)	A config-driven harness, not a pile of one-off scripts; the config is embedded in every result; `--dry-run` validation
Testing (15)	The exact-config recall-1.000 gate passes; same-config-twice reproducibility holds; ground truth is checksum-verified
Review responsiveness (10)	The real OSB/k-NN PR cadence, or a peer review against this rubric
Documentation (10)	`recall-definition.md`, the README a stranger can follow, the write-up
Community interaction (5)	You commented your approach on #2595 and confirmed whether to extend the OSB workload or stand alone

A finished, reproducible harness at 90+ is mergeable tooling and the instrument that makes every other vector PR — including Project 1 — credible.

How to turn this into a real contribution

Extend, don't reinvent. Check the OSB vectorsearch workload first. If recall@k or a custom param source already exists there, your contribution is "add the missing axis / fix the ground-truth handling," which is far more mergeable than a parallel harness.
Comment before you build. On #2595, state which datasets, which axes, and which output schema you propose. The team has opinions about what "the" benchmark should measure — get them first.
Lead with the correctness gate. The reviewer's first question is "how do I trust your recall number?" Your README's first section is the exact-config recall-1.000 proof and the ground-truth method. Earn trust before showing tables.
Pin everything. A benchmark PR that cannot be regenerated is not mergeable. Dataset checksum, versions, config-in-output, same-config-twice reproducibility — these are the contribution.
DCO applies. k-NN and OSB are GitHub-native: git commit -s, follow each repo's contributor guide. OSB workloads have their own structure — read an existing workload's layout before adding.

A trustworthy harness is the rare contribution that makes everyone else's work better. That is why "Medium difficulty, high mergeability" is not a contradiction here — it is the point.

Project 7: A Search Backpressure Signal

A search cluster under load fails in one of two ways. The good way: it rejects the marginal request early, with a clear 429, and keeps serving the rest. The bad way: it accepts everything, every search threadpool queue fills, heap climbs, GC stalls, the cluster-manager loses contact with nodes, and the whole cluster cascades into a coordinated outage that takes an hour to recover from. The difference between those two failure modes is backpressure: the machinery that watches resource consumption per task and cancels or rejects work before the node tips over.

OpenSearch has a search-backpressure framework that tracks per-task resource usage (heap, CPU, elapsed time), identifies the search tasks responsible for strain, and cancels them under a node-level duress signal. It is real, it is tunable, and — like all admission-control systems — it is full of heuristics that are almost right and signals that are almost sensitive enough. This project asks you to add or tune a backpressure signal: a new resource tracker, a better cancellation-eligibility heuristic, or a shard-indexing-pressure analogue — starting from an observability slice you can scope in a weekend and ending at a tuned, tested, RFC-anchored change.

Note: Read Backpressure and Admission Control and Threadpools and Concurrency before you start. This brief assumes you know what the search threadpool is, what a CancellableTask is, how the task framework tracks resource consumption, and the difference between node duress and a task being the cause of it. It will not re-derive them.

This is a "Maybe — RFC first" project. Admission control changes cluster failure behaviour, so the bar for landing a new signal upstream is high and starts with a design discussion. The mergeable slice here is observability and a well-justified, measured heuristic tweak — not a brand-new rejection policy dropped on the maintainers.

Problem & motivation

Backpressure is hard because it is a control system, and control systems fail by being wrong in either direction:

Too insensitive and the cluster cascades. If the trackers underestimate the cost of an expensive search (a deep aggregation, an unbounded scroll, a fan-out over thousands of shards), the node accepts it, the search queue backs up, heap pressure mounts, and instead of one slow query you get a node — then a cluster — falling over. The famous tail-latency-to-outage path.
Too sensitive and you reject healthy traffic. If the cancellation heuristic is trigger-happy, you cancel queries that would have completed fine, turning a transient blip into a wave of 429s and user-visible failures. A backpressure system that cancels too much is its own incident.
The signal that decides is often coarse. "Node is under duress" is a blunt instrument. Which resource (heap vs CPU vs queue depth)? Which tasks are actually responsible? The cancellation-eligibility ranking (cancel the most expensive offenders, not random victims) is a heuristic that can be measurably improved.
It is hard to observe. When a query gets cancelled by backpressure, why? Which tracker fired? What was the node-duress reason? Operators frequently cannot tell a backpressure cancellation from a client timeout, which makes the system impossible to tune in production.

This project makes one of those better. The phased plan starts at making cancellations explainable (genuinely mergeable, low blast radius) and builds toward a tuned or new signal — the kind of change that needs an RFC and a measured before/after.

Real-world grounding

Backpressure and admission control are RFC-driven in OpenSearch, on both the indexing and search sides. These are real:

[Meta] Indexing backpressure — OpenSearch #1446: https://github.com/opensearch-project/OpenSearch/issues/1446
[Meta] Shard-level indexing back-pressure — OpenSearch #478: https://github.com/opensearch-project/OpenSearch/issues/478
Shard Indexing Pressure implementation PR — OpenSearch #1336: https://github.com/opensearch-project/OpenSearch/pull/1336

PR #1336 is the canonical example of how a pressure/backpressure system is built and merged in this codebase — read it end to end before you design anything. The indexing side is the mature template; the search side (task-cancellation under duress) is where there is more room to contribute.

Citation discipline: search backpressure evolves and its settings get renamed. Do not cite a setting or issue you have not verified. Before you scope, run: is:issue label:"distributed framework" backpressure OR admission and is:issue search backpressure cancel in opensearch-project/OpenSearch, and grep the live settings in the source (below). Link the current issue in your design note; #1446 / #478 / #1336 are the anchors.

Subsystems you'll touch

Subsystem	Class / area (grep to confirm names per version)	What it owns
Search backpressure service	`org.opensearch.search.backpressure.SearchBackpressureService` (under `server/.../search/backpressure`)	The control loop: reads node state, runs trackers, decides cancellations
Resource trackers	`org.opensearch.search.backpressure.trackers.*` (`CpuUsageTracker`, `HeapUsageTracker`, `ElapsedTimeTracker`, the `TaskResourceUsageTracker` interface)	Per-task "is this task an offender, and by how much" logic
Node duress	`org.opensearch.search.backpressure.NodeDuressTracker` / the node-duress signal (grep `NodeDuress`)	Decides whether the node is under strain (heap %, CPU %) before any task is cancelled
Task framework	`org.opensearch.tasks.Task` / `CancellableTask`, `TaskResourceTrackingService`, `SearchShardTask` / `SearchTask`	The cancellable units and their accumulated resource stats
Settings	the `search_backpressure.*` cluster settings (grep `search_backpressure` / `SearchBackpressureSettings`)	The tunables: mode (`monitor_only`/`enforced`), thresholds, cancellation ratios
Search threadpool	the `search` / `search_throttled` threadpools (grep `ThreadPool.Names.SEARCH`)	Where the queued/running search work lives that pressure protects
Stats / cancellation reason	the search-backpressure stats (`SearchBackpressureStats`) and node stats wiring	Exposes counts of cancellations, in-flight cancellations, the per-tracker breakdown

Shard indexing pressure (the alternative target) lives under org.opensearch.index.ShardIndexingPressure / ShardIndexingPressureSettings / the ShardIndexingPressureTracker. Grep ShardIndexingPressure — it is the indexing-side analogue and PR #1336 is its origin.

Deep dives that cover the surrounding ground: Backpressure and admission control · Threadpools and concurrency · Search execution (the tasks pressure protects) · Circuit breakers and memory (the related "stop before OOM" mechanism — backpressure is cancellation, circuit breakers are rejection; know the difference).

Phased plan

The discipline of this project is that Phase 1 makes backpressure cancellations explainable — a mergeable observability change — and that explanation is what lets you tune a signal honestly in Phases 3–4. You cannot tune a control loop you cannot see fire.

Phase 0 — Build it, trip it, and trace the cancellation (1 day)

Build OpenSearch and deliberately trigger search backpressure so you can watch the control loop run.

# In your OpenSearch clone:
./gradlew assemble
./gradlew run

# Put backpressure in enforced mode and lower thresholds so it trips easily (test cluster only):
curl -s -X PUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{
  "persistent": { "search_backpressure.mode": "enforced" }
}'
# (grep SearchBackpressureSettings for the exact threshold setting names for your version)

# Now run an abusive query (deep agg / huge size / heavy script) repeatedly under load and watch:
curl -s localhost:9200/_nodes/stats/search_backpressure?pretty

Find the cancellation decision in code:

grep -rn "class SearchBackpressureService\|cancel\|NodeDuress\|TaskResourceUsageTracker" \
  server/src/main/java/org/opensearch/search/backpressure | head -40
grep -rn "search_backpressure\|SearchBackpressureSettings\|getCancellationThreshold" \
  server/src/main/java/org/opensearch/search/backpressure
grep -rn "SearchBackpressureStats\|cancellationCount\|currentTracker" \
  server/src/main/java/org/opensearch/search/backpressure/stats

Write a one-page capstone-work/cancellation-path.md: node-duress check → tracker evaluation → cancellation-eligibility ranking → CancellableTask.cancel(reason). With file:line citations. This is your execution-path-mastery artifact and you cannot skip it. You must be able to point at the exact branch where node duress is declared and the exact branch where a task is selected to die.

Phase 1 — Make the cancellation reason observable (the scoped, mergeable slice)

When backpressure cancels a search, the reason is often opaque to the operator and the client. Make it precise. Two complementary, small changes:

Enrich the cancellation reason string on the task so it names the tracker and the node-duress reason that fired:

grep -rn "cancellationReason\|getCancellationReason\|reasonString\|cancel(" \
  server/src/main/java/org/opensearch/search/backpressure

// When SearchBackpressureService decides to cancel, build a precise reason:
String reason = "search backpressure cancellation; node_duress=[" + duressReason + "]"
    + " offending_tracker=[" + topTracker.name() + "]"
    + " task_heap=[" + humanReadableBytes(task.getHeapUsage()) + "]"
    + " task_elapsed=[" + task.getElapsedTime() + "ms]";
task.cancel(reason);

Expose a per-tracker cancellation breakdown in the search-backpressure stats so an operator can see which tracker is doing the cancelling over time (heap-driven vs cpu-driven vs elapsed-driven):

grep -rn "class SearchBackpressureStats\|writeTo\|toXContent" \
  server/src/main/java/org/opensearch/search/backpressure/stats

Then tests: a unit test that the reason string contains the tracker and duress reason, and an integration test (OpenSearchIntegTestCase) that trips backpressure in enforced mode and asserts the stats report the expected tracker as the cancellation cause.

// SearchBackpressureIT-style test
public void testCancellationReasonNamesTracker() throws Exception {
    setBackpressureEnforcedWithLowHeapThreshold();
    Exception e = expectThrows(Exception.class, () -> runHeapHeavyAggUnderLoad());
    assertThat(rootCauseMessage(e), containsString("offending_tracker=[heap_usage_tracker]"));
}

Run the gates:

./gradlew spotlessApply
./gradlew :server:test --tests "*SearchBackpressure*"
./gradlew precommit

Why this is the right Phase 1: it is a minimum diff, it does not change when anything is cancelled (so it cannot make the control loop worse), and it is exactly the operability improvement maintainers merge. A "name the tracker in the cancellation reason + stats" PR is a credible first backpressure contribution — and you cannot tune the heuristic in Phase 3 without it.

Phase 2 — Reproduce a wrong decision and characterize it

Using your Phase-1 observability, construct a scenario where the current signal is measurably wrong in one direction:

False negative: an expensive query the trackers underweight, so the node accepts it past the point it should have shed load (cascade risk).
False positive: a query the heuristic cancels that would have completed fine (over-rejection).

Capture it as a reproducible test scenario in capstone-work/design.md: the workload, the settings, the observed wrong decision, and the metric that proves it is wrong (queue depth and heap over time, or completion-vs-cancellation outcome).

Phase 3 — Tune or add a signal (the change that needs an RFC)

Now the real work. Choose one, scoped tightly, and write a design note / RFC comment first:

Option	Touches	Why it is real
Improve the cancellation-eligibility ranking (cancel the costliest offender, not the first/oldest)	the eligibility comparator in the service	Pure heuristic improvement; measurable as "fewer healthy queries cancelled per unit of relief"
Add/tune a resource tracker (e.g. a queue-depth or shard-fanout tracker) behind a default-off setting	`trackers/` + settings + stats	A new signal — needs RFC, but contained if default-off and `monitor_only`-able
Tighten the node-duress signal (multi-resource, hysteresis to avoid flapping)	`NodeDuressTracker`	Directly reduces both false positives and flapping
A shard-indexing-pressure tracker improvement (the #1336 lineage)	`ShardIndexingPressure*`	Mature template; smaller "is this novel?" risk than search side

The non-negotiable: any new or tuned signal ships monitor_only first. It logs/stats what it would have done without actually cancelling, so it can be validated in production before it is allowed to reject traffic. This is how admission-control changes are landed safely, and proposing it any other way will (correctly) get the PR blocked.

// Every new signal must respect the mode gate:
if (settings.getMode() == SearchBackpressureMode.MONITOR_ONLY) {
    stats.recordWouldHaveCancelled(task, reason);   // observe, do not cancel
} else if (settings.getMode() == SearchBackpressureMode.ENFORCED) {
    task.cancel(reason);
}

Phase 4 — Prove the control loop got better, not just different

A backpressure change is a control-system change; "it cancels differently" is not a result. You must show, on a reproducible load scenario, that your change improves the trade-off:

scenario: 200 concurrent heavy aggs + steady healthy traffic, heap-pressure regime

metric                        baseline    yours      better?
healthy queries cancelled        38          11       yes (fewer false positives)
node max heap %                  94          88       yes (still sheds enough load)
p99 of healthy queries (ms)     1900        1200       yes
cluster stayed up?               yes         yes       (negative control: never trade safety for it)

The headline is "fewer healthy queries cancelled while still preventing the cascade." A tweak that cancels less but lets the node cascade is not an improvement — it is a regression in the dimension that matters most. Your negative control is always "the cluster still survives the overload."

Deliverables

capstone-work/cancellation-path.md — duress → trackers → eligibility → cancel, with citations
capstone-work/design.md — the wrong-decision scenario, the signal you chose, the false-pos/false-neg framing
Phase 1: precise cancellation reason + per-tracker stats breakdown, with unit + integration tests
Phase 2: a reproducible scenario demonstrating a measurably wrong current decision
(Phase 3) a tuned/new signal, shipped monitor_only-first behind a setting, with tests
(Phase 4) a load-scenario before/after table showing a better trade-off, with the "cluster survives" negative control
capstone-work/validation.md — ./gradlew spotlessApply precommit output, test commands, seeds, load harness
A CHANGELOG.md entry under ## [Unreleased]
An upstreaming decision: a DCO-signed PR for the Phase-1 observability slice, or an RFC/issue comment proposing the signal change
A 500–1000 word write-up: the control-system framing, the trade-off you improved, the safety argument

Difficulty & time


Engineering difficulty	Hard (Phase 1 alone is Medium)
Mergeability	High for Phase 1 (observability); RFC-gated for any new/tuned signal
Time	Phase 1: a weekend. Phases 1–2: ~2 weeks. Through Phase 4: 5–7 weeks
Hardest part	Building a reproducible overload scenario, and proving you reduced false positives without reducing safety

The trap is treating this as a coding project. It is a control-systems project: most of the work is constructing a load scenario you can re-run and a measurement that distinguishes "better trade-off" from "different behaviour." Budget heavily for the load harness.

Stretch goals

Add an _explain-style endpoint or a per-request response header that tells a client why its query was cancelled by backpressure (closing the "was it a timeout or a rejection?" gap).
Implement hysteresis on the node-duress signal so it does not flap on and off at the threshold boundary — a classic control-loop improvement with a clean before/after (count the state flips).
Cross-link the search-backpressure stats with the search threadpool queue stats so an operator sees pressure building before cancellations start.
Reproduce the spirit of #1336 on the search side: a per-shard search-pressure tracker that attributes duress to the specific shards/queries causing it, not just the node.

Evaluation

Self-grade against the 100-point rubric. For this project:

Dimension	What earns the points here
Problem articulation (20)	The design note frames it as a control system with a false-positive/false-negative trade-off, names the specific wrong decision — not "the cluster gets slow"
Execution-path mastery (20)	`cancellation-path.md` traces duress → tracker → eligibility → cancel with file:line citations, before you changed anything
Implementation quality (20)	Phase 1 changes only the reason/stats (cannot worsen the loop); any new signal is `monitor_only`-first behind a setting; no scope creep into rejection policy
Testing (15)	A reproducible overload scenario; a test red without the fix; the "cluster survives" negative control on every Phase-4 run
Review responsiveness (10)	The real OpenSearch PR/RFC cadence, or a peer review against this rubric
Documentation (10)	Design note, `CHANGELOG.md`, write-up, and an explicit docs decision (the settings are operator-facing)
Community interaction (5)	You posted the signal design to the issue/RFC thread before building and pinged the right `MAINTAINERS.md` reviewers

A finished Phase 1 at 90+ is a real, merged operability contribution in OpenSearch's resiliency layer — and the instrument that makes the heuristic tuning in Phase 3 defensible.

How to turn this into a real contribution

Start from observability, not the policy. The Phase-1 "name the tracker + stats breakdown" PR is the upstream target. It is low-blast-radius (it changes no cancellation decisions) and is exactly the operability gap operators hit. No RFC needed.
RFC any new or tuned decision. Changing when work is cancelled changes cluster failure behaviour. Read PR #1336 to see how the indexing side did it, then propose your search-side change as an issue/RFC comment with the monitor_only-first plan before writing the decision code.
One slice per PR. Observability, then (separately, after RFC) the heuristic. Never bundle a new tracker with the reason-string change.
Ship monitor_only first, always. The maintainers will require it. A signal that observes before it enforces is the only mergeable shape for new admission control — arrive with that design.
Bring a load scenario, not an anecdote. "It felt better" loses. The before/after table with the "cluster survives" negative control is what makes a control-loop change trustworthy.
DCO applies. Core OpenSearch: git commit -s, CHANGELOG.md, the backport bot.

If only Phase 1 lands, you have made one of OpenSearch's hardest-to-operate subsystems explainable — which is the prerequisite for anyone, including future-you, to tune it safely. That is the point of scoping from the observability slice up.

Project 8: Segment-Replication / RW-Separation Observability

In a document-replication cluster, a replica that has indexed up to operation N is, by definition, N operations behind nothing — it did the work itself. In a segment-replication cluster, a replica does not re-index; it copies finished segments from the primary after the primary refreshes. That is cheaper and faster, but it introduces a new, first-class concept that document replication never had: replication lag. The replica is serving searches against an older set of segments than the primary holds, and the gap between them — measured in bytes to copy, in checkpoints behind, in seconds stale — is something an operator now must be able to see. With reader/writer separation and search replicas layered on top (read nodes that scale independently and can scale to zero), this lag becomes the central health signal of the whole architecture: it is the thing that tells you whether your search tier is keeping up with your write tier.

This project asks you to improve the observability of segment replication / reader-writer separation: add or sharpen the metrics, _cat columns, stats-API fields, or logs that let an operator see replication lag and health — starting from a slice you can scope in a weekend and ending at a tested, upstreamable set of stats. It is the "Medium-Hard" project in the portfolio and one of the cleanest to land, because new, well-tested observability fields are exactly what maintainers merge.

Note: Read the Replication deep dive before you start — this brief does not re-derive how segment replication works. Also read Remote store and durability and Sharding and scaling for how remote-backed storage, segment replication, and reader/writer separation fit together. This brief assumes you know what a ReplicationCheckpoint is, what the primary→replica segment-copy round looks like, and what a "search replica" is.

Problem & motivation

Segment replication trades a familiar failure mode (replicas falling behind by re-indexing too slowly) for an unfamiliar one (replicas falling behind by copying segments too slowly), and the observability did not arrive fully formed alongside the feature:

Lag is multi-dimensional and partially hidden. "How far behind is this replica?" has several answers: how many bytes of new segments remain to copy, how many refresh checkpoints behind the replica's last-applied checkpoint is, and how long (wall-clock) the replica has been stale. Some of these are exposed via _cat/segment_replication; some are awkward to get at, missing at the level an operator wants (per-shard vs aggregated), or absent from the programmatic stats API entirely.
Reader/writer separation makes lag the headline metric, and it is newer. Search replicas are a recent capability. The whole value proposition — scale reads independently, even to zero — depends on operators trusting that a search replica is fresh enough. If you cannot cleanly see a search replica's lag and whether it is converging or diverging, you cannot safely run the architecture, and you certainly cannot autoscale it.
Scale-to-zero adds a new state to observe. A search-replica tier that scaled to zero and is scaling back up has a cold period where it is catching up from far behind. The transition from "cold, catching up" to "warm, caught up" is exactly the moment an operator needs visibility into, and it is a state document replication never had.
Failed/stalled replication is easy to miss. A replica whose segment copy is failing and retrying looks, from a distance, a lot like one that is merely slow. The stats should distinguish "behind and catching up" from "behind and stuck."

This project closes one of those gaps. The phased plan starts at exposing an existing-but-hard-to-get lag signal cleanly in the stats API and _cat (genuinely mergeable, low blast radius) and builds toward richer health fields for the search-replica path.

Real-world grounding

Reader/writer separation, search replicas, and scale-to-zero are RFC-driven and tracked by a meta-issue. These are real:

[META] Reader and writer separation — OpenSearch #15306: https://github.com/opensearch-project/OpenSearch/issues/15306
Scale to zero (search replicas) — OpenSearch #16720: https://github.com/opensearch-project/OpenSearch/issues/16720

The meta-issue #15306 is the durable anchor for the whole reader/writer-separation effort — including the search-replica observability work that this project lives inside. Scale-to-zero (#16720) is the feature that makes lag/freshness visibility non-optional, because a tier that scaled down and is scaling back up must be observably catching up.

Citation discipline: the search-replica stats surface is new and moving. Do not cite a field name or sub-issue you have not verified. Before you scope, search is:issue is:open label:"distributed framework" search replica OR segment replication stats and is:issue is:open "_cat/segment_replication" OR "search replica" lag in opensearch-project/OpenSearch, and grep the live _cat/stats code (below). Link the current sub-issue in your design note; #15306 and #16720 are the anchors.

Subsystems you'll touch

Subsystem	Class / area (grep to confirm names per version)	What it owns
Segrep target service	`org.opensearch.indices.replication.SegmentReplicationTargetService` (under `server/.../indices/replication`)	Drives the replica-side copy round; knows the current/target checkpoints
Replication checkpoints	`org.opensearch.indices.replication.checkpoint.ReplicationCheckpoint`, `SegmentReplicationShardStats`	The checkpoint compared primary↔replica; per-shard replication stats object
Segrep stats / `_cat`	`org.opensearch.rest.action.cat.RestSegmentReplicationAction` (the `_cat/segment_replication` handler), `SegmentReplicationStatsResponse` / the stats transport action	The operator-facing surface: columns, fields, the stats request/response
Stats API wiring	the `SegmentReplicationPerGroupStats` / `SegmentReplicationState`, `IndicesStatsResponse` integration (grep `SegmentReplication` under `server/.../action/admin/indices/stats`)	Where segrep stats attach to the programmatic stats API
Search-replica path	the search-replica routing / `ShardRouting` search-replica role, the read-path that targets search replicas (grep `searchReplica` / `SEARCH_ONLY` / `SearchReplica`)	The reader side of reader/writer separation that you are reporting on
Remote store linkage	`org.opensearch.index.store.RemoteSegmentStoreDirectory` / the remote-store-backed segrep path (grep `RemoteStore`)	Where remote-backed segrep copies segments from (read-only for your purposes — you report on it)

Deep dives that cover the surrounding ground: Replication deep dive (the mechanism — do not re-derive it) · Remote store and durability · Sharding and scaling (reader/writer separation, search replicas, scale-to-zero) · Recovery deep dive (the cold-catch-up case is recovery-adjacent) · REST layer (how _cat and stats endpoints are wired).

Phased plan

The discipline of this project is that Phase 1 exposes a real lag signal cleanly and is mergeable on its own. Observability is the contribution; you do not need to change replication behaviour at all.

Phase 0 — Build it, stand up segrep, and read the existing lag surface (1 day)

Build OpenSearch and stand up an index that uses segment replication, then make a replica fall behind and watch the existing observability report it.

# In your OpenSearch clone:
./gradlew assemble
./gradlew run        # for a multi-node segrep setup you may need a small InternalTestCluster
                     # test or a local 2-node config; segrep needs a replica to observe

# Create an index that uses segment replication:
curl -s -X PUT localhost:9200/segrep-test -H 'Content-Type: application/json' -d '{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1,
    "index.replication.type": "SEGMENT"
  }
}' | python3 -m json.tool

# Index a burst so the replica has segments to copy, then read the existing lag surface:
curl -s 'localhost:9200/_cat/segment_replication?v&detailed'
curl -s 'localhost:9200/_cat/segment_replication/segrep-test?v'

Note: reader/writer separation / search replicas add index.number_of_search_replicas (or a similar setting) and a remote-store-backed cluster. The exact settings change across versions — grep for the current names before trusting any snippet: grep -rn "number_of_search_replicas\|SEARCH_ONLY\|searchReplica\|replication.type" server/src/main/java/org/opensearch/cluster/metadata.

Now find what _cat/segment_replication and the stats API actually report, and where the lag numbers come from:

grep -rn "class RestSegmentReplicationAction\|segment_replication\|Table.*addCell" \
  server/src/main/java/org/opensearch/rest/action/cat
grep -rn "class SegmentReplicationShardStats\|bytesBehind\|checkpointsBehind\|currentReplicationLag\|lastCompletedReplicationLag" \
  server/src/main/java/org/opensearch/indices/replication
grep -rn "SegmentReplicationStatsResponse\|SegmentReplicationPerGroupStats" \
  server/src/main/java/org/opensearch/action/admin/indices

Write a one-page capstone-work/lag-surface.md: what lag dimensions exist (bytes_behind / checkpoints_behind / current vs last-completed lag time), which are exposed in _cat vs the stats API vs neither, and the class that computes each. With file:line citations. This is your execution-path-mastery artifact and you cannot skip it. You must know exactly which lag number lives where before you add or move one.

Phase 1 — Expose a missing lag dimension cleanly (the scoped, mergeable slice)

Pick one lag/health dimension that exists in the internal SegmentReplicationShardStats (or is cheaply computable there) but is not cleanly available where an operator wants it — most commonly, something that is in _cat but missing from the programmatic stats API (so dashboards/automation cannot read it), or vice versa. Expose it.

# Find a field present in the shard stats object but absent from the stats-API response:
grep -rn "bytesBehind\|checkpointsBehind\|ReplicationLag" \
  server/src/main/java/org/opensearch/indices/replication/SegmentReplicationShardStats.java
grep -rn "toXContent\|writeTo\|field(" \
  server/src/main/java/org/opensearch/action/admin/indices/stats/SegmentReplicationPerGroupStats.java

// Add the field to the stats response's XContent + wire serialization (Version-guarded for BWC):
builder.field("checkpoints_behind", shardStats.getCheckpointsBehind());
builder.field("bytes_behind",       new ByteSizeValue(shardStats.getBytesBehind()).toString());
builder.field("current_replication_lag_millis", shardStats.getCurrentReplicationLagMillis());

Because this touches the wire, it needs a Version guard and a serialization round-trip test:

grep -rn "out.getVersion()\|in.getVersion()\|Version.onOrAfter\|writeOptional" \
  server/src/main/java/org/opensearch/action/admin/indices/stats/SegmentReplicationPerGroupStats.java

// AbstractWireSerializingTestCase<SegmentReplicationPerGroupStats> — round-trip the new field,
// AND assert it is dropped/defaulted when serializing to an older Version (BWC).
public void testSerializationRoundTripIncludesCheckpointsBehind() {
    SegmentReplicationPerGroupStats orig = createTestInstance();
    SegmentReplicationPerGroupStats copy = copyInstance(orig);
    assertEquals(orig.getCheckpointsBehind(), copy.getCheckpointsBehind());
}

Add the matching _cat/segment_replication column if it is the surface that was missing, and an integration test that makes a replica fall behind and asserts the field/column reports a non-zero lag, then converges to zero after the copy completes.

Run the gates:

./gradlew spotlessApply
./gradlew :server:test --tests "*SegmentReplication*Stats*"
./gradlew precommit

Why this is the right Phase 1: it is a minimum diff, it changes no replication behaviour (pure observability), and "expose this lag field in the stats API / _cat" is exactly what maintainers merge — especially under the active reader/writer-separation effort. The one subtlety is BWC on the wire, which is why the round-trip + old-Version test is mandatory. This is a credible first segrep contribution.

Phase 2 — Distinguish "catching up" from "stuck"

A single lag number cannot tell an operator whether a behind replica is converging or failing. Add the signal that distinguishes them: expose the replication state (e.g. last failure, retry count, or a derived "lag trend" — is bytes_behind decreasing?) so a stalled copy is visibly different from a slow one.

grep -rn "SegmentReplicationState\|failure\|retry\|getStage\|lastFailure" \
  server/src/main/java/org/opensearch/indices/replication

Test it by injecting a copy failure in an integration test and asserting the stats reflect "stuck" (non-zero retries / a failure reason), not merely "behind."

Phase 3 — Search-replica freshness, for reader/writer separation

Now the reader/writer-separation-specific work. A search replica's whole job is to serve fresh-enough reads. Expose its freshness as a first-class, queryable signal: per-search-replica lag, and ideally a roll-up at the index level ("worst search-replica lag in this index") that an autoscaler or operator can alert on.

grep -rn "searchReplica\|SEARCH_ONLY\|isSearchOnly\|SearchReplicaAllocation" \
  server/src/main/java/org/opensearch/cluster/routing

The deliverable is that an operator can answer, from the stats API alone: "is every search replica in this index within N checkpoints / N seconds of the primary?" — the exact question that makes scale-to-zero (#16720) safe to run, because the catch-up after scaling back up is now observable.

Phase 4 — Validate the freshness signal under the scale-to-zero transition

The freshness signal earns its keep at the moment it is hardest: a search-replica tier scaling back up from zero, catching up from far behind. Reproduce that transition in an integration test and assert the stats tell the right story end to end.

phase                       checkpoints_behind   bytes_behind   state          search-replica fresh?
search replicas at zero            n/a               n/a         (none)         n/a
just scaled up (cold)              412              1.8 GB       CATCHING_UP    NO
mid catch-up                        37              210 MB       CATCHING_UP    NO
caught up                            0                0 B        STEADY         YES

The headline is that the freshness signal transitions cleanly from "cold, not fresh" → "catching up" → "fresh," so an operator (or an autoscaler) can gate read traffic / scale decisions on it. That is the observability that makes reader/writer separation operable.

Deliverables

capstone-work/lag-surface.md — which lag dimension lives in _cat vs stats API vs nowhere, with citations
capstone-work/design.md — the gap (which field, which surface), the BWC/wire-format consideration, what you rejected
Phase 1: a missing lag field exposed in the stats API and/or _cat, Version-guarded, with a serialization round-trip test
Phase 2: a "catching up vs stuck" signal (retries / failure / lag trend), with a failure-injection test
(Phase 3) per-search-replica freshness + an index-level roll-up for reader/writer separation
(Phase 4) an integration test of the scale-to-zero catch-up transition asserting the freshness signal is correct
capstone-work/validation.md — ./gradlew spotlessApply precommit output, test commands, seeds, segrep cluster setup
A CHANGELOG.md entry under ## [Unreleased]
An upstreaming decision: a DCO-signed PR for the Phase-1 stats field, or a written scoped proposal under #15306
A 500–1000 word write-up: the lag-dimension framing, the BWC story, why search-replica freshness matters for scale-to-zero

Difficulty & time


Engineering difficulty	Medium-Hard (Phase 1 alone is Medium)
Mergeability	High — observability fields under an active effort are exactly what maintainers merge
Time	Phase 1: a weekend. Phases 1–2: ~2 weeks. Through Phase 4: 4–6 weeks
Hardest part	Getting a multi-node segrep / search-replica cluster reproducible in a test, and the wire-format BWC on any stats field you add

The trap most people hit is the test environment: observing segment replication needs a real primary/replica (and for Phase 3+, a remote-store-backed search-replica) topology, which is fiddlier than a single node. Get the InternalTestCluster segrep setup working in Phase 0 before you scope.

Stretch goals

Add a _cat/segment_replication column for the "catching up vs stuck" state from Phase 2 so it is visible at a glance, not just in the JSON stats.
Emit a structured log line (at INFO/WARN) when a replica's lag crosses a threshold or a copy fails repeatedly, so the signal reaches log-based alerting, not only metrics scrapers.
Add a cluster-health-style roll-up: a per-index "replication health" (green/yellow/red) derived from worst-replica lag, mirroring how shard allocation surfaces health.
Wire the search-replica freshness signal into whatever autoscaling hook scale-to-zero (#16720) uses, so the catch-up state can actually gate scale decisions — the end-to-end payoff.
Reproduce the spirit of a real #15306 sub-issue: take one concrete "operators cannot see X" complaint from that thread and close it with the field + test in the PR.

Evaluation

Self-grade against the 100-point rubric. For this project:

Dimension	What earns the points here
Problem articulation (20)	The design note names the exact lag dimension and the exact surface it is missing from, and why an operator needs it — not "replication should be observable"
Execution-path mastery (20)	`lag-surface.md` maps every lag number to its computing class and its exposed surface with file:line citations, before you changed anything
Implementation quality (20)	Phase 1 is a minimum diff in pure observability; the wire change is `Version`-guarded; no replication-behaviour change sneaks in
Testing (15)	A serialization round-trip + old-`Version` BWC test; an integration test where a replica falls behind then converges; a failure-injection test for "stuck"
Review responsiveness (10)	The real OpenSearch PR cadence, or a peer review against this rubric
Documentation (10)	Design note, `CHANGELOG.md`, write-up, and an explicit docs decision (the stats fields and `_cat` columns are operator-facing and documented)
Community interaction (5)	You commented your scope on #15306 / the current sub-issue and pinged the right `MAINTAINERS.md` (distributed / replication) reviewers

A finished Phase 1 at 90+ is a real, merged observability contribution in the reader/writer-separation effort — directly useful to every operator running segment replication.

How to turn this into a real contribution

Start from one missing field, not a dashboard. The Phase-1 "expose this lag dimension in the stats API / _cat" PR is the upstream target. It changes no replication behaviour, it is exactly the operability gap under #15306, and it needs no RFC — only a comment on the thread.
Comment before you code. Find the current search-replica observability sub-issue under #15306, confirm the field you want to add is wanted and not already in flight, and link #16720 (scale-to-zero) for why freshness visibility matters.
Respect the wire. Any stats field you add crosses the transport layer between mixed-version nodes during an upgrade. Version-guard it and ship the round-trip + old-Version serialization test in the same PR — see Serialization and BWC. Reviewers will block a stats change that is not BWC-safe.
One slice per PR. The lag field, then (separately) the "stuck vs catching up" state, then (after confirming scope) the search-replica freshness roll-up. Never bundle them.
Bring the convergence test. The reviewer's question is "does the number go to zero when the replica catches up, and stay non-zero when it is stuck?" Arrive with that integration test in the PR.
DCO applies. Core OpenSearch: git commit -s, CHANGELOG.md, the backport bot.

If only Phase 1 lands, you have given every operator running segment replication a lag number they could not cleanly get before — which is the difference between trusting and fearing the reader/writer separation architecture. That is the point of scoping from the single-field slice up.