Lab H5: Reproducing Bugs
Background
A JIRA without a reproducer drifts. A JIRA with a clean reproducer gets attention.
"Clean" means: minimal schema, minimal data, minimal query, runnable in under a minute
on a local MiniTezCluster or MiniHS2. This lab is the procedure.
The Hive integration test framework (hive-itests) is the source of every pattern you
need. Reading its existing tests is the cheapest education.
The Three Reduction Axes
To minimise a reproducer, reduce along three independent axes:
| Axis | Reduce | Stop reducing when |
|---|---|---|
| Schema | Drop unused columns; simplify types | Removing a column makes the bug disappear |
| Data | Reduce row count; generate synthetic data | Reducing rows makes the bug disappear |
| Query | Drop joins, predicates, projections | Dropping a clause makes the bug disappear |
The goal is the smallest schema × smallest data × smallest query that still reproduces.
Setup — Local MiniHS2 + MiniTezCluster
MiniHS2 is a single-JVM HiveServer2 that runs against a MiniTezCluster (a single-JVM
YARN). Together they let you reproduce a Hive-on-Tez bug in seconds without an external
cluster.
Existing reference in your tree:
find ~/hive-src/itests -name "MiniHS2.java" | head
find ~/hive-src/itests -name "TestMiniLlapVectorArrowWithLlapIODisabled.java" | head
find ~/tez-src/tez-tests -name "MiniTezCluster.java"
A reproducer test class skeleton (Hive 3/4 style):
public class TestMyBugRepro {
private MiniHS2 miniHS2;
@Before
public void setUp() throws Exception {
HiveConf conf = new HiveConf();
conf.set("hive.execution.engine", "tez");
conf.set("tez.lib.uris",
"file://" + System.getProperty("tez.lib.dir"));
miniHS2 = new MiniHS2.Builder()
.withConf(conf)
.withMiniMR() // brings up MiniTezCluster
.build();
miniHS2.start(new HashMap<>());
}
@After
public void tearDown() throws Exception {
miniHS2.stop();
}
@Test
public void reproBug() throws Exception {
try (Connection c = DriverManager.getConnection(miniHS2.getJdbcURL());
Statement s = c.createStatement()) {
s.execute("CREATE TABLE t (...) STORED AS ORC");
s.execute("INSERT INTO t VALUES (...)");
ResultSet rs = s.executeQuery("SELECT ...");
// assert behaviour or expect exception
}
}
}
Run with mvn test -pl itests -Dtest=TestMyBugRepro.
Reducing the Schema
Starting from a real production table with 200 columns, reduce iteratively:
- Identify referenced columns. Read the failing query; note which columns the
SELECT,WHERE,GROUP BY,JOIN,ORDER BYactually reference. - Drop everything else. Make a new test schema with only the referenced columns.
- Re-run. Does the bug still reproduce? If yes, you've reduced. If no, you've found a column that's load-bearing; add it back and look for why.
- Simplify remaining types. Replace
DECIMAL(38,10)withDECIMAL(10,2)if the bug doesn't depend on precision. ReplaceSTRUCT<...>withSTRINGif you can. Replace partition columns with non-partitioned tables unless the partition is load-bearing. - Stop when reduction breaks the repro.
For our running example query:
SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;
Only column a is referenced. Schema reduces to:
CREATE TABLE t (a INT) STORED AS ORC;
If the bug needs the second column for some reason (e.g. ORC stripe layout), keep it.
Reducing the Data — JoinDataGen Pattern
Hive's itests includes data generators for systematic minimisation. The most common
pattern is JoinDataGen for generating join inputs at controlled cardinalities:
find ~/hive-src -name "JoinDataGen*.java" -o -name "*DataGen*.java" | head
The pattern (adapt for your bug):
public final class TestDataGen {
public static void writeIntRows(String tableName, int rowCount, int distinctKeys,
Statement s) throws SQLException {
Random r = new Random(42);
StringBuilder values = new StringBuilder();
for (int i = 0; i < rowCount; i++) {
if (i > 0) values.append(",");
values.append("(").append(r.nextInt(distinctKeys)).append(")");
}
s.execute("INSERT INTO " + tableName + " VALUES " + values);
}
}
Reduce data:
- Start with original data size. 1 billion rows? Reduce to 1 million.
- Halve until bug disappears. Binary-search the row count: 1M → 500K → 250K → ...
- At the smallest row count that still repros, vary distinct-key count. Bug may need 5 distinct keys (skew) or 500K (cardinality). Find which.
- Vary value distribution. If the bug needs a skewed distribution (one key gets 90% of rows), generate that explicitly.
- Document the minimum. "Bug reproduces at >= 1024 rows with >= 8 distinct keys."
For our running example, with no actual bug, the minimum is whatever you need to exercise the GROUP BY + ORDER BY path — single-digit rows are enough.
Reducing the Query
Remove clauses one at a time and re-test:
- Remove
ORDER BY→ does the bug still happen? (Probably not, if the bug is in the total-order reducer.) - Remove the aggregate → does the bug still happen?
- Remove
WHEREpredicates one at a time. - Remove
JOINs; if the join is the cause, simplify to a 2-table join, then to a tiny-on-tiny join. - Replace
MAP JOINwithSHUFFLE JOINby disabling map joins (hive.auto.convert.join=false) and re-test.
A reproducer query of 3 lines beats a reproducer query of 30 lines, even for the same bug.
Capturing the Artifacts
A complete bug-report artifact set:
| Artifact | Why |
|---|---|
CREATE TABLE DDL for every table involved | Reproducer setup |
Data generation code or inline INSERT values | Reproducer setup |
| The minimal query | The test |
SET hive.* lines that were necessary | Configuration |
| The expected behavior (correct result) | Oracle |
| The actual behavior (incorrect result or exception) | Symptom |
EXPLAIN FORMATTED output | Plan |
| AM log fragment showing failure | Diagnostic |
| Container log fragment showing exception | Diagnostic |
| Tez and Hive version | Version |
Bundle into a single artifact:
cd ~/tez-notes
mkdir hive-h5-repro
cp ddl.sql hive-h5-repro/
cp gen.sql hive-h5-repro/
cp query.sql hive-h5-repro/
cp explain.txt hive-h5-repro/
cp amlog-fragment.txt hive-h5-repro/
cp container-log-fragment.txt hive-h5-repro/
cat > hive-h5-repro/README.md <<EOF
# Repro for HIVE-XXXXX / TEZ-XXXX
Tez version: 0.10.X
Hive version: 4.0.X
Hadoop version: 3.3.X
JDK: 11
Setup: hive -f ddl.sql && hive -f gen.sql
Repro: hive -f query.sql
Expected: rows = N, max value = M.
Actual: exception in container log (see container-log-fragment.txt).
EOF
tar czf hive-h5-repro.tar.gz hive-h5-repro/
Attach hive-h5-repro.tar.gz to the JIRA. A reproducer in this shape gets opened by
maintainers; one without these elements doesn't.
When MiniTezCluster Doesn't Reproduce
A bug that reproduces on a production cluster but not on MiniTezCluster is the worst
shape. Common causes:
| Cause | Diagnostic |
|---|---|
| Multi-node shuffle behavior; mini cluster is single-node | Force multiple containers per node; can't fully simulate |
| Container OOM at production memory; mini cluster doesn't have memory pressure | Configure mini cluster with tight memory limits |
| Concurrent DAG submissions; mini cluster has none | Run multiple parallel tests |
| ORC stripe layout; needs production-size files | Generate larger ORC files |
| Production data distribution; mini cluster has uniform | Use realistic random seed and distribution |
| Speculative execution; not enabled in mini by default | Enable with tez.am.speculation.enabled=true |
If none of these reduce, the bug may be in cluster-only code paths (RM scheduling edge cases). Document that the reproducer requires N nodes and attach what evidence you have.
A Worked Reproducer — Hypothetical Bug
Suppose a bug: COUNT(*) returns 0 when input table has exactly 1024 rows and
vectorization is enabled. (Imaginary; for the pattern.)
Schema
CREATE TABLE t (a INT) STORED AS ORC;
Data
INSERT INTO t SELECT col1 FROM dual WHERE 1=0; -- placeholder
-- repeat to produce exactly 1024 rows:
INSERT INTO t SELECT pos AS a FROM (
SELECT explode(sequence(1, 1024)) AS pos
) s;
(Hive's explode(sequence(...)) may or may not be available depending on version; use
the equivalent for your version.)
Query
SET hive.vectorized.execution.enabled=true;
SELECT COUNT(*) FROM t;
Expected vs Actual
Expected: 1024
Actual: 0
EXPLAIN
EXPLAIN VECTORIZATION DETAIL SELECT COUNT(*) FROM t;
Save the output. Look for Execution mode: vectorized and any odd Vectorized: false
on a key operator.
Trial Reductions
- 1023 rows: bug? No.
- 1024 rows: bug.
- 2048 rows: bug? Test.
- Vectorization off: bug? Reset.
Document the conditions:
Bug reproduces at:
- row count exactly 1024
- hive.vectorized.execution.enabled=true
Bug does NOT reproduce at:
- row count != 1024
- hive.vectorized.execution.enabled=false
That's a sharp, actionable bug. Attribution (by Lab H4): likely Hive's vectorized aggregation code path. File on HIVE.
Production-to-Test Translation
When a real production bug is reported to you with no reproducer:
- Get the query. From the user, from
hive.log(hive.server2.logging.operation.enabled), or fromHiveServer2audit logs. - Get the schema. Run
SHOW CREATE TABLEon each involved table; copy. - Get a sample of data. A few hundred to a few thousand rows. Anonymise PII if needed.
- Get the version triplet. Tez / Hive / Hadoop.
- Reproduce. Stand up
MiniHS2, load the schema, load the sample data, run the query. - If it reproduces, reduce. Apply the three axes.
- If it doesn't reproduce, expand. More data, more nodes, more concurrency.
A one-day cycle for a complex production bug is fast. A one-week cycle is realistic for something subtle.
Validation Artifacts
After this lab:
- A complete reproducer artifact (a
hive-h5-repro.tar.gz-style bundle) for a real or imagined Hive-on-Tez bug. - A
TestMyBugRepro.javaskeleton you can adapt. - The three-axes reduction discipline applied at least once.
- The reflex to capture the version triplet (Tez/Hive/Hadoop) on every reproducer.
The next lab — Lab H6: Diagnostics — covers what to do when you can't reproduce locally and need to ask the production reporter to capture more data.