Lab H5: Reproducing Bugs

Background

A JIRA without a reproducer drifts. A JIRA with a clean reproducer gets attention. "Clean" means: minimal schema, minimal data, minimal query, runnable in under a minute on a local MiniTezCluster or MiniHS2. This lab is the procedure.

The Hive integration test framework (hive-itests) is the source of every pattern you need. Reading its existing tests is the cheapest education.

The Three Reduction Axes

To minimise a reproducer, reduce along three independent axes:

Axis	Reduce	Stop reducing when
Schema	Drop unused columns; simplify types	Removing a column makes the bug disappear
Data	Reduce row count; generate synthetic data	Reducing rows makes the bug disappear
Query	Drop joins, predicates, projections	Dropping a clause makes the bug disappear

The goal is the smallest schema × smallest data × smallest query that still reproduces.

Setup — Local `MiniHS2` + `MiniTezCluster`

MiniHS2 is a single-JVM HiveServer2 that runs against a MiniTezCluster (a single-JVM YARN). Together they let you reproduce a Hive-on-Tez bug in seconds without an external cluster.

Existing reference in your tree:

find ~/hive-src/itests -name "MiniHS2.java" | head
find ~/hive-src/itests -name "TestMiniLlapVectorArrowWithLlapIODisabled.java" | head
find ~/tez-src/tez-tests -name "MiniTezCluster.java"

A reproducer test class skeleton (Hive 3/4 style):

public class TestMyBugRepro {
  private MiniHS2 miniHS2;

  @Before
  public void setUp() throws Exception {
    HiveConf conf = new HiveConf();
    conf.set("hive.execution.engine", "tez");
    conf.set("tez.lib.uris",
        "file://" + System.getProperty("tez.lib.dir"));
    miniHS2 = new MiniHS2.Builder()
        .withConf(conf)
        .withMiniMR()                  // brings up MiniTezCluster
        .build();
    miniHS2.start(new HashMap<>());
  }

  @After
  public void tearDown() throws Exception {
    miniHS2.stop();
  }

  @Test
  public void reproBug() throws Exception {
    try (Connection c = DriverManager.getConnection(miniHS2.getJdbcURL());
         Statement s = c.createStatement()) {
      s.execute("CREATE TABLE t (...) STORED AS ORC");
      s.execute("INSERT INTO t VALUES (...)");
      ResultSet rs = s.executeQuery("SELECT ...");
      // assert behaviour or expect exception
    }
  }
}

Run with mvn test -pl itests -Dtest=TestMyBugRepro.

Reducing the Schema

Starting from a real production table with 200 columns, reduce iteratively:

Identify referenced columns. Read the failing query; note which columns the SELECT, WHERE, GROUP BY, JOIN, ORDER BY actually reference.
Drop everything else. Make a new test schema with only the referenced columns.
Re-run. Does the bug still reproduce? If yes, you've reduced. If no, you've found a column that's load-bearing; add it back and look for why.
Simplify remaining types. Replace DECIMAL(38,10) with DECIMAL(10,2) if the bug doesn't depend on precision. Replace STRUCT<...> with STRING if you can. Replace partition columns with non-partitioned tables unless the partition is load-bearing.
Stop when reduction breaks the repro.

For our running example query:

SELECT a, COUNT(*) FROM t GROUP BY a ORDER BY a;

Only column a is referenced. Schema reduces to:

CREATE TABLE t (a INT) STORED AS ORC;

If the bug needs the second column for some reason (e.g. ORC stripe layout), keep it.

Reducing the Data — `JoinDataGen` Pattern

Hive's itests includes data generators for systematic minimisation. The most common pattern is JoinDataGen for generating join inputs at controlled cardinalities:

find ~/hive-src -name "JoinDataGen*.java" -o -name "*DataGen*.java" | head

The pattern (adapt for your bug):

public final class TestDataGen {
  public static void writeIntRows(String tableName, int rowCount, int distinctKeys,
                                   Statement s) throws SQLException {
    Random r = new Random(42);
    StringBuilder values = new StringBuilder();
    for (int i = 0; i < rowCount; i++) {
      if (i > 0) values.append(",");
      values.append("(").append(r.nextInt(distinctKeys)).append(")");
    }
    s.execute("INSERT INTO " + tableName + " VALUES " + values);
  }
}

Reduce data:

Start with original data size. 1 billion rows? Reduce to 1 million.
Halve until bug disappears. Binary-search the row count: 1M → 500K → 250K → ...
At the smallest row count that still repros, vary distinct-key count. Bug may need 5 distinct keys (skew) or 500K (cardinality). Find which.
Vary value distribution. If the bug needs a skewed distribution (one key gets 90% of rows), generate that explicitly.
Document the minimum. "Bug reproduces at >= 1024 rows with >= 8 distinct keys."

For our running example, with no actual bug, the minimum is whatever you need to exercise the GROUP BY + ORDER BY path — single-digit rows are enough.

Reducing the Query

Remove clauses one at a time and re-test:

Remove ORDER BY → does the bug still happen? (Probably not, if the bug is in the total-order reducer.)
Remove the aggregate → does the bug still happen?
Remove WHERE predicates one at a time.
Remove JOINs; if the join is the cause, simplify to a 2-table join, then to a tiny-on-tiny join.
Replace MAP JOIN with SHUFFLE JOIN by disabling map joins (hive.auto.convert.join=false) and re-test.

A reproducer query of 3 lines beats a reproducer query of 30 lines, even for the same bug.

Capturing the Artifacts

A complete bug-report artifact set:

Artifact	Why
`CREATE TABLE` DDL for every table involved	Reproducer setup
Data generation code or inline `INSERT` values	Reproducer setup
The minimal query	The test
`SET hive.*` lines that were necessary	Configuration
The expected behavior (correct result)	Oracle
The actual behavior (incorrect result or exception)	Symptom
`EXPLAIN FORMATTED` output	Plan
AM log fragment showing failure	Diagnostic
Container log fragment showing exception	Diagnostic
Tez and Hive version	Version

Bundle into a single artifact:

cd ~/tez-notes
mkdir hive-h5-repro
cp ddl.sql hive-h5-repro/
cp gen.sql hive-h5-repro/
cp query.sql hive-h5-repro/
cp explain.txt hive-h5-repro/
cp amlog-fragment.txt hive-h5-repro/
cp container-log-fragment.txt hive-h5-repro/
cat > hive-h5-repro/README.md <<EOF
# Repro for HIVE-XXXXX / TEZ-XXXX

Tez version: 0.10.X
Hive version: 4.0.X
Hadoop version: 3.3.X
JDK: 11

Setup:  hive -f ddl.sql && hive -f gen.sql
Repro:  hive -f query.sql
Expected: rows = N, max value = M.
Actual:   exception in container log (see container-log-fragment.txt).
EOF
tar czf hive-h5-repro.tar.gz hive-h5-repro/

Attach hive-h5-repro.tar.gz to the JIRA. A reproducer in this shape gets opened by maintainers; one without these elements doesn't.

When `MiniTezCluster` Doesn't Reproduce

A bug that reproduces on a production cluster but not on MiniTezCluster is the worst shape. Common causes:

Cause	Diagnostic
Multi-node shuffle behavior; mini cluster is single-node	Force multiple containers per node; can't fully simulate
Container OOM at production memory; mini cluster doesn't have memory pressure	Configure mini cluster with tight memory limits
Concurrent DAG submissions; mini cluster has none	Run multiple parallel tests
ORC stripe layout; needs production-size files	Generate larger ORC files
Production data distribution; mini cluster has uniform	Use realistic random seed and distribution
Speculative execution; not enabled in mini by default	Enable with `tez.am.speculation.enabled=true`

If none of these reduce, the bug may be in cluster-only code paths (RM scheduling edge cases). Document that the reproducer requires N nodes and attach what evidence you have.

A Worked Reproducer — Hypothetical Bug

Suppose a bug: COUNT(*) returns 0 when input table has exactly 1024 rows and vectorization is enabled. (Imaginary; for the pattern.)

Schema

CREATE TABLE t (a INT) STORED AS ORC;

Data

INSERT INTO t SELECT col1 FROM dual WHERE 1=0;  -- placeholder
-- repeat to produce exactly 1024 rows:
INSERT INTO t SELECT pos AS a FROM (
  SELECT explode(sequence(1, 1024)) AS pos
) s;

(Hive's explode(sequence(...)) may or may not be available depending on version; use the equivalent for your version.)

Query

SET hive.vectorized.execution.enabled=true;
SELECT COUNT(*) FROM t;

Expected vs Actual

Expected: 1024
Actual:   0

EXPLAIN

EXPLAIN VECTORIZATION DETAIL SELECT COUNT(*) FROM t;

Save the output. Look for Execution mode: vectorized and any odd Vectorized: false on a key operator.

Trial Reductions

1023 rows: bug? No.
1024 rows: bug.
2048 rows: bug? Test.
Vectorization off: bug? Reset.

Document the conditions:

Bug reproduces at:
  - row count exactly 1024
  - hive.vectorized.execution.enabled=true
Bug does NOT reproduce at:
  - row count != 1024
  - hive.vectorized.execution.enabled=false

That's a sharp, actionable bug. Attribution (by Lab H4): likely Hive's vectorized aggregation code path. File on HIVE.

Production-to-Test Translation

When a real production bug is reported to you with no reproducer:

Get the query. From the user, from hive.log (hive.server2.logging.operation.enabled), or from HiveServer2 audit logs.
Get the schema. Run SHOW CREATE TABLE on each involved table; copy.
Get a sample of data. A few hundred to a few thousand rows. Anonymise PII if needed.
Get the version triplet. Tez / Hive / Hadoop.
Reproduce. Stand up MiniHS2, load the schema, load the sample data, run the query.
If it reproduces, reduce. Apply the three axes.
If it doesn't reproduce, expand. More data, more nodes, more concurrency.

A one-day cycle for a complex production bug is fast. A one-week cycle is realistic for something subtle.

Validation Artifacts

After this lab:

A complete reproducer artifact (a hive-h5-repro.tar.gz-style bundle) for a real or imagined Hive-on-Tez bug.
A TestMyBugRepro.java skeleton you can adapt.
The three-axes reduction discipline applied at least once.
The reflex to capture the version triplet (Tez/Hive/Hadoop) on every reproducer.

The next lab — Lab H6: Diagnostics — covers what to do when you can't reproduce locally and need to ask the production reporter to capture more data.

Open-Source Engineer & Contributor