Skip to content

[Bug]: query66 (TPC-DS) drops a GROUP BY row AND returns non-deterministic numeric values at scale 1G #24216

@Ariznawlll

Description

@Ariznawlll

Summary

query66 from TPC-DS exhibits two distinct bugs at scale=1 (1G):

  1. Row drop: MO returns 4 rows; MySQL 8.4.2 and PostgreSQL 13.5 both return 5 rows. The missing row corresponds to the warehouse with w_warehouse_name='' and w_warehouse_sq_ft IS NULL.
  2. Non-deterministic numeric values: Across 10 consecutive runs, MO returns 10 different result hashes (same 4 rows, but the aggregated numeric columns differ each time).

Environment

  • MatrixOne: 8.0.30 (standalone)
  • Dataset: TPC-DS scale = 1 (1G), standard 25 tables
  • Database: tpcds
  • Reference engines: MySQL 8.4.2, PostgreSQL 13.5 (both stable, both return 5 rows)

Reproduction

Bug 1: Row drop

# MO returns 4 rows, MySQL/PG both return 5
mysql -h<MO_HOST> -P6001 -u<USER> -p<PWD> tpcds -N -B \
  < query66.sql 2>/dev/null | wc -l

Bug 2: Non-determinism

# Row count is stable (always 4), but content hash differs every run
for i in {1..10}; do
  mysql -h<MO_HOST> -P6001 -u<USER> -p<PWD> tpcds -N -B \
    < query66.sql 2>/dev/null | sort | md5sum | awk '{print $1}'
done | sort | uniq -c

Observed:

1 0846396bcb46f6fc864baa157e338cfe
1 28998c0dcbfa1879b54902acf7483af5
1 2da4214ec33249476896e4c7c5479203
1 2ef6403e139f3ff13cc5095d73c521f5
1 390e8676810f9bdee7cb487733306950
1 71f02348d9af0c900032870dc56dcea9
1 75822de0f3df12655ff6dd45732840e7
1 b0403117e234176cba90309be7ec3a60
1 b9e212efcdc4e3e40dafa503fba0e267
1 f8b2735210c2d3963a085f75a0a7b9de

10 runs → 10 distinct content hashes.

The missing row

The row that MO drops, as returned by both MySQL and PG:

Engine First column (w_warehouse_name) w_warehouse_sq_ft Rest
MySQL '' (empty) 0 retained
PG '' (empty) NULL retained
MO missing

The corresponding underlying data in MO:

mysql> select w_warehouse_sk, w_warehouse_name, w_warehouse_sq_ft, w_city from warehouse;
+----------------+----------------------+-------------------+----------+
| w_warehouse_sk | w_warehouse_name     | w_warehouse_sq_ft | w_city   |
+----------------+----------------------+-------------------+----------+
|              1 | Conventional childr  |            977787 | Fairview |
|              2 | Important issues liv |            138504 | Fairview |
|              3 | Doors canno          |            294242 | Fairview |
|              4 | Bad cards must make. |            621234 | Fairview |
|              5 | (empty)              |              NULL | Fairview |   <-- this row's group is dropped
+----------------+----------------------+-------------------+----------+

Important: PG also stores w_warehouse_sq_ft = NULL for this row, yet PG retains the GROUP BY group. So the bug is not caused by the NULL value itself — it is in MO's GROUP BY / aggregation pipeline.

Why this is critical

  1. Data completeness: A whole result row silently disappears. End users have no signal that anything is wrong.
  2. Non-determinism: Even the rows that are returned have different aggregated values on every run, so the result of this query is unreliable as input for any downstream system.
  3. Cross-engine consistency: Both MySQL and PG agree on the row count (5) and on stable numeric output across runs. MO is the outlier on both axes.
  4. Self-contained minimal repro could not trigger the bug: I built a stripped-down 5-table test with hand-written rows mirroring the same shape (w_warehouse_name='', w_warehouse_sq_ft=NULL, full outer/inner aggregation, || carriers, etc.) and MO returned the correct 2 rows. This strongly indicates the bug is scale- and parallel-execution related (only manifests at TPC-DS 1G volume across web_sales ≈ 720K rows joined with warehouse/date_dim/time_dim/ship_mode).

Suggested investigation direction

  • Parallel Hash GROUP BY behavior on composite keys that include (empty_string, NULL)
  • Whether one worker's local NULL bucket is being discarded during the merge step
  • Whether SUM(x / NULL) in the projection list interacts with the GROUP BY pipeline in a way that affects which groups survive
  • Whether the inner subquery's GROUP BY emits the row but the outer GROUP BY drops it

SQL (TPC-DS query66, unchanged)

The full SQL is the standard TPC-DS query66; available in the mo-auto-test repo at query_1g_mo/query66.sql.

Key shape:

SELECT w_warehouse_name, w_warehouse_sq_ft, w_city, ..., ship_carriers, year,
       SUM(jan_sales), ..., SUM(jan_sales / w_warehouse_sq_ft), ...,
       SUM(jan_net), ...
FROM (
    SELECT w_warehouse_name, w_warehouse_sq_ft, w_city, ...,
           'ORIENTAL' || ',' || 'BOXBUNDLES' AS ship_carriers,
           d_year AS year,
           SUM(CASE WHEN d_moy = 1 THEN ws_ext_sales_price * ws_quantity ELSE 0 END) AS jan_sales,
           ...
    FROM web_sales, warehouse, date_dim, time_dim, ship_mode
    WHERE ws_warehouse_sk = w_warehouse_sk
      AND ws_sold_date_sk = d_date_sk
      AND ws_sold_time_sk = t_time_sk
      AND ws_ship_mode_sk = sm_ship_mode_sk
      AND d_year = 2001
      AND t_time BETWEEN 42970 AND 42970 + 28800
      AND sm_carrier IN ('ORIENTAL', 'BOXBUNDLES')
    GROUP BY w_warehouse_name, w_warehouse_sq_ft, w_city, ..., d_year
    UNION ALL
    -- second arm: catalog_sales, same shape
    ...
) x
GROUP BY w_warehouse_name, w_warehouse_sq_ft, w_city, ..., ship_carriers, year
ORDER BY w_warehouse_name, w_warehouse_sq_ft, w_city, ...
LIMIT 100;

Data preparation reference

Standard TPC-DS 1G dataset; load scripts available in https://github.com/matrixorigin/mo-auto-test (load_tpcds_1g.sh).

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't workingseverity/s0Extreme impact: Cause the application to break down and seriously affect the use

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions