Bug Description
When running SELECT DISTINCT col FROM table WHERE col IN (...) on a column that is the leading key of a composite secondary index, the query scans the entire index table instead of pruning via prefix_in block filter. The equivalent SELECT COUNT(*) ... WHERE col IN (...) correctly prunes and is ~90x faster.
Environment
- MatrixOne version: latest main
- Table: ~45M rows, composite secondary index on
(b28_s_kgd353d, b28_s_kgdtvnx, b28_s_kgdp984, b28_s_kgd4kbn, b28_s_kgd4b76)
Steps to Reproduce
-- Table DDL (simplified)
CREATE TABLE bpc_consolidated_report (
b28_s_kgd353d varchar(255) DEFAULT NULL,
-- ... other columns ...
KEY idx_bpc_consolidated_report_q1 (b28_s_kgd353d, b28_s_kgdtvnx, b28_s_kgdp984, b28_s_kgd4kbn, b28_s_kgd4b76)
);
-- Fast query (0.4s): block filter works, scans 886 blocks / 7.2M rows
SELECT COUNT(*) FROM bpc_consolidated_report
WHERE b28_s_kgd353d IN ('2022.12', '2023.12', '2024.12', '2025.12');
-- Slow query (38s): block filter ineffective, scans 5503 blocks / 45M rows
SELECT DISTINCT b28_s_kgd353d FROM bpc_consolidated_report
WHERE b28_s_kgd353d IN ('2022.12', '2023.12', '2024.12', '2025.12');
Explain Analyze Output
COUNT(*) — 0.433 sec ✅
Aggregate
inputRows=7200773 outputRows=1
-> Index Table Scan on idx_bpc_consolidated_report_q1
inputBlocks=886 inputRows=7200773 ReadSize=156mb
Filter Cond: prefix_in(__mo_index_idx_col)
Block Filter Cond: prefix_in(__mo_index_idx_col)
DISTINCT — 38.008 sec ❌
Aggregate
inputRows=45006351 outputRows=4 timeConsumed=35708ms
Group Key: serial_extract(__mo_index_idx_col, 0, VARCHAR))
-> Index Table Scan on idx_bpc_consolidated_report_q1
inputBlocks=5503 inputRows=45006351 ReadSize=932mb
Filter Cond: prefix_in(__mo_index_idx_col)
Block Filter Cond: prefix_in(__mo_index_idx_col)
Analysis
Both queries use prefix_in as the Block Filter Cond, but:
| Metric |
COUNT(*) |
DISTINCT |
| inputBlocks |
886 |
5,503 |
| inputRows |
7,200,773 |
45,006,351 |
| ReadSize |
156 MB |
932 MB |
| Total time |
0.4s |
38s |
The matching rows are ~7.2M out of ~45M total (~16%). The COUNT query correctly prunes to only those rows, but the DISTINCT query reads the entire index table (45M rows = all rows in the index). The prefix_in block filter appears in the plan but has no actual pruning effect for the DISTINCT path.
Additionally, the DISTINCT aggregation uses serial_extract(__mo_index_idx_col, 0, VARCHAR) to extract the leading column from the serialized composite index key, which adds ~35.7s of CPU overhead on the unnecessarily large 45M row input.
Expected Behavior
The DISTINCT query should achieve similar block filter pruning as the COUNT query, scanning ~886 blocks / 7.2M rows instead of the full 5,503 blocks / 45M rows.
Workaround
Rewrite as UNION ALL of per-value queries to force the efficient prefix_in path:
SELECT '2022.12' AS b28_s_kgd353d FROM bpc_consolidated_report WHERE b28_s_kgd353d = '2022.12' LIMIT 1
UNION ALL
SELECT '2023.12' FROM bpc_consolidated_report WHERE b28_s_kgd353d = '2023.12' LIMIT 1
UNION ALL
SELECT '2024.12' FROM bpc_consolidated_report WHERE b28_s_kgd353d = '2024.12' LIMIT 1
UNION ALL
SELECT '2025.12' FROM bpc_consolidated_report WHERE b28_s_kgd353d = '2025.12' LIMIT 1;
Bug Description
When running
SELECT DISTINCT col FROM table WHERE col IN (...)on a column that is the leading key of a composite secondary index, the query scans the entire index table instead of pruning viaprefix_inblock filter. The equivalentSELECT COUNT(*) ... WHERE col IN (...)correctly prunes and is ~90x faster.Environment
(b28_s_kgd353d, b28_s_kgdtvnx, b28_s_kgdp984, b28_s_kgd4kbn, b28_s_kgd4b76)Steps to Reproduce
Explain Analyze Output
COUNT(*) — 0.433 sec ✅
DISTINCT — 38.008 sec ❌
Analysis
Both queries use
prefix_inas the Block Filter Cond, but:The matching rows are ~7.2M out of ~45M total (~16%). The COUNT query correctly prunes to only those rows, but the DISTINCT query reads the entire index table (45M rows = all rows in the index). The
prefix_inblock filter appears in the plan but has no actual pruning effect for the DISTINCT path.Additionally, the DISTINCT aggregation uses
serial_extract(__mo_index_idx_col, 0, VARCHAR)to extract the leading column from the serialized composite index key, which adds ~35.7s of CPU overhead on the unnecessarily large 45M row input.Expected Behavior
The DISTINCT query should achieve similar block filter pruning as the COUNT query, scanning ~886 blocks / 7.2M rows instead of the full 5,503 blocks / 45M rows.
Workaround
Rewrite as UNION ALL of per-value queries to force the efficient prefix_in path: