fix(top): spill to disk and streaming eval to prevent OOM on large LIMIT (Fixes #24243)#24244
fix(top): spill to disk and streaming eval to prevent OOM on large LIMIT (Fixes #24243)#24244jiangxinmeng1 wants to merge 5 commits intomatrixorigin:mainfrom
Conversation
When LIMIT exceeds 16384 rows, the Top operator now keeps only sort-key columns in the heap and spills full rows to a temp file. During eval, needed rows are read back from disk and assembled into the output batch. This reduces heap memory from O(limit * row_width) to O(limit * key_width). Also fixes a memory leak in mergetop where defer bat.Clean inside a loop kept all intermediate batches alive until function return. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous spill fix only moved memory from build to eval phase. Now evalSpill streams output in 8192-row chunks instead of materializing all limit rows at once. Peak memory during eval drops from O(limit * row_width) to O(chunk_size * row_width), e.g. ~10 MB per chunk instead of ~7 GiB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge Queue Status
This pull request spent 8 seconds in the queue, with no time running CI. ReasonThe pull request can't be updated
HintYou should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again. |
Merge Queue Status
This pull request spent 6 seconds in the queue, with no time running CI. ReasonThe pull request can't be updated
HintYou should update or rebase your pull request manually. If you do, this pull request will automatically be requeued once the queue conditions match again. |
What type of PR is this?
Which issue(s) this PR fixes:
fixes #24243
What this PR does / why we need it:
INSERT INTO ... SELECT ... ORDER BY col LIMIT 5000000on a 100M row table causes OOM in CI nightly regression. The Top operator holds all LIMIT rows with ALL columns in the heap, consuming O(limit × row_width) memory — for 5M rows of wide data this reaches tens of GiB.This PR makes three targeted changes:
1. Top operator: spill to disk for large LIMIT (
top/top.go,top/types.go)When LIMIT > 16384, the Top operator now:
batch.MarshalBinaryrowRef{batchIdx, rowIdx}per heap entry to locate spilled rows during evalHeap memory drops from O(limit × row_width) to O(limit × key_width).
2. Top operator: streaming eval in spill mode (
top/top.go)Instead of materializing all LIMIT rows into one giant batch during eval, spill mode now:
orderedRefsCall()invocationEval peak memory drops from O(limit × row_width) to O(chunk_size × row_width) (~10 MiB per chunk).
3. MergeTop: fix memory leak from
deferin loop (mergetop/top.go)defer bat.Clean(proc.Mp())was placed inside aforloop inbuild(). Sincedeferonly fires on function return, every duplicated batch from each iteration accumulated in memory. Replaced with explicitbat.Clean()after eachprocessBatchcall and on error paths.How this PR impacts memory (conceptual):