feat: add batch_size parameter to read_parquet#494
feat: add batch_size parameter to read_parquet#494kevinjacobs-delfi wants to merge 1 commit intokylebarron:mainfrom
Conversation
The default ParquetRecordBatchReaderBuilder batch_size of 1024 rows creates thousands of tiny batches for typical files. This adds a batch_size keyword argument to read_parquet, passed through to ParquetRecordBatchReaderBuilder::with_batch_size. On a 2M-row single-column file, read_all() drops from 4.3ms to 0.24ms with batch_size=65536. On a 5-column file, 46ms to 29ms. Signed-off-by: Kevin Jacobs <kevin.jacobs@delfidiagnostics.com>
|
I can't find the conversation right now but upstream I think it makes sense to add a |
|
I've had the same question-- matching the row group size seems like the most natural default, since row groups should be scaled to the memory constraints of the application. |
|
I'm pretty sure there's no way to set that upstream. |
|
Please let me know what else you need before merging this PR. I'm finding |
|
|
||
| Keyword Args: | ||
| batch_size: The number of rows per batch in the returned reader. | ||
| Defaults to 1024 if not specified. Larger values reduce per-batch |
There was a problem hiding this comment.
I wonder if we should change this default. If Python consumers are materially different than a Rust process that it's worth defaulting to, say, 65536 rows.
I do feel like 1024 rows is extremely small and a bad default
|
I've been busy with a bunch of projects; feel free to ping me when I forget about a PR like this |
Adds an optional
batch_sizekeyword argument toread_parquet, passed through toParquetRecordBatchReaderBuilder::with_batch_size.The default of 1024 rows creates thousands of tiny batches for typical files, which dominates read time due to per-batch overhead.
Benchmarks (2M-row file):
Includes type stub update and test.