Skip to content

feat: add batch_size parameter to read_parquet#494

Open
kevinjacobs-delfi wants to merge 1 commit intokylebarron:mainfrom
kevinjacobs-delfi:read-parquet-batch-size
Open

feat: add batch_size parameter to read_parquet#494
kevinjacobs-delfi wants to merge 1 commit intokylebarron:mainfrom
kevinjacobs-delfi:read-parquet-batch-size

Conversation

@kevinjacobs-delfi
Copy link
Copy Markdown

Adds an optional batch_size keyword argument to read_parquet, passed through to ParquetRecordBatchReaderBuilder::with_batch_size.

The default of 1024 rows creates thousands of tiny batches for typical files, which dominates read time due to per-batch overhead.

Benchmarks (2M-row file):

batch_size 1 column (ms) 5 columns (ms)
1,024 (default) 4.3 46
65,536 0.24 29

Includes type stub update and test.

The default ParquetRecordBatchReaderBuilder batch_size of 1024 rows
creates thousands of tiny batches for typical files. This adds a
batch_size keyword argument to read_parquet, passed through to
ParquetRecordBatchReaderBuilder::with_batch_size.

On a 2M-row single-column file, read_all() drops from 4.3ms to 0.24ms
with batch_size=65536. On a 5-column file, 46ms to 29ms.

Signed-off-by: Kevin Jacobs <kevin.jacobs@delfidiagnostics.com>
@kevinjacobs-delfi kevinjacobs-delfi changed the title Add batch_size parameter to read_parquet feat: add batch_size parameter to read_parquet Mar 27, 2026
@github-actions github-actions Bot added the feat label Mar 27, 2026
@kylebarron
Copy link
Copy Markdown
Owner

kylebarron commented Mar 27, 2026

I can't find the conversation right now but upstream parquet maintainers were quite adamant about their choice of row group size for memory reasons.

I think it makes sense to add a batch_size parameter. I wish there were a way to say "same number of rows as per parquet batch"

@kevinjacobs-delfi
Copy link
Copy Markdown
Author

I've had the same question-- matching the row group size seems like the most natural default, since row groups should be scaled to the memory constraints of the application.

@kylebarron
Copy link
Copy Markdown
Owner

I'm pretty sure there's no way to set that upstream.

@kevinjacobs-delfi
Copy link
Copy Markdown
Author

Please let me know what else you need before merging this PR. I'm finding arro3 extremely useful and have more small fixes and improvements in the works. Thanks!


Keyword Args:
batch_size: The number of rows per batch in the returned reader.
Defaults to 1024 if not specified. Larger values reduce per-batch
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should change this default. If Python consumers are materially different than a Rust process that it's worth defaulting to, say, 65536 rows.

I do feel like 1024 rows is extremely small and a bad default

@kylebarron
Copy link
Copy Markdown
Owner

I've been busy with a bunch of projects; feel free to ping me when I forget about a PR like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants