feat: add batch_size parameter to read_parquet by kevinjacobs-delfi · Pull Request #494 · kylebarron/arro3

kevinjacobs-delfi · 2026-03-27T07:11:24Z

Adds an optional batch_size keyword argument to read_parquet, passed through to ParquetRecordBatchReaderBuilder::with_batch_size.

The default of 1024 rows creates thousands of tiny batches for typical files, which dominates read time due to per-batch overhead.

Benchmarks (2M-row file):

batch_size	1 column (ms)	5 columns (ms)
1,024 (default)	4.3	46
65,536	0.24	29

Includes type stub update and test.

The default ParquetRecordBatchReaderBuilder batch_size of 1024 rows creates thousands of tiny batches for typical files. This adds a batch_size keyword argument to read_parquet, passed through to ParquetRecordBatchReaderBuilder::with_batch_size. On a 2M-row single-column file, read_all() drops from 4.3ms to 0.24ms with batch_size=65536. On a 5-column file, 46ms to 29ms. Signed-off-by: Kevin Jacobs <kevin.jacobs@delfidiagnostics.com>

kylebarron · 2026-03-27T15:14:30Z

I can't find the conversation right now but upstream parquet maintainers were quite adamant about their choice of row group size for memory reasons.

I think it makes sense to add a batch_size parameter. I wish there were a way to say "same number of rows as per parquet batch"

kevinjacobs-delfi · 2026-03-27T15:32:12Z

I've had the same question-- matching the row group size seems like the most natural default, since row groups should be scaled to the memory constraints of the application.

kylebarron · 2026-03-28T16:08:21Z

I'm pretty sure there's no way to set that upstream.

kevinjacobs-delfi · 2026-03-30T12:33:32Z

Please let me know what else you need before merging this PR. I'm finding arro3 extremely useful and have more small fixes and improvements in the works. Thanks!

kylebarron · 2026-04-15T20:06:52Z


+    Keyword Args:
+        batch_size: The number of rows per batch in the returned reader.
+            Defaults to 1024 if not specified. Larger values reduce per-batch


I wonder if we should change this default. If Python consumers are materially different than a Rust process that it's worth defaulting to, say, 65536 rows.

I do feel like 1024 rows is extremely small and a bad default

kylebarron · 2026-04-15T20:07:48Z

I've been busy with a bunch of projects; feel free to ping me when I forget about a PR like this

kevinjacobs-delfi changed the title ~~Add batch_size parameter to read_parquet~~ feat: add batch_size parameter to read_parquet Mar 27, 2026

github-actions Bot added the feat label Mar 27, 2026

kylebarron reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add batch_size parameter to read_parquet#494

feat: add batch_size parameter to read_parquet#494
kevinjacobs-delfi wants to merge 1 commit intokylebarron:mainfrom
kevinjacobs-delfi:read-parquet-batch-size

kevinjacobs-delfi commented Mar 27, 2026

Uh oh!

kylebarron commented Mar 27, 2026 •

edited

Loading

Uh oh!

kevinjacobs-delfi commented Mar 27, 2026

Uh oh!

kylebarron commented Mar 28, 2026

Uh oh!

kevinjacobs-delfi commented Mar 30, 2026

Uh oh!

kylebarron Apr 15, 2026

Uh oh!

kylebarron commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevinjacobs-delfi commented Mar 27, 2026

Uh oh!

kylebarron commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjacobs-delfi commented Mar 27, 2026

Uh oh!

kylebarron commented Mar 28, 2026

Uh oh!

kevinjacobs-delfi commented Mar 30, 2026

Uh oh!

kylebarron Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

kylebarron commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kylebarron commented Mar 27, 2026 •

edited

Loading