perf(reader): Improve the performance of parquet reader#230
perf(reader): Improve the performance of parquet reader#230discivigour wants to merge 17 commits intoapache:mainfrom
Conversation
| let fetch_bytes = &fetched[idx]; | ||
| let start = (range.start - fetch_range.start) as usize; | ||
| let end = (range.end - fetch_range.start) as usize; | ||
| fetch_bytes.slice(start..end.min(fetch_bytes.len())) |
There was a problem hiding this comment.
If the data returned by fetch is not long enough, truncated data will be silently returned here, and downstream may obtain incomplete column chunks, leading to parsing errors or data corruption. Suggest changing it to assert or returning an error, do not silently swallow it.
There was a problem hiding this comment.
Good point, I will change it.
| Ok(ranges | ||
| .iter() | ||
| .map(|range| { | ||
| let idx = fetch_ranges.partition_point(|v| v.start <= range.start) - 1; |
There was a problem hiding this comment.
If partition_point returns 0, 0-1 overflow panic will occur here. Although logically fetch_range must cover all original ranges, this is an implicit assumption. It is recommended to add a debug_assert! Or use checked_stub+more explicit error messages
| /// Default coalesce threshold: 1 MiB. | ||
| const DEFAULT_RANGE_COALESCE_BYTES: u64 = 1024 * 1024; | ||
| /// Default concurrent range fetches. | ||
| const DEFAULT_RANGE_FETCH_CONCURRENCY: usize = 8; |
There was a problem hiding this comment.
Same to iceberg, maybe 10?
| for range in &merged { | ||
| let length = range.end - range.start; | ||
| let expected_size = MIN_SPLIT_SIZE.max(length / target_count as u64 + 1); | ||
| let min_remain = expected_size.max(MIN_SPLIT_SIZE * 2); |
| /// with the last chunk taking whatever remains. | ||
| /// Ranges smaller than `2 * MIN_SPLIT_SIZE` are kept as-is to | ||
| /// avoid excessive small IO requests. | ||
| fn split_ranges_for_concurrency(merged: Vec<Range<u64>>, target_count: usize) -> Vec<Range<u64>> { |
There was a problem hiding this comment.
target_count => concurrency
|
|
||
| for range in &merged { | ||
| let length = range.end - range.start; | ||
| let expected_size = MIN_SPLIT_SIZE.max(length / target_count as u64 + 1); |
There was a problem hiding this comment.
MIN_SPLIT_SIZE.max(length.div_ceil(target_count as u64)) ?
Purpose
Brief change log
Tests
API and Format
Documentation