Optimize memory footprint of view arrays from ScalarValue::to_array_of_size#19441
Optimize memory footprint of view arrays from ScalarValue::to_array_of_size#19441Jefffrey merged 1 commit intoapache:mainfrom
ScalarValue::to_array_of_size#19441Conversation
| let mut builder = | ||
| StringViewBuilder::with_capacity(size).with_deduplicate_strings(); | ||
| for _ in 0..size { | ||
| builder.append_value(value); | ||
| } | ||
| let array = builder.finish(); | ||
| Arc::new(array) |
There was a problem hiding this comment.
Technically we could further optimize this by manually calling StringViewArray::try_new with correct data and avoid need to hash as part of the builder; felt that might get too into the weeds of arrow-rs code, so stuck with this simpler approach
There was a problem hiding this comment.
Actually if we have a benchmark maybe it could be worth testing out if there is a performance gain by doing this? 🤔
There was a problem hiding this comment.
Update - here is the tracking ticket (thanks @Jefffrey )
| let buffers = array.data_buffers(); | ||
| assert_eq!(1, buffers.len()); | ||
| // Ensure we only have a single copy of the value string | ||
| assert_eq!(value.len(), buffers[0].len()); |
There was a problem hiding this comment.
On main this assert fails as the data buffer would be 10 * value.len(); with this fix we only have a single copy of the full string in the child buffer, minimizing memory footprint of the output array
| Arc::new(StringViewArray::from_iter_values(repeat_n(value, size))) | ||
| let mut builder = | ||
| StringViewBuilder::with_capacity(size).with_deduplicate_strings(); | ||
| for _ in 0..size { |
There was a problem hiding this comment.
🤔 we should prob have some kind of append_n to remove this boilerplate, in some future for arrow-rs
2010YOUY01
left a comment
There was a problem hiding this comment.
It's a good idea to keep it simple now. I agree if we can add a new API in arrow-rs to let it directly construct StringView arrays with repeating element of size k, it's likely to be much faster. (should we open an issue?)
|
An |
## Which issue does this PR close? - Follow on to #19441 ## Rationale for this change In #19441 @Jefffrey filed a follow on ticket for arrow-rs apache/arrow-rs#9034 I wanted to leave the context of where it could be used in DataFusion so we remember to use it when available ## What changes are included in this PR? Add a comment with a reference to apache/arrow-rs#9034 ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? No, only comments
Which issue does this PR close?
ScalarValue::to_array_of_size#19440Rationale for this change
When we have view scalars (utf8/binary) and we call
to_array_of_size, the data buffers the resultant arrays have contains duplicate data. This is because the APIs we use don't deduplicate the data, instead appending it each time even though the data is exactly duplicated.What changes are included in this PR?
Manually use a builder with deduplication enabled.
Are these changes tested?
Added test.
Are there any user-facing changes?
No.