Skip to content

Optimize memory footprint of view arrays from ScalarValue::to_array_of_size#19441

Merged
Jefffrey merged 1 commit intoapache:mainfrom
Jefffrey:optimize-view-to-arr
Dec 23, 2025
Merged

Optimize memory footprint of view arrays from ScalarValue::to_array_of_size#19441
Jefffrey merged 1 commit intoapache:mainfrom
Jefffrey:optimize-view-to-arr

Conversation

@Jefffrey
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

When we have view scalars (utf8/binary) and we call to_array_of_size, the data buffers the resultant arrays have contains duplicate data. This is because the APIs we use don't deduplicate the data, instead appending it each time even though the data is exactly duplicated.

What changes are included in this PR?

Manually use a builder with deduplication enabled.

Are these changes tested?

Added test.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the common Related to common crate label Dec 21, 2025
Comment on lines +3031 to +3037
let mut builder =
StringViewBuilder::with_capacity(size).with_deduplicate_strings();
for _ in 0..size {
builder.append_value(value);
}
let array = builder.finish();
Arc::new(array)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we could further optimize this by manually calling StringViewArray::try_new with correct data and avoid need to hash as part of the builder; felt that might get too into the weeds of arrow-rs code, so stuck with this simpler approach

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if we have a benchmark maybe it could be worth testing out if there is a performance gain by doing this? 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - here is the tracking ticket (thanks @Jefffrey )

let buffers = array.data_buffers();
assert_eq!(1, buffers.len());
// Ensure we only have a single copy of the value string
assert_eq!(value.len(), buffers[0].len());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main this assert fails as the data buffer would be 10 * value.len(); with this fix we only have a single copy of the full string in the child buffer, minimizing memory footprint of the output array

Arc::new(StringViewArray::from_iter_values(repeat_n(value, size)))
let mut builder =
StringViewBuilder::with_capacity(size).with_deduplicate_strings();
for _ in 0..size {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 we should prob have some kind of append_n to remove this boilerplate, in some future for arrow-rs

Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey this is LGTM

Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to keep it simple now. I agree if we can add a new API in arrow-rs to let it directly construct StringView arrays with repeating element of size k, it's likely to be much faster. (should we open an issue?)

@Jefffrey
Copy link
Copy Markdown
Contributor Author

An append_n API sounds good, raised apache/arrow-rs#9034

@Jefffrey Jefffrey added this pull request to the merge queue Dec 23, 2025
Merged via the queue into apache:main with commit 677c543 Dec 23, 2025
27 checks passed
@Jefffrey Jefffrey deleted the optimize-view-to-arr branch December 23, 2025 08:40
github-merge-queue Bot pushed a commit that referenced this pull request Dec 24, 2025
## Which issue does this PR close?

- Follow on to #19441

## Rationale for this change

In #19441 @Jefffrey filed a
follow on ticket for arrow-rs
apache/arrow-rs#9034

I wanted to leave the context of where it could be used in DataFusion so
we remember to use it when available

## What changes are included in this PR?

Add a comment with a reference to
apache/arrow-rs#9034

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?
No, only comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize memory footprint of view arrays from ScalarValue::to_array_of_size

5 participants