Add append_value_n to GenericByteBuilder#9426
Conversation
I noticed that this method is available on PrimitiveTypeBuilder, but missing on the GenericByteBuilder, which make sense since the gain is less, but after benchmarking, it shows a solid 10%: ``` ┌───────────────────┬────────────────┬───────────────────┬─────────┐ │ Benchmark │ append_value_n │ append_value loop │ Speedup │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=5 │ 371 ns │ 408 ns │ 10% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=30 │ 456 ns │ 507 ns │ 10% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100/len=1024 │ 1.81 µs │ 1.95 µs │ 8% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=5 │ 2.39 µs │ 2.87 µs │ 17% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=30 │ 3.41 µs │ 3.89 µs │ 12% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=1000/len=1024 │ 12.3 µs │ 14.4 µs │ 15% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=5 │ 23.8 µs │ 29.3 µs │ 19% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=30 │ 33.7 µs │ 39.0 µs │ 14% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=10000/len=1024 │ 115.9 µs │ 135.0 µs │ 14% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=5 │ 227.5 µs │ 278.6 µs │ 18% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=30 │ 328.1 µs │ 377.9 µs │ 13% │ ├───────────────────┼────────────────┼───────────────────┼─────────┤ │ n=100000/len=1024 │ 1.16 ms │ 1.34 ms │ 14% │ └───────────────────┴────────────────┴───────────────────┴─────────┘ ``` I think this is still worthwhile to be added. Let me know what the community thinks!
| pub fn append_value_n(&mut self, value: impl AsRef<T::Native>, n: usize) { | ||
| let bytes = value.as_ref().as_ref(); | ||
| for _ in 0..n { | ||
| self.value_builder.extend_from_slice(bytes); |
There was a problem hiding this comment.
are there methods can we can use to reserve the capacity for value_builder and offsets_builder ahead of time
There was a problem hiding this comment.
(reserve is potentially dangerous performance wise if it doesn't do amortized allocations, or we would need to do the amortization here.).
There was a problem hiding this comment.
Is that directly related to this PR? Or a more general observation?
AFAIK, rust container allocations are amortized, e.g. Vec docs state:
Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.
So hopefully that's enough to avoid quadratic silliness even if we don't reserve, leaving only constant factors on the table?
There was a problem hiding this comment.
I didn't look closely at Rust, but from c++ there could be a potential footgun (copied the reference below). So calling append_n with n=1 in a loop was my concern if we ended up using reserve. It sounds like Rust might might make stronger guarantees (but there is wiggle room)
Correctly using reserve() can prevent unnecessary reallocations, but inappropriate uses of reserve() (for instance, calling it before every push_back() call) may actually increase the number of reallocations (by causing the capacity to grow linearly rather than exponentially) and result in increased computational complexity and decreased performance. For example, a function that receives an arbitrary vector by reference and appends elements to it should usually not call reserve() on the vector, since it does not know of the vector's usage characteristics.
There was a problem hiding this comment.
Or put another way, callers that care about it can probably call reserve themselves externally
There was a problem hiding this comment.
You can reserve capacity by calling with_capacity:
arrow-rs/arrow-array/src/builder/generic_bytes_builder.rs
Lines 43 to 57 in 2f40f78
We already do this in our code.
There was a problem hiding this comment.
OK, so lets avoid adding it here. It seems better placed in caller code.
There was a problem hiding this comment.
As this is append_value_n I believe it makes sense to add a reserve.
While the builder provides with_capacity - the primary use case for using a builder is when you don't know the capacity upfront (in most other cases you can build the array from iterator or slice directly and skipping the builder overhead).
There was a problem hiding this comment.
My friend Claude created a benchmark:
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
use arrow_array::builder::StringBuilder;
use criterion::*;
use std::hint;
fn bench_append_value(c: &mut Criterion) {
let mut group = c.benchmark_group("append_value");
for &str_len in &[5, 30, 1024] {
let value = "x".repeat(str_len);
for &n in &[100, 1000, 10000, 100000] {
group.throughput(Throughput::Elements(n as u64));
group.bench_with_input(
BenchmarkId::new(format!("n={n}"), format!("len={str_len}")),
&(&value, n),
|b, &(value, n)| {
b.iter(|| {
let mut builder = StringBuilder::new();
for _ in 0..n {
builder.append_value(value);
}
hint::black_box(builder.finish());
})
},
);
}
}
group.finish();
}
fn bench_append_value_n(c: &mut Criterion) {
let mut group = c.benchmark_group("append_value_n");
for &str_len in &[5, 30, 1024] {
let value = "x".repeat(str_len);
for &n in &[100, 1000, 10000, 100000] {
group.throughput(Throughput::Elements(n as u64));
group.bench_with_input(
BenchmarkId::new(format!("n={n}"), format!("len={str_len}")),
&(&value, n),
|b, &(value, n)| {
b.iter(|| {
let mut builder = StringBuilder::new();
builder.append_value_n(value, n);
hint::black_box(builder.finish());
})
},
);
}
}
group.finish();
}
criterion_group!(benches, bench_append_value, bench_append_value_n);
criterion_main!(benches);Let me know if you find this valuable to add to the repository.
This resulted in:
append_value_n improvements with reserve():
┌────────────────────┬─────────┬─────────┬─────────────┐
│ Config │ Before │ After │ Improvement │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=1000, len=5 │ 2.96 µs │ 2.68 µs │ ~9% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=10000, len=5 │ 29.3 µs │ 26.4 µs │ ~10% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=100000, len=5 │ 268 µs │ 249 µs │ ~7% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=100, len=30 │ 628 ns │ 548 ns │ ~13% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=1000, len=30 │ 4.22 µs │ 3.68 µs │ ~13% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=10000, len=30 │ 43.9 µs │ 36.3 µs │ ~17% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=100000, len=30 │ 404 µs │ 350 µs │ ~13% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=100, len=1024 │ 3.16 µs │ 1.93 µs │ ~39% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=1000, len=1024 │ 13.7 µs │ 11.1 µs │ ~19% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=10000, len=1024 │ 729 µs │ 103 µs │ ~86% │
├────────────────────┼─────────┼─────────┼─────────────┤
│ n=100000, len=1024 │ 8.68 ms │ 1.02 ms │ ~88% │
└────────────────────┴─────────┴─────────┴─────────────┘
Keep in mind that the benchmark uses ::new rather than ::with_capacity. We can see that .reserve offers some benefits if you don't pre-allocate.
| pub fn append_value_n(&mut self, value: impl AsRef<T::Native>, n: usize) { | ||
| let bytes = value.as_ref().as_ref(); | ||
| for _ in 0..n { | ||
| self.value_builder.extend_from_slice(bytes); |
There was a problem hiding this comment.
Is that directly related to this PR? Or a more general observation?
AFAIK, rust container allocations are amortized, e.g. Vec docs state:
Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.
So hopefully that's enough to avoid quadratic silliness even if we don't reserve, leaving only constant factors on the table?
|
Out of curiosity, are there any benchmarks in the repo that show these improvements? |
I've asked by friend Claude to generate a microbenchmark, where the result can be found in the PR description. I can commit the benchmark as well, but I figured that would generate a lot of very specific benchmarks which won't be used that much, I guess. |
| self.value_builder.extend_from_slice(bytes); | ||
| self.offsets_builder.push(self.next_offset()); | ||
| } | ||
| self.null_buffer_builder.append_n_non_nulls(n); |
There was a problem hiding this comment.
Pulling this out of the loop gives the 10-20% speedup.
| let bytes = value.as_ref().as_ref(); | ||
| for _ in 0..n { | ||
| self.value_builder.extend_from_slice(bytes); | ||
| self.offsets_builder.push(self.next_offset()); |
There was a problem hiding this comment.
If you want to make it faster:
self.offsets_builder.append_trusted_len_iter( could be used outside of the loop with an offset iterator.
IMO, also a reserve should be added here as this is a builder API - it makes not really sense in most cases to use a builder if you know the capacity upfront (you can build the array directly, which will be faster anyway.
|
@Dandandan -- any reason not to merge this now that the |
Which issue does this PR close?
append_value_nto GenericByteBuilder #9425.Rationale for this change
I noticed that this method is available on PrimitiveTypeBuilder, but missing on the GenericByteBuilder, which make sense since the gain is less, but after benchmarking, it shows a solid 10%. Mostly because the more efficient allocation of the null-mask.
I think this is still worthwhile to be added. Let me know what the community thinks!
What changes are included in this PR?
A new public API.
Are these changes tested?
Yes!
Are there any user-facing changes?
A new public API.