[Variant] test: add variant object tests with different sizes#7896
[Variant] test: add variant object tests with different sizes#7896alamb merged 4 commits intoapache:mainfrom
Conversation
843eb67 to
34bb605
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you very much for this PR @odysa -- it is a great start. I left some suggestions, let me know what you think.
FYI @codephage2020
There was a problem hiding this comment.
Inserting 2²⁴ fields took 66 seconds on my laptop. I've marked it as ignored for now while I explore a faster approach.
There was a problem hiding this comment.
Instead of extracting the field value directly, you could also use the API in VariantHeader, like @codephage2020 did in #7876
Something like
let expected_offset_size = OffsetSizeBytes::Three;
assert_eq!(obj.header.field_offset_size, expected_offset_size)You would also have to put the tests into object.rs so the header field was accessable
There was a problem hiding this comment.
Hi Alamb. The person you may want to mention is him @klion26 .
There was a problem hiding this comment.
Yes, I apologize, I tagged the wrong collaborator. Please feel free to comment and review too @codephage2020
ad2b013 to
544ff48
Compare
| } | ||
|
|
||
| #[test] | ||
| fn test_variant_object_16777217_elements() { |
There was a problem hiding this comment.
This test takes about 45 seconds to run on my laptop. THe next longest test takes less than a second
Is there any way to make this faster?
PASS [ 45.136s] parquet-variant variant::object::tests::test_variant_object_16777217_elements
PASS [ 0.126s] parquet-variant variant::object::tests::test_variant_object_65535_bytes_child_data_2_byte_offsets
PASS [ 0.144s] parquet-variant variant::object::tests::test_variant_object_65537_elements
PASS [ 0.697s] parquet-variant variant::list::tests::test_large_variant_list_with_total_child_length_between_2_pow_24_and_2_pow_32
There was a problem hiding this comment.
I was thinking about something like batch write.
There was a problem hiding this comment.
my brief profiling suggests that a large amount of the time was inserting the field names in (basically doing the hash table lookup). I am not sure how we would fix that -- you could try pre-populating the field names in the builder I suppose
There was a problem hiding this comment.
I tested pre-populating.
let field_names: Vec<String> = (0..count).map(|val| val.to_string()).collect();
let s = field_names.iter().map(|s| s.as_str());
let start = std::time::Instant::now();
let mut builder = VariantBuilder::new().with_field_names(s);
println!("Pre-populating field names: {:?}", start.elapsed());Pre-populating field names: 20.150840416s
There was a problem hiding this comment.
Sadly it still seems to take 45 seconds on my machine
alamb
left a comment
There was a problem hiding this comment.
Thank you @odysa -- this looks really nice -- now the only thing left is to figure out how to avoid spending 45 seconds on a test
cc @friendlymatthew in case you have some other thoughts
There was a problem hiding this comment.
Hi, I took a look and was able to get the test case from ~45s to ~33s on my machine. (To be clear, half of this time is spent formatting 2**24 field names.)
The main wins were:
- reserve capacity upfront #7922
- avoid full validation when reconstructing the variant
|
|
||
| obj.finish().unwrap(); | ||
| let (metadata, value) = builder.finish(); | ||
| let variant = Variant::try_new(&metadata, &value).unwrap(); |
There was a problem hiding this comment.
Using the builder already does validation, so we can save a lot of time by doing unchecked validation when reconstructing the variant.
Something like:
let variant = Variant::new(&metadata, &value);There was a problem hiding this comment.
Thanks. change applied.
| Variant::Int32(count - 1) | ||
| ); | ||
|
|
||
| let header_byte = first_byte_from_slice(&value).unwrap(); |
There was a problem hiding this comment.
- Do we need to compute the header here? Seems we can get the header from
obj.header - Do we need to verify the
field_offset_sizeinVariantObjectHeader
There was a problem hiding this comment.
- done
- It's tested in
test_variant_object_with_large_data
|
|
||
| use super::*; | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
The comment in line 411 seems incorrect -- So header byte = 00_0_1_0001 = 0x10. Maybe we can improve it also
|
Thanks gaain @odysa |
…7922) # Which issue does this PR close? - Part of #7896 # Rationale for this change In #7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />

Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
VariantObjectss #7821 .Rationale for this change
VariantObject with between 2^8 and 2^16 elements ( field_id_size_minus_1 = 1, 2 byte field ids)
VariantObject with between 2^16 and 2^24 elements ( field_id_size_minus_1 = 2, 3 byte field ids)
VariantObject with between 2^24 and 2^32 elements ( field_id_size_minus_1 = 3, 4 byte field ids)
VariantObject with total child data length between 2^8 and 2^16 elements ( field_offset_size_minus_1 = 1, 2 byte field offsets)
VariantObject with total child data length between 2^16 and 2^24 elements ( field_offset_size_minus_1 = 2, 3 byte field offsets)
VariantObject with total child data length between 2^24 and 2^32 elements ( field_offset_size_minus_1 = 3, 4 byte field offsets)