[VARIANT] Support both fallible and infallible access to variants by scovich · Pull Request #7807 · apache/arrow-rs

scovich · 2025-06-27T16:36:36Z

Which issue does this PR close?

Closes #[Variant] Revisit validation cost of infallible iterators #7711

Rationale for this change

Full validation is nice, but expensive when not needed.

What changes are included in this PR?

Allow both validated+infallible and unvalidated+fallible access combinations. This generally means splitting out "shallow" (constant-time) validations to a try_xxx_impl method, along with a validate method that performs complete (recursive) validation. The corresponding try_xxx method then calls validate on the result of try_xxx_impl, while xxx method just unwraps the result.

Some annoying shortcomings that I don't think are possible to avoid:

It would be nice to allow "unvalidated" [Short]String variant values, since strings could potentially be quite large; but there is no safe way to construct an unvalidated utf-8 string. So only metadata, object, and array can be in an invalidated state.
The Index trait requires its implementation to return references. This works ok for VariantMetadata, which returns &'m str, but VariantList and VariantObject need to return wrapper objects by value and so cannot impl Index. Instead, their infallible get type methods return Option instead of Result, which isn't really an improvement to user experience.

Are these changes tested?

TODO (help would be appreciated, this has turned into a much larger effort than I guessed)

We typically require tests for all PRs in order to:

Prevent the code from being accidentally broken by subsequent changes
Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

New try_xxx methods to pair with existing xxx methods, e.g. try_new and new.

scovich · 2025-06-27T16:37:30Z

Attn @alamb @friendlymatthew -- this is an early pathfinding that only changes VariantMetadata. If we like what we see here I can try to expand it to VariantList and VariantObject as well.

friendlymatthew · 2025-06-28T02:55:42Z

+/// A _validated_ instance guarantees that:
+///
+/// - header byte is valid
+/// - dictionary size is in bounds
+/// - offset array content is in-bounds
+/// - first offset is zero
+/// - last offset is in-bounds
+/// - all other offsets are in-bounds (*)
+/// - all offsets are monotonically increasing (*)
+/// - all values are valid utf-8 (*)


Hi, it looks good to me. It makes sense we're reserving validation that requires a linear scan across offsets/values for try_new

I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed

alamb

I think this is looking quite good -- thank you @scovich

alamb · 2025-06-29T10:55:29Z

        // Fields
-        assert_eq!(md.get(0).unwrap(), "cat");
-        assert_eq!(md.get(1).unwrap(), "dog");
+        assert_eq!(&md[0], "cat");


that looks much better

Unfortunately, we can only do this for VariantMetadata... VariantObject and VariantList need to return a value, not a reference, which Index trait doesn't allow.

alamb · 2025-06-29T10:55:39Z

    }
+}
+
+/// Retrieves the ith dictionary entry, panicking if the index is out of bounds. Accessing


alamb · 2025-06-29T10:57:11Z

+/// A _validated_ instance guarantees that:
+///
+/// - header byte is valid
+/// - dictionary size is in bounds
+/// - offset array content is in-bounds
+/// - first offset is zero
+/// - last offset is in-bounds
+/// - all other offsets are in-bounds (*)
+/// - all offsets are monotonically increasing (*)
+/// - all values are valid utf-8 (*)


I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed

scovich

Self-review (found a few issues that will be addressed shortly)

scovich · 2025-06-30T23:01:12Z

+    pub(crate) fn unpack_usize(&self, bytes: &[u8], index: usize) -> Result<usize, ArrowError> {
+        self.unpack_usize_at_offset(bytes, 0, index)


It turns out that a bunch of slice accesses were not correctly bounded; once I fixed that, most unpack_usize calls were passing offset 0. So I renamed the original method as unpack_usize_at_offset, and created a new 2-arg unpack_usize that doesn't take an offset.

scovich · 2025-06-30T23:01:30Z

+        let offset = offset_index
+            .checked_mul(*self as usize)
+            .and_then(|n| n.checked_add(byte_offset))


This was unchecked arithmetic that risked overflow panic on 32-bit arch.

scovich · 2025-06-30T23:02:03Z

+        result
+            .try_into()
+            .map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))


Factored out the u32 -> usize conversion to reduce redundancy.

scovich · 2025-06-30T23:15:16Z

+            VariantBasicType::Object => {
+                Variant::Object(VariantObject::try_new_impl(metadata, value)?)
+            }
+            VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?),


By directly invoking try_new_impl methods, we avoid recursive validation of array elements and object fields. Unfortunately this requires making them pub(crate), when I would have preferred to keep them private.

It makes sense to me and I don't really have any better solution

scovich · 2025-06-30T23:16:10Z

 /// A parsed version of the variant array value header byte.
 #[derive(Clone, Debug, PartialEq)]
 pub(crate) struct VariantListHeader {
+    num_elements_size: OffsetSizeBytes,


It turns out is_large isn't nearly as useful as the actual offset unpacker. That, combined with the first_offset_byte helper below, reduces the size of the corresponding variant struct.

See also [Variant] VariantMetadata, VariantList and VariantObject are too big for Copy #7831

scovich · 2025-06-30T23:33:35Z

            .ok_or_else(|| overflow_error("offset of variant object field offsets"))?;

-        let values_start_byte = num_elements
+        let first_value_byte = num_elements


renamed to match similar patterns elsewhere

scovich · 2025-06-30T23:35:03Z

+        let value_bytes = slice_from_slice(self.value, self.first_value_byte..)?;
+        let value_bytes = slice_from_slice(value_bytes, self.get_offset(i)?..)?;


again, don't let the method access bytes from an obviously wrong part of the underlying buffer

scovich · 2025-06-30T23:35:30Z

    /// This should never happen since the constructor validates all data upfront.
    pub fn field_name(&self, i: usize) -> Option<&'m str> {
-        Some(
+        (i < self.len()).then(|| {


Another missing bounds check

scovich · 2025-06-30T23:36:35Z

+        self.iter_try()
+            .map(|result| result.expect("Invalid variant list entry"))


We don't try to validate because that would trigger the recursion the caller was trying to avoid.

scovich · 2025-06-30T23:37:23Z

-            self.try_field(i)
-                .expect("validation error after construction"),
-        )
+        (i < self.len()).then(|| self.try_field_impl(i).expect("Invalid object field value"))


missing bounds check!

(the original code was incapable of returning None)

alamb

Thank you @scovich -- this looks (really) great to me 🏆

The one thing I was thinking about was should we start making tests to verify checking of invalid variants 🤔

I do think trying to cook up examples of all possible malformed variants is probably not a great use of time, but fuzzing might be

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I am not sure how important this is

alamb · 2025-07-01T11:39:56Z

        let buf_one = [0x01u8, 0xAB, 0xCD];
        assert_eq!(
-            OffsetSizeBytes::One.unpack_usize(&buf_one, 0, 0).unwrap(),
+            OffsetSizeBytes::One.unpack_usize(&buf_one, 0).unwrap(),


I think these tests are much more readable now

alamb · 2025-07-01T11:44:42Z

 /// }
 /// ```
+///
+/// # Validation


👨‍🍳 👌 -- very nice

alamb · 2025-07-01T11:46:47Z

+            VariantBasicType::Object => {
+                Variant::Object(VariantObject::try_new_impl(metadata, value)?)
+            }
+            VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?),


It makes sense to me and I don't really have any better solution

alamb · 2025-07-01T11:50:00Z

+/// Every instance of variant list is either _valid_ or _invalid_. depending on whether the
+/// underlying bytes are a valid encoding of a variant array (see below).
+///
+/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully _validated_. They always


This is somewhat duplicative of the comments in Variant and we could refer people back to those docs for a definition of valid vs invalid 🤔 However I think repeating the content is also fine

alamb · 2025-07-01T11:51:41Z

-        // Iterate over all values of this array in order to validate the field_offset array and
-        // prove that the field values are all in bounds. Otherwise, `iter` might panic on `unwrap`.
-        validate_fallible_iterator(new_self.iter_checked())?;
+        // Validate just the first and last offset, ignoring the other offsets and all value bytes.


I think validating the first and last offsets seems very reasonable to me

alamb · 2025-07-01T12:06:14Z

-    field_ids_start_byte: usize,
-    field_offsets_start_byte: usize,
-    values_start_byte: usize,
+    first_field_offset_byte: usize,


these names are much easier to understand

alamb · 2025-07-01T12:12:13Z

@friendlymatthew let me know if you would like time to review this PR as well

cc @PinkCrow007 @mkarbo @superserious-dev @Weijun-H

friendlymatthew · 2025-07-01T12:28:28Z

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I think fuzzing would help a lot, especially with AFL

scovich · 2025-07-01T13:42:18Z

The Index trait requires its implementation to return references. This works ok for VariantMetadata, which returns &'m str, but VariantList and VariantObject need to return wrapper objects by value and so cannot impl Index. Instead, their infallible get type methods return Option instead of Result, which isn't really an improvement to user experience.

@alamb -- what are your thoughts here? I'm not convinced a get that can panic and yet still returns Option is ever better than a try_get that cannot panic and returns Result? Either way you have to handle the "error" case, so there's no usability or readability improvement to offset the panic risk?

alamb · 2025-07-01T17:55:28Z

@alamb -- what are your thoughts here? I'm not convinced a get that can panic and yet still returns Option is ever better than a try_get that cannot panic and returns Result? Either way you have to handle the "error" case, so there's no usability or readability improvement to offset the panic risk?

I personally think Option is slightly easier to reason about (and is more performance as it doesn't allocate a String for the error).

I also think there is a subtle difference between the two APIs:

A get() that returns None (the key wasn't found) is a condition that you expect will happen with valid input and normal operation
A try_get() that returns an Err is something you expect won't happen in normal operation, and thus having all call sites have to deal with the "error that should not happen" case is wonky

I suggest personally we change try_get() to return Result<Option<..>> to reflect that the None is expected in normal operation while the error is not

alamb · 2025-07-01T18:00:39Z

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I think fuzzing would help a lot, especially with AFL

Thanks @friendlymatthew -- I filed a ticke to track the idea

[Variant] Add testing for invalid variants (fuzz testing??) #7842

alamb · 2025-07-01T18:01:45Z

I think this PR is a significant step forward -- thank you @scovich 🙏

While we may want to iterate on the APIs some more let's merge this one in as is so we can keep pushing forward and keep the conflicts to a minimum

support both fallible and infallible access to variant metadata

19db4ee

github-actions Bot added the parquet Changes to the parquet crate label Jun 27, 2025

friendlymatthew reviewed Jun 28, 2025

View reviewed changes

alamb reviewed Jun 29, 2025

View reviewed changes

Tweak object and array validation as well

b132d5c

scovich commented Jun 30, 2025

View reviewed changes

address self-review comments

780a7f4

scovich marked this pull request as ready for review June 30, 2025 23:40

scovich requested review from alamb and friendlymatthew June 30, 2025 23:40

alamb approved these changes Jul 1, 2025

View reviewed changes

friendlymatthew approved these changes Jul 1, 2025

View reviewed changes

alamb mentioned this pull request Jul 1, 2025

[Variant] Add testing for invalid variants (fuzz testing??) #7842

Closed

alamb merged commit 248ee73 into apache:main Jul 1, 2025
13 checks passed

alamb mentioned this pull request Jul 1, 2025

[Variant] Field lookup with out of bounds index causes unwanted behavior #7784

Closed

		pub(crate) fn unpack_usize(&self, bytes: &[u8], index: usize) -> Result<usize, ArrowError> {
		self.unpack_usize_at_offset(bytes, 0, index)

		let value_bytes = slice_from_slice(self.value, self.first_value_byte..)?;
		let value_bytes = slice_from_slice(value_bytes, self.get_offset(i)?..)?;

		self.iter_try()
		.map(\|result\| result.expect("Invalid variant list entry"))

Conversation

scovich commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

friendlymatthew commented Jul 1, 2025

Uh oh!

scovich commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

alamb commented Jul 1, 2025

Uh oh!

alamb commented Jul 1, 2025

scovich commented Jun 27, 2025 •

edited

Loading

scovich commented Jun 27, 2025 •

edited

Loading

scovich commented Jul 1, 2025 •

edited

Loading