Skip to content

[VARIANT] Support both fallible and infallible access to variants#7807

Merged
alamb merged 3 commits intoapache:mainfrom
scovich:fallible-variant
Jul 1, 2025
Merged

[VARIANT] Support both fallible and infallible access to variants#7807
alamb merged 3 commits intoapache:mainfrom
scovich:fallible-variant

Conversation

@scovich
Copy link
Copy Markdown
Contributor

@scovich scovich commented Jun 27, 2025

Which issue does this PR close?

Rationale for this change

Full validation is nice, but expensive when not needed.

What changes are included in this PR?

Allow both validated+infallible and unvalidated+fallible access combinations. This generally means splitting out "shallow" (constant-time) validations to a try_xxx_impl method, along with a validate method that performs complete (recursive) validation. The corresponding try_xxx method then calls validate on the result of try_xxx_impl, while xxx method just unwraps the result.

Some annoying shortcomings that I don't think are possible to avoid:

  • It would be nice to allow "unvalidated" [Short]String variant values, since strings could potentially be quite large; but there is no safe way to construct an unvalidated utf-8 string. So only metadata, object, and array can be in an invalidated state.
  • The Index trait requires its implementation to return references. This works ok for VariantMetadata, which returns &'m str, but VariantList and VariantObject need to return wrapper objects by value and so cannot impl Index. Instead, their infallible get type methods return Option instead of Result, which isn't really an improvement to user experience.

Are these changes tested?

TODO (help would be appreciated, this has turned into a much larger effort than I guessed)

We typically require tests for all PRs in order to:

  1. Prevent the code from being accidentally broken by subsequent changes
  2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

New try_xxx methods to pair with existing xxx methods, e.g. try_new and new.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 27, 2025
@scovich
Copy link
Copy Markdown
Contributor Author

scovich commented Jun 27, 2025

Attn @alamb @friendlymatthew -- this is an early pathfinding that only changes VariantMetadata. If we like what we see here I can try to expand it to VariantList and VariantObject as well.

Comment thread parquet-variant/src/variant/metadata.rs Outdated
Comment on lines +95 to +104
/// A _validated_ instance guarantees that:
///
/// - header byte is valid
/// - dictionary size is in bounds
/// - offset array content is in-bounds
/// - first offset is zero
/// - last offset is in-bounds
/// - all other offsets are in-bounds (*)
/// - all offsets are monotonically increasing (*)
/// - all values are valid utf-8 (*)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it looks good to me. It makes sense we're reserving validation that requires a linear scan across offsets/values for try_new

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking quite good -- thank you @scovich

// Fields
assert_eq!(md.get(0).unwrap(), "cat");
assert_eq!(md.get(1).unwrap(), "dog");
assert_eq!(&md[0], "cat");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks much better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can only do this for VariantMetadata... VariantObject and VariantList need to return a value, not a reference, which Index trait doesn't allow.

}
}

/// Retrieves the ith dictionary entry, panicking if the index is out of bounds. Accessing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread parquet-variant/src/variant/metadata.rs Outdated
Comment on lines +95 to +104
/// A _validated_ instance guarantees that:
///
/// - header byte is valid
/// - dictionary size is in bounds
/// - offset array content is in-bounds
/// - first offset is zero
/// - last offset is in-bounds
/// - all other offsets are in-bounds (*)
/// - all offsets are monotonically increasing (*)
/// - all values are valid utf-8 (*)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed

Comment thread parquet-variant/src/variant/metadata.rs Outdated
Copy link
Copy Markdown
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review (found a few issues that will be addressed shortly)

Comment on lines +143 to +144
pub(crate) fn unpack_usize(&self, bytes: &[u8], index: usize) -> Result<usize, ArrowError> {
self.unpack_usize_at_offset(bytes, 0, index)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that a bunch of slice accesses were not correctly bounded; once I fixed that, most unpack_usize calls were passing offset 0. So I renamed the original method as unpack_usize_at_offset, and created a new 2-arg unpack_usize that doesn't take an offset.

Comment on lines +168 to +170
let offset = offset_index
.checked_mul(*self as usize)
.and_then(|n| n.checked_add(byte_offset))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unchecked arithmetic that risked overflow panic on 32-bit arch.

Comment on lines +187 to +189
result
.try_into()
.map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out the u32 -> usize conversion to reduce redundancy.

VariantBasicType::Object => {
Variant::Object(VariantObject::try_new_impl(metadata, value)?)
}
VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By directly invoking try_new_impl methods, we avoid recursive validation of array elements and object fields. Unfortunately this requires making them pub(crate), when I would have preferred to keep them private.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me and I don't really have any better solution

/// A parsed version of the variant array value header byte.
#[derive(Clone, Debug, PartialEq)]
pub(crate) struct VariantListHeader {
num_elements_size: OffsetSizeBytes,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out is_large isn't nearly as useful as the actual offset unpacker. That, combined with the first_offset_byte helper below, reduces the size of the corresponding variant struct.

.ok_or_else(|| overflow_error("offset of variant object field offsets"))?;

let values_start_byte = num_elements
let first_value_byte = num_elements
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to match similar patterns elsewhere

Comment on lines +251 to +252
let value_bytes = slice_from_slice(self.value, self.first_value_byte..)?;
let value_bytes = slice_from_slice(value_bytes, self.get_offset(i)?..)?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, don't let the method access bytes from an obviously wrong part of the underlying buffer

/// This should never happen since the constructor validates all data upfront.
pub fn field_name(&self, i: usize) -> Option<&'m str> {
Some(
(i < self.len()).then(|| {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another missing bounds check

Comment thread parquet-variant/src/variant/list.rs Outdated
Comment on lines +255 to +256
self.iter_try()
.map(|result| result.expect("Invalid variant list entry"))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't try to validate because that would trigger the recursion the caller was trying to avoid.

self.try_field(i)
.expect("validation error after construction"),
)
(i < self.len()).then(|| self.try_field_impl(i).expect("Invalid object field value"))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing bounds check!

(the original code was incapable of returning None)

@scovich scovich marked this pull request as ready for review June 30, 2025 23:40
@scovich scovich requested review from alamb and friendlymatthew June 30, 2025 23:40
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- this looks (really) great to me 🏆

The one thing I was thinking about was should we start making tests to verify checking of invalid variants 🤔

I do think trying to cook up examples of all possible malformed variants is probably not a great use of time, but fuzzing might be

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I am not sure how important this is

let buf_one = [0x01u8, 0xAB, 0xCD];
assert_eq!(
OffsetSizeBytes::One.unpack_usize(&buf_one, 0, 0).unwrap(),
OffsetSizeBytes::One.unpack_usize(&buf_one, 0).unwrap(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are much more readable now

/// }
/// ```
///
/// # Validation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌 -- very nice

VariantBasicType::Object => {
Variant::Object(VariantObject::try_new_impl(metadata, value)?)
}
VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me and I don't really have any better solution

/// Every instance of variant list is either _valid_ or _invalid_. depending on whether the
/// underlying bytes are a valid encoding of a variant array (see below).
///
/// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully _validated_. They always
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat duplicative of the comments in Variant and we could refer people back to those docs for a definition of valid vs invalid 🤔 However I think repeating the content is also fine

// Iterate over all values of this array in order to validate the field_offset array and
// prove that the field values are all in bounds. Otherwise, `iter` might panic on `unwrap`.
validate_fallible_iterator(new_self.iter_checked())?;
// Validate just the first and last offset, ignoring the other offsets and all value bytes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think validating the first and last offsets seems very reasonable to me

field_ids_start_byte: usize,
field_offsets_start_byte: usize,
values_start_byte: usize,
first_field_offset_byte: usize,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these names are much easier to understand

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jul 1, 2025

@friendlymatthew let me know if you would like time to review this PR as well

cc @PinkCrow007 @mkarbo @superserious-dev @Weijun-H

@friendlymatthew
Copy link
Copy Markdown
Contributor

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I think fuzzing would help a lot, especially with AFL

@scovich
Copy link
Copy Markdown
Contributor Author

scovich commented Jul 1, 2025

  • The Index trait requires its implementation to return references. This works ok for VariantMetadata, which returns &'m str, but VariantList and VariantObject need to return wrapper objects by value and so cannot impl Index. Instead, their infallible get type methods return Option instead of Result, which isn't really an improvement to user experience.

@alamb -- what are your thoughts here? I'm not convinced a get that can panic and yet still returns Option is ever better than a try_get that cannot panic and returns Result? Either way you have to handle the "error" case, so there's no usability or readability improvement to offset the panic risk?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jul 1, 2025

@alamb -- what are your thoughts here? I'm not convinced a get that can panic and yet still returns Option is ever better than a try_get that cannot panic and returns Result? Either way you have to handle the "error" case, so there's no usability or readability improvement to offset the panic risk?

I personally think Option is slightly easier to reason about (and is more performance as it doesn't allocate a String for the error).

I also think there is a subtle difference between the two APIs:

  1. A get() that returns None (the key wasn't found) is a condition that you expect will happen with valid input and normal operation
  2. A try_get() that returns an Err is something you expect won't happen in normal operation, and thus having all call sites have to deal with the "error that should not happen" case is wonky

I suggest personally we change try_get() to return Result<Option<..>> to reflect that the None is expected in normal operation while the error is not

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jul 1, 2025

Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic

I think fuzzing would help a lot, especially with AFL

Thanks @friendlymatthew -- I filed a ticke to track the idea

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jul 1, 2025

I think this PR is a significant step forward -- thank you @scovich 🙏

While we may want to iterate on the APIs some more let's merge this one in as is so we can keep pushing forward and keep the conflicts to a minimum

@alamb alamb merged commit 248ee73 into apache:main Jul 1, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants