[VARIANT] Support both fallible and infallible access to variants#7807
[VARIANT] Support both fallible and infallible access to variants#7807alamb merged 3 commits intoapache:mainfrom
Conversation
|
Attn @alamb @friendlymatthew -- this is an early pathfinding that only changes |
| /// A _validated_ instance guarantees that: | ||
| /// | ||
| /// - header byte is valid | ||
| /// - dictionary size is in bounds | ||
| /// - offset array content is in-bounds | ||
| /// - first offset is zero | ||
| /// - last offset is in-bounds | ||
| /// - all other offsets are in-bounds (*) | ||
| /// - all offsets are monotonically increasing (*) | ||
| /// - all values are valid utf-8 (*) |
There was a problem hiding this comment.
Hi, it looks good to me. It makes sense we're reserving validation that requires a linear scan across offsets/values for try_new
There was a problem hiding this comment.
I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed
| // Fields | ||
| assert_eq!(md.get(0).unwrap(), "cat"); | ||
| assert_eq!(md.get(1).unwrap(), "dog"); | ||
| assert_eq!(&md[0], "cat"); |
There was a problem hiding this comment.
Unfortunately, we can only do this for VariantMetadata... VariantObject and VariantList need to return a value, not a reference, which Index trait doesn't allow.
| } | ||
| } | ||
|
|
||
| /// Retrieves the ith dictionary entry, panicking if the index is out of bounds. Accessing |
| /// A _validated_ instance guarantees that: | ||
| /// | ||
| /// - header byte is valid | ||
| /// - dictionary size is in bounds | ||
| /// - offset array content is in-bounds | ||
| /// - first offset is zero | ||
| /// - last offset is in-bounds | ||
| /// - all other offsets are in-bounds (*) | ||
| /// - all offsets are monotonically increasing (*) | ||
| /// - all values are valid utf-8 (*) |
There was a problem hiding this comment.
I agree -- the API sketched out in this PR looks great to me -- ergonomic to use as well as offering optimized path access when needed
scovich
left a comment
There was a problem hiding this comment.
Self-review (found a few issues that will be addressed shortly)
| pub(crate) fn unpack_usize(&self, bytes: &[u8], index: usize) -> Result<usize, ArrowError> { | ||
| self.unpack_usize_at_offset(bytes, 0, index) |
There was a problem hiding this comment.
It turns out that a bunch of slice accesses were not correctly bounded; once I fixed that, most unpack_usize calls were passing offset 0. So I renamed the original method as unpack_usize_at_offset, and created a new 2-arg unpack_usize that doesn't take an offset.
| let offset = offset_index | ||
| .checked_mul(*self as usize) | ||
| .and_then(|n| n.checked_add(byte_offset)) |
There was a problem hiding this comment.
This was unchecked arithmetic that risked overflow panic on 32-bit arch.
| result | ||
| .try_into() | ||
| .map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string())) |
There was a problem hiding this comment.
Factored out the u32 -> usize conversion to reduce redundancy.
| VariantBasicType::Object => { | ||
| Variant::Object(VariantObject::try_new_impl(metadata, value)?) | ||
| } | ||
| VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?), |
There was a problem hiding this comment.
By directly invoking try_new_impl methods, we avoid recursive validation of array elements and object fields. Unfortunately this requires making them pub(crate), when I would have preferred to keep them private.
There was a problem hiding this comment.
It makes sense to me and I don't really have any better solution
| /// A parsed version of the variant array value header byte. | ||
| #[derive(Clone, Debug, PartialEq)] | ||
| pub(crate) struct VariantListHeader { | ||
| num_elements_size: OffsetSizeBytes, |
There was a problem hiding this comment.
It turns out is_large isn't nearly as useful as the actual offset unpacker. That, combined with the first_offset_byte helper below, reduces the size of the corresponding variant struct.
| .ok_or_else(|| overflow_error("offset of variant object field offsets"))?; | ||
|
|
||
| let values_start_byte = num_elements | ||
| let first_value_byte = num_elements |
There was a problem hiding this comment.
renamed to match similar patterns elsewhere
| let value_bytes = slice_from_slice(self.value, self.first_value_byte..)?; | ||
| let value_bytes = slice_from_slice(value_bytes, self.get_offset(i)?..)?; |
There was a problem hiding this comment.
again, don't let the method access bytes from an obviously wrong part of the underlying buffer
| /// This should never happen since the constructor validates all data upfront. | ||
| pub fn field_name(&self, i: usize) -> Option<&'m str> { | ||
| Some( | ||
| (i < self.len()).then(|| { |
There was a problem hiding this comment.
Another missing bounds check
| self.iter_try() | ||
| .map(|result| result.expect("Invalid variant list entry")) |
There was a problem hiding this comment.
We don't try to validate because that would trigger the recursion the caller was trying to avoid.
| self.try_field(i) | ||
| .expect("validation error after construction"), | ||
| ) | ||
| (i < self.len()).then(|| self.try_field_impl(i).expect("Invalid object field value")) |
There was a problem hiding this comment.
missing bounds check!
(the original code was incapable of returning None)
alamb
left a comment
There was a problem hiding this comment.
Thank you @scovich -- this looks (really) great to me 🏆
The one thing I was thinking about was should we start making tests to verify checking of invalid variants 🤔
I do think trying to cook up examples of all possible malformed variants is probably not a great use of time, but fuzzing might be
Maybe we could create a fuzz tester that created a random variant using the builder, and then randomly changed a few bytes around. -- Then if validation succeeded we would walk over the entire variant and verify that the infallable APIs didn't panic
I am not sure how important this is
| let buf_one = [0x01u8, 0xAB, 0xCD]; | ||
| assert_eq!( | ||
| OffsetSizeBytes::One.unpack_usize(&buf_one, 0, 0).unwrap(), | ||
| OffsetSizeBytes::One.unpack_usize(&buf_one, 0).unwrap(), |
There was a problem hiding this comment.
I think these tests are much more readable now
| /// } | ||
| /// ``` | ||
| /// | ||
| /// # Validation |
| VariantBasicType::Object => { | ||
| Variant::Object(VariantObject::try_new_impl(metadata, value)?) | ||
| } | ||
| VariantBasicType::Array => Variant::List(VariantList::try_new_impl(metadata, value)?), |
There was a problem hiding this comment.
It makes sense to me and I don't really have any better solution
| /// Every instance of variant list is either _valid_ or _invalid_. depending on whether the | ||
| /// underlying bytes are a valid encoding of a variant array (see below). | ||
| /// | ||
| /// Instances produced by [`Self::try_new`] or [`Self::validate`] are fully _validated_. They always |
There was a problem hiding this comment.
This is somewhat duplicative of the comments in Variant and we could refer people back to those docs for a definition of valid vs invalid 🤔 However I think repeating the content is also fine
| // Iterate over all values of this array in order to validate the field_offset array and | ||
| // prove that the field values are all in bounds. Otherwise, `iter` might panic on `unwrap`. | ||
| validate_fallible_iterator(new_self.iter_checked())?; | ||
| // Validate just the first and last offset, ignoring the other offsets and all value bytes. |
There was a problem hiding this comment.
I think validating the first and last offsets seems very reasonable to me
| field_ids_start_byte: usize, | ||
| field_offsets_start_byte: usize, | ||
| values_start_byte: usize, | ||
| first_field_offset_byte: usize, |
There was a problem hiding this comment.
these names are much easier to understand
|
@friendlymatthew let me know if you would like time to review this PR as well |
I think fuzzing would help a lot, especially with AFL |
@alamb -- what are your thoughts here? I'm not convinced a |
I personally think Option is slightly easier to reason about (and is more performance as it doesn't allocate a String for the error). I also think there is a subtle difference between the two APIs:
I suggest personally we change |
Thanks @friendlymatthew -- I filed a ticke to track the idea |
|
I think this PR is a significant step forward -- thank you @scovich 🙏 While we may want to iterate on the APIs some more let's merge this one in as is so we can keep pushing forward and keep the conflicts to a minimum |
Which issue does this PR close?
Rationale for this change
Full validation is nice, but expensive when not needed.
What changes are included in this PR?
Allow both validated+infallible and unvalidated+fallible access combinations. This generally means splitting out "shallow" (constant-time) validations to a
try_xxx_implmethod, along with avalidatemethod that performs complete (recursive) validation. The correspondingtry_xxxmethod then callsvalidateon the result oftry_xxx_impl, whilexxxmethod just unwraps the result.Some annoying shortcomings that I don't think are possible to avoid:
Indextrait requires its implementation to return references. This works ok forVariantMetadata, which returns&'m str, butVariantListandVariantObjectneed to return wrapper objects by value and so cannotimpl Index. Instead, their infalliblegettype methods returnOptioninstead ofResult, which isn't really an improvement to user experience.Are these changes tested?
TODO (help would be appreciated, this has turned into a much larger effort than I guessed)
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
New
try_xxxmethods to pair with existingxxxmethods, e.g.try_newandnew.