Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We have a RecordBatch in our application that is has this form where there's values that can take on many types. We have a type column that identifies which column to read the values from, and values column for each type. The values columns are dictionary encoded.
type | str_val | int_val | ...
-----|---------|---------| ...
str | "a" | null | ...
str | "b" | null | ...
int | null | 1 | ...
I was writing some code to construct the values columns from typed segments, by creating either all-null segments, or segments containing values, and concatenating them together. Something like:
let str_val_col = concat([
&DictionaryArray::new(str_val_keys, values.clone())
&DictionaryArray::new(UInt8Array::new_null(non_str_len), values.clone()),
// ...
])
In my profiling, I noticed that DictionaryArray::new was slower than I expected because it was validating all the keys.
Describe the solution you'd like
In the case where the dictionary keys are all null, I think we can maybe skip this validation here?
https://github.com/apache/arrow-rs/blob/main/arrow-array/src/array/dictionary_array.rs#L289-L314
Describe alternatives you've considered
I could use new_unchecked, but this has a few downsides:
- some code bases have wariness about unsafe code
- if using the
force_validate, we still validate the keys
concat([
#[allow(unsafe_code)]
unsafe {
&DictionaryArray::new_unchecked(UInt8Array::new_null(len), values.clone()),
}
&DictionaryArray::new(non_null_keys, values.clone())
// ...
])
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We have a
RecordBatchin our application that is has this form where there's values that can take on many types. We have atypecolumn that identifies which column to read the values from, and values column for each type. The values columns are dictionary encoded.I was writing some code to construct the values columns from typed segments, by creating either all-null segments, or segments containing values, and concatenating them together. Something like:
In my profiling, I noticed that
DictionaryArray::newwas slower than I expected because it was validating all the keys.Describe the solution you'd like
In the case where the dictionary keys are all null, I think we can maybe skip this validation here?
https://github.com/apache/arrow-rs/blob/main/arrow-array/src/array/dictionary_array.rs#L289-L314
Describe alternatives you've considered
I could use
new_unchecked, but this has a few downsides:force_validate, we still validate the keysAdditional context