Skip to content

Performance of creating all null dictionary array can be improved #9321

@albertlockett

Description

@albertlockett

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We have a RecordBatch in our application that is has this form where there's values that can take on many types. We have a type column that identifies which column to read the values from, and values column for each type. The values columns are dictionary encoded.

type | str_val | int_val | ...
-----|---------|---------| ...
str  |   "a"   |   null  | ...
str  |   "b"   |   null  | ...
int  |   null  |   1     | ...

I was writing some code to construct the values columns from typed segments, by creating either all-null segments, or segments containing values, and concatenating them together. Something like:

let str_val_col = concat([
  &DictionaryArray::new(str_val_keys, values.clone())
  &DictionaryArray::new(UInt8Array::new_null(non_str_len), values.clone()),
  // ...
])

In my profiling, I noticed that DictionaryArray::new was slower than I expected because it was validating all the keys.

Describe the solution you'd like

In the case where the dictionary keys are all null, I think we can maybe skip this validation here?
https://github.com/apache/arrow-rs/blob/main/arrow-array/src/array/dictionary_array.rs#L289-L314

Describe alternatives you've considered

I could use new_unchecked, but this has a few downsides:

  • some code bases have wariness about unsafe code
  • if using the force_validate, we still validate the keys
concat([
  #[allow(unsafe_code)]
  unsafe { 
    &DictionaryArray::new_unchecked(UInt8Array::new_null(len), values.clone()),
   }
  &DictionaryArray::new(non_null_keys, values.clone())
  // ...
])

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions