The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:
-
There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).
-
Unnecessary memory copy (the vector data must be copied to the hash table).
-
The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
-
The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
-
The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
We plan to implement a new one in the algorithm module, and gradually deprecate the current one.
Reporter: Liya Fan / @liyafan82
Assignee: Liya Fan / @liyafan82
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-5917. Please see the migration documentation for further details.
The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:
There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).
Unnecessary memory copy (the vector data must be copied to the hash table).
The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
We plan to implement a new one in the algorithm module, and gradually deprecate the current one.
Reporter: Liya Fan / @liyafan82
Assignee: Liya Fan / @liyafan82
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-5917. Please see the migration documentation for further details.