[Java] Redesign the dictionary encoder

The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:
1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject(i)).
1. Unnecessary memory copy (the vector data must be copied to the hash table).
1. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
1. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
1. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.
   
   We plan to implement a new one in the algorithm module, and gradually deprecate the current one.

**Reporter**: [Liya Fan](https://issues.apache.org/jira/browse/ARROW-5917) / @liyafan82
**Assignee**: [Liya Fan](https://issues.apache.org/jira/browse/ARROW-5917) / @liyafan82
#### Related issues:
- [[Java] Provide hash table based dictionary encoder](https://github.com/apache/arrow/issues/22577) (relates to)
#### PRs and other links:
- [GitHub Pull Request #4994](https://github.com/apache/arrow/pull/4994)

<sub>**Note**: *This issue was originally created as [ARROW-5917](https://issues.apache.org/jira/browse/ARROW-5917). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] Redesign the dictionary encoder #22327

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Java] Redesign the dictionary encoder #22327

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions