Skip to content

Support CSV files encoded with charsets other than UTF-8 #9465

@Rafferty97

Description

@Rafferty97

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It seems that CSV files encoded with non-UTF-8 charsets, such as Windows-1252, are annoyingly common in the wild. It would be useful to be able to consume them directly via an additional configuration option.

Describe the solution you'd like
Add a configuration option to the CSV reader to specify a character encoding, defaulting to UTF-8. The implementation can make us of encoding_rs, and could be feature-gated so as to not affect users who don't need this functionality.

Describe alternatives you've considered
The only alternative I can think of is to decode the entire CSV file up front before reading it via Apache Arrow, but this is suboptimal for a lot of usecases.

Additional context
I originally opened a similar issue in the Datafusion project, but after further reflection, figured it was possibly better implemented in Arrow itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions