Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It seems that CSV files encoded with non-UTF-8 charsets, such as Windows-1252, are annoyingly common in the wild. It would be useful to be able to consume them directly via an additional configuration option.
Describe the solution you'd like
Add a configuration option to the CSV reader to specify a character encoding, defaulting to UTF-8. The implementation can make us of encoding_rs, and could be feature-gated so as to not affect users who don't need this functionality.
Describe alternatives you've considered
The only alternative I can think of is to decode the entire CSV file up front before reading it via Apache Arrow, but this is suboptimal for a lot of usecases.
Additional context
I originally opened a similar issue in the Datafusion project, but after further reflection, figured it was possibly better implemented in Arrow itself.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It seems that CSV files encoded with non-UTF-8 charsets, such as Windows-1252, are annoyingly common in the wild. It would be useful to be able to consume them directly via an additional configuration option.
Describe the solution you'd like
Add a configuration option to the CSV reader to specify a character encoding, defaulting to UTF-8. The implementation can make us of
encoding_rs, and could be feature-gated so as to not affect users who don't need this functionality.Describe alternatives you've considered
The only alternative I can think of is to decode the entire CSV file up front before reading it via Apache Arrow, but this is suboptimal for a lot of usecases.
Additional context
I originally opened a similar issue in the Datafusion project, but after further reflection, figured it was possibly better implemented in Arrow itself.