Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We now have two different APIs for writing row groups in parallel, depending on encryption, and I would like to simplify the code to use just one.
The current example for writing row groups in parallel uses get_column_writers and does not support encryption
@rok and @adamreeve added a new API based on ArrowRowGroupWriterFactory for encoding parquet columns and row groups in parallel, with encryption in
This API is also somewhat strange in that it makes users create an ArrowWriter only to immediately destructure it into a SerializedWriter / the underlying writer.
The reason we need to expose ArrowRowGroupWriterFactory is that ArrowRowGroupWriterFactory::create_column_writers also has the appropriate encryption properties whereas get_column_writers does not
Describe the solution you'd like
I would like a single easy to use API for writing in parallel that:
- Is the same for encryption vs not encryption
- Has clear examples
Describe alternatives you've considered
I suggest:
- Make the constructors for
ArrowRowGroupWriterFactory public
- Update the example to use
ArrowRowGroupWriterFactory / ArrowRowGroupWriterFactory::create_column_writers function
- Deprecating the existing
get_column_writers function directing people to ArrowRowGroupWriterFactory
- Deprecate
ArrowWriter::into_serialized_writer, directing people to ArrowRowGroupWriterFactory
Additional context
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We now have two different APIs for writing row groups in parallel, depending on encryption, and I would like to simplify the code to use just one.
The current example for writing row groups in parallel uses
get_column_writersand does not support encryption@rok and @adamreeve added a new API based on
ArrowRowGroupWriterFactoryfor encoding parquet columns and row groups in parallel, with encryption inThis API is also somewhat strange in that it makes users create an
ArrowWriteronly to immediately destructure it into aSerializedWriter/ the underlying writer.The reason we need to expose
ArrowRowGroupWriterFactoryis thatArrowRowGroupWriterFactory::create_column_writersalso has the appropriate encryption properties whereasget_column_writersdoes notDescribe the solution you'd like
I would like a single easy to use API for writing in parallel that:
Describe alternatives you've considered
I suggest:
ArrowRowGroupWriterFactorypublicArrowRowGroupWriterFactory/ArrowRowGroupWriterFactory::create_column_writersfunctionget_column_writersfunction directing people toArrowRowGroupWriterFactoryArrowWriter::into_serialized_writer, directing people toArrowRowGroupWriterFactoryAdditional context