You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The copy in the data page header can not even be accessed via the Rust parquet reader and I don't think it is widely used (it was effectively replaced by the PageIndex)
Makes me wonder if we should rethink EnabledStatistics. The Parquet spec actually recommends not writing page level statistics if the page indexes are written. Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.
Readers that support ColumnIndex should not also use page statistics. The only reason to write page-level statistics when writing ColumnIndex structs is to support older readers (not recommended).
Describe the solution you'd like
I would like a way to avoid writing data page header statistics (as they are likely to not be useful to other systems and thus wasteful)
Describe alternatives you've considered
Option 1: Redefine EnabledStatistics::Page
I personally suggest the following change which would requires no changes for users who have set EnabledStatistics::Page and make their parquet files smaller.
Redefine EnabledStatistics::Page: to mean store statistics for ColumnChunk and PageIndex (not data page headers)
Add a new option WriterProperties::write_data_page_statistics that would explicitly also write the data page headers as well. We would add a note saying the option is not recommended for the reasons listed above
Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.
One challenge with this is that it would require all existing users to know to update their code to stop writing data page headers
Option 3: EnabledStatistics more specific
Another alternative is to make EnabledStatistics more specific, something like
* `EnabledStatistics::None`:No statistics
* `EnabledStatistics::Chunk`:Stores the statistics for each ColumnChunk(1 above)* `EnabledStatistics::ColumnIndex`:Stores the statistics in the ColumnChunk and ColumnIndex* `EnabledStatistics::ColumnIndexAndPage`:Stores the statistics in the data page headers **AND** the ColumnChunk and the ColumnIndex
This would be a breaking API change that would be somewhat annoying to downstream users as they would have to change their code.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are currently 3 places statistics can be written in Parquet files:
The level of statistics is controlled by the EnabledStatistics structure:
EnabledStatistics::None: No statisticsEnabledStatistics::Chunk: Stores the statistics for each ColumnChunk (1 above)EnabledStatistics::Page: Stores the statistics for each ColumnChunk, AND the data page headers AND the ColumnIndexProblem
EnabledStatistics::Pageis wasteful because:max_statistics_truncate_lengthis ignored when writing statistics to data page headers #7579 for an example)In fact, as @etseidl points out here: #7490 (comment)
Specifically the documentation on PageIndex says:
Describe the solution you'd like
I would like a way to avoid writing data page header statistics (as they are likely to not be useful to other systems and thus wasteful)
Describe alternatives you've considered
Option 1: Redefine
EnabledStatistics::PageI personally suggest the following change which would requires no changes for users who have set
EnabledStatistics::Pageand make their parquet files smaller.EnabledStatistics::Page: to mean store statistics for ColumnChunk and PageIndex (not data page headers)WriterProperties::write_data_page_statisticsthat would explicitly also write the data page headers as well. We would add a note saying the option is not recommended for the reasons listed aboveOption 2:
EnabledStatistics::ChunkAndIndex@etseidl suggests adding another variant:
One challenge with this is that it would require all existing users to know to update their code to stop writing data page headers
Option 3:
EnabledStatisticsmore specificAnother alternative is to make
EnabledStatisticsmore specific, something likeThis would be a breaking API change that would be somewhat annoying to downstream users as they would have to change their code.
Additional context
max_statistics_truncate_length#7490max_statistics_truncate_lengthis ignored when writing statistics to data page headers #7579