Fix bug in handling of empty Parquet page index structures#8817
Fix bug in handling of empty Parquet page index structures#8817alamb merged 6 commits intoapache:mainfrom
Conversation
| // Missing indexes may also have the placeholder ColumnIndexMetaData::NONE | ||
| if matches!(column_index, ColumnIndexMetaData::NONE) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
This is the fix for the problem detected in #8811
| // test to see if all indexes for this file are empty | ||
| let all_none = column_indexes | ||
| .as_ref() | ||
| .is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none()))); |
There was a problem hiding this comment.
I found this problem while writing the test for the original issue #8815
| for (column_idx, column_metadata) in row_group.columns.iter_mut().enumerate() { | ||
| if let Some(column_index) = &column_indexes[row_group_idx][column_idx] { | ||
| // Missing indexes may also have the placeholder ColumnIndexMetaData::NONE | ||
| if matches!(column_index, ColumnIndexMetaData::NONE) { |
There was a problem hiding this comment.
A minor style thing is that the logic about column metadata writing is now split in two places -- here and write_column_index -- so someone reading write_column_index may not realize that it can't be called with NONE
I wonder if it would be clearer if you moved this matches into write_column_index?
You would then have to test out here if a column index was actually written by checking bytes writtten, which might be slower I suppose 🤔
There was a problem hiding this comment.
There's also encryption to deal with in write_column_index. We could modify write_thrift for ColumnIndexMetaData to be a no-op for NONE indices. But as you say we'd want to check bytes written before and after, and then behave differently if no bytes were actually written.
This too can be part of a solution to #8818. The NONE index is a kludge anyway. If we properly support None in the page index I think this too goes away.
There was a problem hiding this comment.
I guess write_column_index could return a bool. We'd only modify the chunk metadata if the write returns true.
| .collect() | ||
| }); | ||
| // test to see if all indexes for this file are empty | ||
| let all_none = column_indexes |
There was a problem hiding this comment.
Minor nit would be that i would find this easier to read if it were put in a function like finalize_column_indexes that could keep this already large function smaller
The same comment applies to the offset_indexes
There was a problem hiding this comment.
True, it's a wall of code. I'll see if I can simplify this some.
| let offset_indexes: Option<ParquetOffsetIndex> = if all_none { | ||
| None | ||
| } else { | ||
| // FIXME(ets): this will panic if there's a missing index. |
There was a problem hiding this comment.
Is this comment relevant anymore? If so maybe we should track it with a ticket
There was a problem hiding this comment.
I only added a test for all none, so it's conceivable there could be rogue None in there someplace.
I think this ties in with #8818. Allowing None in the final index would fix this.
| } | ||
|
|
||
| #[test] | ||
| fn test_rewrite_missing_column_index() { |
There was a problem hiding this comment.
FWIW I also verified that this test covers the change by runing it without this PR and it fails as expected:
---- file::writer::tests::test_rewrite_no_page_indexes stdout ----
thread 'file::writer::tests::test_rewrite_no_page_indexes' (25670053) panicked at parquet/src/file/metadata/writer.rs:243:54:
called `Option::unwrap()` on a `None` value
---- file::writer::tests::test_rewrite_missing_column_index stdout ----
thread 'file::writer::tests::test_rewrite_missing_column_index' (25670052) panicked at parquet/src/file/writer.rs:2514:24:
called `Result::unwrap()` on an `Err` value: General("Cannot serialize NONE index")
failures:
file::writer::tests::test_rewrite_missing_column_index
file::writer::tests::test_rewrite_no_page_indexes
test result: FAILED. 815 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 3.49s
|
Thank you @etseidl |
Which issue does this PR close?
ThriftMetadataWriter::write_column_indexescannot handle aColumnIndexMetaData::NONE#8815.Rationale for this change
When writing Parquet metadata, sometimes the column and offset indexes contain missing values (this is usually a side effect of the
ParquetMetaDatanot allowing forNonein the page index structures). This can lead to errors or panics.What changes are included in this PR?
Adds some checking in
ThriftMetaDataWriterto detect missing bits and work around them.Are these changes tested?
Yes, new tests added.
Are there any user-facing changes?
No