Skip to content

Fix bug in handling of empty Parquet page index structures#8817

Merged
alamb merged 6 commits intoapache:mainfrom
etseidl:missing_col_index
Nov 13, 2025
Merged

Fix bug in handling of empty Parquet page index structures#8817
alamb merged 6 commits intoapache:mainfrom
etseidl:missing_col_index

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Nov 10, 2025

Which issue does this PR close?

Rationale for this change

When writing Parquet metadata, sometimes the column and offset indexes contain missing values (this is usually a side effect of the ParquetMetaData not allowing for None in the page index structures). This can lead to errors or panics.

What changes are included in this PR?

Adds some checking in ThriftMetaDataWriter to detect missing bits and work around them.

Are these changes tested?

Yes, new tests added.

Are there any user-facing changes?

No

Comment thread parquet/src/file/metadata/writer.rs Outdated
Comment on lines +116 to +119
// Missing indexes may also have the placeholder ColumnIndexMetaData::NONE
if matches!(column_index, ColumnIndexMetaData::NONE) {
continue;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for the problem detected in #8811

Comment thread parquet/src/file/metadata/writer.rs Outdated
Comment on lines +234 to +237
// test to see if all indexes for this file are empty
let all_none = column_indexes
.as_ref()
.is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none())));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this problem while writing the test for the original issue #8815

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Nov 10, 2025
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl --- this makes sense to me

Comment thread parquet/src/file/metadata/writer.rs Outdated
for (column_idx, column_metadata) in row_group.columns.iter_mut().enumerate() {
if let Some(column_index) = &column_indexes[row_group_idx][column_idx] {
// Missing indexes may also have the placeholder ColumnIndexMetaData::NONE
if matches!(column_index, ColumnIndexMetaData::NONE) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor style thing is that the logic about column metadata writing is now split in two places -- here and write_column_index -- so someone reading write_column_index may not realize that it can't be called with NONE

I wonder if it would be clearer if you moved this matches into write_column_index?

You would then have to test out here if a column index was actually written by checking bytes writtten, which might be slower I suppose 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also encryption to deal with in write_column_index. We could modify write_thrift for ColumnIndexMetaData to be a no-op for NONE indices. But as you say we'd want to check bytes written before and after, and then behave differently if no bytes were actually written.

This too can be part of a solution to #8818. The NONE index is a kludge anyway. If we properly support None in the page index I think this too goes away.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess write_column_index could return a bool. We'd only modify the chunk metadata if the write returns true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change in 1654a14

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Comment thread parquet/src/file/metadata/writer.rs Outdated
.collect()
});
// test to see if all indexes for this file are empty
let all_none = column_indexes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit would be that i would find this easier to read if it were put in a function like finalize_column_indexes that could keep this already large function smaller

The same comment applies to the offset_indexes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it's a wall of code. I'll see if I can simplify this some.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ab9c18c

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix fixed in fae197f

Comment thread parquet/src/file/metadata/writer.rs Outdated
let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
// FIXME(ets): this will panic if there's a missing index.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment relevant anymore? If so maybe we should track it with a ticket

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added a test for all none, so it's conceivable there could be rogue None in there someplace.

I think this ties in with #8818. Allowing None in the final index would fix this.

}

#[test]
fn test_rewrite_missing_column_index() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I also verified that this test covers the change by runing it without this PR and it fails as expected:

---- file::writer::tests::test_rewrite_no_page_indexes stdout ----

thread 'file::writer::tests::test_rewrite_no_page_indexes' (25670053) panicked at parquet/src/file/metadata/writer.rs:243:54:
called `Option::unwrap()` on a `None` value

---- file::writer::tests::test_rewrite_missing_column_index stdout ----

thread 'file::writer::tests::test_rewrite_missing_column_index' (25670052) panicked at parquet/src/file/writer.rs:2514:24:
called `Result::unwrap()` on an `Err` value: General("Cannot serialize NONE index")


failures:
    file::writer::tests::test_rewrite_missing_column_index
    file::writer::tests::test_rewrite_no_page_indexes

test result: FAILED. 815 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 3.49s

@alamb alamb merged commit 77a4c43 into apache:main Nov 13, 2025
16 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Nov 13, 2025

Thank you @etseidl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThriftMetadataWriter::write_column_indexes cannot handle a ColumnIndexMetaData::NONE

2 participants