Skip to content

[thrift-remodel] Use new writer to write Parquet file metadata#8445

Merged
etseidl merged 154 commits intoapache:gh5854_thrift_remodelfrom
etseidl:write_file_meta
Sep 26, 2025
Merged

[thrift-remodel] Use new writer to write Parquet file metadata#8445
etseidl merged 154 commits intoapache:gh5854_thrift_remodelfrom
etseidl:write_file_meta

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Sep 25, 2025

Which issue does this PR close?

Note: this targets a feature branch, not main

Rationale for this change

This PR closes the loop and and now Parquet metadata is completely handled by the new code.

What changes are included in this PR?

Changes the metadata builders to use the new structs rather than those from format. As a consequence, the close methods no longer return a format::FileMetaData but instead return a ParquetMetaData.

Are these changes tested?

Covered by existing tests, but many tests were modified to deal with the switch to ParquetMetaData mentioned above.

Are there any user-facing changes?

Yes

@etseidl etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API labels Sep 25, 2025
@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Sep 25, 2025

It's getting very close to October. I'm not sure we'll be able to get this into 57.0.0, but the thought of keeping this up to date with the changes that are queued up makes me queasy.

}
}

/// Write an encrypted Thrift serializable object
Copy link
Copy Markdown
Contributor Author

@etseidl etseidl Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no parquet::format structs left to encrypt 😄

let base_expected_size = 2280;
#[cfg(feature = "encryption")]
let base_expected_size = 2616;
let base_expected_size = 2712;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to track down why this jumped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encrypted_column_metadata adds 24 bytes per column chunk.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will help some...I'll admit to not really having a feel for the sizes of Rust structures. I'd imagine an Option<Box<>> would be 8-16 bytes?

assert_eq!(metadata.file_metadata().num_rows(), 50);
// TODO(ets): what was this meant to test? The read and written schemas differ because an
// archaic form for a list was used in the source file.
// assert_eq!(metadata.schema, metadata.schema);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one of @rok or @adamreeve opine here? 🙏

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was supposed to verify that after we write the encrypted file the schema matches the original input, as your comment suggests. But it's clearly not doing that!

I don't think this is necessary, the schemas shouldn't need to match exactly, and verify_encryption_test_data already tests all the columns expected are there and will verify the arrow types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adamreeve. I'll just remove the asserts then.

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Sep 25, 2025

@alamb I've merged in your recent changes, could you take a look please? 🙏

@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 26, 2025
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I reviewed this PR as carefully as I could. I have some small suggestions, but nothing that I think would prevent this PR from merging.

In general I found the new parquet metadata writing code easy to follow, and the patterns make lots of sense to me

});

#[cfg(feature = "arrow")]
c.bench_function("page headers (no stats)", |b| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when reviewing these benchmarks, it seems like maybe it is time to remove benchmarks for page header statistics (as they aren't really useful / widely used)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed...the only reason I added them was to see the speedup from not decoding the Statistics. I'll make a note to remove them later. Same for the private file metadata decoding...we should only be benchmarking the public API.

///
/// Attempting to write after calling finish will result in an error
pub async fn finish(&mut self) -> Result<crate::format::FileMetaData> {
pub async fn finish(&mut self) -> Result<ParquetMetaData> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a much more reasonable API, FWIW, as the ParquetMetadata is what is used in the rest of the APIs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I was most worried about this change.

Comment thread parquet/src/file/metadata/memory.rs Outdated
+ self.unencoded_byte_array_data_bytes.heap_size()
+ self.repetition_level_histogram.heap_size()
+ self.definition_level_histogram.heap_size()
+ self.column_crypto_metadata.heap_size()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was at first annoyed at the replication here, but the alternative is to #cfg out a different function r something which is not obviously simpler

Though it would make it easier to keep these functions in sync 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could inline the #cfg, but I remember in the past we were trying to steer away from too many cfgs sprinkled about in the code. But maybe for the column chunk fields it's a better approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree there is no great solution

let base_expected_size = 2280;
#[cfg(feature = "encryption")]
let base_expected_size = 2616;
let base_expected_size = 2712;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔

7: optional list<ColumnOrder> column_orders;
8: optional EncryptionAlgorithm<'a> encryption_algorithm
9: optional binary<'a> footer_signing_key_metadata
8: optional EncryptionAlgorithm encryption_algorithm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been following along, but what is the signficance of not using references (aka removing <'a> 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lifetime annotations signal the macros to generate slices rather than vectors. While implementing write I found I couldn't keep the references alive long enough to use slices, so for now encryption requires more allocations than strictly necessary. I'd hate to duplicate all these structs to have one for reading and one for writing.

Perhaps if we figure out a way to encapsulate the encryption code more, we can revisit this.

Comment thread parquet/src/file/metadata/thrift_gen.rs Outdated
impl<'a> WriteThrift for FileMeta<'a> {
const ELEMENT_TYPE: ElementType = ElementType::Struct;

#[allow(unused_assignments)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's probably a cut-and-paste leftover. I'll see if I can remove it. Nice catch!

None
},
repetition_type: Some(basic_info.repetition()),
name: basic_info.name(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and this does not allocate a string (uses &str) 👍

Comment thread parquet/src/file/metadata/thrift_gen.rs Outdated
type_length: None,
repetition_type: repetition,
name: basic_info.name(),
num_children: Some(fields.len() as i32),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe as a follow on we should validate this limit (aka that there are not more than 2M fields 🤔 )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a BIG schema, but I could switch to a try.

Copy link
Copy Markdown
Contributor

@alamb alamb Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a BIG schema, but I could switch to a try.

yeah, I am not really imagining that someone would need it for real, more like either did it by accident or is trying to cause denial of service

}
}

// struct RowGroup {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked how to find this, and see that this maps straightforwardly to the original thrift 👍

https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1001

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted the comment there to explain the magic numbers.

if let Some(column_orders) = self.file_metadata.column_orders() {
last_field_id = column_orders.write_thrift_field(writer, 7, last_field_id)?;
}
if let Some(algo) = self.encryption_algorithm.as_ref() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could avoid a lot of repetition if you just put the #[cfg(not(feature = "encryption"))] on these two fields, like

Suggested change
if let Some(algo) = self.encryption_algorithm.as_ref() {
#[cfg(feature = "encryption")]
if let Some(algo) = self.encryption_algorithm.as_ref() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do...same comment as above about the encryption cfgs.

@etseidl
Copy link
Copy Markdown
Contributor Author

etseidl commented Sep 26, 2025

Thanks @alamb! I'll clean this up and then move on to the last major PR, which I hope will be the last of the breaking changes. The rest should be fine tuning and testing. Field skipping should not be breaking changes either.

FWIW the next PR will also add the beginnings of thrift documentation.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Sep 26, 2025

BTW I am thinking we should document all this great work in a blog post or something

Nothing actionable yet, just FYI


impl HeapSize for ColumnChunkMetaData {
fn heap_size(&self) -> usize {
#[cfg(feature = "encryption")]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb is this more palatable? I wish I could use if cfg!() ... else ... but can't because the fields don't exist if encryption isn't enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable to me

Comment thread parquet/src/file/metadata/memory.rs Outdated
+ self.unencoded_byte_array_data_bytes.heap_size()
+ self.repetition_level_histogram.heap_size()
+ self.definition_level_histogram.heap_size()
+ self.column_crypto_metadata.heap_size()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree there is no great solution


impl HeapSize for ColumnChunkMetaData {
fn heap_size(&self) -> usize {
#[cfg(feature = "encryption")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable to me

@etseidl etseidl merged commit aa26c0c into apache:gh5854_thrift_remodel Sep 26, 2025
16 checks passed
@etseidl etseidl deleted the write_file_meta branch October 10, 2025 14:36
@alamb alamb mentioned this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants