[thrift-remodel] Use new writer to write Parquet file metadata#8445
[thrift-remodel] Use new writer to write Parquet file metadata#8445etseidl merged 154 commits intoapache:gh5854_thrift_remodelfrom
Conversation
|
It's getting very close to October. I'm not sure we'll be able to get this into |
| } | ||
| } | ||
|
|
||
| /// Write an encrypted Thrift serializable object |
There was a problem hiding this comment.
There are no parquet::format structs left to encrypt 😄
| let base_expected_size = 2280; | ||
| #[cfg(feature = "encryption")] | ||
| let base_expected_size = 2616; | ||
| let base_expected_size = 2712; |
There was a problem hiding this comment.
I still need to track down why this jumped
There was a problem hiding this comment.
encrypted_column_metadata adds 24 bytes per column chunk.
There was a problem hiding this comment.
As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔
There was a problem hiding this comment.
That will help some...I'll admit to not really having a feel for the sizes of Rust structures. I'd imagine an Option<Box<>> would be 8-16 bytes?
| assert_eq!(metadata.file_metadata().num_rows(), 50); | ||
| // TODO(ets): what was this meant to test? The read and written schemas differ because an | ||
| // archaic form for a list was used in the source file. | ||
| // assert_eq!(metadata.schema, metadata.schema); |
There was a problem hiding this comment.
Could one of @rok or @adamreeve opine here? 🙏
There was a problem hiding this comment.
I guess this was supposed to verify that after we write the encrypted file the schema matches the original input, as your comment suggests. But it's clearly not doing that!
I don't think this is necessary, the schemas shouldn't need to match exactly, and verify_encryption_test_data already tests all the columns expected are there and will verify the arrow types.
There was a problem hiding this comment.
Thanks @adamreeve. I'll just remove the asserts then.
|
@alamb I've merged in your recent changes, could you take a look please? 🙏 |
alamb
left a comment
There was a problem hiding this comment.
Thank you @etseidl -- I reviewed this PR as carefully as I could. I have some small suggestions, but nothing that I think would prevent this PR from merging.
In general I found the new parquet metadata writing code easy to follow, and the patterns make lots of sense to me
| }); | ||
|
|
||
| #[cfg(feature = "arrow")] | ||
| c.bench_function("page headers (no stats)", |b| { |
There was a problem hiding this comment.
when reviewing these benchmarks, it seems like maybe it is time to remove benchmarks for page header statistics (as they aren't really useful / widely used)
There was a problem hiding this comment.
Agreed...the only reason I added them was to see the speedup from not decoding the Statistics. I'll make a note to remove them later. Same for the private file metadata decoding...we should only be benchmarking the public API.
| /// | ||
| /// Attempting to write after calling finish will result in an error | ||
| pub async fn finish(&mut self) -> Result<crate::format::FileMetaData> { | ||
| pub async fn finish(&mut self) -> Result<ParquetMetaData> { |
There was a problem hiding this comment.
I think that is a much more reasonable API, FWIW, as the ParquetMetadata is what is used in the rest of the APIs
There was a problem hiding this comment.
Good. I was most worried about this change.
| + self.unencoded_byte_array_data_bytes.heap_size() | ||
| + self.repetition_level_histogram.heap_size() | ||
| + self.definition_level_histogram.heap_size() | ||
| + self.column_crypto_metadata.heap_size() |
There was a problem hiding this comment.
I was at first annoyed at the replication here, but the alternative is to #cfg out a different function r something which is not obviously simpler
Though it would make it easier to keep these functions in sync 🤔
There was a problem hiding this comment.
I could inline the #cfg, but I remember in the past we were trying to steer away from too many cfgs sprinkled about in the code. But maybe for the column chunk fields it's a better approach.
There was a problem hiding this comment.
Yeah, I agree there is no great solution
| let base_expected_size = 2280; | ||
| #[cfg(feature = "encryption")] | ||
| let base_expected_size = 2616; | ||
| let base_expected_size = 2712; |
There was a problem hiding this comment.
As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔
| 7: optional list<ColumnOrder> column_orders; | ||
| 8: optional EncryptionAlgorithm<'a> encryption_algorithm | ||
| 9: optional binary<'a> footer_signing_key_metadata | ||
| 8: optional EncryptionAlgorithm encryption_algorithm |
There was a problem hiding this comment.
I haven't been following along, but what is the signficance of not using references (aka removing <'a> 🤔
There was a problem hiding this comment.
The lifetime annotations signal the macros to generate slices rather than vectors. While implementing write I found I couldn't keep the references alive long enough to use slices, so for now encryption requires more allocations than strictly necessary. I'd hate to duplicate all these structs to have one for reading and one for writing.
Perhaps if we figure out a way to encapsulate the encryption code more, we can revisit this.
| impl<'a> WriteThrift for FileMeta<'a> { | ||
| const ELEMENT_TYPE: ElementType = ElementType::Struct; | ||
|
|
||
| #[allow(unused_assignments)] |
There was a problem hiding this comment.
Ah, that's probably a cut-and-paste leftover. I'll see if I can remove it. Nice catch!
| None | ||
| }, | ||
| repetition_type: Some(basic_info.repetition()), | ||
| name: basic_info.name(), |
There was a problem hiding this comment.
I double checked and this does not allocate a string (uses &str) 👍
| type_length: None, | ||
| repetition_type: repetition, | ||
| name: basic_info.name(), | ||
| num_children: Some(fields.len() as i32), |
There was a problem hiding this comment.
maybe as a follow on we should validate this limit (aka that there are not more than 2M fields 🤔 )
There was a problem hiding this comment.
That would be a BIG schema, but I could switch to a try.
There was a problem hiding this comment.
That would be a BIG schema, but I could switch to a
try.
yeah, I am not really imagining that someone would need it for real, more like either did it by accident or is trying to cause denial of service
| } | ||
| } | ||
|
|
||
| // struct RowGroup { |
There was a problem hiding this comment.
I double checked how to find this, and see that this maps straightforwardly to the original thrift 👍
There was a problem hiding this comment.
Yes, I wanted the comment there to explain the magic numbers.
| if let Some(column_orders) = self.file_metadata.column_orders() { | ||
| last_field_id = column_orders.write_thrift_field(writer, 7, last_field_id)?; | ||
| } | ||
| if let Some(algo) = self.encryption_algorithm.as_ref() { |
There was a problem hiding this comment.
I think we could avoid a lot of repetition if you just put the #[cfg(not(feature = "encryption"))] on these two fields, like
| if let Some(algo) = self.encryption_algorithm.as_ref() { | |
| #[cfg(feature = "encryption")] | |
| if let Some(algo) = self.encryption_algorithm.as_ref() { |
There was a problem hiding this comment.
Will do...same comment as above about the encryption cfgs.
|
Thanks @alamb! I'll clean this up and then move on to the last major PR, which I hope will be the last of the breaking changes. The rest should be fine tuning and testing. Field skipping should not be breaking changes either. FWIW the next PR will also add the beginnings of thrift documentation. |
|
BTW I am thinking we should document all this great work in a blog post or something Nothing actionable yet, just FYI |
|
|
||
| impl HeapSize for ColumnChunkMetaData { | ||
| fn heap_size(&self) -> usize { | ||
| #[cfg(feature = "encryption")] |
There was a problem hiding this comment.
@alamb is this more palatable? I wish I could use if cfg!() ... else ... but can't because the fields don't exist if encryption isn't enabled.
There was a problem hiding this comment.
I think this is reasonable to me
| + self.unencoded_byte_array_data_bytes.heap_size() | ||
| + self.repetition_level_histogram.heap_size() | ||
| + self.definition_level_histogram.heap_size() | ||
| + self.column_crypto_metadata.heap_size() |
There was a problem hiding this comment.
Yeah, I agree there is no great solution
|
|
||
| impl HeapSize for ColumnChunkMetaData { | ||
| fn heap_size(&self) -> usize { | ||
| #[cfg(feature = "encryption")] |
There was a problem hiding this comment.
I think this is reasonable to me
Which issue does this PR close?
Note: this targets a feature branch, not main
Rationale for this change
This PR closes the loop and and now Parquet metadata is completely handled by the new code.
What changes are included in this PR?
Changes the metadata builders to use the new structs rather than those from
format. As a consequence, theclosemethods no longer return aformat::FileMetaDatabut instead return aParquetMetaData.Are these changes tested?
Covered by existing tests, but many tests were modified to deal with the switch to
ParquetMetaDatamentioned above.Are there any user-facing changes?
Yes