[thrift-remodel] Use new writer to write Parquet file metadata by etseidl · Pull Request #8445 · apache/arrow-rs

etseidl · 2025-09-25T16:25:29Z

Which issue does this PR close?

Note: this targets a feature branch, not main

Part of Use custom thrift decoder to improve speed of parsing parquet metadata #5854.

Rationale for this change

This PR closes the loop and and now Parquet metadata is completely handled by the new code.

What changes are included in this PR?

Changes the metadata builders to use the new structs rather than those from format. As a consequence, the close methods no longer return a format::FileMetaData but instead return a ParquetMetaData.

Are these changes tested?

Covered by existing tests, but many tests were modified to deal with the switch to ParquetMetaData mentioned above.

Are there any user-facing changes?

Yes

etseidl · 2025-09-25T16:27:05Z

It's getting very close to October. I'm not sure we'll be able to get this into 57.0.0, but the thought of keeping this up to date with the changes that are queued up makes me queasy.

etseidl · 2025-09-25T16:32:39Z

    }
 }

-/// Write an encrypted Thrift serializable object


There are no parquet::format structs left to encrypt 😄

etseidl · 2025-09-25T16:33:52Z

        let base_expected_size = 2280;
        #[cfg(feature = "encryption")]
-        let base_expected_size = 2616;
+        let base_expected_size = 2712;


I still need to track down why this jumped

encrypted_column_metadata adds 24 bytes per column chunk.

As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔

That will help some...I'll admit to not really having a feel for the sizes of Rust structures. I'd imagine an Option<Box<>> would be 8-16 bytes?

etseidl · 2025-09-25T17:55:49Z

+    assert_eq!(metadata.file_metadata().num_rows(), 50);
+    // TODO(ets): what was this meant to test? The read and written schemas differ because an
+    // archaic form for a list was used in the source file.
+    // assert_eq!(metadata.schema, metadata.schema);


Could one of @rok or @adamreeve opine here? 🙏

I guess this was supposed to verify that after we write the encrypted file the schema matches the original input, as your comment suggests. But it's clearly not doing that!

I don't think this is necessary, the schemas shouldn't need to match exactly, and verify_encryption_test_data already tests all the columns expected are there and will verify the arrow types.

Thanks @adamreeve. I'll just remove the asserts then.

etseidl · 2025-09-25T22:10:11Z

@alamb I've merged in your recent changes, could you take a look please? 🙏

alamb

Thank you @etseidl -- I reviewed this PR as carefully as I could. I have some small suggestions, but nothing that I think would prevent this PR from merging.

In general I found the new parquet metadata writing code easy to follow, and the patterns make lots of sense to me

alamb · 2025-09-26T12:19:31Z

    });

    #[cfg(feature = "arrow")]
    c.bench_function("page headers (no stats)", |b| {


when reviewing these benchmarks, it seems like maybe it is time to remove benchmarks for page header statistics (as they aren't really useful / widely used)

Agreed...the only reason I added them was to see the speedup from not decoding the Statistics. I'll make a note to remove them later. Same for the private file metadata decoding...we should only be benchmarking the public API.

alamb · 2025-09-26T12:20:53Z

    ///
    /// Attempting to write after calling finish will result in an error
-    pub async fn finish(&mut self) -> Result<crate::format::FileMetaData> {
+    pub async fn finish(&mut self) -> Result<ParquetMetaData> {


I think that is a much more reasonable API, FWIW, as the ParquetMetadata is what is used in the rest of the APIs

Good. I was most worried about this change.

alamb · 2025-09-26T12:23:56Z

+            + self.unencoded_byte_array_data_bytes.heap_size()
+            + self.repetition_level_histogram.heap_size()
+            + self.definition_level_histogram.heap_size()
+            + self.column_crypto_metadata.heap_size()


I was at first annoyed at the replication here, but the alternative is to #cfg out a different function r something which is not obviously simpler

Though it would make it easier to keep these functions in sync 🤔

I could inline the #cfg, but I remember in the past we were trying to steer away from too many cfgs sprinkled about in the code. But maybe for the column chunk fields it's a better approach.

Yeah, I agree there is no great solution

alamb · 2025-09-26T12:27:28Z

        let base_expected_size = 2280;
        #[cfg(feature = "encryption")]
-        let base_expected_size = 2616;
+        let base_expected_size = 2712;


As a follow on, what do you think about Box'ing that -- Option<Box<ColumnCryptoMetaData>> 🤔 I have also thought recently the #cfgs for encryption make the code harder to work with (though they have the benefit there is no overhead if the feature is not enabled) 🤔

alamb · 2025-09-26T12:28:20Z

  7: optional list<ColumnOrder> column_orders;
-  8: optional EncryptionAlgorithm<'a> encryption_algorithm
-  9: optional binary<'a> footer_signing_key_metadata
+  8: optional EncryptionAlgorithm encryption_algorithm


I haven't been following along, but what is the signficance of not using references (aka removing <'a> 🤔

The lifetime annotations signal the macros to generate slices rather than vectors. While implementing write I found I couldn't keep the references alive long enough to use slices, so for now encryption requires more allocations than strictly necessary. I'd hate to duplicate all these structs to have one for reading and one for writing.

Perhaps if we figure out a way to encapsulate the encryption code more, we can revisit this.

alamb · 2025-09-26T12:28:58Z

+impl<'a> WriteThrift for FileMeta<'a> {
+    const ELEMENT_TYPE: ElementType = ElementType::Struct;
+
+    #[allow(unused_assignments)]


Is this still needed?

Ah, that's probably a cut-and-paste leftover. I'll see if I can remove it. Nice catch!

alamb · 2025-09-26T12:31:22Z

+                    None
+                },
+                repetition_type: Some(basic_info.repetition()),
+                name: basic_info.name(),


I double checked and this does not allocate a string (uses &str) 👍

alamb · 2025-09-26T12:32:11Z

+                type_length: None,
+                repetition_type: repetition,
+                name: basic_info.name(),
+                num_children: Some(fields.len() as i32),


maybe as a follow on we should validate this limit (aka that there are not more than 2M fields 🤔 )

That would be a BIG schema, but I could switch to a try.

That would be a BIG schema, but I could switch to a try.

yeah, I am not really imagining that someone would need it for real, more like either did it by accident or is trying to cause denial of service

alamb · 2025-09-26T12:33:55Z

+    }
+}
+
+// struct RowGroup {


I double checked how to find this, and see that this maps straightforwardly to the original thrift 👍

https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1001

Yes, I wanted the comment there to explain the magic numbers.

alamb · 2025-09-26T12:37:58Z

+        if let Some(column_orders) = self.file_metadata.column_orders() {
+            last_field_id = column_orders.write_thrift_field(writer, 7, last_field_id)?;
+        }
+        if let Some(algo) = self.encryption_algorithm.as_ref() {


I think we could avoid a lot of repetition if you just put the #[cfg(not(feature = "encryption"))] on these two fields, like

Suggested change

if let Some(algo) = self.encryption_algorithm.as_ref() {

#[cfg(feature = "encryption")]

if let Some(algo) = self.encryption_algorithm.as_ref() {

Will do...same comment as above about the encryption cfgs.

etseidl · 2025-09-26T14:30:54Z

Thanks @alamb! I'll clean this up and then move on to the last major PR, which I hope will be the last of the breaking changes. The rest should be fine tuning and testing. Field skipping should not be breaking changes either.

FWIW the next PR will also add the beginnings of thrift documentation.

alamb · 2025-09-26T14:47:46Z

BTW I am thinking we should document all this great work in a blog post or something

Blog post for arrow 57 #8463

Nothing actionable yet, just FYI

etseidl · 2025-09-26T15:08:04Z


 impl HeapSize for ColumnChunkMetaData {
    fn heap_size(&self) -> usize {
+        #[cfg(feature = "encryption")]


@alamb is this more palatable? I wish I could use if cfg!() ... else ... but can't because the fields don't exist if encryption isn't enabled.

I think this is reasonable to me

alamb · 2025-09-26T16:18:12Z

+            + self.unencoded_byte_array_data_bytes.heap_size()
+            + self.repetition_level_histogram.heap_size()
+            + self.definition_level_histogram.heap_size()
+            + self.column_crypto_metadata.heap_size()


Yeah, I agree there is no great solution

alamb · 2025-09-26T16:20:32Z


 impl HeapSize for ColumnChunkMetaData {
    fn heap_size(&self) -> usize {
+        #[cfg(feature = "encryption")]


I think this is reasonable to me

etseidl added 30 commits August 20, 2025 12:28

custom PageLocation decoder for speed

e3a0b50

fix recently added test

71d3859

clippy

ff42e5a

experimental new form for column index

1f2c216

fix for test added in main

37f3b20

refactor new column index

3d4e28e

checkpoint...everything but stats converter

2b85b89

fix bug found in testing

5ee1b8f

Merge branch 'new_col_idx' into new_col_idx_full

624b88b

stats converter works

d99a06a

get rid of import

79a6917

get parquet-index working

878d460

doc fixes

009632a

Merge branch 'offset_idx_speedup' into new_col_idx_full

998ac6c

move column index to its own module

a822dfd

add ColumnIndexIterators trait, simplify stats converter a little

20df075

restore comment

7755b7b

Merge branch 'new_col_idx' into new_col_idx_full

66ed8bc

further rework...allow for fallback to slow decoder

f6c5738

Merge branch 'offset_idx_speedup' into new_col_idx_full

3733b86

refactor a bit

09d71e1

simplify reading of int array

1ddaa35

Merge branch 'offset_idx_speedup' into new_col_idx_full

006d59d

get write working for enum and some unions

c271085

make test_roundtrip visible

34cdaf2

add test for converted_type, start on logical_type

c9be570

checkpoint struct field writing

a9cd09d

get some struct examples and lists working

ae65167

get rid of copied allow

272a013

get writer macros for structs working

632e171

etseidl added 4 commits September 23, 2025 12:07

backport some doc fixes

0701d60

Merge branch 'write_page_indexes' into write_file_meta

7fb0e13

fix recently added test

4977f2f

Merge branch 'gh5854_thrift_remodel' into write_file_meta

7ec64a9

etseidl added parquet Changes to the parquet crate api-change Changes to the arrow API labels Sep 25, 2025

forgot to check this in during merge

a87b0a2

etseidl commented Sep 25, 2025

View reviewed changes

etseidl added 2 commits September 25, 2025 10:11

remove TODO

1334370

add HeapSize for crypto fields on chunk metadata

5c5c826

etseidl commented Sep 25, 2025

View reviewed changes

etseidl added 4 commits September 25, 2025 12:58

Merge branch 'gh5854_thrift_remodel' into write_file_meta

facd852

Merge branch 'gh5854_thrift_remodel' into write_file_meta

f82fd45

Merge branch 'gh5854_thrift_remodel' into write_file_meta

bd682d1

remove unnecessary checks

1bca0a0

mbrobbel added this to the 57.0.0 milestone Sep 26, 2025

alamb approved these changes Sep 26, 2025

View reviewed changes

alamb mentioned this pull request Sep 26, 2025

[Parquet] Reduce size of ParquetMetadata when encryption feature is enabled #8469

Closed

implement suggestions from review

c3907dc

etseidl commented Sep 26, 2025

View reviewed changes

Merge branch 'gh5854_thrift_remodel' into write_file_meta

9045533

alamb approved these changes Sep 26, 2025

View reviewed changes

etseidl merged commit aa26c0c into apache:gh5854_thrift_remodel Sep 26, 2025
16 checks passed

etseidl deleted the write_file_meta branch October 10, 2025 14:36

alamb mentioned this pull request Nov 19, 2025

Blog post for arrow 57 #8463

Closed

	if let Some(algo) = self.encryption_algorithm.as_ref() {
	#[cfg(feature = "encryption")]
	if let Some(algo) = self.encryption_algorithm.as_ref() {

Conversation

etseidl commented Sep 25, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Sep 25, 2025

Uh oh!

etseidl Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Sep 25, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl commented Sep 26, 2025

Uh oh!

alamb commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

etseidl Sep 25, 2025 •

edited

Loading

alamb Sep 26, 2025 •

edited

Loading