Describe the bug
See #246 and 6a65543. There are some notes referring to this issue in that PR.
The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.
The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.
I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.
To Reproduce
- Create a RecordBatch that has a non-null struct with a nullable child.
- Write that to Parquet
- Read the Parquet file with Spark
Expected behavior
There shoulb some clear behaviour that is also documented.
Additional context
See the commit 6a65543, specifically the comments added around the tests.
Describe the bug
See #246 and 6a65543. There are some notes referring to this issue in that PR.
The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.
The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.
I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.
To Reproduce
Expected behavior
There shoulb some clear behaviour that is also documented.
Additional context
See the commit 6a65543, specifically the comments added around the tests.