Fix compatibility quirks between arrow and parquet structs

**Describe the bug**

See #246 and https://github.com/apache/arrow-rs/commit/6a6554361d5ac2304b17723e00910ccea34a710a. There are some notes referring to this issue in that PR.

The issue is that the different parquet implementations handle non-null structs (and possibly lists) differently.
Spark doesn't seem to have a facility to create non-null struct schemas, so structs are nullable by default. If one creates a non-null struct with null children, pyspark won't read it.

The C++ implementation reads this back fine, perhaps because there's a good mapping to Arrow data.
The Rust implementation will write the file, but won't read it back.

I also have some uncertainty on whether a non-null parent + null child is logically correct or Arrow specification compliant.

**To Reproduce**

* Create a RecordBatch that has a non-null struct with a nullable child.
* Write that to Parquet
* Read the Parquet file with Spark

**Expected behavior**

There shoulb some clear behaviour that is also documented.

**Additional context**

See the commit https://github.com/apache/arrow-rs/commit/6a6554361d5ac2304b17723e00910ccea34a710a, specifically the comments added around the tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compatibility quirks between arrow and parquet structs #245

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fix compatibility quirks between arrow and parquet structs #245

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions