Skip to content

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393

Open
wgtmac wants to merge 4 commits intoapache:masterfrom
wgtmac:PARQUET-2249
Open

PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
wgtmac wants to merge 4 commits intoapache:masterfrom
wgtmac:PARQUET-2249

Conversation

@wgtmac
Copy link
Copy Markdown
Member

@wgtmac wgtmac commented Feb 12, 2026

Rationale for this change

PoC implementation for the spec change: apache/parquet-format#514

What changes are included in this PR?

  • Added a new ColumnOrder for IEEE 754 total order.
  • Added comparators for Float, Double and Float16.
  • Added nan_count to statistics and column index for floating types.
  • Added predicate pushdown support for statistics and column index filtering.

Are these changes tested?

Added various test cases for both new metadata and filtering.

Are there any user-facing changes?

Yes, users can now set the new column order but by default it is not used.

Closes #406

@wgtmac wgtmac force-pushed the PARQUET-2249 branch 2 times, most recently from c01b3f3 to 4b7e86b Compare March 6, 2026 15:27
Copy link
Copy Markdown
Contributor

@shangxinli shangxinli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gang for picking this up and driving it forward!

+1 on the approach. This combined solution addresses both the ordering ambiguity and the NaN pollution concern pragmatically. Looking forward to seeing the arrow-cpp PoC as well.

mergeStatisticsMinMax(stats);
markAsNotEmpty();
}
if (isNanCountSet() && stats.isNanCountSet()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: should we also validate that both stats have the same column order before merging? Merging IEEE 754 stats with TYPE_DEFINED_ORDER stats could produce incorrect min/max since NaN handling differs. Though type.equals() on line 408 may already cover this if column order is part of type equality.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nan_count is independent of the column order. Here we are conservative that if any side is missing, we should drop it.

Comment on lines +109 to +113
if (!Float.isNaN(min_value)) {
if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) {
min = min_value;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!Float.isNaN(min_value)) {
if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) {
min = min_value;
}
}
if (!Float.isNaN(min_value)) {
if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) {
min = min_value;
}
} else if (Float.isNan(min) && comparator().compare(min, min_value) > 0) {
min = min_value;
}

IIUC the wording of the most recent proposal is that the statistics must contain the min and max NaN values for an all-NaN page/chunk. I think you need to still update the stats for an incoming NaN if the current min/max is NaN.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment from @rdblue to simply use a standard NaN: apache/parquet-format#514 (comment)

IMO, currently NaN values are just sentinels for min/max to indicate a all-NaN page/chunk. We should be conservative not to depend on the order of NaN values for filtering.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's the comment, but the wording hasn't yet changed. And given the other comments in that thread I'm not sure it will be changed. But at this stage in the process I guess it doesn't matter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the spec to follow the total order. Now this implementation has been changed to reflect that.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 2, 2026

@wgtmac thank you for adding the interop test! 🙏 In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0. The latter is to make sure that the old rules regarding 0 min being set to -0 and 0 max set +0 are no longer followed with the new ordering.

@wgtmac
Copy link
Copy Markdown
Member Author

wgtmac commented Apr 3, 2026

In arrow-rs we've made the decision to only write the new column order for floats, so I can't reproduce the total order columns.

Did you mean arrow-rs will no longer write floats with the legacy TypeDefinedOrder? From the perspective of interoperability test, I think this is fine if it does not fail when reading files produced by other writers.

Some things I think need to be added are negative NaNs, as well as examples where the min and/or max are 0

That's a good suggestion! I have updated the floating-point interop coverage to add explicit ZERO_MIN and ZERO_MAX cases, so we now verify that IEEE-754 total order no longer rewrites +0 min to -0 or -0 max to +0. I also expanded the NaN coverage to include both negative and positive NaN patterns.

While debugging the test, I found that the Java implementation uses Float.floatToIntBits instead of Float.floatToRawIntBits (same for double) which canonicalizes NaN bits and pollutes both values and stats. I fixed the float/double write paths to preserve raw NaN bits instead of canonicalizing them.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 3, 2026

Thank you. I hope to have the rust tests done today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants