GH-49885: [C++][Python] Bind unresolved Substrait expressions using a supplied schema#49886
Draft
malinjawi wants to merge 1 commit intoapache:mainfrom
Draft
GH-49885: [C++][Python] Bind unresolved Substrait expressions using a supplied schema#49886malinjawi wants to merge 1 commit intoapache:mainfrom
malinjawi wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
9d08da8 to
17bd960
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This is follow-up work to GH-33985 / PR #34834 now that Substrait can represent unresolved / partially bound expressions (see substrait-io/substrait#515).
Arrow can currently deserialize bound Substrait
ExtendedExpressionmessages, but it cannot yet consume unresolved expressions that contain:Expression.NamedExpressionType.Unknownadd:unknown_unknownTo support front-end filter / projection workflows, Arrow should be able to deserialize these messages using a supplied Arrow schema, bind unresolved names and types against that schema, and then return normal Arrow compute expressions.
This PR depends on the Substrait protocol change in substrait-io/substrait#1063, so it should remain draft until Arrow can pin to a Substrait release that includes those protocol changes.
What changes are included in this PR?
This PR adds schema-aware deserialization for unresolved Substrait expressions.
On the C++ side:
DeserializeExpressions(buf, input_schema, ...)overloadExpression.NamedExpressionto ArrowFieldRefType.Unknownas a bind-time placeholder instead of an executable Arrow typeExtendedExpression.base_schemaextension:io.substrait:unknownto resolve through Arrow's existing function registryOn the Python side:
schema=support to:pyarrow.substrait.deserialize_expressions(...)pyarrow.substrait.BoundExpressions.from_substrait(...)pyarrow.compute.Expression.from_substrait(...)SubstraitSchema.to_pysubstrait()work with eithersubstrait.protoor generated protobuf module layoutsTesting added:
base_schemaAre these changes tested?
Yes.
Validated locally with:
pyarrow.datasetflows using unresolved projection and filter expressionsThe local end-to-end validation was run against an Arrow build using a Substrait archive containing the protocol changes from substrait-io/substrait#1063.
Are there any user-facing changes?
Yes.
This PR adds additive API surface for schema-aware deserialization of unresolved Substrait expressions:
DeserializeExpressions(const Buffer&, const Schema&, ...)pyarrow.substrait.deserialize_expressions(..., schema=...)pyarrow.substrait.BoundExpressions.from_substrait(..., schema=...)pyarrow.compute.Expression.from_substrait(..., schema=...)These changes are intended for unresolved / partially bound Substrait expression workflows and do not change the existing bound-expression API behavior.
Additional context
Portions of this change were developed with AI assistance and then manually reviewed, built, debugged, and validated.