Skip to content

GH-49885: [C++][Python] Bind unresolved Substrait expressions using a supplied schema#49886

Draft
malinjawi wants to merge 1 commit intoapache:mainfrom
malinjawi:malinjawi/unbound-expressions-arrow
Draft

GH-49885: [C++][Python] Bind unresolved Substrait expressions using a supplied schema#49886
malinjawi wants to merge 1 commit intoapache:mainfrom
malinjawi:malinjawi/unbound-expressions-arrow

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented Apr 28, 2026

Rationale for this change

This is follow-up work to GH-33985 / PR #34834 now that Substrait can represent unresolved / partially bound expressions (see substrait-io/substrait#515).

Arrow can currently deserialize bound Substrait ExtendedExpression messages, but it cannot yet consume unresolved expressions that contain:

  • Expression.NamedExpression
  • Type.Unknown
  • unresolved function signatures such as add:unknown_unknown

To support front-end filter / projection workflows, Arrow should be able to deserialize these messages using a supplied Arrow schema, bind unresolved names and types against that schema, and then return normal Arrow compute expressions.

This PR depends on the Substrait protocol change in substrait-io/substrait#1063, so it should remain draft until Arrow can pin to a Substrait release that includes those protocol changes.

What changes are included in this PR?

This PR adds schema-aware deserialization for unresolved Substrait expressions.

On the C++ side:

  • add a DeserializeExpressions(buf, input_schema, ...) overload
  • bind Expression.NamedExpression to Arrow FieldRef
  • treat Type.Unknown as a bind-time placeholder instead of an executable Arrow type
  • validate supplied schema names against unresolved ExtendedExpression.base_schema
  • allow unresolved function ids under extension:io.substrait:unknown to resolve through Arrow's existing function registry

On the Python side:

  • add optional schema= support to:
    • pyarrow.substrait.deserialize_expressions(...)
    • pyarrow.substrait.BoundExpressions.from_substrait(...)
    • pyarrow.compute.Expression.from_substrait(...)
  • make SubstraitSchema.to_pysubstrait() work with either substrait.proto or generated protobuf module layouts

Testing added:

  • unresolved projection binding with a supplied schema
  • unresolved filter binding with a supplied schema
  • failure when no schema is supplied
  • failure when the supplied schema does not match the unresolved base_schema
  • combined unresolved filter + projection scanner flow

Are these changes tested?

Yes.

Validated locally with:

  • targeted C++ Substrait serde coverage
  • targeted Python Substrait tests
  • end-to-end pyarrow.dataset flows using unresolved projection and filter expressions
  • negative cases for missing schema and schema mismatch

The local end-to-end validation was run against an Arrow build using a Substrait archive containing the protocol changes from substrait-io/substrait#1063.

Are there any user-facing changes?

Yes.

This PR adds additive API surface for schema-aware deserialization of unresolved Substrait expressions:

  • C++:
    • DeserializeExpressions(const Buffer&, const Schema&, ...)
  • Python:
    • pyarrow.substrait.deserialize_expressions(..., schema=...)
    • pyarrow.substrait.BoundExpressions.from_substrait(..., schema=...)
    • pyarrow.compute.Expression.from_substrait(..., schema=...)

These changes are intended for unresolved / partially bound Substrait expression workflows and do not change the existing bound-expression API behavior.

Additional context

Portions of this change were developed with AI assistance and then manually reviewed, built, debugged, and validated.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #49885 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions Bot added the awaiting review Awaiting review label Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant