Skip to content

Add support for lambda column capture#21323

Open
gstvg wants to merge 3 commits intoapache:mainfrom
gstvg:lambda_capture
Open

Add support for lambda column capture#21323
gstvg wants to merge 3 commits intoapache:mainfrom
gstvg:lambda_capture

Conversation

@gstvg
Copy link
Copy Markdown
Contributor

@gstvg gstvg commented Apr 2, 2026

Which issue does this PR close?

Part of #21172

Rationale for this change

Capture support wasn't implemented in the core lambda support to reduce PR size and because it requires further discussions not tied to basic support

What changes are included in this PR?

Lambda capture
list_values_row_number helper to adjust a list to the lambda scope
Make #18329 lambda-aware

Are these changes tested?

sqllogictests for lambda capture and CaseWhen
unit tests for list_values_row_number

Are there any user-facing changes?

This add breaking changes to unreleased items only

@github-actions github-actions Bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate spark labels Apr 2, 2026
} else if let Some(lambda_variable) =
expr.as_any().downcast_ref::<LambdaVariable>()
{
used_column_indices.insert(lambda_variable.index());
Copy link
Copy Markdown
Member

@rluvaton rluvaton Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 98% sure this has a bug for conflicting indices for lambda variable and columns, and even if you separate lambda variable indices from the column indices you can still have problem with nested lambda variables and using upper lambda variable inside nested ones

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sqllogictest test which I hope includes all the cases you cited and more (4932cae). Compared to your snippet at #21231 (comment) where lambda variables are included first in the scoped schema and external columns after them, here lambda variables are pushed to the end of the outer schema, which still includes unreferenced columns, and in case of any name conflicts(a lambda variable shadows a field from the outer schema), we rename the shadowed field to an unique name ( 5c5ca19#diff-a3e127629e9516ec496d656ebb53a1e8bf730eb02d219c4ce42ee47572685844R253-R325, 5c5ca19#diff-7fb0a64e734f54d94d48e9e02c51573a3678205f9ee8e2afaf41d686187a285eR440-R489). That way, after a field has been introduced into the schema, be it a column on the outermost schema or a lambda variable into inner schemas, their index never changes, regardless of how many new scopes are created from it down the tree. Because of that, the casewhen optimization (as well as the same opimization in lambdas) can safely collect all indices and assume all those that are out-of-bounds of the scoped batch it's projecting refer to inner lambda variables not yet available. It still need to rewrite all of them since they were originally computed based on the unprojected, full schema, and any projection of a outer schema affects the indices of all it's derived, inner schemas, and must be propagated down the tree, for every projection(inner projections couldn't know how to rewrite indices of outer projection)

@rluvaton
Copy link
Copy Markdown
Member

@gstvg do you want to align the pr to latest main and continue working on it so we can release with the lambda support to avoid 2 breaking changes?

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Apr 30, 2026

@rluvaton Sure, I will work on this later today. I will ping you when I finish.

@github-actions github-actions Bot removed sql SQL Planner optimizer Optimizer rules core Core DataFusion crate substrait Changes to the substrait crate catalog Related to the catalog crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate labels Apr 30, 2026
@github-actions github-actions Bot removed ffi Changes to the ffi crate spark labels Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning origin/main
    Building datafusion-common v53.1.0 (current)
error: running cargo-doc on crate 'datafusion-common' failed with output:
-----
   Compiling proc-macro2 v1.0.106
   Compiling quote v1.0.45
   Compiling unicode-ident v1.0.24
   Compiling libc v0.2.186
   Compiling autocfg v1.5.0
   Compiling libm v0.2.16
    Checking cfg-if v1.0.4
   Compiling num-traits v0.2.19
    Checking memchr v2.8.0
   Compiling syn v2.0.117
   Compiling shlex v1.3.0
   Compiling find-msvc-tools v0.1.9
   Compiling serde_core v1.0.228
    Checking itoa v1.0.18
   Compiling zerocopy v0.8.48
    Checking bytes v1.11.1
   Compiling jobserver v0.1.34
   Compiling zmij v1.0.21
   Compiling cc v1.2.61
   Compiling serde_json v1.0.149
    Checking num-integer v0.1.46
    Checking stable_deref_trait v1.2.1
   Compiling getrandom v0.3.4
   Compiling version_check v0.9.5
    Checking siphasher v1.0.2
    Checking iana-time-zone v0.1.65
    Checking chrono v0.4.44
    Checking phf_shared v0.12.1
   Compiling ahash v0.8.12
    Checking num-bigint v0.4.6
   Compiling chrono-tz v0.10.4
    Checking phf v0.12.1
    Checking once_cell v1.21.4
   Compiling synstructure v0.13.2
    Checking arrow-schema v58.1.0
    Checking num-complex v0.4.6
    Checking hashbrown v0.16.1
    Checking writeable v0.6.3
    Checking lexical-util v1.0.7
   Compiling pkg-config v0.3.33
    Checking litemap v0.8.2
   Compiling zerocopy-derive v0.8.48
   Compiling zerofrom-derive v0.1.7
   Compiling yoke-derive v0.8.2
   Compiling zerovec-derive v0.11.3
    Checking zerofrom v0.1.7
    Checking yoke v0.8.2
   Compiling displaydoc v0.2.5
    Checking zerotrie v0.2.4
   Compiling zstd-sys v2.0.16+zstd.1.5.7
    Checking zerovec v0.11.6
    Checking smallvec v1.15.1
   Compiling icu_normalizer_data v2.2.0
   Compiling object v0.37.3
    Checking tinystr v0.8.3
    Checking potential_utf v0.1.5
    Checking utf8_iter v1.0.4
    Checking icu_locale_core v2.2.0
   Compiling icu_properties_data v2.2.0
    Checking icu_collections v2.2.0
    Checking icu_provider v2.2.0
   Compiling semver v1.0.28
   Compiling rustc_version v0.4.1
    Checking lexical-parse-integer v1.0.6
    Checking lexical-write-integer v1.0.6
   Compiling zstd-safe v7.2.4
    Checking lexical-write-float v1.0.6
    Checking lexical-parse-float v1.0.6
    Checking icu_properties v2.2.0
    Checking half v2.7.1
    Checking arrow-buffer v58.1.0
    Checking icu_normalizer v2.2.0
    Checking arrow-data v58.1.0
   Compiling flatbuffers v25.12.19
    Checking aho-corasick v1.1.4
    Checking arrow-array v58.1.0
    Checking futures-sink v0.3.32
    Checking regex-syntax v0.8.10
   Compiling ar_archive_writer v0.5.1
    Checking futures-core v0.3.32
    Checking base64 v0.22.1
    Checking arrow-select v58.1.0
    Checking ryu v1.0.23
    Checking unicode-width v0.2.2
    Checking pin-project-lite v0.2.17
    Checking unicode-segmentation v1.13.2
   Compiling parking_lot_core v0.9.12
   Compiling psm v0.1.31
    Checking comfy-table v7.2.2
    Checking futures-channel v0.3.32
    Checking regex-automata v0.4.14
    Checking arrow-ord v58.1.0
    Checking idna_adapter v1.2.2
    Checking lexical-core v1.0.6
   Compiling futures-macro v0.3.32
    Checking atoi v2.0.0
    Checking foldhash v0.2.0
    Checking allocator-api2 v0.2.21
    Checking alloc-no-stdlib v2.0.4
    Checking slab v0.4.12
    Checking futures-task v0.3.32
    Checking scopeguard v1.2.0
    Checking twox-hash v2.1.2
    Checking equivalent v1.0.2
   Compiling thiserror v2.0.18
    Checking percent-encoding v2.3.2
    Checking futures-io v0.3.32
    Checking bitflags v2.11.1
    Checking futures-util v0.3.32
    Checking form_urlencoded v1.2.2
    Checking lz4_flex v0.13.0
    Checking regex v1.12.3
    Checking hashbrown v0.17.0
    Checking lock_api v0.4.14
    Checking alloc-stdlib v0.2.2
    Checking arrow-cast v58.1.0
    Checking idna v1.1.0
   Compiling thiserror-impl v2.0.18
   Compiling ring v0.17.14
   Compiling stacker v0.1.24
    Checking csv-core v0.1.13
    Checking either v1.15.0
   Compiling getrandom v0.4.2
   Compiling snap v1.1.1
   Compiling paste v1.0.15
    Checking simdutf8 v0.1.5
    Checking itertools v0.14.0
    Checking csv v1.4.0
    Checking parking_lot v0.12.5
    Checking url v2.5.8
    Checking indexmap v2.14.0
    Checking brotli-decompressor v5.0.0
   Compiling async-trait v0.1.89
   Compiling tokio-macros v2.7.0
    Checking http v1.4.0
    Checking ordered-float v2.10.1
    Checking getrandom v0.2.17
    Checking zlib-rs v0.6.3
    Checking zstd v0.13.3
    Checking untrusted v0.9.0
    Checking integer-encoding v3.0.4
    Checking arrow-ipc v58.1.0
    Checking humantime v2.3.0
    Checking byteorder v1.5.0
    Checking thrift v0.17.0
    Checking object_store v0.13.2
    Checking flate2 v1.1.9
    Checking tokio v1.52.1
    Checking brotli v8.0.2
    Checking arrow-json v58.1.0
    Checking arrow-csv v58.1.0
    Checking futures v0.3.32
    Checking arrow-string v58.1.0
    Checking arrow-row v58.1.0
    Checking arrow-arith v58.1.0
   Compiling recursive-proc-macro-impl v0.1.1
   Compiling sqlparser_derive v0.5.0
    Checking log v0.4.29
   Compiling seq-macro v0.3.6
    Checking recursive v0.1.1
    Checking uuid v1.23.1
    Checking arrow v58.1.0
    Checking parquet v58.1.0
    Checking hex v0.4.3
    Checking sqlparser v0.61.0
error[E0432]: unresolved import `object_store::buffered`
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:25:19
    |
 25 | use object_store::buffered::BufWriter;
    |                   ^^^^^^^^ could not find `buffered` in `object_store`
    |
note: found an item that was configured out
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/object_store-0.13.2/src/lib.rs:545:9
    |
544 | #[cfg(feature = "tokio")]
    |       ----------------- the item is gated behind the `tokio` feature
545 | pub mod buffered;
    |         ^^^^^^^^

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:98:13
    |
 98 | /             self.w
 99 | |                 .put(bs)
100 | |                 .await
    | |______________________^ cannot infer type

error[E0282]: type annotations needed
   --> /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-58.1.0/src/arrow/async_writer/store.rs:107:13
    |
107 | /             self.w
108 | |                 .shutdown()
109 | |                 .await
    | |______________________^ cannot infer type

Some errors have detailed explanations: E0282, E0432.
For more information about an error, try `rustc --explain E0282`.
error: could not compile `parquet` (lib) due to 3 previous errors
warning: build failed, waiting for other jobs to finish...

-----

error: failed to build rustdoc for crate datafusion-common v53.1.0
note: this is usually due to a compilation error in the crate,
      and is unlikely to be a bug in cargo-semver-checks
note: the following command can be used to reproduce the error:
      cargo new --lib example &&
          cd example &&
          echo '[workspace]' >> Cargo.toml &&
          cargo add --path /home/runner/work/datafusion/datafusion/datafusion/common --features backtrace,force_hash_collisions,object_store,parquet,parquet_encryption,recursive_protection,sql,sqlparser &&
          cargo check &&
          cargo doc

    Building datafusion-expr v53.1.0 (current)
       Built [  26.577s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.072s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  26.397s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.072s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.335s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ExecutionProps.lambda_variable_qualifier in /home/runner/work/datafusion/datafusion/datafusion/expr/src/execution_props.rs:76

--- failure method_parameter_count_changed: pub method parameter count changed ---

Description:
A publicly-visible method now takes a different number of parameters, not counting the receiver (self) parameter.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#fn-change-arity
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/method_parameter_count_changed.ron

Failed in:
  datafusion_expr::LambdaArgument::new takes 2 parameters in /home/runner/work/datafusion/datafusion/target/semver-checks/git-origin_main/cb537bd32a347e60e7247edfd0488b7e7280fd04/datafusion/expr/src/higher_order_function.rs:230, but now takes 3 parameters in /home/runner/work/datafusion/datafusion/datafusion/expr/src/higher_order_function.rs:235
  datafusion_expr::LambdaArgument::evaluate takes 1 parameters in /home/runner/work/datafusion/datafusion/target/semver-checks/git-origin_main/cb537bd32a347e60e7247edfd0488b7e7280fd04/datafusion/expr/src/higher_order_function.rs:242, but now takes 2 parameters in /home/runner/work/datafusion/datafusion/datafusion/expr/src/higher_order_function.rs:267

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  55.627s] datafusion-expr
    Building datafusion-functions-nested v53.1.0 (current)
       Built [  34.941s] (current)
     Parsing datafusion-functions-nested v53.1.0 (current)
      Parsed [   0.035s] (current)
    Building datafusion-functions-nested v53.1.0 (baseline)
       Built [  35.468s] (baseline)
     Parsing datafusion-functions-nested v53.1.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion-functions-nested v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.186s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  71.879s] datafusion-functions-nested
    Building datafusion-physical-expr v53.1.0 (current)
       Built [  24.987s] (current)
     Parsing datafusion-physical-expr v53.1.0 (current)
      Parsed [   0.043s] (current)
    Building datafusion-physical-expr v53.1.0 (baseline)
       Built [  24.754s] (baseline)
     Parsing datafusion-physical-expr v53.1.0 (baseline)
      Parsed [   0.044s] (baseline)
    Checking datafusion-physical-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.297s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  51.077s] datafusion-physical-expr
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 138.875s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.022s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 142.263s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.088s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 284.235s] datafusion-sqllogictest
error: aborting due to failure to build rustdoc for crate datafusion-common v53.1.0

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Apr 30, 2026

@rluvaton This is ready for review. Note I kept the same approach because it fully implements capture and doesn't make copies of uncaptured columns, the whole tree is exposed via TreeNode and every Column exposed to regular tree traversals have an index referring to the outer schema, without requiring new tree node methods. It does so by implementing the same optimization as CaseWhen, collect used indices, rewrite the body into a projected_body but expose the unprojected body to TreeNode, and then project the batch before providing it to the function implementer via LambdaArgument.captures. But if using ProjectionExpr is better as you suggested before I'm open to it.

Note I'm using the same approach as #18329 to keep the codebase consistent, but my first version before it got merged instead of rewriting+projecting, simply swapped uncaptured columns for NullArrays which are cheap to create. While not as elegant as #18329, it's simpler and easier to reason about specially in the context of deeply nested lambdas and case's.

cc @LiaCastaneda @comphead @pepijnve

@gstvg gstvg changed the title [DRAFT] Add support for lambda column capture Add support for lambda column capture Apr 30, 2026
@gstvg gstvg marked this pull request as ready for review April 30, 2026 22:11
Copy link
Copy Markdown
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gstvg I'll check it out this week, btw, does lambda support nested labmdas?

SELECT transform(
         arr,
         inner -> transform(inner, x -> x * 2)
       ) AS doubled
FROM values (array(array(1,2), array(3,4))) AS t(arr);

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Apr 30, 2026

... does lambda support nested labmdas?

@comphead Yes, there's a test with nested lambdas ( column t.list is captured at the outermost lambdas, column number captured at the innermost lambdas, and list variable is captured from the outer lambda, there's also some shadowing happening) :

# case with inner nested higher order function
query T?I?
select
t.text,
t.list,
t.number,
case
when t.number > 30 then array_transform(
[[t.list]],
list -> array_transform(
list,
list -> array_transform(
list,
v -> number + v + list[1]
)
)
)
else array_transform(
[[t.list]],
list -> array_transform(
list,
list -> array_transform(
list,
v -> number + list[1]
)
)
)
end
from t
order by t.number;
----
a [1, 50] 10 [[[11, 11]]]
b [4, 50] 40 [[[48, 94]]]
c [7, 50] 60 [[[74, 117]]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants