Skip to content

BatchCoalescer::push_batch panics on schema mismatch instead of returning error #9389

@bvolpato-dd

Description

@bvolpato-dd

Describe the bug

BatchCoalescer::push_batch uses a hard assert_eq! at coalesce.rs:462 to check the incoming batch column count against the coalescer's schema:

assert_eq!(arrays.len(), self.in_progress_arrays.len());

When a batch with a mismatched number of columns is pushed, this panics the process instead of returning an ArrowError. The function signature is already Result<(), ArrowError>, so it could just return an error here.

This is inconsistent with the rest of the arrow API — RecordBatch::try_new returns Err(ArrowError::InvalidArgumentError) for the same kind of column count mismatch, and other checks in the same struct (lines 475, 533, 534) use debug_assert! rather than a hard assert.

We hit this in production in a DataFusion-based query service. A connector returned batches whose schema didn't match the plan's output schema, and instead of getting a query error we got a process-wide panic through the repartition operator:

assertion `left == right` failed
  left: 2
 right: 0

To Reproduce

use std::sync::Arc;
use arrow_array::{RecordBatch, UInt32Array};
use arrow_select::coalesce::BatchCoalescer;
use arrow_schema::{DataType, Field, Schema};

#[test]
#[should_panic(expected = "assertion `left == right` failed")]
fn push_batch_panics_on_schema_mismatch() {
    let empty_schema = Arc::new(Schema::empty());
    let mut coalescer = BatchCoalescer::new(empty_schema, 100);

    let schema = Arc::new(Schema::new(vec![
        Field::new("c0", DataType::UInt32, false),
        Field::new("c1", DataType::UInt32, false),
    ]));
    let batch = RecordBatch::try_new(
        schema,
        vec![
            Arc::new(UInt32Array::from(vec![1, 2, 3])),
            Arc::new(UInt32Array::from(vec![4, 5, 6])),
        ],
    )
    .unwrap();

    // panics instead of returning Err
    let _ = coalescer.push_batch(batch);
}

Also reproduces in the reverse direction (coalescer has 2 fields, batch has 1).

Expected behavior

push_batch should return Err(ArrowError::InvalidArgumentError(...)) so the caller can handle it gracefully, rather than bringing down the process.

Something like:

if arrays.len() != self.in_progress_arrays.len() {
    return Err(ArrowError::InvalidArgumentError(format!(
        "Batch has {} columns but BatchCoalescer expects {}",
        arrays.len(),
        self.in_progress_arrays.len()
    )));
}

Additional context

Affects arrow-select 57.2.0 and 57.3.0 (same code in both). Happy to open a PR for this if it makes sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions