Skip to content

[Feature Request]: Enhance data branch diff with fallback, delta output, and summary statistics #23751

@XuPeng-SH

Description

@XuPeng-SH

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

Currently data branch diff has limitations that make it less flexible than Git's diff:

  1. Only works with branched tables: Requires LCA (Lowest Common Ancestor) relationship, cannot compare arbitrary tables
  2. Limited output modes: Only supports full output and output count, lacks incremental change tracking
  3. No detailed statistics: output count only returns total rows, not breakdown of changes

Describe the feature you'd like

1. Fallback to EXCEPT for non-branched tables

Current behavior:

-- Only works if t2 was created from t1
data branch create table t2 from t1{snapshot="sp1"};
data branch diff t2 against t1;  -- ✅ Works (has LCA)

-- Fails if no branch relationship
data branch diff t3 against t4;  -- ❌ Error: no LCA

Proposed behavior:

-- Automatically fallback to EXCEPT-like behavior
data branch diff t3 against t4;  -- ✅ Works (uses set difference)

2. Delta output mode

Proposed syntax:

data branch diff t1 against t2 output delta;

Expected result:

operation | a | b
----------|---|---
+         | 3 | 3  -- Added in t1
-         | 4 | 4  -- Deleted from t1 (exists in t2)

Similar to Git's +/- notation for added/removed lines.

3. Summary statistics output

Proposed syntax:

data branch diff t1 against t2 output summary;

Expected result:

metric    | count
----------|------
added     | 10
deleted   | 5
modified  | 3
unchanged | 100

Provides quick overview without scanning all rows.

Describe implementation you've considered

Fallback mechanism

  • Check if tables have LCA relationship
  • If no LCA: use EXCEPT-based comparison
  • Maintain same output format for consistency

Delta output

  • Add operation column: + (added), - (deleted), ~ (modified)
  • Use primary key to match rows for modification detection
  • If no PK: treat as add/delete only

Summary statistics

  • Leverage existing diff computation
  • Aggregate results into statistics
  • Minimal performance overhead (single pass)

Documentation, Adoption, Use Case, Migration Strategy

Use Cases

  1. Data validation: Compare staging vs production tables without branch relationship
  2. Change tracking: Quickly see what changed between snapshots
  3. Monitoring: Track data drift with summary statistics
  4. Debugging: Identify specific rows added/deleted with delta output

Adoption

  • Backward compatible: existing queries work unchanged
  • New output modes are opt-in via output clause
  • Fallback behavior is transparent to users

Metadata

Metadata

Assignees

Labels

kind/featurepriority/p0Critical feature that should be implemented in this version

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions