Skip to content

fix: protect data branch history with internal pitr#24256

Draft
gouhongshen wants to merge 3 commits intomatrixorigin:mainfrom
gouhongshen:investigate/data-branch-update-insert
Draft

fix: protect data branch history with internal pitr#24256
gouhongshen wants to merge 3 commits intomatrixorigin:mainfrom
gouhongshen:investigate/data-branch-update-insert

Conversation

@gouhongshen
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #23751

What this PR does / why we need it:

  • create one-year internal PITR records for data branch source and target tables/databases so the branch point remains recoverable after object merge/GC
  • reserve and hide data branch internal PITR names from user PITR create/drop/alter/show/restore paths
  • clean data branch internal PITR records after normal DROP TABLE, normal DROP DATABASE, and data branch delete when no active branch metadata still references the protected object
  • make current-object snapshot scans collect committed in-memory tombstones at the requested snapshot timestamp instead of the latest transaction snapshot

Root cause:

After the data branch LCA point becomes unrecoverable by historical object state, diff may need to rely on current objects plus commit timestamps to reconstruct the branch-point view. Without pinning the branch point, later object merge/GC can remove the required historical state and cause update rows to be classified as inserts.

Validation:

  • git diff --check
  • CGO_CFLAGS="-I$(pwd)/thirdparties/install/include" CGO_LDFLAGS="-L$(pwd)/thirdparties/install/lib" go test ./pkg/frontend -run 'Test.*Pitr|Test.*DataBranch'
  • CGO_CFLAGS="-I$(pwd)/thirdparties/install/include" CGO_LDFLAGS="-L$(pwd)/thirdparties/install/lib" go test ./pkg/sql/plan -run 'TestBuildCreatePitr|TestBuildDropPitr|TestBuildShow'
  • CGO_CFLAGS="-I$(pwd)/thirdparties/install/include" CGO_LDFLAGS="-L$(pwd)/thirdparties/install/lib" go test ./pkg/sql/compile -run 'Test.*Pitr|Test.*CDC|TestCheckPitr|TestDrop'
  • make build

@gouhongshen
Copy link
Copy Markdown
Contributor Author

Code Review Report

Overview

Metric Value
Files reviewed 11
Findings 5
Must Fix 2
Should Fix 3
Destructive test verdict Insufficient

Summary

The direction is reasonable: internal PITR is the right mechanism to keep data-branch history recoverable after object merge/GC. The PR is not ready as-is. The main correctness issue is that user-facing PITR filters use LIKE '__mo_data_branch_pitr_%' without escaping _, so ordinary user PITR names can be treated as internal. The test delta also does not cover the actual regression this PR is meant to prevent. Cleanup logic has significant synchronous catalog-scan amplification in normal DROP and DATA BRANCH DELETE paths.

Destructive Test Verdict

Conclusion: insufficient.

Covered: reserved-name unit checks and expected SQL string updates for some PITR paths.

Missing: an end-to-end UPDATE-vs-INSERT regression after flush/checkpoint/GC, internal PITR create lifecycle assertions, internal PITR cleanup lifecycle assertions, and a direct test that snapshot fallback collects tombstones at snapshotTS rather than the current transaction snapshot.

Must Fix

1. Unescaped LIKE treats normal user PITR names as internal — pkg/frontend/pitr.go:65

Category: Bug / user-visible correctness

Several user-facing filters use patterns like pitr_name not like '__mo_data_branch_pitr_%'. In SQL LIKE, _ is a single-character wildcard, not a literal underscore. That means a user PITR name that does not start with the reserved prefix can still match the pattern and be hidden or ignored by public PITR flows. Examples of affected sites include pkg/frontend/pitr.go:65, pkg/frontend/show_recovery_window.go:201, pkg/sql/plan/build_show.go:1006, and pkg/sql/compile/ddl.go:4701.

Impact: such a user PITR can be filtered out from SHOW PITR, duplicate checks, recovery-window logic, CDC/PITR granularity checks, or DDL cleanup paths even though isReservedPitrName() would allow the name.

Root fix: do not use an unescaped LIKE pattern for the internal prefix. Prefer filtering public paths by kind = 'user' only. Where legacy prefix matching is still needed, use an escaped pattern with ESCAPE or a shared exact-prefix helper whose SQL semantics match the Go strings.HasPrefix rule.

2. The PR lacks a regression test for the UPDATE becoming INSERT failure — pkg/frontend/data_branch.go:381

Category: Test / regression coverage

This is a high-risk storage/history fix, but the changed tests mostly update SQL strings and reserved-name checks. There is no new test that fails before this PR and passes after it for the original symptom: data branch diff classifying an UPDATE as INSERT after merge/GC makes the LCA path unrecoverable.

Required coverage should include at least:

create table base(a int primary key, b int);
insert into base values (1, 10), (2, 20);
data branch create table br from base;
update br set b = 11 where a = 1;
-- force flush/checkpoint/GC or the closest BVT-supported equivalent
-- assert internal PITR exists
-- assert data branch diff reports UPDATE for a=1, not INSERT

Also add a focused test for ScanSnapshotWithCurrentRanges / collectTombstonesAtSnapshot so future changes cannot accidentally go back to collecting committed in-memory tombstones at the current transaction snapshot.

Should Fix

1. Normal DROP paths trigger global internal PITR cleanup — pkg/sql/compile/ddl.go:3408

Category: Performance / scalability

DROP TABLE calls cleanupDataBranchInternalPitrsAfterDDL(), which scans all internal table PITRs and all internal database PITRs. Each table PITR probes mo_branch_metadata; each database PITR may also read historical mo_tables. DROP DATABASE has the same issue at pkg/sql/compile/ddl.go:291. The cost scales with global internal PITR history, not with the object being dropped.

Root fix: make synchronous DDL cleanup scoped to the affected account/database/table IDs. If global reconciliation is needed for legacy rows, run it as a paginated, rate-limited background cleanup job, not inside every user DDL.

2. Database PITR cleanup materializes historical table IDs into Go — pkg/sql/compile/ddl.go:2020

Category: Performance / memory

dataBranchDatabaseTableIDsAt() loads all rel_id rows from mo_tables {MO_TS = create_time} into a Go slice, then probes mo_branch_metadata in batches of 512. A large database turns one cleanup decision into large catalog result materialization plus many follow-up queries. The same pattern exists in the frontend cleanup helper around pkg/frontend/data_branch.go:1161.

Root fix: push the existence check into SQL with an EXISTS / semi-join and LIMIT 1, or maintain a direct active-reference index/counter for database-level branch metadata. The code only needs to know whether any active reference remains; it does not need the full table ID list in memory.

3. DATA BRANCH DELETE does scoped cleanup and then global cleanup again — pkg/frontend/data_branch.go:1368

Category: Performance / design

dataBranchDeleteTable first calls cleanupDataBranchTablePitrIfUnused() for the current table, then immediately runs cleanupAllDataBranchTablePitrsIfUnused() and cleanupAllDataBranchDatabasePitrsIfUnused(). dataBranchDeleteDatabase has the same pattern around pkg/frontend/data_branch.go:1464-1478. This turns every delete into global PITR reconciliation.

Root fix: keep request-path cleanup limited to the current table/database and move full reconciliation to a dedicated background path. That keeps latency proportional to the user operation and avoids repeated global scans in batch deletes.

4. Internal PITR rules are duplicated across frontend, compile, and plan packages — pkg/frontend/data_branch.go:54

Category: Maintainability / contract drift

The reserved prefix, internal kind, sys_mo_catalog_pitr, and filtering rules are repeated in pkg/frontend/data_branch.go, pkg/frontend/pitr.go, pkg/sql/compile/ddl.go, pkg/sql/plan/build_ddl.go, and pkg/sql/plan/build_show.go. The unescaped LIKE issue above is a concrete example of contract drift between Go prefix checks and SQL prefix checks.

Root fix: centralize the internal PITR contract: name construction, reserved-name check, public visibility predicate, and cleanup predicate should come from one shared helper or catalog-level contract. The public paths should not hand-roll their own string predicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working kind/test-ci size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants