Skip to content

Bug: data branch merge fails with ErrTxnNeedRetryWithDefChanged in multi-CN PESSIMISTIC mode #24099

@XuPeng-SH

Description

@XuPeng-SH

Summary

data branch merge fails intermittently in multi-CN PESSIMISTIC mode with:

txn need retry in rc mode, def changed

Error code: ErrTxnNeedRetryWithDefChanged (20631)

Reproduction

Observed in CI: multi cn e2e bvt test docker compose(PESSIMISTIC)
Test file: test/distributed/cases/git4data/branch/merge/merge_1.sql line 85

data branch merge t2 into t1

Root Cause

data branch merge executes internal SQL (CREATE TABLE, INSERT, DELETE) via runSql() in data_branch_helpers.go:168:

opts := executor.Options{}.
    WithDisableIncrStatement().  // disables compile retry
    WithTxn(backSes.GetTxnHandler().GetTxn()).
    WithKeepTxnAlive()

WithDisableIncrStatement() sets disableRetry = true in the compile path (sql_executor.go:418), which prevents compile2.go from retrying on ErrTxnNeedRetryWithDefChanged.

When any internal INSERT/DELETE hits a lock conflict on mo_tables catalog rows (common in multi-CN PESSIMISTIC mode), lock_meta.go:198-203 converts it to ErrTxnNeedRetryWithDefChanged. Since retry is disabled, the error propagates directly to the client.

Additionally, handleBranchMerge in self_handle.go has no retry wrapper of its own.

Call chain

handleBranchMerge (self_handle.go)
  → diffMergeAgency (data_branch.go)
    → mergeDiffs / diffOnBase (parallel)
      → flushSqlValues → execSQLStatements → runSql (data_branch_helpers.go)
        → SQL executor with disableRetry=true
          → compile2 (retry disabled)
            → lock_meta.go:198 → ErrTxnNeedRetryWithDefChanged
              → error surfaces to client

Why it is flaky

In multi-CN PESSIMISTIC mode, distributed lock contention on catalog metadata rows is non-deterministic. The merge creates temp tables via CTAS and then applies INSERT/DELETE to the base table, creating multiple lock points where conflicts can occur. Single-CN tests rarely trigger this.

Suggested Fix Directions

  1. Add retry logic at handleBranchMerge level — wrap the merge operation in a retry loop for ErrTxnNeedRetryWithDefChanged
  2. Remove WithDisableIncrStatement() for these internal SQL executions (evaluate if safe)
  3. Add a dedicated retry wrapper in diffMergeAgency

Related Files

  • pkg/frontend/data_branch_helpers.go:168WithDisableIncrStatement() usage
  • pkg/frontend/data_branch.godiffMergeAgency
  • pkg/frontend/self_handle.gohandleBranchMerge entry point
  • pkg/sql/compile/compile2.go:245-303 — retry loop that is bypassed
  • pkg/sql/compile/lock_meta.go:195-210 — error generation
  • pkg/sql/compile/sql_executor.go:418disableRetry mapping

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't workingseverity/s0Extreme impact: Cause the application to break down and seriously affect the use

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions