Summary
data branch merge fails intermittently in multi-CN PESSIMISTIC mode with:
txn need retry in rc mode, def changed
Error code: ErrTxnNeedRetryWithDefChanged (20631)
Reproduction
Observed in CI: multi cn e2e bvt test docker compose(PESSIMISTIC)
Test file: test/distributed/cases/git4data/branch/merge/merge_1.sql line 85
data branch merge t2 into t1
Root Cause
data branch merge executes internal SQL (CREATE TABLE, INSERT, DELETE) via runSql() in data_branch_helpers.go:168:
opts := executor.Options{}.
WithDisableIncrStatement(). // disables compile retry
WithTxn(backSes.GetTxnHandler().GetTxn()).
WithKeepTxnAlive()
WithDisableIncrStatement() sets disableRetry = true in the compile path (sql_executor.go:418), which prevents compile2.go from retrying on ErrTxnNeedRetryWithDefChanged.
When any internal INSERT/DELETE hits a lock conflict on mo_tables catalog rows (common in multi-CN PESSIMISTIC mode), lock_meta.go:198-203 converts it to ErrTxnNeedRetryWithDefChanged. Since retry is disabled, the error propagates directly to the client.
Additionally, handleBranchMerge in self_handle.go has no retry wrapper of its own.
Call chain
handleBranchMerge (self_handle.go)
→ diffMergeAgency (data_branch.go)
→ mergeDiffs / diffOnBase (parallel)
→ flushSqlValues → execSQLStatements → runSql (data_branch_helpers.go)
→ SQL executor with disableRetry=true
→ compile2 (retry disabled)
→ lock_meta.go:198 → ErrTxnNeedRetryWithDefChanged
→ error surfaces to client
Why it is flaky
In multi-CN PESSIMISTIC mode, distributed lock contention on catalog metadata rows is non-deterministic. The merge creates temp tables via CTAS and then applies INSERT/DELETE to the base table, creating multiple lock points where conflicts can occur. Single-CN tests rarely trigger this.
Suggested Fix Directions
- Add retry logic at
handleBranchMerge level — wrap the merge operation in a retry loop for ErrTxnNeedRetryWithDefChanged
- Remove
WithDisableIncrStatement() for these internal SQL executions (evaluate if safe)
- Add a dedicated retry wrapper in
diffMergeAgency
Related Files
pkg/frontend/data_branch_helpers.go:168 — WithDisableIncrStatement() usage
pkg/frontend/data_branch.go — diffMergeAgency
pkg/frontend/self_handle.go — handleBranchMerge entry point
pkg/sql/compile/compile2.go:245-303 — retry loop that is bypassed
pkg/sql/compile/lock_meta.go:195-210 — error generation
pkg/sql/compile/sql_executor.go:418 — disableRetry mapping
Summary
data branch mergefails intermittently in multi-CN PESSIMISTIC mode with:Error code:
ErrTxnNeedRetryWithDefChanged(20631)Reproduction
Observed in CI: multi cn e2e bvt test docker compose(PESSIMISTIC)
Test file:
test/distributed/cases/git4data/branch/merge/merge_1.sqlline 85Root Cause
data branch mergeexecutes internal SQL (CREATE TABLE, INSERT, DELETE) viarunSql()indata_branch_helpers.go:168:WithDisableIncrStatement()setsdisableRetry = truein the compile path (sql_executor.go:418), which preventscompile2.gofrom retrying onErrTxnNeedRetryWithDefChanged.When any internal INSERT/DELETE hits a lock conflict on
mo_tablescatalog rows (common in multi-CN PESSIMISTIC mode),lock_meta.go:198-203converts it toErrTxnNeedRetryWithDefChanged. Since retry is disabled, the error propagates directly to the client.Additionally,
handleBranchMergeinself_handle.gohas no retry wrapper of its own.Call chain
Why it is flaky
In multi-CN PESSIMISTIC mode, distributed lock contention on catalog metadata rows is non-deterministic. The merge creates temp tables via CTAS and then applies INSERT/DELETE to the base table, creating multiple lock points where conflicts can occur. Single-CN tests rarely trigger this.
Suggested Fix Directions
handleBranchMergelevel — wrap the merge operation in a retry loop forErrTxnNeedRetryWithDefChangedWithDisableIncrStatement()for these internal SQL executions (evaluate if safe)diffMergeAgencyRelated Files
pkg/frontend/data_branch_helpers.go:168—WithDisableIncrStatement()usagepkg/frontend/data_branch.go—diffMergeAgencypkg/frontend/self_handle.go—handleBranchMergeentry pointpkg/sql/compile/compile2.go:245-303— retry loop that is bypassedpkg/sql/compile/lock_meta.go:195-210— error generationpkg/sql/compile/sql_executor.go:418—disableRetrymapping