Skip to content

feat: persist 2pc transaction state to disk#934

Draft
willothy wants to merge 15 commits intomainfrom
willothy/2pc-durability
Draft

feat: persist 2pc transaction state to disk#934
willothy wants to merge 15 commits intomainfrom
willothy/2pc-durability

Conversation

@willothy
Copy link
Copy Markdown

@willothy willothy commented Apr 25, 2026

Adds a WAL with recovery for 2pc transaction state for crash-safety.

Still needs integration tests

Closes #911

willothy added 15 commits April 22, 2026 15:50
Defines the on-disk record format for the two-phase commit write-ahead
log: length + crc32c + tag + rmp-serde framing.

Adds WAL config options to General and crc32c as a dependency.
Adds SegmentReader for iterating records out of an existing segment
and Segment for appending. SegmentReader::into_writable converts the
reader into a writable segment, truncating any torn tail.
Adds Wal: a cloneable handle around an mpsc channel into a single writer
task that owns the active Segment, batches concurrent appends behind one
fsync per batch, and rotates segments at the configured size limit.
Shutdown via AtomicBool + Notify; the writer body is wrapped in
catch_unwind so a panic doesn't hang shutdown.

Switches Segment to a batch-only append API (append_batch takes
pre-encoded bytes plus record count) so the writer can encode the whole
batch into one buffer and issue a single write_all.
Adds recovery::recover_transactions which scans every segment and hands
each in-flight transaction to Manager::restore_transaction. Wraps probe
+ recovery + writer spawn behind Wal::open, with distinct error variants
for directory access failures.
WAL initialization replays in-flight 2PC transactions back into the
manager so they can be driven to a terminal state. If the WAL can't
be opened, 2PC continues in-memory only without durability rather
than failing pgdog startup.

Recovery distinguishes corruption from IO failures: corrupt segments
are renamed to .broken and skipped, with restore skipped entirely on
any corruption so a missing Committing record can't silently invert a
committed transaction. IO failures abort recovery and disable the WAL.
Manager logs Begin and Committing on phase transitions and End on
cleanup, before mutating in-memory state. WAL write failures are
logged and the transaction proceeds without durability rather than
blocking 2PC. Shutdown drains the cleanup queue and then the WAL so
final End records make it to disk.

Drops the unused participant-shards field from Begin and Checkpoint
records since cleanup fans to all current shards via the existing
42704-tolerant path.
A second pgdog pointed at the same WAL dir now fails fast with the
prior holder's PID and start time instead of silently corrupting the
log.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Sharding] Persist 2pc transaction state to disk

2 participants