Skip to content

Undo#21

Draft
gburd wants to merge 44 commits intomasterfrom
undo
Draft

Undo#21
gburd wants to merge 44 commits intomasterfrom
undo

Conversation

@gburd
Copy link
Copy Markdown
Owner

@gburd gburd commented Mar 26, 2026

No description provided.

@github-actions github-actions Bot force-pushed the master branch 30 times, most recently from 9355586 to 9cbf7e6 Compare March 30, 2026 18:18
gburd added 30 commits April 28, 2026 14:49
Implement WAL-logged file truncation. Executed immediately with
XLogFlush before the irreversible operation (following SMGR_TRUNCATE
pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows.

API: FileOpsTruncate(path, length) -> void
WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations.

CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits
(only _S_IREAD/_S_IWRITE; no group/other support).

CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses
ACLs for ownership, not uid/gid).

Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers
rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir()
(no mode parameter, permissions inherited from parent).

RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all
platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink()
(NTFS junction points) on Windows. Registers delete-on-abort.

LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA()
on Windows (NTFS only). Registers delete-on-abort.

Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.

FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:

  - Linux: <sys/xattr.h> setxattr/removexattr
  - macOS: <sys/xattr.h> with extra options parameter
  - FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
  - Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
  - Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
    on unsupported platforms for WAL stream portability)

Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE,
RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK,
SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay.

Update the transactional fileops example script with the expanded
operation set following the Berkeley DB fileops.src model.
Introduce the IndexPrune framework that allows index access methods to
register callbacks for proactively pruning dead index entries when UNDO
records are discarded. This avoids accumulating dead tuples that would
otherwise require VACUUM to clean up.

Key components:
- index_prune.h: IndexPruneCallbacks structure and registration API
- index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher
- relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard

Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation.
To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler.
nbtprune.c implements dead-entry detection and removal using UNDO
discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler.
hashprune.c implements dead-entry detection and removal using UNDO
discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler.
ginprune.c implements dead-entry detection and removal using UNDO
discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler.
gistprune.c implements dead-entry detection and removal using UNDO
discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler.
spgprune.c implements dead-entry detection and removal using UNDO
discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose
output. Include comprehensive test suite exercising index pruning across
all supported index access methods via test_undo_tam.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.

How heap uses UNDO:

INSERT operations:
  - Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
  - Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
  - On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan

UPDATE operations:
  - Write UNDO record with complete old tuple version before update
  - On abort: UndoReplay() restores old tuple version from UNDO

DELETE operations:
  - Write UNDO record with complete deleted tuple data
  - On abort: UndoReplay() resurrects tuple from UNDO record

MVCC visibility:
  - Tuples reference UNDO chain via xmin/xmax
  - HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
  - Enables reconstructing tuple state as of any snapshot

Configuration:
  CREATE TABLE t (...) WITH (enable_undo=on);

The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.

Value proposition:

1. Faster rollback: No heap scan required, UNDO chains are sequential
   - Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
   - UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)

2. Cleaner abort handling: UNDO records are self-contained
   - No need to track which heap pages were modified
   - Works across crashes (UNDO is WAL-logged)

3. Foundation for future features:
   - Multi-version concurrency control without bloat
   - Faster VACUUM (can discard entire UNDO segments)
   - Point-in-time recovery improvements

Trade-offs:

Costs:
  - Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
  - UNDO log space: Requires space for UNDO records until no longer visible
  - Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed

Benefits:
  - Primarily valuable for workloads with:
    - Frequent aborts (e.g., speculative execution, deadlocks)
    - Long-running transactions needing old snapshots
    - Hot UPDATE workloads benefiting from cleaner rollback

Not recommended for:
  - Bulk load workloads (COPY: 2x write amplification without abort benefit)
  - Append-only tables (rare aborts mean cost without benefit)
  - Space-constrained systems (UNDO retention increases storage)

When beneficial:
  - OLTP with high abort rates (>5%)
  - Systems with aggressive pruning needs (frequent VACUUM)
  - Workloads requiring historical visibility (audit, time-travel queries)

Integration points:
  - heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
  - Heap pruning respects undo_retention to avoid discarding needed UNDO
  - pg_upgrade compatibility: UNDO disabled for upgraded tables

Background workers:
  - Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
  - Rollback itself is synchronous (via UndoReplay() during transaction abort)
  - Workers periodically trim UNDO logs based on undo_retention and snapshot visibility

This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Implement UNDO resource manager for B-tree indexes and regression test.
When a transaction aborts, provisionally inserted index entries are marked
LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead
tuples and index consistency via bt_index_check().
Document the UNDO architecture including UNDO log design, record
format, transaction integration, and heap AM integration details.
Add a self-contained benchmark suite comparing three scenarios:
baseline (master), undo-compiled-but-off, and undo-enabled.
Covers insert/update/delete throughput, rollback cost, VACUUM
overhead, read stability under writes, storage footprint, and
pgbench TPS with a mixed OLTP workload including 10% rollbacks.
Segment rotation and bulk UNDO hints were moved earlier in the
commit series. Update README, HEAP_UNDO_DESIGN.md, and
DESIGN_NOTES.md to reflect that these features are implemented
rather than planned future work.
UNDO_PRUNE records are informational only — they are never applied
during transaction rollback and exist solely for forensic recovery
via pg_undorecover.  Writing them synchronously during VACUUM added
2-3x overhead (measured via B6 benchmarks at scale=100k: vacuum_time
went from 1.00x to 3.36x vs baseline).

Exclude PRUNE_VACUUM_SCAN and PRUNE_VACUUM_CLEANUP from UNDO record
generation, matching the existing PRUNE_ON_ACCESS exclusion.  After
this change, vacuum_time drops to 1.05x vs baseline.
For DELETE and UPDATE on UNDO-enabled tables, the old code copied
the tuple three times per row:
  1. heap_copytuple() - palloc + full tuple memcpy
  2. HeapUndoBuildPayload() - palloc + header + tuple memcpy
  3. UndoRecordSerialize() - header + payload memcpy into uset buffer

Since the buffer is still exclusively locked when the UNDO record is
constructed (before START_CRIT_SECTION), we can read tuple data
directly from the page without copying.

Add UndoRecordAddPayloadParts() and HeapBulkUndoAddRecordParts() to
accept scatter-gather payloads (fixed header + tuple data), eliminating
the intermediate palloc.  This reduces the copy chain from 3 to 1:
  1. UndoRecordAddPayloadParts() - header + tuple data directly into
     uset buffer

Measured improvement on B6 delete_5pct at scale=100k:
  Before: 5.44x vs baseline (median 27.85ms)
  After:  3.69x vs baseline (median 19.18ms)
…ssembly

Three changes to reduce per-row UNDO cost:

1. Lower bulk UNDO activation threshold from 1000 to 100 estimated rows.
   Operations on 100-1000 rows (e.g., batch deletes, small updates) now
   use batched UNDO instead of per-row UndoLogAllocate + WAL + pwrite.

2. Eliminate MemoryContextSwitchTo overhead in UndoRecordAddPayload and
   UndoRecordAddPayloadParts.  Build the UNDO record header directly
   in the uset buffer instead of using a stack variable + memcpy.
   Use MemoryContextAlloc/MemoryContextAllocZero instead of switching
   contexts in UndoRecordSetCreate.  Reduce initial buffer from 8KB
   to 512 bytes (sufficient for a single record; grows dynamically).

3. Piggyback nbtree UNDO records onto the heap bulk UNDO batch when
   bulk mode is active.  Previously every index tuple insertion created
   a separate UndoRecordSet with its own UndoLogAllocate + XLogInsert
   + pwrite cycle — doubling the per-row UNDO overhead for indexed
   tables.  Now nbtree records are added to the heap's batch and
   flushed together.
Two optimizations targeting the per-row UNDO write path overhead:

1. File extent caching: Add cached_size to UndoLogFdCacheEntry so
   ExtendUndoLogFile() skips fstat() when the cached file size is
   already sufficient.  The UNDO file only grows, so the cached size
   is a reliable lower bound.  This eliminates ~90% of fstat syscalls.

2. WAL record batching: Instead of emitting one XLOG_UNDO_ALLOCATE
   WAL record per UndoRecordSetInsert(), coalesce allocations in the
   same log into a deferred batch.  The batch is flushed as a single
   WAL record at transaction commit/abort, before log rotation, or
   when 64 entries accumulate.  This reduces XLogInsert overhead from
   per-row to per-batch.

   The redo handler is updated to only advance the insert pointer
   forward (never regress), which is necessary for correctness when
   concurrent backends produce overlapping coalesced ranges.

Benchmark impact on per-row operations (undo_on/baseline ratio):
  single_row_delete: 3.29x -> 1.88x (-43% overhead)
  single_row_update: 1.99x -> 1.40x (-30% overhead)
  individual_insert: 2.35x -> 1.95x (-17% overhead)
  vacuum_time:       1.19x -> 1.04x (-13% overhead)

All regression tests pass (251 subtests + recovery/CLR/crash/standby).
Replace the three shared-memory hash tables in sLog (SLogTxnHash,
SLogXidHash) with a shared-memory skip-list and compressed sparse
bitmap (sparsemap), keeping SLogTupleHash unchanged.

Transaction sLog changes:
- Skip-list keyed by (xid, reloid) replaces SLogTxnHash.  Entries
  are ordered by xid then reloid, enabling O(log n) lookups and
  efficient xid-range operations (all entries for an xid are
  contiguous).
- Sparsemap replaces SLogXidHash for O(1) SLogXidIsPresent() checks.
  The hot-path presence check uses only a SpinLock-protected bitmap,
  avoiding the heavier LWLock entirely.
- Single LWLock replaces 4 partition locks.  The skip-list is
  lock-free by design but used in SKIPLIST_SINGLE_THREADED mode
  (no C11 stdatomic dependency) because the pool allocator and
  sparsemap require external synchronization.  A single lock suffices
  since sLog is only modified on transaction abort.
- Pool allocator in shared memory: contiguous slab of 260 slots
  (256 entries + 2 sentinels + 2 margin) with index-based free-list.
  EBR retire callback redirects node deallocation to pool free-list
  instead of pfree() (which would crash on shared memory).
- SKIPLIST_MAX_HEIGHT reduced to 16 (supports 65K entries; pool
  holds at most 256).

Public API signatures are preserved; callers are unaffected.
The multi-threaded atomic path now uses PostgreSQL's pg_atomic_uint64
type (port/atomics.h) instead of C11 <stdatomic.h>.  The conversion
uses an anonymous union { pg_atomic_uint64 _pg; T _value; } for
_SKIP_ATOMIC(T), preserving type safety via __typeof__ at load/store
sites while routing all atomic operations through pg_atomic_*.

Key mappings:
- _skip_atomic_load → pg_atomic_read_u64 / pg_atomic_read_membarrier_u64
- _skip_atomic_store → pg_atomic_write_u64 / pg_atomic_write_membarrier_u64
- _skip_atomic_cas_strong/weak → pg_atomic_compare_exchange_u64 (always strong)
- _skip_atomic_fetch_add/sub → pg_atomic_fetch_add/sub_u64
- _skip_atomic_exchange → pg_atomic_exchange_u64
- _skip_atomic_thread_fence → pg_memory_barrier()

Pointer types are stored via uintptr_t type-punning through uint64.
The SKIPLIST_SINGLE_THREADED path (used by slog.c) is unchanged.
Sparsemap has zero atomics—no conversion needed.
Add features from the lrlck benchmark framework:

- pg_prewarm: warm shared_buffers before in-cache benchmarks for
  deterministic results (controlled via scale threshold)

- Wait event sampling: background pg_stat_activity sampler captures
  lock contention patterns during pgbench runs at 2-second intervals

- Zipfian hot/cold workload (B10): random_zipfian(1.2) distribution
  creates realistic skewed access patterns mixing cache hits and
  misses, with 5% rollback rate for UNDO stress testing

- Multi-role concurrent workload (B11): 4 simultaneous pgbench
  instances with different behaviors (hot readers, cold readers,
  updaters with 20% rollback, range scanners) hitting the same
  database — exercises UNDO under realistic mixed-workload contention

- CV% (Coefficient of Variation) in reports: stability indicator
  for undo_on scenario, helping identify noisy measurements

- Statistics helpers: cv(), stdev(), percentile() functions in
  common.sh for richer analysis

- Cache-pressure support: PGBENCH_SCALE_LARGE (default 500) for
  workloads that need working set >> shared_buffers
UndoRecordSetCreate() parented the uset's memory context to
CurrentMemoryContext.  When called from heap_insert() during SPI
execution (PL/pgSQL DO blocks), CurrentMemoryContext is the
executor's per-query context (es_query_cxt), which is destroyed
in FreeExecutorState() after each SPI_execute call.  Since
xactundo.c stores the uset pointer in the static
XactUndo.record_set[] and reuses it across multiple statements
within a transaction, the stale pointer caused a use-after-free
crash on the second INSERT iteration.

Fix by using TopTransactionContext as the parent, which survives
across SPI statement boundaries but is cleaned up at transaction
commit/abort.
Port the compressed sparse bitmap (sparsemap) as a shared PostgreSQL
library in src/backend/lib/sparsemap.c with headers in src/include/lib/.
The sparsemap provides O(1) presence checks for sequential integer keys
using a compressed bitmap format that operates entirely in-place on a
pre-allocated buffer -- ideal for shared memory use.

Also add test modules (test_skiplist, test_sparsemap) exercising both
data structures under the PostgreSQL TAP test framework.
…rhead

Enhance the UNDO subsystem with several performance improvements:

- Route UNDO I/O through shared_buffers via the undo buffer manager,
  eliminating direct file I/O for UNDO page reads/writes
- Batch WAL records for UNDO log allocation to reduce WAL volume
- Add UNDO-specific WAL redo support for new record types
- Reduce per-row UNDO overhead in heap INSERT/DELETE/UPDATE paths
  by avoiding unnecessary tuple copies and streamlining record assembly
- Update UndoRecordSet API with scatter-gather payload support
…-back

UNDO buffers use virtual RelFileLocators with dbOid=9 (UNDO_DB_OID)
to route I/O through shared_buffers.  When the checkpointer flushes
dirty UNDO buffers, md.c resolves this to file path base/9/<lognum>.
However, neither the base/9/ directory nor the smgr files were being
created -- only the legacy base/undo/ files were managed.  This caused
checkpoint to fail with "could not open file base/9/1: No such file
or directory".

Fix by:
1. Creating base/9/ directory during UndoLogShmemInit() at startup,
   before any WAL replay or checkpoint can reference UNDO buffers.
2. Adding ExtendUndoLogSmgrFile() which creates and extends the
   smgr-managed file via smgrcreate()/smgrextend(), called from
   UndoLogAllocate() alongside the legacy ExtendUndoLogFile().
3. Calling ExtendUndoLogSmgrFile() from the XLOG_UNDO_EXTEND redo
   handler so recovery also prepares the smgr files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant