Conversation
9355586 to
9cbf7e6
Compare
Implement WAL-logged file truncation. Executed immediately with XLogFlush before the irreversible operation (following SMGR_TRUNCATE pattern). Uses ftruncate() on POSIX, SetEndOfFile() on Windows. API: FileOpsTruncate(path, length) -> void WAL: XLOG_FILEOPS_TRUNCATE with redo that replays the truncation.
Implement WAL-logged file metadata operations. CHMOD: chmod() on POSIX, _chmod() on Windows with limited mode bits (only _S_IREAD/_S_IWRITE; no group/other support). CHOWN: chown() on POSIX, no-op with WARNING on Windows (Windows uses ACLs for ownership, not uid/gid). Both execute immediately and are WAL-logged for crash recovery.
MKDIR: Immediate execution using MakePGDirectory(). Registers rmdir-on-abort for automatic cleanup on rollback. On Windows: _mkdir() (no mode parameter, permissions inherited from parent). RMDIR: Deferred to commit time (like DELETE). Uses rmdir() on all platforms, _rmdir() on Windows.
SYMLINK: Immediate execution. Uses symlink() on POSIX, pgsymlink() (NTFS junction points) on Windows. Registers delete-on-abort. LINK: Immediate execution. Uses link() on POSIX, CreateHardLinkA() on Windows (NTFS only). Registers delete-on-abort. Both create links idempotently during redo (unlink first if exists).
Add extended attribute operations to the transactional file operations
framework, completing the Berkeley DB fileops.src operation set.
FileOpsSetXattr() and FileOpsRemoveXattr() provide immediate execution
with WAL logging for crash recovery replay. A new cross-platform
portability layer (src/port/pg_xattr.c) abstracts platform differences:
- Linux: <sys/xattr.h> setxattr/removexattr
- macOS: <sys/xattr.h> with extra options parameter
- FreeBSD: <sys/extattr.h> extattr_set_file/extattr_delete_file
- Windows: NTFS Alternate Data Streams via CreateFileA("path:name")
- Fallback: returns ENOTSUP (operation succeeds in WAL but no-op
on unsupported platforms for WAL stream portability)
Platform detection uses compiler-defined macros (__linux__, __APPLE__,
__FreeBSD__, WIN32) rather than configure-time checks, avoiding
meson.build/configure.ac complexity.
Add regression tests for all FILEOPS operations (CREATE, DELETE, RENAME, WRITE, TRUNCATE, CHMOD, CHOWN, MKDIR, RMDIR, SYMLINK, LINK, SETXATTR, REMOVEXATTR) and a crash recovery test for WAL replay. Update the transactional fileops example script with the expanded operation set following the Berkeley DB fileops.src model.
Introduce the IndexPrune framework that allows index access methods to register callbacks for proactively pruning dead index entries when UNDO records are discarded. This avoids accumulating dead tuples that would otherwise require VACUUM to clean up. Key components: - index_prune.h: IndexPruneCallbacks structure and registration API - index_prune.c: Registry management and IndexPruneNotifyDiscard() dispatcher - relundo_discard.c: Hook to call IndexPruneNotifyDiscard on UNDO discard Individual index AM implementations follow in subsequent commits.
Placeholder for index pruning design documentation. To be populated when design notes are split by subsystem.
Register IndexPrune callbacks in the B-tree access method handler. nbtprune.c implements dead-entry detection and removal using UNDO discard notifications, allowing proactive cleanup without full VACUUM.
Register IndexPrune callbacks in the hash access method handler. hashprune.c implements dead-entry detection and removal using UNDO discard notifications for hash indexes.
Register IndexPrune callbacks in the GIN access method handler. ginprune.c implements dead-entry detection and removal using UNDO discard notifications for GIN indexes.
Register IndexPrune callbacks in the GiST access method handler. gistprune.c implements dead-entry detection and removal using UNDO discard notifications for GiST indexes.
Register IndexPrune callbacks in the SP-GiST access method handler. spgprune.c implements dead-entry detection and removal using UNDO discard notifications for SP-GiST indexes.
Add VACUUM statistics tracking for UNDO-pruned index entries and verbose output. Include comprehensive test suite exercising index pruning across all supported index access methods via test_undo_tam.
Adds opt-in UNDO support to the standard heap table access method.
When enabled, heap operations write UNDO records to enable physical
rollback without scanning the heap, and support UNDO-based MVCC
visibility determination.
How heap uses UNDO:
INSERT operations:
- Before inserting tuple, call PrepareXactUndoData() to reserve UNDO space
- Write UNDO record with: transaction ID, tuple TID, old tuple data (null for INSERT)
- On abort: UndoReplay() marks tuple as LP_UNUSED without heap scan
UPDATE operations:
- Write UNDO record with complete old tuple version before update
- On abort: UndoReplay() restores old tuple version from UNDO
DELETE operations:
- Write UNDO record with complete deleted tuple data
- On abort: UndoReplay() resurrects tuple from UNDO record
MVCC visibility:
- Tuples reference UNDO chain via xmin/xmax
- HeapTupleSatisfiesSnapshot() can walk UNDO chain for older versions
- Enables reconstructing tuple state as of any snapshot
Configuration:
CREATE TABLE t (...) WITH (enable_undo=on);
The enable_undo storage parameter is per-table and defaults to off for
backward compatibility. When disabled, heap behaves exactly as before.
Value proposition:
1. Faster rollback: No heap scan required, UNDO chains are sequential
- Traditional abort: Full heap scan to mark tuples invalid (O(n) random I/O)
- UNDO abort: Sequential UNDO log scan (O(n) sequential I/O, better cache locality)
2. Cleaner abort handling: UNDO records are self-contained
- No need to track which heap pages were modified
- Works across crashes (UNDO is WAL-logged)
3. Foundation for future features:
- Multi-version concurrency control without bloat
- Faster VACUUM (can discard entire UNDO segments)
- Point-in-time recovery improvements
Trade-offs:
Costs:
- Additional writes: Every DML writes both heap + UNDO (roughly 2x write amplification)
- UNDO log space: Requires space for UNDO records until no longer visible
- Complexity: New GUCs (undo_retention, max_undo_workers), monitoring needed
Benefits:
- Primarily valuable for workloads with:
- Frequent aborts (e.g., speculative execution, deadlocks)
- Long-running transactions needing old snapshots
- Hot UPDATE workloads benefiting from cleaner rollback
Not recommended for:
- Bulk load workloads (COPY: 2x write amplification without abort benefit)
- Append-only tables (rare aborts mean cost without benefit)
- Space-constrained systems (UNDO retention increases storage)
When beneficial:
- OLTP with high abort rates (>5%)
- Systems with aggressive pruning needs (frequent VACUUM)
- Workloads requiring historical visibility (audit, time-travel queries)
Integration points:
- heap_insert/update/delete call PrepareXactUndoData/InsertXactUndoData
- Heap pruning respects undo_retention to avoid discarding needed UNDO
- pg_upgrade compatibility: UNDO disabled for upgraded tables
Background workers:
- Cluster-wide UNDO has async workers for cleanup/discard of old UNDO records
- Rollback itself is synchronous (via UndoReplay() during transaction abort)
- Workers periodically trim UNDO logs based on undo_retention and snapshot visibility
This demonstrates cluster-wide UNDO in production use. Note that this
differs from per-relation logical UNDO (added in subsequent patches),
which uses per-table UNDO forks and async rollback via background
workers.
Implement UNDO resource manager for B-tree indexes and regression test. When a transaction aborts, provisionally inserted index entries are marked LP_DEAD. Includes zero_vacuum test verifying aborted inserts leave no dead tuples and index consistency via bt_index_check().
Document the UNDO architecture including UNDO log design, record format, transaction integration, and heap AM integration details.
Add a self-contained benchmark suite comparing three scenarios: baseline (master), undo-compiled-but-off, and undo-enabled. Covers insert/update/delete throughput, rollback cost, VACUUM overhead, read stability under writes, storage footprint, and pgbench TPS with a mixed OLTP workload including 10% rollbacks.
Segment rotation and bulk UNDO hints were moved earlier in the commit series. Update README, HEAP_UNDO_DESIGN.md, and DESIGN_NOTES.md to reflect that these features are implemented rather than planned future work.
UNDO_PRUNE records are informational only — they are never applied during transaction rollback and exist solely for forensic recovery via pg_undorecover. Writing them synchronously during VACUUM added 2-3x overhead (measured via B6 benchmarks at scale=100k: vacuum_time went from 1.00x to 3.36x vs baseline). Exclude PRUNE_VACUUM_SCAN and PRUNE_VACUUM_CLEANUP from UNDO record generation, matching the existing PRUNE_ON_ACCESS exclusion. After this change, vacuum_time drops to 1.05x vs baseline.
For DELETE and UPDATE on UNDO-enabled tables, the old code copied
the tuple three times per row:
1. heap_copytuple() - palloc + full tuple memcpy
2. HeapUndoBuildPayload() - palloc + header + tuple memcpy
3. UndoRecordSerialize() - header + payload memcpy into uset buffer
Since the buffer is still exclusively locked when the UNDO record is
constructed (before START_CRIT_SECTION), we can read tuple data
directly from the page without copying.
Add UndoRecordAddPayloadParts() and HeapBulkUndoAddRecordParts() to
accept scatter-gather payloads (fixed header + tuple data), eliminating
the intermediate palloc. This reduces the copy chain from 3 to 1:
1. UndoRecordAddPayloadParts() - header + tuple data directly into
uset buffer
Measured improvement on B6 delete_5pct at scale=100k:
Before: 5.44x vs baseline (median 27.85ms)
After: 3.69x vs baseline (median 19.18ms)
…ssembly Three changes to reduce per-row UNDO cost: 1. Lower bulk UNDO activation threshold from 1000 to 100 estimated rows. Operations on 100-1000 rows (e.g., batch deletes, small updates) now use batched UNDO instead of per-row UndoLogAllocate + WAL + pwrite. 2. Eliminate MemoryContextSwitchTo overhead in UndoRecordAddPayload and UndoRecordAddPayloadParts. Build the UNDO record header directly in the uset buffer instead of using a stack variable + memcpy. Use MemoryContextAlloc/MemoryContextAllocZero instead of switching contexts in UndoRecordSetCreate. Reduce initial buffer from 8KB to 512 bytes (sufficient for a single record; grows dynamically). 3. Piggyback nbtree UNDO records onto the heap bulk UNDO batch when bulk mode is active. Previously every index tuple insertion created a separate UndoRecordSet with its own UndoLogAllocate + XLogInsert + pwrite cycle — doubling the per-row UNDO overhead for indexed tables. Now nbtree records are added to the heap's batch and flushed together.
Two optimizations targeting the per-row UNDO write path overhead: 1. File extent caching: Add cached_size to UndoLogFdCacheEntry so ExtendUndoLogFile() skips fstat() when the cached file size is already sufficient. The UNDO file only grows, so the cached size is a reliable lower bound. This eliminates ~90% of fstat syscalls. 2. WAL record batching: Instead of emitting one XLOG_UNDO_ALLOCATE WAL record per UndoRecordSetInsert(), coalesce allocations in the same log into a deferred batch. The batch is flushed as a single WAL record at transaction commit/abort, before log rotation, or when 64 entries accumulate. This reduces XLogInsert overhead from per-row to per-batch. The redo handler is updated to only advance the insert pointer forward (never regress), which is necessary for correctness when concurrent backends produce overlapping coalesced ranges. Benchmark impact on per-row operations (undo_on/baseline ratio): single_row_delete: 3.29x -> 1.88x (-43% overhead) single_row_update: 1.99x -> 1.40x (-30% overhead) individual_insert: 2.35x -> 1.95x (-17% overhead) vacuum_time: 1.19x -> 1.04x (-13% overhead) All regression tests pass (251 subtests + recovery/CLR/crash/standby).
Replace the three shared-memory hash tables in sLog (SLogTxnHash, SLogXidHash) with a shared-memory skip-list and compressed sparse bitmap (sparsemap), keeping SLogTupleHash unchanged. Transaction sLog changes: - Skip-list keyed by (xid, reloid) replaces SLogTxnHash. Entries are ordered by xid then reloid, enabling O(log n) lookups and efficient xid-range operations (all entries for an xid are contiguous). - Sparsemap replaces SLogXidHash for O(1) SLogXidIsPresent() checks. The hot-path presence check uses only a SpinLock-protected bitmap, avoiding the heavier LWLock entirely. - Single LWLock replaces 4 partition locks. The skip-list is lock-free by design but used in SKIPLIST_SINGLE_THREADED mode (no C11 stdatomic dependency) because the pool allocator and sparsemap require external synchronization. A single lock suffices since sLog is only modified on transaction abort. - Pool allocator in shared memory: contiguous slab of 260 slots (256 entries + 2 sentinels + 2 margin) with index-based free-list. EBR retire callback redirects node deallocation to pool free-list instead of pfree() (which would crash on shared memory). - SKIPLIST_MAX_HEIGHT reduced to 16 (supports 65K entries; pool holds at most 256). Public API signatures are preserved; callers are unaffected.
The multi-threaded atomic path now uses PostgreSQL's pg_atomic_uint64
type (port/atomics.h) instead of C11 <stdatomic.h>. The conversion
uses an anonymous union { pg_atomic_uint64 _pg; T _value; } for
_SKIP_ATOMIC(T), preserving type safety via __typeof__ at load/store
sites while routing all atomic operations through pg_atomic_*.
Key mappings:
- _skip_atomic_load → pg_atomic_read_u64 / pg_atomic_read_membarrier_u64
- _skip_atomic_store → pg_atomic_write_u64 / pg_atomic_write_membarrier_u64
- _skip_atomic_cas_strong/weak → pg_atomic_compare_exchange_u64 (always strong)
- _skip_atomic_fetch_add/sub → pg_atomic_fetch_add/sub_u64
- _skip_atomic_exchange → pg_atomic_exchange_u64
- _skip_atomic_thread_fence → pg_memory_barrier()
Pointer types are stored via uintptr_t type-punning through uint64.
The SKIPLIST_SINGLE_THREADED path (used by slog.c) is unchanged.
Sparsemap has zero atomics—no conversion needed.
Add features from the lrlck benchmark framework: - pg_prewarm: warm shared_buffers before in-cache benchmarks for deterministic results (controlled via scale threshold) - Wait event sampling: background pg_stat_activity sampler captures lock contention patterns during pgbench runs at 2-second intervals - Zipfian hot/cold workload (B10): random_zipfian(1.2) distribution creates realistic skewed access patterns mixing cache hits and misses, with 5% rollback rate for UNDO stress testing - Multi-role concurrent workload (B11): 4 simultaneous pgbench instances with different behaviors (hot readers, cold readers, updaters with 20% rollback, range scanners) hitting the same database — exercises UNDO under realistic mixed-workload contention - CV% (Coefficient of Variation) in reports: stability indicator for undo_on scenario, helping identify noisy measurements - Statistics helpers: cv(), stdev(), percentile() functions in common.sh for richer analysis - Cache-pressure support: PGBENCH_SCALE_LARGE (default 500) for workloads that need working set >> shared_buffers
UndoRecordSetCreate() parented the uset's memory context to CurrentMemoryContext. When called from heap_insert() during SPI execution (PL/pgSQL DO blocks), CurrentMemoryContext is the executor's per-query context (es_query_cxt), which is destroyed in FreeExecutorState() after each SPI_execute call. Since xactundo.c stores the uset pointer in the static XactUndo.record_set[] and reuses it across multiple statements within a transaction, the stale pointer caused a use-after-free crash on the second INSERT iteration. Fix by using TopTransactionContext as the parent, which survives across SPI statement boundaries but is cleaned up at transaction commit/abort.
Port the compressed sparse bitmap (sparsemap) as a shared PostgreSQL library in src/backend/lib/sparsemap.c with headers in src/include/lib/. The sparsemap provides O(1) presence checks for sequential integer keys using a compressed bitmap format that operates entirely in-place on a pre-allocated buffer -- ideal for shared memory use. Also add test modules (test_skiplist, test_sparsemap) exercising both data structures under the PostgreSQL TAP test framework.
…rhead Enhance the UNDO subsystem with several performance improvements: - Route UNDO I/O through shared_buffers via the undo buffer manager, eliminating direct file I/O for UNDO page reads/writes - Batch WAL records for UNDO log allocation to reduce WAL volume - Add UNDO-specific WAL redo support for new record types - Reduce per-row UNDO overhead in heap INSERT/DELETE/UPDATE paths by avoiding unnecessary tuple copies and streamlining record assembly - Update UndoRecordSet API with scatter-gather payload support
…-back UNDO buffers use virtual RelFileLocators with dbOid=9 (UNDO_DB_OID) to route I/O through shared_buffers. When the checkpointer flushes dirty UNDO buffers, md.c resolves this to file path base/9/<lognum>. However, neither the base/9/ directory nor the smgr files were being created -- only the legacy base/undo/ files were managed. This caused checkpoint to fail with "could not open file base/9/1: No such file or directory". Fix by: 1. Creating base/9/ directory during UndoLogShmemInit() at startup, before any WAL replay or checkpoint can reference UNDO buffers. 2. Adding ExtendUndoLogSmgrFile() which creates and extends the smgr-managed file via smgrcreate()/smgrextend(), called from UndoLogAllocate() alongside the legacy ExtendUndoLogFile(). 3. Calling ExtendUndoLogSmgrFile() from the XLOG_UNDO_EXTEND redo handler so recovery also prepares the smgr files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.