Skip to content

MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, …#4405

Open
dr-m wants to merge 1 commit into11.4from
MDEV-37949
Open

MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, …#4405
dr-m wants to merge 1 commit into11.4from
MDEV-37949

Conversation

@dr-m
Copy link
Copy Markdown
Contributor

@dr-m dr-m commented Oct 28, 2025

  • The Jira issue number for this PR is: MDEV-37949

Description

MDEV-37949: Implement innodb_log_archive

The InnoDB write-ahead log file in the old innodb_log_archive=OFF format is named ib_logfile0, pre-allocated to innodb_log_file_size and written as a ring buffer. This is good for write performance and space management, but unsuitable for arbitrary point-in-time recovery or for facilitating efficient incremental backup.

innodb_log_archive=ON: A new format where InnoDB will create and preallocate files ib_%016x.log, instead of writing a circular file ib_logfile0. Each file will be pre-allocated to innodb_log_file_size. Once a log fills up, we will create and pre-allocate another log file, to which log records will be written. Upon the completion of the first log checkpoint in a recently created log file, the old log file will be marked read-only, signaling that there will be no further writes to that file, and that the file may safely be moved to long-term storage.

The file name includes the log sequence number (LSN) at file offset 12288 (log_t::START_OFFSET). We identify each checkpoint by storing a 64-bit big-endian offset into the optional FILE_MODIFY and the mandatory FILE_CHECKPOINT records, between 12288 and the end of the file.

The innodb_encrypt_log format is identified by storing the encryption information at the start of the log file. The first 64-bit value will be 1, which is an invalid checkpoint offset. Each innodb_log_archive=ON log must use the same encryption parameters. Changing innodb_encrypt_log or related parameters is only possible by setting innodb_log_archive=OFF and restarting the server, which will permanently lose the history of the archived log.

The maximum number of log checkpoints that the innodb_log_archive=ON file header can represent is limited to 12288/8=1536 when using innodb_encrypt_log=OFF. If we run out of slots in a log file, each subsequently completed checkpoint in that log file will overwrite the last slot in the checkpoint header, until we switch to the next log.

innodb_log_recovery_start: The checkpoint LSN to start recovery from. This will be useful when recovering from an archived log. This is useful for restoring an incremental backup (applying InnoDB log files that were copied since the previous restore).

innodb_log_recovery_target: The requested LSN to end recovery at. When this is set, all persistent InnoDB tables will be read-only, and no writes to the log are allowed. The intended purpose of this setting is to prepare an incremental backup, as well as to allow data retrieval as of a particular logical point of time.

Setting innodb_log_recovery_target>0 is much like setting innodb_read_only=ON, with the exception that the data files may be written to by crash recovery, and locking reads will conflict with any incomplete transactions as necessary, and all transaction isolation levels will work normally (not hard-wired to READ UNCOMMITTED).

The status variable innodb_lsn_archived will reflect the LSN since when a complete InnoDB log archive is available. Its initial value will be that of the new parameter innodb_log_archive_start. If that variable is 0 (the default), the innodb_lsn_archived will be recovered from the available log files. If innodb_log_archive=OFF, innodb_lsn_archived will be adjusted to the latest checkpoint every time a log checkpoint is executed. If innodb_log_archive=ON, the value
should not change.

Statements like

SET GLOBAL innodb_log_archive=!@@GLOBAL.innodb_log_archive;

will take effect as soon as possible, possibly after a log checkpoint has been completed. The log file will be renamed between ib_logfile0 and ib_%016x.log as appropriate.

When innodb_log_archive=ON, the setting SET GLOBAL innodb_log_file_size will affect subsequently created log files when the file that is being currently written is running out. If we are switching log files exactly at the same time, then a somewhat misleading error message "innodb_log_file_size change is already in progress" will be issued.

When innodb_read_only=ON, innodb_log_recovery_target will be set to the current LSN. This ensures that it suffices to check only one of these variables when blocking writes to persistent tables.

See the commit message for a more detailed description of the changed data structures and functions.

Release Notes

See the previous section.

How can this PR be tested?

mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --mysqld=--loose-inodb-log-file-mmap=OFF --skip-test=mariabackup
mysql-test/mtr --parallel=auto --force --big-test --mysqld=--loose-innodb-log-archive --mysqld=--loose-innodb-log-recovery-start=12288 --mysqld=--loose-inodb-log-file-mmap=ON --skip-test=mariabackup

The mariabackup suite must be skipped when setting innodb_log_archive=ON, because the mariadb-backup tool will only support the old innodb_log_archive=OFF format (copying from ib_logfile0).

Unfortunately, all --suite=encryption tests that use innodb_encrypt_log must be skipped when using innodb_log_archive. This is because the server would have to be reinitialized; we do not allow changing the format of an archived log on startup (such as adding or removing encryption). This combination is covered by the test innodb.log_file_size_online,encrypted.

no_checkpoint_prepare.inc: A new file, to prepare for subsequent inclusion of no_checkpoint_end.inc. We will invoke the server to parse the log and to determine the latest checkpoint.

A number of tests that would fail when the parameter innodb_log_recovery_start=12288 is present, which is forcing
recovery to start from the beginning of the history (the database creation), have been adjusted with
explicit --innodb-log-recovery-start=0 to override that:

  1. Some injected corruption may be "healed" by replaying the log from the beginning. Some tests expect an empty buffer pool after a restart, with no page I/O due to crash recovery.
  2. Any test that sets innodb_read_only=ON would fail with an error message that the setting prevents crash recovery, unless innodb_log_recovery_start=0.
  3. Any test that changes innodb_undo_tablespaces would fail in crash recovery, because crash recovery assumes that the undo tablespace ID that is available from the undo* files corresponds with the start of the log. This is an unforunate design bug which we cannot fix easily.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

This is a new feature, but for now based on the 11.4 branch so that any unrelated errors that may be found during testing can be fixed rather quickly. Merges to the main branch may be blocked for weeks at a time.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@dr-m dr-m self-assigned this Oct 28, 2025
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dr-m dr-m force-pushed the MDEV-37949 branch 4 times, most recently from 4a9a384 to 9d14e2c Compare November 12, 2025 14:35
@dr-m dr-m force-pushed the MDEV-37949 branch 2 times, most recently from 1ab3880 to 254bfaa Compare January 8, 2026 15:13
@dr-m dr-m force-pushed the MDEV-37949 branch 2 times, most recently from b9728c2 to 091c2b3 Compare February 5, 2026 05:28
@dr-m dr-m changed the title MDEV-37949: Implement innodb_log_archive_file_size, innodb_log_archive_path, … MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, … Feb 12, 2026
@dr-m dr-m force-pushed the MDEV-37949 branch 3 times, most recently from 1112487 to 998c03a Compare February 13, 2026 13:54
@dr-m dr-m force-pushed the MDEV-37949 branch 7 times, most recently from 4d2e8ad to 99959b5 Compare February 19, 2026 13:47
@dr-m dr-m marked this pull request as ready for review February 19, 2026 13:48
Comment thread storage/innobase/buf/buf0flu.cc
Comment thread storage/innobase/buf/buf0flu.cc Outdated
Comment thread storage/innobase/buf/buf0flu.cc Outdated
Comment thread storage/innobase/log/log0log.cc
Comment thread extra/mariabackup/backup_copy.cc
Comment thread storage/innobase/buf/buf0flu.cc Outdated
Comment thread storage/innobase/log/log0log.cc
Comment thread storage/innobase/log/log0log.cc
Comment thread storage/innobase/log/log0log.cc
Comment thread storage/innobase/log/log0log.cc
Comment thread storage/innobase/buf/buf0flu.cc
Comment thread storage/innobase/log/log0log.cc Outdated
Comment thread storage/innobase/mtr/mtr0mtr.cc Outdated
Comment on lines +953 to +961
static_assert(ARCHIVE_FILE_SIZE_MAX - START_OFFSET < 1ULL << 32,
"replace buf_size with capacity() below");

uint64_t l, lsn;
static_assert(WRITE_TO_BUF == WRITE_BACKOFF << 1, "");
while (UNIV_UNLIKELY((l= write_lsn_offset.fetch_add(size + WRITE_TO_BUF) &
(WRITE_TO_BUF - 1)) + size +
(lsn= base_lsn.load(std::memory_order_relaxed)) >=
first_lsn + buf_size) &&
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In b38bc40 we should have either preserved the assignments to buf_size and assertions buf_size == capacity(), or replaced the buf_size in the while expression with capacity().

This inconsistency is what caused hangs during SET GLOBAL innodb_log_archive=0 in the test innodb.log_file_size_online.

In the end, I ended up fixing this up in 4018e39. Curiously, the expression had to be ordered in this way so that GCC 16 -O2 would produce optimal code, with no redundant loads of the link-time-resolved address of log_sys.

@dr-m dr-m force-pushed the MDEV-37949 branch 2 times, most recently from d18934b to 257ec94 Compare April 23, 2026 12:50
@dr-m dr-m requested a review from Thirunarayanan April 23, 2026 13:09
@dr-m
Copy link
Copy Markdown
Contributor Author

dr-m commented Apr 23, 2026

To address MDEV-39162 I changed the checkpoint offsets from 32-bit to 64-bit, so that we can lift the innodb_log_file_size=4G limit for innodb_log_archive=ON files. We’re also trying to preallocate the next log file, so that there is a chance that when we actually switch files, there is a good chance that the file creation and preallocation that must be executed at that point (while holding exclusive log_sys.latch) will be no-op.

if (size == 1 && *srv_log_group_home_dir == '.')
size= 0;
std::string path;
path.reserve(size + sizeof "/ib_0000000000000000.log");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this crash if its related
Found here: origin/MDEV-37949 257ec94

#0  __GI___libc_malloc (bytes=31) at ./malloc/malloc.c:3318
#1  operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  std::__new_allocator<char>::allocate (this=this@entry=, __n=<optimized out>) at /usr/include/c++/13/bits/new_allocator.h:151
#3  std::allocator_traits<std::allocator<char> >::allocate (__n=<optimized out>, __a=...) at /usr/include/c++/13/bits/alloc_traits.h:481
#4  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_allocate (__n=<optimized out>, __a=...) at /usr/include/c++/13/bits/basic_string.h:126
#5  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create (this=this@entry=, __capacity=30, __old_capacity=<optimized out>) at /usr/include/c++/13/bits/basic_string.tcc:159
#6  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::reserve (this=this@entry=, __res=<optimized out>, __res@entry=25) at /usr/include/c++/13/bits/basic_string.tcc:315
#7  log_t::get_archive_path[abi:cxx11](unsigned long) const (this=this@entry=<log_sys>, lsn=230649856) at /data/Server/MDEV-37949E/storage/innobase/log/log0log.cc:1188
#8  log_t::get_next_archive_path[abi:cxx11]() const (this=this@entry=<log_sys>) at /data/Server/MDEV-37949E/storage/innobase/log/log0log.cc:1196
#9  log_t::archive_create (this=this@entry=<log_sys>, ex=ex@entry=false) at /data/Server/MDEV-37949E/storage/innobase/log/log0log.cc:1246
#10 log_t::write_checkpoint (this=<optimized out>, checkpoint=checkpoint@entry=178245068, end_lsn=end_lsn@entry=178245550) at /data/Server/MDEV-37949E/storage/innobase/buf/buf0flu.cc:2124
#11 log_checkpoint_low (oldest_lsn=oldest_lsn@entry=178245068, end_lsn=end_lsn@entry=178245550) at /data/Server/MDEV-37949E/storage/innobase/buf/buf0flu.cc:2196
#12 buf_flush_sync_for_checkpoint (lsn=lsn@entry=178244888) at /data/Server/MDEV-37949E/storage/innobase/buf/buf0flu.cc:2403
#13 buf_flush_page_cleaner () at /data/Server/MDEV-37949E/storage/innobase/buf/buf0flu.cc:2679
#14 std::__invoke_impl<void, void (*)()> (__f=<optimized out>) at /usr/include/c++/13/bits/invoke.h:60
#15 std::__invoke<void (*)()> (__fn=<optimized out>) at /usr/include/c++/13/bits/invoke.h:96
#16 std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:292
#17 std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:299
#18 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:244
#19 ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#20 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#21 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(rr) frame 7
#7  log_t::get_archive_path[abi:cxx11](unsigned long) const (this=this@entry=<log_sys>, lsn=230649856) at /data/Server/MDEV-37949E/storage/innobase/log/log0log.cc:1188
1188      path.reserve(size + sizeof "/ib_0000000000000000.log");
(rr) p size
$1 = 0
(rr) p sizeof "/ib_0000000000000000.log"
$2 = 25
(rr) info args
this = <log_sys>
lsn = 230649856
(rr) info locals
size = 0
path = ""

Startup options used:-

# $RQG_HOME/rqg.pl \
# --gendata=conf/mariadb/oltp.zz \
# --max_gd_duration=900 \
# --grammar=conf/mariadb/oltp.yy \
# --redefine=conf/mariadb/redefine_innodb_log_archive.yy \
# --mysqld=--loose-innodb_log_archive=ON \
# --mysqld=--loose-innodb_log_recovery_start=12288 \
# --mysqld=--loose-innodb_log_file_size=4M \
# --reporters=LogArchiveRecovery \
# --vardir_type=fast \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --redefine=conf/mariadb/redefine_innodb_log_size_dynamic.yy \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --mysqld=--loose-innodb_fast_shutdown=1 \
# --mysqld=--sql_mode=traditional \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--innodb_stats_persistent=on \
# --mysqld=--innodb_adaptive_hash_index=off \
# --mysqld=--innodb_sort_buffer_size=65536 \
# --mysqld=--innodb_random_read_ahead=ON \
# --mysqld=--innodb_read_ahead_threshold=0 \
# --mysqld=--log-bin \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
# --mysqld=--loose-max-statement-time=30 \
# --threads=2 \
# --rr=rr record --wait \
# --mysqld=--innodb_undo_log_truncate=ON \
# --mysqld=--innodb_undo_tablespaces=3 \
# --mysqld=--innodb_rollback_on_timeout=OFF \
# --mysqld=--innodb_page_size=16K \
# --mysqld=--innodb-buffer-pool-size=256M \
# --vardir_type=fast \
# --no_mask \
# --batch \
# --basedir1=/Server_bin/MDEV-37949E_debug_Og \
# --basedir2=/Server_bin/MDEV-37949E_debug_Og \
# --script_debug=_nix_ \
# --major_runid=1776949969 \
# --minor_runid=136
#--------

RR trace is present on SDP :-
/data/results/1776949969/TBR-2400-MDEV-37949

@return byte offset within ib_logfile0 */
lsn_t calc_lsn_offset(lsn_t lsn) const noexcept
{
ut_ad(lsn >= first_lsn);
Copy link
Copy Markdown

@saahilalam saahilalam Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this Assertion :-
# 2026-04-23T13:28:53 [1701911] | mariadbd: /data/Server/MDEV-37949F/storage/innobase/include/log0log.h:717: lsn_t log_t::calc_lsn_offset(lsn_t) const: Assertion lsn >= first_lsn' failed.`

RQG startup options :-
`# $RQG_HOME/rqg.pl \
# --gendata=conf/engines/innodb/full_text_search.zz \
# --max_gd_duration=1200 \
# --short_column_names \
# --grammar=conf/engines/innodb/full_text_search.yy \
# --redefine=conf/mariadb/redefine_innodb_log_archive.yy \
# --mysqld=--loose-innodb_log_archive=ON \
# --mysqld=--loose-innodb_log_file_size=32M \
# --reporters=LogArchiveRecovery \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--lock-wait-timeout=15 \
# --mysqld=--innodb-lock-wait-timeout=10 \
# --mysqld=--loose-innodb_fast_shutdown=1 \
# --mysqld=--sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--innodb_stats_persistent=on \
# --mysqld=--innodb_adaptive_hash_index=on \
# --mysqld=--innodb_sort_buffer_size=65536 \
# --mysqld=--innodb_random_read_ahead=OFF \
# --mysqld=--log-bin \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \
# --threads=3 \
# --rr=rr record --chaos --wait \
# --mysqld=--innodb_undo_log_truncate=OFF \
# --mysqld=--innodb_rollback_on_timeout=OFF \
# --mysqld=--innodb_page_size=16K \
# --mysqld=--innodb-buffer-pool-size=256M \
# --vardir_type=fast \
# --no_mask \
# --batch \
# --basedir1=/Server_bin/MDEV-37949F_debug_Og \
# --basedir2=/Server_bin/MDEV-37949F_debug_Og \
# --script_debug=_nix_ \
# --major_runid=1776971149 \
# --minor_runid=111
`

RR trace is present on SDP :
/data/results/1776971149/TBR-2402-MDEV-37949

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log is corrupted, because there is some overlap between the ib_*.log file names. The last valid file is ib_0000000002ffd000.log, which point to a checkpoint that starts before this file. There is a preceding log file ib_0000000002000000.log of 33554432 bytes. For that length, the next log file name should have been ib_0000000003ffd000.log.

443032c fixes this crash but not the corruption. I am yet to check the rr replay of each preceding mariadbd invocation to find out how we ended up with such log file names and how to prevent this kind of corruption.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We created the file with the incorrect name ib_0000000002ffd000.log because the log_sys.file_size had been assigned in the following:

#0  log_t::archived_mmap_switch_complete (this=this@entry=0x555917adc5c0 <log_sys>) at /data/Server/MDEV-37949F/storage/innobase/mtr/mtr0mtr.cc:359
#1  0x000055591645c031 in mtr_t::commit_log_release<true> (this=this@entry=0x70af1fff86b0) at /data/Server/MDEV-37949F/storage/innobase/mtr/mtr0mtr.cc:368
#2  0x0000555916466c90 in mtr_t::commit_log<true> (mtr=0x70af1fff86b0, lsns=<optimized out>) at /data/Server/MDEV-37949F/include/mysql/psi/mysql_thread.h:795
#3  0x000055591645fccf in mtr_t::commit (this=this@entry=0x70af1fff86b0) at /data/Server/MDEV-37949F/storage/innobase/mtr/mtr0mtr.cc:544
#4  0x00005559164c88a8 in row_ins_sec_index_entry_low (flags=flags@entry=0x0, mode=mode@entry=BTR_INSERT_TREE, index=index@entry=0x555918b12f60, offsets_heap=<optimized out>, offsets_heap@entry=0x54e314582510, 
    heap=heap@entry=0x54e3145829a0, entry=entry@entry=0x54e314582100, trx_id=0x0, thr=0x54e31457eee0) at /data/Server/MDEV-37949F/storage/innobase/row/row0ins.cc:3261
  file_size= resize_target;
  return lsn;

A subsequent checkpoint would successfully create and allocate this file but not use it for anything yet:

#0  log_t::archive_create (this=this@entry=0x555917adc5c0 <log_sys>, ex=ex@entry=0x0) at /data/Server/MDEV-37949F/storage/innobase/log/log0log.cc:1309
#1  0x00005559165bddbd in log_t::write_checkpoint (this=<optimized out>, checkpoint=checkpoint@entry=0x1dd5442, end_lsn=end_lsn@entry=0x2000284) at /data/Server/MDEV-37949F/storage/innobase/buf/buf0flu.cc:2124
#2  0x00005559165b877d in log_checkpoint_low (oldest_lsn=oldest_lsn@entry=0x1dd5442, end_lsn=end_lsn@entry=0x2000284) at /data/Server/MDEV-37949F/storage/innobase/buf/buf0flu.cc:2196
#3  0x00005559165ba940 in buf_flush_sync_for_checkpoint (lsn=lsn@entry=0x2000284) at /data/Server/MDEV-37949F/storage/innobase/buf/buf0flu.cc:2403

A long time later, we open this file for writing. Everything else is fine, but half of the file ib_0000000002000000.log is being wasted because we had wrongly assigned the log_sys.file_size:

od -Ax -t x1 data/ib_0000000002000000.log |tail
ffff90 5f 33 2e 69 62 64 b0 2d 99 30 00 2e 2f 74 65 73
ffffa0 74 2f 46 54 53 5f 30 30 30 30 30 30 30 30 30 30
ffffb0 30 30 31 38 39 31 5f 30 30 30 30 30 30 30 30 30
ffffc0 30 30 30 31 65 30 31 5f 49 4e 44 45 58 5f 34 2e
ffffd0 69 62 64 b0 2d 99 31 00 2e 2f 74 65 73 74 2f 46
ffffe0 54 53 5f 30 30 30 30 30 30 30 30 30 30 30 30 31
fffff0 38 39 31 5f 30 30 30 30 30 30 30 30 30 30 30 30
1000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
2000000

We must only assign log_sys.file_size after successfully creating the next archive log file, and nowhere else. This incorrect assignment should be the explanation of any recovery problem where the archive log file names and sizes are not consistent with each other.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can recover the data directory after the following fixup:

chmod u+w data/ib_0000000002000000.log
truncate -s 16m data/ib_0000000002000000.log

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug should be specific to the WITH_INNODB_PMEM code path, and it could be fixed by c819703. Before I spend more time on trying to fix this, I’d like to know if this is still reproducible with the current version. For the record, I was considering the following kind of changes (completely untested):

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index ba3bcf58a63..8852f5d5714 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1844,7 +1844,6 @@ inline bool log_t::write_checkpoint(lsn_t checkpoint, lsn_t end_lsn) noexcept
         buf= resize_buf;
         resize_buf= nullptr;
         first_lsn+= capacity();
-        file_size= resize_target;
         goto unmap_old_checkpoint;
       }
       else if (c && is_mmap())
diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
index 61ca0e39286..101f62397d5 100644
--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -1268,6 +1268,7 @@ ATTRIBUTE_COLD void log_t::archive_create(bool ex) noexcept
     success= os_file_set_size(path.c_str(), file, resize_target);
     if (success && ex)
     {
+      file_size= resize_target;
 #ifdef HAVE_PMEM
       if (is_mmap())
       {
diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
index 0483f3723e9..22cd7a4f84e 100644
--- a/storage/innobase/mtr/mtr0mtr.cc
+++ b/storage/innobase/mtr/mtr0mtr.cc
@@ -355,7 +355,6 @@ ATTRIBUTE_COLD lsn_t log_t::archived_mmap_switch_complete() noexcept
   buf= resize_buf;
   resize_buf= nullptr;
   first_lsn= last_lsn;
-  file_size= resize_target;
   return lsn;
 }
 

The PMEM write path is rather different from the pwrite path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not get into this yesterday. The above patch would crash for me when I am running the following:

mysql-test/mtr --mysqld=--loose-innodb-log-{archive,file-size=4m,recovery-start=12288} --parallel=auto --big-test --force --max-test-fail=0 --skip-test=mariabackup

I believe that the trouble could have started by running sys_vars.innodb_log_file_size_basic on the same worker instance:

SET @@GLOBAL.innodb_log_file_size=10485760;

Corresponding to this, in the core dump I observed log_sys.file_size=0xa00000. But the actual size of log_sys.buf should have been only 0x400000, corresponding to the ib_0000000000003000.log (of 4 MiB), which was followed by ib_0000000000400000.log (of 10MiB). Here is some test harness log:

sys_vars.innodb_log_buffer_size_basic    w13 [ pass ]      3
sys_vars.innodb_log_checkpoint_now_basic w13 [ pass ]     11
sys_vars.innodb_log_file_size_basic      w13 [ pass ]      4
sys_vars.innodb_log_group_home_dir_basic w13 [ pass ]      3
…
sys_vars.log_slow_admin_statements_func  w13 [ fail ]
        Test ended at 2026-04-28 14:03:23

CURRENT_TEST: sys_vars.log_slow_admin_statements_func
mysqltest: At line 41: query 'CALL add_rows()' failed: <Unknown> (2013): Lost connection to server during query

The crash would occur in a log write that spans two files:

#6  0x0000555b31b8a88d in log_t::append (d=@0x7f12685ef5c8: 0x7f126a3fbffe "'\b", s=0x7f12685efc08, size=0x134) at /mariadb/11.4/storage/innobase/mtr/mtr0mtr.cc:1322
#7  0x0000555b31b8b08f in mtr_t::finish_writer<(log_t::write)2> (mtr=0x7f12685efa90, len=0x139) at /mariadb/11.4/storage/innobase/mtr/mtr0mtr.cc:1348
#8  0x0000555b31b85da9 in mtr_t::finish_write (this=0x7f12685efa90, len=0x139) at /mariadb/11.4/storage/innobase/include/mtr0mtr.h:728
#9  mtr_t::do_write (this=this@entry=0x7f12685efa90) at /mariadb/11.4/storage/innobase/mtr/mtr0mtr.cc:1214
#10 0x0000555b31b891a3 in mtr_t::commit (this=this@entry=0x7f12685efa90) at /mariadb/11.4/storage/innobase/mtr/mtr0mtr.cc:540

The problem is that instead of passing size=0x134 we should only have passed size=2 to this buffer. The rest should have been written to log_sys.resize_buf.

The only option that I can see is to introduce a PMEM specific member of log_sys to keep track of the current size of log_sys.buf, independently of the log_sys.file_size that was chosen for the new log file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can repurpose log_sys.write_to_log for storing this extra information, because it is not being used in the PMEM write path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test case for exercising this scenario:

--source include/have_innodb.inc

SET GLOBAL innodb_log_archive=ON;
CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB) ENGINE=InnoDB;

DELIMITER //;
CREATE PROCEDURE fill_t1(IN n INT)
BEGIN
  DECLARE i INT DEFAULT 0;
  WHILE i < n DO
    START TRANSACTION;
    INSERT INTO t1(v) VALUES (REPEAT('x', 2048));
    ROLLBACK;
    SET i = i + 1;
  END WHILE;
END//

CREATE PROCEDURE back_and_forth(IN n INT)
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE EXIT HANDLER FOR 1221 BEGIN END;
  WHILE i < n DO
  SET GLOBAL innodb_log_file_size=4194304, innodb_log_file_size=4198400;
  SET i = i + 1;
  END WHILE;
END//

DELIMITER ;//

--connect con1,localhost,root
send call fill_t1(1000000);
--connection default
call back_and_forth(100000000);

This is easily crashing with the above patch:

2026-04-29  8:49:11 4 [Note] InnoDB: setting innodb_log_archive=1 at innodb_log_recovery_start=54088
mariadbd: /mariadb/11.4/storage/innobase/log/log0log.cc:1501: void log_t::archived_mmap_switch_prepare(bool, bool): Assertion `lsn - get_flushed_lsn(std::memory_order_relaxed) < capacity()' failed.
260429  8:49:12 [ERROR] /dev/shm/11.4/sql/mariadbd got signal 6 ;

c819703 without the patch does not crash. The following assertions would fail, in both the PMEM and pwrite code paths:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index ba3bcf58a63..696a33e6049 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1851,6 +1851,8 @@ inline bool log_t::write_checkpoint(lsn_t checkpoint, lsn_t end_lsn) noexcept
       {
       unmap_old_checkpoint:
         checkpoint_buf= nullptr;
+        ut_ad(os_file_get_size(log.m_file) == file_size);
+        ut_ad(os_file_get_size(resize_log.m_file) == resize_target);
         my_munmap(c, lseek(resize_log.m_file, 0, SEEK_END));
         goto first_checkpoint_in_new_archive;
       }
@@ -1863,6 +1865,8 @@ inline bool log_t::write_checkpoint(lsn_t checkpoint, lsn_t end_lsn) noexcept
 #endif
         first_lsn+= capacity();
 
+        ut_ad(os_file_get_size(log.m_file) == file_size);
+        ut_ad(os_file_get_size(resize_log.m_file) == resize_target);
         file_size= resize_target;
 
 #ifdef HAVE_PMEM

In the first hunk this failure is very obvious.

Now that I have a reduced test case for repeating the problem, I can work on a fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails for both the mmap and pwrite write paths. Previously I thought this would only affect the mmap path:

diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
index 61ca0e39286..3bcc37a1117 100644
--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -1243,6 +1243,7 @@ ATTRIBUTE_COLD void log_t::archive_create(bool ex) noexcept
     return;
   }
 
+  ut_ad(file_size == os_file_get_size(log.m_file));
   std::string path{get_next_archive_path()};
   bool success;
   os_file_t file=

I revised my test so that it will run for a longer time:

--source include/have_innodb.inc
--source include/have_sequence.inc

CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, v LONGBLOB) ENGINE=InnoDB;
--disable_query_log
DELIMITER //;
CREATE PROCEDURE fill_t1(IN n INT)
BEGIN
  WHILE n > 0 DO
    START TRANSACTION;
    INSERT INTO t1(v) SELECT REPEAT('x', 2048) FROM seq_1_to_10000;
    ROLLBACK;
    SET n = n - 1;
  END WHILE;
END//
DELIMITER ;//

--connect con1,localhost,root
send call fill_t1(100000);
--connection default
let $N=1000000;
while ($N) {
--sleep 0.1
SET GLOBAL innodb_log_file_size=4194304;
--sleep 0.1
SET GLOBAL innodb_log_file_size=4198400;
dec $N;
}
--enable_query_log

DROP PROCEDURE fill_t1;
DROP TABLE t1;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that os_file_set_size() would never shrink files on POSIX, only enlarge them. I finally caught this by adding some os_file_get_size() assertions. Fixed in bc87ab8.

ENGINE SUPPORT COMMENT TRANSACTIONS XA SAVEPOINTS
FOUND 1 /InnoDB: Upgrade after a crash is not supported. This redo log was created before MariaDB 10\.2\.2, and we did not find a valid checkpoint/ in mysqld.1.err
# empty redo log from before MariaDB 10.2.2
# restart: --innodb-data-home-dir=MYSQLTEST_VARDIR/tmp/log_corruption --innodb-log-group-home-dir=MYSQLTEST_VARDIR/tmp/log_corruption --innodb-log-recovery-target=12345
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting this error :-

`# 2026-04-23T12:31:05 [3898794] | 2026-04-23 12:31:05 0 [ERROR] InnoDB: ib_000000001e7f1000.log is in unrecognized format
# 2026-04-23T12:31:05 [3898794] | 2026-04-23 12:31:05 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1425] with error Generic error`

Log files present at SDP:
/data/results/1776971149/000931

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovery will only consider the all-zero file ib_000000001e7f1000 because there is a discontinuity: the 268435456-byte file ib_000000001a7f4000.log would be expected to be followed by ib_000000002a7f1000.log. If I rename the dummy file to the expected name, then the recovery will proceed past this point:

2026-04-24 13:11:38 0 [Note] InnoDB: innodb_buffer_pool_size_max=8388614m, innodb_buffer_pool_size=10240m
2026-04-24 13:11:38 0 [Note] InnoDB: Initialized memory pressure event listener
2026-04-24 13:11:38 0 [Note] InnoDB: Completed initialization of buffer pool
2026-04-24 13:11:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=489767822
2026-04-24 13:11:38 0 [Note] InnoDB: End of log at LSN=490002980
2026-04-24 13:11:38 0 [Note] InnoDB: To recover: 94 pages
2026-04-24 13:11:38 0 [Note] InnoDB: Last binlog file './mysql-bin.000001', position 5953096
2026-04-24 13:11:38 0 [Warning] InnoDB: Cannot change innodb_undo_tablespaces=0 because previous shutdown was not with innodb_fast_shutdown=0
2026-04-24 13:11:38 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2026-04-24 13:11:38 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2026-04-24 13:11:38 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2026-04-24 13:11:38 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2026-04-24 13:11:38 0 [Note] InnoDB: log sequence number 490002980 (memory-mapped); transaction id 28255

To analyze the root cause of this failure, I would need a log that shows the sequence of events that led to this file being created during a previous execution of mariadbd in the same data directory. rr replay would be the preferred format. It is possible that this error will be fixed by the revision 4c3af6a, which I did not push to this branch yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revision to the spare archive log creation is now included in c819703.

{
first_checkpoint_in_new_archive:
c= buf;
ut_ad(!memcmp_aligned<512>(c, field_ref_zero, START_OFFSET));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitting this crash

#0  memcmp@plt ()
#1  memcmp_aligned<512ul> (s1=s1@entry=, s2=<optimized out>, n=n@entry=12288) at /data/Server/MDEV-37949D/include/assume_aligned.h:60
#2  log_t::write_checkpoint (this=<log_sys>, checkpoint=checkpoint@entry=58358328, end_lsn=end_lsn@entry=58700523) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0flu.cc:1873
#3  log_checkpoint_low (oldest_lsn=oldest_lsn@entry=58358328, end_lsn=end_lsn@entry=58700523) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0flu.cc:2191
#4  buf_flush_sync_for_checkpoint (lsn=lsn@entry=58700400) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0flu.cc:2401
#5  buf_flush_page_cleaner () at /data/Server/MDEV-37949D/storage/innobase/buf/buf0flu.cc:2696
#6  std::__invoke_impl<void, void (*)()> (__f=<optimized out>) at /usr/include/c++/13/bits/invoke.h:60
#7  std::__invoke<void (*)()> (__fn=<optimized out>) at /usr/include/c++/13/bits/invoke.h:96
#8  std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:292
#9  std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:299
#10 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:244
#11 ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#13 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

RR trace is present on SDP :-

/data/results/1777036660/TBR-2403-MDEv-37949

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rr replay trace is invalid, because the size of the mariadbd executable has changed. This does not look like an actual crash to me. Was the process forcibly killed because a shutdown was taking a longer time under rr record?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the executable was replaced with a matching one, I got the rr replay to work. This looks like a bogus SIGSEGV. For some reason, I can’t disassemble the memcmp@plt in the rr replay trace. What I can see is that we are comparing c=log_sys.buf (the checkpoint buffer) to field_ref_zero. Both buffers are accessible in the rr replay trace.

Why was the SIGSEGV sent by the test harness? I know it may send such signals when a server shutdown appears to take longer than expected. Here, I can’t see any sign of a hang or even an attempt to shut down the server. At least one thread is waiting for a checkpoint to be completed. There is one pending read in read_slots, but it was not initiated a long time ago, and nothing seems to be waiting for it to complete. Everything seems normal to me, except that the test harness sent a SIGSEGV for some reason.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the rr source code, and the only places where it invokes raise(3) or kill(2) with SIGSEGV are in unit tests or in the check coredumping_signal_takes_down_entire_vm() which I does not seem to be sending anything to a ptrace(2) tracee.

sql_print_error("InnoDB: Your database may be corrupt or"
" you may have copied the InnoDB"
" tablespace but not the ib_logfile0. %s",
" tablespaces but not the log. %s",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please chek if this crash is related :-

# 2026-04-24T08:43:49 [1602557] | #0  0x000059ca779d4ec3 in my_thread_var_dbug () at /data/Server/MDEV-37949D/mysys/my_thr_init.c:440
# 2026-04-24T08:43:49 [1602557] | #1  0x000059ca779ed9aa in code_state () at /data/Server/MDEV-37949D/dbug/dbug.c:382
# 2026-04-24T08:43:49 [1602557] | #2  0x000059ca779ee7e6 in _db_keyword_ (cs=0x0, keyword=0x59ca77bd2c28 "page_intermittent_checksum_mismatch", strict=1) at /data/Server/MDEV-37949D/dbug/dbug.c:1769
# 2026-04-24T08:43:49 [1602557] | #3  0x000059ca778aad65 in buf_page_is_corrupted (check_lsn=check_lsn@entry=true, read_buf=read_buf@entry=0x7537c8320000 "", fsp_flags=<optimized out>) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0buf.cc:649
# 2026-04-24T08:43:49 [1602557] | #4  0x000059ca778abdf1 in buf_page_check_corrupt (bpage=bpage@entry=0x7537c8023990, node=...) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0buf.cc:3518
# 2026-04-24T08:43:49 [1602557] | #5  0x000059ca778acc0c in buf_page_t::read_complete (this=0x7537c8023990, node=..., recovery=recovery@entry=false) at /data/Server/MDEV-37949D/storage/innobase/buf/buf0buf.cc:3631
# 2026-04-24T08:43:49 [1602557] | #6  0x000059ca7790ca94 in IORequest::read_complete (this=this@entry=0x59ca7a2469f8, io_error=0) at /data/Server/MDEV-37949D/storage/innobase/fil/fil0fil.cc:2976
# 2026-04-24T08:43:49 [1602557] | #7  0x000059ca777737cd in read_io_callback (c=0x59ca7a246950) at /data/Server/MDEV-37949D/storage/innobase/os/os0file.cc:3027
# 2026-04-24T08:43:49 [1602557] | #8  0x000059ca77976ad1 in tpool::task_group::execute (this=0x59ca7a371fe0, t=t@entry=0x59ca7a2469d8) at /data/Server/MDEV-37949D/tpool/task_group.cc:70
# 2026-04-24T08:43:49 [1602557] | #9  0x000059ca77976b77 in tpool::task::execute (this=0x59ca7a2469d8) at /data/Server/MDEV-37949D/tpool/task.cc:32
# 2026-04-24T08:43:49 [1602557] | #10 0x000059ca7797428f in tpool::thread_pool_generic::worker_main (this=0x59ca7a279df0, thread_var=0x59ca7a32a470) at /data/Server/MDEV-37949D/tpool/tpool_generic.cc:568
# 2026-04-24T08:43:49 [1602557] | #11 0x000059ca779767fd in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__f=<optimized out>, __t=<optimized out>) at /usr/include/c++/13/bits/invoke.h:74
# 2026-04-24T08:43:49 [1602557] | #12 0x000059ca7797681b in std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/13/bits/invoke.h:90
# 2026-04-24T08:43:49 [1602557] | #13 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:292
# 2026-04-24T08:43:49 [1602557] | #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:299
# 2026-04-24T08:43:49 [1602557] | #15 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:244
# 2026-04-24T08:43:49 [1602557] | #16 0x00007d37e46eabb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
# 2026-04-24T08:43:49 [1602557] | #17 0x00007d37e429ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
# 2026-04-24T08:43:49 [1602557] | #18 0x00007d37e4329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Rqg startup options:-

# $RQG_HOME/rqg.pl \
# --gendata=conf/mariadb/oltp.zz \
# --max_gd_duration=900 \
# --grammar=conf/mariadb/oltp.yy \
# --redefine=conf/mariadb/redefine_innodb_log_archive.yy \
# --mysqld=--loose-innodb_log_archive=ON \
# --mysqld=--loose-innodb_log_recovery_start=12288 \
# --mysqld=--loose-innodb_log_file_size=4M \
# --mysqld=--loose-innodb_log_file_mmap=ON \
# --reporters=LogArchiveRecovery \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --redefine=conf/mariadb/redefine_innodb_log_size_dynamic.yy \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --mysqld=--loose-innodb_fast_shutdown=0 \
# --mysqld=--sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--innodb_stats_persistent=on \
# --mysqld=--innodb_adaptive_hash_index=off \
# --mysqld=--innodb_random_read_ahead=ON \
# --mysqld=--innodb_read_ahead_threshold=0 \
# --mysqld=--log-bin \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \
# --threads=2 \
# --rr=rr record --wait \
# --mysqld=--innodb_undo_log_truncate=OFF \
# --mysqld=--innodb_undo_tablespaces=3 \
# --mysqld=--innodb_rollback_on_timeout=OFF \
# --mysqld=--innodb_page_size=4K \
# --mysqld=--innodb-buffer-pool-size=5M \
# --vardir_type=fast \
# --no_mask \
# --batch \
# --basedir1=/Server_bin/MDEV-37949D_debug_Og \
# --basedir2=/Server_bin/MDEV-37949D_debug_Og \
# --script_debug=_nix_ \
# --major_runid=1777036660 \
# --minor_runid=48
#--------
# rqg_rr_add ->"--innodb-use-native-aio=0" "--loose-innodb-write-io-threads=2" "--loose-innodb-read-io-threads=1" "--loose-gdb" "--loose-debug-gdb"<-
# vardir ->/dev/shm/rqg/1777036660/48<- fs_type=tmpfs

RR trace is present on SDP :-
/data/results/1777036660/TBR-2404-MDEV-37949

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace rr replay /data/results/1777036660/TBR-2404-MDEV-37949/1/rr/latest-trace/ is invalid, because the size of the executable has been changed:

[ERROR src/TraceStream.cc:1207:read_mapped_region()] Metadata of /data/Server_bin/MDEV-37949D_debug_Og/bin/mariadbd changed: replay divergence likely, but continuing anyway. inode: 68621966/68622876; mode: 33277/33277; uid: 1005/1005; gid: 1002/1002; size: 226516408/226321408; mtime: 1774850556/1777036278
…
0x00007d37e5254540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr) c
Continuing.
[FATAL src/ReplaySession.cc:1265:check_ticks_consistency()] 
 (task 3246217 (rec:1604921) at time 87)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mmap'; expected 9341, got 9339

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, the executable had been replaced with a matching one. This one looks like a genuine SIGSEGV, but outside InnoDB code, in the DBUG facility, when trying to read tmp->init:

Thread 3 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1604921.1606330]
0x000059ca779d4ec3 in my_thread_var_dbug () at /data/Server/MDEV-37949D/mysys/my_thr_init.c:440
440	  return tmp && tmp->init ? &tmp->dbug : 0;
(rr) bt
#0  0x000059ca779d4ec3 in my_thread_var_dbug () at /data/Server/MDEV-37949D/mysys/my_thr_init.c:440
#1  0x000059ca779ed9aa in code_state () at /data/Server/MDEV-37949D/dbug/dbug.c:382
#2  0x000059ca779ee7e6 in _db_keyword_ (cs=0x0, keyword=0x59ca77bd2c28 "page_intermittent_checksum_mismatch", strict=0x1) at /data/Server/MDEV-37949D/dbug/dbug.c:1769
#3  0x000059ca778aad65 in buf_page_is_corrupted (check_lsn=check_lsn@entry=0x1, read_buf=read_buf@entry=0x7537c8320000 "", fsp_flags=<optimized out>)
    at /data/Server/MDEV-37949D/storage/innobase/buf/buf0buf.cc:649
…
(rr) disas
Dump of assembler code for function my_thread_var_dbug:
   0x000059ca779d4ea8 <+0>:	endbr64
   0x000059ca779d4eac <+4>:	cmpb   $0x0,0x144e345(%rip)        # 0x59ca78e231f8 <my_thread_global_init_done>
   0x000059ca779d4eb3 <+11>:	je     0x59ca779d4ed7 <my_thread_var_dbug+47>
   0x000059ca779d4eb5 <+13>:	sub    $0x8,%rsp
   0x000059ca779d4eb9 <+17>:	call   0x59ca779d49d9 <_my_thread_var>
   0x000059ca779d4ebe <+22>:	test   %rax,%rax
   0x000059ca779d4ec1 <+25>:	je     0x59ca779d4ed2 <my_thread_var_dbug+42>
=> 0x000059ca779d4ec3 <+27>:	cmpb   $0x0,0x150(%rax)
   0x000059ca779d4eca <+34>:	je     0x59ca779d4edd <my_thread_var_dbug+53>
   0x000059ca779d4ecc <+36>:	add    $0x158,%rax
   0x000059ca779d4ed2 <+42>:	add    $0x8,%rsp
   0x000059ca779d4ed6 <+46>:	ret
   0x000059ca779d4ed7 <+47>:	mov    $0x0,%eax
   0x000059ca779d4edc <+52>:	ret
   0x000059ca779d4edd <+53>:	mov    $0x0,%eax
   0x000059ca779d4ee2 <+58>:	jmp    0x59ca779d4ed2 <my_thread_var_dbug+42>
End of assembler dump.
(rr) i reg rax
rax            0x7537b000ad70      0x7537b000ad70
(rr) list
435	{
436	  struct st_my_thread_var *tmp;
437	  if (!my_thread_global_init_done)
438	    return NULL;
439	  tmp= my_thread_var;
440	  return tmp && tmp->init ? &tmp->dbug : 0;
441	}
442	#endif /* DBUG_OFF */
443	
444	/* Return pointer to mutex_in_use */
(rr) p *my_thread_var
Cannot access memory at address 0x7537b000ad70

I am not familiar with this subsystem. InnoDB is invoking DBUG_EXECUTE_IF to try to pretend that a page that was read is corrupted. In any case, this crash is completely unrelated to these changes; we are not changing any data page I/O or buffer pool access in this pull request. I did not find any open bugs that would match this crash.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you file a draft bug for this?

Comment thread storage/innobase/mtr/mtr0mtr.cc
Comment thread storage/innobase/log/log0log.cc Outdated
ut_ad(resize_target >= FILE_SIZE_MIN);
ut_ad(length);
ut_ad(!is_mmap());
ut_a(offset + length + START_OFFSET <= file_size + resize_target);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion easily fails when I run Sysbench oltp_update_index with the minimum innodb_log_file_size=4m. That is, the log checkpoints are not being completed frequently enough. I can reproduce this both with CMAKE_BUILD_TYPE=RelWithDebInfo and CMAKE_BUILD_TYPE=Debug. I will try to fix this. This is repeatable also with the parent of 7599bed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I produced an rr replay trace for this ut_a() failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a source code comment before this assertion that explains what the failure means. Yes, the assertion may fail when you configure an unreasonably small innodb_log_file_size. With innodb_log_archive=OFF we always allowed crash recovery to be broken if the checkpoints cannot keep up with the write load. With innodb_log_archive=ON there is no such option. My test was stable when I used innodb_log_file_size=512m. Such a tiny size did not perform well; for about 75% of the time, InnoDB was blocked, waiting for the log checkpoint to advance far enough.

Possibly we could extend the last log file, or we could add a condition to log_t::write_checkpoint() that it will not release the log_sys.latch during the checkpoint header write if we are near this situation. In the memory-mapped (PMEM) code path we hold the exclusive log_sys.latch there.

Comment thread storage/innobase/log/log0log.cc Outdated
The InnoDB write-ahead log file in the old innodb_log_archive=OFF
format is named ib_logfile0, pre-allocated to innodb_log_file_size and
written as a ring buffer. This is good for write performance and space
management, but unsuitable for arbitrary point-in-time recovery or for
facilitating efficient incremental backup.

innodb_log_archive=ON: A new format where InnoDB will create and
preallocate files ib_%016x.log, instead of writing a circular file
ib_logfile0. Each file will be pre-allocated to innodb_log_file_size.
Once a log fills up, we will create and pre-allocate another log file,
to which log records will be written.  Upon the completion of the first
log checkpoint in a recently created log file, the old log file will
be marked read-only, signaling that there will be no further writes to
that file, and that the file may safely be moved to long-term storage.

The file name includes the log sequence number (LSN) at file offset
12288 (log_t::START_OFFSET). Each checkpoint is identified by storing
a 64-bit big-endian offset into an optional sequence of FILE_MODIFY
records followed by a FILE_CHECKPOINT record, between 12288 and the
end of the file.

The innodb_encrypt_log format is identified by storing the encryption
information at the start of the log file. The first 64-bit value will
be 1, which is an invalid checkpoint offset. Each
innodb_log_archive=ON log must use the same encryption parameters.
Changing innodb_encrypt_log or related parameters is only possible by
setting innodb_log_archive=OFF and restarting the server, which will
permanently lose the history of the archived log.

The maximum number of log checkpoints that the innodb_log_archive=ON
file header can represent is limited to 12288/8=1536 when using
innodb_encrypt_log=OFF. If we run out of slots in a log file, each
subsequently completed checkpoint in that log file will overwrite the
last slot in the checkpoint header, until we switch to the next log.

innodb_log_recovery_start: The checkpoint LSN to start recovery from.
This will be useful when recovering from an archived log. This is useful
for restoring an incremental backup (applying InnoDB log files that were
copied since the previous restore).

innodb_log_recovery_target: The requested LSN to end recovery at.
When this is set, all persistent InnoDB tables will be read-only, and
no writes to the log are allowed. The intended purpose of this setting
is to prepare an incremental backup, as well as to allow data
retrieval as of a particular logical point of time.

Setting innodb_log_recovery_target>0 is much like setting
innodb_read_only=ON, with the exception that the data files may be
written to by crash recovery, and locking reads will conflict with any
incomplete transactions as necessary, and all transaction isolation
levels will work normally (not hard-wired to READ UNCOMMITTED).

srv_read_only_mode: When this is set (innodb_read_only=ON), also
recv_sys.rpo (innodb_log_recovery_target) will be set to the current LSN.
This ensures that it will suffice to check only one of these variables
when blocking writes to persistent tables.

The status variable innodb_lsn_archived will reflect the LSN
since when a complete InnoDB log archive is available. Its initial
value will be that of the new parameter innodb_log_archive_start.
If that variable is 0 (the default), the innodb_lsn_archived will
be recovered from the available log files. If innodb_log_archive=OFF,
innodb_lsn_archived will be adjusted to the latest checkpoint every
time a log checkpoint is executed. If innodb_log_archive=ON, the value
should not change.

SET GLOBAL innodb_log_archive=!@@GLOBAL.innodb_log_archive will take
effect as soon as possible, possibly after a log checkpoint has been
completed. The log file will be renamed between ib_logfile0 and
ib_%016x.log as appropriate.

When innodb_log_archive=ON, the setting SET GLOBAL innodb_log_file_size
will affect subsequently created log files when the file that is being
currently written is running out. If we are switching log files exactly
at the same time, then a somewhat misleading error message
"innodb_log_file_size change is already in progress" will be issued.

no_checkpoint_prepare.inc: A new file, to prepare for subsequent
inclusion of no_checkpoint_end.inc. We will invoke the server to
parse the log and to determine the latest checkpoint.

All --suite=encryption tests that use innodb_encrypt_log
will be skipped for innodb_log_archive=ON, because enabling
or disabling encryption on the log is not possible without
temporarily setting innodb_log_archive=OFF and restarting
the server. The idea is to add the following arguments to an
invocation of mysql-test/mtr:

--mysqld=--loose-innodb-log-archive \
--mysqld=--loose-innodb-log-recovery-start=12288 \
--mysqld=--loose-innodb-log-file-mmap=OFF \
--skip-test=mariabackup

Alternatively, specify --mysqld=--loose-innodb-log-file-mmap=ON
to cover both code paths.

The mariabackup test suite must be skipped when using the
innodb_log_archive=ON format, because mariadb-backup will only
support the old ib_logfile0 format (innodb_log_archive=OFF).

A number of tests would fail when the parameter
innodb_log_recovery_start=12288 is present, which is forcing
recovery to start from the beginning of the history
(the database creation). The affected tests have been adjusted with
explicit --innodb-log-recovery-start=0 to override that:

(0) Some injected corruption may be "healed" by replaying the log
from the beginning. Some tests expect an empty buffer pool after
a restart, with no page I/O due to crash recovery.
(1) Any test that sets innodb_read_only=ON would fail with an error
message that the setting prevents crash recovery, unless
innodb_log_recovery_start=0.
(2) Any test that changes innodb_undo_tablespaces would fail in crash
recovery, because crash recovery assumes that the undo tablespace ID
that is available from the undo* files corresponds with the start of
the log. This is an unforunate design bug which we cannot fix easily.

log_sys.first_lsn: The start of the current log file, to be consulted
in log_t::write_checkpoint() when renaming files.

log_sys.archived_lsn: New field: The value of innodb_lsn_archived.

log_sys.end_lsn: New field: The log_sys.get_lsn() when the latest
checkpoint was initiated. That is, the start LSN of a possibly empty
sequence of FILE_MODIFY records followed by FILE_CHECKPOINT.

log_sys.resize_target: The value of innodb_log_file_size that will be
used for creating the next archive log file once the current file (of
log_sys.file_size) fills up.

log_sys.archive: New field: The value of innodb_log_archive.

log_sys.next_checkpoint_no: Widen to uint16_t. There may be up to
12288/8=1536 checkpoints in the header.

log_sys.log: If innodb_log_archive=ON, this file handle will be kept
open also in the PMEM code path.

log_sys.resize_log: If innodb_log_archive=ON, we may have two log
files open both during normal operation and when parsing the log. This
will store the other handle (old or new file).

log_sys.resize_buf: In the memory-mapped code path, this will point
to the file resize_log when innodb_log_archive=ON.

recv_sys.log_archive: All innodb_log_archive=ON files that will be
considered in recovery.

recv_sys.was_archive: A flag indicating that an innodb_log_archive=ON
file is in innodb_log_archive=OFF format.

log_sys.is_pmem, log_t::is_mmap_writeable(): A new predicate.
If is_mmap_writeable(), we assert and guarantee buf_size == capacity().

log_t::archive_new_write(): Create and allocate a new log file, and
write the outstanding data to both the current and the new file, or
only to the new file, until write_checkpoint() completes the first
checkpoint in the new file.

log_t::archived_mmap_switch_prepare(): Create and memory-map a new log
file, and update file_size to resize_target. Remember the file handle
of the current log in resize_log, so that write_checkpoint() will be
able to make it read-only.

log_t::archived_mmap_switch_complete(): Switch to the buffer that was
created in archived_mmap_switch_prepare().

log_t::archive_create(bool): Create and allocate an archive log file.
On POSIX, ensure that we get the exact correct file size by invoking
ftruncate() on POSIX, because os_file_set_size() would only
truncate files on Microsoft Windows.

buf_flush_archive_create(): Create a spare archive log file if needed.

buf_flush_page_cleaner(): Invoke buf_flush_archive_create() about
one second after a checkpoint where we switched archived log files.
This allows any writer threads to resume activity after we completed
the previous checkpoint.

log_t::write_checkpoint(): Allow an old checkpoint to be completed in
the old log file even after a new one has been created. If we are
writing the first checkpoint in a new log file, we will mark the old
log file read-only. We will also update log_sys.first_lsn unless it
was already updated in ARCHIVED_MMAP code path. In that code path,
there is the special case where log_sys.resize_buf == nullptr and
log_sys.checkpoint_buf points to log_sys.resize_log (the old log file
that is about to be made read-only). In this case, log_sys.first_lsn
will already point to the start of the current log_sys.log, even
though the switch has not been fully completed yet.
Try to preallocate the next archive file if needed, with the goal that
when we need the file it will already be ready for use.

log_t::write_checkpoint(), log_checkpoint_low(), log_checkpoint(),
buf_flush_sync_for_checkpoint(): Return the LSN on which to invoke
archive_create(), or 0 if no spare archive log file needs to be created.

log_t::header_rewrite(my_bool): Rewrite the log file header before or
after renaming the log file, and write a message about the change,
so that there will be a chance to recover in case the server is being
killed during this operation.  The recovery of the last ib_%016%.log
does tolerate also the ib_logfile0 format.

log_t::set_archive(my_bool,THD): Implement SET GLOBAL innodb_log_archive.
An error will be returned if non-archived SET GLOBAL innodb_log_file_size
(log file resizing) is in progress. Wait for checkpoint if necessary.
The current log file will be renamed to either ib_logfile0 or
ib_%016x.log, as appropriate. In SET GLOBAL innodb_log_archive=OFF,
we trigger a write-ahead of the log if necessary, to prevent overrun.

log_t::archive_rename(): Rename an archived log to ib_logfile0 on recovery
in case there had been a crash during set_archive().

log_t::archive_set_size(): A new function, to ensure that
log_sys.resize_target is set on startup.

log_checkpoint_low(): Do not prevent a checkpoint at the start of a file.
We want the first innodb_log_archive=ON file to start with a checkpoint.

log_t::create(lsn_t): Initialize last_checkpoint_lsn. Initialize the
log header as specified by log_sys.archive (innodb_log_archive).

log_write_buf(): Add the parameter max_length, the file wrap limit.

log_write_up_to(), mtr_t::commit_log_release<bool mmap=true>():
If we are switching log files, invoke buf_flush_ahead(lsn, false)
to ensure that a log checkpoint will be completed in the new file.

mtr_t::finish_writer(): Specialize for innodb_log_archive=ON.

mtr_t::commit_file(): Ensure that log archive rotation will complete.

log_t::append_prepare<log_t::ARCHIVED_MMAP>(): Special case.

log_t::append_prepare(): Remove the call to set_check_for_checkpoint()
that was duplicating logic in log_close().

log_t::get_path(): Get the name of the current log file.

log_t::get_circular_path(size_t): Get the path name of a circular file.
Replaces get_log_file_path().

log_t::get_archive_path(lsn_t): Return a name of an archived log file.

log_t::append_archive_name(): Append the archive log file name
to a path string.

log_t::checkpoint_margin(): Replaces log_checkpoint_margin().
If a new archived log file has been created, wait for the
first checkpoint in that file. Revise the formula when
log_sys.need_checkpoint needs to hold. When innodb_log_archive=ON,
we can allow the checkpoint to be older by innodb_log_file_size.

log_close(): Relax the log_sys.need_checkpoint condition
for innodb_log_archive=ON.

srv_log_rebuild_if_needed(): Never rebuild if innodb_log_archive=ON.
The setting innodb_log_file_size will affect the creation of
subsequent log files. The parameter innodb_encrypt_log cannot be
changed while the log is in the innodb_log_archive=ON format.

log_t::attach(), log_mmap(): Add the parameter log_access,
to distinguish memory-mapped or read-only access.

log_t::attach(): When disabling innodb_log_file_mmap, read
checkpoint_buf from the last innodb_log_archive=ON file.

log_t::clear_mmap(): Clear the tail of the checkpoint buffer
if is_mmap_writeable().

log_t::set_recovered(): Invoke clear_mmap(), and restore the
log buffer to the correct position.

recv_sys_t::apply(): Let log_t::clear_mmap() enable log writes.

log_file_is_zero(): Check if a log file starts with NUL bytes
(is a preallocated file).

recv_sys_t::find_checkpoint(): Find and remember the checkpoint position
in the last file when innodb_log_recovery_start points to an older file.
When innodb_log_file_mmap=OFF, restore log_sys.checkpoint_buf from
the latest log file. If the last archive log file is actually
in innodb_log_archive=OFF format despite being named ib_%016.log,
try to recover it in that format. If the circular ib_logfile0 is missing,
determine the oldest archived log file with contiguous LSN.
If innodb_log_archive=ON, refuse to start if ib_logfile0 exists.
Open non-last non-preallocated archived log file in read/write mode.

recv_sys_t::find_checkpoint_archived(): Validate each checkpoint in
the current file header, and by default aim to recover from the last
valid one. Terminate the search if the last validated checkpoint
spanned two files. If innodb_log_recovery_start has been specified,
attempt to validate it even if there is no such information stored
in the checkpoint header.

log_parse_file(): Do not invoke fil_name_process() during
recv_sys_t::find_checkpoint_archived(), when we tolerate FILE_MODIFY
records while looking for a FILE_CHECKPOINT record.

recv_scan_log(): Invoke log_t::archived_switch_recovery() upon
reaching the end of the current archived log file.

log_t::archived_switch_recovery_prepare(): Make use of
recv_sys.log_archive and open all but the last file read-only.

log_t::archived_switch_recovery(): Switch files in the pread() code path.

log_t::archived_mmap_switch_recovery_complete(): Switch files in the
memory-mapped code path.

recv_warp: A pointer wrapper for memory-mapped parsing that spans two
archive log files.

recv_sys_t::parse_mmap(): Use recv_warp for innodb_log_archive=ON.

recv_sys_t::parse(): Tweak some logic for innodb_log_archive=ON.

log_t::set_recovered_checkpoint(): Set the checkpoint on recovery.
Updates also the end_lsn.

log_t::set_recovered_lsn(): Also update flush_lock and write_lock,
to ensure that log_write_up_to() will be a no-op. Invoke
unstash_archive_file() in case there was a garbage (pre-allocated)
file at the end which was not parsed at all.

log_t::persist(): Even if the flushed_to_disk_lsn does not change,
we may want to reset the write_lsn_offset.

Avoid setting log_sys.need_checkpoint unnecessarily (MDEV-39162).
Copy link
Copy Markdown
Member

@Thirunarayanan Thirunarayanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did change in log_t::append_prepare(). It calls capacity() in archive mode. I hope there is no performance regression because of this change.

Archive file creation could slow down page cleaner thread flushing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants