postgresql/src/backend
Alvaro Herrera cfedb279a6
Fix WAL replay in presence of an incomplete record
Physical replication always ships WAL segment files to replicas once
they are complete.  This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
  LOG:  invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be.  A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary.  But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.

A fix for this problem was already attempted in commit 515e3d84a0, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.

This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts.  With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.

Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.

A new TAP test that exercises this is added, but the portability of it
is yet to be seen.

This has been wrong since the introduction of physical replication, so
backpatch all the way back.  In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Nathan Bossart <bossartn@amazon.com>
Discussion: https://postgr.es/m/202108232252.dh7uxf6oxwcy@alvherre.pgsql
2021-09-29 11:21:51 -03:00
..
access Fix WAL replay in presence of an incomplete record 2021-09-29 11:21:51 -03:00
bootstrap Revert "Skip WAL for new relfilenodes, under wal_level=minimal." 2020-03-22 09:24:13 -07:00
catalog Fix lookup error in extended stats ownership check 2021-08-31 18:42:11 +02:00
commands Fix misevaluation of STABLE parameters in CALL within plpgsql. 2021-09-21 19:06:33 -04:00
executor Report tuple address in data-corruption error message 2021-08-30 16:29:12 -04:00
foreign Remove bogus "extern" annotations on function definitions. 2018-02-19 12:07:44 -05:00
jit jit: Do not try to shut down LLVM state in case of LLVM triggered errors. 2021-09-13 18:26:18 -07:00
lib Fix typo in comment 2021-04-20 14:36:47 +02:00
libpq Set type identifier on BIO 2021-08-17 14:30:39 +02:00
main Update copyright for 2018 2018-01-02 23:30:12 -05:00
nodes Ensure that expandTableLikeClause() re-examines the same table. 2020-12-01 14:02:28 -05:00
optimizer Fix mis-planning of repeated application of a projection. 2021-05-31 12:03:00 -04:00
parser Don't elide casting to typmod -1. 2021-09-20 11:48:52 -04:00
partitioning Avoid using ambiguous word "non-negative" in error messages. 2021-07-28 01:24:51 +09:00
po Translation updates 2021-08-09 12:57:38 +02:00
port Make EXEC_BACKEND more convenient on macOS. 2021-08-13 11:10:49 +12:00
postmaster Revert "Avoid creating archive status ".ready" files too early" 2021-09-04 12:14:30 -04:00
regex Make pg_regexec() robust against out-of-range search_start. 2021-09-11 15:19:58 -04:00
replication Fix issue with WAL archiving in standby. 2021-09-09 23:59:19 +09:00
rewrite Fix rewriter to set hasModifyingCTE correctly on rewritten queries. 2021-09-08 12:05:43 -04:00
snowball Avoid unnecessary use of pg_strcasecmp for already-downcased identifiers. 2018-01-26 18:25:14 -05:00
statistics Release memory allocated by dependency_degree 2021-09-23 18:48:58 +02:00
storage Fix variable shadowing in procarray.c. 2021-09-16 13:08:06 +09:00
tcop Fix some anomalies with NO SCROLL cursors. 2021-09-10 13:18:32 -04:00
tsearch Don't leak compiled regex(es) when an ispell cache entry is dropped. 2021-03-18 21:44:43 -04:00
utils Fix bogus timetz_zone() results for DYNTZ abbreviations. 2021-09-06 11:29:52 -04:00
.gitignore Add .gitignore entries for AIX-specific intermediate build artifacts. 2015-07-08 20:44:22 -04:00
common.mk Remove PARTIAL_LINKING build mode. 2018-03-30 17:33:04 -07:00
Makefile Rearrange makefile rules for running Gen_fmgrtab.pl. 2018-05-03 17:54:18 -04:00
nls.mk Translation updates 2018-09-17 08:40:36 +02:00