Commit graph

2010 commits

Author SHA1 Message Date
Nathan Bossart
771fe0948c Avoid including vacuum.h in tableam.h and heapam.h.
Commit 2252fcd427 modified some function prototypes in tableam.h
and heapam.h to take a VacuumParams argument instead of a pointer,
which required including vacuum.h in those headers.  vacuum.h has a
reasonably large dependency tree, and headers like tableam.h are
widely included, so this is not ideal.  To fix, change the
functions in question to accept a "const VacuumParams *" argument
instead.  That allows us to use a forward declaration for
VacuumParams and avoid including vacuum.h.  Since vacuum_rel()
needs to scribble on the params argument, we still pass it by value
to that function so that the original struct is not modified.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/rzxpxod4c4la62yvutyrvgoyilrl2fx55djaf2suidy7np5m6c%403l2ln476eadh
2026-03-31 12:43:52 -05:00
Daniel Gustafsson
097ab69d17 Formalize WAL record for XLOG_CHECKPOINT_REDO
XLOG_CHECKPOINT_REDO only contains the wal_level copied straight in
without an encapsulating record structure. While it works, it makes
future uses of XLOG_CHECKPOINT_REDO hard as there is nowhere to put
new data items.  This fix this was inspired by the online checksums
patch which adds data to this record,  but this change has value on
its own.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c92b5d8b-bc03-47bc-b209-2e4a719eee32@iki.fi
2026-03-31 09:38:01 +02:00
Nathan Bossart
bab2f27eaa Remove bits* typedefs.
In addition to removing the bits8, bits16, and bits32 typedefs,
this commit replaces all uses with uint8, uint16, or uint32.  bits*
provided little benefit beyond establishing the intent of the
variable, and they were inconsistently used for that purpose.
Third-party code should instead use the corresponding uint*
typedef.

Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/absbX33E4eaA0Ity%40nathan
2026-03-30 16:12:08 -05:00
Melanie Plageman
b46e1e54d0 Allow on-access pruning to set pages all-visible
Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.

This commit implements on-access VM setting for sequential scans, tid
range scans, sample scans, bitmap heap scans, and the underlying heap
relation in index scans.

Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, which could
trigger a write and potentially an FPI. It also allows more frequent
index-only scans, since they require pages to be marked all-visible in
the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-30 15:47:07 -04:00
Melanie Plageman
50eb5faea2 Pass down information on table modification to scan nodes
Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
2026-03-30 13:27:34 -04:00
Álvaro Herrera
349bd88202
Don't use bits32 in table AM interface
Seems there's near-universal dislike for the bitsXX typedefs.
Revert that part of commit 1bd6f22f43 in favor of using plain uint32.
2026-03-30 19:06:33 +02:00
Melanie Plageman
dcd8cc1c85 Thread flags through begin-scan APIs
Add an AM user-settable flags parameter to several of the table scan
functions, one table AM callback, and index_beginscan(). This allows
users to pass additional context to be used when building the scan
descriptors.

For index scans, a new flags field is added to IndexFetchTableData, and
the heap AM saves the caller-provided flags there.

This introduces an extension point for follow-up work to pass per-scan
information (such as whether the relation is read-only for the current
query) from the executor to the AM layer.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2be31f17-5405-4de9-8d73-90ebc322f7d8%40vondra.me
2026-03-30 12:27:24 -04:00
Álvaro Herrera
1bd6f22f43
Have table_insert and siblings use an unsigned type for options
Using signed types can lead to bugs, such as the one fixed by commit
2a2e1b470b.

Discussion: https://postgr.es/m/44e6ze3kuunhky63wmfjxrmn72pds2whwf5ok6hpz7c4my7k2h@l65zhpcuasnf
2026-03-30 13:58:16 +02:00
Melanie Plageman
a881cc9c7e Remove XLOG_HEAP2_VISIBLE entirely
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.

Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 17:58:12 -04:00
Melanie Plageman
1252a4ee28 WAL log VM setting during vacuum phase I in XLOG_HEAP2_PRUNE_VACUUM_SCAN
Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.

Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications. This reduces WAL volume produced by vacuum.

For now, vacuum is still the only user of heap_page_prune_and_freeze()
allowed to set the VM. On-access pruning is not yet able to set the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 16:49:46 -04:00
Álvaro Herrera
2102ebb195
Don't include storage/lock.h in so many headers
Since storage/locktags.h was added by commit 322bab7974, many headers
can be made leaner by depending on that instead of on storage/lock.h,
which has many other dependencies.

(In fact, some of these changes were possible even before that.)

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/abvrRZo52Yx9ZzWQ@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 17:11:12 +01:00
Fujii Masao
1c162c965a Report detailed errors from XLogFindNextRecord() failures.
Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.

This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.

For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").

Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
2026-03-24 22:33:09 +09:00
Robert Haas
c98ad086ad Bounds-check access to TupleDescAttr with an Assert.
The second argument to TupleDescAttr should always be at least zero
and less than natts; otherwise, we index outside of the attribute
array. Assert that this is the case.

Various violations, or possible violations, of this rule that are
currently in the tree are actually harmless, because while
we do call TupleDescAttr() before verifying that the argument is
within range, we don't actually dereference it unless the argument
was within range all along. Nonetheless, the Assert means we
should be more careful, so tidy up accordingly.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
2026-03-24 08:58:50 -04:00
Melanie Plageman
01b7e4a46d Add pruning fast path for all-visible and all-frozen pages
Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
heap_page_prune_and_freeze() can be invoked for pages with no pruning or
freezing work to do. To avoid this, if a page is already all-frozen or
it is all-visible and no freezing will be attempted, exit early. We
can't exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 15:46:50 -04:00
Melanie Plageman
4f7ecca84d Detect and fix visibility map corruption in more cases
Move VM corruption detection and repair into heap page pruning. This
allows VM repair during on-access pruning, not only during vacuum.

Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or deleted by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.

Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once,
and a single VM page covers a large number of heap pages.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 11:52:40 -04:00
David Rowley
d8a859d22b Reduce size of CompactAttribute struct to 8 bytes
Previously, this was 16 bytes.  With the use of some bitflags and by
reducing the attcacheoff field size to a 16-bit type, we can halve the
size of the struct.

It's unlikely that caching the offsets for offsets larger than what will
fit in a 16-bit int will help much as the tuple is very likely to have
some non-fixed-width types anyway, the offsets of which we cannot cache.

Shrinking this down to 8 bytes helps by accessing fewer cachelines when
performing tuple deformation.  The fields used there are all fully
fledged fields, which don't require any bitmasking to extract the value
of.  It also helps to more efficiently calculate the address of a
compact_attrs[] element in TupleDesc as the x86 LEA instruction can work
with 8 byte offsets, which allows the element address to be calculated
from the TupleDesc's address in a single instruction using LEA's
concurrent shift and add.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
2026-03-17 15:06:31 +13:00
Álvaro Herrera
fba4233c83
Reduce header inclusions via execnodes.h
Remove a bunch of #include lines from execnodes.h.  Most of these
requier suitable typedefs to be added, so that it still compiles
standalone.  In one case, the fix is to move a struct definition to the
one .c file where it is needed.

Also some light clean up in plannodes.h and genam.h, though not as
extensive as in execnodes.h.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202603131240.ihwqdxnj7w2o@alvherre.pgsql
2026-03-16 14:34:57 +01:00
David Rowley
c456e39113 Optimize tuple deformation
This commit includes various optimizations to improve the performance of
tuple deformation.

We now precalculate CompactAttribute's attcacheoff, which allows us to
remove the code from the deform routines which was setting the
attcacheoff.  Setting the attcacheoff is now handled by
TupleDescFinalize(), which must be called before the TupleDesc is used for
anything.  Having TupleDescFinalize() means we can store the first
attribute in the TupleDesc which does not have an offset cached.  That
allows us to add a dedicated deforming loop to deform all attributes up
to the final one with an attcacheoff set, or up to the first NULL
attribute, whichever comes first.

Here we also improve tuple deformation performance of tuples with NULLs.
Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask,
deforming would, one-by-one, check each and every bit in the NULL bitmap
to see if it was zero.  Now, we process the NULL bitmap 1 byte at a time
rather than 1 bit at a time to find the attnum with the first NULL.  We
can now deform the tuple without checking for NULLs up to just before that
attribute.

We also record the maximum attribute number which is guaranteed to exist
in the tuple, that is, has a NOT NULL constraint and isn't an
atthasmissing attribute.  When deforming only attributes prior to the
guaranteed attnum, we've no need to access the tuple's natt count.  As an
additional optimization, we only count fixed-width columns when
calculating the maximum guaranteed column, as this eliminates the need to
emit code to fetch byref types in the deformation loop for guaranteed
attributes.

Some locations in the code deform tuples that have yet to go through NOT
NULL constraint validation.  We're unable to perform the guaranteed
attribute optimization when that's the case.  This optimization is opt-in
via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS
flag.

This commit also adds a more efficient way of populating the isnull
array by using a bit-wise SWAR trick which performs multiplication on the
inverse of the tuple's bitmap byte and masking out all but the lower bit
of each of the boolean's byte.  This results in much more optimal code
when compared to determining the NULLness via att_isnull().  8 isnull
elements are processed at once using this method, which means we need to
round the tts_isnull array size up to the next 8 bytes.  The palloc code
does this anyway, but the round-up needed to be formalized so as not to
overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds.  Doing
this also allows the NULL-checking deforming loop to more efficiently
check the isnull array, rather than doing the bit-wise processing for each
attribute that att_isnull() does.

The level of performance improvement from these changes seems to vary
depending on the CPU architecture.  Apple's M chips seem particularly
fond of the changes, with some of the tested deform-heavy queries going
over twice as fast as before.  With x86-64, the speedups aren't quite as
large.  With tables containing only a small number of columns, the
speedups will be less.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:46:00 +13:00
David Rowley
503620311e Add all required calls to TupleDescFinalize()
As of this commit all TupleDescs must have TupleDescFinalize() called on
them once the TupleDesc is set up and before BlessTupleDesc() is called.

In this commit, TupleDescFinalize() does nothing. This change has only
been separated out from the commit that properly implements this function
to make the change more obvious.  Any extension which makes its own
TupleDesc will need to be modified to call the new function.

The follow-up commit which properly implements TupleDescFinalize() will
cause any code which forgets to do this to fail in assert-enabled builds in
BlessTupleDesc().  It may still be worth mentioning this change in the
release notes so that extension authors update their code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:45:49 +13:00
Melanie Plageman
99bf1f8aa6 Save vmbuffer in heap-specific scan descriptors for on-access pruning
Future commits will use the visibility map in on-access pruning to fix
VM corruption and set the VM if the page is all-visible.

Saving the vmbuffer in the scan descriptor reduces the number of times
it would need to be pinned and unpinned, making the overhead of doing so
negligible.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com
2026-03-15 11:09:10 -04:00
Peter Geoghegan
d774072f00 Move fake LSN infrastructure out of GiST.
Move utility functions used by GiST to generate fake LSNs into xlog.c
and xloginsert.c, so that other index AMs can also generate fake LSNs.

Preparation for an upcoming commit that will add support for fake LSNs
to nbtree, allowing its dropPin optimization to be used during scans of
unlogged relations.  That commit is itself preparation for another
upcoming commit that will add a new amgetbatch/btgetbatch interface to
enable I/O prefetching.

Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-13 19:38:17 -04:00
Tomas Vondra
b1f14c9672 Use GetXLogInsertEndRecPtr in gistGetFakeLSN
The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:

  ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000

Such failures are very hard to trigger, particularly outside aggressive
test scenarios.

Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().

Initial investigation by me, root cause identified by Andres Freund.

This is a long-standing bug in gistGetFakeLSN(), probably introduced by
c6b92041d3 in PG13. Backpatch to all supported versions.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
Backpatch-through: 14
2026-03-13 23:25:24 +01:00
Heikki Linnakangas
f9de9bf302 Add callback for I/O error messages in SLRUs
Historically, all SLRUs were addressed by transaction IDs, but that
hasn't been true for a long time. However, the error message on I/O
error still always talked about accessing a transaction ID.

This commit adds a callback that allows subsystems to construct their
own error messages, which can then correctly refer to a transaction
ID, multixid or whatever else is used to address the particular SLRU.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com
2026-03-13 16:21:06 +02:00
Peter Geoghegan
d071e1cfec nbtree: Avoid allocating _bt_search stack.
Avoid allocating memory for an nbtree descent stack during index scans.
We only require a descent stack during inserts, when it is used to
determine where to insert a new pivot tuple/downlink into the target
leaf page's parent page in the event of a page split.  (Page deletion's
first phase also performs a _bt_search that requires a descent stack.)

This optimization improves performance by minimizing palloc churn.  It
speeds up index scans that call _bt_search frequently/descend the index
many times, especially when the cost of scanning the index dominates
(e.g., with index-only skip scans).  Testing has shown that the
underlying issue causes performance problems for an upcoming patch that
will replace btgettuple with a new btgetbatch interface to enable I/O
prefetching.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com
2026-03-12 13:22:36 -04:00
Tomas Vondra
943e881733 Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads
On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.

For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.

This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.

Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se
2026-03-11 19:46:08 +01:00
Melanie Plageman
c2a23dcf9e Use the newest to-be-frozen xid as the conflict horizon for freezing
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.

Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
2026-03-10 15:24:39 -04:00
Robert Haas
ffc226ab64 Prevent restore of incremental backup from bloating VM fork.
When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length.  In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)

Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.

To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.

Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17
2026-03-09 06:45:32 -04:00
Melanie Plageman
34cb4254bd Prefix PruneState->all_{visible,frozen} with set_
The PruneState had members called "all_visible" and "all_frozen" which
reflect not the current state of the page but the state it could be in
once pruning and freezing have been executed. These are then saved in
the PruneFreezeResult so the caller can set the VM accordingly.

Prefix the PruneState members as well as the corresponsding
PruneFreezeResult members with "set_" to clarify that they represent the
proposed state of the all-visible and all-frozen bits for a heap page in
the visibility map, not the current state.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-05 16:55:00 -05:00
Michael Paquier
5f8124a0cf Move definition of XLogRecoveryCtlData to xlogrecovery.h
XLogRecoveryCtlData is the structure that stores the shared-memory state
of WAL recovery, including information such as promotion requests, the
timeline ID (TLI), and the LSNs of replayed records.

This refactoring is independently useful because it allows code outside
of core to access the recovery state in live.  It will be used by an
upcoming patch that introduces a SQL function for querying this
information, that can be accessed on a standby once a consistent state
has been reached.  This only moves code around, changing nothing
functionally.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
2026-03-05 12:17:47 +09:00
Michael Paquier
34dfca2934 Change default value of default_toast_compression to "lz4", take two
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

Note that this is a version lighter than 7c1849311e, that included a
replacement of --with-lz4 by --without-lz4 in configure builds, forcing
a requirement for LZ4 in all environments.  The buildfarm did not like
it, at all.  This commit switches default_toast_compression to lz4 as
default only when --with-lz4 is defined, which should keep the buildfarm
at bay while still allowing users to benefit from LZ4 compression in
TOAST as long as the code is compiled with it.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-05 09:24:35 +09:00
Michael Paquier
4f0b3afab4 Revert "Change default value of default_toast_compression to "lz4""
This reverts commit 7c1849311e, due to the fact that more than 60% of
the buildfarm members do not have lz4 installed.  As we are in the last
commit fest of the development cycle, and that it could take a couple
of weeks to stabilize things, this change is reverted for now.

This commit will be reworked in a lighter version, as
default_toast_compression's default can be changed to "lz4" without the
switch from --with-lz4 to --without-lz4.  This approach will keep the
buildfarm at bay, and still allow builds to take advantage of LZ4 in
TOAST by default, as long as the code is compiled with LZ4 support.

A harder requirement based on LZ4 should be achievable at some point,
but it is going to require some work from the buildfarm owners first.
Perhaps this part could be revisited at the beginning of the next
development cycle.

Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com
2026-03-05 08:25:35 +09:00
Michael Paquier
7c1849311e Change default value of default_toast_compression to "lz4", when available
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

--with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the
builds on an option-basis, following a practice similar to readline or
ICU.  References to --with-lz4 are removed from the documentation.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

For the reference, a similar switch has been done with ICU in
fcb21b3acd.  Some of the changes done in this commit are consistent
with that.

Note: this is going to create some disturbance in the buildfarm, in
environments where lz4 is not installed.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-04 13:05:31 +09:00
Michael Paquier
f68d7e7483 Remove WAL page header flag XLP_BKP_REMOVABLE
There are no known users of this flag.  The last supposed user was
pglesslog, which is the reason why this flag has been introduced in
core, based on an historical search pointing at a8d539f124.

I have mentioned that we may want to remove this flag back in 2018, due
to zero users of it in core.  More recently, Noah has pointed out that
this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL
writer in a lock-free fashion with runningBackups > 0, meaning that some
full-page images could be required but not logged, ultimately corrupting
backups.

Bump XLOG_PAGE_MAGIC.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com
Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com
2026-03-02 14:13:05 +09:00
Peter Eisentraut
3f98862980 Fix some -Wcast-qual warnings
This fixes some warnings from -Wcast-qual that are easy to fix,
without using unconstify or the like.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/990c9117-b013-4026-aaf5-261fe2832c3d%40eisentraut.org
2026-02-27 21:57:33 +01:00
Álvaro Herrera
a2c89835f5
Don't include proc.h in shm_mq.h
This prevents proliferation of proc.h to tons of other places; shm_mq.h
is widely included.

Discussion: https://postgr.es/m/202602261733.s2rkxezwuif6@alvherre.pgsql
2026-02-27 10:53:47 +01:00
Melanie Plageman
284925508a Remove table_scan_analyze_next_tuple unneeded parameter OldestXmin
heapam_scan_analyze_next_tuple() doesn't distinguish between dead and
recently dead tuples when counting them, so it doesn't need OldestXmin.
GetOldestNonRemovableTransactionId() isn't free, so removing it is a
win.

Looking at other table AMs implementing table_scan_analyze_next_tuple(),
we couldn't find one using OldestXmin either, so remove it from the
callback.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com
2026-02-26 15:41:53 -05:00
Álvaro Herrera
3894f08abe
Update obsolete comment
table_tuple_update's update_indexes argument hasn't been a boolean since
commit 19d8e2308b.

Backpatch-through: 16
2026-02-18 18:09:54 +01:00
Michael Paquier
9181c870ba Improve type handling of varlena structures
This commit changes the definition of varlena to a typedef, so as it
becomes possible to remove "struct" markers from various declarations in
the code base.  Historically, "struct" markers are not the project style
for variable declarations, so this update simplifies the code and makes
it more consistent across the board.

This change has an impact on the following structures, simplifying
declarations using them:
- varlena
- varatt_indirect
- varatt_external

This cleanup has come up in a different path set that played with
TOAST and varatt.h, independently worth doing on its own.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aW8xvVbovdhyI4yo@paquier.xyz
2026-02-11 07:33:24 +09:00
Álvaro Herrera
cbef472558
Remove HeapTupleheaderSetXminCommitted/Invalid functions
They are not and never have been used by any known code -- apparently we
just cargo-culted them in commit 37484ad2aa (or their ancestor macros
anyway, which begat these functions in commit 34694ec888).  Allegedly
they're also potentially dangerous; users are better off going through
HeapTupleSetHintBits instead.

Author: Andy Fan <zhihuifan1213@163.com>
Discussion: https://postgr.es/m/87sejogt4g.fsf@163.com
2026-02-09 19:15:20 +01:00
Peter Eisentraut
137d05df2f Rename AssertVariableIsOfType to StaticAssertVariableIsOfType
This keeps run-time assertions and static assertions clearly separate.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2273bc2a-045d-4a75-8584-7cd9396e5534%40eisentraut.org
2026-02-03 08:45:24 +01:00
Tom Lane
da7a1dc0d6 Refactor att_align_nominal() to improve performance.
Separate att_align_nominal() into two macros, similarly to what
was already done with att_align_datum() and att_align_pointer().
The inner macro att_nominal_alignby() is really just TYPEALIGN(),
while att_align_nominal() retains its previous API by mapping
TYPALIGN_xxx values to numbers of bytes to align to and then
calling att_nominal_alignby().  In support of this, split out
tupdesc.c's logic to do that mapping into a publicly visible
function typalign_to_alignby().

Having done that, we can replace performance-critical uses of
att_align_nominal() with att_nominal_alignby(), where the
typalign_to_alignby() mapping is done just once outside the loop.

In most places I settled for doing typalign_to_alignby() once
per function.  We could in many places pass the alignby value
in from the caller if we wanted to change function APIs for this
purpose; but I'm a bit loath to do that, especially for exported
APIs that extensions might call.  Replacing a char typalign
argument by a uint8 typalignby argument would be an API change
that compilers would fail to warn about, thus silently breaking
code in hard-to-debug ways.  I did revise the APIs of array_iter_setup
and array_iter_next, moving the element type attribute arguments to
the former; if any external code uses those, the argument-count
change will cause visible compile failures.

Performance testing shows that ExecEvalScalarArrayOp is sped up by
about 10% by this change, when using a simple per-element function
such as int8eq.  I did not check any of the other loops optimized
here, but it's reasonable to expect similar gains.

Although the motivation for creating this patch was to avoid a
performance loss if we add some more typalign values, it evidently
is worth doing whether that patch lands or not.

Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us
2026-02-02 14:39:50 -05:00
Andres Freund
87f7b824f2 tableam: Perform CheckXidAlive check once per scan
Previously, the CheckXidAlive check was performed within the table_scan*next*
functions. This caused the check to be executed for every fetched tuple, an
unnecessary overhead.

To fix, move the check to table_beginscan* so it is performed once per scan
rather than once per row.

Note: table_tuple_fetch_row_version() does not use a scan descriptor;
therefore, the CheckXidAlive check is retained in that function. The overhead
is unlikely to be relevant for the existing callers.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Suggested-by: Amit Kapila <akapila@postgresql.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/tlpltqm5jjwj7mp66dtebwwhppe4ri36vdypux2zoczrc2i3mp%40dhv4v4nikyfg
2026-01-29 17:52:07 -05:00
Masahiko Sawada
1fdbca159e Standardize replication origin naming to use "ReplOrigin".
The replication origin code was using inconsistent naming
conventions. Functions were typically prefixed with 'replorigin',
while typedefs and constants used "RepOrigin".

This commit unifies the naming convention by renaming RepOriginId to
ReplOriginId.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBDgm3hDqUZ+nqu=ViHmkCnJBuJyaxG_yvv27BAi2zBmQ@mail.gmail.com
2026-01-28 11:03:29 -08:00
Amit Kapila
851f6649cc Prevent invalidation of newly synced replication slots.
A race condition could cause a newly synced replication slot to become
invalidated between its initial sync and the checkpoint.

When syncing a replication slot to a standby, the slot's initial
restart_lsn is taken from the publisher's remote_restart_lsn. Because slot
sync happens asynchronously, this value can lag behind the standby's
current redo pointer. Without any interlocking between WAL reservation and
checkpoints, a checkpoint may remove WAL required by the newly synced
slot, causing the slot to be invalidated.

To fix this, we acquire ReplicationSlotAllocationLock before reserving WAL
for a newly synced slot, similar to commit 006dd4b2e5. This ensures that
if WAL reservation happens first, the checkpoint process must wait for
slotsync to update the slot's restart_lsn before it computes the minimum
required LSN.

However, unlike in ReplicationSlotReserveWal(), this lock alone cannot
protect a newly synced slot if a checkpoint has already run
CheckPointReplicationSlots() before slotsync updates the slot. In such
cases, the remote restart_lsn may be stale and earlier than the current
redo pointer. To prevent relying on an outdated LSN, we use the oldest
WAL location available if it is greater than the remote restart_lsn.

This ensures that newly synced slots always start with a safe, non-stale
restart_lsn and are not invalidated by concurrent checkpoints.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Backpatch-through: 17
Discussion: https://postgr.es/m/TY4PR01MB16907E744589B1AB2EE89A31F94D7A%40TY4PR01MB16907.jpnprd01.prod.outlook.com
2026-01-27 05:06:29 +00:00
Melanie Plageman
648a7e28d7 Eliminate use of cached VM value in lazy_scan_prune()
lazy_scan_prune() takes a parameter from lazy_scan_heap() indicating
whether the page was marked all-visible in the VM at the time it was
last checked in find_next_unskippable_block(). This behavior is
historical, dating back to commit 608195a3a3, when we did not pin the
VM page until deciding we must read it. Now that the VM page is already
pinned, there is no meaningful benefit to relying on a cached VM status.

Removing this cached value simplifies the logic in both lazy_scan_heap()
and lazy_scan_prune(). It also clarifies future work that will set the
visibility map on-access: such paths will not have a cached value
available, which would make the logic harder to reason about. And
eliminating it enables us to detect and repair VM corruption on-access.

Along with removing the cached value and unconditionally checking the
visibility status of the heap page, this commit also moves the VM
corruption handling to occur first. This reordering should have no
performance impact, since the checks are inexpensive and performed only
once per page. It does, however, make the control flow easier to
understand. The new restructuring also makes it possible to set the VM
after fixing corruption (if pruning found the page all-visible).

Now that no callers of visibilitymap_set() use its return value, change
its (and visibilitymap_set_vmbits()) return type to void.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/5CEAA162-67B1-44DA-B60D-8B65717E8B05%40gmail.com
2026-01-26 17:00:13 -05:00
Peter Eisentraut
5ca5f12c2c Fix accidentally cast away qualifiers
This fixes cases where a qualifier (const, in all cases here) was
dropped by a cast, but the cast was otherwise necessary or desirable,
so the straightforward fix is to add the qualifier into the cast.

Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b04f4d3a-5e70-4e73-9ef2-87f777ca4aac%40eisentraut.org
2026-01-26 16:02:31 +01:00
Álvaro Herrera
69f98fce5b
Make some use of anonymous unions [reloptions]
In the spirit of commit 4b7e6c73b0 and following, which see for more
details; it appears to have been quite an uncontroversial C11 feature to
use and it makes the code nicer to read.

This commit changes the relopt_value struct.

Author: Peter Eisentraut <peter@eisentraut.org>
Author: Álvaro Herrera <alvherre@kurilemu.de>
Note: Yes, this was written twice independently.
Discussion: https://postgr.es/m/202601192106.zcdi3yu2gzti@alvherre.pgsql
2026-01-22 17:04:59 +01:00
Álvaro Herrera
4d6a66f675
Allow Boolean reloptions to have ternary values
From the user's point of view these are just Boolean values; from the
implementation side we can now distinguish an option that hasn't been
set.  Reimplement the vacuum_truncate reloption using this type.

This could also be used for reloptions vacuum_index_cleanup and
buffering, but those additionally need a per-option "alias" for the
state where the variable is unset (currently the value "auto").

Author: Nikolay Shaplov <dhyan@nataraj.su>
Reviewed-by: Timur Magomedov <t.magomedov@postgrespro.ru>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/3474141.usfYGdeWWP@thinkpad-pgpro
2026-01-21 20:06:01 +01:00
Heikki Linnakangas
c4b71e6f60 Remove some unnecessary code from multixact truncation
With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.

'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used for them in other places.

Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://www.postgresql.org/message-id/000301b2-5b81-4938-bdac-90f6eb660843@iki.fi
2026-01-15 13:34:50 +02:00
Andres Freund
0b96e734c5 heapam: Add batch mode mvcc check and use it in page mode
There are two reasons for doing so:

1) It is generally faster to perform checks in a batched fashion and making
   sequential scans faster is nice.

2) We would like to stop setting hint bits while pages are being written
   out. The necessary locking becomes visible for page mode scans, if done for
   every tuple. With batching, the overhead can be amortized to only happen
   once per page.

There are substantial further optimization opportunities along these
lines:

- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
  HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
  instead write an explicitly optimized version that avoids repeated xid
  tests.

- Introduce batched version of the serializability test

- Introduce batched version of HeapTupleSatisfiesVacuum

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
2026-01-12 13:22:04 -05:00