Commit graph

28616 commits

Author SHA1 Message Date
Andres Freund
294520c444 instrumentation: Use Time-Stamp Counter on x86-64 to lower overhead
This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSCP instructions, instead of APIs like
clock_gettime() on POSIX systems.

This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed
that the overhead on top of actual runtime when instrumenting queries moving
lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow
compared to the actual runtime. More complex workloads such as TPCH queries
have also shown ~20% gains when instrumented compared to before.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when reliable, for example
when running on modern Intel CPUs, or when running on Linux and the system
clocksource is reported as "tsc". The use of the operating system clock source
can be enforced by setting "system", or on x86-64 architectures the use of TSC
can be enforced by explicitly setting "tsc".

In order to use the TSC the frequency is first determined by use of CPUID, and
if not available, by running a short calibration loop at program start,
falling back to the system clock source if TSC values are not stable.

Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP
instruction, which waits for outstanding instructions to retire. RDTSCP is
deemed to have little benefit in the typical InstrStartNode() /
InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To
separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is
introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be
used when precision is more important than performance. When the system timing
clock source is used both of these macros instead utilize the system
APIs (clock_gettime / QueryPerformanceCounter) like before.

Additional users of interval timing, such as track_io_timing and
track_wal_io_timing could also benefit from being converted to use
INSTR_TIME_SET_CURRENT_FAST() but are left for future changes.

Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version)
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version)
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://postgr.es/m/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de
2026-04-07 13:00:24 -04:00
Andres Freund
0022622c93 instrumentation: Standardize ticks to nanosecond conversion method
The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
new pg_ticks_to_ns() / pg_ns_to_ticks() functions that get called from
INSTR_* macros on all platforms.

These functions rely on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

In passing, fix variable names in comment documenting INSTR_TIME_ADD_NANOSEC().

Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
2026-04-07 13:00:24 -04:00
Jacob Champion
b977bd308a oauth: Allow validators to register custom HBA options
OAuth validators can already use custom GUCs to configure behavior
globally. But we currently provide no ability to adjust settings for
individual HBA entries, because the original design focused on a world
where a provider covered a "single audience" of users for one database
cluster. This assumption does not apply to multitenant use cases, where
a single validator may be controlling access for wildly different user
groups.

To improve this use case, add two new API calls for use by validator
callbacks: RegisterOAuthHBAOptions() and GetOAuthHBAOption().
Registering options "foo" and "bar" allows a user to set "validator.foo"
and "validator.bar" in an oauth HBA entry. These options are stringly
typed (syntax validation is solely the responsibility of the defining
module), and names are restricted to a subset of ASCII to avoid tying
our hands with future HBA syntax improvements.

Unfortunately, we can't check the custom option names during a reload of
the configuration, like we do with standard HBA options, without
requiring all validators to be loaded via shared_preload_libraries.
(I consider this to be a nonstarter: most validators should probably use
session_preload_libraries at most, since requiring a full restart just
to update authentication behavior will be unacceptable to many users.)
Instead, the new validator.* options are checked against the registered
list at connection time.

Multiple alternatives were proposed and/or prototyped, including
extending the GUC system to allow per-HBA overrides, joining forces with
recent refactoring work on the reloptions subsystem, and giving the
ability to customize HBA options to all PostgreSQL extensions. I
personally believe per-HBA GUC overrides are the best option, because
several existing GUCs like authentication_timeout and pre_auth_delay
would fit there usefully. But the recent addition of SNI per-host
settings in 4f433025f indicates that a more general solution is needed,
and I expect that to take multiple releases' worth of discussion.

This compromise patch, then, is intentionally designed to be an
architectural dead end: simple to describe, cheap to maintain, and
providing just enough functionality to let validators move forward for
PG19. The hope is that it will be replaced in the future by a solution
that can handle per-host, per-HBA, and other per-context configuration
with the same functionality that GUCs provide today. In the meantime,
the bulk of the code in this patch consists of strict guardrails on the
simple API, to try to ensure that we don't have any reason to regret its
existence during its unknown lifespan.

I owe particular thanks here to Zsolt Parragi, who prototyped several
approaches that guided the final design.

Suggested-by: Zsolt Parragi <zsolt.parragi@percona.com>
Suggested-by: VASUKI M <vasukianand0119@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAN4CZFM3b8u5uNNNsY6XCya257u%2BDofms3su9f11iMCxvCacag%40mail.gmail.com
2026-04-07 08:15:19 -07:00
Álvaro Herrera
e76d8c749c
Reserve replication slots specifically for REPACK
Add a new GUC max_repack_replication_slots, which lets the user reserve
some additional replication slots for concurrent repack (and only
concurrent repack).  With this, the user doesn't have to worry about
changing the max_replication_slots in order to cater for use of
concurrent repack.

(We still use the same pool of bgworkers though, but that's less
commonly a problem than slots.)

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://postgr.es/m/202604012148.nnnmyxxrr6nh@alvherre.pgsql
2026-04-07 16:55:29 +02:00
Heikki Linnakangas
979387f188 Fix harmless leftover in _hash_kill_items()
Checking for 'havePin' is sufficient here. An earlier version of the
patch didn't have the 'havePin' variable and used
'so->hashso_bucket_buf == so->currPos.buf' as the condition when both
locking and unlocking the page. The havePin variable was added later
during development, but the unlocking condition wasn't fully
updated. Tidy it up.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi
2026-04-07 17:38:11 +03:00
Andrew Dunstan
55890a9194 Add errdetail() with PID and UID about source of termination signal.
When a backend is terminated via pg_terminate_backend() or an external
SIGTERM, the error message now includes the sender's PID and UID as
errdetail, making it easier to identify the source of unexpected
terminations in multi-user environments.

On platforms that support SA_SIGINFO (Linux, FreeBSD, and most modern
Unix systems), the signal handler captures si_pid and si_uid from the
siginfo_t structure.  On platforms without SA_SIGINFO, the detail is
simply omitted.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Chao Li <1356863904@qq.com>
Discussion: https://postgr.es/m/CAKZiRmyrOWovZSdixpLd3PGMQXuQL_zw2Ght5XhHCkQ1uDsxjw@mail.gmail.com
2026-04-07 10:22:33 -04:00
Andres Freund
29e7dbf5e4 Minimal fix for WAIT FOR ... MODE 'standby_flush'
The investigation into the negative test performance impact of 7e8aeb9e48
lead to discovering that there are a few issues with WAIT FOR.

This commit is just a minimal fix to prevent hangs in standby_flush mode, due
to WAIT FOR ... 'standby_flush' seeing a 0 LSN if a newly started walreceiver
does not receive any writes, because the stanby is already caught up.

There are several other issues and this is isn't necessarily the best fix. But
this way we get the hangs out of the way.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k@w4bdf4z3wqoz
2026-04-07 09:48:09 -04:00
Heikki Linnakangas
9480c585df Tidy up #ifdef USE_INJECTION_POINTS guards
Remove unnecessary #ifdef guard around the function prototypes; they
are already inside a larger #ifdef block. Move #include "subsystems.h"
inside the USE_INJECTION_POINTS guard; it's needed for
InjectionPointShmemCallbacks, which is a also inside the guard.

Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://www.postgresql.org/message-id/87y0iz2c1v.fsf@wibble.ilmari.org
2026-04-07 16:18:31 +03:00
Tomas Vondra
884f9b3c76 Use add_size/mul_size for index instrumentation size calculations
Use overflow-safe size arithmetic in the Index[Only]Scan and parallel
instrumentation functions, consistent with other executor nodes (Hash,
Sort, Agg, Memoize). This was an oversight in dd78e69cfc.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
2026-04-07 12:47:28 +02:00
Tomas Vondra
9c18b47e61 Fix BitmapHeapScan non-parallel-aware EXPLAIN ANALYZE
Allocates shared bitmap table scan instrumentation for all parallel
scans. Previously, the instrumentation was only allocated for
parallel-aware scans, other bitmap heap scans in the parallel query had
no shared instrumentation and EXPLAIN didn't report exact/lossy pages.
This affected cases like scans on the outside of a parallel join or
queries run with debug_parallel_query=regress.

Fixed by allocating a separate DSM chunk for shared instrumentation and
doing so regardless of parallel-awareness. The instrumentation is
allocated in its own DSM chunk, separate from ParallelBitmapHeapState.

Report an initial patch by me. The approach with a separate DSM was
proposed and implemented by Melanie.

Not backpatched. The issue affects Postgres 18 (since 5a1e6df3b8), but
having multiple DSM chunks is possible only since dd78e69cfc. If we
decide to fix this in backbranches too, it will need to be done in a
less invasive way.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
2026-04-07 12:47:13 +02:00
Álvaro Herrera
0d3dba38c7
Allow logical replication snapshots to be database-specific
By default, the logical decoding assumes access to shared catalogs, so
the snapshot builder needs to consider cluster-wide XIDs during startup.
That in turn means that, if any transaction is already running (and has
XID assigned), the snapshot builder needs to wait for its completion, as
it does not know if that transaction performed catalog changes earlier.

A possible problem with this concept is that if REPACK (CONCURRENTLY) is
running in some database, backends running the same command in other
databases get stuck until the first one has committed. Thus only a
single backend in the cluster can run REPACK (CONCURRENTLY) at any time.
Likewise, REPACK (CONCURRENTLY) can block walsenders starting on behalf
of subscriptions throughout the cluster.

This patch adds a new option to logical replication output plugin, to
declare that it does not use shared catalogs (i.e. catalogs that can be
changed by transactions running in other databases in the cluster). In
that case, no snapshot the backend will use during the decoding needs to
contain information about transactions running in other databases. Thus
the snapshot builder only needs to wait for completion of transactions
in the current database.

Currently we only use this option in the REPACK background worker. It
could possibly be used in the plugin for logical replication too,
however that would need thorough analysis of that plugin.

Bump WAL version number, due to a new field in xl_running_xacts.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/90475.1775218118@localhost
2026-04-07 12:31:18 +02:00
Álvaro Herrera
a3b069ef90
Avoid different-size pointer-to-integer cast
Buildfarm member mamba is unhappy that I wrote "(Datum) NULL" in commit
28d534e2ae:
  https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-07%2005%3A08%3A08
Use "(Datum) 0" which is what we do everywhere else.

Discussion: https://postgr.es/m/CANWCAZaOs_+WPH13ow33Q==+FwBwVZkqzm4vND=WEB4_NBmv1Q@mail.gmail.com
2026-04-07 12:28:05 +02:00
Heikki Linnakangas
6f5ad00ab7 Optimize sort and deduplication in ginExtractEntries()
Remove NULLs from the array first, and use qsort to deduplicate only
the non-NULL items. This simplifies the comparison function. Also
replace qsort_arg() with a templated version so that the comparison
function can be inlined. These changes make ginExtractEntries() a
little faster especially for simple datatypes like integers.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/6d16b6bd-a1ff-4469-aefb-a1c8274e561a@iki.fi
2026-04-07 13:26:39 +03:00
Peter Eisentraut
b6ccd30d8f Add isolation tests for UPDATE/DELETE FOR PORTION OF
Add documentation about concurrency issues related to UPDATE/DELETE
FOR PORTION OF as well as supporting isolation tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2026-04-07 11:22:11 +02:00
Álvaro Herrera
5bcc3fbd19
Fix valgrind failure
Buildfarm member skink reports that the new REPACK code is trying to
write uninitialized bytes to disk, which correspond to padding space in
the SerializedSnapshotData struct.  Silence that by initializing the
memory in SerializeSnapshot() to all zeroes.

Co-authored-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/1976915.1775537087@sss.pgh.pa.us
2026-04-07 11:13:50 +02:00
John Naylor
8c3e22a8f8 Use .h for the file containing the page checksum code fragment
Commit 5e13b0f24 used a .c file for a file containing a code fragment,
to avoid adding an exception to headerscheck. That turned out to be
too clever, since it meant installation didn't happen by the usual
mechanism. Make it look like a normal header and add the requisite
exception.

Bug: #19450
Reported-by: RekGRpth <rekgrpth@gmail.com>
Discussion: https://postgr.es/m/19450-bb0612c50c6786e5@postgresql.org
2026-04-07 15:52:55 +07:00
John Naylor
30229be755 Simplify SortSupport for the macaddr data type
As of commit 6aebedc38 Datums are 64-bit values. Since MAC addresses
have only 6 bytes, the abbreviated key always contains the entire
MAC address and is thus authoritative (for practical purposes -- the
tuple sort machinery has no way of knowing that). Abbreviating this
datatype is cheap, and aborting abbreviation prevents optimizations
like radix sort, so remove cardinality estimation.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TMk10rF_LiMz6j9rRy1rqk-5s+wBPuBefLix4cY+-4s1w@mail.gmail.com
2026-04-07 13:29:27 +07:00
Michael Paquier
49cc0d4148 Mark JumbleState as a const in the post_parse_analyze hook
This commit changes the post_parse_analyze_hook_type() hook to take a
const JumbleState, to tell external modules that they are not allowed to
touch the JumbleState that has been compiled by the core code.  This
fixes a pretty old problem with pg_stat_statements, that had always the
idea of modifying the lengths of the constants stored in the
JumbleState.  The previous state could confuse extensions that need to
look at a JumbleState depending on the loading order, if
pg_stat_statements is part of the stack loaded.

Another piece included in this commit is the move of the routine
fill_in_constant_lengths() to queryjumblefuncs.c, to give an option to
extensions to compile the lengths of the constants, if necessary.  I was
surprised by the number of external code that carries a copy of this
routine (see the thread for details).  Previously, this routine modified
JumbleState.  It now copies the set of LocationLens from JumbleState,
and fills the constant lengths for separate use.

pg_stat_statements is updated to use the new ComputeConstantLengths().
JumbleState is now marked with a const in the module, where relevant.

Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
2026-04-07 15:22:49 +09:00
John Naylor
51098839cf Split CREATE STATISTICS error reasons out into errdetails
Some errmsgs in statscmds.c were phrased as "...cannot be used
because...". Put the reasons into errdetails. While at it, switch
from passive voice to "cannot create..." for the errmsg.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Suggested-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CANWCAZaZeX0omWNh_ZbD_JVujzYQdRUW8UZOQ4dWh9Sg7OcAow@mail.gmail.com
2026-04-07 11:37:48 +07:00
Michael Paquier
17132f55c5 Fix shmem allocation of fixed-sized custom stats kind
StatsShmemSize(), that computes the shmem size needed for pgstats,
includes the amount of shared memory wanted by all the custom stats
kinds registered.  However, the shared memory allocation was done by
ShmemAlloc() in StatsShmemInit(), meaning that the space reserved was
not used, wasting some memory.

These extra allocations would show up under "<anonymous>" in
pg_shmem_allocations, as the allocations done by ShmemAlloc() are not
tracked by ShmemIndexEnt.

Issue introduced by 7949d95945.

Author: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/04b04387-92f5-476c-90b0-4064e71c5f37@iki.fi
Backpatch-through: 18
2026-04-07 11:59:49 +09:00
Amit Langote
5c54c3ed1b Fix deferred FK check batching introduced by commit b7b27eb41a
That commit introduced AfterTriggerIsActive() to detect whether
we are inside the after-trigger firing machinery, so that RI trigger
functions can take the batched fast path.  It was implemented using
query_depth >= 0, which correctly identified immediate trigger firing
but missed the deferred case where query_depth is -1 at COMMIT via
AfterTriggerFireDeferred().  This caused deferred FK checks to fall
back to the per-row fast path instead of the batched path.

The correct check is whether we are inside an after-trigger firing
loop specifically.  Introduce afterTriggerFiringDepth, a counter
incremented around the trigger-firing loops in AfterTriggerEndQuery,
AfterTriggerFireDeferred, and AfterTriggerSetState, and decremented
after FireAfterTriggerBatchCallbacks() returns.  AfterTriggerIsActive()
now returns afterTriggerFiringDepth > 0.

Reported-by: Chao Li <li.evan.chao@gmail.com>
Author: Chao Li <li.evan.chao@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/C2133B47-79CD-40FF-B088-02D20D654806@gmail.com
2026-04-07 10:45:59 +09:00
Melanie Plageman
dd78e69cfc Allocate separate DSM chunk for parallel Index[Only]Scan instrumentation
Previously, parallel index and index-only scans packed the parallel scan
descriptor and shared instrumentation (for EXPLAIN ANALYZE) into a
single DSM allocation. Since scans may be instrumented without being
parallel-aware, and vice versa, using separate DSM chunks -- each with
its own TOC key -- is cleaner. A future commit will extend this pattern
to other scan node types.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
2026-04-06 19:10:19 -04:00
Melanie Plageman
43222b8e53 Assert no duplicate keys in shm_toc_insert()
shm_toc_insert() silently accepts duplicate keys. Since shm_toc_lookup()
returns the first matching entry, any later entry with the same key
would be unreachable. Add an assertion to catch this.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
2026-04-06 18:41:47 -04:00
Nathan Bossart
87f61f0c82 Add pg_stat_autovacuum_scores system view.
This view contains one row for each table in the current database,
showing the current autovacuum scores for that specific table.  It
also shows whether autovacuum would vacuum or analyze the table.

Bumps catversion.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-06 16:56:33 -05:00
Daniel Gustafsson
b3a37ffbc5 Use PG_DATA_CHECKSUM_OFF instead of hardcoded value
For a long time, the online checksums patchset kept the "off" state as
literal zero without a label to be consistent with the previous coding
which only had a label for the "on" state.  Later, when an "off" label
was made not all uses in the code got the memo.  Fix by setting these
to PG_DATA_CHECKSUM_OFF.

While there, fix a duplicate word in a comment introduced by the same
commit.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAJ7c6TPRTnQFXXX1CRcYoTLXw2swtDH==uSz1MYoMKdLrKZHjA@mail.gmail.com
2026-04-06 22:11:53 +02:00
Álvaro Herrera
28d534e2ae
Add CONCURRENTLY option to REPACK
When this flag is specified, REPACK no longer acquires access-exclusive
lock while the new copy of the table is being created; instead, it
creates the initial copy under share-update-exclusive lock only (same as
vacuum, etc), and it follows an MVCC snapshot; it sets up a replication
slot starting at that snapshot, and uses a concurrent background worker
to do logical decoding starting at the snapshot to populate a stash of
concurrent data changes.  Those changes can then be re-applied to the
new copy of the table just before swapping the relfilenodes.
Applications can continue to access the original copy of the table
normally until just before the swap, which is the only point at which
the access-exclusive lock is needed.

There are some loose ends in this commit:
1. concurrent repack needs its own replication slot in order to apply
   logical decoding, which are a scarce resource and easy to run out of.
2. due to the way the historic snapshot is initially set up, only one
   REPACK process can be running at any one time on the whole system.
3. there's a danger of deadlocking (and thus abort) due to the lock
   upgrade required at the final phase.

These issues will be addressed in upcoming commits.

The design and most of the code are by Antonin Houska, heavily based on
his own pg_squeeze third-party implementation.

Author: Antonin Houska <ah@cybertec.at>
Co-authored-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/5186.1706694913@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
2026-04-06 21:55:08 +02:00
Alexander Korotkov
834038c1f8 Avoid syscache lookup while building a WAIT FOR tuple descriptor
Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.

Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
2026-04-06 22:47:26 +03:00
Nathan Bossart
775fe51daa Remove recheck_relation_needs_vacanalyze().
This function is a thin wrapper around relation_needs_vacanalyze()
that handles fetching and freeing the pgstat entry for the table.
Since all callers of relation_needs_vacanalyze() do that anyway, we
can teach that function to fetch/free the pgstat entry and use it
instead.

Suggested-by: Álvaro Herrera <alvherre@kurilemu.de>
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-06 14:30:52 -05:00
Tom Lane
d516974840 Support more object types within CREATE SCHEMA.
Having rejected the principle that we should know how to re-order
the sub-commands of CREATE SCHEMA, there is not really anything
except a little coding to stop us from supporting more object types.
This patch adds support for creating functions (including procedures
and aggregates), operators, types (including domains), collations,
and text search objects.

SQL:2021 specifies that we should allow functions, procedures,
types, domains, and collations, so this moves us a great deal
closer to full SQL compatibility of CREATE SCHEMA.  What remains
missing from their list are casts, transforms, roles, and some
object types we don't support yet (e.g. CREATE CHARACTER SET).
Supporting casts or transforms would be problematic because
they don't have names at all, let alone schema-qualified names,
so it'd be quite a stretch to say that they belong to a schema.
Roles likewise are not schema-qualified, plus they are global
to a cluster, making it even less reasonable to consider them
as belonging to a schema.  So I don't see us trying to complete
the list.

User-defined aggregates and operators are outside the spec's ken,
as are text search objects, so adding them does not do anything for
spec compatibility.  But they go along with these other object types,
plus it takes no additional code to support them since they are
represented as DefineStmts like some variants of CREATE TYPE.
It would indeed take some effort to reject them.

Author: Kirill Reshke <reshkekirill@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com
2026-04-06 15:16:25 -04:00
Tom Lane
404db8f9ed Execute foreign key constraints in CREATE SCHEMA at the end.
The previous patch simplified CREATE SCHEMA's behavior to "execute all
subcommands in the order they are written".  However, that's a bit too
simple, as the spec clearly requires forward references in foreign key
constraint clauses to work, see feature F311-01.  (Most other SQL
implementations seem to read more into the spec than that, but it's
not clear that there's justification for more in the text, and this is
the only case that doesn't introduce unresolvable issues.)  We never
implemented that before, but let's do so now.

To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD
FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's
subcommand list.  This works because the foreign key constraints are
independent and don't affect any other DDL that might be in CREATE
SCHEMA.  For simplicity, we do this for all FOREIGN KEY clauses even
if they would have worked where they were.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
2026-04-06 15:16:25 -04:00
Tom Lane
a9c350d9ee Don't try to re-order the subcommands of CREATE SCHEMA.
transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order.  However, it is nowhere near being capable of doing
that correctly.  Nor is there reason to think that it ever will be,
or that that is a well-defined requirement.  (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.)  Moreover, the problem will get worse as
we add more subcommand types.  Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway.  The foreign-key issue
will be handled in a follow-up patch.

This will result in a release-note-worthy incompatibility,
which is that forward references like
	CREATE SCHEMA myschema
	    CREATE VIEW myview AS SELECT * FROM mytable
	    CREATE TABLE mytable (...);
used to work and no longer will.  Considering how many closely
related variants never worked, this isn't much of a loss.

Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
2026-04-06 15:16:25 -04:00
Masahiko Sawada
1ff3180ca0 Allow autovacuum to use parallel vacuum workers.
Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.

Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.

To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().

A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.

The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
2026-04-06 11:48:29 -07:00
Álvaro Herrera
c0b53ec063
Rename cluster.c to repack.c (and corresponding .h)
CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql
2026-04-06 20:11:01 +02:00
Tom Lane
21c69dc73f Disallow system columns in COPY FROM WHERE conditions.
These columns haven't been computed yet when the filtering happens
(since we've not written the candidate tuple into the table); so
any check on them is wrong or useless.  Worse, since aa606b931 such a
reference results in an access off the end of a TupleDesc, potentially
causing a phony "generated columns are not supported in COPY FROM
WHERE conditions" error; and since c98ad086a it throws an Assert
instead.

Actually we could allow tableoid, which has been set to the OID of the
table named as the COPY target.  However, plausible uses for tests of
tableoid would involve a partitioned target table, and the user would
wish it to read as the OID of the destination partition.  There has
been some discussion of changing things to make it work like that,
but pending that happening we should just disallow tableoid along
with other system columns.

It seems best though to install this prohibition only in HEAD.
In the back branches we'll just guard the unsafe TupleDesc access,
and people will keep getting whatever semantics they got before.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com
2026-04-06 14:05:01 -04:00
Tom Lane
6582010c80 Fix null-bitmap combining in array_agg_array_combine().
This code missed the need to update the combined state's
nullbitmap if state1 already had a bitmap but state2 didn't.
We need to extend the existing bitmap with 1's but didn't.
This could result in wrong output from a parallelized
array_agg(anyarray) calculation, if the input has a mix of
null and non-null elements.  The errors depended on timing
of the parallel workers, and therefore would vary from one
run to another.

Also install guards against integer overflow when calculating
the combined object's sizes, and make some trivial cosmetic
improvements.

Author: Dmytro Astapov <dastapov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Backpatch-through: 16
2026-04-06 13:14:53 -04:00
Robert Haas
0442f1c9ef Add a guc_check_handler to the EXPLAIN extension mechanism.
It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
2026-04-06 12:31:47 -04:00
Nathan Bossart
e3481edfd1 Remove autoanalyze corner case.
The restructuring in commit 53b8ca6881 revealed an interesting
corner case: if a table needs vacuuming for wraparound prevention
and autovacuum is disabled for it, we might still choose to analyze
it.  Research seems to indicate this was an accidental addition by
commit 48188e1621, and further discussion indicates there is
consensus that it is unnecessary and can be removed.

Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan
2026-04-06 11:28:46 -05:00
Robert Haas
e0e819cc08 Expose helper functions scan_quoted_identifier and scan_identifier.
Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
2026-04-06 11:13:25 -04:00
Fujii Masao
93dc1ace20 Release postmaster working memory context in slotsync worker
Child processes do not need the postmaster's working memory context and
normally release it at the start of their main entry point. However,
the slotsync worker forgot to do so.

This commit makes the slotsync worker release the postmaster's working
memory context at startup, preventing unintended use.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com
2026-04-06 23:04:18 +09:00
Heikki Linnakangas
ed71d7356e Fix memory leaks introduced by commit 283e823f9d
When freeing pending_shmem_requests we should also free the ->options.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com
2026-04-06 15:46:03 +03:00
Heikki Linnakangas
2670a0fcc6 Fix compilation without injection points with some compilers
Some compilers didn't like the empty initializer when compiled without
USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual
Studio 2019.

Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz
2026-04-06 15:46:00 +03:00
Michael Paquier
404a17c155 Use single LWLock for lock statistics in pgstats
Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently.  This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL.  A
good chunk of the callbacks get simpler thanks to this change.

The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data.  This would cause some
stats data loss.

Oversight in 4019f725f5.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me
2026-04-06 14:01:04 +09:00
Richard Guo
3a08a2a8b4 Fix volatile function evaluation in eager aggregation
Pushing aggregates containing volatile functions below a join can
violate volatility semantics by changing the number of times the
function is executed.

Here we check the Aggref nodes in the targetlist and havingQual for
volatile functions and disable eager aggregation when such functions
are present.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
2026-04-06 11:54:08 +09:00
Richard Guo
bd94845e8c Fix collation handling for grouping keys in eager aggregation
When determining if it is safe to use an expression as a grouping key
for partial aggregation, eager aggregation relies on the B-tree
equalimage support function to ensure that equality implies image
equality.

Previously, the code incorrectly passed the default collation of the
expression's data type to the equalimage procedure, rather than the
expression's actual collation.  As a result, if a column used a
non-deterministic collation but the base type's default collation was
deterministic, eager aggregation would incorrectly assume that the
column was safe for byte-level grouping.  This could cause rows to be
prematurely grouped and subsequently discarded by strict join
conditions, resulting in incorrect query results.

This patch fixes the issue by passing the expression's actual
collation to the equalimage procedure.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
2026-04-06 11:52:33 +09:00
Fujii Masao
a8f45dee91 Add wal_sender_shutdown_timeout GUC to limit shutdown wait for replication
Previously, during shutdown, walsenders always waited until all pending data
was replicated to receivers. This ensures sender and receiver stay in sync
after shutdown, which is important for physical replication switchovers,
but it can significantly delay shutdown. For example, in logical replication,
if apply workers are blocked on locks, walsenders may wait until those locks
are released, preventing shutdown from completing for a long time.

This commit introduces a new GUC, wal_sender_shutdown_timeout,
which specifies the maximum time a walsender waits during shutdown for all
pending data to be replicated. When set, shutdown completes once all data is
replicated or the timeout expires. A value of -1 (the default) disables
the timeout.

This can reduce shutdown time when replication is slow or stalled. However,
if the timeout is reached, the sender and receiver may be left out of sync,
which can be problematic for physical replication switchovers.

Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Ronan Dunklau <ronan@dunklau.fr>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
2026-04-06 11:35:03 +09:00
Daniel Gustafsson
07009121c2 Test stabilization for online checksums
Postcommit review and buildfarm/CI failures revealed a few issues in
the test code which this commit attempts to resolve.  These failures
are verified using synthetic means.

  * Wait for launcher exit in enable/disable checksum tests

    When enabling or disabling data checksums in a test with waiting
    for an end state (on or off), the test typically want to perform
    more test against the cluster immediately. Make sure to wait for
    the launcher to exit in these cases before returning in order to
    know it can immediately be acted on.  This is a more generic way
    of implementating 0036232ba8.

  * Refactor injection point tests to use the injection_points test
    extension. Two injection points added for online checksums were
    better expressed using the injection_points extension with the
    test code embedded in datachecksum_state.c.

  * Make tests less timing dependent and allow transitions to "on"
    and not just "inprogress-on" in case a test manages to finish
    before it's checked for state.

  * When waiting on a blocking background psql keeping a temporary
    table open, the test first closed the background session abd
    then the server.  This could cause data checksums to manage to
    get enabled in the brief window between dropping the temporary
    table and closing the server.  Fix by closing the server first
    before the background session.

  * Remove a few superfluous duplicate checks and general cleanup
    of comments as well as making LSN logging consistent.

These issues were reported by Andres as well as spotted in the
buildfarm and on CI.

Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se
2026-04-06 02:03:10 +02:00
Daniel Gustafsson
d771b0a907 Handle checksumworker startup wait race
If the background worker for processing databases manages to finish
before the launcher starts waiting for it, the launcher would treat
it erroneously as an error.  Fix by ensureing to check result state
in this case.  Identified on CI and synthetically reproduced during
local testing.

Also while, make sure to properly lock the shared memory structure
before updating tje result state.

Author: Daniel Gustafsson <daniel@yesql.seA
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4
2026-04-06 01:55:06 +02:00
Michael Paquier
557a9f1e3e Add tests for lock statistics, take two
Commit 7c64d56fd9 has removed the isolation test providing coverage
for lock statistics due to some instability in the CI, where the
deadlock timeout may not have enough time to process, preventing the
stats data to be updated.  These also relied on a set of hardcoded
sleeps.

This commit switches the test suite to TAP, instead, that uses an
injection point with a wait to avoid the sleeps.  The injection point is
added in ProcSleep(), once we know that the deadlock timeout has fired
and that the stats have been updated.

Multiple lock patterns are checked, all rely on the same workflow, with
two sessions:
- session 1 holds a given lock type.
- session 2 attaches to the new injection point with the wait action.
- session 2 attempts to acquire a lock conflicting with the lock of
session 1, waiting for the injection point to be reached.
- session 1 releases its lock, session 2 commits.
- pg_stat_lock is polled until the counters are updated for the lock
type.

Bertrand's version of the patch introduced a new routine to
BackgroundPsql() to detect the blocked background sessions.  I have
tweaked the test so as we use the same method as some of the other tests
instead, based on some \echo commands.  This test has been run multiple
times in the CI, all passing, so I'd like to think that this is more
stable than the first version attempted.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/acNTR1lLHwQJ0o+P@ip-10-97-1-34.eu-west-3.compute.internal
2026-04-06 08:51:30 +09:00
Heikki Linnakangas
9b5acad3f4 Convert all remaining subsystems to use the new shmem allocation API
This removes all remaining uses of ShmemInitStruct() and
ShmemInitHash() from built-in code.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:10 +03:00
Heikki Linnakangas
a4b6139dcc Convert buffer manager to use the new shmem allocation functions
This rectifies the initialization functions a little, making the
"buffer strategy" stuff in freelist.c and buffer mapping hash table in
buf_init.c top-level "subsystems" of their own, registered directly in
subsystemlist.h. Previously they were called indirectly from
BufferManagerShmemInit() and BufferManagerShmemSize()

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:08 +03:00
Heikki Linnakangas
dacfe81a0d Add alignment option to ShmemRequestStruct()
The buffer blocks, converted to use ShmemRequestStruct() in the next
commit, are IO-aligned. This might come handy in other places too, so
make it an explicit feature of ShmemRequestStruct().

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:06 +03:00
Heikki Linnakangas
58a1573385 Convert AIO to use the new shmem allocation functions
This replaces the "shmem_size" and "shmem_init" callbacks in the IO
methods table with the same ShmemCallback struct that we now use in
other subsystems

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:04 +03:00
Heikki Linnakangas
2e0943a859 Convert SLRUs to use the new shmem allocation functions
I replaced the old SimpleLruInit() function without a backwards
compatibility wrapper, because few extensions define their own SLRUs.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:02 +03:00
Heikki Linnakangas
4c9eca5afe Refactor shmem initialization code in predicate.c
This is in preparation to convert it to use the new shmem allocation
functions, making the next commit that does that smaller. This inlines
SerialInit() to the caller, and moves all the initialization steps
within PredicateLockShmemInit() to happen after all the
ShmemInit{Struct|Hash}() calls.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:13:01 +03:00
Heikki Linnakangas
c6d55714ba Use the new shmem allocation functions in a few core subsystems
These subsystems have some complicating properties, making them
slightly harder to convert than most:

- The initialization callbacks of some of these subsystems have
  dependencies, i.e. they need to be initialized in the right order.

- The ProcGlobal pointer still needs to be inherited by the
  BackendParameters mechanism on EXEC_BACKEND builds, because
  ProcGlobal is required by InitProcess() to get a PGPROC entry, and
  the PGPROC entry is required to use LWLocks, and usually attaching
  to shared memory areas requires the use of LWLocks.

- Similarly, ProcSignal pointer still needs to be handled by
  BackendParameters, because query cancellation connections access it
  without calling InitProcess

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:12:59 +03:00
Heikki Linnakangas
a006bc7b16 Convert lwlock.c to use the new shmem allocation functions
It seems like a good candidate to convert first because it needs to
initialized before any other subsystem, but other than that it's
nothing special.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:12:57 +03:00
Heikki Linnakangas
1fc2e9fbc0 Introduce a registry of built-in shmem subsystems
To add a new built-in subsystem, add it to subsystemslist.h. That
hooks up its shmem callbacks so that they get called at the right
times during postmaster startup. For now this is unused, but will
replace the current SubsystemShmemSize() and SubsystemShmemInit()
calls in the next commits.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:12:55 +03:00
Heikki Linnakangas
283e823f9d Introduce a new mechanism for registering shared memory areas
This replaces the [Subsystem]ShmemSize() and [Subsystem]ShmemInit()
functions called at postmaster startup with a new set of callbacks.
The new mechanism is designed to be more ergonomic. Notably, the size
of each shmem area is specified in the same ShmemRequestStruct() call,
together with its name. The same mechanism is used in extensions,
replacing the shmem_{request/startup}_hooks.

ShmemInitStruct() and ShmemInitHash() become backwards-compatibility
wrappers around the new functions. In future commits, I will replace
all ShmemInitStruct() and ShmemInitHash() calls with the new
functions, although we'll still need to keep them around for
extensions.

Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:12:50 +03:00
Heikki Linnakangas
6ef9bee293 Move some code from shmem.c and shmem.h
A little refactoring in preparation for the next commit, to make the
material changes in that commit more clear.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-04-06 02:12:48 +03:00
Andres Freund
5a79e78501 instrumentation: Separate per-node logic from other uses
Previously, different places (e.g. query "total time") were repurposing the
Instrumentation struct initially introduced for capturing per-node statistics
during execution. This overuse of the same struct is confusing, e.g. by
cluttering calls of InstrStartNode/InstrStopNode in unrelated code paths, and
prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time and WAL/buffer
usage. Similarly, drop the use of InstrEndLoop outside of per-node
instrumentation - these calls were added without any apparent benefit since
the relevant fields were never read.

Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.

In passing, clarify that InstrAggNode is expected to only run after
InstrEndLoop (as it does in practice), and drop unused code.

This also fixes a consequence-less bug: Previously ->async_mode was only set
when a non-zero instrument_option was passed. That turns out to be harmless
right now, as ->async_mode only affects a timing related field.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
2026-04-05 19:04:24 -04:00
Andres Freund
7d9b74df53 instrumentation: Separate trigger logic from other uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.

In passing, drop the "n" argument to InstrAlloc, as all remaining callers need
exactly one Instrumentation struct.  The duplication between InstrAlloc() and
InstrInit(), as well as the conditional initialization of async_mode will be
addressed in a subsequent commit.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
2026-04-05 16:56:50 -04:00
Andres Freund
6c7bce28c8 Fixups for a4f774cf1c
The database name was warned about when building with
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS, leading to BF and CI failures.

It is somewhat confusing that the required prefix is different for databases
than other object types.

Also fix a pgindent violation that caused koel to start to fail.

Discussion: https://postgr.es/m/ptyiexyhmtxf4lm524s7o7w64r26ra237uusv4tjav4yhpmeoo@vfwwllz7tivb
2026-04-05 15:36:34 -04:00
Andres Freund
df6949ccf7 Add tid_block() and tid_offset() accessor functions
The two new functions allow to extract the block number and offset from a tid.

There are existing ways to do so (e.g. by doing (ctid::text::point)[0]), but
they are hard to remember and not pretty.

tid_block() returns int8 (bigint) because BlockNumber is uint32, which exceeds
the range of int4. tid_offset() returns int4 (integer) because OffsetNumber is
uint16, which fits safely in int4.

Bumps catversion.

Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/CAJTYsWUzok2+mvSYkbVUwq_SWWg-GdHqCuYumN82AU97SjwjCA@mail.gmail.com
2026-04-05 15:17:05 -04:00
Heikki Linnakangas
f10b6be258 Check that the tranche name is unique in RequestNamedLWLockTranche
You could request two tranches with same name, but things would get
confusing when you called GetNamedLWLockTranche() to get the LWLocks
allocated for them; it would always return the first tranche with the
name. That doesn't make sense, so forbid duplicates.

We still allow duplicates with LWLockNewTrancheId(). That works better
as you don't use the name to look up the tranche ID later. It's still
confusing in wait events, for example, but it's not dangerous in the
same way.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi
2026-04-05 21:05:20 +03:00
Andrew Dunstan
a4f774cf1c Add pg_get_database_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a database. It takes a regdatabase argument and an optional
VARIADIC text argument for options that are specified as alternating
name/value pairs. The following options are supported: pretty (boolean)
for formatted output, owner (boolean) to include OWNER and tablespace
(boolean) to include TABLESPACE. The return is one or multiple rows
where the first row is a CREATE DATABASE statement and subsequent rows are
ALTER DATABASE statements to set some database properties.

The caller must have CONNECT privilege on the target database.

Author: Akshay Joshi <akshay.joshi@enterprisedb.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Co-authored-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Quan Zongliang <quanzongliang@yeah.net>
Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
2026-04-05 10:54:54 -04:00
Andrew Dunstan
b99fd9fd7f Add pg_get_tablespace_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a tablespace. It takes a tablespace name or OID and an
optional VARIADIC text argument for options that are specified as
alternating name/value pairs. The following options are supported: pretty
(boolean) for formatted output and owner (boolean) to include OWNER.
(It includes two variants because there is no regtablespace pseudotype.)
The return is one or multiple rows where the first row is a CREATE
TABLESPACE statement and subsequent rows are ALTER TABLESPACE statements
to set some tablespace properties.

The caller must have SELECT privilege on pg_tablespace.

get_reloptions() in ruleutils.c is made non-static so it can be called
from the new ddlutils.c file.

Author: Nishant Sharma <nishant.sharma@enterprisedb.com>
Author: Manni Wood <manni.wood@enterprisedb.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Co-authored-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
2026-04-05 10:54:54 -04:00
Andrew Dunstan
76e514ebb4 Add pg_get_role_ddl() function
Add a new SQL-callable function that returns the DDL statements needed
to recreate a role. It takes a regrole argument and an optional VARIADIC
text argument for options that are specified as alternating name/value
pairs. The following options are supported: pretty (boolean) for
formatted output and memberships (boolean) to include GRANT statements
for role memberships and membership options. The return is one or
multiple rows where the first row is a CREATE ROLE statement and
subsequent rows are ALTER ROLE statements to set some role properties.
Password information is never included in the output.

The caller must have SELECT privilege on pg_authid.

Author: Mario Gonzalez <gonzalemario@gmail.com>
Author: Bryan Green <dbryan.green@gmail.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Co-authored-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Quan Zongliang <quanzongliang@yeah.net>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
2026-04-05 10:54:54 -04:00
Andrew Dunstan
4881981f92 Add infrastructure for pg_get_*_ddl functions
Add parse_ddl_options(), append_ddl_option(), and append_guc_value()
helper functions in a new ddlutils.c file that provide common option
parsing and output formatting for the pg_get_*_ddl family of functions
which will follow in later patches.  These accept VARIADIC text
arguments as alternating name/value pairs.

Callers declare an array of DdlOption descriptors specifying the
accepted option names and their types (boolean, text, or integer).
parse_ddl_options() matches each supplied pair against the array,
validates the value, and fills in the result fields.  This
descriptor-based scheme is based on an idea from Euler Taveira.

This is placed in a new ddlutils.c file which will contain the
pg_get_*_ddl functions.

Author: Akshay Joshi <akshay.joshi@enterprisedb.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Co-authored-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com
Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com
Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com
Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
2026-04-05 10:54:54 -04:00
Álvaro Herrera
caec9d9fad
Allow index_create to suppress index_build progress reporting
A future REPACK patch wants a way to suppress index_build doing its
progress reports when building an index, because that would interfere
with repack's own reporting; so add an INDEX_CREATE_SUPPRESS_PROGRESS
bit that enables this.

Furthermore, change the index_create_copy() API so that it takes flag
bits for index_create() and passes them unchanged.  This gives its
callers more direct control, which eases the interface -- now its
callers can pass the INDEX_CREATE_SUPPRESS_PROGRESS bit directly.  We
use it for the current caller in REINDEX CONCURRENTLY, since it's also
not interested in progress reporting, since it doesn't want
index_build() to be called at all in the first place.

One thing to keep in mind, pointed out by Mihail, is that we're not
suppressing the index-AM-specific progress report updates which happen
during ambuild().  At present this is not a problem, because the values
updated by those don't overlap with those used by commands other than
CREATE INDEX; but maybe in the future we'll want the ability to suppress
them also.  (Alternatively we might want to display how each
index-build-subcommand progresses during REPACK and others.)

Author: Antonin Houska <ah@cybertec.at>
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Discussion: https://postgr.es/m/102906.1773668762@localhost
2026-04-05 13:34:08 +02:00
Etsuro Fujita
de28140ded postgres_fdw: Inherit the local transaction's access/deferrable modes.
READ ONLY transactions should prevent modifications to foreign data as
well as local data, but postgres_fdw transactions declared as READ ONLY
that reference foreign tables mapped to a remote view executing volatile
functions would modify data on remote servers, as it would open remote
transactions in READ WRITE mode.

Similarly, DEFERRABLE transactions should not abort due to a
serialization failure even when accessing foreign data, but postgres_fdw
transactions declared as DEFERRABLE would abort due to that failure in a
remote server, as it would open remote transactions in NOT DEFERRABLE
mode.

To fix, modify postgres_fdw to open remote transactions in the same
access/deferrable modes as the local transaction.  This commit also
modifies it to open remote subtransactions in the same access mode as
the local subtransaction.

This commit changes the behavior of READ ONLY/DEFERRABLE transactions
using postgres_fdw; in particular, it doesn't allow the READ ONLY
transactions to modify data on remote servers anymore, so such
transactions should be redeclared as READ WRITE or rewritten using other
tools like dblink.  The release notes should note this as an
incompatibility.

These issues exist since the introduction of postgres_fdw, but to avoid
the incompatibility in the back branches, fix them in master only.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
Discussion: https://postgr.es/m/E1uLe9X-000zsY-2g%40gemulon.postgresql.org
2026-04-05 18:55:00 +09:00
Thomas Munro
fc44f10665 aio: Simplify pgaio_worker_submit().
Merge pgaio_worker_submit_internal() and pgaio_worker_submit().  The
separation didn't serve any purpose.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
2026-04-05 18:07:21 +12:00
Andres Freund
f63ca33790 read_stream: Only increase read-ahead distance when waiting for IO
This avoids increasing the distance to the maximum in cases where the I/O
subsystem is already keeping up. This turns out to be important for
performance for two reasons:

- Pinning a lot of buffers is not cheap. If additional pins allow us to avoid
  IO waits, it's definitely worth it, but if we can already do all the
  necessary readahead at a distance of 16, reading ahead 512 buffers can
  increase the CPU overhead substantially.  This is particularly noticeable
  when the to-be-read blocks are already in the kernel page cache.

- If the read stream is read to completion, reading in data earlier than
  needed is of limited consequences, leaving aside the CPU costs mentioned
  above. But if the read stream will not be fully consumed, e.g. because it is
  on the inner side of a nested loop join, the additional IO can be a serious
  performance issue. This is not that commonly a problem for current read
  stream users, but the upcoming work, to use a read stream to fetch table
  pages as part of an index scan, frequently encounters this.

Note that this commit would have substantial performance downsides without
earlier commits:

- Commit 6e36930f9a, which avoids decreasing the readahead distance when
  there was recent IO, is crucial, as otherwise we very often would end up not
  reading ahead aggressively enough anymore with this commit, due to
  increasing the distance less often.

- "read stream: Split decision about look ahead for AIO and combining" is
  important as we would otherwise not perform IO combining when the IO
  subsystem can keep up.

- "aio: io_uring: Trigger async processing for large IOs" is important to
  continue to benefit from memory copy parallelism when using fewer IOs.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
2026-04-05 00:43:54 -04:00
Andres Freund
8ca147d582 read stream: Split decision about look ahead for AIO and combining
In a subsequent commit the read-ahead distance will only be increased when
waiting for IO. Without further work that would cause a regression: As IO
combining and read-ahead are currently controlled by the same mechanism, we
would end up not allowing IO combining when never needing to wait for IO (as
the distance ends up too small to allow for full sized IOs), which can
increase CPU overhead. A typical reason to not have to wait for IO completion
at a low look-ahead distance is use of io_uring with the to-be-read data in
the page cache. But even with worker the IO submission rate may be low enough
for the worker to keep up.

One might think that we could just always perform IO combining, but doing so
at the start of a scan can cause performance regressions:

1) Performing a large IO commonly has a higher latency than smaller IOs. That
   is not a problem once reading ahead far enough, but at the start of a stream
   it can lead to longer waits for IO completion.

2) Sometimes read streams will not be read to completion. Immediately starting
   with full sized IOs leads to more wasted effort. This is not commonly an
   issue with existing read stream users, but the upcoming use of read streams
   to fetch table pages as part of an index scan frequently encounters this.

Solve this issue by splitting ReadStream->distance into ->combine_distance and
->readahead_distance. Right now they are increased/decreased at the same time,
but that will change in the next commit.

One of the comments in read_stream_should_look_ahead() refers to a motivation
that only really exists as of the next commit, but without it the code doesn't
make sense on its own.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
2026-04-05 00:43:54 -04:00
Andres Freund
434dab76ba read_stream: Move logic about IO combining & issuing to helpers
The long if statements were hard to read and hard to document. Splitting them
into inline helpers makes it much easier to explain each part separately.

This is done in preparation for making the logic more complicated...

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
2026-04-05 00:43:54 -04:00
Andres Freund
a9ee668817 aio: io_uring: Trigger async processing for large IOs
io_method=io_uring has a heuristic to trigger asynchronous processing of IOs
once the IO depth is a bit larger. That heuristic is important when doing
buffered IO from the kernel page cache, to allow parallelizing of the memory
copy, as otherwise io_method=io_uring would be a lot slower than
io_method=worker in that case.

An upcoming commit will make read_stream.c only increase the read-ahead
distance if we needed to wait for IO to complete. If to-be-read data is in the
kernel page cache, io_uring will synchronously execute IO, unless the IO is
flagged as async.  Therefore the aforementioned change in read_stream.c
heuristic would lead to a substantial performance regression with io_uring
when data is in the page cache, as we would never reach a deep enough queue to
actually trigger the existing heuristic.

Parallelizing the copy from the page cache is mainly important when doing a
lot of IO, which commonly is only possible when doing largely sequential IO.

The reason we don't just mark all io_uring IOs as asynchronous is that the
dispatch to a kernel thread has overhead. This overhead is mostly noticeable
with small random IOs with a low queue depth, as in that case the gain from
parallelizing the memory copy is small and the latency cost high.

The facts from the two prior paragraphs show a way out: Use the size of the IO
in addition to the depth of the queue to trigger asynchronous processing.

One might think that just using the IO size might be enough, but
experimentation has shown that not to be the case - with deep look-ahead
distances being able to parallelize the memory copy is important even with
smaller IOs.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
2026-04-05 00:43:54 -04:00
Álvaro Herrera
33bf7318f9
Make index_concurrently_create_copy more general
Also rename it to index_create_copy.  Add a 'boolean concurrent' option,
and make it work for both cases: in concurrent mode, just create the
catalog entries; caller is responsible for the actual building later.
In non-concurrent mode, the index is built right away.

This allows it to be reused for other purposes -- specifically, for
concurrent REPACK.

(With the CONCURRENTLY option, REPACK cannot simply swap the heap file and
rebuild its indexes.  Instead, it needs to build a separate set of
indexes, including their system catalog entries, *before* the actual
swap, to reduce the time AccessExclusiveLock needs to be held for.  This
approach is different from what CREATE INDEX CONCURRENTLY does.)

Per a suggestion from Mihail Nikalayeu.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/41104.1754922120@localhost
2026-04-04 20:38:26 +02:00
Peter Geoghegan
2d3490dd99 heapam: Keep buffer pins across index scan resets.
Avoid dropping the heap page pin (xs_cbuf) and visibility map pin
(xs_vmbuffer) within heapam_index_fetch_reset.  Retaining these pins
saves cycles during certain nested loop joins and merge joins that
frequently restore a saved mark: cases where the next tuple fetched
after a reset often falls on the same heap page will now avoid the cost
of repeated pinning and unpinning.

Avoiding dropping the scan's heap page buffer pin is preparation for an
upcoming patch that will add I/O prefetching to index scans.  Testing of
that patch (which makes heapam tend to pin more buffers concurrently
than was typical before now) shows that the aforementioned cases get a
small but clearly measurable benefit from this optimization.

Upcoming work to add a slot-based table AM interface for index scans
(which is further preparation for prefetching) will move VM checks for
index-only scans out of the executor and into heapam.  That will expand
the role of xs_vmbuffer to include VM lookups for index-only scans (the
field won't just be used for setting pages all-visible during on-access
pruning via the enhancement recently introduced by commit b46e1e54).
Avoiding dropping the xs_vmbuffer pin will preserve the historical
behavior of nodeIndexonlyscan.c, which always kept this pin on a rescan;
that aspect of this commit isn't really new.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-04-04 13:49:37 -04:00
Peter Geoghegan
c7d09595e4 heapam: Track heap block in IndexFetchHeapData.
Add an explicit BlockNumber field (xs_blk) to IndexFetchHeapData that
tracks which heap block is currently pinned in xs_cbuf.

heapam_index_fetch_tuple now uses xs_blk to determine when buffer
switching is needed, replacing the previous approach that compared
buffer identities via ReleaseAndReadBuffer on every non-HOT-chain call.

This is preparatory work for an upcoming commit that will add index
prefetching using a read stream.  Delegating the release of a currently
pinned buffer to ReleaseAndReadBuffer won't work anymore -- at least not
when the next buffer that the scan needs to pin is one returned by
read_stream_next_buffer (not a buffer returned by ReadBuffer).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-04-04 11:45:33 -04:00
Peter Geoghegan
a29fdd6c8d Move heapam_handler.c index scan code to new file.
Move the heapam index fetch callbacks (index_fetch_begin,
index_fetch_reset, index_fetch_end, and index_fetch_tuple) into a new
dedicated file.  Also move heap_hot_search_buffer over.  This is a
purely mechanical move with no functional impact.

Upcoming work to add a slot-based table AM interface for index scans
will substantially expand this code.  Keeping it in heapam_handler.c
would clutter a file whose primary role is to wire up the TableAmRoutine
callbacks.  Bitmap heap scans and sequential scans would benefit from
similar separation in the future.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
2026-04-04 11:30:41 -04:00
Peter Geoghegan
1adff1a0c5 Rename heapam_index_fetch_tuple argument for clarity.
Rename heapam_index_fetch_tuple's call_again argument to heap_continue,
for consistency with the pointed-to variable name (IndexScanDescData's
xs_heap_continue field).

Preparation for an upcoming commit that will move index scan related
heapam functions into their own file.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/bmbrkiyjxoal6o5xadzv5bveoynrt3x37wqch7w3jnwumkq2yo@b4zmtnrfs4mh
2026-04-04 11:30:05 -04:00
John Naylor
5e13b0f240 Use AVX2 for calculating page checksums where available
We already rely on autovectorization for computing page checksums,
but on x86 we can get a further several-fold performance increase by
annotating pg_checksum_block() with a function target attribute for
the AVX2 instruction set extension. Not only does that use 256-bit
registers, it can also use vector multiplication rather than the
vector shifts and adds used in SSE2.

Similar to other hardware-specific paths, we set a function pointer
on first use. We don't bother to avoid this on platforms without AVX2
since the overhead of indirect calls doesn't matter for multi-kilobyte
inputs. However, we do arrange so that only core has the function
pointer mechanism. External programs will continue to build a normal
static function and don't need to be aware of this.

This matters most when using io_uring since in that case the checksum
computation is not done in parallel by IO workers.

Co-authored-by: Matthew Sterrett <matthewsterrett2@gmail.com>
Co-authored-by: Andrew Kim <andrew.kim@intel.com>
Reviewed-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Tested-by: Ants Aasma <ants.aasma@cybertec.at>
Tested-by: Stepan Neretin <slpmcf@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CA+vA85_5GTu+HHniSbvvP+8k3=xZO=WE84NPwiKyxztqvpfZ3Q@mail.gmail.com
Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal
2026-04-04 18:07:15 +07:00
Heikki Linnakangas
c06443063f Add missing shmem size estimate for fast-path locking struct
It's been missing ever since fast-path locking was introduced. It's a
small discrepancy, about 4 kB, but let's be tidy. This doesn't seem
worth backpatching, however; in stable branches we were less precise
about the estimates and e.g. added a 10% margin to the hash table
estimates, which is usually much bigger than this discrepancy.
2026-04-04 11:46:11 +03:00
Heikki Linnakangas
4953a25b7f Remove HASH_DIRSIZE, always use the default algorithm to select it
It's not very useful to specify a non-standard directory size. The
HASH_DIRSIZE option was only used for shared memory hash tables, and
those always used hash_select_dirsize() to choose the size, which in
turn just uses the default algorithm anyway. That assumption was
ingrained in hash_estimate_size(), too.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:28 +03:00
Heikki Linnakangas
9fe9ecd516 Allocate all parts of shmem hash table from a single contiguous area
Previously, the shared header (HASHHDR) and the directory were
allocated by the caller, and passed to hash_create(), while the actual
elements were allocated separately with ShmemAlloc(). After this
commit, all the memory needed by the header, the directory, and all
the elements is allocated using a single ShmemInitStruct() call, and
the different parts are carved out of that allocation. This way the
ShmemIndex entries (and thus pg_shmem_allocations) reflect the size of
the whole hash table, rather than just the directories.

Commit f5930f9a98 attempted this earlier, but it had to be reverted.
The new strategy is to let dynahash.c perform all the allocations with
the alloc function, but have the alloc function carve out the parts
from the one larger allocation. The shared header and the directory
are now also allocated with alloc calls, instead of passing the area
for those directly from the caller.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:25 +03:00
Heikki Linnakangas
999e9ebb51 Prevent shared memory hash tables from growing beyond initial size
Set HASH_FIXED_SIZE on all shared memory hash tables, to prevent them
from growing after the initial allocation. It was always weirdly
indeterministic that if one hash table used up all the unused shared
memory, you could not use that space for other things anymore until
restart. We just got rid of that behavior for the LOCK and PROCLOCK
tables, but it's similarly weird for all other hash tables.

Increase SHMEM_INDEX_SIZE because we were already above the max size,
on that one, and it's now a hard limit.

Some callers of ShmemInitHash() still pass HASH_FIXED_SIZE, but that's
now unnecessary. They should perhaps now be removed, but it doesn't do
any harm either to pass it.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:24 +03:00
Heikki Linnakangas
9ebe1c4f2c Merge init and max size options on shmem hash tables
Replace the separate init and max size options with a single size
option. We didn't make much use of the feature, all callers except the
ones in wait_event.c already used the same size for both, and the hash
tables in wait_event.c are small so there's little harm in just
allocating them to the max size.

The only reason why you might want to not reserve the max size upfront
is to make the memory available for other hash tables to grow beyond
their max size. Letting hash tables grow much beyond their max size is
bad for performance, however, because we cannot resize the directory,
and we never had very much "wiggle room" to grow to anyway so you
couldn't really rely on it. We recently marked the LOCK and PROCLOCK
tables with HAS_FIXED_SIZE, so there's nothing left in core that would
benefit from more unallocated shared memory.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-04 02:40:20 +03:00
Jacob Champion
d438a36591 oauth: Let validators provide failure DETAILs
At the moment, the only way for a validator module to report error
details on failure is to log them separately before returning from
validate_cb. Independently of that problem, the ereport() calls that we
make during validation failure partially duplicate some of the work of
auth_failed().

The end result is overly verbose and confusing for readers of the logs:

    [768233] LOG:  [my_validator] bad signature in bearer token
    [768233] LOG:  OAuth bearer authentication failed for user "jacob"
    [768233] DETAIL:  Validator failed to authorize the provided token.
    [768233] FATAL:  OAuth bearer authentication failed for user "jacob"
    [768233] DETAIL:  Connection matched file ".../pg_hba.conf" line ...

Solve both problems by making use of the existing logdetail pointer
that's provided by ClientAuthentication. Validator modules may set
ValidatorModuleResult->error_detail to override our default generic
message.

The end result looks something like

    [242284] FATAL:  OAuth bearer authentication failed for user "jacob"
    [242284] DETAIL:  [my_validator] bad signature in bearer token
        Connection matched file ".../pg_hba.conf" line ...

Reported-by: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/202601241015.y5uvxd7oxnfs%40alvherre.pgsql
2026-04-03 16:05:33 -07:00
Nathan Bossart
01876ace13 Add elevel parameter to relation_needs_vacanalyze().
This will be used in a follow-up commit to avoid emitting debug
logs from this function.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 17:04:28 -05:00
Nathan Bossart
53b8ca6881 Teach relation_needs_vacanalyze() to always compute scores.
Presently, this function only computes component scores when the
corresponding threshold is reached.  A follow-up commit will add a
view that shows tables' autovacuum scores, and we anticipate that
users will want to use this view to discover tables that are
nearing autovacuum eligibility.  This commit teaches this function
to always compute autovacuum scores, even when a threshold has not
been reached or autovacuum is disabled.

The restructuring in this commit revealed an interesting edge case.
If the table needs vacuuming for wraparound prevention and
autovacuum is disabled for it, we might still choose to analyze it.
It's not clear if this is intentional, but it has been this way for
nearly 20 years, so it seems best to avoid changing it without
further discussion.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 16:44:41 -05:00
Daniel Gustafsson
f19c0eccae Online enabling and disabling of data checksums
This allows data checksums to be enabled, or disabled, in a running
cluster without restricting access to the cluster during processing.

Data checksums could prior to this only be enabled during initdb or
when the cluster is offline using the pg_checksums app. This commit
introduce functionality to enable, or disable, data checksums while
the cluster is running regardless of how it was initialized.

A background worker launcher process is responsible for launching a
dynamic per-database background worker which will mark all buffers
dirty for all relation with storage in order for them to have data
checksums calculated on write.  Once all relations in all databases
have been processed, the data_checksums state will be set to on and
the cluster will at that point be identical to one which had data
checksums enabled during initialization or via offline processing.

When data checksums are being enabled, concurrent I/O operations
from backends other than the data checksums worker will write the
checksums but not verify them on reading.  Only when all backends
have absorbed the procsignalbarrier for setting data_checksums to
on will they also start verifying checksums on reading.  The same
process is repeated during disabling; all backends write checksums
but do not verify them until the barrier for setting the state to
off has been absorbed by all.  This in-progress state is used to
ensure there are no false negatives (or positives) due to reading
a checksum which is not in sync with the page.

A new testmodule, test_checksums, is introduced with an extensive
set of tests covering both online and offline data checksum mode
changes.  The tests which run concurrent pgbdench during online
processing are gated behind the PG_TEST_EXTRA flag due to being
very expensive to run.  Two levels of PG_TEST_EXTRA flags exist
to turn on a subset of the expensive tests, or the full suite of
multiple runs.

This work is based on an earlier version of this patch which was
reviewed by among others Heikki Linnakangas, Robert Haas, Andres
Freund, Tomas Vondra, Michael Banck and Andrey Borodin.  During
the work on this new version, Tomas Vondra has given invaluable
assistance with not only coding and reviewing but very in-depth
testing.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Magnus Hagander <magnus@hagander.net>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CABUevExz9hUUOLnJVr2kpw9Cx=o4MCr1SVKwbupzuxP7ckNutA@mail.gmail.com
Discussion: https://postgr.es/m/20181030051643.elbxjww5jjgnjaxg@alap3.anarazel.de
Discussion: https://postgr.es/m/CABUevEwE3urLtwxxqdgd5O2oQz9J717ZzMbh+ziCSa5YLLU_BA@mail.gmail.com
2026-04-03 22:58:51 +02:00
Nathan Bossart
8261ee24fe Refactor relation_needs_vacanalyze().
This commit adds an early return to this function, allowing us to
remove a level of indentation on a decent chunk of code.  This is
preparatory work for follow-up commits that will add a new system
view to show tables' autovacuum scores.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
2026-04-03 14:03:12 -05:00
Heikki Linnakangas
79534f9065 Change default of max_locks_per_transactions to 128
The previous commits reduced the amount of memory available for locks
by eliminating the "safety margins" and by settling the split between
LOCK and PROCLOCK tables at startup. The allocation is now more
deterministic, but it also means that you often hit one of the limits
sooner than before. To compensate for that, bump up
max_locks_per_transactions from 64 to 128. With that there is a little
more space in the both hash tables than what was the effective maximum
size for either table before the previous commits.

This only changes the default, so if you had changed
max_locks_per_transactions in postgresql.conf, you will still have
fewer locks available than before for the same setting value. This
should be noted in the release notes. A good rule of thumb is that if
you double max_locks_per_transactions, you should be able to get as
many locks as before.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:27:46 +03:00
Heikki Linnakangas
e1ad034809 Make the lock hash tables fixed-sized
This prevents the LOCK table from "stealing" space that was originally
calculated for the PROLOCK table, and vice versa. That was weirdly
indeterministic so that if you e.g. took a lot of locks consuming all
the available shared memory for the LOCK table, subsequent
transactions that needed the more space for the PROCLOCK table would
fail, but if you restarted the system then the space would be
available for PROCLOCK again. Better to be strict and predictable,
even though that means that in many cases you can acquire far fewer
locks than before.

This also prevents the lock hash tables from using up the
general-purpose 100 kB reserve we set aside for "stuff that's too
small to bother estimating" in CalculateShmemSize(). We are pretty
good at accounting for everything nowadays, so we could probably make
that reservation smaller, but I'll leave that for another commit.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:27:16 +03:00
Heikki Linnakangas
3e854d2ff1 Remove 10% safety margin from lock manager hash table estimates
As the comment says, the hash table sizes are just estimates, but that
doesn't mean we need a "safety margin" here. hash_estimate_size()
estimates the needed size in bytes pretty accurately for the given
number of elements, so if we wanted room for more elements in the
table, we should just use larger max_table_size in the
hash_estimate_size() call.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:26:18 +03:00
Heikki Linnakangas
feb03dfecd Remove bogus "safety margin" from predicate.c shmem estimates
The 10% safety margin was copy-pasted from lock.c when the predicate
locking code was originally added. However, we later (commit
7c797e7194) added the HASH_FIXED_SIZE flag to the hash tables, which
means that they cannot actually use the safety margin that we're
calculating for them.

The extra memory was mainly used by the main lock manager, which is
the only shmem hash table of non-trivial size that does not use the
HASH_FIXED_SIZE flag. If we wanted to have more space for the lock
manager, we should reserve it directly in lock.c. After this commit,
the lock manager will just have less memory available than before.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi
2026-04-03 20:25:57 +03:00
Amit Langote
b7b27eb41a Optimize fast-path FK checks with batched index probes
Instead of probing the PK index on each trigger invocation, buffer
FK rows in a new per-constraint cache entry (RI_FastPathEntry) and
flush them as a batch.

On each trigger invocation, the new ri_FastPathBatchAdd() buffers
the FK row in RI_FastPathEntry.  When the buffer fills (64 rows)
or the trigger-firing cycle ends, the new ri_FastPathBatchFlush()
probes the index for all buffered rows, sharing a single
CommandCounterIncrement, snapshot, permission check, and security
context switch across the batch, rather than repeating each per row
as the SPI path does.  Per-flush CCI is safe because all AFTER
triggers for the buffered rows have already fired by flush time.

For single-column foreign keys, the new ri_FastPathFlushArray()
builds an ArrayType from the buffered FK values (casting to the
PK-side type if needed) and constructs a scan key with the
SK_SEARCHARRAY flag.  The index AM sorts and deduplicates the array
internally, then walks matching leaf pages in one ordered traversal
instead of descending from the root once per row.  A matched[] bitmap
tracks which batch items were satisfied; the first unmatched item is
reported as a violation.  Multi-column foreign keys fall back to
per-row probing via the new ri_FastPathFlushLoop().

The fast path introduced in the previous commit (2da86c1ef9) yields
~1.8x speedup.  This commit adds ~1.6x on top of that, for a combined
~2.9x speedup over the unpatched code (int PK / int FK, 1M rows, PK
table and index cached in memory).

FK tuples are materialized via ExecCopySlotHeapTuple() into a new
purpose-specific memory context (flush_cxt), child of
TopTransactionContext, which is also used for per-flush transient
work: cast results, the search array, and index scan allocations.
It is reset after each flush and deleted in teardown.

The PK relation, index, tuple slots, and fast-path metadata are
cached in RI_FastPathEntry across trigger invocations within a
trigger-firing batch, avoiding repeated open/close overhead.  The
snapshot and IndexScanDesc are taken fresh per flush.  The entry is
not subject to cache invalidation: cached relations are held with
locks for the transaction duration, and the entry's lifetime is
bounded by the trigger-firing cycle.

Lifecycle management for RI_FastPathEntry relies on three new
mechanisms:

  - AfterTriggerBatchCallback: A new general-purpose callback
    mechanism in trigger.c.  Callbacks registered via
    RegisterAfterTriggerBatchCallback() fire at the end of each
    trigger-firing batch (AfterTriggerEndQuery for immediate
    constraints, AfterTriggerFireDeferred at COMMIT, and
    AfterTriggerSetState for SET CONSTRAINTS IMMEDIATE).  The RI
    code registers ri_FastPathEndBatch as a batch callback.

  - Batch callbacks only fire at the outermost query level
    (checked inside FireAfterTriggerBatchCallbacks), so nested
    queries from SPI inside other AFTER triggers do not tear down
    the cache mid-batch.

  - XactCallback: ri_FastPathXactCallback NULLs the static cache
    pointer at transaction end, handling the abort path where the
    batch callback never fired.

  - SubXactCallback: ri_FastPathSubXactCallback NULLs the static
    cache pointer on subtransaction abort, preventing the batch
    callback from accessing already-released resources.

  - AfterTriggerBatchIsActive(): A new exported accessor that
    returns true when afterTriggers.query_depth >= 0.  During
    ALTER TABLE ... ADD FOREIGN KEY validation, RI triggers are
    called directly outside the after-trigger framework, so batch
    callbacks would never fire.  The fast-path code uses this to
    fall back to the non-cached per-invocation path in that
    context.

ri_FastPathEndBatch() flushes any partial batch before tearing
down cached resources.  Since the FK relation may already be
closed by flush time (e.g. for deferred constraints at COMMIT),
it reopens the relation using entry->fk_relid if needed.

The existing ALTER TABLE validation path bypasses batching and
continues to call ri_FastPathCheck() directly per row, because
RI triggers are called outside the after-trigger framework there
and batch callbacks would never fire to flush the buffer.

Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Co-authored-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-04-03 14:33:53 +09:00
Thomas Munro
be21341e13 jit: No backport::SectionMemoryManager for LLVM 22.
LLVM 22 has the fix that we copied into our tree in commit 9044fc1d and
a new function to reach it[1][2], so we only need to use our copy for
Aarch64 + LLVM < 22.  The only change to the final version that our copy
didn't get is a new LLVM_ABI macro, but that isn't appropriate for us.
Our copy is hopefully now frozen and would only need maintenance if bugs
are found in the upstream code.

Non-Aarch64 systems now also use the new API with LLVM 22.  It allocates
all sections with one contiguous mmap() instead of one per
section.  We could have done that earlier, but commit 9044fc1d wanted to
limit the blast radius to the affected systems.  We might as well
benefit from that small improvement everywhere now that it is available
out of the box.

We can't delete our copy until LLVM 22 is our minimum supported version,
or we switch to the newer JITLink API for at least Aarch64.

[1] https://github.com/llvm/llvm-project/pull/71968
[2] https://github.com/llvm/llvm-project/pull/174307

Backpatch-through: 14
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
2026-04-03 14:55:11 +13:00
Andrew Dunstan
bd4f879a9c Add additional jsonpath string methods
Add the following jsonpath methods:

*   l/r/btrim()
*   lower(), upper()
*   initcap()
*   replace()
*   split_part()

Each simply dispatches to the standard string processing functions.
These depend on the locale, but since it's set at `initdb`, they can be
considered immutable and therefore allowed in any jsonpath expression.

Author: Florents Tselai <florents.tselai@gmail.com>
Co-authored-by: David E. Wheeler <david@justatheory.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
2026-04-02 15:19:49 -04:00
Andrew Dunstan
a35c9d524e Rename jsonpath method arg tokens
This is just cleanup in the jsonpath grammar.

Rename the `csv_` tokens to `int_`, because they represent signed or
unsigned integers, as follows:

*   `csv_elem` => `int_elem`
*   `csv_list` => `int_list`
*   `opt_csv_list` => `opt_int_list`

Rename the `datetime_precision` tokens to `uint_arg`, as they represent
unsigned integers and will be useful for other methods in the future, as
follows:

*   `datetime_precision` => `uint_elem`
*   `opt_datetime_precision` => `opt_uint_arg`

Rename the `datetime_template` tokens to `str_arg`, as they represent
strings and will be useful for other methods in the future, as follows:

*   `datetime_template` => `str_elem`
*   `opt_datetime_template` => `opt_str_arg`

Author: David E. Wheeler <david@justatheory.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CA+v5N40sJF39m0v7h=QN86zGp0CUf9F1WKasnZy9nNVj_VhCZQ@mail.gmail.com
2026-04-02 15:19:49 -04:00
Masahiko Sawada
fd7a25af11 Add target_relid parameter to pg_get_publication_tables().
When a tablesync worker checks whether a specific table is published,
it previously issued a query to the publisher calling
pg_get_publication_tables() and filtering the result by relid via a
WHERE clause. Because the function itself was fully evaluated before
the filter was applied, this forced the publisher to enumerate all
tables in the publication. For publications covering a large number of
tables, this resulted in expensive catalog scans and unnecessary CPU
overhead on the publisher.

This commit adds a new overloaded form of pg_get_publication_tables()
that accepts an array of publication names and a target table
OID. Instead of enumerating all published tables, it evaluates
membership for the specified relation via syscache lookups, using the
new is_table_publishable_in_publication() helper. This helper
correctly accounts for publish_via_partition_root, ALL TABLES with
EXCEPT clauses, schema publications, and partition inheritance, while
avoiding the overhead of building the complete published table list.

The existing VARIADIC array form of pg_get_publication_tables() is
preserved for backward compatibility. Tablesync workers use the new
two-argument form when connected to a publisher running PostgreSQL 19
or later.

Bump catalog version.

Reported-by: Marcos Pegoraro <marcos@f10.com.br>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Haoyan Wang <wanghaoyan20@163.com>
Discussion: https://postgr.es/m/CAB-JLwbBFNuASyEnZWP0Tck9uNkthBZqi6WoXNevUT6+mV8XmA@mail.gmail.com
2026-04-02 11:34:50 -07:00
Fujii Masao
5770679918 Remove redundant SetLatch() calls in interrupt handling functions
Interrupt handling functions (e.g., HandleCatchupInterrupt(),
HandleParallelApplyMessageInterrupt()) are called only by
procsignal_sigusr1_handler(), which already calls SetLatch()
for the current process at the end of its processing.
Therefore, these interrupt handling functions do not need to
call SetLatch() themselves.

However, previously, some of these functions redundantly
called SetLatch(). This commit removes those unnecessary
calls.

While duplicate SetLatch() calls are redundant, they are
harmless, so this change is not backpatched.

Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CALj2ACWd5apddj6Cd885WwJ6LquYu_G81C4GoR4xSoDV1x-FEA@mail.gmail.com
2026-04-02 23:55:30 +09:00
Tomas Vondra
7f8c88c2b8 jit: Change the default to off.
While JIT can speed up large analytical queries, it can also cause
serious performance issues on otherwise very fast queries. Compiling
and optimizing the expressions may be so expensive, it completely
outweighs the JIT benefits for shorter queries.

Ideally, we'd address this in the cost model, but the part deciding
whether to enable JIT for a query is rather simple, partially because we
don't have any reliable estimates of how expensive the LLVM compilation
and optimization is.

Sometimes seemingly unrelated changes (for example a couple additional
INSERTs into a table) increase the cost just enough to enable JIT,
resulting in a performance cliff.

Because of these risks, most large-scale deployments already disable JIT
by default. Notably, this includes all hyperscalers.

This commit changes our default to align with that established practice.
If we improve the JIT (be it better costing or cheaper execution), we
can consider enabling it by default again.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/DG1VZJEX1AQH.2EH4OKGRUDB71@jeltef.nl
2026-04-02 13:40:29 +02:00
Thomas Munro
de6b80e5ff jit: Stop emitting lifetime.end for LLVM 22.
The lifetime.end intrinsic can now only be used for stack memory
allocated with alloca[1][2][3].  We use it to tell LLVM about the
lifetime of function arguments/isnull values that we keep in palloc'd
memory, so that it can avoid spilling registers to memory.

We might need to rearrange things and put them on the stack, but that'll
take some research.  In the meantime, unbreak the build on LLVM 22.

[1] https://github.com/llvm/llvm-project/pull/149310
[2] https://llvm.org/docs/LangRef.html#llvm-lifetime-end-intrinsic
[3] https://llvm.org/docs/LangRef.html#i-alloca

Backpatch-through: 14
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> (earlier attempt)
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> (earlier attempt)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier attempt)
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
2026-04-02 15:52:48 +13:00
David Rowley
331d829e62 Fix nocachegetattr() so it again supports deforming cstrings
c456e3911 added various optimizations to the tuple deformation routines.
One optimization assumed that heap tuples would never contain cstrings.
That optimization also made its way into nocachegetattr(), which isn't
correct as ROW() types get formed into HeapTuples by ExecEvalRow() and
those can contain cstring Datums.  nocachegetattr() gets used to extract
Datums from those tuples.

Here we remove the pg_assume(), which was there to instruct the compiler
to omit the attlen == -2 related code in att_addlength_pointer().

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/80aeac57-8f50-4732-a5b4-c2373c3f8149@gmail.com
2026-04-02 14:11:17 +13:00
Andres Freund
6e36930f9a read_stream: Prevent distance from decaying too quickly
Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.

One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.

Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.

The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.

Several folks have written variants of this patch, including at least Thomas
Munro, Melanie Plageman and I.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:50:03 -04:00
Andres Freund
cceb1bf45e read_stream: Issue IO synchronously while in fast path
While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.

This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter.  It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:22:44 -04:00
Heikki Linnakangas
1bdbb211bb Make ShmemIndex visible in the pg_shmem_allocations view
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-01 23:56:51 +03:00
Álvaro Herrera
db89a47115
Give an 'options' parameter to tuple_delete/_update
The tuple_insert() method already has an equivalent argument, so this
makes sense just on consistency grounds, for future growth.

table_delete() can immediately use it to carry the 'changingPart'
boolean; for table_update we don't have any options at present.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> (older version)
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/202603171606.kf6pmhscqbqz@alvherre.pgsql
2026-04-01 20:26:57 +02:00
Peter Eisentraut
8e72d914c5 Add UPDATE/DELETE FOR PORTION OF
This is an extension of the UPDATE and DELETE commands to do a
"temporal update/delete" based on a range or multirange column.  The
user can say UPDATE t FOR PORTION OF valid_at FROM '2001-01-01' TO
'2002-01-01' SET ... (or likewise with DELETE) where valid_at is a
range or multirange column.

The command is automatically limited to rows overlapping the targeted
portion, and only history within those bounds is changed.  If a row
represents history partly inside and partly outside the bounds, then
the command truncates the row's application time to fit within the
targeted portion, then it inserts one or more "temporal leftovers":
new rows containing all the original values, except with the
application-time column changed to only represent the untouched part
of history.

To compute the temporal leftovers that are required, we use the *_minus_multi
set-returning functions defined in 5eed8ce50c.

- Added bison support for FOR PORTION OF syntax.  The bounds must be
  constant, so we forbid column references, subqueries, etc. We do
  accept functions like NOW().
- Added logic to executor to insert new rows for the "temporal
  leftover" part of a record touched by a FOR PORTION OF query.
- Documented FOR PORTION OF.
- Added tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2026-04-01 19:06:03 +02:00
Álvaro Herrera
ec2f81766a
Fix vicinity of tuple_insert to use uint32, not int, for options
Oversight in commit 1bd6f22f43: I was way too optimistic about the
compiler letting me know what variables needed to be updated, and missed
a few of them.  Clean it up.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/40E570EE-5A60-49D8-B8F7-2F8F2B7C8DFA@gmail.com
2026-04-01 18:14:51 +02:00
Dean Rasheed
f7f4052a4e Add support for extended statistics on virtual generated columns.
This allows both univariate and multivariate statistics to be built on
virtual generated columns and expressions that refer to virtual
generated columns. The restriction disallowing extended statistics on
a single column is lifted in the case of a single virtual generated
column, since it is treated as a single expression.

In the catalogs, references to virtual generated columns are stored
as-is. They are expanded at ANALYZE time to build the statistics, and
at planning time to allow the optimizer to make use of the statistics.
This allows the statistics to be correctly rebuilt using ANALYZE, if a
column's generation expression is altered (which causes any existing
statistics data to be deleted).

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/20250422181006.dd6f9d1d81299f5b2ad55e1a@sraoss.co.jp
2026-04-01 17:02:24 +01:00
Andres Freund
513374a47a bufmgr: Return whether WaitReadBuffers() needed to wait
Thanks to the previous commit, pgaio_wref_check_done() will now detect whether
IO has completed even if userspace has not yet consumed the kernel completion.
This knowledge can be useful for callers of WaitReadBuffers() to know whether
it needed to wait or not, e.g. for adjusting read-ahead aggressiveness or for
instrumentation.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Discussion: https://postgr.es/m/a177a6dd-240b-455a-8f25-aca0b1c08c6e@vondra.me
2026-04-01 09:26:43 -04:00
Andres Freund
6e648e353f aio: io_uring: Allow IO methods to check if IO completed in the background
Until now pgaio_wref_check_done() with io_method=io_uring would not detect if
IOs are known to have completed to the kernel, but the completion has not yet
been consumed by userspace.  This can lead to inferior performance and also
makes it harder to use smarter feedback logic in read_stream, because we
cannot use knowledge about whether an IO completed to control the readahead
distance.

This commit just adds the io_uring specific infrastructure. Later commits will
return whether a wait was needed from WaitReadBuffers() and then use that
knowledge.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 09:26:43 -04:00
Amit Langote
edee563456 Make FastPathMeta self-contained by copying FmgrInfo structs
FastPathMeta stored pointers into ri_compare_cache entries via
compare_entries[], creating a dependency on that cache remaining
stable.  If ri_compare_cache entries were invalidated after fpmeta
was populated, the pointers would dangle.

Replace compare_entries[] with inline copies of the two FmgrInfo
fields actually needed (cast_func_finfo and eq_opr_finfo), copied
at populate time via fmgr_info_copy().  fpmeta now depends only on
riinfo remaining valid, which is already handled by the invalidation
callback.

Introduced by commit 2da86c1ef9 ("Add fast path for foreign key
constraint checks"), noticed while reviewing code for robustness
under CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqFQ+ZA7hSOygv4uv_t75B3r0_gosjadetCsAEoaZwTu6g@mail.gmail.com
2026-04-01 18:43:40 +09:00
Amit Langote
e484b0eea6 Fix two issues in fast-path FK check introduced by commit 2da86c1ef9
First, under CLOBBER_CACHE_ALWAYS, the RI_ConstraintInfo entry can
be invalidated by relcache callbacks triggered inside table_open()
or index_open(), leaving ri_FastPathCheck() calling
ri_populate_fastpath_metadata() with a stale entry whose valid flag
is false.  Fix by moving the fpmeta initialization to after
ri_CheckPermissions(), reloading riinfo first to ensure it is
valid, then calling ri_ExtractValues() and build_index_scankeys()
immediately after before any further operations that could trigger
invalidation.

Second, fpmeta allocated in TopMemoryContext was not freed when the
entry was invalidated in InvalidateConstraintCacheCallBack(),
leaking memory each time the constraint cache entry was recycled.
Fix by freeing and NULLing fpmeta at invalidation time.

Noticed locally when testing with CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqGBU__7-VZZhQWQ3EQuwLYNPd9==ngnzduhGWKHMj9mvw@mail.gmail.com
2026-04-01 17:30:33 +09:00
John Naylor
f6bd9f0fe2 Skip common prefixes during radix sort
During the counting step, keep track of the bits that are the same
for the entire input.  If we counted only a single distinct byte,
the next recursion will start at the next byte position that has
more than one distinct byte in the input. This allows us to skip over
multiple passes where the byte is the same for the entire input.

This provides a significant speedup for integers that have some upper
bytes with all-zeros or all-ones, which is common.

Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com
2026-04-01 14:18:57 +07:00
Fujii Masao
21b018e7ea Reduce log level of some logical decoding messages from LOG to DEBUG1
Previously some logical decoding messages (e.g., "logical decoding found
consistent point") were logged at level LOG, even though they provided
low-level, developer-oriented information that DBAs were typically not
interested in.

Since these messages can occur routinely (for example, when keeping calling
pg_logical_slot_get_changes() to obtain the changes from logical decoding),
logging them at LOG can be overly verbose.

This commit reduces their log level to DEBUG1 to avoid unnecessary log noise.

This change applies to a small set of messages for now. Additional messages
may be adjusted similarly in the future.

Even with this change, if these messages from walsender still need to be
observed, enabling DEBUG1 logging selectively for walsender (e.g.,
log_min_messages = 'warning,walsender:debug1') would be helpful to avoid
increasing overall log volume.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGTyHgtD9tyN664x6vQ8Q1G53H7ZUCgBU9_X=nLt3f1QA@mail.gmail.com
2026-04-01 15:43:02 +09:00
Peter Eisentraut
76f4b92bac Use standard C23 and C++ attributes if available
Use the standard C23 and C++ attributes [[nodiscard]], [[noreturn]],
and [[maybe_unused]], if available.

This makes pg_nodiscard and pg_attribute_unused() available in
not-GCC-compatible compilers that support C23 as well as in C++.

For pg_noreturn, we can now drop the GCC-specific and MSVC-specific
fallbacks, because the C11 and the C++ implementation will now cover
all required cases.

Note, in a few places, we need to change the position of the attribute
because it's not valid in that place in C23.

Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
2026-04-01 08:15:02 +02:00
Amit Kapila
6b0550c45d Fix miscellaneous issues in EXCEPT publication clause.
Improve documentation regarding multiple publications and partition
hierarchies. Refine error reporting for excluded relations. Consolidate
docs by using table_object instead of expanded table syntax in publication
commands. Also includes minor test cleanup and naming fixes.

Reported-by: Peter Smith <smithpb2250@gmail.com>
Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm1CiBYcteE_jjPA4BPHfX30dg9eTTTkJgkjY5tgE7t=bQ@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-04-01 09:13:43 +05:30
Andres Freund
c0af4eb4e7 bufmgr: Fix ordering of checks in PinBuffer()
The check for skip_if_not_valid added in 819dc118c0 was put at the start of
the loop. A CAS loop in theory does allow to make that check in a race free
manner. However, just after the check, there's a
    old_buf_state = WaitBufHdrUnlocked(buf);
which introduces a race, because it would allow BM_VALID to be cleared, after
the skip_if_not_valid check.

Fix by restarting the loop after WaitBufHdrUnlocked().

Reported-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Discussion: https://postgr.es/m/5bf667f3-5270-4b19-a08f-0facbecdff68@postgrespro.ru
2026-03-31 19:24:58 -04:00
Jacob Champion
e020a897ef oauth: Don't log discovery connections by default
Currently, when the client sends a parameter discovery request within
OAUTHBEARER, the server logs the attempt with

    FATAL:  OAuth bearer authentication failed for user

These log entries are difficult to distinguish from true authentication
failures, and by default, libpq sends a discovery request as part of
every OAuth connection, making them annoyingly noisy. Use the new
PG_SASL_EXCHANGE_ABANDONED status to suppress them.

Patch by Zsolt Parragi, with some additional comments added by me.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:33 -07:00
Jacob Champion
c4ff16339f sasl: Allow backend mechanisms to "abandon" exchanges
Introduce PG_SASL_EXCHANGE_ABANDONED, which allows CheckSASLAuth to
suppress the failing log entry for any SASL exchange that isn't actually
an authentication attempt. This is desirable for OAUTHBEARER's discovery
exchanges (and a subsequent commit will make use of it there).

This might have some overlap in the future with in-band aborts for SASL
exchanges, but it's intentionally not named _ABORTED to avoid confusion.
(We don't currently support clientside aborts in our SASL profile.)

Adapted from a patch by Zsolt Parragi.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:31 -07:00
Jacob Champion
c2bca7cc96 Add FATAL_CLIENT_ONLY to ereport/elog
SASL exchanges must end with either an AuthenticationOk or an
ErrorResponse from the server, and the standard way to produce an
ErrorResponse packet is for auth_failed() to call ereport(FATAL). This
means that there's no way for a SASL mechanism to suppress the server
log entry if the "authentication attempt" was really just a query for
authentication metadata, as is done with OAUTHBEARER.

Following the example of 1f9158ba4, add a FATAL_CLIENT_ONLY elevel. This
will allow ClientAuthentication() to choose not to log a particular
failure, while still correctly ending the authentication exchange before
process exit.

(The provenance of this patch is convoluted: since it's a mechanical
copy-paste of 1f9158ba4, both Zsolt Parragi and I produced nearly
identical versions independently, and Andrey Borodin reviewed Zsolt's
version. Tom Lane is the author of 1f9158ba4, but I don't want to imply
that he's signed off on this adaptation. See Discussion.)

Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:29 -07:00
Nathan Bossart
771fe0948c Avoid including vacuum.h in tableam.h and heapam.h.
Commit 2252fcd427 modified some function prototypes in tableam.h
and heapam.h to take a VacuumParams argument instead of a pointer,
which required including vacuum.h in those headers.  vacuum.h has a
reasonably large dependency tree, and headers like tableam.h are
widely included, so this is not ideal.  To fix, change the
functions in question to accept a "const VacuumParams *" argument
instead.  That allows us to use a forward declaration for
VacuumParams and avoid including vacuum.h.  Since vacuum_rel()
needs to scribble on the params argument, we still pass it by value
to that function so that the original struct is not modified.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/rzxpxod4c4la62yvutyrvgoyilrl2fx55djaf2suidy7np5m6c%403l2ln476eadh
2026-03-31 12:43:52 -05:00
Tom Lane
fb7a9050d5 Doc: improve explanation of GiST compress/decompress methods.
The docs previously didn't explain that leaf and non-leaf keys
could be treated differently, even though many of our opclasses
do exactly that.  It also wasn't explained how that relates to
the STORAGE option, particularly since only one storage type
can be specified for both leaf and non-leaf keys.

While here, reorganize the text slightly, rather than sticking
additional detail into what's supposed to be a brief summary
paragraph.

Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+renyWs5Np+FLSYfL+eu20S4U671A3fQGb-+7e22HLrD1NbYw@mail.gmail.com
2026-03-31 11:23:26 -04:00
Heikki Linnakangas
7b424e3108 Change the signature of dynahash's alloc function
Instead of passing the current memory context to the alloc function
via a shared variable, pass it directly as an argument.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-31 16:55:03 +03:00
Heikki Linnakangas
dde69621c3 Remove HASH_SEGMENT option
It's been unused forever. There's no urgency in removing it now, but
it was just something that caught my eye.

Aleksander Alekseev proposed this a long time ago [0], but Tom Lane
was worried about third-party extensions using it. I believe that's a
non-issue: I tried grepping through all extensions found on github and
didn't find any references to HASH_SEGMENT.

[0] https://www.postgresql.org/message-id/20160418180711.55ac82c0@fujitsu

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-31 16:45:28 +03:00
Peter Eisentraut
a0dd0702e4 Fix cross variable references in graph pattern causing segfault
When converting the WHERE clause in an element pattern,
generate_query_for_graph_path() calls replace_property_refs() to
replace the property references in it.  Only the current graph element
pattern is passed as the context for replacement.  If there are
references to variables from other element patterns, it causes a
segmentation fault (an assertion failure in an Assert enabled build)
since it does not find path_element object corresponding to those
variables.

We do not support forward and backward variable references within a
graph table clause.  Hence prohibit all the cross references.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reported-by: Man Zeng <zengman@halodbtech.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5u6AoDfNg4%3DR5eVJn_bJn%3DC%3DwVPrto02P_06fxy39fniA%40mail.gmail.com
2026-03-31 11:47:19 +02:00
Peter Eisentraut
c5b3253b8a Property references are preferred over regular column references
When a ColumnRef can be resolved as a graph table property reference
and a lateral table column reference prefer the graph table property
reference since element pattern variables in the GRAPH_TABLE clause
form the innermost namespace.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5u6AoDfNg4%3DR5eVJn_bJn%3DC%3DwVPrto02P_06fxy39fniA%40mail.gmail.com
2026-03-31 11:47:19 +02:00
Amit Langote
68a8601ee9 Fix use-after-free in ri_LoadConstraintInfo
conindid was read from conForm after ReleaseSysCache(tup).  Move
the read to before the release.

Introduced by commit 2da86c1ef9.

Per buildfarm member prion.

Discussion: https://postgr.es/m/CA+HiwqGGYjN6F2oL7yAk=hvSs-sj3TPqZ9JC9iyLkCqJadECrw@mail.gmail.com
2026-03-31 17:04:44 +09:00
Daniel Gustafsson
097ab69d17 Formalize WAL record for XLOG_CHECKPOINT_REDO
XLOG_CHECKPOINT_REDO only contains the wal_level copied straight in
without an encapsulating record structure. While it works, it makes
future uses of XLOG_CHECKPOINT_REDO hard as there is nowhere to put
new data items.  This fix this was inspired by the online checksums
patch which adds data to this record,  but this change has value on
its own.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c92b5d8b-bc03-47bc-b209-2e4a719eee32@iki.fi
2026-03-31 09:38:01 +02:00
Amit Langote
2da86c1ef9 Add fast path for foreign key constraint checks
Add a fast-path optimization for foreign key checks that bypasses SPI
by directly probing the unique index on the referenced table.
Benchmarking shows ~1.8x speedup for bulk FK inserts (int PK/int FK,
1M rows, where PK table and index are cached).

The fast path applies when the referenced table is not partitioned and
the constraint does not involve temporal semantics.  Otherwise, the
existing SPI path is used.

This optimization covers only the referential check trigger
(RI_FKey_check).  The action triggers (CASCADE, SET NULL, SET DEFAULT,
RESTRICT, NO ACTION) must find rows on the FK side to modify, which
requires a table scan with no guaranteed index available, and then
execute DML against those rows through the full executor path including
any triggered actions.  Replicating that without substantial code
duplication is not feasible, so those triggers remain on the SPI path.
Extending the fast path to action triggers remains possible as future
work if the necessary infrastructure is built.

The new ri_FastPathCheck() function extracts the FK values, builds scan
keys, performs an index scan, and locks the matching tuple with
LockTupleKeyShare via ri_LockPKTuple(), which handles the RI-specific
subset of table_tuple_lock() results.

If the locked tuple was reached by chasing an update chain
(tmfd.traversed), recheck_matched_pk_tuple() verifies that the key
is still the same, emulating EvalPlanQual.

The scan uses GetTransactionSnapshot(), matching what the SPI path
uses (via _SPI_execute_plan pushing GetTransactionSnapshot() as the
active snapshot).  Under READ COMMITTED this is a fresh snapshot;
under REPEATABLE READ / SERIALIZABLE it is the frozen transaction-
start snapshot, so PK rows committed after the transaction started
are not visible.

The ri_CheckPermissions() function performs schema USAGE and table
SELECT checks, matching what the SPI path gets implicitly through
the executor's permission checks.  The fast path also switches to
the PK table owner's security context (with SECURITY_NOFORCE_RLS)
before the index probe, matching the SPI path where the query runs
as the table owner.

ri_HashCompareOp() is adjusted to handle cross-type equality operators
(e.g. int48eq for int4 PK / int8 FK) which can appear in conpfeqop.
The existing code asserted same-type operators only, which was correct
for its existing callers (ri_KeysEqual compares same-type FK column
values via ff_eq_oprs), but the fast path is the first caller to pass
pf_eq_oprs, which can be cross-type.

Per-key metadata (compare entries, operator procedures, strategy
numbers) is cached in RI_ConstraintInfo via
ri_populate_fastpath_metadata() on first use, eliminating repeated
calls to ri_HashCompareOp() and get_op_opfamily_properties().
conindid and pk_is_partitioned are also cached at constraint load
time, avoiding per-invocation syscache lookups and the need to open
pk_rel before deciding whether the fast path applies.

New regression tests cover RLS bypass and ACL enforcement for the
fast-path permission checks.  New isolation tests exercise concurrent
PK updates under both READ COMMITTED and REPEATABLE READ.

Author: Junwang Zhao <zhjwpku@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-03-31 13:49:21 +09:00
Amit Kapila
5984ea868e Change syntax of EXCEPT TABLE clause in publication commands.
Adjust the syntax of the EXCEPT clause in CREATE/ALTER PUBLICATION
added in commits fd366065e0 and 493f8c6439 to move the TABLE keyword
inside the relation list.

Old syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT TABLE (t1, ...);
ALTER PUBLICATION  ... SET ALL TABLES EXCEPT TABLE (t1, ...);

New syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT (TABLE t1, ...);
ALTER PUBLICATION  ... SET ALL TABLES EXCEPT (TABLE t1, ...);

This is to ensure that inclusion and exclusion list can be specified in
a same way. Previously, the exclusion table list can be specified as
TABLE (t1, t2, t3) and inclusion list can be specified as TABLE t1, t2,
t3, or TABLE t1, TABLE t2, TABLE t3.

This change is purely syntactic and does not alter behavior.

Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCC8XuwfX62qKBSfHUAoww_XB3_84HjswgL9jxQy696yw@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-31 09:40:51 +05:30
Nathan Bossart
bab2f27eaa Remove bits* typedefs.
In addition to removing the bits8, bits16, and bits32 typedefs,
this commit replaces all uses with uint8, uint16, or uint32.  bits*
provided little benefit beyond establishing the intent of the
variable, and they were inconsistently used for that purpose.
Third-party code should instead use the corresponding uint*
typedef.

Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/absbX33E4eaA0Ity%40nathan
2026-03-30 16:12:08 -05:00
Heikki Linnakangas
40c41dc773 Use ShmemInitStruct to allocate shmem for semaphores
This makes them visible in pg_shmem_allocations

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-30 23:39:35 +03:00
Melanie Plageman
378a216187 Set pd_prune_xid on insert
Now that on-access pruning can update the visibility map (VM) during
read-only queries, set the page’s pd_prune_xid hint during INSERT and on
the new page during UPDATE.

This allows heap_page_prune_and_freeze() to set the VM the first time a
page is read after being filled with tuples. This may avoid I/O
amplification by setting the page all-visible when it is still in shared
buffers and allowing later vacuums to skip scanning the page. It also
enables index-only scans of newly inserted data much sooner.

As a side benefit, this addresses a long-standing note in heap_insert()
and heap_multi_insert(): aborted inserts can now be pruned on-access
rather than lingering until the next VACUUM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-30 16:07:11 -04:00
Melanie Plageman
b46e1e54d0 Allow on-access pruning to set pages all-visible
Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.

This commit implements on-access VM setting for sequential scans, tid
range scans, sample scans, bitmap heap scans, and the underlying heap
relation in index scans.

Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, which could
trigger a write and potentially an FPI. It also allows more frequent
index-only scans, since they require pages to be marked all-visible in
the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-30 15:47:07 -04:00
Tom Lane
7394773450 Be more careful to preserve consistency of a tuplestore.
Several places in tuplestore.c would leave the tuplestore data
structure effectively corrupt if some subroutine were to throw
an error.  Notably, if WRITETUP() failed after some number of
successful calls within dumptuples(), the tuplestore would
contain some memtuples pointers that were apparently live
entries but in fact pointed to pfree'd chunks.

In most cases this sort of thing is fine because transaction
abort cleanup is not too picky about the contents of memory that
it's going to throw away anyway.  There's at least one exception
though: if a Portal has a holdStore, we're going to call
tuplestore_end() on that, even during transaction abort.
So it's not cool if that tuplestore is corrupt, and that means
tuplestore.c has to be more careful.

This oversight demonstrably leads to crashes in v15 and before,
if a holdable cursor fails to persist its data due to an undersized
temp_file_limit setting.  Very possibly the same thing can happen in
v16 and v17 as well, though the specific test case submitted failed
to fail there (cf. 095555daf).  The failure is accidentally dodged
as of v18 because 590b045c3 got rid of tuplestore_end's retail tuple
deletion loop.  Still, it seems unwise to permit tuplestores to become
internally inconsistent in any branch, so I've applied the same fix
across the board.

Since the known test case for this is rather expensive and doesn't
fail in recent branches, I've omitted it.

Bug: #19438
Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 14
2026-03-30 13:59:58 -04:00
Heikki Linnakangas
681774315d Replace getopt() with our re-entrant variant in the backend
Some of these probably could continue using non-re-entrant getopt()
even if we start using threads in the future, but it seems better to
make them all anyway, so that we have a clear-cut rule of "no plain
getopt() in the postgres binary".

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
2026-03-30 20:47:16 +03:00
Heikki Linnakangas
c5f7820e57 Fix latent bug in get_stats_option_name()
The function is supposed to look at the passed in 'arg' argument, but
peeks at the 'optarg' global variable that's part of getopt()
instead. It happened to work anyway, because all callers passed
'optarg' as the argument.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
2026-03-30 20:34:48 +03:00
Melanie Plageman
50eb5faea2 Pass down information on table modification to scan nodes
Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
2026-03-30 13:27:34 -04:00
Álvaro Herrera
349bd88202
Don't use bits32 in table AM interface
Seems there's near-universal dislike for the bitsXX typedefs.
Revert that part of commit 1bd6f22f43 in favor of using plain uint32.
2026-03-30 19:06:33 +02:00
Melanie Plageman
dcd8cc1c85 Thread flags through begin-scan APIs
Add an AM user-settable flags parameter to several of the table scan
functions, one table AM callback, and index_beginscan(). This allows
users to pass additional context to be used when building the scan
descriptors.

For index scans, a new flags field is added to IndexFetchTableData, and
the heap AM saves the caller-provided flags there.

This introduces an extension point for follow-up work to pass per-scan
information (such as whether the relation is read-only for the current
query) from the executor to the AM layer.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2be31f17-5405-4de9-8d73-90ebc322f7d8%40vondra.me
2026-03-30 12:27:24 -04:00
Tom Lane
095555daf1 Detect pfree or repalloc of a previously-freed memory chunk.
Before the major rewrite in commit c6e0fe1f2, AllocSetFree() would
typically crash when asked to free an already-free chunk.  That was
an ugly but serviceable way of detecting coding errors that led to
double pfrees.  But since that rewrite, double pfrees went through
just fine, because the "hdrmask" of a freed chunk isn't changed at all
when putting it on the freelist.  We'd end with a corrupt freelist
that circularly links back to the doubly-freed chunk, which would
usually result in trouble later, far removed from the actual bug.

This situation is no good at all for debugging purposes.  Fortunately,
we can fix it at low cost in MEMORY_CONTEXT_CHECKING builds by making
AllocSetFree() check for chunk->requested_size == InvalidAllocSize,
relying on the pre-existing code that sets it that way just below.

I investigated the alternative of changing a freed chunk's methodid
field, which would allow detection in non-MEMORY_CONTEXT_CHECKING
builds too.  But that adds measurable overhead.  Seeing that we didn't
notice this oversight for more than three years, it's hard to argue
that detecting this type of bug is worth any extra overhead in
production builds.

Likewise fix AllocSetRealloc() to detect repalloc() on a freed chunk,
and apply similar changes in generation.c and slab.c.  (generation.c
would hit an Assert failure anyway, but it seems best to make it act
like aset.c.)  bump.c doesn't need changes since it doesn't support
pfree in the first place.  Ideally alignedalloc.c would receive
similar changes, but in debugging builds it's impossible to reach
AlignedAllocFree() or AlignedAllocRealloc() on a pfreed chunk, because
the underlying context's pfree would have wiped the chunk header of
the aligned chunk.  But that means we should get an error of some
sort, so let's be content with that.

Per investigation of why the test case for bug #19438 didn't appear to
fail in v16 and up, even though the underlying bug was still present.
(This doesn't fix the underlying double-free bug, just cause it to
get detected.)

Bug: #19438
Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 16
2026-03-30 12:02:08 -04:00
Heikki Linnakangas
bd365b1ae5 Fix outdated comment on MainLWLockArray
It's no longer passed to child processes down via BackendParameters in
EXEC_BACKEND mode.

Reported-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA5RZ0vPWNMvTBqyH7nqDRrHd6Y4Et5iNqXFuwpbsPOk3cL4rQ@mail.gmail.com
2026-03-30 17:13:11 +03:00
Melanie Plageman
39dcd10a2c Remove PlannedStmt->resultRelations in favor of resultRelationRelids
PlannedStmt->resultRelations was an integer list of range table indexes
because at the time it was added (to Query), the Bitmapset data type did
not yet exist in Postgres.

0f4c170cf3 added a Bitmapset of result relations, so remove the integer
list of RTIs and use the more compact resultRelationRelids.

Discussion: https://postgr.es/m/CAApHDvqAOeOwCKh9g0gfxWa040%3DHyc7_oA%3DC59rjod8kXJDWyw%40mail.gmail.com
2026-03-30 09:51:28 -04:00
Melanie Plageman
0f4c170cf3 Make it cheap to check if a relation is modified by a query
Save the range table indexes of result relations and row mark relations
in separate bitmapsets in the PlannedStmt. Precomputing them allows
cheap membership checks during execution. Together, these two groups
approximate all relations that will be modified by a query. This
includes relations targeted by INSERT, UPDATE, DELETE, and MERGE as well
as relations with any row mark (like SELECT FOR UPDATE).

Future work will use information on whether or not a relation is
modified by a query in a heuristic.

PlannedStmt->resultRelations is only used in a membership check, so it
will be removed in a separate commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
2026-03-30 09:38:03 -04:00
Álvaro Herrera
1bd6f22f43
Have table_insert and siblings use an unsigned type for options
Using signed types can lead to bugs, such as the one fixed by commit
2a2e1b470b.

Discussion: https://postgr.es/m/44e6ze3kuunhky63wmfjxrmn72pds2whwf5ok6hpz7c4my7k2h@l65zhpcuasnf
2026-03-30 13:58:16 +02:00
Peter Eisentraut
b36b956404 Make cast functions to type money error safe
This converts the cast functions from types integer, bigint, and
numeric to type money to support soft errors.

Note: Casting from type money to type numeric (the other way, function
cash_numeric) is not yet error safe.

Author: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-30 10:10:56 +02:00
Peter Eisentraut
26f9012bee Make cast function from circle to polygon error safe
Previously, the function casting type circle to type polygon could not
be made error safe, because it is an SQL language function.

This refactors it as a C/internal function, by sharing code with the
C/internal function that the SQL function previously wrapped, and soft
error support is added.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-30 09:11:08 +02:00
Fujii Masao
2497dac556 Fix FK triggers losing DEFERRABLE/INITIALLY DEFERRED when marked ENFORCED again
Previously, a foreign key defined as DEFERRABLE INITIALLY DEFERRED could
behave as NOT DEFERRABLE after being set to NOT ENFORCED and then back
to ENFORCED.

This happened because recreating the FK triggers on re-enabling the constraint
forgot to restore the tgdeferrable and tginitdeferred fields in pg_trigger.

Fix this bug by properly setting those fields when the foreign key constraint
is marked ENFORCED again and its triggers are recreated, so the original
DEFERRABLE and INITIALLY DEFERRED properties are preserved.

Backpatch to v18, where NOT ENFORCED foreign keys were introduced.

Author: Yasuo Honda <yasuo.honda@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAKmOUTms2nkxEZDdcrsjq5P3b2L_PR266Hv8kW5pANwmVaRJJQ@mail.gmail.com
Backpatch-through: 18
2026-03-30 14:37:33 +09:00
David Rowley
0d866282b8 Fix datum_image_*()'s inability to detect sign-extension variations
Functions such as hash_numeric() are not careful to use the correct
PG_RETURN_*() macro according to the return type of that function as
defined in pg_proc.  Because that function is meant to return int32,
when the hashed value exceeds 2^31, the 64-bit Datum value won't wrap to
a negative number, which means the Datum won't have the same value as it
would have had it been cast to int32 on a two's complement machine.  This
isn't harmless as both datum_image_eq() and datum_image_hash() may receive
a Datum that's been formed and deformed from a tuple in some cases, and
not in other cases.  When formed into a tuple, the Datum value will be
coerced into an integer according to the attlen as specified by the
TupleDesc.  This can result in two Datums that should be equal being
classed as not equal, which could result in (but not limited to) an error
such as:

ERROR:  could not find memoization table entry

Here we fix this by ensuring we cast the Datum value to a signed integer
according to the typLen specified in the datum_image_eq/datum_image_hash
function call before comparing or hashing.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/CAHewXNmcXVFdB9_WwA8Ez0P+m_TQy_KzYk5Ri5dvg+fuwjD_yw@mail.gmail.com
2026-03-30 16:14:34 +13:00
Amit Langote
1ad7191f7e Add comment explaining fire_triggers=false in ri_PerformCheck()
The reason for passing fire_triggers=false to SPI_execute_snapshot()
in ri_PerformCheck() was not documented, making it unclear why it was
done that way.  Add a comment explaining that it ensures AFTER triggers
on rows modified by the RI action are queued in the outer query's
after-trigger context and fire only after all RI updates on the same
row are complete.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Surya Poondla <suryapoondla4@gmail.com>
Discussion: https://postgr.es/m/20250331212648.ad4ab804559001d7f0788741@sraoss.co.jp
2026-03-30 10:10:17 +09:00
Peter Eisentraut
45cdaf3665 Make geometry cast functions error safe
This adjusts cast functions of the geometry types to support soft
errors.  This requires refactoring of various helper functions to
support error contexts.  Also make the float8 to float4 cast error
safe.  It requires some of the same helper functions.

This is in preparation for a future feature where conversion errors in
casts can be caught.

(The function casting type circle to type polygon is not yet made error
safe, because it is an SQL language function.)

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-29 20:40:50 +02:00
Álvaro Herrera
0841b219bf
Sort InternalBGWorkers list alphabetically
This simplifies deciding where to add a new one.
2026-03-29 14:15:00 +02:00
Peter Eisentraut
10e4d8aaf4 Make cast functions from jsonb error safe
This adjusts cast functions from jsonb to other types to support soft
errors.  This just involves some refactoring of the underlying helper
functions to use ereturn.

This is in preparation for a future feature where conversion errors in
casts can be caught.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-28 15:44:13 +01:00
Andres Freund
999dec9ec6 aio: Don't wait for already in-progress IO
When a backend attempts to start a read IO and finds the first buffer already
has I/O in progress, previously it waited for that I/O to complete before
initiating reads for any of the subsequent buffers.

Although it must wait for the I/O to finish when acquiring the buffer, there's
no reason for it to wait when setting up the read operation. Waiting at this
point prevents starting I/O on subsequent buffers and can significantly reduce
concurrency.

This matters in two workloads:
1) When multiple backends scan the same relation concurrently.
2) When a single backend requests the same block multiple times within the
   readahead distance.

Waiting each time an in-progress read is encountered effectively degenerates
the access pattern into synchronous I/O.

To fix this, when encountering an already in-progress IO for the head buffer,
the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.

In rare cases, a backend may still need to wait synchronously at IO
start time: If another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
2026-03-27 19:53:32 -04:00
Andres Freund
74eafeab1a bufmgr: Improve StartBufferIO interface
Until now StartBufferIO() had a few weaknesses:

- As it did not submit staged IOs, it was not safe to call StartBufferIO()
  where there was a potential for unsubmitted IO, which required
  AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
  StartBufferIO().

- With nowait = true, the boolean return value did not allow to distinguish
  between no IO being necessary and having to wait, which would lead
  ReadBuffersCanStartIO() to unnecessarily submit staged IO.

- Several callers needed to handle both local and shared buffers, requiring
  the caller to differentiate between StartBufferIO() and StartLocalBufferIO()

- In a future commit some callers of StartBufferIO() want the BufferDesc's
  io_wref to be returned, to asynchronously wait for in-progress IO

- Indicating whether to wait with the nowait parameter was somewhat confusing
  compared to a wait parameter

Address these issues as follows:

- StartBufferIO() is renamed to StartSharedBufferIO()

- A new StartBufferIO() is introduced that supports both shared and local
  buffers

- The boolean return value has been replaced with an enum, indicating whether
  the IO is already done, already in progress or that the buffer has been
  readied for IO

- A new PgAioWaitRef * argument allows the caller to get the wait reference is
  desired.  All current callers pass NULL, a user of this will be introduced
  subsequently

- Instead of the nowait argument there now is wait

  This probably would not have been worthwhile on its own, but since all these
  lines needed to be touched anyway...

Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-27 19:08:12 -04:00
Heikki Linnakangas
2407c8db15 Fix RequestNamedLWLockTranche in single-user mode
PostmasterContext is not available in single-user mode, use
TopMemoryContext instead. Also make sure that we use the correct
memory context in the lappend().

Author: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/acb_Eo1XtmCO_9z7@nathan
2026-03-28 01:02:11 +02:00
Andres Freund
f39cb8c011 bufmgr: Make UnlockReleaseBuffer() more efficient
Now that the buffer content lock is implemented as part of BufferDesc.state,
releasing the lock and unpinning the buffer can be implemented as a single
atomic operation.

This improves workloads that have heavy contention on a small number of
buffers substantially, I e.g., see a ~20% improvement for pipelined readonly
pgbench on an older two socket machine.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Andres Freund
8df3c48e46 Use UnlockReleaseBuffer() in more places
An upcoming commit will make UnlockReleaseBuffer() considerably faster and
more scalable than doing LockBuffer(BUFFER_LOCK_UNLOCK); ReleaseBuffer();. But
it's a small performance benefit even as-is.

Most of the callsites changed in this patch are not performance sensitive,
however some, like the nbtree ones, are in critical paths.

This patch changes all the easily convertible places over to
UnlockReleaseBuffer() mainly because I needed to check all of them anyway, and
reducing cases where the operations are done separately makes the checking
easier.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Andres Freund
41d3d64e87 bufmgr: Don't copy pages while writing out
After the series of preceding commits introducing and using
BufferBeginSetHintBits()/BufferSetHintBits16(), hint bits are not set anymore
while IO is going on. Therefore we do not need to copy pages while they are
being written out anymore.

For the same reason XLogSaveBufferForHint() now does not need to operate on a
copy of the page anymore, but can instead use the normal XLogRegisterBuffer()
mechanism. For that the assertions and comments to XLogRegisterBuffer() had to
be updated to allow share-exclusive locked buffers to be registered.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Nathan Bossart
d7965d65fc Add rudimentary table prioritization to autovacuum.
Autovacuum workers scan pg_class twice to collect the set of tables
to process.  The first pass is for plain relations and materialized
views, and the second is for TOAST tables.  When the worker finds a
table to process, it adds it to the end of a list.  Later on, it
processes the tables in the same order as the list.  This simple
strategy has worked surprisingly well for a long time, but there
have been many discussions over the years about trying to improve
it.

This commit introduces a scoring system that is used to sort the
aforementioned list of tables to process.  The idea is to have
autovacuum workers prioritize tables that are furthest beyond their
thresholds (e.g., a table nearing transaction ID wraparound should
be vacuumed first).  This prioritization scheme is certainly far
from perfect; there are simply too many possibilities for any
scoring technique to work across all workloads, and the situation
might change significantly between the time we calculate the score
and the time that autovacuum processes it.  However, we have
attemped to develop something that is expected to work for a large
portion of workloads with reasonable parameter settings.

The score is calculated as the maximum of the ratios of each of the
table's relevant values to its threshold.  For example, if the
number of inserted tuples is 100, and the insert threshold for the
table is 80, the insert score is 1.25.  If all other scores are
below that value, the table's score will be 1.25.  The other
criteria considered for the score are the table ages (both
relfrozenxid and relminmxid) compared to the corresponding
freeze-max-age setting, the number of update/deleted tuples
compared to the vacuum threshold, and the number of
inserted/updated/deleted tuples compared to the analyze threshold.

Once exception to the previous paragraph is for tables nearing
wraparound, i.e., those that have surpassed the effective failsafe
ages.  In that case, the relfrozenxid/relminmxid-based score is
scaled aggressively so that the table has a decent chance of
sorting to the front of the list.

To adjust how strongly each component contributes to the score, the
following parameters can be adjusted from their default of 1.0 to
anywhere between 0.0 and 10.0 (inclusive).  Setting all of these to
0.0 restores pre-v19 prioritization behavior:

	autovacuum_freeze_score_weight
	autovacuum_multixact_freeze_score_weight
	autovacuum_vacuum_score_weight
	autovacuum_vacuum_insert_score_weight
	autovacuum_analyze_score_weight

This is intended to be a baby step towards smarter autovacuum
workers.  Possible future improvements include, but are not limited
to, periodic reprioritization, automatic cost limit adjustments,
and better observability (e.g., a system view that shows current
scores).  While we do not expect this commit to produce any
earth-shattering improvements, it is arguably a prerequisite for
the aforementioned follow-up changes.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/aOaAuXREwnPZVISO%40nathan
2026-03-27 10:17:05 -05:00
Heikki Linnakangas
3fd0577728 Refactor PredicateLockShmemInit to not reuse var for different things
The PredicateLockShmemInit function is pretty complicated, and one
source of confusion is that it reuses the same local variable for
sizes of things. Replace the different uses with separate variables
for clarity.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/113724ab-0028-493f-9605-6e8570f0939f@iki.fi
2026-03-27 13:24:34 +02:00
Peter Eisentraut
288ae96872 Add a graph pattern variable only once
An element pattern variable may be repeated in the path pattern.
GraphTableParseState maintains a list of all variable names used in
the graph pattern.  Add a new variable name to that list only when it
is not present already.  This isn't a problem right now, but it could
be in the future.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5tR4O0vjeqTCPr2VB5pYjNYbJgbCBEQf63NtU5Pz1MiOQ%40mail.gmail.com
2026-03-27 10:55:17 +01:00
Heikki Linnakangas
98993150c0 Minor comment fixes to yesterday's LWLock tranche refactoring
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA5RZ0sLENRM+BicUjQFs_rP38oPx3gm0SsGrD0-jMhhM+HZ_w@mail.gmail.com
2026-03-27 11:44:10 +02:00
Peter Eisentraut
720f0f89d6 Reject consecutive element patterns of same kind
Adding an implicit empty vertex pattern when a path pattern starts or
ends with an edge pattern or when two consecutive edge patterns appear
in the pattern is not supported right now.  Prohibit such path
patterns.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/72a23702-6d96-4103-a54b-057c2352e885%2540eisentraut.org
2026-03-27 10:31:53 +01:00
Heikki Linnakangas
30d432502b Use ShmemInitStruct to allocate lwlock.c's shared memory
It's nice to have them show up in pg_shmem_allocations like all other
shmem areas. ShmemInitStruct() depends on ShmemIndexLock, but only
after postmaster startup.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:51:41 +02:00
Heikki Linnakangas
06d859aaf4 Move ShmemIndexLock into ShmemAllocator
This makes shmem.c independent of the main LWLock array. That makes it
possible to stop passing MainLWLockArray through BackendParameters in
the next commit.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:51:41 +02:00
Heikki Linnakangas
12e3e0f2c8 Use a separate spinlock to protect LWLockTranches
Previously we reused the shmem allocator's ShmemLock to also protect
lwlock.c's shared memory structures. Introduce a separate spinlock for
lwlock.c for the sake of modularity. Now that lwlock.c has its own
shared memory struct (LWLockTranches), this is easy to do.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:50:59 +02:00
Heikki Linnakangas
d6eba30a24 Refactor how user-defined LWLock tranches are stored in shmem
Merge the LWLockTranches and NamedLWLockTrancheRequest data structures
in shared memory into one array of user-defined tranches. The
NamedLWLockTrancheRequest list is now only used in postmaster, to hold
the requests until shared memory is initialized.

Introduce a C struct, LWLockTranches, to hold all the different fields
kept in shared memory. This gives an easier overview of what are all
the things kept in shared memory. Previously, we had separate pointers
for LWLockTrancheNames, LWLockCounter and the (shared memory copy of)
NamedLWLockTrancheRequestArray.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:47:22 +02:00
Heikki Linnakangas
cc88481aeb Rename MAX_NAMED_TRANCHES to MAX_USER_DEFINED_TRANCHES
The "named tranches" term is a little confusing. In most places it
refers to tranches requested with RequestNamedLWLockTranche(), even
though all built-in tranches and tranches allocated with
LWLockNewTrancheId() also have a name. But in MAX_NAMED_TRANCHES, it
refers to tranches requested with either RequestNamedLWLockTranche()
or LWLockNewTrancheId(), as it's the maximum of all of those in total.

The "user defined" term is already used in
LWTRANCHE_FIRST_USER_DEFINED, so let's standardize on that to mean
tranches allocated with either RequestNamedLWLockTranche() or
LWLockNewTrancheId().

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:46:04 +02:00
Robert Haas
26255a3207 Add an alternative_plan_name field to PlannerInfo.
Typically, we have only one PlannerInfo for any given subquery, but
when we are considering a MinMaxAggPath or a hashed subplan, we end
up creating a second PlannerInfo for the same portion of the query,
with a clone of the original range table. In fact, in the MinMaxAggPath
case, we might end up creating several clones, one per aggregate.

At present, there's no easy way for a plugin, such as pg_plan_advice,
to understand the relationships between the original range table and
the copies of it that are created in these cases.  To fix, add an
alternative_plan_name field to PlannerInfo. For a hashed subplan, this
is the plan name for the non-hashed alternative; for minmax aggregates,
this is the plan_name from the parent PlannerInfo; otherwise, it's the
same as plan_name.

Discussion: http://postgr.es/m/CA+TgmoYuWmN-00Ec5pY7zAcpSFQUQLbgAdVWGR9kOR-HM-fHrA@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
2026-03-26 16:45:17 -04:00
Andres Freund
8a1a1d6ab8 bufmgr: Restructure AsyncReadBuffers()
Restructure AsyncReadBuffers() to use early return when the head buffer is
already valid, instead of using a did_start_io flag and if/else branches. Also
move around a bit of the code to be located closer to where it is used. This
is a refactor only.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
df09452c32 bufmgr: Make buffer hit helper
Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
c2a68e08b1 bufmgr: Pass io_object and io_context through to PinBufferForBlock()
PinBufferForBlock() is always_inline and called in a loop in
StartReadBuffersImpl(). Previously it computed io_context and io_object
internally, which required calling IOContextForStrategy() -- a non-inline
function the compiler cannot prove is side-effect-free. This could potential
cause unneeded redundant function calls.

Compute io_context and io_object in the callers instead, allowing
StartReadBuffersImpl() to do so once before entering the loop.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
cf66978d79 Fix off-by-one error in read IO tracing
AsyncReadBuffer()'s no-IO needed path passed
TRACE_POSTGRESQL_BUFFER_READ_DONE the wrong block number because it had
already incremented operation->nblocks_done. Fix by folding the
nblocks_done offset into the blocknum local variable at initialization.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2%40hph6fvdiiria
Backpatch-through: 18
2026-03-26 10:38:56 -04:00
Robert Haas
47c110f77e Respect disabled_nodes in fix_alternative_subplan.
When my commit e222534679 added the
concept of disabled_nodes, it failed to add a disabled_nodes field
to SubPlan. This is a regression: before that commit, when
fix_alternative_subplan compared the costs of two plans, the number
of disabled nodes affected the result, because it was just a
component of the total cost. After that commit, it no longer did,
making it possible for a disabled path to win on cost over one that
is not disabled. Fix that.

As usual for planner fixes that might destabilize plan choices,
no back-patch.

Discussion: https://postgr.es/m/CA+TgmoaK=4w7-qknUo3QhUJ53pXZq=c=KgZmRyD+k7ytqfmgSg@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
2026-03-26 10:25:04 -04:00
Peter Eisentraut
119e791e9c Fix -Wcast-qual warning
This dials back a couple of the qualifiers added by commit
7724cb9935.  Specifically, in match_boolean_partition_clause() the
call to negate_clause() casts away the const, so we shouldn't make the
input argument const.
2026-03-26 15:00:24 +01:00
Fujii Masao
400a790a48 Avoid sending duplicate WAL locations in standby status replies
Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
2026-03-26 20:54:32 +09:00
Fujii Masao
eef1ba704d Fix premature NULL lag reporting in pg_stat_replication
pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.

This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.

This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.

The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.

Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.

Backpatch to all supported branches.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Backpatch-through: 14
2026-03-26 20:49:31 +09:00
Heikki Linnakangas
6b8238cb6a Refactor ShmemIndex initialization
Initialize the ShmemIndex hash table in InitShmemAllocator() already,
removing the need for the separate InitShmemIndex() step.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-03-26 11:35:55 +02:00
Amit Kapila
735e8fe685 Refactor replorigin_session_setup() for better readability.
Reorder the validation checks in replorigin_session_setup() to provide a
more logical flow. This makes the function easier to follow and ensures
that basic state checks are performed consistently.

Additionally, update an error message to align its phrasing with similar
diagnostics in the replication origin subsystem, improving overall
consistency.

Author: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/e0508305-bc6a-417c-b969-36564d632f9e@iki.fi
2026-03-26 09:15:25 +05:30
Michael Paquier
4287c50fc2 Improve timeout handling of pg_promote()
Previously, pg_promote() looped a fixed number of times, calculated from
the specified timeout, and waited 100ms on a latch, once per iteration,
for the promotion of a standby to complete.  However, unrelated signals
to the backend could set the latch and wake up the backend early,
resulting in a faster consumption of the loops and an execution time of
the function that does not match with the timeout input given in input.
This could be confusing for the function caller, especially if some
backend-side timeout is aggressive, because the function would return
much earlier than expected and report that the promote request has not
completed within the time requested.

This commit refines the logic to track the time actually elapsed, by
looping until the requested duration has truly passed.  The code
calculates the end time we expect, then uses it when looping.

Author: Robert Pang <robertpang@google.com>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Discussion: https://postgr.es/m/CAJhEC07OK8J7tLUbyiccnuOXRE7UKxBNqD2-pLfeFXa=tBoWtw@mail.gmail.com
2026-03-26 10:39:40 +09:00
Masahiko Sawada
497c1170cb Add base32hex support to encode() and decode() functions.
This adds support for base32hex encoding and decoding, as defined in
RFC 4648 Section 7. Unlike standard base32, base32hex uses the
extended hex alphabet (0-9, A-V) which preserves the lexicographical
order of the encoded data.

This is particularly useful for representing UUIDv7 values in a
compact string format while maintaining their time-ordered sort
property.

The encode() function produces output padded with '=', while decode()
accepts both padded and unpadded input. Following the behavior of
other encoding types, decoding is case-insensitive.

Suggested-by: Sergey Prokhorenko <sergeyprokhorenko@yahoo.com.au>
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Илья Чердаков <i.cherdakov.pg@gmail.com>
Reviewed-by: Chengxi Sun <chengxisun92@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
2026-03-25 11:35:19 -07:00
Álvaro Herrera
c8b4a3ec08
Remove unused autovac_table.at_sharedrel
The last use was removed by commit 38f7831d70.  After that, we compute
MyWorkerInfo->wi_sharedrel directly from the pg_class tuple of the table
being vacuumed rather than passing it around.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/20260325165734.7ab8e4e55fe4c2f1e55031d9@sraoss.co.jp
2026-03-25 18:24:34 +01:00
Peter Eisentraut
bccfc73acd Disable warnings in system headers in MSVC
This is similar to the standard behavior in GCC.  For MSVC, we set all
headers in angle brackets to be considered system headers.  (GCC goes
by path, not include style.)

The required option is available since VS 2017.  (Before VS 2019
version 16.10, the additional option /experimental:external is
required, but per discussion in [0], we effectively require 16.11, so
this shouldn't be a problem.)

[0]: https://www.postgresql.org/message-id/04ab76a3-186c-4a37-8076-e6882ebf9d43%40eisentraut.org

Then, we can remove one workaround for avoiding a warning from a
system header.  (And some warnings to be enabled in the future could
benefit from this.)

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aa73q1aT0A3/vke/%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-25 15:03:52 +01:00
Peter Eisentraut
5282bf535e Fix some typos and make small stylistic improvements
for commit 2f094e7ac6

Author: zengman <zengman@halodbtech.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-25 09:17:40 +01:00
Peter Eisentraut
c79e414127 Fix typo
Mistake in commit e2f289e5b9: SOFT_ERROR_OCCURRED was called with the
wrong fcinfo field.

Reported-by: Jianghua Yang <yjhjstz@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAZLFmSGti716gWeY%3DDCZ9TTVOixnHZ4_4V4tDzoeE86D64vOA%40mail.gmail.com
2026-03-25 07:09:44 +01:00
Jeff Davis
11f8018ee6 Refactor to remove ForeignServerName().
Callers either have a ForeignServer object or can readily construct
one.

Discussion: https://postgr.es/m/CAExHW5vV5znEvecX=ra2-v7UBj9-M6qvdDzuB78M-TxbYD1PEA@mail.gmail.com
Suggested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
2026-03-24 15:20:28 -07:00
Jeff Davis
f16f5d608c GetSubscription(): use per-object memory context.
Constructing a Subcription object uses a number of small or temporary
allocations. Use a per-object memory context for easy cleanup.

Get rid of FreeSubscription() which did not free all the allocations
anyway. Also get rid of the PG_TRY()/PG_CATCH() logic in
ForeignServerConnectionString() which were used to avoid leaks during
GetSubscription().

Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/xvdjrdqnpap3uq7owbaox3r7p5gf7sv62aaqf2ju3vb6yglatr%40kvvwhoudrlxq
Discussion: https://postgr.es/m/CAA4eK1K=WjZ1maBCmj=5ZdO66AwPORK5ZBxVKedS0xdCcb621A@mail.gmail.com
2026-03-24 15:11:45 -07:00
Melanie Plageman
a881cc9c7e Remove XLOG_HEAP2_VISIBLE entirely
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.

Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 17:58:12 -04:00
Melanie Plageman
a759ced2f1 WAL log VM setting for empty pages in XLOG_HEAP2_PRUNE_VACUUM_SCAN
As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
marks empty pages all-visible and all-frozen in a
XLOG_HEAP2_PRUNE_VACUUM_SCAN record.

This has no real independent benefit, but empty pages were the last user
of XLOG_HEAP2_VISIBLE, so by making this change we can next remove all
of the XLOG_HEAP2_VISIBLE code.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
2026-03-24 17:30:54 -04:00
Melanie Plageman
1252a4ee28 WAL log VM setting during vacuum phase I in XLOG_HEAP2_PRUNE_VACUUM_SCAN
Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.

Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications. This reduces WAL volume produced by vacuum.

For now, vacuum is still the only user of heap_page_prune_and_freeze()
allowed to set the VM. On-access pruning is not yet able to set the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 16:49:46 -04:00
Robert Haas
dc47beacaa get_memoize_path: Don't exit quickly when PGS_NESTLOOP_PLAIN is unset.
This function exits early in the case where the number of inner rows
is estimated to be less than 2, on the theory that in that case a
Nested Loop with inner Memoize must lose to a plain Nested Loop.
But since commit 4020b370f2 it's
possible for a plain Nested Loop to be disabled, while a Nested Loop
with inner Memoize is still enabled. In that case, this reasoning
is not valid, so adjust the code not to exit early in that case.

This issue was revealed by a test_plan_advice failure on buildfarm
member skink, where NESTED_LOOP_MEMOIZE() couldn't be enforced on
replanning due to this early exit.

Discussion: http://postgr.es/m/CA+TgmoZUN8FT1Ah=m6Uis5bHa4FUa+_hMDWtcABG17toEfpiUg@mail.gmail.com
2026-03-24 16:17:26 -04:00
Melanie Plageman
9ba3ec076a Keep newest live XID up-to-date even if page not all-visible
During pruning, we keep track of the newest xmin of live tuples on the
page visible to all running and future transactions so that we can use
it later as the snapshot conflict horizon when setting the VM if the
page turns out to be all-visible.

Previously, we stopped updating this value once we determined the page
was not all-visible. However, maintaining it even when the page is not
all-visible is inexpensive and makes the snapshot conflict horizon
calculation clearer. This guarantees it won't contain a stale value.

Since we'll keep it up to date all the time now anyway, there's no
reason not to maintain set_all_visible for on-access pruning. This will
allow us to set the VM on-access in the future.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-24 15:37:18 -04:00
Melanie Plageman
dd5716f3c7 Use GlobalVisState in vacuum to determine page level visibility
During vacuum's first and third phases, we examine tuples' visibility to
determine if we can set the page all-visible in the visibility map.

Previously, this check compared tuple xmins against a single XID chosen
at the start of vacuum (OldestXmin). We now use GlobalVisState, which
enables future work to set the VM during on-access pruning, since
ordinary queries have access to GlobalVisState but not OldestXmin.

This also benefits vacuum: in some cases, GlobalVisState may advance
during a vacuum, allowing more pages to become considered all-visible.
And, in the future, we could easily add a heuristic to update
GlobalVisState more frequently during vacuums of large tables.

OldestXmin is still used for freezing and as a backstop to ensure we
don't freeze a dead tuple that wasn't yet prunable according to
GlobalVisState in the rare occurrences where GlobalVisState moves
backwards.

Because comparing a transaction ID against GlobalVisState is more
expensive than comparing against a single XID, we defer this check until
after scanning all tuples on the page. Therefore, we perform the
GlobalVisState check only once per page. This is safe because
visibility_cutoff_xid records the newest live xmin on the page; if it is
globally visible, then the entire page is all-visible.

Using GlobalVisState means on-access pruning can also maintain
visibility_cutoff_xid, which is required to set the visibility map
on-access in the future.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk#c755ef151507aba58471ffaca607e493
2026-03-24 14:50:59 -04:00
Álvaro Herrera
f227b7b20c
Avoid including clog.h in proc.h
The number of .c files that must include access/clog.h can currently be
counted on one's fingers and miss only one (assuming one has the usual
number of hands).  However, due to indirect inclusion via proc.h,
there's a lot of files that are pointlessly including it.  This is easy
to avoid with the easy trick implemented by this commit.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603221856.iwlhitt6dxxx@alvherre.pgsql
2026-03-24 17:31:16 +01:00
Álvaro Herrera
2102ebb195
Don't include storage/lock.h in so many headers
Since storage/locktags.h was added by commit 322bab7974, many headers
can be made leaner by depending on that instead of on storage/lock.h,
which has many other dependencies.

(In fact, some of these changes were possible even before that.)

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/abvrRZo52Yx9ZzWQ@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 17:11:12 +01:00
Álvaro Herrera
5f2350a043
Fix dereference in a couple of GUC check hooks
check_backtrace_functions() and check_archive_directory() were doing an
empty-string check this way:
    *newval[0] == '\0'
which, because of operator precedence, is interpreted as *(newval[0])
instead of (*newval)[0] -- but these variables are pointers to C-strings
and we want to check the first character therein, rather than check the
first pointer of the array, so that interpretation is wrong.  This would
be wrong for any index element other than 0, as evidenced by every other
dereference of the same variable in check_backtrace_functions, which use
parentheses.

Add parentheses to make the intended dereference explicit.

This is just cosmetic at this stage, so no backpatch, although it's been
"wrong" for a long time.

Author: Zhang Hu <kongbaik228@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CAB5m2QssN6UO+ckr6ZCcV0A71mKUB6WdiTw1nHo43v4DTW1Dfg@mail.gmail.com
2026-03-24 16:45:39 +01:00
Fujii Masao
1c162c965a Report detailed errors from XLogFindNextRecord() failures.
Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.

This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.

For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").

Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
2026-03-24 22:33:09 +09:00
Robert Haas
c98ad086ad Bounds-check access to TupleDescAttr with an Assert.
The second argument to TupleDescAttr should always be at least zero
and less than natts; otherwise, we index outside of the attribute
array. Assert that this is the case.

Various violations, or possible violations, of this rule that are
currently in the tree are actually harmless, because while
we do call TupleDescAttr() before verifying that the argument is
within range, we don't actually dereference it unless the argument
was within range all along. Nonetheless, the Assert means we
should be more careful, so tidy up accordingly.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
2026-03-24 08:58:50 -04:00
Peter Eisentraut
e2f289e5b9 Make many cast functions error safe
This adjusts many C functions underlying casts to support soft errors.
This is in preparation for a future feature where conversion errors in
casts can be caught.

This patch covers cast functions that can be adjusted easily by
changing ereport to ereturn or making other light changes.  The
underlying helper functions were already changed to support soft
errors some time ago as part of soft error support in type input
functions.

Other casts and types will require some more work and are being kept
as separate patches.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-24 12:08:22 +01:00
Robert Haas
570e2fcc04 Prevent spurious "indexes on virtual generated columns are not supported".
Both of the checks in DefineIndex() that can produce this error
message have a guard against negative attribute numbers, but lack a
guard to ensure that attno is non-zero. As a result, we can index
off the beginning of the TupleDesc and read a garbage byte for
attgenerated. If that byte happens to be 'v', we'll incorrectly
produce the error mentioned above.

The first call site is easy to hit: any attempt to create an
expression index does so. The second one is not currently hit in
the regression tests, but can be hit by something like
CREATE INDEX ON some_table ((some_function(some_table))).

Found by study of a test_plan_advice failure on buildfarm member
skink, though this issue has nothing to do with test_plan_advice
and seems to have only been revealed by happenstance.

Backpatch-through: 18
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
2026-03-24 06:28:33 -04:00
Alexander Korotkov
6888658516 Further improve commentary about ChangeVarNodesWalkExpression()
The updated comment explains why we use ChangeVarNodes_walker() instead of
expression_tree_walker(), and provides a bit more detail about the differences
in processing top-level Query and subqueries.

Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPpHfdvbjq342WTQ705Wmqhe8794pcp7wospz%2BWUJ2qB7vuOqA%40mail.gmail.com
Backpatch-through: 18
2026-03-24 09:54:00 +02:00
Michael Paquier
4019f725f5 Add support for lock statistics in pgstats
This commit adds a new stats kind, called PGSTAT_KIND_LOCK, implementing
statistics for lock tags, as reported by pg_locks.  The implementation
is fixed-sized, as the data is caped based on the number of lock tags in
LockTagType.

The new statistics kind records the following fields, providing insight
regarding lock behavior, while avoiding impact on performance-critical
code paths (such as fast-path lock acquisition):
- waits and wait_time: respectively track the number of times a lock
required waiting and the total time spent acquiring it.  These metrics
are only collected once a lock is successfully acquired and after
deadlock_timeout has been exceeded.
fastpath_exceeded: counts how often a lock could not be acquired via
the fast path due to the max_locks_per_transaction slot limits.

A new view called pg_stat_lock can be used to access this data, coupled
with a SQL function called pg_stat_get_lock().

Bump stat file format PGSTAT_FILE_FORMAT_ID.
Bump catalog version.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 15:32:09 +09:00
Michael Paquier
a90d865182 Move some code blocks in lock.c and proc.c
This change will simplify an upcoming change that will introduce lock
statistics, reducting code churn.

This commit means that we begin to calculate the time it took to acquire
a lock after the deadlock check interrupt has run should log_lock_waits
be off, when taken in isolation.  This is not a performance-critical
code path, and note that log_lock_waits is enabled by default since
2aac62be8c.

Extracted from a larger patch by the same author.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 13:34:54 +09:00
Tom Lane
2e123e3c2b Silence compiler warning from older compilers.
Our RHEL7-vintage buildfarm animals are complaining about
"the comparison will always evaluate as true" for a usage of
SOFT_ERROR_OCCURRED() on a local variable.  This is the same
issue addressed in 7bc88c3d6 and some earlier commits, so solve
it the same way: write "escontext.error_occurred" instead.

Problem dates to recent commit a0b6ef29a, no need for back-patch.
2026-03-23 17:25:12 -04:00
Tom Lane
360dd6f7b4 Improve commentary about ChangeVarNodesWalkExpression().
IMO the proximate cause of the bug fixed in commit 07b7a964d
was sloppy thinking about what ChangeVarNodesWalkExpression()
is to be used for.  Flesh out its header comment to try to
improve that situation.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1607553.1774017006@sss.pgh.pa.us
Backpatch-through: 18
2026-03-23 11:14:24 -04:00
Michael Paquier
93b76db0ac Fix invalid value of pg_aios.pid, function pg_get_aios()
When the value of pg_aios.pid is found to be 0, the function had the
idea to set "nulls" to "false" instead of "true", without setting the
value stored in the tuplestore.  This could lead to the display of buggy
data.  The intention of the code is clearly to display NULL when a PID
of 0 is found, and this commit adjusts the logic to do so.

Issue introduced by 60f566b4f2.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by:  Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/tencent_7D61A85D6143AD57CA8D8C00DEC541869D06@qq.com
Backpatch-through: 18
2026-03-23 18:13:56 +09:00
Michael Paquier
ded9754804 Add missing deflateEnd() for server-side gzip base backups
The gzip basebackup sink called deflateInit2() in begin_archive() but
never called deflateEnd(), leaking zlib's internal compression state
(~256KB per archive) until the memory context of the base backup is
destroyed.

The code tree has already a matching deflateEnd() call for each
deflateInit[2]() call (pgrypto, etc.), except for the file touched in
this commit, so this brings more consistency for all the compression
methods.  The server-side LZ4 and zstd implementations require a
dedicated cleanup callback as they allocate their state outside the
context of a palloc().

As currently used, deflateInit2() is called once per tablespace in a
single backup.  Memory would slightly bloat only when dealing with many
tablespaces at once, not across multiple base backups so this is not
worth a backpatch.  This change could matter for future uses of this
code.

zlib allows the definition of memory allocation and free callbacks in
the z_stream object given to a deflateInit[2]().  The base backup
backend code relies on palloc() for the allocations and deflateEnd()
internally only cleans up memory (no fd allocation for example).

Author: Jianghua Yang <yjhjstz@gmail.com>
Discussion: https://postgr.es/m/CAAZLFmQNJ0QNArpWEOZXwv=vbumcWKEHz-b1me5gBqRqG67EwQ@mail.gmail.com
2026-03-23 09:04:44 +09:00
Peter Geoghegan
e5836f7b7d Add fake LSN support to hash index AM.
Use fake LSNs in all hash AM critical sections that write a WAL record.
This gives us a reliable way (a way that works during scans of both
logged and unlogged relations) to detect when an index page was
concurrently modified during the window between when the page is
initially read (by _hash_readpage) and when the page has any known-dead
items LP_DEAD-marked (by _hash_kill_items).

Preparation for an upcoming patch that makes the hash index AM use the
amgetbatch interface, enabling I/O prefetching during hash index scans.

The amgetbatch design imposes certain rules on index AMs with respect to
how they hold on to index page buffer pins (at least in the case of pins
held as an interlock against unsafe concurrent TID recycling by VACUUM).
These rules have consequences for routines that set LP_DEAD bits on
index tuples from an amgetbatch index AM: such routines have an inherent
need to reason about concurrent TID recycling by VACUUM, but can no
longer rely on their amgettuple routine holding on to a buffer pin
(during the aforementioned window) as an interlock against such
recycling.  Instead, they have to follow a new, standardized approach.

The new approach taken by amgetbatch index AMs when setting LP_DEAD bits
is heavily based on the current nbtree dropPin design, which was added
by commit 2ed5b87f.  It also works by checking if the page's LSN
advanced during the window where unsafe concurrent TID recycling might
have taken place.

This commit is similar to commit 8a879119, which taught nbtree to use
fake LSNs to improve its dropPin behavior.  However, unlike that commit,
this is not an independently useful enhancement, since hash doesn't
implement anything like nbtree's dropPin behavior (not yet).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-22 17:31:43 -04:00
Melanie Plageman
01b7e4a46d Add pruning fast path for all-visible and all-frozen pages
Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
heap_page_prune_and_freeze() can be invoked for pages with no pruning or
freezing work to do. To avoid this, if a page is already all-frozen or
it is all-visible and no freezing will be attempted, exit early. We
can't exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 15:46:50 -04:00
Peter Geoghegan
f026fbf059 Make IndexScanInstrumentation a pointer in executor scan nodes.
Change the IndexScanInstrumentation fields in IndexScanState,
IndexOnlyScanState, and BitmapIndexScanState from inline structs to
pointers.  This avoids additional space overhead whenever new fields are
added to IndexScanInstrumentation in the future, at least in the common
case where the instrumentation isn't used (i.e. when the executor node
isn't being run through an EXPLAIN ANALYZE).

Preparation for an upcoming patch series that will add index
prefetching.  The new slot-based interface that will enable index
prefetching necessitates that we add at least one more field to
IndexScanInstrumentation (to count heap fetches during index-only
scans).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-03-22 13:20:29 -04:00
Melanie Plageman
4f7ecca84d Detect and fix visibility map corruption in more cases
Move VM corruption detection and repair into heap page pruning. This
allows VM repair during on-access pruning, not only during vacuum.

Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or deleted by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.

Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once,
and a single VM page covers a large number of heap pages.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 11:52:40 -04:00
Heikki Linnakangas
516310ed4d Don't reset 'latest_page_number' when replaying multixid truncation
'latest_page_number' is set to the correct value, according to
nextOffset, early at system startup. Contrary to the comment, it hence
should be set up correctly by the time we get to WAL replay.

This was committed to back-branches earlier already (commit
817f74600d), to fix a bug in a backwards-compatibility codepath. We
don't have that bug on 'master', but the change nevertheless makes
sense on 'master' too.

Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/20260214090150.GC2297@p46.dedyn.io;lightning.p46.dedyn.io
Discussion: https://www.postgresql.org/message-id/e1787b17-dc93-4621-a5a1-c713d1ac6a1b@iki.fi
2026-03-22 14:23:54 +02:00
Jeff Davis
4a0b46b6e1 Fix dependency on FDW's connection function.
Missed in commit 8185bb5347.

Catalog version bump.

Discussion: https://postgr.es/m/fd49b44dc65da8e71ab20c1cf1ec7e65921c20f5.camel@j-davis.com
2026-03-20 12:42:59 -07:00
Nathan Bossart
48f11bfa06 Bump transaction/multixact ID warning limits to 100M.
These warning limits were last changed to 40M by commit cd5e82256d.
For the benefit of workloads that rapidly consume transactions or
multixacts, this commit bumps the limits to 100M.  This will
hopefully give users enough time to react.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/aRdhSSFb9zZH_0zc%40nathan
2026-03-20 14:15:33 -05:00
Nathan Bossart
e646450e60 Add percentage of available IDs to wraparound warnings.
This commit adds DETAIL messages to the existing wraparound
WARNINGs that include the percentage of transaction/multixact IDs
that remain available for use.  The hope is that this more clearly
expresses the urgency of the situation.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/aRdhSSFb9zZH_0zc%40nathan
2026-03-20 14:15:33 -05:00
Tom Lane
733f20df53 Discount the metapage when estimating number of index pages visited.
genericcostestimate() estimates the number of index leaf pages to
be visited as a pro-rata fraction of the total number of leaf pages.
Or at least that was the intention.  What it actually used in the
calculation was the total number of index pages, so that non-leaf
pages were also counted.  In a decent-sized index the error is
probably small, since we expect upper page fanout to be high.
But in a small index that's not true; in the worst case with one
data-bearing page plus a metapage, we had 100% relative error.
This led to surprising planning choices such as not using a small
partial index.

To fix, ask genericcostestimate's caller to supply an estimate of
the number of non-leaf pages, and subtract that.  For the built-in
index AMs, it seems sufficient to count the index metapage (if the
AM uses one) as non-leaf.  Per the above argument, counting upper
index pages shouldn't change the estimate much, and in most cases
we don't have any easy way of estimating the number of upper pages.
This might be an area for further research in future.

Any external genericcostestimate callers that do not set the new field
GenericCosts.numNonLeafPages will see the same behavior as before,
assuming they followed the advice to zero out that whole struct.

Unsurprisingly, this change affects a number of plans seen in the
core regression tests.  I hacked up the existing tests to keep the
tests' plans the same, since in each case it appeared that the
test's intent was to test exactly that plan.  Also add one new
test case demonstrating that a better index choice is now made.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://postgr.es/m/870521.1745860752@sss.pgh.pa.us
2026-03-20 14:50:53 -04:00
Alexander Korotkov
07b7a964d3 Fix self-join removal to update bare Var references in join clauses
Self-join removal failed to update Var nodes when the join clause was a
bare Var (e.g., ON t1.bool_col) rather than an expression containing
Vars.  ChangeVarNodesWalkExpression() used expression_tree_walker(),
which descends into child nodes but does not process the top-level node
itself.  When a bare Var referencing the removed relation appeared as
the clause, its varno was left unchanged, leading to "no relation entry
for relid N" errors.

Fix by calling ChangeVarNodes_walker() directly instead of
expression_tree_walker(), so the top-level node is also processed.

Bug: #19435
Reported-by: Hang Ammmkilo <ammmkilo@163.com>
Author: Andrei Lepikhov <lepihov@gmail.com>
Co-authored-by: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/19435-3cc1a87f291129f1%40postgresql.org
Backpatch-through: 18
2026-03-20 15:46:30 +02:00
Álvaro Herrera
e7975f1c06
SET NOT NULL: Call object-alter hook only after the catalog change
... otherwise, the function invoked by the hook might consult the
catalog and not see that the new constraint exists.

This relies on set_attnotnull doing CommandCounterIncrement()
after successfully modifying the catalog.

Oversight in commit 14e87ffa5c.

Author: Artur Zakirov <zaartur@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CAKNkYnxUPCJk-3Xe0A3rmCC8B8V8kqVJbYMVN6ySGpjs_qd7dQ@mail.gmail.com
2026-03-20 14:38:50 +01:00
Andrew Dunstan
4c0390ac53 Add option force_array for COPY JSON FORMAT
This adds the force_array option, which is available exclusively
when using COPY TO with the JSON format.

When enabled, this option wraps the output in a top-level JSON array
(enclosed in square brackets with comma-separated elements), making the
entire result a valid single JSON value.  Without this option, the
default behavior is to output a stream of independent JSON objects.

Attempting to use this option with COPY FROM or with formats other than
JSON will raise an error.

Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:40:17 -04:00
Andrew Dunstan
7dadd38cda json format for COPY TO
This introduces the JSON format option for the COPY TO command, allowing
users to export query results or table data directly as a stream of JSON
objects (one per line, NDJSON style).

The JSON format is currently supported only for COPY TO operations; it
is not available for COPY FROM.

JSON format is incompatible with some standard text/CSV formatting
options, including HEADER, DEFAULT, NULL, DELIMITER, FORCE QUOTE,
FORCE NOT NULL, and FORCE NULL.

Column list support is included: when a column list is specified, only
the named columns are emitted in each JSON object.

Regression tests covering valid JSON exports and error handling for
incompatible options have been added to src/test/regress/sql/copy.sql.

Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Co-Authored-By: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Davin Shearer <davin@apache.org>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:40:04 -04:00
Andrew Dunstan
a2145605ee introduce CopyFormat, refactor CopyFormatOptions
Currently, the COPY command format is determined by two boolean fields
(binary, csv_mode) in CopyFormatOptions.  This approach, while
functional, isn't ideal for implementing other formats in the future.

To simplify adding new formats, introduce a CopyFormat enum.  This makes
the code cleaner and more maintainable, allowing for easier integration
of additional formats down the line.

Author: Joel Jacobson <joel@compiler.org>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:21:57 -04:00
Amit Kapila
493f8c6439 Add support for EXCEPT TABLE in ALTER PUBLICATION.
Following commit fd366065e0, which added EXCEPT TABLE support to
CREATE PUBLICATION, this commit extends ALTER PUBLICATION to allow
modifying the exclusion list.

New Syntax:
ALTER PUBLICATION name SET  publication_all_object [, ... ]

where publication_all_object is one of:
ALL TABLES [ EXCEPT TABLE ( except_table_object [, ... ] ) ]
ALL SEQUENCES

If the EXCEPT clause is provided, the existing exclusion list in
pg_publication_rel is replaced with the specified relations. If the
EXCEPT clause is omitted, any existing exclusions for the publication
are cleared. Similarly, SET ALL SEQUENCES updates

Note that because this is a SET command, specifying only one object
type (e.g., SET ALL SEQUENCES) will reset the other unspecified flags
(e.g., setting puballtables to false).

Consistent with CREATE PUBLICATION, only root partitioned tables or
standard tables can be specified in the EXCEPT list. Specifying a
partition child will result in an error.

Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-20 11:36:09 +05:30
David Rowley
07d5bffe75 Fix new tuple deforming code so it can support cstrings again
In c456e3911, I mistakenly thought that the deformer code would never
see cstrings and that I could use pg_assume() to have the compiler omit
producing code for attlen == -2 attributes.  That saves bloating the
deforming code a bit with the extra check and strlen() call.  While this
is ok to do for tuples from the heap, it's not ok to do for
MinimalTuples as those *can* contain cstrings and
tts_minimal_getsomeattrs() implements deforming by inlining the
(slightly misleadingly named) slot_deform_heap_tuple() code.

To fix, add a new parameter to the slot_deform_heap_tuple() and have the
callers define which code to inline.  Because this new parameter is
passed as a const, the compiler can choose to emit or not emit the
cstring-related code based on the parameter's value.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmSK+gKziAt_WvQoMVWt3_LRVMmRYY9dAbMPMcpPV0QmA@mail.gmail.com
2026-03-20 14:16:06 +13:00
Jeff Davis
703fee3b25 Fix dependency on FDW handler.
ALTER FOREIGN DATA WRAPPER could drop the dependency on the handler
function if it wasn't explicitly specified.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/35c44a4b7fb76d35418c4d66b775a88f4ce60c86.camel@j-davis.com
Backpatch-through: 14
2026-03-19 15:07:43 -07:00
Masahiko Sawada
adcdbe9386 Add parallel vacuum worker usage to VACUUM (VERBOSE) and autovacuum logs.
This commit adds both the number of parallel workers planned and the
number of parallel workers actually launched to the output of
VACUUM (VERBOSE) and autovacuum logs.

Previously, this information was only reported as an INFO message
during VACUUM (VERBOSE), which meant it was not included in autovacuum
logs in practice. Although autovacuum does not yet support parallel
vacuum, a subsequent patch will enable it and utilize these logs in
its regression tests. This change also improves observability by
making it easier to verify if parallel vacuum is utilizing the
expected number of workers.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
2026-03-19 15:01:47 -07:00
Masahiko Sawada
ba21f5bf8a Allow explicit casting between bytea and uuid.
This enables the use of functions such as encode() and decode() with
UUID values, allowing them to be converted to and from alternative
formats like base64 or hex.

The cast maps the 16-byte internal representation of a UUID directly
to a bytea datum. This is more efficient than going through a text
forepresentation.

Bump catalog version.

Author:	Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
2026-03-19 13:51:50 -07:00
Tom Lane
1811f1af98 Improve hash join's handling of tuples with null join keys.
In a plain join, we can just summarily discard an input tuple
with null join key(s), since it cannot match anything from
the other side of the join (assuming a strict join operator).
However, if the tuple comes from the outer side of an outer join
then we have to emit it with null-extension of the other side.

Up to now, hash joins did that by inserting the tuple into the hash
table as though it were a normal tuple.  This is unnecessarily
inefficient though, since the required processing is far simpler than
for a potentially-matchable tuple.  Worse, if there are a lot of such
tuples they will bloat the hash bucket they go into, possibly causing
useless repeated attempts to split that bucket or increase the number
of batches.  We have a report of a large join vainly creating many
thousands of batches when faced with such input.

This patch improves the situation by keeping such tuples out of the
hash table altogether, instead pushing them into a separate tuplestore
from which we return them later.  (One might consider trying to return
them immediately; but that would require substantial refactoring, and
it doesn't work anyway for cases where we rescan an unmodified hash
table.)  This works even in parallel hash joins, because whichever
worker reads a null-keyed tuple can just return it; there's no need
for consultation with other workers.  Thus the tuplestores are local
storage even in a parallel join.

A pre-existing buglet that I noticed while analyzing the code's
behavior is that ExecHashRemoveNextSkewBucket fails to decrement
hashtable->skewTuples for tuples moved into the main hash table
from the skew hash table.  This invalidates ExecHashTableInsert's
calculation of the number of main-hash-table tuples, though probably
not by a lot since we expect the skew table to be small relative
to the main one.  Nonetheless, let's fix that too while we're here.

Bug: #18909
Reported-by: Sergey Koposov <Sergey.Koposov@ed.ac.uk>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/3061845.1746486714@sss.pgh.pa.us
2026-03-19 15:21:36 -04:00
Nathan Bossart
dd1398f137 Allow choosing specific grantors via GRANT/REVOKE ... GRANTED BY.
Except for GRANT and REVOKE on roles, the GRANTED BY clause
currently only accepts the current role to match the SQL standard.
And even if an acceptable grantor (i.e., the current role) is
specified, Postgres ignores it and chooses the "best" grantor for
the command.  Allowing the user to select a specific grantor would
allow better control over the precise behavior of GRANT/REVOKE
statements.  This commit adds that ability.  For consistency with
select_best_grantor(), we only permit choosing grantor roles for
which the current role inherits privileges.

Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aRYLkTpazxKhnS_w%40nathan
2026-03-19 11:41:39 -05:00
Robert Haas
6f0738ddec dshash: Make it possible to suppress out of memory errors
Introduce dshash_find_or_insert_extended, which is just like
dshash_find_or_insert except that it takes a flags argument.
Currently, the only supported flag is DSHASH_INSERT_NO_OOM, but
I have chosen to use an integer rather than a boolean in case we
end up with more flags in the future.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoaJwUukUZGu7_yL74oMTQQz2=zqucMhF9+9xBmSC5us1w@mail.gmail.com
2026-03-19 11:51:17 -04:00
Tom Lane
5a2043bf71 Fix transient memory leakage in jsonpath evaluation.
This patch reimplements JsonValueList to be more space-efficient
and arranges for temporary JsonValueLists created during jsonpath
evaluation to be freed when no longer needed, rather than being
leaked till the end of the function evaluation cycle as before.

The motivation is to prevent indefinite memory bloat while
evaluating jsonpath expressions that traverse a lot of data.
As an example, this query
  SELECT
    jsonb_path_query((SELECT jsonb_agg(i) FROM generate_series(1,10000) i),
                     '$[*] ? (@ < $)');
formerly required about 6GB to execute, with the space required
growing quadratically with the length of the input array.
With this patch the memory consumption stays static.  (The time
required is still quadratic, but we can't do much about that: this
path expression asks to compare each array element to each other one.)

The bloat happens because we construct a JsonValueList containing all
the array elements to represent the second occurrence of "$", and then
just leak it after evaluating the filter expression for any one value
generated from "$[*]".  If I were implementing this functionality from
scratch I'd probably try to avoid materializing that representation at
all, but changing that now looks like more trouble than it's worth.
This patch takes the more conservative approach of just making sure
we free the list after we're done with it.

The existing representation of JsonValueList is neither especially
compact nor especially easy to free: it's a List containing pointers
to separately-palloc'd JsonbValue structs.  We could theoretically
use list_free_deep, but it's not 100% clear that all the JsonbValues
are always safe for us to free.  In any case we are talking about a
lot of palloc/pfree traffic if we keep it like this.  This patch
replaces that with what's essentially an expansible array of
JsonbValues, so that even a long list requires relatively few
palloc requests.  Also, for the very common case that only one or
two elements appear in the list, this representation uses *zero*
pallocs: the elements can be kept in the on-the-stack base struct.

Note that we are only interested in freeing the JsonbValue structs
themselves.  While many types of JsonbValue include pointers to
external data such as strings or numerics, we expect that that data
is part of the original jsonb input Datum(s) and need not (indeed
cannot) be freed here.

In this reimplementation, JsonValueListAppend() always copies the
supplied JsonbValue struct into the JsonValueList data.  This allows
simplifying and regularizing many call sites that sometimes palloc'd
JsonbValues and sometimes passed a local-variable JsonbValue.  Always
doing the latter is simpler, faster, and less bug-prone.

I also removed JsonValueListLength() in favor of constant-time tests
for whether the list has zero, one, or more than one member, which is
what the callers really need to know.  JsonValueListLength() was not
a hot code path, so this aspect of the patch won't move the needle in
the least performance-wise.  But it seems neater.

I've not done any wide-ranging performance testing, but this should
be faster than the old code thanks to reduction of palloc overhead.
On the specific example shown above, it's about twice as fast as
before on not-very-large inputs; and of course it wins big if you
consider an input large enough to drive the old code into swapping.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/569394.1773783211@sss.pgh.pa.us
2026-03-19 11:37:14 -04:00
Peter Eisentraut
7724cb9935 Add some const qualifiers enabled by typeof_unqual change on copyObject
The recent commit to change copyObject() to use typeof_unqual allows
cleaning up some APIs to take advantage of this improved qualifier
handling.  EventTriggerCollectSimpleCommand() is a good example: It
takes a node tree and makes a copy that it keeps around for its
internal purposes, but it can't communicate via its function signature
that it promises not scribble on the passed node tree.  That is now
fixed.

Reviewed-by: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
2026-03-19 06:35:54 +01:00
David Rowley
c95cd2991f Short-circuit row estimation in NOT IN containing NULL consts
ScalarArrayOpExpr used for either NOT IN or <>/= ALL, when the array
contains a NULL constant, will never evaluate to true.  Here we add an
explicit short-circuit in scalararraysel() to account for this and return
0.0 rows when we see that a NULL exists.  When the array is a constant,
we can very quickly see if there are any NULL values and return early
before going to much effort in scalararraysel().  For non-const arrays,
we short-circuit after finding the first NULL and forego selectivity
estimations of any remaining elements.

In the future, it might be better to do something for this case in
constant folding.  We would need to be careful to only do this for
strict operators on expressions located in places that don't care about
distinguishing false from NULL returns. i.e. EXPRKIND_QUAL expressions.
Doing that requires a bit more thought and effort, so here we just fix
some needlessly slow selectivity estimations for ScalarArrayOpExpr
containing many array elements and at least one NULL.

Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/eaa2598c-5356-4e1e-9ec3-5fd6eb1cd704@tantorlabs.com
2026-03-19 17:16:36 +13:00
Michael Paquier
79a5911fe6 Add more debugging information for bgworker termination tests of worker_spi
widowbird has failed again after af8837a10b, with the same symptoms of
a backend still lying around when attempting a database rename with a
bgworker connected to the database being renamed.

We are still not sure yet how the failure can be reached, if this is a
timing issue in the test or an actual bug in the logic used for
interruptible bgworkers.  This commit adds more debugging information in
the backend to help with the analysis as a temporary measure.

Another thing I have noticed is that the queries launching the dynamic
bgworkers or checking pg_stat_activity would connect to the database
renamed.  These are switched to use 'postgres'.  That will hopefully
remove some of the friction of the test, but I doubt that this is the
end of the story.

Discussion: https://postgr.es/m/abtJLEAsf1HZXWdR@paquier.xyz
2026-03-19 11:39:31 +09:00
Daniel Gustafsson
4f433025f6 ssl: Serverside SNI support for libpq
Support for SNI was added to clientside libpq in 5c55dc8b47 with the
sslsni parameter, but there was no support for utilizing it serverside.
This adds support for serverside SNI such that certificate/key handling
is available per host.  A new config file, $datadir/pg_hosts.conf, is
used for configuring which certificate and key should be used for which
hostname.  In order to use SNI the ssl_sni GUC must be set to on, when
it is off the ssl configuration works just like before.  If ssl_sni is
enabled and pg_hosts.conf is non-empty it will take precedence over
the regular SSL GUCs, if it is empty or missing the regular GUCs will
be used just as before this commit with no hostname specific handling.
The TLS init hook is not compatible with ssl_sni since it operates on
a single TLS configuration and SNI break that assumption.  If the init
hook and ssl_sni are both enabled, a WARNING will be issued.

Host configuration can either be for a literal hostname to match, non-
SNI connections using the no_sni keyword or a default fallback matching
all connections.  By omitting no_sni and the fallback a strict mode
can be achieved where only connections using sslsni=1 and a specified
hostname are allowed.

CRL file(s) are applied from postgresql.conf to all configured hostnames.

Serverside SNI requires OpenSSL, currently LibreSSL does not support
the required infrastructure to update the SSL context during the TLS
handshake.

Author: Daniel Gustafsson <daniel@yesql.se>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dewei Dai <daidewei1970@163.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1C81CD0D-407E-44F9-833A-DD0331C202E5@yesql.se
2026-03-18 12:37:11 +01:00
Peter Eisentraut
905e44152a Allow setting the collation strength in ICU tailoring rules
There was a bug that if you created an ICU collation with tailoring
rules, any strength specification inside the rules was ignored.  This
was because we called ucol_openRules() with UCOL_DEFAULT_STRENGTH for
the strength argument, which overrides the strength.  This was because
of faulty guidance in the ICU documentation, which has since been
fixed.  The correct invocation is to use UCOL_DEFAULT for the strength
argument.

This fixes bug #18771 and bug #19425.

Author: Daniel Verite <daniel@manitou-mail.org>
Reported-by: Ruben Ruiz <ruben.ruizcuadrado@gmail.com>
Reported-by: dorian.752@live.fr
Reported-by: Todd Lang <Todd.Lang@D2L.com>
Discussion: https://www.postgresql.org/message-id/flat/YT2PPF959236618377A072745A280E278F4BE1DA@YT2PPF959236618.CANPRD01.PROD.OUTLOOK.COM
Discussion: https://www.postgresql.org/message-id/flat/18771-98bb23e455b0f367@postgresql.org
Discussion: https://www.postgresql.org/message-id/flat/19425-58915e19dacd4f40%40postgresql.org
2026-03-18 08:58:47 +01:00
Andrew Dunstan
3b4c2b9db2 Allow IS JSON predicate to work with domain types
The IS JSON predicate only accepted the base types text, json, jsonb, and
bytea.  Extend it to also accept domain types over those base types by
resolving through getBaseType() during parse analysis.

The base type OID is stored in the JsonIsPredicate node (as exprBaseType)
so the executor can dispatch to the correct validation path without
repeating the domain lookup at runtime.

When a non-supported type (or domain over a non-supported type) is used,
the error message displays the original type name as written by the user,
rather than the resolved base type.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CACJufxEk34DnJFG72CRsPPT4tsJL9arobX0tNPsn7yH28J=zQg@mail.gmail.com
2026-03-17 15:20:22 -04:00
Andres Freund
f5eb854ab6 Fix use of wrong variable in _hash_kill_items()
In 82467f627b I somehow ended up using 'so->currPos.buf' instead of the 'buf'
variable, which is incorrect when the buffer is not already pinned. At the
very least this can lead to assertion failures

Unfortunately this shows that this code path was not covered. Expand
src/test/modules/index/specs/killtuples.spec to test it.  Until now the
'result' step always reported either a 0 or 1 buffer accesses, but when
exercising hash overflows, more buffers are accessed.  To avoid depending on
the precise number of accesses, change the result step to return whether there
were any heap accesses. That makes the change a lot more verbose, but still
seems worth it.

Reported-by: Alexander Kuzmenkov <akuzmenkov@tigerdata.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/vjtmvwvbxt7w5uyacxpzibpj65ewcb7uqaqbhd4arvnjbp5jqz%405ksdh6fsyqve
Discussion: https://postgr.es/m/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi
2026-03-17 14:54:41 -04:00
Andrew Dunstan
ecd9288624 make immutability tests in to_json and to_jsonb complete
Complete the TODOs in to_json_is_immutable() and to_jsonb_is_immutable()
by recursing into container types (arrays, composites, ranges, multiranges,
domains) to check element/sub-type mutability, rather than conservatively
returning "mutable" for all arrays and composites.

The shared logic is factored into a single json_check_mutability() function
in jsonfuncs.c, with the existing exported functions as thin wrappers.
Composite type inspection uses lookup_rowtype_tupdesc() (typcache) instead
of relation_open() to avoid unnecessary lock acquisition in the optimizer.

Range and multirange types are now also checked recursively: if the
subtype's conversion is immutable, the range is considered immutable
for JSON purposes, even though range_out is generically marked STABLE.
This is a behavioral change: range types with immutable subtypes (e.g.,
int4range) can now appear in expression indexes via JSON_ARRAY/JSON_OBJECT,
whereas previously they were conservatively rejected.

Add regression tests for JSON_ARRAY and JSON_OBJECT mutability with
expression indexes and generated columns, covering arrays, composites,
domains, ranges, multiranges and combinations thereof.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxFz=OsXQdsMJ-cqoqspD9aJrwntsQP-U2A-UaV_M+-S9g@mail.gmail.com
Commitfest: https://commitfest.postgresql.org/patch/5759
2026-03-17 11:28:33 -04:00
Nathan Bossart
3b88e50d6c Add more columns to pg_stats, pg_stats_ext, and pg_stats_ext_exprs.
This commit adds table OID and attribute number columns to
pg_stats, and it adds table OID and statistics object OID columns
to pg_stats_ext and pg_stats_ext_exprs.  A proposed follow-up
commit would use pg_stats.tableid to simplify a query in pg_dump.
The others have no immediate purpose but may be useful later.

Bumps catversion.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
2026-03-17 09:26:27 -05:00
Peter Eisentraut
c9babbc881 Dump labels in reproducible order
In pg_get_propgraphdef(), sort the labels before writing out, for a
consistent dump order.  Also, since we now have a list, we can get rid
of the separate table scan to get the count.

Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-17 14:07:29 +01:00
Michael Paquier
233e6ae953 gen_guc_tables.pl: Improve detection of inconsistent data
This commit adds two improvements to gen_guc_tables.pl:
1) When finding two entries with the same name, the script complained
about these being not in alphabetical order, which was confusing.
Duplicated entries are now reported as their own error.
2) While the presence of the required fields is checked for all the
parameters, the script did not perform any checks on the non-required
fields.  A check is added to check that any field defined matches with
what can be accepted.  Previously, a typo in the name of a required
field would cause the field to be reported as missing.  Non-mandatory
fields would be silently ignored, which was problematic as we could lose
some information.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAN4CZFP=3xUoXb9jpn5OWwicg+rbyrca8-tVmgJsQAa4+OExkw@mail.gmail.com
2026-03-17 17:38:55 +09:00
Michael Paquier
1a7ccd2b33 Refactor some code around ALTER TABLE [NO] INHERIT
[NO] INHERIT is not supported for partitioned tables, but this portion
of tablecmds.c did not apply the same rules as the other sub-commands,
checking the relkind in the execution phase, not the preparation phase.

This commit refactors the code to centralize the relkind and other
checks in the preparation phase for both command patterns, getting rid
of one translatable string on the way.  ATT_PARTITIONED_TABLE is
removed from ATSimplePermissions(), and the child relation is checked
the same way for both sub-commands.  The ALTER TABLE patterns that now
fail at preparation failed already at execution, hence there should be
no changes from the user perspective except more consistent error
messages generated.

Some comments at the top of ATPrepAddInherit() were incorrect,
CreateInheritance() being the routine checking the columns and
constraints between the parent and its to-be-child.

Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com
2026-03-17 14:34:29 +09:00
David Rowley
d8a859d22b Reduce size of CompactAttribute struct to 8 bytes
Previously, this was 16 bytes.  With the use of some bitflags and by
reducing the attcacheoff field size to a 16-bit type, we can halve the
size of the struct.

It's unlikely that caching the offsets for offsets larger than what will
fit in a 16-bit int will help much as the tuple is very likely to have
some non-fixed-width types anyway, the offsets of which we cannot cache.

Shrinking this down to 8 bytes helps by accessing fewer cachelines when
performing tuple deformation.  The fields used there are all fully
fledged fields, which don't require any bitmasking to extract the value
of.  It also helps to more efficiently calculate the address of a
compact_attrs[] element in TupleDesc as the x86 LEA instruction can work
with 8 byte offsets, which allows the element address to be calculated
from the TupleDesc's address in a single instruction using LEA's
concurrent shift and add.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
2026-03-17 15:06:31 +13:00
Fujii Masao
d927b4bd97 Fix WAL flush LSN used by logical walsender during shutdown
Commit 6eedb2a5fd made the logical walsender call
XLogFlush(GetXLogInsertRecPtr()) to ensure that all pending WAL is flushed,
fixing a publisher shutdown hang. However, if the last WAL record ends at
a page boundary, GetXLogInsertRecPtr() can return an LSN pointing past
the page header, which can cause XLogFlush() to report an error.

A similar issue previously existed in the GiST code. Commit b1f14c9672
introduced GetXLogInsertEndRecPtr(), which returns a safe WAL insertion end
location (returning the start of the page when the last record ends at a page
boundary), and updated the GiST code to use it with XLogFlush().

This commit fixes the issue by making the logical walsender use
XLogFlush(GetXLogInsertEndRecPtr()) when flushing pending WAL during shutdown.

Backpatch to all supported versions.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/vzguaguldbcyfbyuq76qj7hx5qdr5kmh67gqkncyb2yhsygrdt@dfhcpteqifux
Backpatch-through: 14
2026-03-17 08:10:20 +09:00
David Rowley
7a2ab122a1 Fix thinko in nocachegetattr() and nocache_index_getattr()
This code was recently adjusted by c456e3911, but that commit didn't get
the logic correct when finding the attnum to start walking the tuple in.
If there is a NULL, we need to start walking the tuple before it.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNnb-s_=VdVUZ9h7dPA0u3hxV8x2aU3obZytnqQZ_MiROA@mail.gmail.com
2026-03-17 09:00:39 +13:00
Álvaro Herrera
fba4233c83
Reduce header inclusions via execnodes.h
Remove a bunch of #include lines from execnodes.h.  Most of these
requier suitable typedefs to be added, so that it still compiles
standalone.  In one case, the fix is to move a struct definition to the
one .c file where it is needed.

Also some light clean up in plannodes.h and genam.h, though not as
extensive as in execnodes.h.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202603131240.ihwqdxnj7w2o@alvherre.pgsql
2026-03-16 14:34:57 +01:00