Commit graph

28616 commits

Author SHA1 Message Date
Fujii Masao
5770679918 Remove redundant SetLatch() calls in interrupt handling functions
Interrupt handling functions (e.g., HandleCatchupInterrupt(),
HandleParallelApplyMessageInterrupt()) are called only by
procsignal_sigusr1_handler(), which already calls SetLatch()
for the current process at the end of its processing.
Therefore, these interrupt handling functions do not need to
call SetLatch() themselves.

However, previously, some of these functions redundantly
called SetLatch(). This commit removes those unnecessary
calls.

While duplicate SetLatch() calls are redundant, they are
harmless, so this change is not backpatched.

Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CALj2ACWd5apddj6Cd885WwJ6LquYu_G81C4GoR4xSoDV1x-FEA@mail.gmail.com
2026-04-02 23:55:30 +09:00
Tomas Vondra
7f8c88c2b8 jit: Change the default to off.
While JIT can speed up large analytical queries, it can also cause
serious performance issues on otherwise very fast queries. Compiling
and optimizing the expressions may be so expensive, it completely
outweighs the JIT benefits for shorter queries.

Ideally, we'd address this in the cost model, but the part deciding
whether to enable JIT for a query is rather simple, partially because we
don't have any reliable estimates of how expensive the LLVM compilation
and optimization is.

Sometimes seemingly unrelated changes (for example a couple additional
INSERTs into a table) increase the cost just enough to enable JIT,
resulting in a performance cliff.

Because of these risks, most large-scale deployments already disable JIT
by default. Notably, this includes all hyperscalers.

This commit changes our default to align with that established practice.
If we improve the JIT (be it better costing or cheaper execution), we
can consider enabling it by default again.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/DG1VZJEX1AQH.2EH4OKGRUDB71@jeltef.nl
2026-04-02 13:40:29 +02:00
Thomas Munro
de6b80e5ff jit: Stop emitting lifetime.end for LLVM 22.
The lifetime.end intrinsic can now only be used for stack memory
allocated with alloca[1][2][3].  We use it to tell LLVM about the
lifetime of function arguments/isnull values that we keep in palloc'd
memory, so that it can avoid spilling registers to memory.

We might need to rearrange things and put them on the stack, but that'll
take some research.  In the meantime, unbreak the build on LLVM 22.

[1] https://github.com/llvm/llvm-project/pull/149310
[2] https://llvm.org/docs/LangRef.html#llvm-lifetime-end-intrinsic
[3] https://llvm.org/docs/LangRef.html#i-alloca

Backpatch-through: 14
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> (earlier attempt)
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> (earlier attempt)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier attempt)
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
2026-04-02 15:52:48 +13:00
David Rowley
331d829e62 Fix nocachegetattr() so it again supports deforming cstrings
c456e3911 added various optimizations to the tuple deformation routines.
One optimization assumed that heap tuples would never contain cstrings.
That optimization also made its way into nocachegetattr(), which isn't
correct as ROW() types get formed into HeapTuples by ExecEvalRow() and
those can contain cstring Datums.  nocachegetattr() gets used to extract
Datums from those tuples.

Here we remove the pg_assume(), which was there to instruct the compiler
to omit the attlen == -2 related code in att_addlength_pointer().

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/80aeac57-8f50-4732-a5b4-c2373c3f8149@gmail.com
2026-04-02 14:11:17 +13:00
Andres Freund
6e36930f9a read_stream: Prevent distance from decaying too quickly
Until now we reduced the look-ahead distance by 1 on every hit, and doubled it
on every miss. That is problematic because there are very common IO patterns
where this prevents us from ever reaching a sufficiently high distance (e.g. a
miss followed by a hit will never have the distance grow beyond 2). In many
such cases, if we had ever reached a sufficient look-ahead distance, things
would have been fine, because we grow the distance faster than we decrease it.

One might think that the most obvious answer to this problem would be to never
reduce the distance. However, that would not work well, as (particularly with
upcoming users of read streams), it is reasonably common to at first have a
lot of misses and then to transition to a fully cached workload, e.g. because
the same blocks are needed repeatedly within one stream. Doing unnecessarily
deep readahead can be costly, due to having to pin a lot more buffers, which
increases CPU overhead.

Because the cost of a synchronously handled miss can be very high (multiple
milliseconds for every IO with commonly used storage) compared to the CPU
overhead of keeping the distance too high, we want to err on the side of not
reducing the distance too early.

The insight that a decrease of the distance by 1 at ever hit may be ok at
large distances, but not at low distances, shows a way out: If we only allow
decreasing the distance once there were no misses for our maximum look-ahead
distance, we will keep the distance high as long as readahead has a chance to
do IO asynchronously, but not commonly when not.

Several folks have written variants of this patch, including at least Thomas
Munro, Melanie Plageman and I.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CA+hUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:50:03 -04:00
Andres Freund
cceb1bf45e read_stream: Issue IO synchronously while in fast path
While in fast-path, execute any IO that we might encounter synchronously.
Because we are, in that moment, not reading ahead, dispatching any occasional
IO to workers has the dispatch overhead, without any realistic chance of the
IO completing before we need it.

This helps io_method=worker performance for workloads that have only
occasional cache misses, but where those occasional misses still take long
enough to matter.  It is likely this is only measurable with fast local
storage or workloads with the data in the kernel page cache, as with remote
storage the IO latency, not the dispatch-to-worker latency, is the determining
factor.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 19:22:44 -04:00
Heikki Linnakangas
1bdbb211bb Make ShmemIndex visible in the pg_shmem_allocations view
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-04-01 23:56:51 +03:00
Álvaro Herrera
db89a47115
Give an 'options' parameter to tuple_delete/_update
The tuple_insert() method already has an equivalent argument, so this
makes sense just on consistency grounds, for future growth.

table_delete() can immediately use it to carry the 'changingPart'
boolean; for table_update we don't have any options at present.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> (older version)
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/202603171606.kf6pmhscqbqz@alvherre.pgsql
2026-04-01 20:26:57 +02:00
Peter Eisentraut
8e72d914c5 Add UPDATE/DELETE FOR PORTION OF
This is an extension of the UPDATE and DELETE commands to do a
"temporal update/delete" based on a range or multirange column.  The
user can say UPDATE t FOR PORTION OF valid_at FROM '2001-01-01' TO
'2002-01-01' SET ... (or likewise with DELETE) where valid_at is a
range or multirange column.

The command is automatically limited to rows overlapping the targeted
portion, and only history within those bounds is changed.  If a row
represents history partly inside and partly outside the bounds, then
the command truncates the row's application time to fit within the
targeted portion, then it inserts one or more "temporal leftovers":
new rows containing all the original values, except with the
application-time column changed to only represent the untouched part
of history.

To compute the temporal leftovers that are required, we use the *_minus_multi
set-returning functions defined in 5eed8ce50c.

- Added bison support for FOR PORTION OF syntax.  The bounds must be
  constant, so we forbid column references, subqueries, etc. We do
  accept functions like NOW().
- Added logic to executor to insert new rows for the "temporal
  leftover" part of a record touched by a FOR PORTION OF query.
- Documented FOR PORTION OF.
- Added tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2026-04-01 19:06:03 +02:00
Álvaro Herrera
ec2f81766a
Fix vicinity of tuple_insert to use uint32, not int, for options
Oversight in commit 1bd6f22f43: I was way too optimistic about the
compiler letting me know what variables needed to be updated, and missed
a few of them.  Clean it up.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/40E570EE-5A60-49D8-B8F7-2F8F2B7C8DFA@gmail.com
2026-04-01 18:14:51 +02:00
Dean Rasheed
f7f4052a4e Add support for extended statistics on virtual generated columns.
This allows both univariate and multivariate statistics to be built on
virtual generated columns and expressions that refer to virtual
generated columns. The restriction disallowing extended statistics on
a single column is lifted in the case of a single virtual generated
column, since it is treated as a single expression.

In the catalogs, references to virtual generated columns are stored
as-is. They are expanded at ANALYZE time to build the statistics, and
at planning time to allow the optimizer to make use of the statistics.
This allows the statistics to be correctly rebuilt using ANALYZE, if a
column's generation expression is altered (which causes any existing
statistics data to be deleted).

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/20250422181006.dd6f9d1d81299f5b2ad55e1a@sraoss.co.jp
2026-04-01 17:02:24 +01:00
Andres Freund
513374a47a bufmgr: Return whether WaitReadBuffers() needed to wait
Thanks to the previous commit, pgaio_wref_check_done() will now detect whether
IO has completed even if userspace has not yet consumed the kernel completion.
This knowledge can be useful for callers of WaitReadBuffers() to know whether
it needed to wait or not, e.g. for adjusting read-ahead aggressiveness or for
instrumentation.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
Discussion: https://postgr.es/m/a177a6dd-240b-455a-8f25-aca0b1c08c6e@vondra.me
2026-04-01 09:26:43 -04:00
Andres Freund
6e648e353f aio: io_uring: Allow IO methods to check if IO completed in the background
Until now pgaio_wref_check_done() with io_method=io_uring would not detect if
IOs are known to have completed to the kernel, but the completion has not yet
been consumed by userspace.  This can lead to inferior performance and also
makes it harder to use smarter feedback logic in read_stream, because we
cannot use knowledge about whether an IO completed to control the readahead
distance.

This commit just adds the io_uring specific infrastructure. Later commits will
return whether a wait was needed from WaitReadBuffers() and then use that
knowledge.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/f3xxfrkafjxpyqxywcxricxgyizjirfceychyxsgn7bwjp5eda@kwbduhy7tfmu
Discussion: https://postgr.es/m/CAH2-Wz%3DkMg3PNay96cHMT0LFwtxP-cQSRZTZzh1Cixxf8G%3Dzrw%40mail.gmail.com
2026-04-01 09:26:43 -04:00
Amit Langote
edee563456 Make FastPathMeta self-contained by copying FmgrInfo structs
FastPathMeta stored pointers into ri_compare_cache entries via
compare_entries[], creating a dependency on that cache remaining
stable.  If ri_compare_cache entries were invalidated after fpmeta
was populated, the pointers would dangle.

Replace compare_entries[] with inline copies of the two FmgrInfo
fields actually needed (cast_func_finfo and eq_opr_finfo), copied
at populate time via fmgr_info_copy().  fpmeta now depends only on
riinfo remaining valid, which is already handled by the invalidation
callback.

Introduced by commit 2da86c1ef9 ("Add fast path for foreign key
constraint checks"), noticed while reviewing code for robustness
under CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqFQ+ZA7hSOygv4uv_t75B3r0_gosjadetCsAEoaZwTu6g@mail.gmail.com
2026-04-01 18:43:40 +09:00
Amit Langote
e484b0eea6 Fix two issues in fast-path FK check introduced by commit 2da86c1ef9
First, under CLOBBER_CACHE_ALWAYS, the RI_ConstraintInfo entry can
be invalidated by relcache callbacks triggered inside table_open()
or index_open(), leaving ri_FastPathCheck() calling
ri_populate_fastpath_metadata() with a stale entry whose valid flag
is false.  Fix by moving the fpmeta initialization to after
ri_CheckPermissions(), reloading riinfo first to ensure it is
valid, then calling ri_ExtractValues() and build_index_scankeys()
immediately after before any further operations that could trigger
invalidation.

Second, fpmeta allocated in TopMemoryContext was not freed when the
entry was invalidated in InvalidateConstraintCacheCallBack(),
leaking memory each time the constraint cache entry was recycled.
Fix by freeing and NULLing fpmeta at invalidation time.

Noticed locally when testing with CLOBBER_CACHE_ALWAYS.

Discussion: https://postgr.es/m/CA+HiwqGBU__7-VZZhQWQ3EQuwLYNPd9==ngnzduhGWKHMj9mvw@mail.gmail.com
2026-04-01 17:30:33 +09:00
John Naylor
f6bd9f0fe2 Skip common prefixes during radix sort
During the counting step, keep track of the bits that are the same
for the entire input.  If we counted only a single distinct byte,
the next recursion will start at the next byte position that has
more than one distinct byte in the input. This allows us to skip over
multiple passes where the byte is the same for the entire input.

This provides a significant speedup for integers that have some upper
bytes with all-zeros or all-ones, which is common.

Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/CANWCAZYpGMDSSwAa18fOxJGXaPzVdyPsWpOkfCX32DWh3Qznzw@mail.gmail.com
2026-04-01 14:18:57 +07:00
Fujii Masao
21b018e7ea Reduce log level of some logical decoding messages from LOG to DEBUG1
Previously some logical decoding messages (e.g., "logical decoding found
consistent point") were logged at level LOG, even though they provided
low-level, developer-oriented information that DBAs were typically not
interested in.

Since these messages can occur routinely (for example, when keeping calling
pg_logical_slot_get_changes() to obtain the changes from logical decoding),
logging them at LOG can be overly verbose.

This commit reduces their log level to DEBUG1 to avoid unnecessary log noise.

This change applies to a small set of messages for now. Additional messages
may be adjusted similarly in the future.

Even with this change, if these messages from walsender still need to be
observed, enabling DEBUG1 logging selectively for walsender (e.g.,
log_min_messages = 'warning,walsender:debug1') would be helpful to avoid
increasing overall log volume.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGTyHgtD9tyN664x6vQ8Q1G53H7ZUCgBU9_X=nLt3f1QA@mail.gmail.com
2026-04-01 15:43:02 +09:00
Peter Eisentraut
76f4b92bac Use standard C23 and C++ attributes if available
Use the standard C23 and C++ attributes [[nodiscard]], [[noreturn]],
and [[maybe_unused]], if available.

This makes pg_nodiscard and pg_attribute_unused() available in
not-GCC-compatible compilers that support C23 as well as in C++.

For pg_noreturn, we can now drop the GCC-specific and MSVC-specific
fallbacks, because the C11 and the C++ implementation will now cover
all required cases.

Note, in a few places, we need to change the position of the attribute
because it's not valid in that place in C23.

Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
2026-04-01 08:15:02 +02:00
Amit Kapila
6b0550c45d Fix miscellaneous issues in EXCEPT publication clause.
Improve documentation regarding multiple publications and partition
hierarchies. Refine error reporting for excluded relations. Consolidate
docs by using table_object instead of expanded table syntax in publication
commands. Also includes minor test cleanup and naming fixes.

Reported-by: Peter Smith <smithpb2250@gmail.com>
Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm1CiBYcteE_jjPA4BPHfX30dg9eTTTkJgkjY5tgE7t=bQ@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-04-01 09:13:43 +05:30
Andres Freund
c0af4eb4e7 bufmgr: Fix ordering of checks in PinBuffer()
The check for skip_if_not_valid added in 819dc118c0 was put at the start of
the loop. A CAS loop in theory does allow to make that check in a race free
manner. However, just after the check, there's a
    old_buf_state = WaitBufHdrUnlocked(buf);
which introduces a race, because it would allow BM_VALID to be cleared, after
the skip_if_not_valid check.

Fix by restarting the loop after WaitBufHdrUnlocked().

Reported-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Discussion: https://postgr.es/m/5bf667f3-5270-4b19-a08f-0facbecdff68@postgrespro.ru
2026-03-31 19:24:58 -04:00
Jacob Champion
e020a897ef oauth: Don't log discovery connections by default
Currently, when the client sends a parameter discovery request within
OAUTHBEARER, the server logs the attempt with

    FATAL:  OAuth bearer authentication failed for user

These log entries are difficult to distinguish from true authentication
failures, and by default, libpq sends a discovery request as part of
every OAuth connection, making them annoyingly noisy. Use the new
PG_SASL_EXCHANGE_ABANDONED status to suppress them.

Patch by Zsolt Parragi, with some additional comments added by me.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:33 -07:00
Jacob Champion
c4ff16339f sasl: Allow backend mechanisms to "abandon" exchanges
Introduce PG_SASL_EXCHANGE_ABANDONED, which allows CheckSASLAuth to
suppress the failing log entry for any SASL exchange that isn't actually
an authentication attempt. This is desirable for OAUTHBEARER's discovery
exchanges (and a subsequent commit will make use of it there).

This might have some overlap in the future with in-band aborts for SASL
exchanges, but it's intentionally not named _ABORTED to avoid confusion.
(We don't currently support clientside aborts in our SASL profile.)

Adapted from a patch by Zsolt Parragi.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:31 -07:00
Jacob Champion
c2bca7cc96 Add FATAL_CLIENT_ONLY to ereport/elog
SASL exchanges must end with either an AuthenticationOk or an
ErrorResponse from the server, and the standard way to produce an
ErrorResponse packet is for auth_failed() to call ereport(FATAL). This
means that there's no way for a SASL mechanism to suppress the server
log entry if the "authentication attempt" was really just a query for
authentication metadata, as is done with OAUTHBEARER.

Following the example of 1f9158ba4, add a FATAL_CLIENT_ONLY elevel. This
will allow ClientAuthentication() to choose not to log a particular
failure, while still correctly ending the authentication exchange before
process exit.

(The provenance of this patch is convoluted: since it's a mechanical
copy-paste of 1f9158ba4, both Zsolt Parragi and I produced nearly
identical versions independently, and Andrey Borodin reviewed Zsolt's
version. Tom Lane is the author of 1f9158ba4, but I don't want to imply
that he's signed off on this adaptation. See Discussion.)

Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAN4CZFPim7hUiyb7daNKQPSZ8CvQRBGkVhbvED7yZi8VktSn4Q%40mail.gmail.com
2026-03-31 11:47:29 -07:00
Nathan Bossart
771fe0948c Avoid including vacuum.h in tableam.h and heapam.h.
Commit 2252fcd427 modified some function prototypes in tableam.h
and heapam.h to take a VacuumParams argument instead of a pointer,
which required including vacuum.h in those headers.  vacuum.h has a
reasonably large dependency tree, and headers like tableam.h are
widely included, so this is not ideal.  To fix, change the
functions in question to accept a "const VacuumParams *" argument
instead.  That allows us to use a forward declaration for
VacuumParams and avoid including vacuum.h.  Since vacuum_rel()
needs to scribble on the params argument, we still pass it by value
to that function so that the original struct is not modified.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/rzxpxod4c4la62yvutyrvgoyilrl2fx55djaf2suidy7np5m6c%403l2ln476eadh
2026-03-31 12:43:52 -05:00
Tom Lane
fb7a9050d5 Doc: improve explanation of GiST compress/decompress methods.
The docs previously didn't explain that leaf and non-leaf keys
could be treated differently, even though many of our opclasses
do exactly that.  It also wasn't explained how that relates to
the STORAGE option, particularly since only one storage type
can be specified for both leaf and non-leaf keys.

While here, reorganize the text slightly, rather than sticking
additional detail into what's supposed to be a brief summary
paragraph.

Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+renyWs5Np+FLSYfL+eu20S4U671A3fQGb-+7e22HLrD1NbYw@mail.gmail.com
2026-03-31 11:23:26 -04:00
Heikki Linnakangas
7b424e3108 Change the signature of dynahash's alloc function
Instead of passing the current memory context to the alloc function
via a shared variable, pass it directly as an argument.

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-31 16:55:03 +03:00
Heikki Linnakangas
dde69621c3 Remove HASH_SEGMENT option
It's been unused forever. There's no urgency in removing it now, but
it was just something that caught my eye.

Aleksander Alekseev proposed this a long time ago [0], but Tom Lane
was worried about third-party extensions using it. I believe that's a
non-issue: I tried grepping through all extensions found on github and
didn't find any references to HASH_SEGMENT.

[0] https://www.postgresql.org/message-id/20160418180711.55ac82c0@fujitsu

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-31 16:45:28 +03:00
Peter Eisentraut
a0dd0702e4 Fix cross variable references in graph pattern causing segfault
When converting the WHERE clause in an element pattern,
generate_query_for_graph_path() calls replace_property_refs() to
replace the property references in it.  Only the current graph element
pattern is passed as the context for replacement.  If there are
references to variables from other element patterns, it causes a
segmentation fault (an assertion failure in an Assert enabled build)
since it does not find path_element object corresponding to those
variables.

We do not support forward and backward variable references within a
graph table clause.  Hence prohibit all the cross references.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reported-by: Man Zeng <zengman@halodbtech.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5u6AoDfNg4%3DR5eVJn_bJn%3DC%3DwVPrto02P_06fxy39fniA%40mail.gmail.com
2026-03-31 11:47:19 +02:00
Peter Eisentraut
c5b3253b8a Property references are preferred over regular column references
When a ColumnRef can be resolved as a graph table property reference
and a lateral table column reference prefer the graph table property
reference since element pattern variables in the GRAPH_TABLE clause
form the innermost namespace.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5u6AoDfNg4%3DR5eVJn_bJn%3DC%3DwVPrto02P_06fxy39fniA%40mail.gmail.com
2026-03-31 11:47:19 +02:00
Amit Langote
68a8601ee9 Fix use-after-free in ri_LoadConstraintInfo
conindid was read from conForm after ReleaseSysCache(tup).  Move
the read to before the release.

Introduced by commit 2da86c1ef9.

Per buildfarm member prion.

Discussion: https://postgr.es/m/CA+HiwqGGYjN6F2oL7yAk=hvSs-sj3TPqZ9JC9iyLkCqJadECrw@mail.gmail.com
2026-03-31 17:04:44 +09:00
Daniel Gustafsson
097ab69d17 Formalize WAL record for XLOG_CHECKPOINT_REDO
XLOG_CHECKPOINT_REDO only contains the wal_level copied straight in
without an encapsulating record structure. While it works, it makes
future uses of XLOG_CHECKPOINT_REDO hard as there is nowhere to put
new data items.  This fix this was inspired by the online checksums
patch which adds data to this record,  but this change has value on
its own.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c92b5d8b-bc03-47bc-b209-2e4a719eee32@iki.fi
2026-03-31 09:38:01 +02:00
Amit Langote
2da86c1ef9 Add fast path for foreign key constraint checks
Add a fast-path optimization for foreign key checks that bypasses SPI
by directly probing the unique index on the referenced table.
Benchmarking shows ~1.8x speedup for bulk FK inserts (int PK/int FK,
1M rows, where PK table and index are cached).

The fast path applies when the referenced table is not partitioned and
the constraint does not involve temporal semantics.  Otherwise, the
existing SPI path is used.

This optimization covers only the referential check trigger
(RI_FKey_check).  The action triggers (CASCADE, SET NULL, SET DEFAULT,
RESTRICT, NO ACTION) must find rows on the FK side to modify, which
requires a table scan with no guaranteed index available, and then
execute DML against those rows through the full executor path including
any triggered actions.  Replicating that without substantial code
duplication is not feasible, so those triggers remain on the SPI path.
Extending the fast path to action triggers remains possible as future
work if the necessary infrastructure is built.

The new ri_FastPathCheck() function extracts the FK values, builds scan
keys, performs an index scan, and locks the matching tuple with
LockTupleKeyShare via ri_LockPKTuple(), which handles the RI-specific
subset of table_tuple_lock() results.

If the locked tuple was reached by chasing an update chain
(tmfd.traversed), recheck_matched_pk_tuple() verifies that the key
is still the same, emulating EvalPlanQual.

The scan uses GetTransactionSnapshot(), matching what the SPI path
uses (via _SPI_execute_plan pushing GetTransactionSnapshot() as the
active snapshot).  Under READ COMMITTED this is a fresh snapshot;
under REPEATABLE READ / SERIALIZABLE it is the frozen transaction-
start snapshot, so PK rows committed after the transaction started
are not visible.

The ri_CheckPermissions() function performs schema USAGE and table
SELECT checks, matching what the SPI path gets implicitly through
the executor's permission checks.  The fast path also switches to
the PK table owner's security context (with SECURITY_NOFORCE_RLS)
before the index probe, matching the SPI path where the query runs
as the table owner.

ri_HashCompareOp() is adjusted to handle cross-type equality operators
(e.g. int48eq for int4 PK / int8 FK) which can appear in conpfeqop.
The existing code asserted same-type operators only, which was correct
for its existing callers (ri_KeysEqual compares same-type FK column
values via ff_eq_oprs), but the fast path is the first caller to pass
pf_eq_oprs, which can be cross-type.

Per-key metadata (compare entries, operator procedures, strategy
numbers) is cached in RI_ConstraintInfo via
ri_populate_fastpath_metadata() on first use, eliminating repeated
calls to ri_HashCompareOp() and get_op_opfamily_properties().
conindid and pk_is_partitioned are also cached at constraint load
time, avoiding per-invocation syscache lookups and the need to open
pk_rel before deciding whether the fast path applies.

New regression tests cover RLS bypass and ACL enforcement for the
fast-path permission checks.  New isolation tests exercise concurrent
PK updates under both READ COMMITTED and REPEATABLE READ.

Author: Junwang Zhao <zhjwpku@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-03-31 13:49:21 +09:00
Amit Kapila
5984ea868e Change syntax of EXCEPT TABLE clause in publication commands.
Adjust the syntax of the EXCEPT clause in CREATE/ALTER PUBLICATION
added in commits fd366065e0 and 493f8c6439 to move the TABLE keyword
inside the relation list.

Old syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT TABLE (t1, ...);
ALTER PUBLICATION  ... SET ALL TABLES EXCEPT TABLE (t1, ...);

New syntax:
CREATE PUBLICATION ... FOR ALL TABLES EXCEPT (TABLE t1, ...);
ALTER PUBLICATION  ... SET ALL TABLES EXCEPT (TABLE t1, ...);

This is to ensure that inclusion and exclusion list can be specified in
a same way. Previously, the exclusion table list can be specified as
TABLE (t1, t2, t3) and inclusion list can be specified as TABLE t1, t2,
t3, or TABLE t1, TABLE t2, TABLE t3.

This change is purely syntactic and does not alter behavior.

Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCC8XuwfX62qKBSfHUAoww_XB3_84HjswgL9jxQy696yw@mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-31 09:40:51 +05:30
Nathan Bossart
bab2f27eaa Remove bits* typedefs.
In addition to removing the bits8, bits16, and bits32 typedefs,
this commit replaces all uses with uint8, uint16, or uint32.  bits*
provided little benefit beyond establishing the intent of the
variable, and they were inconsistently used for that purpose.
Third-party code should instead use the corresponding uint*
typedef.

Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/absbX33E4eaA0Ity%40nathan
2026-03-30 16:12:08 -05:00
Heikki Linnakangas
40c41dc773 Use ShmemInitStruct to allocate shmem for semaphores
This makes them visible in pg_shmem_allocations

Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://www.postgresql.org/message-id/01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi
2026-03-30 23:39:35 +03:00
Melanie Plageman
378a216187 Set pd_prune_xid on insert
Now that on-access pruning can update the visibility map (VM) during
read-only queries, set the page’s pd_prune_xid hint during INSERT and on
the new page during UPDATE.

This allows heap_page_prune_and_freeze() to set the VM the first time a
page is read after being filled with tuples. This may avoid I/O
amplification by setting the page all-visible when it is still in shared
buffers and allowing later vacuums to skip scanning the page. It also
enables index-only scans of newly inserted data much sooner.

As a side benefit, this addresses a long-standing note in heap_insert()
and heap_multi_insert(): aborted inserts can now be pruned on-access
rather than lingering until the next VACUUM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-30 16:07:11 -04:00
Melanie Plageman
b46e1e54d0 Allow on-access pruning to set pages all-visible
Many queries do not modify the underlying relation. For such queries, if
on-access pruning occurs during the scan, we can check whether the page
has become all-visible and update the visibility map accordingly.
Previously, only vacuum and COPY FREEZE marked pages as all-visible or
all-frozen.

This commit implements on-access VM setting for sequential scans, tid
range scans, sample scans, bitmap heap scans, and the underlying heap
relation in index scans.

Setting the visibility map on-access can avoid write amplification
caused by vacuum later needing to set the page all-visible, which could
trigger a write and potentially an FPI. It also allows more frequent
index-only scans, since they require pages to be marked all-visible in
the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-30 15:47:07 -04:00
Tom Lane
7394773450 Be more careful to preserve consistency of a tuplestore.
Several places in tuplestore.c would leave the tuplestore data
structure effectively corrupt if some subroutine were to throw
an error.  Notably, if WRITETUP() failed after some number of
successful calls within dumptuples(), the tuplestore would
contain some memtuples pointers that were apparently live
entries but in fact pointed to pfree'd chunks.

In most cases this sort of thing is fine because transaction
abort cleanup is not too picky about the contents of memory that
it's going to throw away anyway.  There's at least one exception
though: if a Portal has a holdStore, we're going to call
tuplestore_end() on that, even during transaction abort.
So it's not cool if that tuplestore is corrupt, and that means
tuplestore.c has to be more careful.

This oversight demonstrably leads to crashes in v15 and before,
if a holdable cursor fails to persist its data due to an undersized
temp_file_limit setting.  Very possibly the same thing can happen in
v16 and v17 as well, though the specific test case submitted failed
to fail there (cf. 095555daf).  The failure is accidentally dodged
as of v18 because 590b045c3 got rid of tuplestore_end's retail tuple
deletion loop.  Still, it seems unwise to permit tuplestores to become
internally inconsistent in any branch, so I've applied the same fix
across the board.

Since the known test case for this is rather expensive and doesn't
fail in recent branches, I've omitted it.

Bug: #19438
Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 14
2026-03-30 13:59:58 -04:00
Heikki Linnakangas
681774315d Replace getopt() with our re-entrant variant in the backend
Some of these probably could continue using non-re-entrant getopt()
even if we start using threads in the future, but it seems better to
make them all anyway, so that we have a clear-cut rule of "no plain
getopt() in the postgres binary".

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
2026-03-30 20:47:16 +03:00
Heikki Linnakangas
c5f7820e57 Fix latent bug in get_stats_option_name()
The function is supposed to look at the passed in 'arg' argument, but
peeks at the 'optarg' global variable that's part of getopt()
instead. It happened to work anyway, because all callers passed
'optarg' as the argument.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
2026-03-30 20:34:48 +03:00
Melanie Plageman
50eb5faea2 Pass down information on table modification to scan nodes
Pass down information to sequential scan, index [only] scan, bitmap
table scan, sample scan, and TID range scan nodes on whether or not the
query modifies the relation being scanned. A later commit will use this
information to update the VM during on-access pruning only if the
relation is not modified by the query.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru
2026-03-30 13:27:34 -04:00
Álvaro Herrera
349bd88202
Don't use bits32 in table AM interface
Seems there's near-universal dislike for the bitsXX typedefs.
Revert that part of commit 1bd6f22f43 in favor of using plain uint32.
2026-03-30 19:06:33 +02:00
Melanie Plageman
dcd8cc1c85 Thread flags through begin-scan APIs
Add an AM user-settable flags parameter to several of the table scan
functions, one table AM callback, and index_beginscan(). This allows
users to pass additional context to be used when building the scan
descriptors.

For index scans, a new flags field is added to IndexFetchTableData, and
the heap AM saves the caller-provided flags there.

This introduces an extension point for follow-up work to pass per-scan
information (such as whether the relation is read-only for the current
query) from the executor to the AM layer.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2be31f17-5405-4de9-8d73-90ebc322f7d8%40vondra.me
2026-03-30 12:27:24 -04:00
Tom Lane
095555daf1 Detect pfree or repalloc of a previously-freed memory chunk.
Before the major rewrite in commit c6e0fe1f2, AllocSetFree() would
typically crash when asked to free an already-free chunk.  That was
an ugly but serviceable way of detecting coding errors that led to
double pfrees.  But since that rewrite, double pfrees went through
just fine, because the "hdrmask" of a freed chunk isn't changed at all
when putting it on the freelist.  We'd end with a corrupt freelist
that circularly links back to the doubly-freed chunk, which would
usually result in trouble later, far removed from the actual bug.

This situation is no good at all for debugging purposes.  Fortunately,
we can fix it at low cost in MEMORY_CONTEXT_CHECKING builds by making
AllocSetFree() check for chunk->requested_size == InvalidAllocSize,
relying on the pre-existing code that sets it that way just below.

I investigated the alternative of changing a freed chunk's methodid
field, which would allow detection in non-MEMORY_CONTEXT_CHECKING
builds too.  But that adds measurable overhead.  Seeing that we didn't
notice this oversight for more than three years, it's hard to argue
that detecting this type of bug is worth any extra overhead in
production builds.

Likewise fix AllocSetRealloc() to detect repalloc() on a freed chunk,
and apply similar changes in generation.c and slab.c.  (generation.c
would hit an Assert failure anyway, but it seems best to make it act
like aset.c.)  bump.c doesn't need changes since it doesn't support
pfree in the first place.  Ideally alignedalloc.c would receive
similar changes, but in debugging builds it's impossible to reach
AlignedAllocFree() or AlignedAllocRealloc() on a pfreed chunk, because
the underlying context's pfree would have wiped the chunk header of
the aligned chunk.  But that means we should get an error of some
sort, so let's be content with that.

Per investigation of why the test case for bug #19438 didn't appear to
fail in v16 and up, even though the underlying bug was still present.
(This doesn't fix the underlying double-free bug, just cause it to
get detected.)

Bug: #19438
Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 16
2026-03-30 12:02:08 -04:00
Heikki Linnakangas
bd365b1ae5 Fix outdated comment on MainLWLockArray
It's no longer passed to child processes down via BackendParameters in
EXEC_BACKEND mode.

Reported-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA5RZ0vPWNMvTBqyH7nqDRrHd6Y4Et5iNqXFuwpbsPOk3cL4rQ@mail.gmail.com
2026-03-30 17:13:11 +03:00
Melanie Plageman
39dcd10a2c Remove PlannedStmt->resultRelations in favor of resultRelationRelids
PlannedStmt->resultRelations was an integer list of range table indexes
because at the time it was added (to Query), the Bitmapset data type did
not yet exist in Postgres.

0f4c170cf3 added a Bitmapset of result relations, so remove the integer
list of RTIs and use the more compact resultRelationRelids.

Discussion: https://postgr.es/m/CAApHDvqAOeOwCKh9g0gfxWa040%3DHyc7_oA%3DC59rjod8kXJDWyw%40mail.gmail.com
2026-03-30 09:51:28 -04:00
Melanie Plageman
0f4c170cf3 Make it cheap to check if a relation is modified by a query
Save the range table indexes of result relations and row mark relations
in separate bitmapsets in the PlannedStmt. Precomputing them allows
cheap membership checks during execution. Together, these two groups
approximate all relations that will be modified by a query. This
includes relations targeted by INSERT, UPDATE, DELETE, and MERGE as well
as relations with any row mark (like SELECT FOR UPDATE).

Future work will use information on whether or not a relation is
modified by a query in a heuristic.

PlannedStmt->resultRelations is only used in a membership check, so it
will be removed in a separate commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
2026-03-30 09:38:03 -04:00
Álvaro Herrera
1bd6f22f43
Have table_insert and siblings use an unsigned type for options
Using signed types can lead to bugs, such as the one fixed by commit
2a2e1b470b.

Discussion: https://postgr.es/m/44e6ze3kuunhky63wmfjxrmn72pds2whwf5ok6hpz7c4my7k2h@l65zhpcuasnf
2026-03-30 13:58:16 +02:00
Peter Eisentraut
b36b956404 Make cast functions to type money error safe
This converts the cast functions from types integer, bigint, and
numeric to type money to support soft errors.

Note: Casting from type money to type numeric (the other way, function
cash_numeric) is not yet error safe.

Author: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-30 10:10:56 +02:00
Peter Eisentraut
26f9012bee Make cast function from circle to polygon error safe
Previously, the function casting type circle to type polygon could not
be made error safe, because it is an SQL language function.

This refactors it as a C/internal function, by sharing code with the
C/internal function that the SQL function previously wrapped, and soft
error support is added.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-30 09:11:08 +02:00
Fujii Masao
2497dac556 Fix FK triggers losing DEFERRABLE/INITIALLY DEFERRED when marked ENFORCED again
Previously, a foreign key defined as DEFERRABLE INITIALLY DEFERRED could
behave as NOT DEFERRABLE after being set to NOT ENFORCED and then back
to ENFORCED.

This happened because recreating the FK triggers on re-enabling the constraint
forgot to restore the tgdeferrable and tginitdeferred fields in pg_trigger.

Fix this bug by properly setting those fields when the foreign key constraint
is marked ENFORCED again and its triggers are recreated, so the original
DEFERRABLE and INITIALLY DEFERRED properties are preserved.

Backpatch to v18, where NOT ENFORCED foreign keys were introduced.

Author: Yasuo Honda <yasuo.honda@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAKmOUTms2nkxEZDdcrsjq5P3b2L_PR266Hv8kW5pANwmVaRJJQ@mail.gmail.com
Backpatch-through: 18
2026-03-30 14:37:33 +09:00
David Rowley
0d866282b8 Fix datum_image_*()'s inability to detect sign-extension variations
Functions such as hash_numeric() are not careful to use the correct
PG_RETURN_*() macro according to the return type of that function as
defined in pg_proc.  Because that function is meant to return int32,
when the hashed value exceeds 2^31, the 64-bit Datum value won't wrap to
a negative number, which means the Datum won't have the same value as it
would have had it been cast to int32 on a two's complement machine.  This
isn't harmless as both datum_image_eq() and datum_image_hash() may receive
a Datum that's been formed and deformed from a tuple in some cases, and
not in other cases.  When formed into a tuple, the Datum value will be
coerced into an integer according to the attlen as specified by the
TupleDesc.  This can result in two Datums that should be equal being
classed as not equal, which could result in (but not limited to) an error
such as:

ERROR:  could not find memoization table entry

Here we fix this by ensuring we cast the Datum value to a signed integer
according to the typLen specified in the datum_image_eq/datum_image_hash
function call before comparing or hashing.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/CAHewXNmcXVFdB9_WwA8Ez0P+m_TQy_KzYk5Ri5dvg+fuwjD_yw@mail.gmail.com
2026-03-30 16:14:34 +13:00
Amit Langote
1ad7191f7e Add comment explaining fire_triggers=false in ri_PerformCheck()
The reason for passing fire_triggers=false to SPI_execute_snapshot()
in ri_PerformCheck() was not documented, making it unclear why it was
done that way.  Add a comment explaining that it ensures AFTER triggers
on rows modified by the RI action are queued in the outer query's
after-trigger context and fire only after all RI updates on the same
row are complete.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Surya Poondla <suryapoondla4@gmail.com>
Discussion: https://postgr.es/m/20250331212648.ad4ab804559001d7f0788741@sraoss.co.jp
2026-03-30 10:10:17 +09:00
Peter Eisentraut
45cdaf3665 Make geometry cast functions error safe
This adjusts cast functions of the geometry types to support soft
errors.  This requires refactoring of various helper functions to
support error contexts.  Also make the float8 to float4 cast error
safe.  It requires some of the same helper functions.

This is in preparation for a future feature where conversion errors in
casts can be caught.

(The function casting type circle to type polygon is not yet made error
safe, because it is an SQL language function.)

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-29 20:40:50 +02:00
Álvaro Herrera
0841b219bf
Sort InternalBGWorkers list alphabetically
This simplifies deciding where to add a new one.
2026-03-29 14:15:00 +02:00
Peter Eisentraut
10e4d8aaf4 Make cast functions from jsonb error safe
This adjusts cast functions from jsonb to other types to support soft
errors.  This just involves some refactoring of the underlying helper
functions to use ereturn.

This is in preparation for a future feature where conversion errors in
casts can be caught.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-28 15:44:13 +01:00
Andres Freund
999dec9ec6 aio: Don't wait for already in-progress IO
When a backend attempts to start a read IO and finds the first buffer already
has I/O in progress, previously it waited for that I/O to complete before
initiating reads for any of the subsequent buffers.

Although it must wait for the I/O to finish when acquiring the buffer, there's
no reason for it to wait when setting up the read operation. Waiting at this
point prevents starting I/O on subsequent buffers and can significantly reduce
concurrency.

This matters in two workloads:
1) When multiple backends scan the same relation concurrently.
2) When a single backend requests the same block multiple times within the
   readahead distance.

Waiting each time an in-progress read is encountered effectively degenerates
the access pattern into synchronous I/O.

To fix this, when encountering an already in-progress IO for the head buffer,
the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.

In rare cases, a backend may still need to wait synchronously at IO
start time: If another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
2026-03-27 19:53:32 -04:00
Andres Freund
74eafeab1a bufmgr: Improve StartBufferIO interface
Until now StartBufferIO() had a few weaknesses:

- As it did not submit staged IOs, it was not safe to call StartBufferIO()
  where there was a potential for unsubmitted IO, which required
  AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
  StartBufferIO().

- With nowait = true, the boolean return value did not allow to distinguish
  between no IO being necessary and having to wait, which would lead
  ReadBuffersCanStartIO() to unnecessarily submit staged IO.

- Several callers needed to handle both local and shared buffers, requiring
  the caller to differentiate between StartBufferIO() and StartLocalBufferIO()

- In a future commit some callers of StartBufferIO() want the BufferDesc's
  io_wref to be returned, to asynchronously wait for in-progress IO

- Indicating whether to wait with the nowait parameter was somewhat confusing
  compared to a wait parameter

Address these issues as follows:

- StartBufferIO() is renamed to StartSharedBufferIO()

- A new StartBufferIO() is introduced that supports both shared and local
  buffers

- The boolean return value has been replaced with an enum, indicating whether
  the IO is already done, already in progress or that the buffer has been
  readied for IO

- A new PgAioWaitRef * argument allows the caller to get the wait reference is
  desired.  All current callers pass NULL, a user of this will be introduced
  subsequently

- Instead of the nowait argument there now is wait

  This probably would not have been worthwhile on its own, but since all these
  lines needed to be touched anyway...

Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-27 19:08:12 -04:00
Heikki Linnakangas
2407c8db15 Fix RequestNamedLWLockTranche in single-user mode
PostmasterContext is not available in single-user mode, use
TopMemoryContext instead. Also make sure that we use the correct
memory context in the lappend().

Author: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/acb_Eo1XtmCO_9z7@nathan
2026-03-28 01:02:11 +02:00
Andres Freund
f39cb8c011 bufmgr: Make UnlockReleaseBuffer() more efficient
Now that the buffer content lock is implemented as part of BufferDesc.state,
releasing the lock and unpinning the buffer can be implemented as a single
atomic operation.

This improves workloads that have heavy contention on a small number of
buffers substantially, I e.g., see a ~20% improvement for pipelined readonly
pgbench on an older two socket machine.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Andres Freund
8df3c48e46 Use UnlockReleaseBuffer() in more places
An upcoming commit will make UnlockReleaseBuffer() considerably faster and
more scalable than doing LockBuffer(BUFFER_LOCK_UNLOCK); ReleaseBuffer();. But
it's a small performance benefit even as-is.

Most of the callsites changed in this patch are not performance sensitive,
however some, like the nbtree ones, are in critical paths.

This patch changes all the easily convertible places over to
UnlockReleaseBuffer() mainly because I needed to check all of them anyway, and
reducing cases where the operations are done separately makes the checking
easier.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Andres Freund
41d3d64e87 bufmgr: Don't copy pages while writing out
After the series of preceding commits introducing and using
BufferBeginSetHintBits()/BufferSetHintBits16(), hint bits are not set anymore
while IO is going on. Therefore we do not need to copy pages while they are
being written out anymore.

For the same reason XLogSaveBufferForHint() now does not need to operate on a
copy of the page anymore, but can instead use the normal XLogRegisterBuffer()
mechanism. For that the assertions and comments to XLogRegisterBuffer() had to
be updated to allow share-exclusive locked buffers to be registered.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-27 15:56:29 -04:00
Nathan Bossart
d7965d65fc Add rudimentary table prioritization to autovacuum.
Autovacuum workers scan pg_class twice to collect the set of tables
to process.  The first pass is for plain relations and materialized
views, and the second is for TOAST tables.  When the worker finds a
table to process, it adds it to the end of a list.  Later on, it
processes the tables in the same order as the list.  This simple
strategy has worked surprisingly well for a long time, but there
have been many discussions over the years about trying to improve
it.

This commit introduces a scoring system that is used to sort the
aforementioned list of tables to process.  The idea is to have
autovacuum workers prioritize tables that are furthest beyond their
thresholds (e.g., a table nearing transaction ID wraparound should
be vacuumed first).  This prioritization scheme is certainly far
from perfect; there are simply too many possibilities for any
scoring technique to work across all workloads, and the situation
might change significantly between the time we calculate the score
and the time that autovacuum processes it.  However, we have
attemped to develop something that is expected to work for a large
portion of workloads with reasonable parameter settings.

The score is calculated as the maximum of the ratios of each of the
table's relevant values to its threshold.  For example, if the
number of inserted tuples is 100, and the insert threshold for the
table is 80, the insert score is 1.25.  If all other scores are
below that value, the table's score will be 1.25.  The other
criteria considered for the score are the table ages (both
relfrozenxid and relminmxid) compared to the corresponding
freeze-max-age setting, the number of update/deleted tuples
compared to the vacuum threshold, and the number of
inserted/updated/deleted tuples compared to the analyze threshold.

Once exception to the previous paragraph is for tables nearing
wraparound, i.e., those that have surpassed the effective failsafe
ages.  In that case, the relfrozenxid/relminmxid-based score is
scaled aggressively so that the table has a decent chance of
sorting to the front of the list.

To adjust how strongly each component contributes to the score, the
following parameters can be adjusted from their default of 1.0 to
anywhere between 0.0 and 10.0 (inclusive).  Setting all of these to
0.0 restores pre-v19 prioritization behavior:

	autovacuum_freeze_score_weight
	autovacuum_multixact_freeze_score_weight
	autovacuum_vacuum_score_weight
	autovacuum_vacuum_insert_score_weight
	autovacuum_analyze_score_weight

This is intended to be a baby step towards smarter autovacuum
workers.  Possible future improvements include, but are not limited
to, periodic reprioritization, automatic cost limit adjustments,
and better observability (e.g., a system view that shows current
scores).  While we do not expect this commit to produce any
earth-shattering improvements, it is arguably a prerequisite for
the aforementioned follow-up changes.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/aOaAuXREwnPZVISO%40nathan
2026-03-27 10:17:05 -05:00
Heikki Linnakangas
3fd0577728 Refactor PredicateLockShmemInit to not reuse var for different things
The PredicateLockShmemInit function is pretty complicated, and one
source of confusion is that it reuses the same local variable for
sizes of things. Replace the different uses with separate variables
for clarity.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/113724ab-0028-493f-9605-6e8570f0939f@iki.fi
2026-03-27 13:24:34 +02:00
Peter Eisentraut
288ae96872 Add a graph pattern variable only once
An element pattern variable may be repeated in the path pattern.
GraphTableParseState maintains a list of all variable names used in
the graph pattern.  Add a new variable name to that list only when it
is not present already.  This isn't a problem right now, but it could
be in the future.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5tR4O0vjeqTCPr2VB5pYjNYbJgbCBEQf63NtU5Pz1MiOQ%40mail.gmail.com
2026-03-27 10:55:17 +01:00
Heikki Linnakangas
98993150c0 Minor comment fixes to yesterday's LWLock tranche refactoring
Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA5RZ0sLENRM+BicUjQFs_rP38oPx3gm0SsGrD0-jMhhM+HZ_w@mail.gmail.com
2026-03-27 11:44:10 +02:00
Peter Eisentraut
720f0f89d6 Reject consecutive element patterns of same kind
Adding an implicit empty vertex pattern when a path pattern starts or
ends with an edge pattern or when two consecutive edge patterns appear
in the pattern is not supported right now.  Prohibit such path
patterns.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/72a23702-6d96-4103-a54b-057c2352e885%2540eisentraut.org
2026-03-27 10:31:53 +01:00
Heikki Linnakangas
30d432502b Use ShmemInitStruct to allocate lwlock.c's shared memory
It's nice to have them show up in pg_shmem_allocations like all other
shmem areas. ShmemInitStruct() depends on ShmemIndexLock, but only
after postmaster startup.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:51:41 +02:00
Heikki Linnakangas
06d859aaf4 Move ShmemIndexLock into ShmemAllocator
This makes shmem.c independent of the main LWLock array. That makes it
possible to stop passing MainLWLockArray through BackendParameters in
the next commit.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:51:41 +02:00
Heikki Linnakangas
12e3e0f2c8 Use a separate spinlock to protect LWLockTranches
Previously we reused the shmem allocator's ShmemLock to also protect
lwlock.c's shared memory structures. Introduce a separate spinlock for
lwlock.c for the sake of modularity. Now that lwlock.c has its own
shared memory struct (LWLockTranches), this is easy to do.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:50:59 +02:00
Heikki Linnakangas
d6eba30a24 Refactor how user-defined LWLock tranches are stored in shmem
Merge the LWLockTranches and NamedLWLockTrancheRequest data structures
in shared memory into one array of user-defined tranches. The
NamedLWLockTrancheRequest list is now only used in postmaster, to hold
the requests until shared memory is initialized.

Introduce a C struct, LWLockTranches, to hold all the different fields
kept in shared memory. This gives an easier overview of what are all
the things kept in shared memory. Previously, we had separate pointers
for LWLockTrancheNames, LWLockCounter and the (shared memory copy of)
NamedLWLockTrancheRequestArray.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:47:22 +02:00
Heikki Linnakangas
cc88481aeb Rename MAX_NAMED_TRANCHES to MAX_USER_DEFINED_TRANCHES
The "named tranches" term is a little confusing. In most places it
refers to tranches requested with RequestNamedLWLockTranche(), even
though all built-in tranches and tranches allocated with
LWLockNewTrancheId() also have a name. But in MAX_NAMED_TRANCHES, it
refers to tranches requested with either RequestNamedLWLockTranche()
or LWLockNewTrancheId(), as it's the maximum of all of those in total.

The "user defined" term is already used in
LWTRANCHE_FIRST_USER_DEFINED, so let's standardize on that to mean
tranches allocated with either RequestNamedLWLockTranche() or
LWLockNewTrancheId().

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:46:04 +02:00
Robert Haas
26255a3207 Add an alternative_plan_name field to PlannerInfo.
Typically, we have only one PlannerInfo for any given subquery, but
when we are considering a MinMaxAggPath or a hashed subplan, we end
up creating a second PlannerInfo for the same portion of the query,
with a clone of the original range table. In fact, in the MinMaxAggPath
case, we might end up creating several clones, one per aggregate.

At present, there's no easy way for a plugin, such as pg_plan_advice,
to understand the relationships between the original range table and
the copies of it that are created in these cases.  To fix, add an
alternative_plan_name field to PlannerInfo. For a hashed subplan, this
is the plan name for the non-hashed alternative; for minmax aggregates,
this is the plan_name from the parent PlannerInfo; otherwise, it's the
same as plan_name.

Discussion: http://postgr.es/m/CA+TgmoYuWmN-00Ec5pY7zAcpSFQUQLbgAdVWGR9kOR-HM-fHrA@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
2026-03-26 16:45:17 -04:00
Andres Freund
8a1a1d6ab8 bufmgr: Restructure AsyncReadBuffers()
Restructure AsyncReadBuffers() to use early return when the head buffer is
already valid, instead of using a did_start_io flag and if/else branches. Also
move around a bit of the code to be located closer to where it is used. This
is a refactor only.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
df09452c32 bufmgr: Make buffer hit helper
Already two places count buffer hits, requiring quite a few lines of
code since we do accounting in so many places. Future commits will add
more locations, so refactor into a helper.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
c2a68e08b1 bufmgr: Pass io_object and io_context through to PinBufferForBlock()
PinBufferForBlock() is always_inline and called in a loop in
StartReadBuffersImpl(). Previously it computed io_context and io_object
internally, which required calling IOContextForStrategy() -- a non-inline
function the compiler cannot prove is side-effect-free. This could potential
cause unneeded redundant function calls.

Compute io_context and io_object in the callers instead, allowing
StartReadBuffersImpl() to do so once before entering the loop.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-26 12:07:05 -04:00
Andres Freund
cf66978d79 Fix off-by-one error in read IO tracing
AsyncReadBuffer()'s no-IO needed path passed
TRACE_POSTGRESQL_BUFFER_READ_DONE the wrong block number because it had
already incremented operation->nblocks_done. Fix by folding the
nblocks_done offset into the blocknum local variable at initialization.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/u73un3xeljr4fiidzwi4ikcr6vm7oqugn4fo5vqpstjio6anl2%40hph6fvdiiria
Backpatch-through: 18
2026-03-26 10:38:56 -04:00
Robert Haas
47c110f77e Respect disabled_nodes in fix_alternative_subplan.
When my commit e222534679 added the
concept of disabled_nodes, it failed to add a disabled_nodes field
to SubPlan. This is a regression: before that commit, when
fix_alternative_subplan compared the costs of two plans, the number
of disabled nodes affected the result, because it was just a
component of the total cost. After that commit, it no longer did,
making it possible for a disabled path to win on cost over one that
is not disabled. Fix that.

As usual for planner fixes that might destabilize plan choices,
no back-patch.

Discussion: https://postgr.es/m/CA+TgmoaK=4w7-qknUo3QhUJ53pXZq=c=KgZmRyD+k7ytqfmgSg@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
2026-03-26 10:25:04 -04:00
Peter Eisentraut
119e791e9c Fix -Wcast-qual warning
This dials back a couple of the qualifiers added by commit
7724cb9935.  Specifically, in match_boolean_partition_clause() the
call to negate_clause() casts away the const, so we shouldn't make the
input argument const.
2026-03-26 15:00:24 +01:00
Fujii Masao
400a790a48 Avoid sending duplicate WAL locations in standby status replies
Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
2026-03-26 20:54:32 +09:00
Fujii Masao
eef1ba704d Fix premature NULL lag reporting in pg_stat_replication
pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.

This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.

This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.

The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.

Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.

Backpatch to all supported branches.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Backpatch-through: 14
2026-03-26 20:49:31 +09:00
Heikki Linnakangas
6b8238cb6a Refactor ShmemIndex initialization
Initialize the ShmemIndex hash table in InitShmemAllocator() already,
removing the need for the separate InitShmemIndex() step.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
2026-03-26 11:35:55 +02:00
Amit Kapila
735e8fe685 Refactor replorigin_session_setup() for better readability.
Reorder the validation checks in replorigin_session_setup() to provide a
more logical flow. This makes the function easier to follow and ensures
that basic state checks are performed consistently.

Additionally, update an error message to align its phrasing with similar
diagnostics in the replication origin subsystem, improving overall
consistency.

Author: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/e0508305-bc6a-417c-b969-36564d632f9e@iki.fi
2026-03-26 09:15:25 +05:30
Michael Paquier
4287c50fc2 Improve timeout handling of pg_promote()
Previously, pg_promote() looped a fixed number of times, calculated from
the specified timeout, and waited 100ms on a latch, once per iteration,
for the promotion of a standby to complete.  However, unrelated signals
to the backend could set the latch and wake up the backend early,
resulting in a faster consumption of the loops and an execution time of
the function that does not match with the timeout input given in input.
This could be confusing for the function caller, especially if some
backend-side timeout is aggressive, because the function would return
much earlier than expected and report that the promote request has not
completed within the time requested.

This commit refines the logic to track the time actually elapsed, by
looping until the requested duration has truly passed.  The code
calculates the end time we expect, then uses it when looping.

Author: Robert Pang <robertpang@google.com>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Discussion: https://postgr.es/m/CAJhEC07OK8J7tLUbyiccnuOXRE7UKxBNqD2-pLfeFXa=tBoWtw@mail.gmail.com
2026-03-26 10:39:40 +09:00
Masahiko Sawada
497c1170cb Add base32hex support to encode() and decode() functions.
This adds support for base32hex encoding and decoding, as defined in
RFC 4648 Section 7. Unlike standard base32, base32hex uses the
extended hex alphabet (0-9, A-V) which preserves the lexicographical
order of the encoded data.

This is particularly useful for representing UUIDv7 values in a
compact string format while maintaining their time-ordered sort
property.

The encode() function produces output padded with '=', while decode()
accepts both padded and unpadded input. Following the behavior of
other encoding types, decoding is case-insensitive.

Suggested-by: Sergey Prokhorenko <sergeyprokhorenko@yahoo.com.au>
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Илья Чердаков <i.cherdakov.pg@gmail.com>
Reviewed-by: Chengxi Sun <chengxisun92@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
2026-03-25 11:35:19 -07:00
Álvaro Herrera
c8b4a3ec08
Remove unused autovac_table.at_sharedrel
The last use was removed by commit 38f7831d70.  After that, we compute
MyWorkerInfo->wi_sharedrel directly from the pg_class tuple of the table
being vacuumed rather than passing it around.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/20260325165734.7ab8e4e55fe4c2f1e55031d9@sraoss.co.jp
2026-03-25 18:24:34 +01:00
Peter Eisentraut
bccfc73acd Disable warnings in system headers in MSVC
This is similar to the standard behavior in GCC.  For MSVC, we set all
headers in angle brackets to be considered system headers.  (GCC goes
by path, not include style.)

The required option is available since VS 2017.  (Before VS 2019
version 16.10, the additional option /experimental:external is
required, but per discussion in [0], we effectively require 16.11, so
this shouldn't be a problem.)

[0]: https://www.postgresql.org/message-id/04ab76a3-186c-4a37-8076-e6882ebf9d43%40eisentraut.org

Then, we can remove one workaround for avoiding a warning from a
system header.  (And some warnings to be enabled in the future could
benefit from this.)

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aa73q1aT0A3/vke/%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-25 15:03:52 +01:00
Peter Eisentraut
5282bf535e Fix some typos and make small stylistic improvements
for commit 2f094e7ac6

Author: zengman <zengman@halodbtech.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-25 09:17:40 +01:00
Peter Eisentraut
c79e414127 Fix typo
Mistake in commit e2f289e5b9: SOFT_ERROR_OCCURRED was called with the
wrong fcinfo field.

Reported-by: Jianghua Yang <yjhjstz@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAZLFmSGti716gWeY%3DDCZ9TTVOixnHZ4_4V4tDzoeE86D64vOA%40mail.gmail.com
2026-03-25 07:09:44 +01:00
Jeff Davis
11f8018ee6 Refactor to remove ForeignServerName().
Callers either have a ForeignServer object or can readily construct
one.

Discussion: https://postgr.es/m/CAExHW5vV5znEvecX=ra2-v7UBj9-M6qvdDzuB78M-TxbYD1PEA@mail.gmail.com
Suggested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
2026-03-24 15:20:28 -07:00
Jeff Davis
f16f5d608c GetSubscription(): use per-object memory context.
Constructing a Subcription object uses a number of small or temporary
allocations. Use a per-object memory context for easy cleanup.

Get rid of FreeSubscription() which did not free all the allocations
anyway. Also get rid of the PG_TRY()/PG_CATCH() logic in
ForeignServerConnectionString() which were used to avoid leaks during
GetSubscription().

Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/xvdjrdqnpap3uq7owbaox3r7p5gf7sv62aaqf2ju3vb6yglatr%40kvvwhoudrlxq
Discussion: https://postgr.es/m/CAA4eK1K=WjZ1maBCmj=5ZdO66AwPORK5ZBxVKedS0xdCcb621A@mail.gmail.com
2026-03-24 15:11:45 -07:00
Melanie Plageman
a881cc9c7e Remove XLOG_HEAP2_VISIBLE entirely
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.

Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 17:58:12 -04:00
Melanie Plageman
a759ced2f1 WAL log VM setting for empty pages in XLOG_HEAP2_PRUNE_VACUUM_SCAN
As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
marks empty pages all-visible and all-frozen in a
XLOG_HEAP2_PRUNE_VACUUM_SCAN record.

This has no real independent benefit, but empty pages were the last user
of XLOG_HEAP2_VISIBLE, so by making this change we can next remove all
of the XLOG_HEAP2_VISIBLE code.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
2026-03-24 17:30:54 -04:00
Melanie Plageman
1252a4ee28 WAL log VM setting during vacuum phase I in XLOG_HEAP2_PRUNE_VACUUM_SCAN
Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.

Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications. This reduces WAL volume produced by vacuum.

For now, vacuum is still the only user of heap_page_prune_and_freeze()
allowed to set the VM. On-access pruning is not yet able to set the VM.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 16:49:46 -04:00
Robert Haas
dc47beacaa get_memoize_path: Don't exit quickly when PGS_NESTLOOP_PLAIN is unset.
This function exits early in the case where the number of inner rows
is estimated to be less than 2, on the theory that in that case a
Nested Loop with inner Memoize must lose to a plain Nested Loop.
But since commit 4020b370f2 it's
possible for a plain Nested Loop to be disabled, while a Nested Loop
with inner Memoize is still enabled. In that case, this reasoning
is not valid, so adjust the code not to exit early in that case.

This issue was revealed by a test_plan_advice failure on buildfarm
member skink, where NESTED_LOOP_MEMOIZE() couldn't be enforced on
replanning due to this early exit.

Discussion: http://postgr.es/m/CA+TgmoZUN8FT1Ah=m6Uis5bHa4FUa+_hMDWtcABG17toEfpiUg@mail.gmail.com
2026-03-24 16:17:26 -04:00
Melanie Plageman
9ba3ec076a Keep newest live XID up-to-date even if page not all-visible
During pruning, we keep track of the newest xmin of live tuples on the
page visible to all running and future transactions so that we can use
it later as the snapshot conflict horizon when setting the VM if the
page turns out to be all-visible.

Previously, we stopped updating this value once we determined the page
was not all-visible. However, maintaining it even when the page is not
all-visible is inexpensive and makes the snapshot conflict horizon
calculation clearer. This guarantees it won't contain a stale value.

Since we'll keep it up to date all the time now anyway, there's no
reason not to maintain set_all_visible for on-access pruning. This will
allow us to set the VM on-access in the future.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-24 15:37:18 -04:00
Melanie Plageman
dd5716f3c7 Use GlobalVisState in vacuum to determine page level visibility
During vacuum's first and third phases, we examine tuples' visibility to
determine if we can set the page all-visible in the visibility map.

Previously, this check compared tuple xmins against a single XID chosen
at the start of vacuum (OldestXmin). We now use GlobalVisState, which
enables future work to set the VM during on-access pruning, since
ordinary queries have access to GlobalVisState but not OldestXmin.

This also benefits vacuum: in some cases, GlobalVisState may advance
during a vacuum, allowing more pages to become considered all-visible.
And, in the future, we could easily add a heuristic to update
GlobalVisState more frequently during vacuums of large tables.

OldestXmin is still used for freezing and as a backstop to ensure we
don't freeze a dead tuple that wasn't yet prunable according to
GlobalVisState in the rare occurrences where GlobalVisState moves
backwards.

Because comparing a transaction ID against GlobalVisState is more
expensive than comparing against a single XID, we defer this check until
after scanning all tuples on the page. Therefore, we perform the
GlobalVisState check only once per page. This is safe because
visibility_cutoff_xid records the newest live xmin on the page; if it is
globally visible, then the entire page is all-visible.

Using GlobalVisState means on-access pruning can also maintain
visibility_cutoff_xid, which is required to set the visibility map
on-access in the future.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk#c755ef151507aba58471ffaca607e493
2026-03-24 14:50:59 -04:00
Álvaro Herrera
f227b7b20c
Avoid including clog.h in proc.h
The number of .c files that must include access/clog.h can currently be
counted on one's fingers and miss only one (assuming one has the usual
number of hands).  However, due to indirect inclusion via proc.h,
there's a lot of files that are pointlessly including it.  This is easy
to avoid with the easy trick implemented by this commit.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603221856.iwlhitt6dxxx@alvherre.pgsql
2026-03-24 17:31:16 +01:00
Álvaro Herrera
2102ebb195
Don't include storage/lock.h in so many headers
Since storage/locktags.h was added by commit 322bab7974, many headers
can be made leaner by depending on that instead of on storage/lock.h,
which has many other dependencies.

(In fact, some of these changes were possible even before that.)

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/abvrRZo52Yx9ZzWQ@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 17:11:12 +01:00
Álvaro Herrera
5f2350a043
Fix dereference in a couple of GUC check hooks
check_backtrace_functions() and check_archive_directory() were doing an
empty-string check this way:
    *newval[0] == '\0'
which, because of operator precedence, is interpreted as *(newval[0])
instead of (*newval)[0] -- but these variables are pointers to C-strings
and we want to check the first character therein, rather than check the
first pointer of the array, so that interpretation is wrong.  This would
be wrong for any index element other than 0, as evidenced by every other
dereference of the same variable in check_backtrace_functions, which use
parentheses.

Add parentheses to make the intended dereference explicit.

This is just cosmetic at this stage, so no backpatch, although it's been
"wrong" for a long time.

Author: Zhang Hu <kongbaik228@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CAB5m2QssN6UO+ckr6ZCcV0A71mKUB6WdiTw1nHo43v4DTW1Dfg@mail.gmail.com
2026-03-24 16:45:39 +01:00
Fujii Masao
1c162c965a Report detailed errors from XLogFindNextRecord() failures.
Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.

This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.

For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").

Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
2026-03-24 22:33:09 +09:00
Robert Haas
c98ad086ad Bounds-check access to TupleDescAttr with an Assert.
The second argument to TupleDescAttr should always be at least zero
and less than natts; otherwise, we index outside of the attribute
array. Assert that this is the case.

Various violations, or possible violations, of this rule that are
currently in the tree are actually harmless, because while
we do call TupleDescAttr() before verifying that the argument is
within range, we don't actually dereference it unless the argument
was within range all along. Nonetheless, the Assert means we
should be more careful, so tidy up accordingly.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
2026-03-24 08:58:50 -04:00
Peter Eisentraut
e2f289e5b9 Make many cast functions error safe
This adjusts many C functions underlying casts to support soft errors.
This is in preparation for a future feature where conversion errors in
casts can be caught.

This patch covers cast functions that can be adjusted easily by
changing ereport to ereturn or making other light changes.  The
underlying helper functions were already changed to support soft
errors some time ago as part of soft error support in type input
functions.

Other casts and types will require some more work and are being kept
as separate patches.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
2026-03-24 12:08:22 +01:00
Robert Haas
570e2fcc04 Prevent spurious "indexes on virtual generated columns are not supported".
Both of the checks in DefineIndex() that can produce this error
message have a guard against negative attribute numbers, but lack a
guard to ensure that attno is non-zero. As a result, we can index
off the beginning of the TupleDesc and read a garbage byte for
attgenerated. If that byte happens to be 'v', we'll incorrectly
produce the error mentioned above.

The first call site is easy to hit: any attempt to create an
expression index does so. The second one is not currently hit in
the regression tests, but can be hit by something like
CREATE INDEX ON some_table ((some_function(some_table))).

Found by study of a test_plan_advice failure on buildfarm member
skink, though this issue has nothing to do with test_plan_advice
and seems to have only been revealed by happenstance.

Backpatch-through: 18
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
2026-03-24 06:28:33 -04:00
Alexander Korotkov
6888658516 Further improve commentary about ChangeVarNodesWalkExpression()
The updated comment explains why we use ChangeVarNodes_walker() instead of
expression_tree_walker(), and provides a bit more detail about the differences
in processing top-level Query and subqueries.

Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPpHfdvbjq342WTQ705Wmqhe8794pcp7wospz%2BWUJ2qB7vuOqA%40mail.gmail.com
Backpatch-through: 18
2026-03-24 09:54:00 +02:00
Michael Paquier
4019f725f5 Add support for lock statistics in pgstats
This commit adds a new stats kind, called PGSTAT_KIND_LOCK, implementing
statistics for lock tags, as reported by pg_locks.  The implementation
is fixed-sized, as the data is caped based on the number of lock tags in
LockTagType.

The new statistics kind records the following fields, providing insight
regarding lock behavior, while avoiding impact on performance-critical
code paths (such as fast-path lock acquisition):
- waits and wait_time: respectively track the number of times a lock
required waiting and the total time spent acquiring it.  These metrics
are only collected once a lock is successfully acquired and after
deadlock_timeout has been exceeded.
fastpath_exceeded: counts how often a lock could not be acquired via
the fast path due to the max_locks_per_transaction slot limits.

A new view called pg_stat_lock can be used to access this data, coupled
with a SQL function called pg_stat_get_lock().

Bump stat file format PGSTAT_FILE_FORMAT_ID.
Bump catalog version.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 15:32:09 +09:00
Michael Paquier
a90d865182 Move some code blocks in lock.c and proc.c
This change will simplify an upcoming change that will introduce lock
statistics, reducting code churn.

This commit means that we begin to calculate the time it took to acquire
a lock after the deadlock check interrupt has run should log_lock_waits
be off, when taken in isolation.  This is not a performance-critical
code path, and note that log_lock_waits is enabled by default since
2aac62be8c.

Extracted from a larger patch by the same author.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 13:34:54 +09:00
Tom Lane
2e123e3c2b Silence compiler warning from older compilers.
Our RHEL7-vintage buildfarm animals are complaining about
"the comparison will always evaluate as true" for a usage of
SOFT_ERROR_OCCURRED() on a local variable.  This is the same
issue addressed in 7bc88c3d6 and some earlier commits, so solve
it the same way: write "escontext.error_occurred" instead.

Problem dates to recent commit a0b6ef29a, no need for back-patch.
2026-03-23 17:25:12 -04:00
Tom Lane
360dd6f7b4 Improve commentary about ChangeVarNodesWalkExpression().
IMO the proximate cause of the bug fixed in commit 07b7a964d
was sloppy thinking about what ChangeVarNodesWalkExpression()
is to be used for.  Flesh out its header comment to try to
improve that situation.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1607553.1774017006@sss.pgh.pa.us
Backpatch-through: 18
2026-03-23 11:14:24 -04:00
Michael Paquier
93b76db0ac Fix invalid value of pg_aios.pid, function pg_get_aios()
When the value of pg_aios.pid is found to be 0, the function had the
idea to set "nulls" to "false" instead of "true", without setting the
value stored in the tuplestore.  This could lead to the display of buggy
data.  The intention of the code is clearly to display NULL when a PID
of 0 is found, and this commit adjusts the logic to do so.

Issue introduced by 60f566b4f2.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by:  Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/tencent_7D61A85D6143AD57CA8D8C00DEC541869D06@qq.com
Backpatch-through: 18
2026-03-23 18:13:56 +09:00
Michael Paquier
ded9754804 Add missing deflateEnd() for server-side gzip base backups
The gzip basebackup sink called deflateInit2() in begin_archive() but
never called deflateEnd(), leaking zlib's internal compression state
(~256KB per archive) until the memory context of the base backup is
destroyed.

The code tree has already a matching deflateEnd() call for each
deflateInit[2]() call (pgrypto, etc.), except for the file touched in
this commit, so this brings more consistency for all the compression
methods.  The server-side LZ4 and zstd implementations require a
dedicated cleanup callback as they allocate their state outside the
context of a palloc().

As currently used, deflateInit2() is called once per tablespace in a
single backup.  Memory would slightly bloat only when dealing with many
tablespaces at once, not across multiple base backups so this is not
worth a backpatch.  This change could matter for future uses of this
code.

zlib allows the definition of memory allocation and free callbacks in
the z_stream object given to a deflateInit[2]().  The base backup
backend code relies on palloc() for the allocations and deflateEnd()
internally only cleans up memory (no fd allocation for example).

Author: Jianghua Yang <yjhjstz@gmail.com>
Discussion: https://postgr.es/m/CAAZLFmQNJ0QNArpWEOZXwv=vbumcWKEHz-b1me5gBqRqG67EwQ@mail.gmail.com
2026-03-23 09:04:44 +09:00
Peter Geoghegan
e5836f7b7d Add fake LSN support to hash index AM.
Use fake LSNs in all hash AM critical sections that write a WAL record.
This gives us a reliable way (a way that works during scans of both
logged and unlogged relations) to detect when an index page was
concurrently modified during the window between when the page is
initially read (by _hash_readpage) and when the page has any known-dead
items LP_DEAD-marked (by _hash_kill_items).

Preparation for an upcoming patch that makes the hash index AM use the
amgetbatch interface, enabling I/O prefetching during hash index scans.

The amgetbatch design imposes certain rules on index AMs with respect to
how they hold on to index page buffer pins (at least in the case of pins
held as an interlock against unsafe concurrent TID recycling by VACUUM).
These rules have consequences for routines that set LP_DEAD bits on
index tuples from an amgetbatch index AM: such routines have an inherent
need to reason about concurrent TID recycling by VACUUM, but can no
longer rely on their amgettuple routine holding on to a buffer pin
(during the aforementioned window) as an interlock against such
recycling.  Instead, they have to follow a new, standardized approach.

The new approach taken by amgetbatch index AMs when setting LP_DEAD bits
is heavily based on the current nbtree dropPin design, which was added
by commit 2ed5b87f.  It also works by checking if the page's LSN
advanced during the window where unsafe concurrent TID recycling might
have taken place.

This commit is similar to commit 8a879119, which taught nbtree to use
fake LSNs to improve its dropPin behavior.  However, unlike that commit,
this is not an independently useful enhancement, since hash doesn't
implement anything like nbtree's dropPin behavior (not yet).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-22 17:31:43 -04:00
Melanie Plageman
01b7e4a46d Add pruning fast path for all-visible and all-frozen pages
Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
heap_page_prune_and_freeze() can be invoked for pages with no pruning or
freezing work to do. To avoid this, if a page is already all-frozen or
it is all-visible and no freezing will be attempted, exit early. We
can't exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 15:46:50 -04:00
Peter Geoghegan
f026fbf059 Make IndexScanInstrumentation a pointer in executor scan nodes.
Change the IndexScanInstrumentation fields in IndexScanState,
IndexOnlyScanState, and BitmapIndexScanState from inline structs to
pointers.  This avoids additional space overhead whenever new fields are
added to IndexScanInstrumentation in the future, at least in the common
case where the instrumentation isn't used (i.e. when the executor node
isn't being run through an EXPLAIN ANALYZE).

Preparation for an upcoming patch series that will add index
prefetching.  The new slot-based interface that will enable index
prefetching necessitates that we add at least one more field to
IndexScanInstrumentation (to count heap fetches during index-only
scans).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-03-22 13:20:29 -04:00
Melanie Plageman
4f7ecca84d Detect and fix visibility map corruption in more cases
Move VM corruption detection and repair into heap page pruning. This
allows VM repair during on-access pruning, not only during vacuum.

Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or deleted by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.

Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once,
and a single VM page covers a large number of heap pages.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 11:52:40 -04:00
Heikki Linnakangas
516310ed4d Don't reset 'latest_page_number' when replaying multixid truncation
'latest_page_number' is set to the correct value, according to
nextOffset, early at system startup. Contrary to the comment, it hence
should be set up correctly by the time we get to WAL replay.

This was committed to back-branches earlier already (commit
817f74600d), to fix a bug in a backwards-compatibility codepath. We
don't have that bug on 'master', but the change nevertheless makes
sense on 'master' too.

Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/20260214090150.GC2297@p46.dedyn.io;lightning.p46.dedyn.io
Discussion: https://www.postgresql.org/message-id/e1787b17-dc93-4621-a5a1-c713d1ac6a1b@iki.fi
2026-03-22 14:23:54 +02:00
Jeff Davis
4a0b46b6e1 Fix dependency on FDW's connection function.
Missed in commit 8185bb5347.

Catalog version bump.

Discussion: https://postgr.es/m/fd49b44dc65da8e71ab20c1cf1ec7e65921c20f5.camel@j-davis.com
2026-03-20 12:42:59 -07:00
Nathan Bossart
48f11bfa06 Bump transaction/multixact ID warning limits to 100M.
These warning limits were last changed to 40M by commit cd5e82256d.
For the benefit of workloads that rapidly consume transactions or
multixacts, this commit bumps the limits to 100M.  This will
hopefully give users enough time to react.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/aRdhSSFb9zZH_0zc%40nathan
2026-03-20 14:15:33 -05:00
Nathan Bossart
e646450e60 Add percentage of available IDs to wraparound warnings.
This commit adds DETAIL messages to the existing wraparound
WARNINGs that include the percentage of transaction/multixact IDs
that remain available for use.  The hope is that this more clearly
expresses the urgency of the situation.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/aRdhSSFb9zZH_0zc%40nathan
2026-03-20 14:15:33 -05:00
Tom Lane
733f20df53 Discount the metapage when estimating number of index pages visited.
genericcostestimate() estimates the number of index leaf pages to
be visited as a pro-rata fraction of the total number of leaf pages.
Or at least that was the intention.  What it actually used in the
calculation was the total number of index pages, so that non-leaf
pages were also counted.  In a decent-sized index the error is
probably small, since we expect upper page fanout to be high.
But in a small index that's not true; in the worst case with one
data-bearing page plus a metapage, we had 100% relative error.
This led to surprising planning choices such as not using a small
partial index.

To fix, ask genericcostestimate's caller to supply an estimate of
the number of non-leaf pages, and subtract that.  For the built-in
index AMs, it seems sufficient to count the index metapage (if the
AM uses one) as non-leaf.  Per the above argument, counting upper
index pages shouldn't change the estimate much, and in most cases
we don't have any easy way of estimating the number of upper pages.
This might be an area for further research in future.

Any external genericcostestimate callers that do not set the new field
GenericCosts.numNonLeafPages will see the same behavior as before,
assuming they followed the advice to zero out that whole struct.

Unsurprisingly, this change affects a number of plans seen in the
core regression tests.  I hacked up the existing tests to keep the
tests' plans the same, since in each case it appeared that the
test's intent was to test exactly that plan.  Also add one new
test case demonstrating that a better index choice is now made.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://postgr.es/m/870521.1745860752@sss.pgh.pa.us
2026-03-20 14:50:53 -04:00
Alexander Korotkov
07b7a964d3 Fix self-join removal to update bare Var references in join clauses
Self-join removal failed to update Var nodes when the join clause was a
bare Var (e.g., ON t1.bool_col) rather than an expression containing
Vars.  ChangeVarNodesWalkExpression() used expression_tree_walker(),
which descends into child nodes but does not process the top-level node
itself.  When a bare Var referencing the removed relation appeared as
the clause, its varno was left unchanged, leading to "no relation entry
for relid N" errors.

Fix by calling ChangeVarNodes_walker() directly instead of
expression_tree_walker(), so the top-level node is also processed.

Bug: #19435
Reported-by: Hang Ammmkilo <ammmkilo@163.com>
Author: Andrei Lepikhov <lepihov@gmail.com>
Co-authored-by: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/19435-3cc1a87f291129f1%40postgresql.org
Backpatch-through: 18
2026-03-20 15:46:30 +02:00
Álvaro Herrera
e7975f1c06
SET NOT NULL: Call object-alter hook only after the catalog change
... otherwise, the function invoked by the hook might consult the
catalog and not see that the new constraint exists.

This relies on set_attnotnull doing CommandCounterIncrement()
after successfully modifying the catalog.

Oversight in commit 14e87ffa5c.

Author: Artur Zakirov <zaartur@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CAKNkYnxUPCJk-3Xe0A3rmCC8B8V8kqVJbYMVN6ySGpjs_qd7dQ@mail.gmail.com
2026-03-20 14:38:50 +01:00
Andrew Dunstan
4c0390ac53 Add option force_array for COPY JSON FORMAT
This adds the force_array option, which is available exclusively
when using COPY TO with the JSON format.

When enabled, this option wraps the output in a top-level JSON array
(enclosed in square brackets with comma-separated elements), making the
entire result a valid single JSON value.  Without this option, the
default behavior is to output a stream of independent JSON objects.

Attempting to use this option with COPY FROM or with formats other than
JSON will raise an error.

Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:40:17 -04:00
Andrew Dunstan
7dadd38cda json format for COPY TO
This introduces the JSON format option for the COPY TO command, allowing
users to export query results or table data directly as a stream of JSON
objects (one per line, NDJSON style).

The JSON format is currently supported only for COPY TO operations; it
is not available for COPY FROM.

JSON format is incompatible with some standard text/CSV formatting
options, including HEADER, DEFAULT, NULL, DELIMITER, FORCE QUOTE,
FORCE NOT NULL, and FORCE NULL.

Column list support is included: when a column list is specified, only
the named columns are emitted in each JSON object.

Regression tests covering valid JSON exports and error handling for
incompatible options have been added to src/test/regress/sql/copy.sql.

Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Co-Authored-By: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Davin Shearer <davin@apache.org>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:40:04 -04:00
Andrew Dunstan
a2145605ee introduce CopyFormat, refactor CopyFormatOptions
Currently, the COPY command format is determined by two boolean fields
(binary, csv_mode) in CopyFormatOptions.  This approach, while
functional, isn't ideal for implementing other formats in the future.

To simplify adding new formats, introduce a CopyFormat enum.  This makes
the code cleaner and more maintainable, allowing for easier integration
of additional formats down the line.

Author: Joel Jacobson <joel@compiler.org>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:21:57 -04:00
Amit Kapila
493f8c6439 Add support for EXCEPT TABLE in ALTER PUBLICATION.
Following commit fd366065e0, which added EXCEPT TABLE support to
CREATE PUBLICATION, this commit extends ALTER PUBLICATION to allow
modifying the exclusion list.

New Syntax:
ALTER PUBLICATION name SET  publication_all_object [, ... ]

where publication_all_object is one of:
ALL TABLES [ EXCEPT TABLE ( except_table_object [, ... ] ) ]
ALL SEQUENCES

If the EXCEPT clause is provided, the existing exclusion list in
pg_publication_rel is replaced with the specified relations. If the
EXCEPT clause is omitted, any existing exclusions for the publication
are cleared. Similarly, SET ALL SEQUENCES updates

Note that because this is a SET command, specifying only one object
type (e.g., SET ALL SEQUENCES) will reset the other unspecified flags
(e.g., setting puballtables to false).

Consistent with CREATE PUBLICATION, only root partitioned tables or
standard tables can be specified in the EXCEPT list. Specifying a
partition child will result in an error.

Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-20 11:36:09 +05:30
David Rowley
07d5bffe75 Fix new tuple deforming code so it can support cstrings again
In c456e3911, I mistakenly thought that the deformer code would never
see cstrings and that I could use pg_assume() to have the compiler omit
producing code for attlen == -2 attributes.  That saves bloating the
deforming code a bit with the extra check and strlen() call.  While this
is ok to do for tuples from the heap, it's not ok to do for
MinimalTuples as those *can* contain cstrings and
tts_minimal_getsomeattrs() implements deforming by inlining the
(slightly misleadingly named) slot_deform_heap_tuple() code.

To fix, add a new parameter to the slot_deform_heap_tuple() and have the
callers define which code to inline.  Because this new parameter is
passed as a const, the compiler can choose to emit or not emit the
cstring-related code based on the parameter's value.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmSK+gKziAt_WvQoMVWt3_LRVMmRYY9dAbMPMcpPV0QmA@mail.gmail.com
2026-03-20 14:16:06 +13:00
Jeff Davis
703fee3b25 Fix dependency on FDW handler.
ALTER FOREIGN DATA WRAPPER could drop the dependency on the handler
function if it wasn't explicitly specified.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/35c44a4b7fb76d35418c4d66b775a88f4ce60c86.camel@j-davis.com
Backpatch-through: 14
2026-03-19 15:07:43 -07:00
Masahiko Sawada
adcdbe9386 Add parallel vacuum worker usage to VACUUM (VERBOSE) and autovacuum logs.
This commit adds both the number of parallel workers planned and the
number of parallel workers actually launched to the output of
VACUUM (VERBOSE) and autovacuum logs.

Previously, this information was only reported as an INFO message
during VACUUM (VERBOSE), which meant it was not included in autovacuum
logs in practice. Although autovacuum does not yet support parallel
vacuum, a subsequent patch will enable it and utilize these logs in
its regression tests. This change also improves observability by
making it easier to verify if parallel vacuum is utilizing the
expected number of workers.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
2026-03-19 15:01:47 -07:00
Masahiko Sawada
ba21f5bf8a Allow explicit casting between bytea and uuid.
This enables the use of functions such as encode() and decode() with
UUID values, allowing them to be converted to and from alternative
formats like base64 or hex.

The cast maps the 16-byte internal representation of a UUID directly
to a bytea datum. This is more efficient than going through a text
forepresentation.

Bump catalog version.

Author:	Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
2026-03-19 13:51:50 -07:00
Tom Lane
1811f1af98 Improve hash join's handling of tuples with null join keys.
In a plain join, we can just summarily discard an input tuple
with null join key(s), since it cannot match anything from
the other side of the join (assuming a strict join operator).
However, if the tuple comes from the outer side of an outer join
then we have to emit it with null-extension of the other side.

Up to now, hash joins did that by inserting the tuple into the hash
table as though it were a normal tuple.  This is unnecessarily
inefficient though, since the required processing is far simpler than
for a potentially-matchable tuple.  Worse, if there are a lot of such
tuples they will bloat the hash bucket they go into, possibly causing
useless repeated attempts to split that bucket or increase the number
of batches.  We have a report of a large join vainly creating many
thousands of batches when faced with such input.

This patch improves the situation by keeping such tuples out of the
hash table altogether, instead pushing them into a separate tuplestore
from which we return them later.  (One might consider trying to return
them immediately; but that would require substantial refactoring, and
it doesn't work anyway for cases where we rescan an unmodified hash
table.)  This works even in parallel hash joins, because whichever
worker reads a null-keyed tuple can just return it; there's no need
for consultation with other workers.  Thus the tuplestores are local
storage even in a parallel join.

A pre-existing buglet that I noticed while analyzing the code's
behavior is that ExecHashRemoveNextSkewBucket fails to decrement
hashtable->skewTuples for tuples moved into the main hash table
from the skew hash table.  This invalidates ExecHashTableInsert's
calculation of the number of main-hash-table tuples, though probably
not by a lot since we expect the skew table to be small relative
to the main one.  Nonetheless, let's fix that too while we're here.

Bug: #18909
Reported-by: Sergey Koposov <Sergey.Koposov@ed.ac.uk>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/3061845.1746486714@sss.pgh.pa.us
2026-03-19 15:21:36 -04:00
Nathan Bossart
dd1398f137 Allow choosing specific grantors via GRANT/REVOKE ... GRANTED BY.
Except for GRANT and REVOKE on roles, the GRANTED BY clause
currently only accepts the current role to match the SQL standard.
And even if an acceptable grantor (i.e., the current role) is
specified, Postgres ignores it and chooses the "best" grantor for
the command.  Allowing the user to select a specific grantor would
allow better control over the precise behavior of GRANT/REVOKE
statements.  This commit adds that ability.  For consistency with
select_best_grantor(), we only permit choosing grantor roles for
which the current role inherits privileges.

Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aRYLkTpazxKhnS_w%40nathan
2026-03-19 11:41:39 -05:00
Robert Haas
6f0738ddec dshash: Make it possible to suppress out of memory errors
Introduce dshash_find_or_insert_extended, which is just like
dshash_find_or_insert except that it takes a flags argument.
Currently, the only supported flag is DSHASH_INSERT_NO_OOM, but
I have chosen to use an integer rather than a boolean in case we
end up with more flags in the future.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoaJwUukUZGu7_yL74oMTQQz2=zqucMhF9+9xBmSC5us1w@mail.gmail.com
2026-03-19 11:51:17 -04:00
Tom Lane
5a2043bf71 Fix transient memory leakage in jsonpath evaluation.
This patch reimplements JsonValueList to be more space-efficient
and arranges for temporary JsonValueLists created during jsonpath
evaluation to be freed when no longer needed, rather than being
leaked till the end of the function evaluation cycle as before.

The motivation is to prevent indefinite memory bloat while
evaluating jsonpath expressions that traverse a lot of data.
As an example, this query
  SELECT
    jsonb_path_query((SELECT jsonb_agg(i) FROM generate_series(1,10000) i),
                     '$[*] ? (@ < $)');
formerly required about 6GB to execute, with the space required
growing quadratically with the length of the input array.
With this patch the memory consumption stays static.  (The time
required is still quadratic, but we can't do much about that: this
path expression asks to compare each array element to each other one.)

The bloat happens because we construct a JsonValueList containing all
the array elements to represent the second occurrence of "$", and then
just leak it after evaluating the filter expression for any one value
generated from "$[*]".  If I were implementing this functionality from
scratch I'd probably try to avoid materializing that representation at
all, but changing that now looks like more trouble than it's worth.
This patch takes the more conservative approach of just making sure
we free the list after we're done with it.

The existing representation of JsonValueList is neither especially
compact nor especially easy to free: it's a List containing pointers
to separately-palloc'd JsonbValue structs.  We could theoretically
use list_free_deep, but it's not 100% clear that all the JsonbValues
are always safe for us to free.  In any case we are talking about a
lot of palloc/pfree traffic if we keep it like this.  This patch
replaces that with what's essentially an expansible array of
JsonbValues, so that even a long list requires relatively few
palloc requests.  Also, for the very common case that only one or
two elements appear in the list, this representation uses *zero*
pallocs: the elements can be kept in the on-the-stack base struct.

Note that we are only interested in freeing the JsonbValue structs
themselves.  While many types of JsonbValue include pointers to
external data such as strings or numerics, we expect that that data
is part of the original jsonb input Datum(s) and need not (indeed
cannot) be freed here.

In this reimplementation, JsonValueListAppend() always copies the
supplied JsonbValue struct into the JsonValueList data.  This allows
simplifying and regularizing many call sites that sometimes palloc'd
JsonbValues and sometimes passed a local-variable JsonbValue.  Always
doing the latter is simpler, faster, and less bug-prone.

I also removed JsonValueListLength() in favor of constant-time tests
for whether the list has zero, one, or more than one member, which is
what the callers really need to know.  JsonValueListLength() was not
a hot code path, so this aspect of the patch won't move the needle in
the least performance-wise.  But it seems neater.

I've not done any wide-ranging performance testing, but this should
be faster than the old code thanks to reduction of palloc overhead.
On the specific example shown above, it's about twice as fast as
before on not-very-large inputs; and of course it wins big if you
consider an input large enough to drive the old code into swapping.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/569394.1773783211@sss.pgh.pa.us
2026-03-19 11:37:14 -04:00
Peter Eisentraut
7724cb9935 Add some const qualifiers enabled by typeof_unqual change on copyObject
The recent commit to change copyObject() to use typeof_unqual allows
cleaning up some APIs to take advantage of this improved qualifier
handling.  EventTriggerCollectSimpleCommand() is a good example: It
takes a node tree and makes a copy that it keeps around for its
internal purposes, but it can't communicate via its function signature
that it promises not scribble on the passed node tree.  That is now
fixed.

Reviewed-by: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
2026-03-19 06:35:54 +01:00
David Rowley
c95cd2991f Short-circuit row estimation in NOT IN containing NULL consts
ScalarArrayOpExpr used for either NOT IN or <>/= ALL, when the array
contains a NULL constant, will never evaluate to true.  Here we add an
explicit short-circuit in scalararraysel() to account for this and return
0.0 rows when we see that a NULL exists.  When the array is a constant,
we can very quickly see if there are any NULL values and return early
before going to much effort in scalararraysel().  For non-const arrays,
we short-circuit after finding the first NULL and forego selectivity
estimations of any remaining elements.

In the future, it might be better to do something for this case in
constant folding.  We would need to be careful to only do this for
strict operators on expressions located in places that don't care about
distinguishing false from NULL returns. i.e. EXPRKIND_QUAL expressions.
Doing that requires a bit more thought and effort, so here we just fix
some needlessly slow selectivity estimations for ScalarArrayOpExpr
containing many array elements and at least one NULL.

Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/eaa2598c-5356-4e1e-9ec3-5fd6eb1cd704@tantorlabs.com
2026-03-19 17:16:36 +13:00
Michael Paquier
79a5911fe6 Add more debugging information for bgworker termination tests of worker_spi
widowbird has failed again after af8837a10b, with the same symptoms of
a backend still lying around when attempting a database rename with a
bgworker connected to the database being renamed.

We are still not sure yet how the failure can be reached, if this is a
timing issue in the test or an actual bug in the logic used for
interruptible bgworkers.  This commit adds more debugging information in
the backend to help with the analysis as a temporary measure.

Another thing I have noticed is that the queries launching the dynamic
bgworkers or checking pg_stat_activity would connect to the database
renamed.  These are switched to use 'postgres'.  That will hopefully
remove some of the friction of the test, but I doubt that this is the
end of the story.

Discussion: https://postgr.es/m/abtJLEAsf1HZXWdR@paquier.xyz
2026-03-19 11:39:31 +09:00
Daniel Gustafsson
4f433025f6 ssl: Serverside SNI support for libpq
Support for SNI was added to clientside libpq in 5c55dc8b47 with the
sslsni parameter, but there was no support for utilizing it serverside.
This adds support for serverside SNI such that certificate/key handling
is available per host.  A new config file, $datadir/pg_hosts.conf, is
used for configuring which certificate and key should be used for which
hostname.  In order to use SNI the ssl_sni GUC must be set to on, when
it is off the ssl configuration works just like before.  If ssl_sni is
enabled and pg_hosts.conf is non-empty it will take precedence over
the regular SSL GUCs, if it is empty or missing the regular GUCs will
be used just as before this commit with no hostname specific handling.
The TLS init hook is not compatible with ssl_sni since it operates on
a single TLS configuration and SNI break that assumption.  If the init
hook and ssl_sni are both enabled, a WARNING will be issued.

Host configuration can either be for a literal hostname to match, non-
SNI connections using the no_sni keyword or a default fallback matching
all connections.  By omitting no_sni and the fallback a strict mode
can be achieved where only connections using sslsni=1 and a specified
hostname are allowed.

CRL file(s) are applied from postgresql.conf to all configured hostnames.

Serverside SNI requires OpenSSL, currently LibreSSL does not support
the required infrastructure to update the SSL context during the TLS
handshake.

Author: Daniel Gustafsson <daniel@yesql.se>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dewei Dai <daidewei1970@163.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1C81CD0D-407E-44F9-833A-DD0331C202E5@yesql.se
2026-03-18 12:37:11 +01:00
Peter Eisentraut
905e44152a Allow setting the collation strength in ICU tailoring rules
There was a bug that if you created an ICU collation with tailoring
rules, any strength specification inside the rules was ignored.  This
was because we called ucol_openRules() with UCOL_DEFAULT_STRENGTH for
the strength argument, which overrides the strength.  This was because
of faulty guidance in the ICU documentation, which has since been
fixed.  The correct invocation is to use UCOL_DEFAULT for the strength
argument.

This fixes bug #18771 and bug #19425.

Author: Daniel Verite <daniel@manitou-mail.org>
Reported-by: Ruben Ruiz <ruben.ruizcuadrado@gmail.com>
Reported-by: dorian.752@live.fr
Reported-by: Todd Lang <Todd.Lang@D2L.com>
Discussion: https://www.postgresql.org/message-id/flat/YT2PPF959236618377A072745A280E278F4BE1DA@YT2PPF959236618.CANPRD01.PROD.OUTLOOK.COM
Discussion: https://www.postgresql.org/message-id/flat/18771-98bb23e455b0f367@postgresql.org
Discussion: https://www.postgresql.org/message-id/flat/19425-58915e19dacd4f40%40postgresql.org
2026-03-18 08:58:47 +01:00
Andrew Dunstan
3b4c2b9db2 Allow IS JSON predicate to work with domain types
The IS JSON predicate only accepted the base types text, json, jsonb, and
bytea.  Extend it to also accept domain types over those base types by
resolving through getBaseType() during parse analysis.

The base type OID is stored in the JsonIsPredicate node (as exprBaseType)
so the executor can dispatch to the correct validation path without
repeating the domain lookup at runtime.

When a non-supported type (or domain over a non-supported type) is used,
the error message displays the original type name as written by the user,
rather than the resolved base type.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CACJufxEk34DnJFG72CRsPPT4tsJL9arobX0tNPsn7yH28J=zQg@mail.gmail.com
2026-03-17 15:20:22 -04:00
Andres Freund
f5eb854ab6 Fix use of wrong variable in _hash_kill_items()
In 82467f627b I somehow ended up using 'so->currPos.buf' instead of the 'buf'
variable, which is incorrect when the buffer is not already pinned. At the
very least this can lead to assertion failures

Unfortunately this shows that this code path was not covered. Expand
src/test/modules/index/specs/killtuples.spec to test it.  Until now the
'result' step always reported either a 0 or 1 buffer accesses, but when
exercising hash overflows, more buffers are accessed.  To avoid depending on
the precise number of accesses, change the result step to return whether there
were any heap accesses. That makes the change a lot more verbose, but still
seems worth it.

Reported-by: Alexander Kuzmenkov <akuzmenkov@tigerdata.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/vjtmvwvbxt7w5uyacxpzibpj65ewcb7uqaqbhd4arvnjbp5jqz%405ksdh6fsyqve
Discussion: https://postgr.es/m/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi
2026-03-17 14:54:41 -04:00
Andrew Dunstan
ecd9288624 make immutability tests in to_json and to_jsonb complete
Complete the TODOs in to_json_is_immutable() and to_jsonb_is_immutable()
by recursing into container types (arrays, composites, ranges, multiranges,
domains) to check element/sub-type mutability, rather than conservatively
returning "mutable" for all arrays and composites.

The shared logic is factored into a single json_check_mutability() function
in jsonfuncs.c, with the existing exported functions as thin wrappers.
Composite type inspection uses lookup_rowtype_tupdesc() (typcache) instead
of relation_open() to avoid unnecessary lock acquisition in the optimizer.

Range and multirange types are now also checked recursively: if the
subtype's conversion is immutable, the range is considered immutable
for JSON purposes, even though range_out is generically marked STABLE.
This is a behavioral change: range types with immutable subtypes (e.g.,
int4range) can now appear in expression indexes via JSON_ARRAY/JSON_OBJECT,
whereas previously they were conservatively rejected.

Add regression tests for JSON_ARRAY and JSON_OBJECT mutability with
expression indexes and generated columns, covering arrays, composites,
domains, ranges, multiranges and combinations thereof.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxFz=OsXQdsMJ-cqoqspD9aJrwntsQP-U2A-UaV_M+-S9g@mail.gmail.com
Commitfest: https://commitfest.postgresql.org/patch/5759
2026-03-17 11:28:33 -04:00
Nathan Bossart
3b88e50d6c Add more columns to pg_stats, pg_stats_ext, and pg_stats_ext_exprs.
This commit adds table OID and attribute number columns to
pg_stats, and it adds table OID and statistics object OID columns
to pg_stats_ext and pg_stats_ext_exprs.  A proposed follow-up
commit would use pg_stats.tableid to simplify a query in pg_dump.
The others have no immediate purpose but may be useful later.

Bumps catversion.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
2026-03-17 09:26:27 -05:00
Peter Eisentraut
c9babbc881 Dump labels in reproducible order
In pg_get_propgraphdef(), sort the labels before writing out, for a
consistent dump order.  Also, since we now have a list, we can get rid
of the separate table scan to get the count.

Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Co-authored-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-17 14:07:29 +01:00
Michael Paquier
233e6ae953 gen_guc_tables.pl: Improve detection of inconsistent data
This commit adds two improvements to gen_guc_tables.pl:
1) When finding two entries with the same name, the script complained
about these being not in alphabetical order, which was confusing.
Duplicated entries are now reported as their own error.
2) While the presence of the required fields is checked for all the
parameters, the script did not perform any checks on the non-required
fields.  A check is added to check that any field defined matches with
what can be accepted.  Previously, a typo in the name of a required
field would cause the field to be reported as missing.  Non-mandatory
fields would be silently ignored, which was problematic as we could lose
some information.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAN4CZFP=3xUoXb9jpn5OWwicg+rbyrca8-tVmgJsQAa4+OExkw@mail.gmail.com
2026-03-17 17:38:55 +09:00
Michael Paquier
1a7ccd2b33 Refactor some code around ALTER TABLE [NO] INHERIT
[NO] INHERIT is not supported for partitioned tables, but this portion
of tablecmds.c did not apply the same rules as the other sub-commands,
checking the relkind in the execution phase, not the preparation phase.

This commit refactors the code to centralize the relkind and other
checks in the preparation phase for both command patterns, getting rid
of one translatable string on the way.  ATT_PARTITIONED_TABLE is
removed from ATSimplePermissions(), and the child relation is checked
the same way for both sub-commands.  The ALTER TABLE patterns that now
fail at preparation failed already at execution, hence there should be
no changes from the user perspective except more consistent error
messages generated.

Some comments at the top of ATPrepAddInherit() were incorrect,
CreateInheritance() being the routine checking the columns and
constraints between the parent and its to-be-child.

Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com
2026-03-17 14:34:29 +09:00
David Rowley
d8a859d22b Reduce size of CompactAttribute struct to 8 bytes
Previously, this was 16 bytes.  With the use of some bitflags and by
reducing the attcacheoff field size to a 16-bit type, we can halve the
size of the struct.

It's unlikely that caching the offsets for offsets larger than what will
fit in a 16-bit int will help much as the tuple is very likely to have
some non-fixed-width types anyway, the offsets of which we cannot cache.

Shrinking this down to 8 bytes helps by accessing fewer cachelines when
performing tuple deformation.  The fields used there are all fully
fledged fields, which don't require any bitmasking to extract the value
of.  It also helps to more efficiently calculate the address of a
compact_attrs[] element in TupleDesc as the x86 LEA instruction can work
with 8 byte offsets, which allows the element address to be calculated
from the TupleDesc's address in a single instruction using LEA's
concurrent shift and add.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
2026-03-17 15:06:31 +13:00
Fujii Masao
d927b4bd97 Fix WAL flush LSN used by logical walsender during shutdown
Commit 6eedb2a5fd made the logical walsender call
XLogFlush(GetXLogInsertRecPtr()) to ensure that all pending WAL is flushed,
fixing a publisher shutdown hang. However, if the last WAL record ends at
a page boundary, GetXLogInsertRecPtr() can return an LSN pointing past
the page header, which can cause XLogFlush() to report an error.

A similar issue previously existed in the GiST code. Commit b1f14c9672
introduced GetXLogInsertEndRecPtr(), which returns a safe WAL insertion end
location (returning the start of the page when the last record ends at a page
boundary), and updated the GiST code to use it with XLogFlush().

This commit fixes the issue by making the logical walsender use
XLogFlush(GetXLogInsertEndRecPtr()) when flushing pending WAL during shutdown.

Backpatch to all supported versions.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/vzguaguldbcyfbyuq76qj7hx5qdr5kmh67gqkncyb2yhsygrdt@dfhcpteqifux
Backpatch-through: 14
2026-03-17 08:10:20 +09:00
David Rowley
7a2ab122a1 Fix thinko in nocachegetattr() and nocache_index_getattr()
This code was recently adjusted by c456e3911, but that commit didn't get
the logic correct when finding the attnum to start walking the tuple in.
If there is a NULL, we need to start walking the tuple before it.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNnb-s_=VdVUZ9h7dPA0u3hxV8x2aU3obZytnqQZ_MiROA@mail.gmail.com
2026-03-17 09:00:39 +13:00
Álvaro Herrera
fba4233c83
Reduce header inclusions via execnodes.h
Remove a bunch of #include lines from execnodes.h.  Most of these
requier suitable typedefs to be added, so that it still compiles
standalone.  In one case, the fix is to move a struct definition to the
one .c file where it is needed.

Also some light clean up in plannodes.h and genam.h, though not as
extensive as in execnodes.h.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202603131240.ihwqdxnj7w2o@alvherre.pgsql
2026-03-16 14:34:57 +01:00
Peter Eisentraut
5c2a8d272b Use C11 alignas in typedef definitions
They were already using pg_attribute_aligned.  This replaces that with
alignas and moves that into the required syntactic position.

Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/d7a788fa-e609-4894-a8be-2f70e135424f%40eisentraut.org
2026-03-16 11:35:51 +01:00
Peter Eisentraut
2f094e7ac6 SQL Property Graph Queries (SQL/PGQ)
Implementation of SQL property graph queries, according to SQL/PGQ
standard (ISO/IEC 9075-16:2023).

This adds:

- GRAPH_TABLE table function for graph pattern matching
- DDL commands CREATE/ALTER/DROP PROPERTY GRAPH
- several new system catalogs and information schema views
- psql \dG command
- pg_get_propgraphdef() function for pg_dump and psql

A property graph is a relation with a new relkind RELKIND_PROPGRAPH.
It acts like a view in many ways.  It is rewritten to a standard
relational query in the rewriter.  Access privileges act similar to a
security invoker view.  (The security definer variant is not currently
implemented.)

Starting documentation can be found in doc/src/sgml/ddl.sgml and
doc/src/sgml/queries.sgml.

Author: Peter Eisentraut <peter@eisentraut.org>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-16 10:14:18 +01:00
Fujii Masao
fd6ecbfa75 Ensure "still waiting on lock" message is logged only once per wait.
When log_lock_waits is enabled, the "still waiting on lock" message is normally
emitted only once while a session continues waiting. However, if the wait is
interrupted, for example by wakeups from client_connection_check_interval,
SIGHUP for configuration reloads, or similar events, the message could be
emitted again each time the wait resumes.

For example, with very small client_connection_check_interval values
(e.g., 100 ms), this behavior could flood the logs with repeated messages,
making them difficult to use.

To prevent this, this commit guards the "still waiting on lock" message so
it is reported at most once during a lock wait, even if the wait is interrupted.
This preserves the intended behavior when no interrupts occur.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com
2026-03-16 18:10:57 +09:00
Michael Paquier
c336133c65 Reject ALTER TABLE .. CLUSTER earlier for partitioned tables
ALTER TABLE .. CLUSTER ON and SET WITHOUT CLUSTER are not supported for
partitioned tables and already fail with a check happening when the
sub-command is executed, not when it is prepared.

This commit moves the relkind check for partitioned tables to happen
when the sub-command is prepared in ATSimplePermissions().  This matches
with the practice of the other sub-commands of ALTER TABLE, shaving one
translatable string.

mark_index_clustered() can be a bit simplified, switching one
elog(ERROR) to an assertion.  Note that mark_index_clustered() can also
be called through a CLUSTER command, but it cannot be reached for a
partitioned table, per the assertion based on the relkind in
cluster_rel(), and there is only one caller of rebuild_relation().

Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com
2026-03-16 17:48:39 +09:00
Fujii Masao
8fe315f18d Add stats_reset column to pg_statio_all_sequences
pg_statio_all_sequences lacked a stats_reset column, unlike the other
pg_statio_* views that already expose it. This commit adds the column so
users can see when the statistics in this view were last reset.

Also this commit updates the documentation for
pg_stat_reset_single_table_counters() to clarify that it can reset statistics
for sequences and materialized views as well.

Catalog version bumped.

Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Shihao Zhong <zhong950419@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0v0OPGyDpwxkX81CtTt9xsj9-TNxhm=8JdOvEKPsVVFNg@mail.gmail.com
2026-03-16 17:24:08 +09:00
Peter Eisentraut
a41bc38439 Fix accidentally casting away const
Recently introduced in commit 8c2b30487c.
2026-03-16 07:37:03 +01:00
Amit Kapila
5f39698c90 Remove obsolete speculative insert cleanup in ReorderBuffer.
Commit 4daa140a2f introduced proper decoding for speculative aborts. As a
result, the internal state is guaranteed to be clean when a new
speculative insert is encountered. This patch removes the defensive
cleanup code that is no longer reachable.

Author: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/23256.1772702981@localhost
2026-03-16 10:14:22 +05:30
Michael Paquier
bfa3c4f106 Optimize hash index bulk-deletion with streaming read
This commit refactors hashbulkdelete() to use streaming reads, improving
the efficiency of the operation by prefetching upcoming buckets while
processing a current bucket.  There are some specific changes required
to make sure that the cleanup work happens in accordance to the data
pushed to the stream read callback.  When the cached metadata page is
refreshed to be able to process the next set of buckets, the stream is
reset and the data fed to the stream read callback has to be updated.
The reset needs to happen in two code paths, when _hash_getcachedmetap()
is called.

The author has seen better performance numbers than myself on this one
(with tweaks similar to 6c228755ad).  The numbers are good enough for
both of us that this change is worth doing, in terms of IO and runtime.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
2026-03-16 09:22:09 +09:00
Tom Lane
82ff54377e Move -ffast-math defense to float.c and remove the configure check.
We had defenses against -ffast-math in timestamp-related files,
which is a pretty obsolete place for them since we've not supported
floating-point timestamps in a long time.  Remove those and instead
put one in float.c, which is still broken by using this switch.
Add some commentary to put more color on why it's a bad idea.

Also remove the check from configure.  That was just there to fail
faster, but it doesn't really seem necessary anymore, and besides
we have no corresponding check in meson.build.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/abFXfKC8zR0Oclon%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-15 19:34:52 -04:00
David Rowley
c456e39113 Optimize tuple deformation
This commit includes various optimizations to improve the performance of
tuple deformation.

We now precalculate CompactAttribute's attcacheoff, which allows us to
remove the code from the deform routines which was setting the
attcacheoff.  Setting the attcacheoff is now handled by
TupleDescFinalize(), which must be called before the TupleDesc is used for
anything.  Having TupleDescFinalize() means we can store the first
attribute in the TupleDesc which does not have an offset cached.  That
allows us to add a dedicated deforming loop to deform all attributes up
to the final one with an attcacheoff set, or up to the first NULL
attribute, whichever comes first.

Here we also improve tuple deformation performance of tuples with NULLs.
Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask,
deforming would, one-by-one, check each and every bit in the NULL bitmap
to see if it was zero.  Now, we process the NULL bitmap 1 byte at a time
rather than 1 bit at a time to find the attnum with the first NULL.  We
can now deform the tuple without checking for NULLs up to just before that
attribute.

We also record the maximum attribute number which is guaranteed to exist
in the tuple, that is, has a NOT NULL constraint and isn't an
atthasmissing attribute.  When deforming only attributes prior to the
guaranteed attnum, we've no need to access the tuple's natt count.  As an
additional optimization, we only count fixed-width columns when
calculating the maximum guaranteed column, as this eliminates the need to
emit code to fetch byref types in the deformation loop for guaranteed
attributes.

Some locations in the code deform tuples that have yet to go through NOT
NULL constraint validation.  We're unable to perform the guaranteed
attribute optimization when that's the case.  This optimization is opt-in
via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS
flag.

This commit also adds a more efficient way of populating the isnull
array by using a bit-wise SWAR trick which performs multiplication on the
inverse of the tuple's bitmap byte and masking out all but the lower bit
of each of the boolean's byte.  This results in much more optimal code
when compared to determining the NULLness via att_isnull().  8 isnull
elements are processed at once using this method, which means we need to
round the tts_isnull array size up to the next 8 bytes.  The palloc code
does this anyway, but the round-up needed to be formalized so as not to
overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds.  Doing
this also allows the NULL-checking deforming loop to more efficiently
check the isnull array, rather than doing the bit-wise processing for each
attribute that att_isnull() does.

The level of performance improvement from these changes seems to vary
depending on the CPU architecture.  Apple's M chips seem particularly
fond of the changes, with some of the tested deform-heavy queries going
over twice as fast as before.  With x86-64, the speedups aren't quite as
large.  With tables containing only a small number of columns, the
speedups will be less.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:46:00 +13:00
David Rowley
503620311e Add all required calls to TupleDescFinalize()
As of this commit all TupleDescs must have TupleDescFinalize() called on
them once the TupleDesc is set up and before BlessTupleDesc() is called.

In this commit, TupleDescFinalize() does nothing. This change has only
been separated out from the commit that properly implements this function
to make the change more obvious.  Any extension which makes its own
TupleDesc will need to be modified to call the new function.

The follow-up commit which properly implements TupleDescFinalize() will
cause any code which forgets to do this to fail in assert-enabled builds in
BlessTupleDesc().  It may still be worth mentioning this change in the
release notes so that extension authors update their code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:45:49 +13:00
Tom Lane
e5a77d876d Save a few bytes per CatCTup.
CatalogCacheCreateEntry() computed the space needed for a CatCTup
as sizeof(CatCTup) + MAXIMUM_ALIGNOF.  That's not our usual style,
and it wastes memory by allocating more padding than necessary.
On 64-bit machines sizeof(CatCTup) would be maxaligned already
since it contains pointer fields, therefore this code is wasting
8 bytes compared to the more usual MAXALIGN(sizeof(CatCTup)).

While at it, we don't really need to do MemoryContextSwitchTo()
when we're only allocating one block.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/tencent_A42E0544C6184FE940CD8E3B14A3F0A39605@qq.com
2026-03-15 18:05:38 -04:00
Melanie Plageman
99bf1f8aa6 Save vmbuffer in heap-specific scan descriptors for on-access pruning
Future commits will use the visibility map in on-access pruning to fix
VM corruption and set the VM if the page is all-visible.

Saving the vmbuffer in the scan descriptor reduces the number of times
it would need to be pinned and unpinned, making the overhead of doing so
negligible.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com
2026-03-15 11:09:10 -04:00
Melanie Plageman
8d2c1df4f4 Avoid BufferGetPage() calls in heap_update()
BufferGetPage() isn't cheap and heap_update() calls it multiple times
when it could just save the page from a single call. Do that.
While we are at it, make separate variables for old and new page in
heap_xlog_update(). It's confusing to reuse "page" for both pages.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a%2BhO4PCptyaPR7AMZd7FjcHfOFKKJT8ouU3KedMud0tQ%40mail.gmail.com
2026-03-15 10:42:34 -04:00
Melanie Plageman
a3511443e5 Initialize missing fields in CreateExecutorState()
d47cbf474e and cbc127917e forgot to initialize a few fields they
introduced in the EState, so do that now.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com
2026-03-15 10:13:14 -04:00
Tom Lane
2eb87345e1 Fix aclitemout() to work during early bootstrap.
"initdb -d" has been broken since commit f95d73ed4, because I changed
aclitemin to work in bootstrap mode but failed to consider aclitemout.
That routine isn't reached by default, but it is if the elog message
level is high enough, so it needs to work without catalog access too.

This patch just makes it use its existing code paths to print role
OIDs numerically.  We could alternatively invent an inverse of
boot_get_role_oid() and print them symbolically, but that would take
more code and it's not apparent that it'd be any better for debugging
purposes.

Reported-by: Greg Burd <greg@burd.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/4416.1773328045@sss.pgh.pa.us
2026-03-14 13:46:54 -04:00
Tomas Vondra
02eecead86 Tighten asserts on ParallelWorkerNumber
The comment about ParallelWorkerNumbr in parallel.c says:

  In parallel workers, it will be set to a value >= 0 and < the number
  of workers before any user code is invoked; each parallel worker will
  get a different parallel worker number.

However asserts in various places collecting instrumentation allowed
(ParallelWorkerNumber == num_workers). That would be a bug, as the value
is used as index into an array with num_workers entries.

Fixed by adjusting the asserts accordingly. Backpatch to all supported
versions.

Discussion: https://postgr.es/m/5db067a1-2cdf-4afb-a577-a04f30b69167@vondra.me
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Backpatch-through: 14
2026-03-14 15:26:39 +01:00
David Rowley
4deecb52af Allow sibling call optimization in slot_getsomeattrs_int()
This changes the TupleTableSlotOps contract to make it so the
getsomeattrs() function is in charge of calling
slot_getmissingattrs().

Since this removes all code from slot_getsomeattrs_int() aside from the
getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so
that it calls getsomeattrs() directly.  We leave slot_getsomeattrs_int()
intact as this is still called from the JIT code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
2026-03-14 13:52:09 +13:00
Peter Geoghegan
8a879119a1 Use fake LSNs to improve nbtree dropPin behavior.
Use fake LSNs in all nbtree critical sections that write a WAL record.
That way we can safely apply the _bt_killitems LSN trick with logged and
unlogged indexes alike.  This brings the same benefits to plain scans of
unlogged relations that commit 2ed5b87f brought to plain scans of logged
relations: scans will drop their leaf page pin eagerly (by applying the
"dropPin" optimization), which avoids blocking progress by VACUUM.  This
is particularly helpful with applications that allow a scrollable cursor
to remain idle for long periods.

Preparation for an upcoming commit that will add the amgetbatch
interface, and switch nbtree over to it (from amgettuple) to enable I/O
prefetching.  The index prefetching read stream's effective prefetch
distance is adversely affected by any buffer pins held by the index AM.
At the same time, it can be useful for prefetching to read dozens of
leaf pages ahead of the scan to maintain an adequate prefetch distance.

The index prefetching patch avoids this tension by always eagerly
dropping index page pins of the kind traditionally held as an interlock
against unsafe concurrent TID recycling by VACUUM (essentially the same
way that amgetbitmap routines have always avoided holding onto pins).
The work from this commit makes that possible during scans of nbtree
unlogged indexes -- without our having to give up on setting LP_DEAD
bits on index tuples altogether.

Follow-up to commit d774072f, which moved the fake LSN infrastructure
out of GiST so that it could be used by other index AMs.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-13 20:37:39 -04:00
Peter Geoghegan
d774072f00 Move fake LSN infrastructure out of GiST.
Move utility functions used by GiST to generate fake LSNs into xlog.c
and xloginsert.c, so that other index AMs can also generate fake LSNs.

Preparation for an upcoming commit that will add support for fake LSNs
to nbtree, allowing its dropPin optimization to be used during scans of
unlogged relations.  That commit is itself preparation for another
upcoming commit that will add a new amgetbatch/btgetbatch interface to
enable I/O prefetching.

Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-13 19:38:17 -04:00
Jeff Davis
9b860373da Add error code to user-visible message.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
2026-03-13 16:07:54 -07:00
Tomas Vondra
b1f14c9672 Use GetXLogInsertEndRecPtr in gistGetFakeLSN
The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:

  ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000

Such failures are very hard to trigger, particularly outside aggressive
test scenarios.

Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().

Initial investigation by me, root cause identified by Andres Freund.

This is a long-standing bug in gistGetFakeLSN(), probably introduced by
c6b92041d3 in PG13. Backpatch to all supported versions.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
Backpatch-through: 14
2026-03-13 23:25:24 +01:00
Heikki Linnakangas
311a851436 Free memory allocated for unrecognized_protocol_options
Since 4966bd3ed9 Valgrind started to warn about little amount of
memory being leaked in ProcessStartupPacket(). This is not critical
but the warnings may distract from real issues. Fix it by freeing the
list after use.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN3Hbb5p=UHx0SPVN+h_JwPAV6rxoqOm7gHBMFKfnGK-Q@mail.gmail.com
2026-03-13 23:37:19 +02:00
Andres Freund
ce5d489166 Fix bug due to confusion about what IsMVCCSnapshot means
In 0b96e734c5 I (Andres) relied on page_collect_tuples() being called only
with an MVCC snapshot, and added assertions to that end, but did not realize
that IsMVCCSnapshot() allows both proper MVCC snapshots and historical
snapshots, which behave quite similarly to MVCC snapshots.

Unfortunately that can lead to incorrect visibility results during logical
decoding, as a historical snapshot is interpreted as a plain MVCC
snapshot. The only reason this wasn't noticed earlier is that it's hard to
reach as most of the time there are no sequential scans during logical
decoding.

To fix the bug and avoid issues like this in the future, split
IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now
only the latter includes historic snapshots.

One effect of this is that during logical decoding no page-at-a-time snapshots
are used, as otherwise runtime branches to handle historic snapshots would be
needed in some performance critical paths. Given how uncommon sequential scans
are during logical decoding, that seems acceptable.

Author: Antonin Houska <ah@cybertec.at>
Reported-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/61812.1770637345@localhost
2026-03-13 13:53:19 -04:00
Nathan Bossart
e0a3a3fd53 Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
Presently, such commands scan the input buffer one byte at a time
looking for special characters.  This commit adds a new path that
uses SIMD instructions to skip over chunks of data without any
special characters.  This can be much faster.

To avoid regressions, SIMD processing is disabled for the remainder
of the COPY FROM command as soon as we encounter a short line or a
special character (except for end-of-line characters, else we'd
always disable it after the first line).  This is perhaps too
conservative, but it could probably be made more lenient in the
future via fine-tuned heuristics.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Neil Conway <neil.conway@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Tested-by: Manni Wood <manni.wood@enterprisedb.com>
Tested-by: Mark Wong <markwkm@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com
2026-03-13 11:07:32 -05:00
Peter Eisentraut
8c2b30487c Factor out constructSetOpTargetlist() from transformSetOperationTree()
This would be used separately by a future patch.  It also makes a
little smaller.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-13 16:16:40 +01:00
Heikki Linnakangas
f9de9bf302 Add callback for I/O error messages in SLRUs
Historically, all SLRUs were addressed by transaction IDs, but that
hasn't been true for a long time. However, the error message on I/O
error still always talked about accessing a transaction ID.

This commit adds a callback that allows subsystems to construct their
own error messages, which can then correctly refer to a transaction
ID, multixid or whatever else is used to address the particular SLRU.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com
2026-03-13 16:21:06 +02:00
Fujii Masao
723619eaa3 Add stats_reset column to pg_stat_database_conflicts.
This commit adds a stats_reset column to pg_stat_database_conflicts,
allowing users to see when the statistics in this view were last reset.
This makes the view consistent with pg_stat_database and other statistics
views.

Catalog version bumped.

Author: Shihao Zhong <zhong950419@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAGRkXqS98OebEWjax99_LVAECsxCB8i=BfsdAL34i-5QHfwyOQ@mail.gmail.com
2026-03-13 22:17:14 +09:00
Heikki Linnakangas
2e1dcf8c54 Check for interrupts during non-fast-update GIN insertion
ginExtractEntries() can produce a lot of entries for a single item.
During index build, we check for interrupts between entries, and the
fast-update codepath does it as part of vacuum_delay_point(), but the
non-fast update insertion codepath was uninterruptible. Add
CHECK_FOR_INTERRUPTS() between entries in the non-fast update codepath
too.

Author: Vinod Sridharan <vsridh90@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAFMdLD6mQvAuStiOGvBJxAEfo6wdjZhj3+JveTLxOX8MVn4zmA@mail.gmail.com
2026-03-13 15:12:32 +02:00
Alexander Korotkov
fa6f2f624c Rework ginScanToDelete() to pass Buffers instead of BlockNumbers.
Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and
re-read pages that were already pinned and locked during the tree walk.  The
caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet
ginScanToDelete() re-read it by block number with special-case code to skip
re-locking.

At first, this commit gives both functions more appropriate names,
ginScanPostingTreeToDelete() and ginDeletePostingPage(), indicating they deal
with posting trees/pages.  This is more descriptive and similar to the way we
name other GIN functions, for instance, ginVacuumPostingTree() and
ginVacuumPostingTreeLeaves().

Then rework both functions to pass Buffers directly.  DataPageDeleteStack now
carries buffer, myoff (downlink offset in parent), and isRoot per level,
so ginScanPostingTreeToDelete() takes only GinVacuumState and
DataPageDeleteStack pointers.  Also, ginDeletePostingPage() receives the three
Buffers directly, and no longer reads or releases them itself.  The caller
reads and locks child pages before recursing, and manages buffer lifecycle
afterward.

This eliminates the confusing isRoot special cases in buffer management,
including the apparent (but unreachable) double release of the root
buffer identified by Andres Freund.

Add comments explaining the locking protocol and the DataPageDeleteStack
structure.

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Jinbinge <jinbinge@126.com>
2026-03-13 13:50:13 +02:00
Heikki Linnakangas
f30cebb954 Fix pointer type of ShmemAllocatorData->index
This went unnoticed in commit e2362eb2bd because the pointer is cast
to/from a void pointer.
2026-03-13 11:00:15 +02:00
Andrew Dunstan
a0b6ef29a5 Enable fast default for domains with non-volatile constraints
Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when
the column type was a domain with constraints (CHECK or NOT NULL), even
if the default value satisfied those constraints.  This was because
contain_volatile_functions() considers CoerceToDomain immutable, so
the code conservatively assumed any constrained domain might fail.

Improve this by using soft error handling (ErrorSaveContext) to evaluate
the CoerceToDomain expression at ALTER TABLE time.  If the default value
passes the domain's constraints, the value is stored as a "missing"
attribute default and no table rewrite is needed.  If the constraint
check fails, we fall back to a table rewrite, preserving the historical
behavior that constraint violations are only raised when the table
actually contains rows.

Domains with volatile constraint expressions always require a table
rewrite since the constraint result could differ per evaluation and
cannot be cached.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
2026-03-12 18:05:01 -04:00
Andrew Dunstan
487cf2cbd2 Extend DomainHasConstraints() to optionally check constraint volatility
Add an optional bool *has_volatile output parameter to
DomainHasConstraints().  When non-NULL, the function checks whether any
CHECK constraint contains a volatile expression.  Callers that don't
need this information pass NULL and get the same behavior as before.

This is needed by a subsequent commit that enables the fast default
optimization for domains with non-volatile constraints: we can safely
evaluate such constraints once at ALTER TABLE time, but volatile
constraints require a full table rewrite.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
2026-03-12 18:04:16 -04:00
Peter Geoghegan
a367c433ad Use simplehash for backend-private buffer pin refcounts.
Replace dynahash with simplehash for the per-backend PrivateRefCountHash
overflow table.  Simplehash generates inlined, open-addressed lookup
code, avoiding the per-call overhead of dynahash that becomes noticeable
when many buffers are pinned with a CPU-bound workload.

Motivated by testing of the index prefetching patch, which pins many
more buffers concurrently than typical index scans.

Author: Peter Geoghegan <pg@bowt.ie>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-03-12 13:26:16 -04:00
Peter Geoghegan
d071e1cfec nbtree: Avoid allocating _bt_search stack.
Avoid allocating memory for an nbtree descent stack during index scans.
We only require a descent stack during inserts, when it is used to
determine where to insert a new pivot tuple/downlink into the target
leaf page's parent page in the event of a page split.  (Page deletion's
first phase also performs a _bt_search that requires a descent stack.)

This optimization improves performance by minimizing palloc churn.  It
speeds up index scans that call _bt_search frequently/descend the index
many times, especially when the cost of scanning the index dominates
(e.g., with index-only skip scans).  Testing has shown that the
underlying issue causes performance problems for an upcoming patch that
will replace btgettuple with a new btgetbatch interface to enable I/O
prefetching.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com
2026-03-12 13:22:36 -04:00
Michael Paquier
6c228755ad Use streaming read for VACUUM cleanup of GIN
This commit replace the synchronous ReadBufferExtended() loop done in
ginvacuumcleanup() with the streaming read equivalent, to improve I/O
efficiency during GIN index vacuum cleanup operations.

With dm_delay to emulate some latency and debug_io_direct=data to force
synchronous writes and force the read path to be exercised, the author
has noticed a 5x improvement in runtime, with a substantial reduction in
IO stats numbers.  I have reproduced similar numbers while running
similar tests, with improvements becoming better with more tuples and
more pages manipulated.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
2026-03-12 11:48:31 +09:00
Richard Guo
383eb21ebf Convert NOT IN sublinks to anti-joins when safe
The planner has historically been unable to convert "x NOT IN (SELECT
y ...)" sublinks into anti-joins.  This is because standard SQL
semantics for NOT IN require that if the comparison "x = y" returns
NULL, the "NOT IN" expression evaluates to NULL (effectively false),
causing the row to be discarded.  In contrast, an anti-join preserves
the row if no match is found.  Due to this semantic mismatch regarding
NULL handling, the conversion was previously considered unsafe.

However, if we can prove that neither side of the comparison can yield
NULL values, and further that the operator itself cannot return NULL
for non-null inputs, the behavior of NOT IN and anti-join becomes
identical.  Enabling this conversion allows the planner to treat the
sublink as a first-class relation rather than an opaque SubPlan
filter.  This unlocks global join ordering optimization and permits
the selection of the most efficient join algorithm based on cost,
often yielding significant performance improvements for large
datasets.

This patch verifies that neither side of the comparison can be NULL
and that the operator is safe regarding NULL results before performing
the conversion.

To verify operator safety, we require that the operator be a member of
a B-tree or Hash operator family.  This serves as a proxy for standard
boolean behavior, ensuring the operator does not return NULL on valid
non-null inputs, as doing so would break index integrity.

For operand non-nullability, this patch makes use of several existing
mechanisms.  It leverages the outer-join-aware-Var infrastructure to
verify that a Var does not come from the nullable side of an outer
join, and consults the NOT-NULL-attnums hash table to efficiently
verify schema-level NOT NULL constraints.  Additionally, it employs
find_nonnullable_vars to identify Vars forced non-nullable by qual
clauses, and expr_is_nonnullable to deduce non-nullability for other
expression types.

The logic for verifying the non-nullability of the subquery outputs
was adapted from prior work by David Rowley and Tom Lane.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com
2026-03-12 09:45:18 +09:00
Andres Freund
6322a028fa bufmgr: Fix use of wrong variable in GetPrivateRefCountEntrySlow()
Unfortunately, in 30df61990c, I made GetPrivateRefCountEntrySlow() set a
wrong cache hint when moving entries from the hash table to the faster array.
There are no correctness concerns due to this, just an unnecessary loss of
performance.

Noticed while testing the index prefetching patch.

Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-03-11 17:52:21 -04:00
Jeff Davis
547c15f9f8 Fix use of volatile.
Commit 8185bb5347 misused volatile. Fix it. See also 6307b096e2.

Reported-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1bb21c7d-885f-4f07-a3ed-21b60d7c92c6@eisentraut.org
2026-03-11 14:27:58 -07:00
Andrew Dunstan
342051d73b Add support for altering CHECK constraint enforceability
This commit adds support for ALTER TABLE ALTER CONSTRAINT ... [NOT]
ENFORCED for CHECK constraints.  Previously, only foreign key
constraints could have their enforceability altered.

When changing from NOT ENFORCED to ENFORCED, the operation not only
updates catalog information but also performs a full table scan in
Phase 3 to validate that existing data satisfies the constraint.

For partitioned tables and inheritance hierarchies, the operation
recurses to all child tables.  When changing to NOT ENFORCED, we must
recurse even if the parent is already NOT ENFORCED, since child
constraints may still be ENFORCED.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@cybertec.at>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com
2026-03-11 16:15:35 -04:00
Andrew Dunstan
a9747153e1 rename alter constraint enforceability related functions
The functions AlterConstrEnforceabilityRecurse and
ATExecAlterConstrEnforceability are being renamed to
AlterFKConstrEnforceabilityRecurse and ATExecAlterFKConstrEnforceability,
respectively.

The current alter constraint functions only handle Foreign Key constraints.
Renaming them to be more explicit about the constraint type is necessary;
otherwise, it will cause confusion when we later introduce the ability to alter
the enforceability of other constraints.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>

Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com
2026-03-11 16:14:58 -04:00
Andres Freund
a766125efd bufmgr: Switch to standard order in MarkBufferDirtyHint()
When we were updating hint bits with just a share lock MarkBufferDirtyHint()
had to use a non-standard order of operations, i.e. WAL log the buffer before
marking the buffer dirty. This was required because the lock level used to set
hints did not conflict with the lock level that was used to flush pages, which
would have allowed flushing the page out before the WAL record. The
non-standard order in turn required preventing the checkpoint from starting
between writing the WAL record and flushing out the page.

Now that setting hints and writing out buffers use share-exclusive, we can
revert back to the normal order of operations.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-11 14:58:29 -04:00
Andres Freund
b0f4ff3c92 bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED
Due to the recent changes to use a share-exclusive mode for setting hint bits
and for flushing pages - instead of using share mode as before - a buffer
cannot be dirtied while the flush is ongoing.  The reason we needed
JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was
ongoing - which is not possible anymore.

Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-11 14:58:29 -04:00
Melanie Plageman
11e0824bd9 Avoid WAL flush checks for unlogged buffers in GetVictimBuffer()
GetVictimBuffer() rejects a victim buffer if it is from a bulkread
strategy ring and reusing it would require flushing WAL. Unlogged table
buffers can have fake LSNs (e.g. unlogged GiST pages) and calling
XLogNeedsFlush() on a fake LSN is meaningless.

This is a bit of future-proofing because currently the bulkread strategy
is not used for relations with fake LSNs.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Earlier version reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/fmkqmyeyy7bdpvcgkheb6yaqewemkik3ls6aaveyi5ibmvtxnd%40nu2kvy5rq3a6
2026-03-11 14:50:50 -04:00
Tomas Vondra
943e881733 Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads
On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.

For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.

This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.

Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se
2026-03-11 19:46:08 +01:00
Tomas Vondra
b6eb8dde6b Fix indentation from commit 29a0fb2157
Per buildfarm animal koel
2026-03-11 15:14:46 +01:00
Tomas Vondra
29a0fb2157 Conditional locking in pgaio_worker_submit_internal
With io_method=worker, there's a single I/O submission queue. With
enough workers, the backends and workers may end up spending a lot of
time competing for the AioWorkerSubmissionQueueLock lock. This can
happen with workloads that keep the queue full, in which case it's
impossible to add requests to the queue. Increasing the number of I/O
workers increases the pressure on the lock, worsening the issue.

This change improves the situation in two ways:

* If AioWorkerSubmissionQueueLock can't be acquired without waiting,
  the I/O is performed synchronously (as if the queue was full).

* When an entry can't be added to a full queue, stop trying to add more
  entries. All remaining entries are handled as synchronous I/O.

The regression was reported by Alexandre Felipe. Investigation and
patch by me, based on an idea by Andres Freund.

Reported-by: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com
2026-03-11 13:40:23 +01:00
Peter Eisentraut
d537f59fbb Sort out table_open vs. relation_open in rewriter
table_open() is a wrapper around relation_open() that checks that the
relkind is table-like and gives a user-facing error message if not.
It is best used in directly user-facing areas to check that the user
used the right kind of command for the relkind.  In internal uses
where the relkind was previously checked from the user's perspective,
table_open() is not necessary and might even be confusing if it were
to give out-of-context error messages.

In rewriteHandler.c, there were several such table_open() calls, which
this changes to relation_open().  This currently doesn't make a
difference, but there are plans to have other relkinds that could
appear in the rewriter but that shouldn't be accessible via
table-specific commands, and this clears the way for that.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-11 09:22:11 +01:00
Andres Freund
82467f627b Require share-exclusive lock to set hint bits and to flush
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
2026-03-10 19:32:13 -04:00
Melanie Plageman
4c7362c553 Remove unused PruneState member frz_conflict_horizon
c2a23dcf9e removed use of PruneState.frz_conflict_horizon but
neglected to actually remove the member. Do that now.
2026-03-10 18:31:00 -04:00
Heikki Linnakangas
138592d1b0 Don't clear pendingRecoveryConflicts at end of transaction
Commit 17f51ea818 introduced a new pendingRecoveryConflicts field in
PGPROC to replace the various ProcSignals. The new field was cleared
in ProcArrayEndTransaction(), which makes sense for conflicts with
e.g. locks or buffer pins which are gone at end of transaction. But it
is not appropriate for conflicts on a database, or a logical slot.

Because of this, the 035_standby_logical_decoding.pl test was
occasionally getting stuck in the buildfarm. It happens if the startup
process signals recovery conflict with the logical slot just when the
walsender process using the slot calls ProcArrayEndTransaction().

To fix, don't clear pendingRecoveryConflicts in
ProcArrayEndTransaction(). We could still clear certain conflict
flags, like conflicts on locks, but we didn't try to do that before
commit 17f51ea818 either.

In the passing, fix a misspelled comment, and make
InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't
think aux processes can have recovery conflicts, but it seems best to
initialize the field and keep InitAuxiliaryProcess() as close to
InitProcess() as possible.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com
2026-03-11 00:06:09 +02:00
Melanie Plageman
c2a23dcf9e Use the newest to-be-frozen xid as the conflict horizon for freezing
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.

Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
2026-03-10 15:24:39 -04:00
Álvaro Herrera
ac58465e06
Introduce the REPACK command
REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.

This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
2026-03-10 19:56:39 +01:00
Masahiko Sawada
a596d27d80 Fix grammar in short description of effective_wal_level.
Align with the convention of using third-person singular (e.g.,
"Shows" instead of "Show") for GUC parameter descriptions.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com
2026-03-10 11:36:38 -07:00
Andres Freund
f4a4ce52c0 heapam: Don't mimic MarkBufferDirtyHint() in inplace updates
Previously heap_inplace_update_and_unlock() used an operation order similar to
MarkBufferDirty(), to reduce the number of different approaches used for
updating buffers.  However, in an upcoming patch, MarkBufferDirtyHint() will
switch to using the update protocol used by most other places (enabled by hint
bits only being set while holding a share-exclusive lock).

Luckily it's pretty easy to adjust heap_inplace_update_and_unlock(). As a
comment already foresaw, we can use the normal order, with the slight change
of updating the buffer contents after WAL logging.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-10 11:58:06 -04:00
Fujii Masao
59bae23435 Remove duplicate initialization in initialize_brin_buildstate().
Commit dae761a added initialization of some BrinBuildState fields
in initialize_brin_buildstate(). Later, commit b437571 inadvertently
added the same initialization again.

This commit removes that redundant initialization. No behavioral
change is intended.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nmrca6-9SNChDvRYD6+r==fs9qg5J93kahS7vpoq8QVg@mail.gmail.com
2026-03-10 22:55:11 +09:00
Peter Eisentraut
8080f44f96 Rename grammar nonterminal to simplify reuse
A list of expressions with optional AS-labels is useful in a few
different places.  Right now, this is available as xml_attribute_list
because it was first used in the XMLATTRIBUTES construct, but it is
already used elsewhere, and there are other possible future uses.  To
reduce possible confusion going forward, rename it to
labeled_expr_list (like existing expr_list plus ColLabel).

Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-10 14:09:09 +01:00
Robert Haas
0fbfd37cef Allow extensions to mark an individual index as disabled.
Up until now, the only way for a loadable module to disable the use of a
particular index was to use build_simple_rel_hook (or, previous to
yesterday's commit, get_relation_info_hook) to remove it from the index
list. While that works, it has some disadvantages. First, the index
becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.

Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it may be better to fall back to a scan using a disabled
index rather than falling back to an also-disabled sequential scan.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com
2026-03-10 08:33:55 -04:00
Michael Paquier
03facc1211 Switch to FATAL error for missing checkpoint record without backup_label
Crash recovery started without a backup_label previously crashed with a
PANIC if the checkpoint record could not be found.  This commit lowers
the report generated to be a FATAL instead.

With recovery methods being more imaginative these days, this should
provide more flexibility when handling PostgreSQL recovery processing in
the event of a driver error, similarly to 15f68cebdc.  An extra
benefit of this change is that it becomes possible to add a test to
check that a FATAL is hit with an expected error message pattern.  With
the recovery code becoming more complicated over the last couple of
years, I suspect that this will be benefitial to cover in the long-term.

The original PANIC behavior has been introduced in the early days of
crash recovery, as of 4d14fe0048 (PANIC did not exist yet, the code
used STOP).

Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/CAMm1aWZbQ-Acp_xAxC7mX9uZZMH8+NpfepY9w=AOxbBVT9E=uA@mail.gmail.com
2026-03-10 12:00:05 +09:00
Michael Paquier
6307b096e2 Fix misuse of "volatile" in xml.c
What should be used is not "volatile foo *ptr" but "foo *volatile ptr",
The incorrect (former) style means that what the pointer variable points
to is volatile.  The correct (latter) style means that the pointer
variable itself needs to be treated as volatile.  The latter style is
required to ensure a consistent treatment of these variables after a
longjmp with the TRY/CATCH blocks.

Some casts can be removed thanks to this change.

Issue introduced by 2e94721747, so no backpatch is required.  A
similar set of issues has been fixed in 93001888d8 for contrib/xml2/.

Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_5BE8DAD985EE140ED62EA728C8D4E1311F0A@qq.com
2026-03-10 07:05:32 +09:00
Robert Haas
91f33a2ae9 Replace get_relation_info_hook with build_simple_rel_hook.
For a long time, PostgreSQL has had a get_relation_info_hook which
plugins can use to editorialize on the information that
get_relation_info obtains from the catalogs. However, this hook is
only called for baserels of type RTE_RELATION, and there is
potential utility in a similar call back for other types of
RTEs. This might have had utility even before commit
4020b370f2 added pgs_mask to
RelOptInfo, but it certainly has utility now.

So, move the callback up one level, deleting get_relation_info_hook and
adding build_simple_rel_hook instead. The new callback is called just
slightly later than before and with slightly different arguments, but it
should be fairly straightforward to adjust existing code that currently
uses get_relation_info_hook: the values previously available as
relationObjectId and inhparent are now available via rte->relid and
rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if
desired.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com
2026-03-09 09:48:26 -04:00
Robert Haas
8300d3ad4a Consider startup cost as a figure of merit for partial paths.
Previously, the comments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).

Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e222534679
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.

This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com
2026-03-09 08:16:30 -04:00
Robert Haas
ffc226ab64 Prevent restore of incremental backup from bloating VM fork.
When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length.  In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)

Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.

To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.

Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17
2026-03-09 06:45:32 -04:00
Amit Kapila
06d8302262 Remove trailing period from errmsg in subscriptioncmds.c.
Author: Sahitya Chandra <sahityajb@gmail.com>
Discussion: https://postgr.es/m/20260308142806.181309-1-sahityajb@gmail.com
2026-03-09 15:10:03 +05:30
Michael Paquier
4da2afd01f Fix size underestimation of DSA pagemap for odd-sized segments
When make_new_segment() creates an odd-sized segment, the pagemap was
only sized based on a number of usable_pages entries, forgetting that a
segment also contains metadata pages, and that the FreePageManager uses
absolute page indices that cover the entire segment.  This
miscalculation could cause accesses to pagemap entries to be out of
bounds.  During subsequent reuse of the allocated segment, allocations
landing on pages with indices higher than usable_pages could cause
out-of-bounds pagemap reads and/or writes.  On write, 'span' pointers
are stored into the data area, corrupting the allocated objects.  On
read (aka during a dsa_free), garbage is interpreted as a span pointer,
typically crashing the server in dsa_get_address().

The normal geometric path correctly sizes the pagemap for all pages in
the segment.  The odd-sized path needs to do the same, but it works
forward from usable_pages rather than backward from total_size.

This commit fixes the sizing of the odd-sized case by adding pagemap
entries for the metadata pages after the initial metadata_bytes
calculation, using an integer ceiling division to compute the exact
number of additional entries needed in one go, avoiding any iteration in
the calculation.

An assertion is added in the code path for odd-sized segments, ensuring
that the pagemap includes the metadata area, and that the result is
appropriately sized.

This problem would show up depending on the size requested for the
allocation of a DSA segment.  The reporter has noticed this issue when a
parallel hash join makes a DSA allocation large enough to trigger the
odd-sized segment path, but it could happen for anything that does a DSA
allocation.

A regression test is added to test_dsa, down to v17 where the test
module has been introduced.  This adds a set of cheap tests to check the
problem, the new assertion being useful for this purpose.  Sami has
proposed a test that took a longer time than what I have done here; the
test committed is faster and good enough to check the odd-sized
allocation path.

Author: Paul Bunn <paul.bunn@icloud.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com
Backpatch-through: 14
2026-03-09 13:46:27 +09:00
Masahiko Sawada
50ea4e09b6 Use palloc_object() and palloc_array() in more areas of the logical replication.
The idea is to encourage the use of newer routines across the tree, as
these offer stronger type-safety guarantees than raw palloc().

Similar work has been done in commits 1b105f9472, 0c3c5c3b06,
31d3847a37, and 4f7dacc5b8. This commit extends those changes to
more locations within src/backend/replication/logical/.

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAHut+Pv4N7Vpxo18+NAR1r9RGvR8b0BtwTkoeCE2PfFoXgmR6A@mail.gmail.com
2026-03-06 10:49:50 -08:00
Tom Lane
415100aa62 Support grouping-expression references and GROUPING() in subqueries.
Until now, substitute_grouped_columns and its predecessor
check_ungrouped_columns intentionally did not cope with references
to GROUP BY expressions (anything more complex than a Var) within
subqueries of the query having GROUP BY.  Because they didn't try to
match subexpressions of subqueries to the GROUP BY list, they'd drill
down to raw Vars of the grouping level and then fail with "subquery
uses ungrouped column from outer query".  There have been remarkably
few complaints about this deficiency, so nobody ever did anything
about it.

The reason for not wanting to deal with it is that within a subquery,
Vars will have varlevelsup different from zero and will thus not be
equal() to the expressions seen in the outer query.  We recognized
this at least as far back as 96ca8ffeb, although I think the comment
I added about it then was just documenting a pre-existing deficiency.
It looks like at the time, the solutions I considered were
(1) write a version of equal() that permits an offset in varlevelsup,
or (2) dynamically apply IncrementVarSublevelsUp at each
subexpression.  (1) would require an amount of new code that seems
rather out of proportion to the benefit, while (2) would add an
exponential amount of cost to the matching process.  But rethinking
it now, what seems attractive is (3) apply IncrementVarSublevelsUp to
the groupingClause list not the subexpressions, and do so only once
per subquery depth level.  Then we can still use plain equal() to
check for matches, and we're not incurring cost proportional to some
power of the subquery's complexity.

This patch continues to use the old logic when the GROUP BY list is
all Vars.  We could discard the special comparison logic for that and
always do it the more general way, but that would be a good deal
slower.  (Micro-benchmarking just parse analysis suggests it's about
50% slower than the Vars-only path.  But we've not heard complaints
about the speed of matching within the main query, so I doubt that
applying the same matching logic within subqueries will be a problem.)
The lack of complaints suggests strongly that this is a very minority
use-case, so I don't want to make the typical case slower to fix it.

While testing that, I was surprised to discover a nearby bug:
GROUPING() within a subquery fails to match GROUP BY Vars that are
join alias Vars.  It tries to apply flatten_join_alias_vars to make
such cases work, but that fails to work inside a subquery because
varlevelsup is wrong.  Therefore, this patch invents a new entry point
flatten_join_alias_for_parser() that allows specification of a
sublevels_up offset.  (It seems cleaner to give the parser its own
entry point rather than abuse the planner's conventions even further.)

While this is pretty clearly a bug fix, I'm hesitant to take the risk
of back-patching, seeing that the existing behavior has stood for so
long with so few complaints.  Maybe we can reconsider once this patch
has baked awhile in master.

Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us
2026-03-06 13:40:55 -05:00
Jeff Davis
8185bb5347 CREATE SUBSCRIPTION ... SERVER.
Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER
clause instead of a raw connection string using the CONNECTION clause.

  * Enables a user with sufficient privileges to create a subscription
    using a foreign server by name without specifying the connection
    details.

  * Integrates with user mappings (and other FDW infrastructure) using
    the subscription owner.

  * Provides a layer of indirection to manage multiple subscriptions
    to the same remote server more easily.

Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify
a connection_function. To be eligible for a subscription, the foreign
server's foreign data wrapper must specify a connection_function.

Add connection_function support to postgres_fdw, and bump postgres_fdw
version to 1.3.

Bump catversion.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com
2026-03-06 08:27:56 -08:00
Álvaro Herrera
868825aaeb
Don't include wait_event.h in pgstat.h
wait_event.h itself includes wait_event_types.h, which is a generated
file, so it's nice that we can avoid compiling >10% of the tree just
because that file is regenerated.

To avoid breaking too many third-party modules, we now #include
utils/wait_classes.h in storage/latch.h.  Then, the very common case
of doing
	WaitLatch(..., PG_WAIT_EXTENSION)
continues to work by including just storage/latch.h.  (I didn't try to
determine how many modules would actually break if we don't do this, but
this seems a convenient and low-impact measure.)

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql
2026-03-06 16:24:58 +01:00
Fujii Masao
6eedb2a5fd Fix publisher shutdown hang caused by logical walsender busy loop.
Previously, when logical replication was running, shutting down
the publisher could cause the logical walsender to enter a busy loop
and prevent the publisher from completing shutdown.

During shutdown, the logical walsender waits for all pending WAL
to be written out. However, some WAL records could remain unflushed,
causing the walsender to wait indefinitely.

The issue occurred because the walsender used XLogBackgroundFlush() to
flush pending WAL. This function does not guarantee that all WAL is written.
For example, WAL generated by a transaction without an assigned
transaction ID that aborts might not be flushed.

This commit fixes the bug by making the logical walsender call XLogFlush()
instead, ensuring that all pending WAL is written and preventing
the busy loop during shutdown.

Backpatch to all supported versions.

Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com
Backpatch-through: 14
2026-03-06 16:43:40 +09:00
Michael Paquier
d5ea206728 Fix inconsistency with HeapTuple freeing in extended_stats_funcs.c
heap_freetuple() is a thin wrapper doing a pfree(), and the function
import_pg_statistic(), introduced by ba97bf9cb7, had the idea to call
directly pfree() rather than the "dedicated" heap tuple routine.

upsert_pg_statistic_ext_data already uses heap_freetuple().  This code
is harmless as-is, but let's be consistent across the board.

Reported-by: Yonghao Lee <yonghao_lee@qq.com>
Discussion: https://postgr.es/m/tencent_CA1315EE8FB9C62F742C71E95FAD72214205@qq.com
2026-03-06 14:49:00 +09:00
Michael Paquier
2d4ead6f4b Fix order of columns in pg_stat_recovery
recovery_last_xact_time is listed before current_chunk_start_time in the
documentation, the function definition and the view definition, but
their order was reversed in the code.

Thinko in 01d485b142.  Mea culpa.

Author: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurQQ1naKmPJhfE5WOUQjtf5tu08Kw3QCGY5UY=7Rt9fE=w@mail.gmail.com
2026-03-06 14:41:41 +09:00
Amit Kapila
f1ddaa1535 Fix inconsistent elevel in pg_sync_replication_slots() retry logic.
The commit 0d2d4a0ec3 allowed pg_sync_replication_slots() to retry sync
attempts, but missed a case, when WAL prior to a slot's
confirmed_flush_lsn is not yet flushed locally.

By changing the elevel from ERROR to LOG, we allow the sync loop to
continue. This provides the opportunity for the slot to be synchronized
once the standby catches up with the necessary WAL.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDZAA+gWDntpa5ucqKKba41=tXmoXqN3q4rpjO9cdxgQrw@mail.gmail.com
2026-03-06 10:51:32 +05:30
Michael Paquier
01d485b142 Add system view pg_stat_recovery
This commit introduces pg_stat_recovery, that exposes at SQL level the
state of recovery as tracked by XLogRecoveryCtlData in shared memory,
maintained by the startup process.  This new view includes the following
fields, that are useful for monitoring purposes on a standby, once it
has reached a consistent state (making the execution of the SQL function
possible):
- Last-successfully replayed WAL record LSN boundaries and its timeline.
- Currently replaying WAL record end LSN and its timeline.
- Current WAL chunk start time.
- Promotion trigger state.
- Timestamp of latest processed commit/abort.
- Recovery pause state.

Some of this data can already be recovered from different system
functions, but not all of it.  See pg_get_wal_replay_pause_state or
pg_last_xact_replay_timestamp.  This new view offers a stronger
consistency guarantee, by grabbing the recovery state for all fields
through one spinlock acquisition.

The system view relies on a new function, called pg_stat_get_recovery().
Querying this data requires the pg_read_all_stats privilege.  The view
returns no rows if the node is not in recovery.

This feature originates from a suggestion I have made while discussion
the addition of a CONNECTING state to the WAL receiver's shared memory
state, because we lacked access to some of the state data.  The author
has taken the time to implement it, so thanks for that.

Bump catalog version.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz
2026-03-06 12:37:40 +09:00
Michael Paquier
42a12856a6 Refactor code retrieving string for RecoveryPauseState
This refactoring is going to be useful in an upcoming commit, to avoid
some code duplication with the function pg_get_wal_replay_pause_state(),
that returns a string for the recovery pause state.

Refactoring opportunity noticed while hacking on a different patch.

Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
2026-03-06 11:53:23 +09:00
Tom Lane
f95d73ed43 Simplify creation of built-in functions with non-default ACLs.
Up to now, to create such a function, one had to make a pg_proc.dat
entry and then modify it with GRANT/REVOKE commands, which we put in
system_functions.sql.  That seems a little ugly though, because it
violates the idea of having a single source of truth about the initial
contents of pg_proc, and it results in leaving dead rows in the
initial contents of pg_proc.

This patch improves matters by allowing aclitemin to work during early
bootstrap, before pg_authid has been loaded.  On the same principle
that we use for early access to pg_type details, put a table of known
built-in role names into bootstrap.c, and use that in bootstrap mode.

To create a built-in function with a non-default ACL, one should write
the desired ACL list in its pg_proc.dat entry, using a simplified
version of aclitemout's notation: omit the grantor (if it is the
bootstrap superuser, which it pretty much always should be) and spell
the bootstrap superuser's name as POSTGRES, similarly to the notation
used elsewhere in src/include/catalog.  This results in entries like

  proacl => '{POSTGRES=X,pg_monitor=X}'

which shows that we've revoked public execute permissions and instead
granted that to pg_monitor.

In addition to fixing up pg_proc.dat entries, I got rid of some
role grants that had been stuck into system_functions.sql,
and instead put them into a new file pg_auth_members.dat;
that seems like a far less random place to put the information.

The correctness of the data changes can be verified by comparing the
initial contents of pg_proc and pg_auth_members before and after.
pg_proc should match exactly, but the OID column of pg_auth_members
will probably be different because those OIDs now get assigned a
little earlier in bootstrap.  (I forced a catversion bump out of
caution, but it wasn't really necessary.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net
2026-03-05 17:43:09 -05:00
Melanie Plageman
34cb4254bd Prefix PruneState->all_{visible,frozen} with set_
The PruneState had members called "all_visible" and "all_frozen" which
reflect not the current state of the page but the state it could be in
once pruning and freezing have been executed. These are then saved in
the PruneFreezeResult so the caller can set the VM accordingly.

Prefix the PruneState members as well as the corresponsding
PruneFreezeResult members with "set_" to clarify that they represent the
proposed state of the all-visible and all-frozen bits for a heap page in
the visibility map, not the current state.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-05 16:55:00 -05:00
Melanie Plageman
68c2dcb913 Add PageGetPruneXid() helper
This is similar to the other page accessors in bufpage.h. It improves
readability and avoids long lines.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com
2026-03-05 16:22:57 -05:00
Melanie Plageman
59663e4207 Move commonly used context into PruneState and simplify helpers
heap_page_prune_and_freeze() and many of its helpers use the heap
buffer, block number, and page. Other helpers took the heap page and
didn't use it. Initializing these values once during
prune_freeze_setup() simplifies the helpers' interfaces and avoids any
repeated calls to BufferGetBlockNumber() and BufferGetPage().

While updating PruneState, also reorganize its fields to make the layout
and member documentation more consistent.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com
2026-03-05 16:10:29 -05:00
Alexander Korotkov
177037341a Fix handling of updated tuples in the MERGE statement
This branch missed the IsolationUsesXactSnapshot() check.  That led to EPQ on
repeatable read and serializable isolation levels.  This commit fixes the
issue and provides a simple isolation check for that.  Backpatch through v15
where MERGE statement was introduced.

Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAPpHfdvzZSaNYdj5ac-tYRi6MuuZnYHiUkZ3D-AoY-ny8v%2BS%2Bw%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Backpatch-through: 15
2026-03-05 19:49:28 +02:00
Fujii Masao
bffd7130e9 Improve validation of recovery_target_xid GUC values.
Previously, the recovery_target_xid GUC values were not sufficiently validated.
As a result, clearly invalid inputs such as the string "bogus", a decimal value
like "1.1", or 0 (a transaction ID smaller than the minimum valid value of 3)
were unexpectedly accepted. In these cases, the value was interpreted as
transaction ID 0, which could cause recovery to behave unexpectedly.

This commit improves validation of recovery_target_xid GUC so that invalid
values are rejected with an error. This prevents recovery from proceeding
with misconfigured recovery_target_xid settings.

Also this commit updates the documentation to clarify the allowed values
for recovery_target_xid GUC.

Author: David Steele <david@pgbackrest.org>
Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/f14463ab-990b-4ae9-a177-998d2677aae0@pgbackrest.org
2026-03-05 21:40:32 +09:00
Michael Paquier
5f8124a0cf Move definition of XLogRecoveryCtlData to xlogrecovery.h
XLogRecoveryCtlData is the structure that stores the shared-memory state
of WAL recovery, including information such as promotion requests, the
timeline ID (TLI), and the LSNs of replayed records.

This refactoring is independently useful because it allows code outside
of core to access the recovery state in live.  It will be used by an
upcoming patch that introduces a SQL function for querying this
information, that can be accessed on a standby once a consistent state
has been reached.  This only moves code around, changing nothing
functionally.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
2026-03-05 12:17:47 +09:00
Michael Paquier
34dfca2934 Change default value of default_toast_compression to "lz4", take two
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

Note that this is a version lighter than 7c1849311e, that included a
replacement of --with-lz4 by --without-lz4 in configure builds, forcing
a requirement for LZ4 in all environments.  The buildfarm did not like
it, at all.  This commit switches default_toast_compression to lz4 as
default only when --with-lz4 is defined, which should keep the buildfarm
at bay while still allowing users to benefit from LZ4 compression in
TOAST as long as the code is compiled with it.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-05 09:24:35 +09:00
Michael Paquier
4f0b3afab4 Revert "Change default value of default_toast_compression to "lz4""
This reverts commit 7c1849311e, due to the fact that more than 60% of
the buildfarm members do not have lz4 installed.  As we are in the last
commit fest of the development cycle, and that it could take a couple
of weeks to stabilize things, this change is reverted for now.

This commit will be reworked in a lighter version, as
default_toast_compression's default can be changed to "lz4" without the
switch from --with-lz4 to --without-lz4.  This approach will keep the
buildfarm at bay, and still allow builds to take advantage of LZ4 in
TOAST by default, as long as the code is compiled with LZ4 support.

A harder requirement based on LZ4 should be achievable at some point,
but it is going to require some work from the buildfarm owners first.
Perhaps this part could be revisited at the beginning of the next
development cycle.

Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com
2026-03-05 08:25:35 +09:00
Tom Lane
e6a1d8f5ac Fix estimate_hash_bucket_stats's correction for skewed data.
The previous idea was "scale up the bucketsize estimate by the ratio
of the MCV's frequency to the average value's frequency".  But we
should have been suspicious of that plan, since it frequently led to
impossible (> 1) values which we had to apply an ad-hoc clamp to.
Joel Jacobson demonstrated that it sometimes leads to making the
wrong choice about which side of the hash join should be inner.

Instead, drop the whole business of estimating average frequency, and
just clamp the bucketsize estimate to be at least the MCV's frequency.
This corresponds to the bucket size we'd get if only the MCV appears
in a bucket, and the MCV's frequency is not affected by the
WHERE-clause filters.  (We were already making the latter assumption.)
This also matches the coding used since 4867d7f62 in the case where
only a default ndistinct estimate is available.

Interestingly, this change affects no existing regression test cases.
Add one to demonstrate that it helps pick the smaller table to be
hashed when the MCV is common enough to affect the results.

This leaves estimate_hash_bucket_stats not considering the effects of
null join keys at all, which we should probably improve.  However,
I have a different patch in the queue that will change the executor's
handling of null join keys, so it seems appropriate to wait till
that's in before doing anything more here.

Reported-by: Joel Jacobson <joel@compiler.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com
2026-03-04 15:33:15 -05:00
Álvaro Herrera
ce4fbe1ac6
Don't malloc(0) in EventTriggerCollectAlterTSConfig
Author: Florin Irion <florin.irion@enterprisedb.com>
Discussion: https://postgr.es/m/c6fff161-9aee-4290-9ada-71e21e4d84de@gmail.com
2026-03-04 15:04:53 +01:00
Amit Kapila
fd366065e0 Allow table exclusions in publications via EXCEPT TABLE.
Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE
syntax. This allows one or more tables to be excluded. The publisher will
not send the data of excluded tables to the subscriber.

To support this, pg_publication_rel now includes a prexcept column to flag
excluded relations. For partitioned tables, the exclusion is applied at
the root level; specifying a root table excludes all current and future
partitions in that tree.

Follow-up work will implement ALTER PUBLICATION support for managing these
exclusions.

Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-04 15:56:48 +05:30
Michael Paquier
7c1849311e Change default value of default_toast_compression to "lz4", when available
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

--with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the
builds on an option-basis, following a practice similar to readline or
ICU.  References to --with-lz4 are removed from the documentation.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

For the reference, a similar switch has been done with ICU in
fcb21b3acd.  Some of the changes done in this commit are consistent
with that.

Note: this is going to create some disturbance in the buildfarm, in
environments where lz4 is not installed.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-04 13:05:31 +09:00
Richard Guo
1f4f87d794 Remove redundant restriction checks in apply_child_basequals
In apply_child_basequals, after translating a parent relation's
restriction quals for a child relation, we simplify each child qual by
calling eval_const_expressions.  Historically, the code then called
restriction_is_always_false and restriction_is_always_true to reduce
NullTest quals that are provably false or true.

However, since commit e2debb643, the planner natively performs
NullTest deduction during constant folding.  Therefore, calling
restriction_is_always_false and restriction_is_always_true immediately
afterward is redundant and wastes CPU cycles.  We can safely remove
them and simply rely on the constant folding to handle the deduction.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-vLmGXaUEZyOMacN0BVfqWCt2tM-eDVWdDfJnOQaauGg@mail.gmail.com
2026-03-04 10:57:43 +09:00
Richard Guo
ce1c17a316 Remove obsolete SAMESIGN macro
The SAMESIGN macro was historically used as a helper for manual
integer overflow checks.  However, since commit 4d6ad3125 introduced
overflow-aware integer operations, this manual sign-checking logic is
no longer necessary.

The macro remains defined in brin_minmax_multi.c and timestamp.c, but
is not used in either file.  This patch removes these definitions to
clean things up.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-NL3J3hQ3LzrwV-YUkQC18P+jM7ZiegQyAHzgdZev2qg@mail.gmail.com
2026-03-04 10:56:06 +09:00
Melanie Plageman
38229cb905 Add read_stream_{pause,resume}()
Read stream users can now pause lookahead when no blocks are currently
available. After resuming, subsequent read_stream_next_buffer() calls
continue lookahead with the previous lookahead distance.

This is especially useful for read stream users with self-referential
access patterns (where consuming already-read buffers can produce
additional block numbers).

Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
2026-03-03 16:03:09 -05:00
Álvaro Herrera
cece37c984
Reduce scope of for-loop-local variables to avoid shadowing
Adjust a couple of for-loops where a local variable was shadowed by
another in the same scope, by renaming it as well as reducing its scope
to the containing for-loop.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAEoWx2kQ2x5gMaj8tHLJ3=jfC+p5YXHkJyHrDTiQw2nn2FJTmQ@mail.gmail.com
2026-03-03 11:24:11 +01:00
Peter Eisentraut
f2d7570cdd Reduce the scope of volatile qualifiers
Commit c66a7d75e6 introduced a new "cast discards ‘volatile’"
warning (-Wcast-qual) in vac_truncate_clog().

Instead of making use of unvolatize(), remove the warning by reducing the
scope of the volatile qualifier (added in commit 2d2e40e3be) to only
2 fields.

Also do the same for vac_update_datfrozenxid(), since the intent of
commit f65ab862e3 was to prevent the same kind of race condition that
commit 2d2e40e3be was fixing.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aZ3a%2BV82uSfEjDmD%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-03 10:02:28 +01:00
Peter Eisentraut
2a525cc97e Add COPY (on_error set_null) option
If ON_ERROR SET_NULL is specified during COPY FROM, any data type
conversion errors will result in the affected column being set to a
null value.  A column's not-null constraints are still enforced, and
attempting to set a null value in such columns will raise a constraint
violation error.  This applies to a column whose data type is a domain
with a NOT NULL constraint.

Author: Jian He <jian.universality@gmail.com>
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: torikoshia <torikoshia@oss.nttdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com
2026-03-03 07:37:12 +01:00
Michael Paquier
ba97bf9cb7 Add support for "exprs" in pg_restore_extended_stats()
This commit adds support for the restore of extended statistics of the
kind "exprs", counting for the statistics data computed for expressions.

The input format consists of a jsonb object which must be an array of
objects which are keyed by statistics parameter names, like this:
[{"stat_type1": "...", "stat_type2": "...", ...},
 {"stat_type1": "...", "stat_type2": "...", ...}, ...]

The outer array must have as many elements as there are expressions
defined in the statistics object, mapping with the way extended
statistics are built with one pg_statistic tuple stored for each
expression whose statistics have been computed.  The elements of the
array must be either objects or null values (equivalent of invalid data,
case also supported by the stats computations when its data is inserted
in the catalogs).

The keys of the inner objects are names of the statistical columns in
pg_stats_ext_exprs (i.e. everything after "inherited").  Not all
parameter keys need to be provided, those omitted being silently
ignored.  Key values that do not match a statistical column name will
cause a warning to be issued, but do not otherwise fail the expression
or the import as a whole.

The expected value type for all parameters is jbvString, which allows
us to validate the values using the input function specific to that
parameter.  Any parameters with a null value are silently ignored, same
as if they were not provided in the first place.

This commit includes a battery of test cases:
- Sanity checks for what-should-be-all the failures in restore code
paths, including parsing errors, parameter sanity checks depending on
the extended stats object definition, etc.
- Value injection, for scalar, array, range, multi-range cases.
- Stats data cloning, with differential checks between the source
relation and its target.  The source and the target should hold the same
stats data after restore.
- While expressions are supported in extended statistics since v14,
range_length_histogram, range_empty_frac, and range_bounds_histogram
have been added to pg_stat_ext_exprs only in v19.  A test case has been
added to emulate a dump taken from v18, with expression stats restored
for a range data type where these three fields are NULL.

Support for pg_dump is included, with expressions supported since v14,
inherited since v15, and data for range types in expressions in v19.

pg_upgrade is the main use-case of this feature; it is also possible to
inject statistics, same as for the other extstat kinds.

As of this commit, ANALYZE should not be required after pg_upgrade when
the cluster upgrading from uses extended statistics, as MCV,
dependencies, expressions and ndistinct stats are all covered.  The
stats data related to range types used in expressions requires v19,
whose support has also been added.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=fPcci6oPyuyEZ0F4bWqAA7HzaWO+ZPptufuX5_uWt6kw@mail.gmail.com
2026-03-03 14:19:54 +09:00
Jeff Davis
11171fe1fc style: define parameterless functions as foo(void).
Change pg_icu_unicode_version() to pg_icu_unicode_version(void),
introduced by commit af2d4ca191. See commit 9b05e2ec08, which fixed
similar cases.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aaEhpwrj1FY/8/7n@ip-10-97-1-34.eu-west-3.compute.internal
2026-03-02 20:12:38 -08:00
Heikki Linnakangas
ccae90abdb Fix OldestMemberMXactId and OldestVisibleMXactId array usage
Commit ab355e3a88 changed how the OldestMemberMXactId array is
indexed. It's no longer indexed by synthetic dummyBackendId, but with
ProcNumber. The PGPROC entries for prepared xacts come after auxiliary
processes in the allProcs array, which rendered the calculation for
MaxOldestSlot and the indexes into the array incorrect.  (The
OldestVisibleMXactId array is not used for prepared xacts, and thus
never accessed with ProcNumber's greater than MaxBackends, so this
only affects the OldestMemberMXactId array.)

As a result, a prepared xact would store its value past the end of the
OldestMemberMXactId array, overflowing into the OldestVisibleMXactId
array. That could cause a transaction's row lock to appear invisible
to other backends, or other such visibility issues. With a very small
max_connections setting, the store could even go beyond the
OldestVisibleMXactId array, stomping over the first element in the
BufferDescriptor array.

To fix, calculate the array sizes more precisely, and introduce helper
functions to calculate the array indexes correctly.

Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/7acc94b0-ea82-4657-b1b0-77842cb7a60c@postgrespro.ru
Backpatch-through: 17
2026-03-02 19:19:22 +02:00
Melanie Plageman
8b9d42bf6b Save prune cycles by consistently clearing prune hints on all-visible pages
All-visible pages can't contain prunable tuples. We already clear the
prune hint (pd_prune_xid) during pruning of all-visible pages, but we
were not doing so in vacuum phase three, nor initializing it for
all-frozen pages created by COPY FREEZE, and we were not clearing it on
standbys.

Because page hints are not WAL-logged, pages on a standby carry stale
pd_prune_xid values. After promotion, that stale hint triggers
unnecessary on-access pruning.

Fix this by clearing the prune hint everywhere we currently mark a heap
page all-visible. Clearing it when setting PD_ALL_VISIBLE ensures no
extra overhead.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/CAAKRu_b-BMOyu0X-0jc_8bWNSbQ5K6JTEueayEhcQuw-OkCSKg%40mail.gmail.com
2026-03-02 11:05:59 -05:00
Michael Paquier
f68d7e7483 Remove WAL page header flag XLP_BKP_REMOVABLE
There are no known users of this flag.  The last supposed user was
pglesslog, which is the reason why this flag has been introduced in
core, based on an historical search pointing at a8d539f124.

I have mentioned that we may want to remove this flag back in 2018, due
to zero users of it in core.  More recently, Noah has pointed out that
this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL
writer in a lock-free fashion with runningBackups > 0, meaning that some
full-page images could be required but not logged, ultimately corrupting
backups.

Bump XLOG_PAGE_MAGIC.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com
Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com
2026-03-02 14:13:05 +09:00
Michael Paquier
f7dc17aa91 Fix memory allocation size in RegisterExtensionExplainOption()
The allocations used for the static array ExplainExtensionOptionArray,
that tracks a set of ExplainExtensionOption, used "char *" instead of
ExplainExtensionOption as the memory size consumed by one element,
underestimating the memory required by half.

The initial allocation of ExplainExtensionNameArray wants to hold 16
elements before being reallocated, and with "char *" it meant that there
was enough space only for 8 ExplainExtensionOption elements, 16 bytes
required for each element.  The backend would crash once one tries to
register a 9th EXPLAIN option.

As far as I can see, the allocation formulas of GetExplainExtensionId()
have been copy-pasted to RegisterExtensionExplainOption(), but the
internal maths of the copy were not adjusted accordingly.

Oversight in c65bc2e1d1.

Author: Joel Jacobson <joel@compiler.org>
Discussion: https://postgr.es/m/2a4bd2f5-2a2f-409f-8ac7-110dd3fad4fc@app.fastmail.com
Backpatch-through: 18
2026-03-02 13:14:15 +09:00
Michael Paquier
3b7a6fa157 Fix set of issues with extended statistics on expressions
This commit addresses two defects regarding extended statistics on
expressions:
- When building extended statistics in lookup_var_attr_stats(), the call
to examine_attribute() did not account for the possibility of a NULL
return value.  This can happen depending on the behavior of a typanalyze
callback — for example, if the callback returns false, if no rows are
sampled, or if no statistics are computed.  In such cases, the code
attempted to build MCV, dependency, and ndistinct statistics using a
NULL pointer, incorrectly assuming valid statistics were available,
which could lead to a server crash.
- When loading extended statistics for expressions,
statext_expressions_load() did not account for NULL entries in the
pg_statistic array storing expression statistics.  Such NULL entries can
be generated when statistics collection fails for an expression, as may
occur during the final step of serialize_expr_stats().  An extended
statistics object defining N expressions requires N corresponding
elements in the pg_statistic array stored for the expressions, and some
of these elements can be NULL.  This situation is reachable when a
typanalyze callback returns true, but sets stats_valid to indicate that
no useful statistics could be computed.

While these scenarios cannot occur with in-core typanalyze callbacks, as
far as I have analyzed, they can be triggered by custom data types with
custom typanalyze implementations, at least.

No tests are added in this commit.  A follow-up commit will introduce a
test module that can be extended to cover similar edge cases if
additional issues are discovered.  This takes care of the core of the
problem.

Attribute and relation statistics already offer similar protections:
- ANALYZE detects and skips the build of invalid statistics.
- Invalid catalog data is handled defensively when loading statistics.

This issue exists since the support for extended statistics on
expressions has been added, down to v14 as of a4d75c86bf.  Backpatch
to all supported stable branches.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aaDrJsE1I5mrE-QF@paquier.xyz
Backpatch-through: 14
2026-03-02 09:38:37 +09:00
Tom Lane
d80b022501 Correctly calculate "MCV frequency" for a unique column.
In commit bd3e3e9e5, I over-hastily used 1 / rel->rows as the assumed
frequency of entries in a column that ANALYZE has found to be unique.
However, rel->rows is the number of table rows that are estimated to
pass the query's restriction conditions, so that we got a too-large
result if the query has selective restrictions.  What I should have
used is 1 / rel->tuples, since that is the estimated total number of
table rows.  The pre-existing code path that digs a frequency out of
the histogram produces a frequency relative to the whole table, so
surely this new alternative code path must do so as well.  Any
correction needed on the basis of selectivity must be done by the
user of the mcv_freq value.

Fixing this causes all the regression test plans changed by bd3e3e9e5
to revert to what they had been, except for the first change in
join.out.  As I correctly argued in bd3e3e9e5, in that test case we
have no stats and should not risk a hash join.  Evidently I was less
correct to argue that the other changes were improvements.

Reported-by: Joel Jacobson <joel@compiler.org>
Diagnosed-by: Tender Wang <tndrwang@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com
2026-03-01 12:56:55 -05:00
Peter Eisentraut
3f98862980 Fix some -Wcast-qual warnings
This fixes some warnings from -Wcast-qual that are easy to fix,
without using unconstify or the like.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/990c9117-b013-4026-aaf5-261fe2832c3d%40eisentraut.org
2026-02-27 21:57:33 +01:00
Tom Lane
65a3ff8f1b Doc: improve user docs and code comments about EXISTS(SELECT * ...).
Point out that Postgres automatically optimizes away the target list
of an EXISTS' subquery, except in weird cases such as target lists
containing set-returning functions.  Thus, both common conventions
EXISTS(SELECT * FROM ...) and EXISTS(SELECT 1 FROM ...) are
overhead-free and there's little reason to prefer one over the other.

In the code comments, mention that the SQL spec says that
EXISTS(SELECT * FROM ...) should be interpreted as EXISTS(SELECT
some-literal FROM ...), but we don't choose to do it exactly that way.

Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/9b301c70-3909-4f0f-98ca-9e3c4d142f3e@eisentraut.org
2026-02-27 15:20:16 -05:00
Tom Lane
98616ac18b Don't flatten join alias Vars that are stored within a GROUP RTE.
The RTE's groupexprs list is used for deparsing views, and for that
usage it must contain the original alias Vars; else we can get
incorrect SQL output.  But since commit 247dea89f,
parseCheckAggregates put the GROUP BY expressions through
flatten_join_alias_vars before building the RTE_GROUP RTE.
Changing the order of operations there is enough to fix it.

This patch unfortunately can do nothing for already-created views:
if they use a coding pattern that is subject to the bug, they will
deparse incorrectly and hence present a dump/reload hazard in the
future.  The only fix is to recreate the view from the original SQL.
But the trouble cases seem to be quite narrow.  AFAICT the output
was only wrong for "SELECT ... t1 LEFT JOIN t2 USING (x) GROUP BY x"
where t1.x and t2.x were not of identical data types and t1.x was
the side that required an implicit coercion.  If there was no hidden
coercion, or if the join was plain, RIGHT, or FULL, the deparsed
output was uglier than intended but not functionally wrong.

Reported-by: Swirl Smog Dowry <swirl-smog-dowry@duck.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CA+-gibjCg_vjcq3hWTM0sLs3_TUZ6Q9rkv8+pe2yJrdh4o4uoQ@mail.gmail.com
Backpatch-through: 18
2026-02-27 12:54:02 -05:00
Álvaro Herrera
a2c89835f5
Don't include proc.h in shm_mq.h
This prevents proliferation of proc.h to tons of other places; shm_mq.h
is widely included.

Discussion: https://postgr.es/m/202602261733.s2rkxezwuif6@alvherre.pgsql
2026-02-27 10:53:47 +01:00
Melanie Plageman
284925508a Remove table_scan_analyze_next_tuple unneeded parameter OldestXmin
heapam_scan_analyze_next_tuple() doesn't distinguish between dead and
recently dead tuples when counting them, so it doesn't need OldestXmin.
GetOldestNonRemovableTransactionId() isn't free, so removing it is a
win.

Looking at other table AMs implementing table_scan_analyze_next_tuple(),
we couldn't find one using OldestXmin either, so remove it from the
callback.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com
2026-02-26 15:41:53 -05:00
Melanie Plageman
3efe58febc Simplify visibility check in heap_page_would_be_all_visible()
heap_page_would_be_all_visible() does not need to distinguish between
HEAPTUPLE_RECENTLY_DEAD and HEAPTUPLE_DEAD tuples: any tuple in a state
other than HEAPTUPLE_LIVE means the page is not all-visible and
heap_page_would_be_all_visible() returns false.

Given that, calling HeapTupleSatisfiesVacuum() is unnecessary, since it
performs extra work to distinguish between dead and recently dead tuples
using OldestXmin. Replace it with the more minimal
HeapTupleSatisfiesVacuumHorizon().

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com
2026-02-26 15:41:45 -05:00
Jeff Davis
d942511f08 Fix memory leaks in pg_locale_icu.c.
The backport prior to 18 requires minor modification due to code
refactoring.

Discussion: https://postgr.es/m/e2b7a0a88aaadded7e2d19f42d5ab03c9e182ad8.camel@j-davis.com
Backpatch-through: 16
2026-02-26 12:15:01 -08:00
Melanie Plageman
5aea60839b Rename LVRelState VM-related logging counters
The LVRelState fields that track newly all-visible/all-frozen pages were
previously named vm_new_visible_pages, vm_new_frozen_pages, and
vm_new_visible_frozen_pages. The correct terminology is all-visible and
all-frozen; omitting “all” was open to misinterpretation, as the page
isn't visible or invisible, rather all the tuples on the page are
visible to all running and future transactions. Rename the members
accordingly.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-02-26 15:04:49 -05:00
Álvaro Herrera
7b9b620d8f
Don't include latch.h in libpq/libpq.h
This reduces the inclusion footprint of latch.h a bit.

Per suggestion from Andres Freund.

Discussion: https://postgr.es/m/pap7mzhcxvuwlfdebjkh646ntyk4brtwm4dbocfpllwdccta5t@w3d7wz6mjpwv
2026-02-26 18:04:13 +01:00
Andres Freund
9d6294c09e instrumentation: Drop INSTR_TIME_SET_CURRENT_LAZY macro
This macro had exactly one user in InstrStartNode, and the caller can
instead use INSTR_TIME_IS_ZERO / INSTR_TIME_SET_CURRENT directly.

This supports a future change that intends to modify the time source being
used in the InstrStartNode case.

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53Pkx1bK1FB71_nBqYmzvSSXnp_MbE0ZDnU+baPJF6Ud2WDA@mail.gmail.com
2026-02-26 10:39:29 -05:00
Andres Freund
3218825271 instrumentation: Rename INSTR_TIME_LT macro to INSTR_TIME_GT
This was incorrectly named "LT" for "larger than" in e5a5e0a907, but
that is against existing conventions, where "LT" means "less than".
Clarify by using "GT" for "greater than" in macro name, and add a missing
comment at the top of instr_time.h to note the macro's existence.

Reported by: Peter Smith <smithpb2250@gmail.com>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAHut%2BPut94CTpjQsqOJHdHkgJ2ZXq%2BqVSfMEcmDKLiWLW-hPfA%40mail.gmail.com
2026-02-26 10:38:59 -05:00
Fujii Masao
70f470314c Fix ProcWakeup() resetting wrong waitStart field.
Previously, when one process woke another that was waiting on a lock,
ProcWakeup() incorrectly cleared its own waitStart field (i.e.,
MyProc->waitStart) instead of that of the process being awakened.
As a result, the awakened process retained a stale lock-wait start timestamp.

This did not cause user-visible issues. pg_locks.waitstart was reported as
NULL for the awakened process (i.e., when pg_locks.granted is true),
regardless of the waitStart value.

This bug was introduced by commit 46d6e5f567.

This commit fixes this by resetting the waitStart field of the process
being awakened in ProcWakeup().

Backpatch to all supported branches.

Reported-by: Chao Li <li.evan.chao@gmail.com>
Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: ji xu <thanksgreed@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/537BD852-EC61-4D25-AB55-BE8BE46D07D7@gmail.com
Backpatch-through: 14
2026-02-26 08:46:12 +09:00
Richard Guo
77c7a17a6e Fix unsafe RTE_GROUP removal in simplify_EXISTS_query
When simplify_EXISTS_query removes the GROUP BY clauses from an EXISTS
subquery, it previously deleted the RTE_GROUP RTE directly from the
subquery's range table.

This approach is dangerous because deleting an RTE from the middle of
the rtable list shifts the index of any subsequent RTE, which can
silently corrupt any Var nodes in the query tree that reference those
later relations.  (Currently, this direct removal has not caused
problems because the RTE_GROUP RTE happens to always be the last entry
in the rtable list.  However, relying on that is extremely fragile and
seems like trouble waiting to happen.)

Instead of deleting the RTE_GROUP RTE, this patch converts it in-place
to be RTE_RESULT type and clears its groupexprs list.  This preserves
the length and indexing of the rtable list, ensuring all Var
references remain intact.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3472344.1771858107@sss.pgh.pa.us
Backpatch-through: 18
2026-02-25 11:13:21 +09:00
Álvaro Herrera
65707ed9af
Add backtrace support for Windows using DbgHelp API
Previously, backtrace generation on Windows would return an "unsupported"
message.  With this commit, we rely on CaptureStackBackTrace() to capture
the call stack and the DbgHelp API (SymFromAddrW, SymGetLineFromAddrW64)
for symbol resolution.

Symbol handler initialization (SymInitialize) is performed once per
process and cached.  If initialization fails, the report for it is
returned as the backtrace output.  The symbol handler is cleaned up via
on_proc_exit() to release DbgHelp resources.

The implementation provides symbol names, offsets, and addresses.  When
PDB files are available, it also includes source file names and line
numbers.  Symbol names and file paths are converted from UTF-16 to the
database encoding using wchar2char(), which properly handles both UTF-8
and non-UTF-8 databases on Windows.  When symbol information is
unavailable or encoding conversion fails, it falls back to displaying raw
addresses.

The implementation uses the explicit UTF16 versions of the DbgHelp
functions (SYMBOL_INFOW, SymFromAddrW, IMAGEHLP_LINEW64,
SymGetLineFromAddrW64) rather than the generic versions.  This allows us
to rely on predictable encoding conversion, rather than using the
haphazard ANSI codepage that we'd get otherwise.

DbgHelp is apparently available on all Windows platforms we support, so
there are no version number checks.

Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/a692c0fe-caca-4c08-9c5d-debfd0ef2504@gmail.com
2026-02-24 17:34:56 +01:00
Peter Eisentraut
a99c6b56ff Make ALTER DOMAIN VALIDATE CONSTRAINT no-op when constraint is already validated
Currently, AlterDomainValidateConstraint will re-validate a constraint
that has already been validated, which would just waste cycles.  This
operation should be a no-op when the constraint is already validated.
This also aligns with ATExecValidateConstraint.

Author: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG=-Dv9fPJHqkA9c-wGZ2dDOWOXSp-X-0K_G7r-DgaASw@mail.gmail.com
2026-02-24 10:58:36 +01:00
Peter Eisentraut
f80bedd52b Allow ALTER COLUMN SET EXPRESSION on virtual columns with CHECK constraints
Previously, changing the generation expression of a virtual column was
prohibited if the column was referenced by a CHECK constraint.  This
lifts that restriction.

RememberAllDependentForRebuilding within ATExecSetExpression will
rebuild all the dependent constraints, later ATPostAlterTypeCleanup
queues the required AlterTableStmt operations for ALTER TABLE Phase 3
execution.

Overall, ALTER COLUMN SET EXPRESSION on virtual columns may require
scanning the table to re-verify any associated CHECK constraints, but
it does not require a table rewrite in ALTER TABLE Phase 3.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CACJufxH3VETr7orF5rW29GnDk3n1wWbOE3WdkHYd3iPGrQ9E_A@mail.gmail.com
2026-02-24 10:32:05 +01:00
Michael Paquier
462fe0ff62 Fix variety of typos and grammar mistakes
This commit includes a batch of fixes for various minor typos and
grammar mistakes, that have been proposed to the hackers mailing list
since the beginning of January.

Similar batches are planned on a bi-monthly basis depending on the
amount received, with the next one for the end of April.
2026-02-24 13:26:37 +09:00
Nathan Bossart
d981976027 Allow pg_{read,write}_all_data to access large objects.
Since the initial goal of pg_read_all_data was to be able to run
pg_dump as a non-superuser without explicitly granting access to
every object, it follows that it should allow reading all large
objects.  For consistency, pg_write_all_data should allow writing
all large objects, too.

Author: Nitin Motiani <nitinmotiani@google.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CAH5HC96dxAEvP78s1-JK_nDABH5c4w2MDfyx4vEWxBEfofGWsw%40mail.gmail.com
2026-02-23 14:55:21 -06:00
Tom Lane
d743545d84 Work around lgamma(NaN) bug on AIX.
lgamma(NaN) should produce NaN, but on older versions of AIX
it reports an ERANGE error.  While that's been fixed in the latest
version of libm, it'll take awhile for the fix to propagate.  This
workaround is harmless even when the underlying bug does get fixed.

Discussion: https://postgr.es/m/3603369.1771877682@sss.pgh.pa.us
2026-02-23 15:30:50 -05:00
Peter Eisentraut
aca61f7e5f Use LOCKMODE in parse_relation.c/.h
There were a couple of comments in parse_relation.c

> Note: properly, lockmode should be declared LOCKMODE not int, but that
> would require importing storage/lock.h into parse_relation.h.  Since
> LOCKMODE is typedef'd as int anyway, that seems like overkill.

but actually LOCKMODE has been in storage/lockdefs.h for a while,
which is intentionally a more narrow header.  So we can include that
one in parse_relation.h and just use LOCKMODE normally.

An alternative would be to add a duplicate typedef into
parse_relation.h, but that doesn't seem necessary here.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/4bcd65fb-2497-484c-bb41-83cb435eb64d%40eisentraut.org
2026-02-23 21:25:55 +01:00
Tom Lane
4a1b05caa5 Restore AIX support.
The concerns that led us to remove AIX support in commit 0b16bb877
have now been alleviated:

1. IBM has stepped forward to provide support, including buildfarm
animal(s).
2. AIX 7.2 and later seem to be fine with large pg_attribute_aligned
requirements.  Since 7.1 is now EOL anyway, we can just cease to
support it.
3. Tossing xlc support overboard seems okay as well.  It's a bit
sad to drop one of the few remaining non-gcc-alike compilers, but
working around xlc's bugs and idiosyncrasies doesn't seem justified
by the theoretical portability benefits.
4. Likewise, we can stop supporting 32-bit AIX builds.  This is
not so much about whether we could build such executables as that
they're too much of a pain to manage in the field, due to limited
address space available for dynamic library loading.
5. We hit on a way to manage catalog column alignment that doesn't
require continuing developer effort (see commit ecae09725).

Hence, this commit reverts 0b16bb877 and some follow-on commits
such as e6bb491bf, except for not putting back XLC support nor
the changes related to catalog column alignment.

Some other notable changes from the way things were in v16:

Prefer unnamed POSIX semaphores on AIX, rather than the default
choice of SysV semaphores.

Include /opt/freeware/lib in -Wl,-blibpath, even when it is not
mentioned anywhere in LDFLAGS.

Remove platform-specific adjustment of MEMSET_LOOP_LIMIT; maybe
that's still the right thing, but it really ought to be re-tested.

Silence compiler warnings related to getpeereid(), wcstombs_l(),
and PAM conversation procs.

Accept "libpythonXXX.a" as an okay name for the Python shared
library (but only on AIX!).

Author: Aditya Kamath <Aditya.Kamath1@ibm.com>
Author: Srirama Kucherlapati <sriram.rk@in.ibm.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CY5PR11MB63928CC05906F27FB10D74D0FD322@CY5PR11MB6392.namprd11.prod.outlook.com
2026-02-23 13:34:22 -05:00
Nathan Bossart
bc60ee8606 Warn upon successful MD5 password authentication.
This uses the "connection warning" infrastructure introduced by
commit 1d92e0c2cc to emit a WARNING when an MD5 password is used to
authenticate.  MD5 password support was marked as deprecated in
v18 and will be removed in a future release of Postgres.  These
warnings are on by default but can be turned off via the existing
md5_password_warnings parameter.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Xiangyu Liang <liangxiangyu_2013@163.com>
Discussion: https://postgr.es/m/aYzeAYEbodkkg5e-%40nathan
2026-02-23 11:22:04 -06:00
Peter Eisentraut
797872f6b9 Rename validate_relation_kind()
There are three static definitions of validate_relation_kind() in the
codebase, one each in table.c, indexam.c and sequence.c, validating that
the given relation is a table, an index or a sequence respectively.
The compiler knows which definition to use where because they are static.
But this could be confusing to a reader. Rename these functions so that
their names reflect the kind of relation they are validating. While at
it, also update the comments in table.c to clarify the definition of
table-like relkinds so that we don't have to maintain the exclusion list
as the set of relkinds undergoes changes.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
2026-02-23 17:38:06 +01:00
Peter Eisentraut
d7be57ad85 Flip logic in table validate_relation_kind
It instead of checking which relkinds it shouldn't be, explicitly list
the ones we accept.  This is used to check which relkinds are accepted
in table_open() and related functions.  Before this change, figuring
that out was always a few steps too complicated.  This also makes
changes for new relkinds more explicit instead of accidental.
Finally, this makes this more aligned with the functions of the same
name in src/backend/access/index/indexam.c and
src/backend/access/sequence/sequence.c.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
2026-02-23 17:32:07 +01:00
Andrew Dunstan
b380a56a3f Disallow CR and LF in database, role, and tablespace names
Previously, these characters could cause problems when passed through
shell commands, and were flagged with a comment in string_utils.c
suggesting they be rejected in a future major release.

The affected commands are CREATE DATABASE, CREATE ROLE, CREATE TABLESPACE,
ALTER DATABASE RENAME, ALTER ROLE RENAME, and ALTER TABLESPACE RENAME.

Also add a pg_upgrade check to detect these invalid names in clusters
being upgraded from pre-v19 versions, producing a report file listing
any offending objects that must be renamed before upgrading.

Tests have been modified accordingly.

Author: Mahendra Singh Thalor <mahi6run@gmail.com>
Reviewed-By: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-By: Andrew Dunstan <andrew@dunslane.net>
Reviewed-By: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Srinath Reddy <srinath2133@gmail.com>

Discussion: https://postgr.es/m/CAKYtNApkOi4FY0S7+3jpTqnHVyyZ6Tbzhtbah-NBbY-mGsiKAQ@mail.gmail.com
2026-02-23 11:19:13 -05:00
Nathan Bossart
f33b8793fd Make use of pg_popcount() in more places.
This replaces some loops over word-length popcount functions with
calls to pg_popcount().  Since pg_popcount() may use a function
pointer for inputs with sizes >= a Bitmapset word, this produces a
small regression for the common one-word case in bms_num_members().
To deal with that, this commit adds an inlined fast-path for that
case.  This fast-path could arguably go in pg_popcount() itself
(with an appropriate alignment check), but that is left for future
study.

Suggested-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CANWCAZY7R%2Biy%2Br9YM_sySNydHzNqUirx1xk0tB3ej5HO62GdgQ%40mail.gmail.com
2026-02-23 09:26:00 -06:00
Peter Eisentraut
55f3859329 Change error message for sequence validate_relation_kind()
We can just say "... is not a sequence" instead of the more
complicated variant from before, which was probably copied from
src/backend/access/table/table.c.

Fix a typo in a comment in passing.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org
2026-02-23 10:56:54 +01:00
Peter Eisentraut
3a63b76571 Fix additional fallthrough warnings from clang
Clang warns if falling through to a case or default label that is
immediately followed by break, but GCC does
not (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432).  (MSVC also
warns about the equivalent code in C++.)

This is in preparation for enabling fallthrough warnings on Clang.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org
2026-02-23 07:40:19 +01:00
Álvaro Herrera
0eeffd31bf
Avoid name collision with NOT NULL constraints
If a CREATE TABLE statement defined a constraint whose name is identical
to the name generated for a NOT NULL constraint, we'd throw an
(unnecessary) unique key violation error on
pg_constraint_conrelid_contypid_conname_index: this can easily be
avoided by choosing a different name for the NOT NULL constraint.

Fix by passing the constraint names already created by
AddRelationNewConstraints() to AddRelationNotNullConstraints(), so that
the latter can avoid name collisions with them.

Bug: #19393
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reported-by: Hüseyin Demir <huseyin.d3r@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/19393-6a82427485a744cf@postgresql.org
2026-02-21 12:22:08 +01:00
Heikki Linnakangas
36bbcd5be3 Split PGPROC 'links' field into two, for clarity
The field was mainly used for the position in a LOCK's wait queue, but
also as the position in a the freelist when the PGPROC entry was not
in use. The reuse saves some memory at the expense of readability,
which seems like a bad tradeoff. If we wanted to make the struct
smaller there's other things we could do, but we're actually just
discussing adding padding to the struct for performance reasons.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/3dd6f70c-b94d-4428-8e75-74a7136396be@iki.fi
2026-02-20 22:34:42 +02:00
Nathan Bossart
dc592a4155 Speedup COPY FROM with additional function inlining.
Following the example set by commit 58a359e585, we can squeeze out
a little more performance from COPY FROM (FORMAT {text,csv}) by
inlining CopyReadLineText() and passing the is_csv parameter as a
constant.  This allows the compiler to emit specialized code with
fewer branches.

This is preparatory work for a proposed follow-up commit that would
further optimize this code with SIMD instructions.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz>
Tested-by: Manni Wood <manni.wood@enterprisedb.com>
Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com
2026-02-20 12:07:27 -06:00
Richard Guo
691977d370 Fix computation of varnullingrels when translating appendrel Var
When adjust_appendrel_attrs translates a Var referencing a parent
relation into a Var referencing a child relation, it propagates
varnullingrels from the parent Var to the translated Var.  Previously,
the code simply overwrote the translated Var's varnullingrels with
those of the parent.

This was incorrect because the translated Var might already possess
nonempty varnullingrels.  This happens, for example, when a LATERAL
subquery within a UNION ALL references a Var from the nullable side of
an outer join.  In such cases, the translated Var correctly carries
the outer join's relid in its varnullingrels.  Overwriting these bits
with the parent Var's set caused the planner to lose track of the fact
that the Var could be nulled by that outer join.

In the reported case, because the underlying column had a NOT NULL
constraint, the planner incorrectly deduced that the Var could never
be NULL and discarded essential IS NOT NULL filters.  This led to
incorrect query results where NULL rows were returned instead of being
filtered out.

To fix, use bms_add_members to merge the parent Var's varnullingrels
into the translated Var's existing set, preserving both sources of
nullability.

Back-patch to v16.  Although the reported case does not seem to cause
problems in v16, leaving incorrect varnullingrels in the tree seems
like a trap for the unwary.

Bug: #19412
Reported-by: Sergey Shinderuk <s.shinderuk@postgrespro.ru>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19412-1d0318089b86859e@postgresql.org
Backpatch-through: 16
2026-02-20 17:57:53 +09:00
Michael Paquier
0dc22fff64 Fix constant in error message for recovery_target_timeline
The intention was to use PG_UINT32_MAX, not UINT_MAX.  Let's be
consistent and use the same constant.

Thinko in fd7d7b7191.

Author: David Steele <david@pgbackrest.org>
Discussion: https://postgr.es/m/aZfXO97jSQaTTlfD@paquier.xyz
2026-02-20 16:17:57 +09:00
Amit Kapila
9842e8aca0 Avoid including worker_internal.h in pgstat.h.
pgstat.h is a widely included header. Including worker_internal.h there is
unnecessary and creates tight coupling. By refactoring
pgstat_report_subscription_error() to fetch the required
LogicalRepWorkerType internally rather than receiving it as an argument,
we can eliminate the need for the internal header.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/aY-UE-4t7FiYgH3t@alap3.anarazel.de
2026-02-20 09:26:33 +05:30
Nathan Bossart
ba401828c1 Remove SpinLockFree() and S_LOCK_FREE().
S_LOCK_FREE() is used by the test program in s_lock.c, but nobody
has voiced concerns about losing some coverage there.
SpinLockFree() appears to have been unused since it was introduced
by commit 499abb0c0f.  There was agreement to remove these in 2020,
but it never happened.  Since we still have agreement for removal
in 2026, let's do that now.

Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/aZX2oUcKf7IzHnnK%40nathan
Discussion: https://postgr.es/m/20200608225338.m5zho424w6lpwb2d%40alap3.anarazel.de
2026-02-19 16:19:41 -06:00
Robert Haas
6e466e1e83 Fix add_partial_path interaction with disabled_nodes
Commit e222534679 adjusted the logic in
add_path() to keep the path list sorted by disabled_nodes and then by
total_cost, but failed to make the corresponding adjustment to
add_partial_path. As a result, add_partial_path might sort the path list
just by total cost, which could lead to later planner misbehavior.

In principle, this should be back-patched to v18, but we are typically
reluctant to back-patch planner fixes for fear of destabilizing working
installations, and it is unclear to me that this has sufficiently
serious consequences to justify an exception, so for now, no back-patch.

Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: http://postgr.es/m/CAMbWs4-mO3jMK4t_LgcJ+7Eo=NmGgkxettgRaVbJzZvVZ1koMA@mail.gmail.com
2026-02-19 13:46:10 -05:00
Álvaro Herrera
fc3896c786
Add translator comment
Otherwise the message is not very clear.

Backpatch-through: 18
2026-02-19 17:11:04 +01:00
Tom Lane
2f248ad573 Remove no-longer-useful markers in pg_hba.conf.sample.
The source version of pg_hba.conf.sample contains
@remove-line-for-nolocal@ markers that indicate which lines should
be deleted for an installation that doesn't HAVE_UNIX_SOCKETS.
We no longer support that case, and since commit f55808828 all
that initdb is doing is unconditionally removing the markers.
We might as well remove the markers from the source version and
drop the removal code, which is unintelligible now anyway.

This will not of course save any noticeable number of cycles
in initdb, but it might save some confusion for future
developers looking at pg_hba.conf.sample.  It also reduces the
number of distinct cases that replace_token() has to support,
possibly allowing some tightening of that function.

Discussion: https://postgr.es/m/2287786.1771458157@sss.pgh.pa.us
2026-02-19 11:09:00 -05:00
Fujii Masao
fb80f388f4 Add per-subscription wal_receiver_timeout setting.
This commit allows setting wal_receiver_timeout per subscription
using the CREATE SUBSCRIPTION and ALTER SUBSCRIPTION commands.
The value is stored in the subwalrcvtimeout column of the pg_subscription
catalog.

When set, this value overrides the global wal_receiver_timeout for
the subscription's apply worker. The default is -1, which means the
global setting (from the server configuration, command line, role,
or database) remains in effect.

This feature is useful for configuring different timeout values for
each subscription, especially when connecting to multiple publisher
servers, to improve failure detection.

Bump catalog version.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/a1414b64-bf58-43a6-8494-9704975a41e9@oss.nttdata.com
2026-02-20 01:00:09 +09:00
Fujii Masao
8a6af3ad08 Make GUC wal_receiver_timeout user-settable.
When multiple subscribers connect to different publisher servers,
it can be useful to set different wal_receiver_timeout values for
each connection to better detect failures. However, previously
this wasn't possible, which limited flexibility in managing subscriptions.

This commit changes wal_receiver_timeout to be user-settable,
allowing different values to be assigned using ALTER ROLE SET for
each subscription owner. This effectively enables per-subscription
configuration.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/a1414b64-bf58-43a6-8494-9704975a41e9@oss.nttdata.com
2026-02-20 00:52:43 +09:00
Fujii Masao
5b93a5987b Log checkpoint request flags in checkpoint completion messages.
Checkpoint completion log messages include more detail than checkpoint
start messages, but previously omitted the checkpoint request flags,
which were only logged at checkpoint start. As a result, users had to
correlate completion messages with earlier start messages to see
the full context.

This commit includes the checkpoint request flags in the checkpoint
completion log message as well. This duplicates some information,
but makes the completion message self-contained and easier to interpret.

Author: Soumya S Murali <soumyamurali.work@gmail.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Reviewed-by: Yuan Li <carol.li2025@outlook.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAMtXxw9tPwV=NBv5S9GZXMSKPeKv5f9hRhSjZ8__oLsoS5jcuA@mail.gmail.com
2026-02-19 23:55:12 +09:00
Peter Eisentraut
8354b9d6b6 Use fallthrough attribute instead of comment
Instead of using comments to mark fallthrough switch cases, use the
fallthrough attribute.  This will (in the future, not here) allow
supporting other compilers besides gcc.  The commenting convention is
only supported by gcc, the attribute is supported by clang, and in the
fullness of time the C23 standard attribute would allow supporting
other compilers as well.

Right now, we package the attribute into a macro called
pg_fallthrough.  This commit defines that macro and replaces the
existing comments with that macro invocation.

We also raise the level of the gcc -Wimplicit-fallthrough= option from
3 to 5 to enforce the use of the attribute.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org
2026-02-19 08:51:12 +01:00
Peter Eisentraut
0c3fbb3fef Remove useless fallthrough annotation
A fallthrough attribute after the last case is a constraint violation
in C23, and clang warns about it (not about this comment, but if we
changed it to an attribute).  Remove it.  (There was apparently never
anything after this to fall through to, even in the first commit
da07a1e8565.)

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org
2026-02-19 08:50:58 +01:00
Michael Paquier
21e323e941 Sanitize some WAL-logging buffer handling in GIN and GiST code
As transam's README documents, the general order of actions recommended
when WAL-logging a buffer is to unlock and unpin buffers after leaving a
critical section.  This pattern was not being followed by some code
paths of GIN and GiST, adjusted in this commit, where buffers were
either unlocked or unpinned inside a critical section.  Based on my
analysis of each code path updated here, there is no reason to not
follow the recommended unlocking/unpin pattern done outside of a
critical section.

These inconsistencies are rather old, coming mainly from ecaa4708e5
and ff301d6e69.  The guidelines in the README predate these commits,
being introduced in 6d61cdec07.

Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CALdSSPgBPnpNNzxv0Y+_GNFzW6PmzRZYh+_hpf06Y1N2zLhZaQ@mail.gmail.com
2026-02-19 15:59:20 +09:00
Tom Lane
759b03b24c Simplify creation of built-in functions with default arguments.
Up to now, to create such a function, one had to make a pg_proc.dat
entry and then overwrite it with a CREATE OR REPLACE command in
system_functions.sql.  That's error-prone (cf. bug #19409) and
results in leaving dead rows in the initial contents of pg_proc.

Manual maintenance of pg_node_tree strings seems entirely impractical,
and parsing expressions during bootstrap would be extremely difficult
as well.  But Andres Freund observed that all the current use-cases
are simple constants, and building a Const node is well within the
capabilities of bootstrap mode.  So this patch invents a special case:
if bootstrap mode is asked to ingest a non-null value for
pg_proc.proargdefaults (which would otherwise fail in
pg_node_tree_in), it parses the value as an array literal and then
feeds the element strings to the input functions for the corresponding
parameter types.  Then we can build a suitable pg_node_tree string
with just a few more lines of code.

This allows removing all the system_functions.sql entries that are
just there to set up default arguments, replacing them with
proargdefaults fields in pg_proc.dat entries.  The old technique
remains available in case someone needs a non-constant default.

The initial contents of pg_proc are demonstrably the same after
this patch, except that (1) json_strip_nulls and jsonb_strip_nulls
now have the correct provolatile setting, as per bug #19409;
(2) pg_terminate_backend, make_interval, and drandom_normal
now have defaults that don't include a type coercion, which is
how they should have been all along.

In passing, remove some unused entries from bootstrap.c's TypInfo[]
array.  I had to add some new ones because we'll now need an entry for
each default-possessing system function parameter, but we shouldn't
carry more than we need there; it's just a maintenance gotcha.

Bug: #19409
Reported-by: Lucio Chiessi <lucio.chiessi@trustly.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Author: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net
Discussion: https://postgr.es/m/19409-e16cd2605e59a4af@postgresql.org
2026-02-18 14:14:44 -05:00
Heikki Linnakangas
d62dca3b29 Use standard die() handler for SIGTERM in bgworkers
The previous default bgworker_die() signal would exit with elog(FATAL)
directly from the signal handler. That could cause deadlocks or
crashes if the signal handler runs while we're e.g holding a spinlock
or in the middle of a memory allocation.

All the built-in background workers overrode that to use the normal
die() handler and CHECK_FOR_INTERRUPTS(). Let's make that the default
for all background workers. Some extensions relying on the old
behavior might need to adapt, but the new default is much safer and is
the right thing to do for most background workers.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/5238fe45-e486-4c62-a7f3-c7d8d416e812@iki.fi
2026-02-18 19:59:34 +02:00
Michael Paquier
ee642cccc4 Switch SysCacheIdentifier to a typedef enum
The main purpose of this change is to allow an ABI checker to understand
when the list of SysCacheIdentifier changes, by switching all the
routine declarations that relied on a signed integer for a syscache ID
to this new type.  This is going to be useful in the long-term for
versions newer than v19 so as we will be able to check when the list of
values in SysCacheIdentifier is updated in a non-ABI compliant fashion.

Most of the changes of this commit are due to the new definition of
SyscacheCallbackFunction, where a SysCacheIdentifier is now required for
the syscache ID.  It is a mechanical change, still slightly invasive.

There are more areas in the tree that could be improved with an ABI
checker in mind; this takes care of only one area.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/289125.1770913057@sss.pgh.pa.us
2026-02-18 09:58:38 +09:00
Michael Paquier
c06b5b99bb Add concept of invalid value to SysCacheIdentifier
This commit tweaks the generation of the syscache IDs for the enum
SysCacheIdentifier to now include an invalid value, with -1 assigned as
value.  The concept of an invalid syscache ID exists when handling
lookups of a ObjectAddress, based on their set of properties in
ObjectPropertyType.  -1 is used for the case where an object type has no
option for a syscache lookup.

This has been found as independently useful while discussing a switch of
SysCacheIdentifier to a typedef, as we already have places that want to
know about the concept of an invalid value when dealing with
ObjectAddresses.

Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://postgr.es/m/aZQRnmp9nVjtxAHS@paquier.xyz
2026-02-18 09:25:52 +09:00
Michael Paquier
f7df12a66c Fix one-off issue with cache ID in objectaddress.c
get_catalog_object_by_oid_extended() has been doing a syscache lookup
when given a cache ID strictly higher than 0, which is wrong because the
first valid value of SysCacheIdentifier is 0.

This issue had no consequences, as the first value assigned in the
enum SysCacheIdentifier is AGGFNOID, which is not used in the object
type properties listed in objectaddress.c.  Even if an ID of 0 was
hypotherically given, the code would still work with a less efficient
heap-or-index scan.

Discussion: https://postgr.es/m/aZTr_R6JGmqokUBb@paquier.xyz
2026-02-18 08:47:58 +09:00
Álvaro Herrera
b7271aa1d7
Use a bitmask for ExecInsertIndexTuples options
... instead of passing a bunch of separate booleans.

Also, rearrange the argument list in a hopefully more sensible order.

Discussion: https://postgr.es/m/202602111846.xpvuccb3inbx@alvherre.pgsql
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> (older version)
2026-02-17 17:59:45 +01:00
Álvaro Herrera
661237056b
Fix memory leak in new GUC check_hook
Commit 38e0190ced forgot to pfree() an allocation (freed in other
places of the same function) in only one of several spots in
check_log_min_messages().  Per Coverity.  Add that.

While at it, avoid open-coding guc_strdup().  The new coding does a
strlen() that wasn't there before, but I doubt it's measurable.
2026-02-17 16:38:24 +01:00
Heikki Linnakangas
a92b809f9d Ignore SIGINT in walwriter and walsummarizer
Previously, SIGINT was treated the same as SIGTERM in walwriter and
walsummarizer. That decision goes back to when the walwriter process
was introduced (commit ad4295728e), and was later copied to
walsummarizer. It was a pretty arbitrary decision back then, and we
haven't adopted that convention in all the other processes that have
been introduced later.

Summary of how other processes respond to SIGINT:
- Autovacuum launcher: Cancel the current iteration of launching
- bgworker: Ignore (unless connected to a database)
- checkpointer: Request shutdown checkpoint
- bgwriter: Ignore
- pgarch: Ignore
- startup process: Ignore
- walreceiver: Ignore
- IO worker: die()

IO workers are a notable exception in that they exit on SIGINT, and
there's a documented reason for that: IO workers ignore SIGTERM, so
SIGINT provides a way to manually kill them. (They do respond to
SIGUSR2, though, like all the other processes that we don't want to
exit immediately on SIGTERM on operating system shutdown.)

To make this a little more consistent, ignore SIGINT in walwriter and
walsummarizer. They have no "query" to cancel, and they react to
SIGTERM just fine.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/818bafaf-1e77-4c78-8037-d7120878d87c@iki.fi
2026-02-17 17:18:31 +02:00
Noah Misch
8cef93d8a5 Suppress new "may be used uninitialized" warning.
Various buildfarm members, having compilers like gcc 8.5 and 6.3, fail
to deduce that text_substring() variable "E" is initialized if
slice_size!=-1.  This suppression approach quiets gcc 8.5; I did not
reproduce the warning elsewhere.  Back-patch to v14, like commit
9f4fd119b2.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1157953.1771266105@sss.pgh.pa.us
Backpatch-through: 14
2026-02-16 18:04:58 -08:00
Peter Eisentraut
d50c86e743 Change remaining StaticAssertStmt() to StaticAssertDecl()
This completes the work started by commit 75f49221c2.

In basebackup.c, changing the StaticAssertStmt to StaticAssertDecl
results in having the same StaticAssertDecl() in 2 functions.  So, it
makes more sense to move it to file scope instead.

Also, as it depends on some computations based on 2 tar blocks, define
TAR_NUM_TERMINATION_BLOCKS.

In deadlock.c, change the StaticAssertStmt to StaticAssertDecl and
keep it in the function scope.  Add new braces to avoid warning from
-Wdeclaration-after-statement.

In aset.c, change the StaticAssertStmt to StaticAssertDecl and move it
to file scope.

Finally, update the comments in c.h a bit.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/aYH6ii46AvGVCB84%40ip-10-97-1-34.eu-west-3.compute.internal
2026-02-16 09:22:43 +01:00
Fujii Masao
351265a6c7 Remove recovery.signal at recovery end when both signal files are present.
When both standby.signal and recovery.signal are present, standby.signal
takes precedence and the server runs in standby mode. Previously,
in this case, recovery.signal was not removed at the end of standby mode
(i.e., on promotion) or at the end of archive recovery, while standby.signal
was removed. As a result, a leftover recovery.signal could cause
a subsequent restart to enter archive recovery unexpectedly, potentially
preventing the server from starting. This behavior was surprising and
confusing to users.

This commit fixes the issue by updating the recovery code to remove
recovery.signal alongside standby.signal when both files are present and
recovery completes.

Because this code path is particularly sensitive and changes in recovery
behavior can be risky for stable branches, this change is applied only to
the master branch.

Reported-by: Nikolay Samokhvalov <nik@postgres.ai>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: David Steele <david@pgbackrest.org>
Discussion: https://postgr.es/m/CAM527d8PVAQFLt_ndTXE19F-XpDZui861882L0rLY3YihQB8qA@mail.gmail.com
2026-02-16 13:57:38 +09:00
Noah Misch
9f4fd119b2 Fix SUBSTRING() for toasted multibyte characters.
Commit 1e7fe06c10 changed
pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an
incomplete character.  Most callers want that.  text_substring() does
not.  It detoasts the most bytes it could possibly need to get the
requested number of characters.  For example, to extract up to 2 chars
from UTF8, it needs to detoast 8 bytes.  In a string of 3-byte UTF8
chars, 8 bytes spans 2 complete chars and 1 partial char.

Fix this by replacing this pg_mbstrlen_with_len() call with a string
traversal that differs by stopping upon finding as many chars as the
substring could need.  This also makes SUBSTRING() stop raising an
encoding error if the incomplete char is past the end of the substring.
This is consistent with the general philosophy of the above commit,
which was to raise errors on a just-in-time basis.  Before the above
commit, SUBSTRING() never raised an encoding error.

SUBSTRING() has long been detoasting enough for one more char than
needed, because it did not distinguish exclusive and inclusive end
position.  For avoidance of doubt, stop detoasting extra.

Back-patch to v14, like the above commit.  For applications using
SUBSTRING() on non-ASCII column values, consider applying this to your
copy of any of the February 12, 2026 releases.

Reported-by: SATŌ Kentarō <ranvis@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Bug: #19406
Discussion: https://postgr.es/m/19406-9867fddddd724fca@postgresql.org
Backpatch-through: 14
2026-02-14 12:16:16 -08:00
Noah Misch
4644f8b23b pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport.
The prior order caused spurious Valgrind errors.  They're spurious
because the ereport(ERROR) non-local exit discards the pointer in
question.  pg_mblen_cstr() ordered the checks correctly, but these other
two did not.  Back-patch to v14, like commit
1e7fe06c10.

Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20260214053821.fa.noahmisch@microsoft.com
Backpatch-through: 14
2026-02-14 12:16:16 -08:00
John Naylor
ef3c3cf6d0 Perform radix sort on SortTuples with pass-by-value Datums
Radix sort can be much faster than quicksort, but for our purposes it
is limited to sequences of unsigned bytes. To make tuples with other
types amenable to this technique, several features of tuple comparison
must be accounted for, i.e. the sort key must be "normalized":

1. Signedness -- It's possible to modify a signed integer such that
it can be compared as unsigned. For example, a signed char has range
-128 to 127. If we cast that to unsigned char and add 128, the range
of values becomes 0 to 255 while preserving order.

2. Direction -- SQL allows specification of ASC or DESC. The
descending case is easily handled by taking the complement of the
unsigned representation.

3. NULL values -- NULLS FIRST and NULLS LAST must work correctly.

This commmit only handles the case where datum1 is pass-by-value
Datum (possibly abbreviated) that compares like an ordinary
integer. (Abbreviations of values of type "numeric" are a convenient
counterexample.) First, tuples are partitioned by nullness in the
correct NULL ordering. Then the NOT NULL tuples are sorted with radix
sort on datum1. For tiebreaks on subsequent sortkeys (including the
first sort key if abbreviated), we divert to the usual qsort.

ORDER BY queries on pre-warmed buffers are up to 2x faster on high
cardinality inputs with radix sort than the sort specializations added
by commit 697492434, so get rid of them. It's sufficient to fall back
to qsort_tuple() for small arrays. Moderately low cardinality inputs
show more modest improvents. Our qsort is strongly optimized for very
low cardinality inputs, but radix sort is usually equal or very close
in those cases.

The changes to the regression tests are caused by under-specified sort
orders, e.g. "SELECT a, b from mytable order by a;". For unstable
sorts, such as our qsort and this in-place radix sort, there is no
guarantee of the order of "b" within each group of "a".

The implementation is taken from ska_byte_sort() (Boost licensed),
which is similar to American flag sort (an in-place radix sort) with
modifications to make it better suited for modern pipelined CPUs.

The technique of normalization described above can also be extended
to the case of multiple keys. That is left for future work (Thanks
to Peter Geoghegan for the suggestion to look into this area).

Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com
2026-02-14 13:50:06 +07:00
Michael Paquier
775fc01415 Improve error message for checksum failures in pgstat_database.c
This log message was referring to conflicts, but it is about checksum
failures.  The log message improved in this commit should never show up,
due to the fact that pgstat_prepare_report_checksum_failure() should
always be called before pgstat_report_checksum_failures_in_db(), with a
stats entry already created in the pgstats shared hash table.  The three
code paths able to report database-level checksum failures follow
already this requirement.

Oversight in b96d3c3897.

Author: Wang Peng <215722532@qq.com>
Discussion: https://postgr.es/m/tencent_9B6CD6D9D34AE28CDEADEC6188DB3BA1FE07@qq.com
Backpatch-through: 18
2026-02-13 12:17:08 +09:00
Dean Rasheed
88327092ff Add support for INSERT ... ON CONFLICT DO SELECT.
This adds a new ON CONFLICT action DO SELECT [FOR UPDATE/SHARE], which
returns the pre-existing rows when conflicts are detected. The INSERT
statement must have a RETURNING clause, when DO SELECT is specified.

The optional FOR UPDATE/SHARE clause allows the rows to be locked
before they are are returned. As with a DO UPDATE conflict action, an
optional WHERE clause may be used to prevent rows from being selected
for return (but as with a DO UPDATE action, rows filtered out by the
WHERE clause are still locked).

Bumps catversion as stored rules change.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Marko Tiikkaja <marko@joh.to>
Author: Viktor Holmberg <v@viktorh.net>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/d631b406-13b7-433e-8c0b-c6040c4b4663@Spark
Discussion: https://postgr.es/m/5fca222d-62ae-4a2f-9fcb-0eca56277094@Spark
Discussion: https://postgr.es/m/2b5db2e6-8ece-44d0-9890-f256fdca9f7e@proxel.se
Discussion: https://postgr.es/m/CAL9smLCdV-v3KgOJX3mU19FYK82N7yzqJj2HAwWX70E=P98kgQ@mail.gmail.com
2026-02-12 09:57:04 +00:00
Amit Kapila
788ec96d59 Refactor slot synchronization logic in slotsync.c.
Following e68b6adad9, the reason for skipping slot synchronization is
stored as a slot property. This commit removes redundant function
parameters that previously tracked this state, instead relying directly on
the slot property.

Additionally, this change centralizes the logic for skipping
synchronization when required WAL has not yet been received or flushed. By
consolidating this check, we reduce code duplication and the risk of
inconsistent state updates across different code paths.

In passing, add an assertion to ensure a slot is marked as temporary if a
consistent point has not been reached during synchronization.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/TY4PR01MB16907DD16098BE3B20486D4569463A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/CAFPTHDZAA+gWDntpa5ucqKKba41=tXmoXqN3q4rpjO9cdxgQrw@mail.gmail.com
2026-02-12 14:38:31 +05:30
Dean Rasheed
706cadde32 Remove p_is_insert from struct ParseState.
The only place that used p_is_insert was transformAssignedExpr(),
which used it to distinguish INSERT from UPDATE when handling
indirection on assignment target columns -- see commit c1ca3a19df.
However, this information is already available to
transformAssignedExpr() via its exprKind parameter, which is always
either EXPR_KIND_INSERT_TARGET or EXPR_KIND_UPDATE_TARGET.

As noted in the commit message for c1ca3a19df, this use of
p_is_insert isn't particularly pretty, so have transformAssignedExpr()
use the exprKind parameter instead. This then allows p_is_insert to be
removed entirely, which simplifies state management in a few other
places across the parser.

Author: Viktor Holmberg <v@viktorh.net>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/badc3b4c-da73-4000-b8d3-638a6f53a769@Spark
2026-02-12 09:01:42 +00:00
Richard Guo
cf74558feb Reduce LEFT JOIN to ANTI JOIN using NOT NULL constraints
For a LEFT JOIN, if any var from the right-hand side (RHS) is forced
to null by upper-level quals but is known to be non-null for any
matching row, the only way the upper quals can be satisfied is if the
join fails to match, producing a null-extended row.  Thus, we can
treat this left join as an anti-join.

Previously, this transformation was limited to cases where the join's
own quals were strict for the var forced to null by upper qual levels.
This patch extends the logic to check table constraints, leveraging
the NOT NULL attribute information already available thanks to the
infrastructure introduced by e2debb643.  If a forced-null var belongs
to the RHS and is defined as NOT NULL in the schema (and is not
nullable due to lower-level outer joins), we know that the left join
can be reduced to an anti-join.

Note that to ensure the var is not nullable by any lower-level outer
joins within the current subtree, we collect the relids of base rels
that are nullable within each subtree during the first pass of the
reduce-outer-joins process.  This allows us to verify in the second
pass that a NOT NULL var is indeed safe to treat as non-nullable.

Based on a proposal by Nicolas Adenis-Lamarre, but this is not the
original patch.

Suggested-by: Nicolas Adenis-Lamarre <nicolas.adenis.lamarre@gmail.com>
Author: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CACPGbctKMDP50PpRH09in+oWbHtZdahWSroRstLPOoSDKwoFsw@mail.gmail.com
2026-02-12 15:30:13 +09:00
Heikki Linnakangas
78a5e3074b Fix pg_stat_get_backend_wait_event() for aux processes
The pg_stat_activity view shows information for aux processes, but the
pg_stat_get_backend_wait_event() and
pg_stat_get_backend_wait_event_type() functions did not. To fix, call
AuxiliaryPidGetProc(pid) if BackendPidGetProc(pid) returns NULL, like
we do in pg_stat_get_activity().

In version 17 and above, it's a little silly to use those functions
when we already have the ProcNumber at hand, but it was necessary
before v17 because the backend ID was different from ProcNumber. I
have other plans for wait_event_info on master, so it doesn't seem
worth applying a different fix on different versions now.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/c0320e04-6e85-4c49-80c5-27cfb3a58108@iki.fi
Backpatch-through: 14
2026-02-11 18:50:57 +02:00
Nathan Bossart
1d92e0c2cc Add password expiration warnings.
This commit adds a new parameter called
password_expiration_warning_threshold that controls when the server
begins emitting imminent-password-expiration warnings upon
successful password authentication.  By default, this parameter is
set to 7 days, but this functionality can be disabled by setting it
to 0.  This patch also introduces a new "connection warning"
infrastructure that can be reused elsewhere.  For example, we may
want to warn about the use of MD5 passwords for a couple of
releases before removing MD5 password support.

Author: Gilles Darold <gilles@darold.net>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: songjinzhou <tsinghualucky912@foxmail.com>
Reviewed-by: liu xiaohui <liuxh.zj.cn@gmail.com>
Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/129bcfbf-47a6-e58a-190a-62fc21a17d03%40migops.com
2026-02-11 10:36:15 -06:00
Álvaro Herrera
1efdd7cc63
Cleanup for log_min_messages changes in 38e0190ced
* Remove an unused variable
* Use "default log level" consistently (instead of "generic")
* Keep the process types in alphabetical order (missed one place in the
  SGML docs)
* Since log_min_messages type was changed from enum to string, it
  is a good idea to add single quotes when printing it out.  Otherwise
  it fails if the user copies and pastes from the SHOW output to SET,
  except in the simplest case.  Using single quotes reduces confusion.
* Use lowercase string for the burned-in default value, to keep the same
  output as previous versions.

Author: Euler Taveira <euler@eulerto.com>
Author: Man Zeng <zengman@halodbtech.com>
Author: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/202602091250.genyflm2d5dw@alvherre.pgsql
2026-02-11 16:38:18 +01:00
Heikki Linnakangas
7984ce7a1d Move ProcStructLock to the ProcGlobal struct
It protects the freeProcs and some other fields in ProcGlobal, so
let's move it there. It's good for cache locality to have it next to
the thing it protects, and just makes more sense anyway. I believe it
was allocated as a separate shared memory area just for historical
reasons.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/b78719db-0c54-409f-b185-b0d59261143f@iki.fi
2026-02-11 16:48:45 +02:00
Heikki Linnakangas
ab32a9e21d Remove useless store to local variable
It was a leftover from commit 5764f611e1, which converted the loop to
use dclist_foreach.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/3dd6f70c-b94d-4428-8e75-74a7136396be@iki.fi
2026-02-11 11:49:18 +02:00
Robert Haas
7358abcc60 Store information about Append node consolidation in the final plan.
An extension (or core code) might want to reconstruct the planner's
decisions about whether and where to perform partitionwise joins from
the final plan. To do so, it must be possible to find all of the RTIs
of partitioned tables appearing in the plan. But when an AppendPath
or MergeAppendPath pulls up child paths from a subordinate AppendPath
or MergeAppendPath, the RTIs of the subordinate path do not appear
in the final plan, making this kind of reconstruction impossible.

To avoid this, propagate the RTI sets that would have been present
in the 'apprelids' field of the subordinate Append or MergeAppend
nodes that would have been created into the surviving Append or
MergeAppend node, using a new 'child_append_relid_sets' field for
that purpose. The value of this field is a list of Bitmapsets,
because each relation whose append-list was pulled up had its own
set of RTIs: just one, if it was a partitionwise scan, or more than
one, if it was a partitionwise join. Since our goal is to see where
partitionwise joins were done, it is essential to avoid losing the
information about how the RTIs were grouped in the pulled-up
relations.

This commit also updates pg_overexplain so that EXPLAIN (RANGE_TABLE)
will display the saved RTI sets.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 17:55:59 -05:00
Michael Paquier
9181c870ba Improve type handling of varlena structures
This commit changes the definition of varlena to a typedef, so as it
becomes possible to remove "struct" markers from various declarations in
the code base.  Historically, "struct" markers are not the project style
for variable declarations, so this update simplifies the code and makes
it more consistent across the board.

This change has an impact on the following structures, simplifying
declarations using them:
- varlena
- varatt_indirect
- varatt_external

This cleanup has come up in a different path set that played with
TOAST and varatt.h, independently worth doing on its own.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aW8xvVbovdhyI4yo@paquier.xyz
2026-02-11 07:33:24 +09:00
Robert Haas
0d4391b265 Store information about elided nodes in the final plan.
An extension (or core code) might want to reconstruct the planner's
choice of join order from the final plan. To do so, it must be possible
to find all of the RTIs that were part of the join problem in that plan.
Commit adbad833f3, together with the
earlier work in 8c49a484e8, is enough to
let us match up RTIs we see in the final plan with RTIs that we see
during the planning cycle, but we still have a problem if the planner
decides to drop some RTIs out of the final plan altogether.

To fix that, when setrefs.c removes a SubqueryScan, single-child Append,
or single-child MergeAppend from the final Plan tree, record the type of
the removed node and the RTIs that the removed node would have scanned
in the final plan tree. It would be natural to record this information
on the child of the removed plan node, but that would require adding an
additional pointer field to type Plan, which seems undesirable.  So,
instead, store the information in a separate list that the executor need
never consult, and use the plan_node_id to identify the plan node with
which the removed node is logically associated.

Also, update pg_overexplain to display these details.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 16:46:05 -05:00
Robert Haas
adbad833f3 Store information about range-table flattening in the final plan.
Suppose that we're currently planning a query and, when that same
query was previously planned and executed, we learned something about
how a certain table within that query should be planned. We want to
take note when that same table is being planned during the current
planning cycle, but this is difficult to do, because the RTI of the
table from the previous plan won't necessarily be equal to the RTI
that we see during the current planning cycle. This is because each
subquery has a separate range table during planning, but these are
flattened into one range table when constructing the final plan,
changing RTIs.

Commit 8c49a484e8 allows us to match up
subqueries seen in the previous planning cycles with the subqueries
currently being planned just by comparing textual names, but that's
not quite enough to let us deduce anything about individual tables,
because we don't know where each subquery's range table appears in
the final, flattened range table.

To fix that, store a list of SubPlanRTInfo objects in the final
planned statement, each including the name of the subplan, the offset
at which it begins in the flattened range table, and whether or not
it was a dummy subplan -- if it was, some RTIs may have been dropped
from the final range table, but also there's no need to control how
a dummy subquery gets planned. The toplevel subquery has no name and
always begins at rtoffset 0, so we make no entry for it.

This commit teaches pg_overexplain's RANGE_TABLE option to make use
of this new data to display the subquery name for each range table
entry.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 15:33:39 -05:00
Robert Haas
0f4c8d33d4 Pass cursorOptions to planner_setup_hook.
Commit 94f3ad3961 failed to do this
because I couldn't think of a use for the information, but this has
proven to be short-sighted. Best to fix it before this code is
officially released.

Now, the only argument to standard_planenr that isn't passed to
planner_setup_hook is boundParams, but that is accessible via
glob->boundParams, and so doesn't need to be passed separately.

Discussion: https://www.postgresql.org/message-id/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com
2026-02-10 11:50:28 -05:00
Robert Haas
cbdf93d471 Fix PGS_CONSIDER_NONPARTIAL interaction with Materialize nodes.
Commit 4020b370f2 had the idea that it
would be a good idea to handle testing PGS_CONSIDER_NONPARTIAL within
cost_material to save callers the trouble, but that turns out not to be
a very good idea. One concern is that it makes cost_material() dependent
on the caller having initialized certain fields in the MaterialPath,
which is a bit awkward for materialize_finished_plan, which wants to use
a dummy path.

Another problem is that it can result in generated materialized nested
loops where the Materialize node is disabled, contrary to the intention
of joinpath.c's logic in match_unsorted_outer() and
consider_parallel_nestloop(), which aims to consider such paths only
when they would not need to be disabled. In the previous coding, it was
possible for the pgs_mask on the joinrel to have PGS_CONSIDER_NONPARTIAL
set, while the inner rel had the same bit clear. In that case, we'd
generate and then disable a Materialize path.

That seems wrong, so instead, pull up the logic to test the
PGS_CONSIDER_NONPARTIAL bit into joinpath.c, restoring the historical
behavior that either we don't generate a given materialized nested loop
in the first place, or we don't disable it.

Discussion: http://postgr.es/m/CA+TgmoawzvCoZAwFS85tE5+c8vBkqgcS8ZstQ_ohjXQ9wGT9sw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com
2026-02-10 11:49:07 -05:00
Heikki Linnakangas
be5257725d Refactor ProcessRecoveryConflictInterrupt for readability
Two changes here:

1. Introduce a separate RECOVERY_CONFLICT_BUFFERPIN_DEADLOCK flag to
indicate a suspected deadlock that involves a buffer pin. Previously
the startup process used the same flag for a deadlock involving just
regular locks, and to check for deadlocks involving the buffer
pin. The cases are handled separately in the startup process, but the
receiving backend had to deduce which one it was based on
HoldingBufferPinThatDelaysRecovery(). With a separate flag, the
receiver doesn't need to guess.

2. Rewrite the ProcessRecoveryConflictInterrupt() function to not rely
on fallthrough through the switch-statement. That was difficult to
read.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
2026-02-10 16:23:10 +02:00
Heikki Linnakangas
17f51ea818 Separate RecoveryConflictReasons from procsignals
Share the same PROCSIG_RECOVERY_CONFLICT flag for all recovery
conflict reasons. To distinguish, have a bitmask in PGPROC to indicate
the reason(s).

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
2026-02-10 16:23:08 +02:00
Heikki Linnakangas
ddc3250208 Use ProcNumber rather than pid in ReplicationSlot
This helps the next commit.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
2026-02-10 16:23:05 +02:00
Michael Paquier
f33c585774 Simplify some log messages in extended_stats_funcs.c
The log messages used in this file applied too much quoting logic:
- No need for quote_identifier(), which is fine to not use in the
context of a log entry.
- The usual project style is to group the namespace and object together
in a quoted string, when mentioned in an log message.  This code quoted
the namespace name and the extended statistics object name separately,
which was confusing.

Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com
2026-02-10 16:59:19 +09:00
Michael Paquier
307447e6db Add information about range type stats to pg_stats_ext_exprs
This commit adds three attributes to the system view pg_stats_ext_exprs,
whose data can exist when involving a range type in an expression:
range_length_histogram
range_empty_frac
range_bounds_histogram

These statistics fields exist since 918eee0c49, and have become
viewable in pg_stats later in bc3c8db8ae.  This puts the definition of
pg_stats_ext_exprs on par with pg_stats.

This issue has showed up during the discussion about the restore of
extended statistics for expressions, so as it becomes possible to query
the stats data to restore from the catalogs.  Having access to this data
is useful on its own, without the restore part.

Some documentation and some tests are added, written by me.  Corey has
authored the part in system_views.sql.

Bump catalog version.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aYmCUx9VvrKiZQLL@paquier.xyz
2026-02-10 12:36:57 +09:00
Richard Guo
f41ab51573 Teach planner to transform "x IS [NOT] DISTINCT FROM NULL" to a NullTest
In the spirit of 8d19d0e13, this patch teaches the planner about the
principle that NullTest with !argisrow is fully equivalent to SQL's IS
[NOT] DISTINCT FROM NULL.

The parser already performs this transformation for literal NULLs.
However, a DistinctExpr expression with one input evaluating to NULL
during planning (e.g., via const-folding of "1 + NULL" or parameter
substitution in custom plans) currently remains as a DistinctExpr
node.

This patch closes the gap for const-folded NULLs.  It specifically
targets the case where one input is a constant NULL and the other is a
nullable non-constant expression.  (If the other input were otherwise,
the DistinctExpr node would have already been simplified to a constant
TRUE or FALSE.)

This transformation can be beneficial because NullTest is much more
amenable to optimization than DistinctExpr, since the planner knows a
good deal about the former and next to nothing about the latter.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
2026-02-10 10:19:25 +09:00
Richard Guo
0aaf0de7fe Optimize BooleanTest with non-nullable input
The BooleanTest construct (IS [NOT] TRUE/FALSE/UNKNOWN) treats a NULL
input as the logical value "unknown".  However, when the input is
proven to be non-nullable, this special handling becomes redundant.
In such cases, the construct can be simplified directly to a boolean
expression or a constant.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
2026-02-10 10:18:47 +09:00
Richard Guo
0a37961254 Optimize IS DISTINCT FROM with non-nullable inputs
The IS DISTINCT FROM construct compares values acting as though NULL
were a normal data value, rather than "unknown".  Semantically, "x IS
DISTINCT FROM y" yields true if the values differ or if exactly one is
NULL, and false if they are equal or both NULL.  Unlike ordinary
comparison operators, it never returns NULL.

Previously, the planner only simplified this construct if all inputs
were constants, folding it to a constant boolean result.  This patch
extends the optimization to cases where inputs are non-constant but
proven to be non-nullable.  Specifically, "x IS DISTINCT FROM NULL"
folds to constant TRUE if "x" is known to be non-nullable.  For cases
where both inputs are guaranteed not to be NULL, the expression
becomes semantically equivalent to "x <> y", and the DistinctExpr is
converted into an inequality OpExpr.

This transformation provides several benefits.  It converts the
comparison into a standard operator, allowing the use of partial
indexes and constraint exclusion.  Furthermore, if the clause is
negated (i.e., "IS NOT DISTINCT FROM"), it simplifies to an equality
operator.  This enables the planner to generate better plans using
index scans, merge joins, hash joins, and EC-based qual deduction.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
2026-02-10 10:17:45 +09:00
Nathan Bossart
158408fef8 pg_upgrade: Fix handling of pg_largeobject_metadata.
For binary upgrades from v16 or newer, pg_upgrade transfers the
files for pg_largeobject_metadata from the old cluster, as opposed
to using COPY or ordinary SQL commands to reconstruct its contents.
While this approach adds complexity, it can greatly reduce
pg_upgrade's runtime when there are many large objects.

Large objects with comments or security labels are one source of
complexity for this approach.  During pg_upgrade, schema
restoration happens before files are transferred.  Comments and
security labels are transferred in the former step, but the COMMENT
and SECURITY LABEL commands will fail if their corresponding large
objects do not exist.  To deal with this, pg_upgrade first copies
only the rows of pg_largeobject_metadata that are needed to avoid
failures.  Later, pg_upgrade overwrites those rows by replacing
pg_largeobject_metadata's files with its files in the old cluster.

Unfortunately, there's a subtle problem here.  Simply put, there's
no guarantee that pg_upgrade will overwrite all of
pg_largeobject_metadata's files on the new cluster.  For example,
the new cluster's version might more aggressively extend relations
or create visibility maps, and pg_upgrade's file transfer code is
not sophisticated enough to remove files that lack counterparts in
the old cluster.  These extra files could cause problems
post-upgrade.

More fortunately, we can simultaneously fix the aforementioned
problem and further optimize binary upgrades for clusters with many
large objects.  If we teach the COMMENT and SECURITY LABEL commands
to allow nonexistent large objects during binary upgrades,
pg_upgrade no longer needs to transfer pg_largeobject_metadata's
contents beforehand.  This approach also allows us to remove the
associated dependency tracking from pg_dump, even for upgrades from
v12-v15 that use COPY to transfer pg_largeobject_metadata's
contents.

In addition to what is described in the previous paragraph, this
commit modifies the query in getLOs() to only retrieve LOs with
comments or security labels for upgrades from v12 or newer.  We
have long assumed that such usage is rare, so this should reduce
pg_upgrade's memory usage and runtime in many cases.  We might also
be able to remove the "upgrades from v12 or newer" restriction on
the recent batch of optimizations by adding special handling for
pg_largeobject_metadata's hidden OID column on older versions
(since this catalog previously used the now-removed WITH OIDS
feature), but that is left as a future exercise.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/3yd2ss6n7xywo6pmhd7jjh3bqwgvx35bflzgv3ag4cnzfkik7m%40hiyadppqxx6w
2026-02-09 14:58:02 -06:00
Heikki Linnakangas
73d60ac385 cleanup: Deadlock checker is no longer called from signal handler
Clean up a few leftovers from when the deadlock checker was called
from signal handler. We stopped doing that in commit 6753333f55, in
year 2015.

- CheckDeadLock can return a return value directly to the caller,
  there's no need to use a global variable for that.

- Remove outdated comments that claimed that CheckDeadLock "signals
  ProcSleep".

- It should be OK to ereport() from DeadLockCheck now. I considered
  getting rid of InitDeadLockChecking() and moving the workspace
  allocations into DeadLockCheck, but it's still good to avoid doing
  the allocations while we're holding all the partition locks. So just
  update the comment to give that as the reason we do the allocations
  up front.
2026-02-09 20:26:23 +02:00
Heikki Linnakangas
18f0afb2a6 Fix incorrect iteration type in extension_file_exists()
Commit f3c9e341cd changed the type of objects in the List that
get_extension_control_directories() returns, from "char *" to
"ExtensionLocation *", but missed adjusting this one caller.

Author: Chao Li <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/362EA9B3-589B-475A-A16E-F10C30426E28@gmail.com
2026-02-09 19:15:44 +02:00
Tom Lane
8ebdf41c26 Harden _int_matchsel() against being attached to the wrong operator.
While the preceding commit prevented such attachments from occurring
in future, this one aims to prevent further abuse of any already-
created operator that exposes _int_matchsel to the wrong data types.
(No other contrib module has a vulnerable selectivity estimator.)

We need only check that the Const we've found in the query is indeed
of the type we expect (query_int), but there's a difficulty: as an
extension type, query_int doesn't have a fixed OID that we could
hard-code into the estimator.

Therefore, the bulk of this patch consists of infrastructure to let
an extension function securely look up the OID of a datatype
belonging to the same extension.  (Extension authors have requested
such functionality before, so we anticipate that this code will
have additional non-security uses, and may soon be extended to allow
looking up other kinds of SQL objects.)

This is done by first finding the extension that owns the calling
function (there can be only one), and then thumbing through the
objects owned by that extension to find a type that has the desired
name.  This is relatively expensive, especially for large extensions,
so a simple cache is put in front of these lookups.

Reported-by: Daniel Firer as part of zeroday.cloud
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Security: CVE-2026-2004
Backpatch-through: 14
2026-02-09 10:14:22 -05:00
Tom Lane
841d42cc4e Require superuser to install a non-built-in selectivity estimator.
Selectivity estimators come in two flavors: those that make specific
assumptions about the data types they are working with, and those
that don't.  Most of the built-in estimators are of the latter kind
and are meant to be safely attachable to any operator.  If the
operator does not behave as the estimator expects, you might get a
poor estimate, but it won't crash.

However, estimators that do make datatype assumptions can malfunction
if they are attached to the wrong operator, since then the data they
get from pg_statistic may not be of the type they expect.  This can
rise to the level of a security problem, even permitting arbitrary
code execution by a user who has the ability to create SQL objects.

To close this hole, establish a rule that built-in estimators are
required to protect themselves against being called on the wrong type
of data.  It does not seem practical however to expect estimators in
extensions to reach a similar level of security, at least not in the
near term.  Therefore, also establish a rule that superuser privilege
is required to attach a non-built-in estimator to an operator.
We expect that this restriction will have little negative impact on
extensions, since estimators generally have to be written in C and
thus superuser privilege is required to create them in the first
place.

This commit changes the privilege checks in CREATE/ALTER OPERATOR
to enforce the rule about superuser privilege, and fixes a couple
of built-in estimators that were making datatype assumptions without
sufficiently checking that they're valid.

Reported-by: Daniel Firer as part of zeroday.cloud
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Security: CVE-2026-2004
Backpatch-through: 14
2026-02-09 10:07:31 -05:00
Tom Lane
60e7ae41a6 Guard against unexpected dimensions of oidvector/int2vector.
These data types are represented like full-fledged arrays, but
functions that deal specifically with these types assume that the
array is 1-dimensional and contains no nulls.  However, there are
cast pathways that allow general oid[] or int2[] arrays to be cast
to these types, allowing these expectations to be violated.  This
can be exploited to cause server memory disclosure or SIGSEGV.
Fix by installing explicit checks in functions that accept these
types.

Reported-by: Altan Birler <altan.birler@tum.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Security: CVE-2026-2003
Backpatch-through: 14
2026-02-09 09:57:43 -05:00
Álvaro Herrera
38e0190ced
Allow log_min_messages to be set per process type
Change log_min_messages from being a single element to a comma-separated
list of type:level elements, with 'type' representing a process type,
and 'level' being a log level to use for that type of process.  The list
must also have a freestanding level specification which is used for
process types not listed, which convenientely makes the whole thing
backwards-compatible.

Some choices made here could be contested; for instance, we use the
process type `backend` to affect regular backends as well as dead-end
backends and the standalone backend, and `autovacuum` means both the
launcher and the workers.  I think it's largely sensible though, and it
can easily be tweaked if desired.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Tan Yang <332696245@qq.com>
Discussion: https://postgr.es/m/e85c6671-1600-4112-8887-f97a8a5d07b2@app.fastmail.com
2026-02-09 13:23:10 +01:00
Thomas Munro
c67bef3f32 Code coverage for most pg_mblen* calls.
A security patch changed them today, so close the coverage gap now.
Test that buffer overrun is avoided when pg_mblen*() requires more
than the number of bytes remaining.

This does not cover the calls in dict_thesaurus.c or in dict_synonym.c.
That code is straightforward.  To change that code's input, one must
have access to modify installed OS files, so low-privilege users are not
a threat.  Testing this would likewise require changing installed
share/postgresql/tsearch_data, which was enough of an obstacle to not
bother.

Security: CVE-2026-2006
Backpatch-through: 14
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
2026-02-09 12:44:12 +13:00
Thomas Munro
1e7fe06c10 Replace pg_mblen() with bounds-checked versions.
A corrupted string could cause code that iterates with pg_mblen() to
overrun its buffer.  Fix, by converting all callers to one of the
following:

1. Callers with a null-terminated string now use pg_mblen_cstr(), which
raises an "illegal byte sequence" error if it finds a terminator in the
middle of the sequence.

2. Callers with a length or end pointer now use either
pg_mblen_with_len() or pg_mblen_range(), for the same effect, depending
on which of the two seems more convenient at each site.

3. A small number of cases pre-validate a string, and can use
pg_mblen_unbounded().

The traditional pg_mblen() function and COPYCHAR macro still exist for
backward compatibility, but are no longer used by core code and are
hereby deprecated.  The same applies to the t_isXXX() functions.

Security: CVE-2026-2006
Backpatch-through: 14
Co-authored-by: Thomas Munro <thomas.munro@gmail.com>
Co-authored-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reported-by: Paul Gerste (as part of zeroday.cloud)
Reported-by: Moritz Sanft (as part of zeroday.cloud)
2026-02-09 12:44:04 +13:00
Tom Lane
73dd7163c5 Replace some hard-wired OID constants with corresponding macros.
Looking again at commit 7cdb633c8, I wondered why we have hard-wired
"1034" for the OID of type aclitem[].  Some other entries in the same
array have numeric type OIDs as well.  This seems to be a hangover
from years ago when not every built-in pg_type entry had an OID macro.
But since we made genbki.pl responsible for generating these macros,
there are macros available for all these array types, so there's no
reason not to follow the project policy of never writing numeric OID
constants in C code.
2026-02-07 23:15:20 -05:00
Tom Lane
7cdb633c89 Make some minor cleanups in typalign-related code.
Commit 7b378237a widened AclMode to 64 bits, which implies that
the alignment of AclItem is now determined by an int64 field.
That commit correctly set the typalign for SQL type aclitem to
'd', but it missed the hard-wired knowledge about _aclitem in
bootstrap.c.  This doesn't seem to have caused any ill effects,
probably because we never try to fill a non-null value into
an aclitem[] column during bootstrap.  Nonetheless, it's clearly
a gotcha waiting to happen, so fix it up.

In passing, also fix a couple of typanalyze functions that were
using hard-coded typalign constants when they could just as
easily use greppable TYPALIGN_xxx macros.

Noticed these while working on a patch to expand the set of
typalign values.  I doubt we are going to pursue that path,
but these fixes still seem worth a quick commit.

Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us
2026-02-06 20:46:03 -05:00
Nathan Bossart
ba1e14134a Adjust style of some debugging macros.
This commit adjusts a few debugging macros to match the style of
those in pg_config_manual.h.  Like commits 123661427b and
b4cbc106a6, these were discovered while reviewing Aleksander
Alekseev's proposed changes to pgindent.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aP-H6kSsGOxaB21k%40nathan
2026-02-06 16:24:21 -06:00
Michael Paquier
072c842135 Fix use of proc number in pgstat_create_backend()
This routine's internals directly used MyProcNumber to choose which
object ID to assign for the hash key of a backend's stats entry, while
the value to use is given as input argument of the function.

The original intention was to pass MyProcNumber as an argument of
pgstat_create_backend() when called in pgstat_bestart_final(),
pgstat_beinit() ensuring that MyProcNumber has been set, not use it
directly in the function.  This commit addresses this inconsistency by
using the procnum given by the caller of pgstat_create_backend(), not
MyProcNumber.

This issue is not a cause of bugs currently.  However, let's keep the
code in sync across all the branches where this code exists, as it could
matter in a future backpatch.

Oversight in 4feba03d8b.

Reported-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Discussion: https://postgr.es/m/TYCPR01MB11316AD8150C8F470319ACCAEE866A@TYCPR01MB11316.jpnprd01.prod.outlook.com
Backpatch-through: 18
2026-02-06 19:57:22 +09:00
Thomas Munro
f94e9141a0 Add file_extend_method=posix_fallocate,write_zeros.
Provide a way to disable the use of posix_fallocate() for relation
files.  It was introduced by commit 4d330a61bb.  The new setting
file_extend_method=write_zeros can be used as a workaround for problems
reported from the field:

 * BTRFS compression is disabled by the use of posix_fallocate()
 * XFS could produce spurious ENOSPC errors in some Linux kernel
   versions, though that problem is reported to have been fixed

The default is file_extend_method=posix_fallocate if available, as
before.  The write_zeros option is similar to PostgreSQL < 16, except
that now it's multi-block.

Backpatch-through: 16
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net
2026-02-06 17:38:49 +13:00
Michael Paquier
9476ef206c Fix comment in extended_stats_funcs.c
The attribute storing the statistics data for a set of expressions in
pg_statistic_ext_data is stxdexpr.  stxdexprs does not exist.

Extracted from a larger patch by the same author.  Incorrect as of
efbebb4e85.

Author: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CADkLM=fPcci6oPyuyEZ0F4bWqAA7HzaWO+ZPptufuX5_uWt6kw@mail.gmail.com
2026-02-05 15:14:53 +09:00
Masahiko Sawada
7a1f0f8747 pg_upgrade: Optimize logical replication slot caught-up check.
Commit 29d0a77fa6 improved pg_upgrade to allow migrating logical slots
provided that all logical slots have caught up (i.e., they have no
pending decodable WAL records). Previously, this verification was done
by checking each slot individually, which could be time-consuming if
there were many logical slots to migrate.

This commit optimizes the check to avoid reading the same WAL stream
multiple times. It performs the check only for the slot with the
minimum confirmed_flush_lsn and applies the result to all other slots
in the same database. This limits the check to at most one logical
slot per database.

During the check, we identify the last decodable WAL record's LSN to
report any slots with unconsumed records, consistent with the existing
error reporting behavior. Additionally, the maximum
confirmed_flush_lsn among all logical slots on the database is used as
an early scan cutoff; finding a decodable WAL record beyond this point
implies that no slot has caught up.

Performance testing demonstrated that the execution time remains
stable regardless of the number of slots in the database.

Note that we do not distinguish slots based on their output plugins. A
hypothetical plugin might use a replication origin filter that filters
out changes from a specific origin. In such cases, we might get a
false positive (erroneously considering a slot caught up). However,
this is safe from a data integrity standpoint, such scenarios are
rare, and the impact of a false positive is minimal.

This optimization is applied only when the old cluster is version 19
or later.

Bump catalog version.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBZ0LAcw1OHGEKdW7S5TRJaURdhEk3CLAW69_siqfqyAg@mail.gmail.com
2026-02-04 17:11:27 -08:00