Commit graph

28616 commits

Author SHA1 Message Date
Andres Freund
45f658dacb freespace: Don't modify page without any lock
Before this commit fsm_vacuum_page() modified the page without any lock on the
page. Historically that was kind of ok, as we didn't rely on the freespace to
really stay consistent and we did not have checksums. But these days pages are
checksummed and there are ways for FSM pages to be included in WAL records,
even if the FSM itself is still not WAL logged. If a FSM page ever were
modified while a WAL record referenced that page, we'd be in trouble, as the
WAL CRC could end up getting corrupted.

The reason to address this right now is a series of patches with the goal to
only allow modifications of pages with an appropriate lock level. Obviously
not having any lock is not appropriate :)

Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr
Discussion: https://postgr.es/m/e6a8f734-2198-4958-a028-aba863d4a204@iki.fi
2026-01-12 12:40:00 -05:00
Álvaro Herrera
225d1df1d2
Stop including {brin,gin}_tuple.h in tuplesort.h
Doing this meant that those two headers, which are supposed to be
internal to their corresponding index AMs, were being included pretty
much universally, because tuplesort.h is included by execnodes.h which
is very widely used.  Stop that, and fix fallout.

We also change indexing.h to no longer include execnodes.h (tuptable.h
is sufficient), and relscan.h to no longer include buf.h (pointless
since c2fe139c20).

Author: Mario González <gonzalemario@gmail.com>
Discussion: https://postgr.es/m/CAFsReFUcBFup=Ohv_xd7SNQ=e73TXi8YNEkTsFEE2BW7jS1noQ@mail.gmail.com
2026-01-12 18:09:49 +01:00
Álvaro Herrera
2defd00062
Move instrumentation-related structs to instrument_node.h
Some structs and enums related to parallel query instrumentation had
organically grown scattered across various files, and were causing
header pollution especially through execnodes.h.  Create a single file
where they can live together.

This only moves the structs to the new file; cleaning up the pollution
by removing no-longer-necessary cross-header inclusion will be done in
future commits.

Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: Mario González <gonzalemario@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/202510051642.wwmn4mj77wch@alvherre.pgsql
Discussion: https://postgr.es/m/CAFsReFUr4KrQ60z+ck9cRM4WuUw1TCghN7EFwvV0KvuncTRc2w@mail.gmail.com
2026-01-12 16:59:28 +01:00
Peter Eisentraut
c3c240537f Avoid casting void * function arguments
In many cases, the cast would silently drop a const qualifier.  To
fix, drop the unnecessary cast and let the compiler check the types
and qualifiers.  Add const to read-only local variables, preserving
the const qualifiers from the function signatures.

Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/aUQHy/MmWq7c97wK%40ip-10-97-1-34.eu-west-3.compute.internal
2026-01-12 16:12:56 +01:00
Andres Freund
e5a5e0a907 instrumentation: Keep time fields as instrtime, convert in callers
Previously the instrumentation logic always converted to seconds, only for
many of the callers to do unnecessary division to get to milliseconds. As an
upcoming refactoring will split the Instrumentation struct, utilize instrtime
always to keep things simpler. It's also a bit faster to not have to first
convert to a double in functions like InstrEndLoop(), InstrAggNode().

Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com
2026-01-09 13:38:00 -05:00
Heikki Linnakangas
bba81f9d3d Inline ginCompareAttEntries for speed
It is called in tight loops during GIN index build.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/5d366878-2007-4d31-861e-19294b7a583b@gmail.com
2026-01-09 20:31:43 +02:00
Tom Lane
7a1d422e39 Improve "constraint must include all partitioning columns" message.
This formerly said "unique constraint must ...", which was accurate
enough when it only applied to UNIQUE and PRIMARY KEY constraints.
However, now we use it for exclusion constraints too, and in that
case it's a tad confusing.  Do what we already did in the errdetail
message: print the constraint_type, so that it looks like "UNIQUE
constraint ...", "EXCLUDE constraint ...", etc.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxH6VhAf65Vghg4T2q315gY=Rt4BUfMyunkfRj0n2S9n-g@mail.gmail.com
2026-01-09 12:59:35 -05:00
Nathan Bossart
7a485bd641 pg_dump: Fix gathering of sequence information.
Since commit bd15b7db48, pg_dump uses pg_get_sequence_data() (née
pg_sequence_read_tuple()) to gather all sequence data in a single
query as opposed to a query per sequence.  Two related bugs have
been identified:

* If the user lacks appropriate privileges on the sequence, pg_dump
generates a setval() command with garbage values instead of
failing as expected.

* pg_dump can fail due to a concurrently dropped sequence, even if
the dropped sequence's data isn't part of the dump.

This commit fixes the above issues by 1) teaching
pg_get_sequence_data() to return nulls instead of erroring for a
missing sequence and 2) teaching pg_dump to fail if it tries to
dump the data of a sequence for which pg_get_sequence_data()
returned nulls.  Note that pg_dump may still fail due to a
concurrently dropped sequence, but it should now only do so when
the sequence data is part of the dump.  This matches the behavior
before commit bd15b7db48.

Bug: #19365
Reported-by: Paveł Tyślacki <pavel.tyslacki@gmail.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19365-6245240d8b926327%40postgresql.org
Discussion: https://postgr.es/m/2885944.1767029161%40sss.pgh.pa.us
Backpatch-through: 18
2026-01-09 10:12:54 -06:00
Fujii Masao
24cb3a08a4 Use IsA() macro in define.c, for sake of consistency.
Commit 63d1b1cf7f replaced a direct nodeTag() comparison with the IsA()
macro in copy.c, but a similar direct comparison remained in define.c.

This commit replaces that comparison with IsA() for consistency.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGjWGS89_sTx=sbPm0FQemyQQrfTKm=taUhAJFV5k-9cw@mail.gmail.com
2026-01-09 20:24:00 +09:00
Peter Eisentraut
1f08d687c3 meson: Rename cpp variable to cxx
Since CPP is also used to mean C PreProcessor in our meson.build
files, it's confusing to use it to also mean C PlusPlus.  This uses
the non-ambiguous cxx wherever possible.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/D98JHQF7H2A8.VSE3I4CJBTAB%40partin.io
2026-01-09 08:58:23 +01:00
David Rowley
349107537d Fix possible incorrect column reference in ERROR message
When creating a partition for a RANGE partitioned table, the reporting
of errors relating to converting the specified range values into
constant values for the partition key's type could display the name of a
previous partition key column when an earlier range was specified as
MINVALUE or MAXVALUE.

This was caused by the code not correctly incrementing the index that
tracks which partition key the foreach loop was working on after
processing MINVALUE/MAXVALUE ranges.

Fix by using foreach_current_index() to ensure the index variable is
always set to the List element being worked on.

Author: myzhen <zhenmingyang@yeah.net>
Reviewed-by: zhibin wang <killerwzb@gmail.com>
Discussion: https://postgr.es/m/273cab52.978.19b96fc75e7.Coremail.zhenmingyang@yeah.net
Backpatch-through: 14
2026-01-09 11:01:36 +13:00
Tom Lane
b352d3d80b Mark GiST inet_ops as opcdefault, and deal with ensuing fallout.
This patch completes the transition to making inet_ops be default for
inet/cidr columns, rather than btree_gist's opclasses.  Once we do
that, though, pg_upgrade has a big problem.  A dump from an older
version will see btree_gist's opclasses as being default, so it will
not mention the opclass explicitly in CREATE INDEX commands, which
would cause the restore to create the indexes using inet_ops.  Since
that's not compatible with what's actually in the files, havoc would
ensue.

This isn't readily fixable, because the CREATE INDEX command strings
are built by the older server's pg_get_indexdef() function; pg_dump
hasn't nearly enough knowledge to modify those strings successfully.
Even if we cared to put in the work to make that happen in pg_dump,
it would be counterproductive because the end goal here is to get
people off of these opclasses.  Allowing such indexes to persist
through pg_upgrade wouldn't advance that goal.

Therefore, this patch just adds code to pg_upgrade to detect indexes
that would be problematic and refuse to upgrade.

There's another issue too: even without any indexes to worry about,
pg_dump in binary-upgrade mode will reproduce the "CREATE OPERATOR
CLASS ... DEFAULT" commands for btree_gist's opclasses, and those
will fail because now we have a built-in opclass that provides a
conflicting default.  We could ask users to drop the btree_gist
extension altogether before upgrading, but that would carry very
severe penalties.  It would affect perfectly-valid indexes for other
data types, and it would drop operators that might be relied on in
views or other database objects.  Instead, put a hack in DefineOpClass
to ignore the DEFAULT clauses for these opclasses when in
binary-upgrade mode.  This will result in installing a version of
btree_gist that isn't quite the version it claims to be, but that can
be fixed by issuing ALTER EXTENSION UPDATE afterwards.

Since we don't apply that hack when not in binary-upgrade mode,
it is now impossible to install any version of btree_gist
less than 1.9 via CREATE EXTENSION.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us
2026-01-08 14:03:56 -05:00
Heikki Linnakangas
63d1b1cf7f Use IsA macro, for sake of consistency
Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAOzEurS=PzRzGba3mpNXgEhbnQFA0dxXaU0ujCJ0aa9yMSH6Pw@mail.gmail.com
2026-01-08 18:58:28 +02:00
Heikki Linnakangas
ad853bb877 Fix misc typos, mostly in comments
The only user-visible change is the fix in the "malformed
pg_dependencies" error detail. That one is new in commit e1405aa5e3,
so no backpatching required.
2026-01-08 18:10:08 +02:00
Heikki Linnakangas
d3304921db Improve comments around _bt_checkkeys
Discussion: https://www.postgresql.org/message-id/f5388839-99da-465a-8744-23cdfa8ce4db@iki.fi
2026-01-08 18:10:05 +02:00
Amit Kapila
31ddbb38ee Fix typos in the code.
Author: "Dewei Dai" <daidewei1970@163.com>
Author: zengman <zengman@halodbtech.com>
Author: Zhiyuan Su <suzhiyuan_pg@126.com>
Discussion: https://postgr.es/m/2026010719201902382410@163.com
Discussion: https://postgr.es/m/tencent_4DC563C83443A4B1082D2BFF@qq.com
Discussion: https://postgr.es/m/44656d72.2a63.19b9b92b0a3.Coremail.suzhiyuan_pg@126.com
2026-01-08 09:43:50 +00:00
Peter Eisentraut
6ade3cd459 Remove use of rindex() function
rindex() has been removed from POSIX 2008.  Replace the one remaining
use with the equivalent and more standard strrchr().

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org
2026-01-08 08:53:49 +01:00
Peter Geoghegan
8aed8e168f Fix nbtree skip array transformation comments.
Fix comments that incorrectly described transformations performed by the
"Avoid extra index searches through preprocessing" mechanism introduced
by commit b3f1a13f.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/20251230190145.c3c88c5eb0f88b136adda92f@sraoss.co.jp
Backpatch-through: 18
2026-01-07 12:53:07 -05:00
Michael Paquier
68119480a7 Fix unexpected reversal of lists during catcache rehash
During catcache searches, the most-recently searched entries are kept at
the head of the list to speed up subsequent searches, keeping the
"freshest" entries at its beginning.  A rehash of the catcache was doing
the opposite: fresh entries were moved to the tail of the newly-created
buckets, causing a rehash to slow down a bit.

When a rehash is done, this commit switches the code to use
dlist_push_tail() instead of dlist_push_head(), so as fresh entries are
kept at the head of the lists, not their tail.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/tencent_9EA10D8512B5FE29E7323F780A0749768708@qq.com
2026-01-07 17:52:54 +09:00
Michael Paquier
b139bd3b6e Add data type oid8, 64-bit unsigned identifier
This new identifier type provides support for 64-bit unsigned values,
to be used in catalogs, like OIDs.  An advantage of a new data type is
that it becomes easier to grep for it in the code when assigning this
type to a catalog attribute, linking it to dedicated APIs and internal
structures.

The following operators are added in this commit, with dedicated tests:
- Casts with integer types and OID.
- btree and hash operators
- min/max functions.
- C type with related macros and defines, named around "Oid8".

This has been mentioned as useful on its own on the thread to add
support for 64-bit TOAST values, so as it becomes possible to attach
this data type to the TOAST code and catalog definitions.  However, as
this concept can apply to many more areas, it is implemented as its own
independent change.  This is based on a discussion with Andres Freund
and Tom Lane.

Bump catalog version.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>
Discussion: https://postgr.es/m/1891064.1754681536@sss.pgh.pa.us
2026-01-07 11:37:00 +09:00
Jeff Davis
af2d4ca191 Clean up ICU includes.
Remove ICU includes from pg_locale.h, and instead include them in the
few C files that need ICU. Clean up a few other includes in passing.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/48911db71d953edec66df0d2ce303563d631fbe0.camel@j-davis.com
2026-01-06 17:19:51 -08:00
Andres Freund
75609fded3 Fix buggy interaction between array subscripts and subplan params
In a7f107df2 I changed subplan param evaluation to happen within the
containing expression. As part of that, ExecInitSubPlanExpr() was changed to
evaluate parameters via a new EEOP_PARAM_SET expression step. These parameters
were temporarily stored into ExprState->resvalue/resnull, with some reasoning
why that would be fine. Unfortunately, that analysis was wrong -
ExecInitSubscriptionRef() evaluates the input array into "resv"/"resnull",
which will often point to ExprState->resvalue/resnull. This means that the
EEOP_PARAM_SET, if inside an array subscript, would overwrite the input array
to array subscript.

The fix is fairly simple - instead of evaluating into
ExprState->resvalue/resnull, store the temporary result of the subplan in the
subplan's return value.

Bug: #19370
Reported-by: Zepeng Zhang <redraiment@gmail.com>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Diagnosed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/19370-7fb7a5854b7618f1@postgresql.org
Backpatch-through: 18
2026-01-06 19:51:10 -05:00
Jeff Davis
c4ff35f104 ICU: use UTF8-optimized case conversion API
Initializes a UCaseMap object once for use across calls, and uses
UTF8-optimized APIs.

Author: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/5a010b27-8ed9-4739-86fe-1562b07ba564@proxel.se
2026-01-06 14:09:07 -08:00
Alexander Korotkov
bf308639bf Fix variable usage in wakeupWaiters()
Mistakenly, `i` was used both as an integer representation of lsnType and
an index in wakeUpProcs array.

Discussion: https://postgr.es/m/CA%2BhUKG%2BL0OQR8dQtsNBSaA3FNNyQeFabdeRVzurm0b7Xtp592g%40mail.gmail.com
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Ruikai Peng <ruikai@pwno.io>
2026-01-06 10:04:28 +02:00
Michael Paquier
f1e251be80 Allow bgworkers to be terminated for database-related commands
Background workers gain a new flag, called BGWORKER_INTERRUPTIBLE, that
offers the possibility to terminate the workers when these are connected
to a database that is involved in one of the following commands:
ALTER DATABASE RENAME TO
ALTER DATABASE SET TABLESPACE
CREATE DATABASE
DROP DATABASE

This is useful to give background workers the same behavior as backends
and autovacuum workers, which are stopped when these commands are
executed.  The default behavior, that exists since 9.3, is still to
never terminate bgworkers connected to the database involved in any of
these commands.  The new flag has to be set to terminate the workers.

A couple of tests are added to worker_spi to track the commands that
impact the termination of the workers.  There is a test case for a
non-interruptible worker, additionally, that relies on an injection
point to make the wait time in CountOtherDBBackends() reduced from 5s to
0.3s for faster test runs.  The tests rely on the contents of the server
logs to check if a worker has been started or terminated:
- LOG generated by worker_spi_main() at startup, once connection to
database is done.
- FATAL in bgworker_die() when terminated.
A couple of tests run in the CI have showed that this method is stable
enough.  The safe_psql() calls that scan pg_stat_activity could be
replaced with some poll_query_until() for more stability, if the current
method proves to be an issue in the buildfarm.

Author: Aya Iwata <iwata.aya@fujitsu.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Discussion: https://postgr.es/m/OS7PR01MB11964335F36BE41021B62EAE8EAE4A@OS7PR01MB11964.jpnprd01.prod.outlook.com
2026-01-06 14:24:29 +09:00
Amit Kapila
c970bdc037 Update comments atop ReplicationSlotCreate.
Since commit 1462aad2e4, which introduced the ability to modify the
two_phase property of a slot, the comments above ReplicationSlotCreate
have become outdated. We have now added a cautionary note in the comments
above ReplicationSlotAlter explaining when it is safe to modify the
two_phase property of a slot.

Author: Daniil Davydov <3danissimo@gmail.com>
Author: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CAJDiXggZXQZ7bD0QcTizDt6us9aX6ZKK4dWxzgb5x3+TsVHjqQ@mail.gmail.com
2026-01-06 05:02:25 +00:00
David Rowley
6ca8506ea5 Fix issue with EVENT TRIGGERS and ALTER PUBLICATION
When processing the "publish" options of an ALTER PUBLICATION command,
we call SplitIdentifierString() to split the options into a List of
strings.  Since SplitIdentifierString() modifies the delimiter
character and puts NULs in their place, this would overwrite the memory
of the AlterPublicationStmt.  Later in AlterPublicationOptions(), the
modified AlterPublicationStmt is copied for event triggers, which would
result in the event trigger only seeing the first "publish" option
rather than all options that were specified in the command.

To fix this, make a copy of the string before passing to
SplitIdentifierString().

Here we also adjust a similar case in the pgoutput plugin.  There's no
known issues caused by SplitIdentifierString() here, so this is being
done out of paranoia.

Thanks to Henson Choi for putting together an example case showing the
ALTER PUBLICATION issue.

Author: sunil s <sunilfeb26@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Backpatch-through: 14
2026-01-06 17:29:12 +13:00
Amit Kapila
63ed3bc7f9 Fix typo in slot.c.
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/AC9B87F1-ED04-4547-B85C-9443B4253A08@gmail.com
Discussion: https://postgr.es/m/CAJDiXggZXQZ7bD0QcTizDt6us9aX6ZKK4dWxzgb5x3+TsVHjqQ@mail.gmail.com
2026-01-06 04:13:40 +00:00
Michael Paquier
ae28373602 Fix typo in planner.c
b8cfcb9e00 did not get this change right.

Author: Alexander Law <exclusion@gmail.com>
Discussion: https://postgr.es/m/CAJ0YPFFWhJXs-e-=7iJz-FLp=b1dXfJA_qtrVAgto=bZmzD9zQ@mail.gmail.com
2026-01-06 12:30:01 +09:00
Fujii Masao
d926462d81 Honor GUC settings specified in CREATE SUBSCRIPTION CONNECTION.
Prior to v15, GUC settings supplied in the CONNECTION clause of
CREATE SUBSCRIPTION were correctly passed through to
the publisher's walsender. For example:

        CREATE SUBSCRIPTION mysub
            CONNECTION 'options=''-c wal_sender_timeout=1000'''
            PUBLICATION ...

would cause wal_sender_timeout to take effect on the publisher's walsender.

However, commit f3d4019da5 changed the way logical replication
connections are established, forcing the publisher's relevant
GUC settings (datestyle, intervalstyle, extra_float_digits) to
override those provided in the CONNECTION string. As a result,
from v15 through v18, GUC settings in the CONNECTION string were
always ignored.

This regression prevented per-connection tuning of logical replication.
For example, using a shorter timeout for walsender connecting
to a nearby subscriber and a longer one for walsender connecting
to a remote subscriber.

This commit restores the intended behavior by ensuring that
GUC settings in the CONNECTION string are again passed through
and applied by the walsender, allowing per-connection configuration.

Backpatch to v15, where the regression was introduced.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com
Backpatch-through: 15
2026-01-06 11:52:22 +09:00
David Rowley
7f9acc9bcc Simplify GetOperatorFromCompareType() code
The old code would set *opid = InvalidOid to determine if the
get_opclass_opfamily_and_input_type() worked or not.  This means more
moving parts that what's really needed here.  Let's just fail
immediately if the get_opclass_opfamily_and_input_type() lookup fails.

Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com>
Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com
2026-01-06 15:25:13 +13:00
David Rowley
e8d4e94a47 Fix misleading comment for GetOperatorFromCompareType
The comment claimed *strat got set to InvalidStrategy when the function
lookup fails.  This isn't true; an ERROR is raised when that happens.

Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com>
Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com
Backpatch-through: 18
2026-01-06 15:16:14 +13:00
Tom Lane
b85d5dc0e7 Fix meson build of snowball code.
include/snowball/libstemmer has to be in the -I search path,
as it is in the autoconf build.  It's not apparent to me how
this ever worked before, nor why my recent commit made it
stop working.

Discussion: https://postgr.es/m/ld2iurl7kzexwydxmdfhdgarpa7xxsfrgvggqhbblt4rvt3h6t@bxsk6oz5x7cc
2026-01-05 16:51:36 -05:00
Tom Lane
7dc95cc3b9 Update to latest Snowball sources.
It's been almost a year since we last did this, and upstream has
been busy.  They've added stemmers for Polish and Esperanto,
and also deprecated their old Dutch stemmer in favor of the
Kraaij-Pohlmann algorithm.  (The "dutch" stemmer is now the
latter, and "dutch_porter" is the old algorithm.)

Upstream also decided to rename their internal header "header.h"
to something less generic: "snowball_runtime.h".  Seems like a good
thing, but it complicates this patch a bit because we were relying on
interposing our own version of "header.h" to control system header
inclusion order.  (We're partially failing at that now, because now the
generated stemmer files include <stddef.h> before snowball_runtime.h.
I think that'll be okay, but if the buildfarm complains then we'll
have to do more-extensive editing of the generated files.)

I realized that we weren't documenting the available stemmers in
any user-visible place, except indirectly through sample \dFd output.
That's incomplete because we only provide built-in dictionaries for
the recommended stemmers for each language, not alternative stemmers
such as dutch_porter.  So I added a list to the documentation.

I did not do anything with the stopword lists.  If those are still
available from snowballstem.org, they are mighty well hidden.

Discussion: https://postgr.es/m/1185975.1767569534@sss.pgh.pa.us
2026-01-05 15:22:37 -05:00
Masahiko Sawada
d351063e49 Fix typo in parallel.c.
Author: kelan <ke_lan1@qq.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/tencent_38B5875E2D440C8DA8C0C022ABD999F9C207@qq.com
2026-01-05 10:16:28 -08:00
Alexander Korotkov
49a181b5d6 Add the MODE option to the WAIT FOR LSN command
This commit extends the WAIT FOR LSN command with an optional MODE option in
the WITH clause that specifies which LSN type to wait for:

  WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)]

where mode can be:
 - 'standby_replay' (default): Wait for WAL to be replayed to the specified
   LSN,
 - 'standby_write': Wait for WAL to be written (received) to the specified
   LSN,
 - 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN,
 - 'primary_flush': Wait for WAL to be flushed to disk on the primary server.

The default mode is 'standby_replay', matching the original behavior when MODE
is not specified. This follows the pattern used by COPY and EXPLAIN
commands, where options are specified as string values in the WITH clause.

Modes are explicitly named to distinguish between primary and standby
operations:
- Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only
  be used during recovery (on a standby server),
- Primary mode ('primary_flush') can only be used on a primary server.

The 'standby_write' and 'standby_flush' modes are useful for scenarios where
applications need to ensure WAL has been received or persisted on the standby
without necessarily waiting for replay to complete. The 'primary_flush' mode
allows waiting for WAL to be flushed on the primary server.

This commit also includes includes:
- Documentation updates for the new syntax and mode descriptions,
- Test coverage for all four modes, including error cases and concurrent
  waiters,
- Wakeup logic in walreceiver for standby write/flush waiters,
- Wakeup logic in WAL writer for primary flush waiters.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
2026-01-05 19:56:19 +02:00
Alexander Korotkov
7a39f43d88 Extend xlogwait infrastructure with write and flush wait types
Add support for waiting on WAL write and flush LSNs in addition to the
existing replay LSN wait type. This provides the foundation for
extending the WAIT FOR command with MODE parameter.

Key changes are following.
- Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to
  WaitLSNType.
- Add GetCurrentLSNForWaitType() to retrieve the current LSN for each wait
  type.
- Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and
  WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility.
- Update WaitForLSN() to use GetCurrentLSNForWaitType() internally.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
2026-01-05 19:56:19 +02:00
Alexander Korotkov
d51a5d8e56 Adjust errcode in checkPartition()
Replace ERRCODE_UNDEFINED_TABLE with ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE
for the case where we don't find a parent-child relationship between the
partitioned table and its partition.  In this case, tables are present, but
they are not in a prerequisite state (no relationship).

Discussion: https://postgr.es/m/CAHewXNmBM%2B5qbrJMu60NxPn%2B0y-%3D2wXM-QVVs3xRp8NxFvDb9A%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
2026-01-05 19:56:19 +02:00
Michael Paquier
877ae5db89 Fix comment in tableam.c
Author: shiyu qin <qinshy510@gmail.com>
Discussion: https://postgr.es/m/CAJUCM3uJjoLR1zfKoZD4J71T-hdeFdFw1kTQoMkywKZP0hZsvw@mail.gmail.com
2026-01-05 19:15:55 +09:00
Heikki Linnakangas
461b8cc952 Tighten up assertion on a local variable
'lineindex' is 0-based, as mentioned in the comments.

Backpatch to v18 where the assertion was added.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/tencent_A84F3C810365BB9BD08442955AE494141907@qq.com
Backpatch-through: 18
2026-01-05 11:33:35 +02:00
David Rowley
4c144e0452 Use the GetPGProcByNumber() macro when possible
A few places were accessing &ProcGlobal->allProcs directly, so adjust
them to use the accessor macro instead.

Author: Maksim Melnikov <m.melnikov@postgrespro.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/80621c00-aba6-483c-88b1-a845461d1165@postgrespro.ru
2026-01-05 21:19:03 +13:00
Amit Kapila
3f906d3af9 Improve the comments atop build_replindex_scan_key().
Author: zourenli <398740848@qq.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/tencent_C2DC8157CC05C8F5C36E12678A7864554809@qq.com
2026-01-05 03:06:55 +00:00
Michael Paquier
b8cfcb9e00 Fix typos and inconsistencies in code and comments
This change is a cocktail of harmonization of function argument names,
grammar typos, renames for better consistency and unused code (see
ltree).  All of these have been spotted by the author.

Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/b2c0d0b7-3944-487d-a03d-d155851958ff@gmail.com
2026-01-05 09:19:15 +09:00
Tom Lane
ba75f71752 Include error location in errors from ComputeIndexAttrs().
Make use of IndexElem's new location field to localize these
errors better.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxH3OgXF1hrzGAaWyNtye2jHEmk9JbtrtGv-KJK6tsGo5w@mail.gmail.com
2026-01-04 14:16:20 -05:00
Tom Lane
62299bbd90 Add parse location to IndexElem.
This patch mostly just fills in the field, although a few error
reports in resolve_unique_index_expr() are adjusted to use it.
The next commit will add more uses.

catversion bump out of an abundance of caution: I'm not sure
IndexElem can appear in stored rules, but I'm not sure it can't
either.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxH3OgXF1hrzGAaWyNtye2jHEmk9JbtrtGv-KJK6tsGo5w@mail.gmail.com
Discussion: https://postgr.es/m/202512121327.f2zimsr6guso@alvherre.pgsql
2026-01-04 14:16:20 -05:00
Tom Lane
54315fde73 Improve a couple of error messages.
Change "function" to "function or procedure" in
PreventInTransactionBlock, and improve grammar of ExecWaitStmt's
complaint about having an active snapshot.

Author: Pavel Stehule <pavel.stehule@gmail.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Marcos Pegoraro <marcos@f10.com.br>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFj8pRCveWPR06bbad9GnMb0Kcr6jnXPttv9XOaOB+oFCD1Tsg@mail.gmail.com
2026-01-03 17:18:39 -05:00
Bruce Momjian
451c43974f Update copyright for 2026
Backpatch-through: 14
2026-01-01 13:24:10 -05:00
Andrew Dunstan
f3c9e341cd Add paths of extensions to pg_available_extensions
Add a new "location" column to the pg_available_extensions and
pg_available_extension_versions views, exposing the directory where
the extension is located.

The default system location is shown as '$system', the same value
that can be used to configure the extension_control_path GUC.

User-defined locations are only visible for super users, otherwise
'<insufficient privilege>' is returned as a column value, the same
behaviour that we already use in pg_stat_activity.

I failed to resist the temptation to do a little extra editorializing of
the TAP test script.

Catalog version bumped.

Author: Matheus Alcantara <mths.dev@pm.me>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Reviewed-By: Rohit Prasad <rohit.prasad@arm.com>
Reviewed-By: Michael Banck <mbanck@gmx.net>
Reviewed-By: Manni Wood <manni.wood@enterprisedb.com>
Reviewed-By: Euler Taveira <euler@eulerto.com>
Reviewed-By: Quan Zongliang <quanzongliang@yeah.net>
2026-01-01 12:13:59 -05:00
Masahiko Sawada
85d5bd308b Fix macro name for io_uring_queue_init_mem check.
Commit f54af9f267 added a check for
io_uring_queue_init_mem(). However, it used the macro name
HAVE_LIBURING_QUEUE_INIT_MEM in both meson.build and the C code, while
the Autotools build script defined HAVE_IO_URING_QUEUE_INIT_MEM. As a
result, the optimization was never enabled in builds configured with
Autotools, as the C code checked for the wrong macro name.

This commit changes the macro name to HAVE_IO_URING_QUEUE_INIT_MEM in
meson.build and the C code. This matches the actual function
name (io_uring_queue_init_mem), following the standard HAVE_<FUNCTION>
convention.

Backpatch to 18, where the macro was introduced.

Bug: #19368
Reported-by: Evan Si <evsi@amazon.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19368-016d79a7f3a1c599@postgresql.org
Backpatch-through: 18
2025-12-31 11:18:14 -08:00
Thomas Munro
915711c8a4 jit: Fix jit_profiling_support when unavailable.
jit_profiling_support=true captures profile data for Linux perf.  On
other platforms, LLVMCreatePerfJITEventListener() returns NULL and the
attempt to register the listener would crash.

Fix by ignoring the setting in that case.  The documentation already
says that it only has an effect if perf support is present, and we
already did the same for older LLVM versions that lacked support.

No field reports, unsurprisingly for an obscure developer-oriented
setting.  Noticed in passing while working on commit 1a28b4b4.

Backpatch-through: 14
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJgB6gvrdDohgwLfCwzVQm%3DVMtb9m0vzQn%3DCwWn-kwG9w%40mail.gmail.com
2025-12-31 14:50:23 +13:00
Tom Lane
bc6374cd76 Change IndexAmRoutines to be statically-allocated structs.
Up to now, index amhandlers were expected to produce a new, palloc'd
struct on each call.  That requires palloc/pfree overhead, and creates
a risk of memory leaks if the caller fails to pfree, and the time
taken to fill such a large structure isn't nil.  Moreover, we were
storing these things in the relcache, eating several hundred bytes for
each cached index.  There is not anything in these structs that needs
to vary at runtime, so let's change the definition so that an
amhandler can return a pointer to a "static const" struct of which
there's only one copy per index AM.  Mark all the core code's
IndexAmRoutine pointers const so that we catch anyplace that might
still try to change or pfree one.

(This is similar to the way we were already handling TableAmRoutine
structs.  This commit does fix one comment that was infelicitously
copied-and-pasted into tableamapi.c.)

This commit needs to be called out in the v19 release notes as an API
change for extension index AMs.  An un-updated AM will still work
(as of now, anyway) but it risks memory leaks and will be slower than
necessary.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAEoWx2=vApYk2LRu8R0DdahsPNEhWUxGBZ=rbZo1EXE=uA+opQ@mail.gmail.com
2025-12-30 18:26:23 -05:00
Masahiko Sawada
736f754eed Add dead items memory usage to VACUUM (VERBOSE) and autovacuum logs.
This commit adds the total memory allocated during vacuum, the number
of times the dead items storage was reset, and the configured memory
limit. This helps users understand how much memory VACUUM required,
and such information can be used to avoid multiple index scans.

Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHza6qcPitBCkyiKJosDTt3bmxMvzZOTONoebwCkBZrr3rk65Q%40mail.gmail.com
2025-12-30 13:12:10 -08:00
Masahiko Sawada
2a5225b99d Fix a race condition in updating procArray->replication_slot_xmin.
Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 14
2025-12-30 10:56:30 -08:00
Michael Paquier
ffdcc9c638 Fix comment in lsyscache.c
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2miv0KGcM9j29ANRN45-Vz-2qAqrM0cv9OtaLx8e_WCMQ@mail.gmail.com
2025-12-30 16:42:21 +09:00
Thomas Munro
1a28b4b455 jit: Drop redundant LLVM configure probes.
We currently require LLVM 14, so these probes for LLVM 9 functions
always succeeded.  Even when the features aren't enabled in an LLVM
build, dummy functions are defined (a problem for a later commit).

The whole PGAC_CHECK_LLVM_FUNCTIONS macro and Meson equivalent are
removed, because we switched to testing LLVM_VERSION_MAJOR at compile
time in subsequent work and these were the last holdouts.  That suits
the nature of LLVM API evolution better, and also allows for strictly
mechanical pruning in future commits like 820b5af7 and 972c2cd2.  They
advanced the minimum LLVM version but failed to spot these.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJgB6gvrdDohgwLfCwzVQm%3DVMtb9m0vzQn%3DCwWn-kwG9w%40mail.gmail.com
2025-12-30 20:24:42 +13:00
Michael Paquier
97b101776c Add pg_get_multixact_stats()
This new function exposes at SQL level some information related to
multixacts, not available until now.  This data is useful for monitoring
purposes, especially for workloads that make a heavy use of multixacts:
- num_mxids, number of MultiXact IDs in use.
- num_members, number of member entries in use.
- members_size, bytes used by num_members in pg_multixact/members/.
- oldest_multixact: oldest MultiXact still needed.

This patch has been originally proposed when MultiXactOffset was still
32 bits, to monitor wraparound.  This part is not relevant anymore since
bd8d9c9bdf that has widen MultiXactOffset to 64 bits.  The monitoring
of disk space usage for the members is still relevant.

Some tests are added to check this function, in the shape of one
isolation test with concurrent transactions that take a ROW SHARE lock,
and some SQL tests for pg_read_all_stats.  Some documentation is added
to explain some patterns that can come from the information provided by
the function.

Bump catalog version.

Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
2025-12-30 15:38:50 +09:00
Michael Paquier
9cf746a453 Change GetMultiXactInfo() to return the next multixact offset
This routine returned a number of members as a MultiXactOffset,
calculated based on the difference between the next-to-be-assigned
offset and the oldest offset.  However, this number is not actually an
offset but a number.

This type confusion comes from the original implementation of
MultiXactMemberFreezeThreshold(), in 53bb309d2d.  The number of
members is now defined as a uint64, large enough for MultiXactOffset.
This change will be used in a follow-up patch.

Reviewed-by: Naga Appani <nagnrik@gmail.com>
Discussion: https://postgr.es/m/aUyTvZMq2CLgNEB4@paquier.xyz
2025-12-30 14:03:49 +09:00
Thomas Munro
7da9d8f2db jit: Remove -Wno-deprecated-declarations in 18+.
REL_18_STABLE and master have commit ee485912, so they always use the
newer LLVM opaque pointer functions.  Drop -Wno-deprecated-declarations
(commit a56e7b660) for code under jit/llvm in those branches, to catch
any new deprecation warnings that arrive in future version of LLVM.

Older branches continued to use functions marked deprecated in LLVM 14
and 15 (ie switched to the newer functions only for LLVM 16+), as a
precaution against unforeseen compatibility problems with bitcode
already shipped.  In those branches, the comment about warning
suppression is updated to explain that situation better.  In theory we
could suppress warnings only for LLVM 14 and 15 specifically, but that
isn't done here.

Backpatch-through: 14
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1407185.1766682319%40sss.pgh.pa.us
2025-12-30 14:34:32 +13:00
Tom Lane
bd3e3e9e56 Ensure sanity of hash-join costing when there are no MCV statistics.
estimate_hash_bucket_stats is defined to return zero to *mcv_freq if
it cannot obtain a value for the frequency of the most common value.
Its sole caller final_cost_hashjoin ignored this provision and would
blindly believe the zero value, resulting in computing zero for the
largest bucket size.  In consequence, the safety check that intended
to prevent the largest bucket from exceeding get_hash_memory_limit()
was ineffective, allowing very silly plans to be chosen if statistics
were missing.

After fixing final_cost_hashjoin to disregard zero results for
mcv_freq, a second problem appeared: some cases that should use hash
joins failed to.  This is because estimate_hash_bucket_stats was
unaware of the fact that ANALYZE won't store MCV statistics if it
doesn't find any multiply-occurring values.  Thus the lack of an MCV
stats entry doesn't necessarily mean that we know nothing; we may
well know that the column is unique.  The former coding returned zero
for *mcv_freq in this case, which was pretty close to correct, but now
final_cost_hashjoin doesn't believe it and disables the hash join.
So check to see if there is a HISTOGRAM stats entry; if so, ANALYZE
has in fact run for this column and must have found it to be unique.
In that case report the MCV frequency as 1 / rows, instead of claiming
ignorance.

Reporting a more accurate *mcv_freq in this case can also affect the
bucket-size skew adjustment further down in estimate_hash_bucket_stats,
causing hash-join cost estimates to change slightly.  This affects
some plan choices in the core regression tests.  The first diff in
join.out corresponds to a case where we have no stats and should not
risk a hash join, but the remaining changes are caused by producing
a better bucket-size estimate for unique join columns.  Those are all
harmless changes so far as I can tell.

The existing behavior was introduced in commit 4867d7f62 in v11.
It appears from the commit log that disabling the bucket-size safety
check in the absence of statistics was intentional; but we've now seen
a case where the ensuing behavior is bad enough to make that seem like
a poor decision.  In any case the lack of other problems with that
safety check after several years helps to justify enforcing it more
strictly.  However, we won't risk back-patching this, in case any
applications are depending on the existing behavior.

Bug: #19363
Reported-by: Jinhui Lai <jinhui.lai@qq.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2380165.1766871097@sss.pgh.pa.us
Discussion: https://postgr.es/m/19363-8dd32fc7600a1153@postgresql.org
2025-12-29 13:01:27 -05:00
Richard Guo
559f9e90db Ignore PlaceHolderVars when looking up statistics
When looking up statistical data about an expression, we failed to
look through PlaceHolderVar nodes, treating them as opaque.  This
could prevent us from matching an expression to base columns, index
expressions, or extended statistics, as examine_variable() relies on
strict structural matching.

As a result, queries involving PlaceHolderVar nodes often fell back to
default selectivity estimates, potentially leading to poor plan
choices.

This patch updates examine_variable() to strip PlaceHolderVars before
analysis.  This is safe during estimation because PlaceHolderVars are
transparent for the purpose of statistics lookup: they do not alter
the value distribution of the underlying expression.

To minimize performance overhead on this hot path, a lightweight
walker first checks for the presence of PlaceHolderVars.  The more
expensive mutator is invoked only when necessary.

There is one ensuing plan change in the regression tests, which is
expected and demonstrates the fix: the rowcount estimate becomes much
more accurate with this patch.

Back-patch to v18.  Although this issue exists before that, changes in
this version made it common enough to notice.  Given the lack of field
reports for older versions, I am not back-patching further.

Reported-by: Haowu Ge <gehaowu@bitmoe.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com
Backpatch-through: 18
2025-12-29 11:40:45 +09:00
Richard Guo
ad66f705fa Strip PlaceHolderVars from index operands
When pulling up a subquery, we may need to wrap its targetlist items
in PlaceHolderVars to enforce separate identity or as a result of
outer joins.  However, this causes any upper-level WHERE clauses
referencing these outputs to contain PlaceHolderVars, which prevents
indxpath.c from recognizing that they could be matched to index
columns or index expressions, potentially affecting the planner's
ability to use indexes.

To fix, explicitly strip PlaceHolderVars from index operands.  A
PlaceHolderVar appearing in a relation-scan-level expression is
effectively a no-op.  Nevertheless, to play it safe, we strip only
PlaceHolderVars that are not marked nullable.

The stripping is performed recursively to handle cases where
PlaceHolderVars are nested or interleaved with other node types.  To
minimize performance impact, we first use a lightweight walker to
check for the presence of strippable PlaceHolderVars.  The expensive
mutator is invoked only if a candidate is found, avoiding unnecessary
memory allocation and tree copying in the common case where no
PlaceHolderVars are present.

Back-patch to v18.  Although this issue exists before that, changes in
this version made it common enough to notice.  Given the lack of field
reports for older versions, I am not back-patching further.

Reported-by: Haowu Ge <gehaowu@bitmoe.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com
Backpatch-through: 18
2025-12-29 11:38:49 +09:00
Peter Eisentraut
b7057e4346 Change some Datum to void * for opaque pass-through pointer
Here, Datum was used to pass around an opaque pointer between a group
of functions.  But one might as well use void * for that; the use of
Datum doesn't achieve anything here and is just distracting.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/1c5d23cb-288b-4154-b1cd-191fe2301707%40eisentraut.org
2025-12-28 14:34:12 +01:00
Michael Paquier
9adf32da6b Split some long Makefile lists
This change makes more readable code diffs when adding new items or
removing old items, while ensuring that lines do not get excessively
long.  Some SUBDIRS, PROGRAMS and REGRESS lists are split.

Note that there are a few more REGRESS lists that could be split,
particularly in contrib/.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl
2025-12-28 09:17:42 +09:00
Michael Paquier
36b8f4974a Fix pg_stat_get_backend_activity() to use multi-byte truncated result
pg_stat_get_backend_activity() calls pgstat_clip_activity() to ensure
that the reported query string is correctly truncated when it finishes
with an incomplete multi-byte sequence.  However, the result returned by
the function was not what pgstat_clip_activity() generated, but the
non-truncated, original, contents from PgBackendStatus.st_activity_raw.

Oversight in 54b6cd589a, so backpatch all the way down.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2mDzwc48q2EK9tSXS6iJMJ35wvxNQnHX+rXjy5VgLvJQw@mail.gmail.com
Backpatch-through: 14
2025-12-27 17:23:30 +09:00
Michael Paquier
bde3a46160 Upgrade BufFile to use int64 for byte positions
This change has the advantage of removing some weird type casts, caused
by offset calculations based on pgoff_t but saved as int (on older
branches we use off_t, which could be 4 or 8 bytes depending on the
environment).  These are safe currently because capped by
MAX_PHYSICAL_FILESIZE, but we would run into problems when to make
MAX_PHYSICAL_FILESIZE larger or allow callers of these routines to use a
larger physical max size on demand.

While on it, this improves BufFileDumpBuffer() so as we do not use an
offset for "availbytes".  It is not a file offset per-set, but a number
of available bytes.

This change should lead to no functional changes.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
2025-12-26 08:41:56 +09:00
Michael Paquier
eee19a30d6 Fix typo in stat_utils.c
Introduced by 213a1b8952.

Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNku-jz-FPKeJVk25fZ1pV2buYh5vpeqGDOB=bFQhKxXhw@mail.gmail.com
2025-12-26 07:53:46 +09:00
Michael Paquier
213a1b8952 Move attribute statistics functions to stat_utils.c
Many of the operations done for attribute stats in attribute_stats.c
share the same logic as extended stats, as done by a patch under
discussion to add support for extended stats import and export.  All the
pieces necessary for extended statistics are moved to stats_utils.c,
which is the file where common facilities are shared for stats files.

The following renames are done:
* get_attr_stat_type() -> statatt_get_type()
* init_empty_stats_tuple() -> statatt_init_empty_tuple()
* set_stats_slot() -> statatt_set_slot()
* get_elem_stat_type() -> statatt_get_elem_type()

While on it, this commit adds more documentation for all these
functions, describing more their internals and the dependencies that
have been implied for attribute statistics.  The same concepts apply to
extended statistics, at some degree.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-12-25 15:13:39 +09:00
Richard Guo
325808cac9 Fix planner error with SRFs and grouping sets
If there are any SRFs in a PathTarget, we must separate it into
SRF-computing and SRF-free targets.  This is because the executor can
only handle SRFs that appear at the top level of the targetlist of a
ProjectSet plan node.

If we find a subexpression that matches an expression already computed
in the previous plan level, we should treat it like a Var and should
not split it again.  setrefs.c will later replace the expression with
a Var referencing the subplan output.

However, when processing the grouping target for grouping sets, the
planner can fail to recognize that an expression is already computed
in the scan/join phase.  The root cause is a mismatch in the
nullingrels bits.  Expressions in the grouping target carry the
grouping nulling bit in their nullingrels to indicate that they can be
nulled by the grouping step.  However, the corresponding expressions
in the scan/join target do not have these bits.

As a result, the exact match check in list_member() fails, leading the
planner to incorrectly believe that the expression needs to be
re-evaluated from its arguments, which are often not available in the
subplan.  This can lead to planner errors such as "variable not found
in subplan target list".

To fix, ignore the grouping nulling bit when checking whether an
expression from the grouping target is available in the pre-grouping
input target.  This aligns with the matching logic in setrefs.c.

Backpatch to v18, where this issue was introduced.

Bug: #19353
Reported-by: Marian MULLER REBEYROL <marian.muller@serli.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/19353-aaa179bba986a19b@postgresql.org
Backpatch-through: 18
2025-12-25 12:12:52 +09:00
Fujii Masao
d1b35952da Fix CREATE SUBSCRIPTION failure when the publisher runs on pre-PG19.
CREATE SUBSCRIPTION with copy_data=true and origin='none' previously
failed when the publisher was running a version earlier than PostgreSQL 19,
even though this combination should be supported.

The failure occurred because the command issued a query calling
pg_get_publication_sequences function on the publisher. That function
does not exist before PG19 and the query is only needed for logical
replication sequence synchronization, which is supported starting in PG19.

This commit fixes this issue by skipping that query when the
publisher runs a version earlier than PG19.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
2025-12-24 23:45:19 +09:00
Fujii Masao
5e813edb55 Fix version check for retain_dead_tuples subscription option.
The retain_dead_tuples subscription option is supported only when
the publisher runs PostgreSQL 19 or later. However, it could previously
be enabled even when the publisher was running an earlier version.

This was caused by check_pub_dead_tuple_retention() comparing
the publisher server version against 19000 instead of 190000.

Fix this typo so that the version check correctly enforces the PG19+
requirement.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com
2025-12-24 23:25:00 +09:00
Richard Guo
c8d2f68cc8 Teach expr_is_nonnullable() to handle more expression types
Currently, the function expr_is_nonnullable() checks only Const and
Var expressions to determine if an expression is non-nullable.  This
patch extends the detection logic to handle more expression types.

This can enable several downstream optimizations, such as reducing
NullTest quals to constant truth values (e.g., "COALESCE(var, 1) IS
NULL" becomes FALSE) and converting "COUNT(expr)" to the more
efficient "COUNT(*)" when the expression is proven non-nullable.

This breaks a test case in test_predtest.sql, since we now simplify
"ARRAY[] IS NULL" to constant FALSE, preventing it from weakly
refuting a strict ScalarArrayOpExpr ("x = any(ARRAY[])").  To ensure
the refutation logic is still exercised as intended, wrap the array
argument in opaque_array().

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
2025-12-24 18:00:44 +09:00
Richard Guo
cb7b7ec7a1 Optimize ROW(...) IS [NOT] NULL using non-nullable fields
We break ROW(...) IS [NOT] NULL into separate tests on its component
fields.  During this breakdown, we can improve efficiency by utilizing
expr_is_nonnullable() to detect fields that are provably non-nullable.

If a component field is proven non-nullable, it affects the outcome
based on the test type.  For an IS NULL test, a single non-nullable
field refutes the whole NullTest, reducing it to constant FALSE.  For
an IS NOT NULL test, the check for that specific field is guaranteed
to succeed, so we can discard it from the list of component tests.

This extends the existing optimization logic, which previously only
handled Const fields, to support any expression that can be proven
non-nullable.

In passing, update the existing constant folding of NullTests to use
expr_is_nonnullable() instead of var_is_nonnullable(), enabling it to
benefit from future improvements to that function.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
2025-12-24 18:00:02 +09:00
Richard Guo
10c4fe074a Simplify COALESCE expressions using non-nullable arguments
The COALESCE function returns the first of its arguments that is not
null.  When an argument is proven non-null, if it is the first
non-null-constant argument, the entire COALESCE expression can be
replaced by that argument.  If it is a subsequent argument, all
following arguments can be dropped, since they will never be reached.

Currently, we perform this simplification only for Const arguments.
This patch extends the simplification to support any expression that
can be proven non-nullable.

This can help avoid the overhead of evaluating unreachable arguments.
It can also lead to better plans when the first argument is proven
non-nullable and replaces the expression, as the planner no longer has
to treat the expression as non-strict, and can also leverage index
scans on the resulting expression.

There is an ensuing plan change in generated_virtual.out, and we have
to modify the test to ensure that it continues to test what it is
intended to.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com
2025-12-24 17:58:49 +09:00
Michael Paquier
b947dd5c75 Improve comment in pgstatfuncs.c
Author: Zizhen Qiao <zizhen_qiao@163.com>
Discussion: https://postgr.es/m/5ee635f9.49f7.19b4ed9e803.Coremail.zizhen_qiao@163.com
2025-12-24 17:09:13 +09:00
Amit Kapila
1528b0d899 Don't advance origin during apply failure.
The logical replication parallel apply worker could incorrectly advance
the origin progress during an error or failed apply. This behavior risks
transaction loss because such transactions will not be resent by the
server.

Commit 3f28b2fcac addressed a similar issue for both the apply worker and
the table sync worker by registering a before_shmem_exit callback to reset
origin information. This prevents the worker from advancing the origin
during transaction abortion on shutdown. This patch applies the same fix
to the parallel apply worker, ensuring consistent behavior across all
worker types.

As with 3f28b2fcac, we are backpatching through version 16, since parallel
apply mode was introduced there and the issue only occurs when changes are
applied before the transaction end record (COMMIT or ABORT) is received.

Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16
Discussion: https://postgr.es/m/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com
2025-12-24 04:36:39 +00:00
Masahiko Sawada
67c20979ce Toggle logical decoding dynamically based on logical slot presence.
Previously logical decoding required wal_level to be set to 'logical'
at server start. This meant that users had to incur the overhead of
logical-level WAL logging even when no logical replication slots were
in use.

This commit adds functionality to automatically control logical
decoding availability based on logical replication slot presence. The
newly introduced module logicalctl.c allows logical decoding to be
dynamically activated when needed when wal_level is set to
'replica'.

When the first logical replication slot is created, the system
automatically increases the effective WAL level to maintain
logical-level WAL records. Conversely, after the last logical slot is
dropped or invalidated, it decreases back to 'replica' WAL level.

While activation occurs synchronously right after creating the first
logical slot, deactivation happens asynchronously through the
checkpointer process. This design avoids a race condition at the end
of recovery; a concurrent deactivation could happen while the startup
process enables logical decoding at the end of recovery, but WAL
writes are still not permitted until recovery fully completes. The
checkpointer will handle it after recovery is done. Asynchronous
deactivation also avoids excessive toggling of the logical decoding
status in workloads that repeatedly create and drop a single logical
slot. On the other hand, this lazy approach can delay changes to
effective_wal_level and the disabling logical decoding, especially
when the checkpointer is busy with other tasks. We chose this lazy
approach in all deactivation paths to keep the implementation simple,
even though laziness is strictly required only for end-of-recovery
cases. Future work might address this limitation either by using a
dedicated worker instead of the checkpointer, or by implementing
synchronous waiting during slot drops if workloads are significantly
affected by the lazy deactivation of logical decoding.

The effective WAL level, determined internally by XLogLogicalInfo, is
allowed to change within a transaction until an XID is assigned. Once
an XID is assigned, the value becomes fixed for the remainder of the
transaction. This behavior ensures that the logging mode remains
consistent within a writing transaction, similar to the behavior of
GUC parameters.

A new read-only GUC parameter effective_wal_level is introduced to
monitor the actual WAL level in effect. This parameter reflects the
current operational WAL level, which may differ from the configured
wal_level setting.

Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct.

Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com
2025-12-23 10:13:16 -08:00
Heikki Linnakangas
955f550686 Fix bug in following update chain when locking a heap tuple
After waiting for a concurrent updater to finish, heap_lock_tuple()
followed the update chain to lock all tuple versions. However, when
stepping from the initial tuple to the next one, it failed to check
that the next tuple's XMIN matches the initial tuple's XMAX. That's an
important check whenever following an update chain, and the recursive
part that follows the chain did it, but the initial step missed it.
Without the check, if the updating transaction aborts, the updated
tuple is vacuumed away and replaced by an unrelated tuple, the
unrelated tuple might get incorrectly locked.

Author: Jasper Smit <jasper.smit@servicenow.com>
Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com
Backpatch-through: 14
2025-12-23 13:37:16 +02:00
Michael Paquier
8e0d32a4a1 Fix orphaned origin in shared memory after DROP SUBSCRIPTION
Since ce0fdbfe97, a replication slot and an origin are created by each
tablesync worker, whose information is stored in both a catalog and
shared memory (once the origin is set up in the latter case).  The
transaction where the origin is created is the same as the one that runs
the initial COPY, with the catalog state of the origin becoming visible
for other sessions only once the COPY transaction has committed.  The
catalog state is coupled with a state in shared memory, initialized at
the same time as the origin created in the catalogs.  Note that the
transaction doing the initial data sync can take a long time, time that
depends on the amount of data to transfer from a publication node to its
subscriber node.

Now, when a DROP SUBSCRIPTION is executed, all its workers are stopped
with the origins removed.  The removal of each origin relies on a
catalog lookup.  A worker still running the initial COPY would fail its
transaction, with the catalog state of the origin rolled back while the
shared memory state remains around.  The session running the DROP
SUBSCRIPTION should be in charge of cleaning up the catalog and the
shared memory state, but as there is no data in the catalogs the shared
memory state is not removed.  This issue would leave orphaned origin
data in shared memory, leading to a confusing state as it would still
show up in pg_replication_origin_status.  Note that this shared memory
data is sticky, being flushed on disk in replorigin_checkpoint at
checkpoint.  This prevents other origins from reusing a slot position
in the shared memory data.

To address this problem, the commit moves the creation of the origin at
the end of the transaction that precedes the one executing the initial
COPY, making the origin immediately visible in the catalogs for other
sessions, giving DROP SUBSCRIPTION a way to know about it.  A different
solution would have been to clean up the shared memory state using an
abort callback within the tablesync worker.  The solution of this commit
is more consistent with the apply worker that creates an origin in a
short transaction.

A test is added in the subscription test 004_sync.pl, which was able to
display the problem.  The test fails when this commit is reverted.

Reported-by: Tenglong Gu <brucegu@amazon.com>
Reported-by: Daisuke Higuchi <higudai@amazon.com>
Analyzed-by: Michael Paquier <michael@paquier.xyz>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/aUTekQTg4OYnw-Co@paquier.xyz
Backpatch-through: 14
2025-12-23 14:32:14 +09:00
Michael Paquier
e5f3839af6 Switch buffile.c/h to use pgoff_t instead of off_t
off_t was previously used for offsets, which is 4 bytes on Windows,
hence limiting the backend code to a hard limit for files longer than
2GB.  This leads to some simplification in these files, removing some
casts based on long, also 4 bytes on Windows.

This commit removes one comment introduced in db3c4c3a2d, not relevant
anymore as pgoff_t is a safe 8-byte alternative on Windows.

This change is surprisingly not invasive, as the callers of
BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c,
tuplestore.c, etc.) track offsets in local structures that just to
switch from off_t to pgoff_t for the most part.

The file is still relying on a maximum file size of
MAX_PHYSICAL_FILESIZE (1GB).  This change allows the code to make this
maximum potentially larger in the future, or larger on a per-demand
basis.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
2025-12-23 07:41:34 +09:00
Michael Paquier
d31276f0e2 Fix another typo in gininsert.c
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNkRJ9DMFZMQKWQ32U+OTBR78KeGh2=9Wy5jEeWDxMVFcQ@mail.gmail.com
2025-12-22 12:38:40 +09:00
Peter Geoghegan
fab5cd3dd1 Remove obsolete name_ops index-only scan comments.
nbtree index-only scans of an index that uses btree/name_ops as one of
its index column's input opclasses are no longer at any risk of reading
past the end of currTuples.  We're no longer reliant on such scans being
able to at least read from the start of markTuples storage (which uses
space from the same allocation as currTuples) to avoid a segfault:
StoreIndexTuple (from nodeIndexonlyscan.c) won't actually read past the
end of a cstring datum from a name_ops index.  In other words, we
already have the "special-case treatment for name_ops" that the removed
comment supposed we could avoid by relying on markTuples in this way.

Oversight in commit a63224be49, which added special case handling of
name_ops cstrings to StoreIndexTuple, but missed these comments.
2025-12-21 12:27:38 -05:00
Andres Freund
548de59d93 heapam: Move logic to handle HEAP_MOVED into a helper function
Before we dealt with this in 6 near identical and one very similar copy.

The helper function errors out when encountering a
HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
in-progress. It'd be preferrable to do that change separately, but otherwise
it'd not be possible to deduplicate the handling in
HeapTupleSatisfiesVacuum().

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
2025-12-19 13:28:34 -05:00
Andres Freund
09ae2c8bac bufmgr: Optimize & harmonize LockBufHdr(), LWLockWaitListLock()
The main optimization is for LockBufHdr() to delay initializing
SpinDelayStatus, similar to what LWLockWaitListLock already did. The
initialization is sufficiently expensive & buffer header lock acquisitions are
sufficiently frequent, to make it worthwhile to instead have a fastpath (via a
likely() branch) that does not initialize the SpinDelayStatus.

While LWLockWaitListLock() already the aforementioned optimization, it did not
use likely(), and inspection of the assembly shows that this indeed leads to
worse code generation (also observed in a microbenchmark). Fix that by adding
the likely().

While the LockBufHdr() improvement is a small gain on its own, it mainly is
aimed at preventing a regression after a future commit, which requires
additional locking to set hint bits.

While touching both, also make the comments more similar to each other.

Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-19 13:23:33 -05:00
Heikki Linnakangas
47a9f61fca Use proper type for RestoreTransactionSnapshot's PGPROC arg
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/08cbaeb5-aaaf-47b6-9ed8-4f7455b0bc4b@iki.fi
2025-12-19 13:40:02 +02:00
Michael Paquier
5cdbec5aa9 Fix typos in gininsert.c
Introduced by 8492feb98f.

Author: Xingbin She <xingbin.she@qq.com>
Discussion: https://postgr.es/m/tencent_C254AE962588605F132DB4A6F87205D6A30A@qq.com
2025-12-19 14:33:38 +09:00
Fujii Masao
b3ccb0a2cb Add guard to prevent recursive memory context logging.
Previously, if memory context logging was triggered repeatedly and
rapidly while a previous request was still being processed, it could
result in recursive calls to ProcessLogMemoryContextInterrupt().
This could lead to infinite recursion and potentially crash the process.

This commit adds a guard to prevent such recursion.
If ProcessLogMemoryContextInterrupt() is already in progress and
logging memory contexts, subsequent calls will exit immediately,
avoiding unintended recursive calls.

While this scenario is unlikely in practice, it's not impossible.
This change adds a safety check to prevent such failures.

Back-patch to v14, where memory context logging was introduced.

Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Artem Gavrilov <artem.gavrilov@percona.com>
Discussion: https://postgr.es/m/CA+TgmoZMrv32tbNRrFTvF9iWLnTGqbhYSLVcrHGuwZvCtph0NA@mail.gmail.com
Backpatch-through: 14
2025-12-19 12:05:37 +09:00
Michael Paquier
9d0f7996e5 Use table/index_close() more consistently
All the code paths updated here have been using relation_close() to
close a relation that has already been opened with table_open() or
index_open(), where a relkind check is enforced.

table_close() and index_open() do the same thing as relation_close(), so
there was no harm, but being inconsistent could lead to issues if the
internals of these close() functions begin to introduce some logic
specific to each relkind in the future.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal
2025-12-19 07:55:58 +09:00
Heikki Linnakangas
951b60f7ab Do not emit WAL for unlogged BRIN indexes
Operations on unlogged relations should not be WAL-logged. The
brin_initialize_empty_new_buffer() function didn't get the memo.

The function is only called when a concurrent update to a brin page
uses up space that we're just about to insert to, which makes it
pretty hard to hit. If you do manage to hit it, a full-page WAL record
is erroneously emitted for the unlogged index. If you then crash,
crash recovery will fail on that record with an error like this:

    FATAL:  could not create file "base/5/32819": File exists

Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/CALdSSPhpZXVFnWjwEBNcySx_vXtXHwB2g99gE6rK0uRJm-3GgQ@mail.gmail.com
Backpatch-through: 14
2025-12-18 15:08:48 +02:00
Michael Paquier
f4e797171e Change pgstat_report_vacuum() to use Relation
This change makes pgstat_report_vacuum() more consistent with
pgstat_report_analyze(), that also uses a Relation.  This enforces a
policy that callers of this routine should open and lock the relation
whose statistics are updated before calling this routine.  We will
unlikely have a lot of callers of this routine in the tree, but it seems
like a good idea to imply this requirement in the long run.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aUEA6UZZkDCQFgSA@ip-10-97-1-34.eu-west-3.compute.internal
2025-12-17 11:26:17 +09:00
Michael Paquier
1d325ad99c Remove useless code in InjectionPointAttach()
strlcpy() ensures that a target string is zero-terminated, so there is
no need to enforce it a second time in this code.  This simplification
could have been done in 0eb23285a2.

Author: Feilong Meng <feelingmeng@foxmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/tencent_771178777C5BC17FCB7F7A1771CD1FFD5708@qq.com
2025-12-17 08:58:58 +09:00
Jeff Davis
0a90df58cf Avoid global LC_CTYPE dependency in pg_locale_icu.c.
ICU still depends on libc for compatibility with certain historical
behavior for single-byte encodings. Make the dependency explicit by
holding a locale_t object when required.

We should consider a better solution in the future, such as decoding
the text to UTF-32 and using u_tolower(). That would be a behavior
change and require additional infrastructure though; so for now, just
avoid the global LC_CTYPE dependency.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-16 15:32:57 -08:00
Jeff Davis
87b2968df0 downcase_identifier(): use method table from locale provider.
Previously, libc's tolower() was always used for lowercasing
identifiers, regardless of the database locale (though only characters
beyond 127 in single-byte encodings were affected). Refactor to allow
each provider to supply its own implementation of identifier
downcasing.

For historical compatibility, when using a single-byte encoding, ICU
still relies on tolower().

One minor behavior change is that, before the database default locale
is initialized, it uses ASCII semantics to downcase the
identifiers. Previously, it would use the postmaster's LC_CTYPE
setting from the environment. While that could have some effect during
GUC processing, for example, it would have been fragile to rely on the
environment setting anyway. (Also, it only matters when the encoding
is single-byte.)

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-16 15:32:41 -08:00
Robert Haas
f1a6e622bd Switch memory contexts in ReinitializeParallelDSM.
We already do this in CreateParallelContext, InitializeParallelDSM, and
LaunchParallelWorkers. I suspect the reason why the matching logic was
omitted from ReinitializeParallelDSM is that I failed to realize that
any memory allocation was happening here -- but shm_mq_attach does
allocate, which could result in a shm_mq_handle being allocated in a
shorter-lived context than the ParallelContext which points to it.

That could result in a crash if the shorter-lived context is freed
before the parallel context is destroyed. As far as I am currently
aware, there is no way to reach a crash using only code that is
present in core PostgreSQL, but extensions could potentially trip
over this. Fixing this in the back-branches appears low-risk, so
back-patch to all supported versions.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
Backpatch-through: 14
Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com
2025-12-16 12:24:55 -05:00
Melanie Plageman
bfe5c4bec7 Add explanatory comment to prune_freeze_setup()
heap_page_prune_and_freeze() fills in PruneState->deadoffsets, the array
of OffsetNumbers of dead tuples. It is returned to the caller in the
PruneFreezeResult. To avoid having two copies of the array, the
PruneState saves only a pointer to the array. This was a bit unusual and
confusing, so add a clarifying comment.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2=jiD1nqch4JQN+odAxZSD7mRvdoHUGJYN2r6tQG_66yQ@mail.gmail.com
2025-12-16 11:04:07 -05:00
Melanie Plageman
4877391ce8 Fix const qualification in prune_freeze_setup()
The const qualification of the presult argument to prune_freeze_setup()
is later cast away, so it was not correct. Remove it and add a comment
explaining that presult should not be modified.

Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fb97d0ae-a0bc-411d-8a87-f84e7e146488%40eisentraut.org
2025-12-16 11:00:05 -05:00
John Naylor
9303d62c6d Separate out bytea sort support from varlena.c
In the wake of commit b45242fd3, bytea_sortsupport() still called out
to varstr_sortsupport(). Treating bytea as a kind of text/varchar
required varstr_sortsupport() to allow for the possibility of
NUL bytes, but only for C collation. This was confusing. For
better separation of concerns, create an independent sortsupport
implementation in bytea.c.

The heuristics for bytea_abbrev_abort() remain the same as for
varstr_abbrev_abort(). It's possible that the bytea case warrants
different treatment, but that is left for future investigation.

In passing, adjust some strange looking comparisons in
varstr_abbrev_abort().

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TP1bAbEhUJa6+rgceN6QJWMSsxhg1=mqfSN=Nb-n6DAKg@mail.gmail.com
2025-12-16 15:19:16 +07:00
Michael Paquier
15f68cebdc Add TAP test to check recovery when redo LSN is missing
This commit provides test coverage for dc7c77f825, where the redo
record and the checkpoint record finish on different WAL segments with
the start of recovery able to detect that the redo record is missing.

This test uses a wait injection point done in the critical section of a
checkpoint, method that requires not one but actually two wait injection
points to avoid any memory allocations within the critical section of
the checkpoint:
- Checkpoint run with a background psql.
- One first wait point is run by the checkpointer before the critical
section, allocating the shared memory required by the DSM registry for
the wait machinery in the library injection_points.
- First point is woken up.
- Second wait point is loaded before the critical section, allocating
the memory to build the path to the library loaded, then run in the
critical section once the checkpoint redo record has been logged.
- WAL segment is switched while waiting on the second point.
- Checkpoint completes.
- Stop cluster with immediate mode.
- The segment that includes the redo record is removed.
- Start, recovery fails as the redo record cannot be found.

The error message introduced in dc7c77f825 is now reduced to a FATAL,
meaning that the information is still provided while being able to use a
test for it.  Nitin has provided a basic version of the test, that I
have enhanced to make it portable with two points.  Without
dc7c77f825, the cluster crashes in this test, not on a PANIC but due
to the pointer dereference at the beginning of recovery, failure
mentioned in the other commit.

Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
2025-12-16 14:28:05 +09:00
Michael Paquier
dc7c77f825 Fail recovery when missing redo checkpoint record without backup_label
This commit adds an extra check at the beginning of recovery to ensure
that the redo record of a checkpoint exists before attempting WAL
replay, logging a PANIC if the redo record referenced by the checkpoint
record could not be found.  This is the same level of failure as when a
checkpoint record is missing.  This check is added when a cluster is
started without a backup_label, after retrieving its checkpoint record.
The redo LSN used for the check is retrieved from the checkpoint record
successfully read.

In the case where a backup_label exists, the startup process already
fails if the redo record cannot be found after reading a checkpoint
record at the beginning of recovery.

Previously, the presence of the redo record was not checked.  If the
redo and checkpoint records were located on different WAL segments, it
would be possible to miss a entire range of WAL records that should have
been replayed but were just ignored.  The consequences of missing the
redo record depend on the version dealt with, these becoming worse the
older the version used:
- On HEAD, v18 and v17, recovery fails with a pointer dereference at the
beginning of the redo loop, as the redo record is expected but cannot be
found.  These versions are good students, because we detect a failure
before doing anything, even if the failure is misleading in the shape of
a segmentation fault, giving no information that the redo record is
missing.
- In v16 and v15, problems show at the end of recovery within
FinishWalRecovery(), the startup process using a buggy LSN to decide
from where to start writing WAL.  The cluster gets corrupted, still it
is noisy about it.
- v14 and older versions are worse: a cluster gets corrupted but it is
entirely silent about the matter.  The redo record missing causes the
startup process to skip entirely recovery, because a missing record is
the same as not redo being required at all.  This leads to data loss, as
everything is missed between the redo record and the checkpoint record.

Note that I have tested that down to 9.4, reproducing the issue with a
version of the author's reproducer slightly modified.  The code is wrong
since at least 9.2, but I did not look at the exact point of origin.

This problem has been found by debugging a cluster where the WAL segment
including the redo segment was missing due to an operator error, leading
to a crash, based on an investigation in v15.

Requesting archive recovery with the creation of a recovery.signal or
a standby.signal even without a backup_label would mitigate the issue:
if the record cannot be found in pg_wal/, the missing segment can be
retrieved with a restore_command when checking that the redo record
exists.  This was already the case without this commit, where recovery
would re-fetch the WAL segment that includes the redo record.  The check
introduced by this commit makes the segment to be retrieved earlier to
make sure that the redo record can be found.

On HEAD, the code will be slightly changed in a follow-up commit to not
rely on a PANIC, to include a test able to emulate the original problem.
This is a minimal backpatchable fix, kept separated for clarity.

Reported-by: Andres Freund <andres@anarazel.de>
Analyzed-by: Andres Freund <andres@anarazel.de>
Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de
Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com
Backpatch-through: 14
2025-12-16 13:29:28 +09:00
Nathan Bossart
48d4a1423d Allow passing a pointer to GetNamedDSMSegment()'s init callback.
This commit adds a new "void *arg" parameter to
GetNamedDSMSegment() that is passed to the initialization callback
function.  This is useful for reusing an initialization callback
function for multiple DSM segments.

Author: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com
2025-12-15 14:27:16 -06:00
Noah Misch
64bf53dd61 Revisit cosmetics of "For inplace update, send nontransactional invalidations."
This removes a never-used CacheInvalidateHeapTupleInplace() parameter.
It adds README content about inplace update visibility in logical
decoding.  It rewrites other comments.

Back-patch to v18, where commit 243e9b40f1
first appeared.  Since this removes a CacheInvalidateHeapTupleInplace()
parameter, expect a v18 ".abi-compliance-history" edit to follow.  PGXN
contains no calls to that function.

Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reported-by: Ilyasov Ian <ianilyasov@outlook.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Surya Poondla <s_poondla@apple.com>
Discussion: https://postgr.es/m/CA+renyU+LGLvCqS0=fHit-N1J-2=2_mPK97AQxvcfKm+F-DxJA@mail.gmail.com
Backpatch-through: 18
2025-12-15 12:19:49 -08:00
Noah Misch
0839fbe400 Correct comments of "Fix data loss at inplace update after heap_update()".
This corrects commit a07e03fd8f.

Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reported-by: Surya Poondla <s_poondla@apple.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://postgr.es/m/CA+renyWCW+_2QvXERBQ+mna6ANwAVXXmHKCA-WzL04bZRsjoBA@mail.gmail.com
2025-12-15 12:19:49 -08:00
Jeff Davis
54c41a6deb Remove unused single-byte char_is_cased() API.
https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-15 10:24:57 -08:00
Jeff Davis
9c8de15969 Use multibyte-aware extraction of pattern prefixes.
Previously, like_fixed_prefix() used char-at-a-time logic, which
forced it to be too conservative for case-insensitive matching.

Introduce like_fixed_prefix_ci(), and use that for case-insensitive
pattern prefixes. It uses multibyte and locale-aware logic, along with
the new pg_iswcased() API introduced in 630706ced0.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-15 10:24:47 -08:00
Tom Lane
8191937082 Add offnum range checks to suppress compile warnings with UBSAN.
Late-model gcc with -fsanitize=undefined enabled issues warnings
about uses of PageGetItemId() when it can't prove that the
offsetNumber is > 0.  The call sites where this happens are
checking that the offnum is <= PageGetMaxOffsetNumber(page), so
it seems reasonable to add an explicit check that offnum >= 1 too.

While at it, rearrange the code to be less contorted and avoid
duplicate checks on PageGetMaxOffsetNumber.  Maybe the compiler
would optimize away the duplicate logic or maybe not, but the
existing coding has little to recommend it anyway.

There are multiple instances of this identical coding pattern in
heapam.c and heapam_xlog.c.  Current gcc only complains about two
of them, but I fixed them all in the name of consistency.

Potentially this could be back-patched in the name of silencing
warnings; but I think enabling UBSAN is mainly something people
would do on HEAD, so for now it seems not worth the trouble.

Discussion: https://postgr.es/m/1699806.1765746897@sss.pgh.pa.us
2025-12-15 12:40:09 -05:00
Heikki Linnakangas
ecb553ae82 Improve sanity checks on multixid members length
In the server, check explicitly for multixids with zero members. We
used to have an assertion for it, but commit d4b7bde418 replaced it
with more extensive runtime checks, but it missed the original case of
zero members.

In the upgrade code, a negative length never makes sense, so better
check for it explicitly. Commit d4b7bde418 added a similar sanity
check to the corresponding server code on master, and in backbranches,
the 'length' is passed to palloc which would fail with "invalid memory
alloc request size" error. Clarify the comments on what kind of
invalid entries are tolerated by the upgrade code and which ones are
reported as fatal errors.

Coverity complained about 'length' in the upgrade code being
tainted. That's bogus because we trust the data on disk at least to
some extent, but hopefully this will silence the complaint. If not,
I'll dismiss it manually.

Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi
2025-12-15 13:30:17 +02:00
David Rowley
cd83ed9a91 Fix typo in tablecmds.c
Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2%3DAib%2BcatZn6wHKmz0BWe8-q10NAhpxu8wUDT19SddKNA%40mail.gmail.com
2025-12-15 17:17:04 +13:00
Amit Kapila
0d2d4a0ec3 Add retry logic to pg_sync_replication_slots().
Previously, pg_sync_replication_slots() would finish without synchronizing
slots that didn't meet requirements, rather than failing outright. This
could leave some failover slots unsynchronized if required catalog rows or
WAL segments were missing or at risk of removal, while the standby
continued removing needed data.

To address this, the function now waits for the primary slot to advance to
a position where all required data is available on the standby before
completing synchronization. It retries cyclically until all failover slots
that existed on the primary at the start of the call are synchronized.
Slots created after the function begins are not included. If the standby
is promoted during this wait, the function exits gracefully and the
temporary slots will be removed.

Author: Ajin Cherian <itsajin@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Yilin Zhang <jiezhilove@126.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com
2025-12-15 02:50:21 +00:00
Michael Paquier
4ba012a8ed Allow cumulative statistics to read/write auxiliary data from/to disk
Cumulative stats kinds gain the capability to write additional per-entry
data when flushing the stats at shutdown, and read this data when
loading back the stats at startup.  This can be fit for example in the
case of variable-length data (like normalized query strings), so as it
becomes possible to link the shared memory stats entries to data that is
stored in a different area, like a DSA segment.

Three new optional callbacks are added to PgStat_KindInfo, available to
variable-numbered stats kinds:
* to_serialized_data: writes auxiliary data for an entry.
* from_serialized_data: reads auxiliary data for an entry.
* finish: performs actions after read/write/discard operations.  This is
invoked after processing all the entries of a kind, allowing extensions
to close file handles and clean up resources.

Stats kinds have the option to store this data in the existing pgstats
file, but can as well store it in one or more additional files whose
names can be built upon the entry keys.  The new serialized callbacks
are called once an entry key is read or written from the main stats
file.  A file descriptor to the main pgstats file is available in the
arguments of the callbacks.

Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com
2025-12-15 09:40:56 +09:00
Tom Lane
58dad7f349 Update typedefs.list to match what the buildfarm currently reports.
The current list from the buildfarm includes quite a few typedef
names that it used to miss.  The reason is a bit obscure, but it
seems likely to have something to do with our recent increased
use of palloc_object and palloc_array.  In any case, this makes
the relevant struct declarations be much more nicely formatted,
so I'll take it.  Install the current list and re-run pgindent
to update affected code.

Syncing with the current list also removes some obsolete
typedef names and fixes some alphabetization errors.

Discussion: https://postgr.es/m/1681301.1765742268@sss.pgh.pa.us
2025-12-14 17:03:53 -05:00
Andres Freund
30df61990c bufmgr: Add one-entry cache for private refcount
The private refcount entry for a buffer is often looked up repeatedly for the
same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's
worthwhile to have a one-entry cache for that case. With that cache in place,
it's worth splitting GetPrivateRefCountEntry() into a small inline
portion (for the cache hit case) and an out-of-line helper for the rest.

This is helpful for some workloads today, but becomes more important in an
upcoming patch that will utilize the private refcount infrastructure to also
store whether the buffer is currently locked, as that increases the rate of
lookups substantially.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar
2025-12-14 13:09:43 -05:00
Andres Freund
edbaaea0a9 bufmgr: Separate keys for private refcount infrastructure
This makes lookups faster, due to allowing auto-vectorized lookups. It is also
beneficial for an upcoming patch, independent of auto-vectorization, as the
upcoming patch wants to track more information for each pinned buffer, making
the existing loop, iterating over an array of PrivateRefCountEntry, more
expensive due to increasing its size.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-14 13:09:43 -05:00
Alexander Korotkov
b27e48213f Refactor WaitLSNType enum to use a macro for type count
Change WAIT_LSN_TYPE_COUNT from an enum sentinel to a macro definition,
in a similar way to IOObject, IOContext, and BackendType enums.  Remove
explicit enum value assignments well.

Author: Xuneng Zhou <xunengzhou@gmail.com>
2025-12-14 17:18:32 +02:00
Alexander Korotkov
c5ae07a90a Fix usage of palloc() in MERGE/SPLIT PARTITION(s) code
f2e4cc4279 and 4b3d173629 implement ALTER TABLE ... MERGE/SPLIT
PARTITION(s) commands.  In several places, these commits use palloc(),
where we should use palloc_object() and palloc_array().  This commit
provides appropriate usage of palloc_object() and palloc_array().

Reported-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/tencent_3661BB522D5466B33EA33666%40qq.com
2025-12-14 16:10:25 +02:00
Alexander Korotkov
4b3d173629 Implement ALTER TABLE ... SPLIT PARTITION ... command
This new DDL command splits a single partition into several partitions.  Just
like the ALTER TABLE ... MERGE PARTITIONS ... command, new partitions are
created using the createPartitionTable() function with the parent partition
as the template.

This commit comprises a quite naive implementation which works in a single
process and holds the ACCESS EXCLUSIVE LOCK on the parent table during all
the operations, including the tuple routing.  This is why the new DDL command
can't be recommended for large, partitioned tables under high load.  However,
this implementation comes in handy in certain cases, even as it is.  Also, it
could serve as a foundation for future implementations with less locking and
possibly parallelism.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval <d.koval@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Richard Guo <guofenglinux@gmail.com>
Co-authored-by: Dagfinn Ilmari Mannsaker <ilmari@ilmari.org>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Co-authored-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Stephane Tachoires <stephane.tachoires@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Daniel Gustafsson <dgustafsson@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
2025-12-14 13:29:38 +02:00
Alexander Korotkov
f2e4cc4279 Implement ALTER TABLE ... MERGE PARTITIONS ... command
This new DDL command merges several partitions into a single partition of the
target table.  The target partition is created using the new
createPartitionTable() function with the parent partition as the template.

This commit comprises a quite naive implementation which works in a single
process and holds the ACCESS EXCLUSIVE LOCK on the parent table during all
the operations, including the tuple routing.  This is why this new DDL
command can't be recommended for large partitioned tables under a high load.
However, this implementation comes in handy in certain cases, even as it is.
Also, it could serve as a foundation for future implementations with less
locking and possibly parallelism.

Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval <d.koval@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Co-authored-by: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Richard Guo <guofenglinux@gmail.com>
Co-authored-by: Dagfinn Ilmari Mannsaker <ilmari@ilmari.org>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Co-authored-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Zhihong Yu <zyu@yugabyte.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Stephane Tachoires <stephane.tachoires@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Daniel Gustafsson <dgustafsson@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
2025-12-14 13:29:17 +02:00
Tom Lane
ef5f559b95 Fix jsonb_object_agg crash after eliminating null-valued pairs.
In commit b61aa76e4 I added an assumption in jsonb_object_agg_finalfn
that it'd be okay to apply uniqueifyJsonbObject repeatedly to a
JsonbValue.  I should have studied that code more closely first,
because in skip_nulls mode it removed leading nulls by changing the
"pairs" array start pointer.  This broke the data structure's
invariants in two ways: pairs no longer references a repalloc-able
chunk, and the distance from pairs to the end of its array is less
than parseState->size.  So any subsequent addition of more pairs is
at high risk of clobbering memory and/or causing repalloc to crash.
Unfortunately, adding more pairs is exactly what will happen when the
aggregate is being used as a window function.

Fix by rewriting uniqueifyJsonbObject to not do that.  The prior
coding had little to recommend it anyway.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/ec5e96fb-ee49-4e5f-8a09-3f72b4780538@gmail.com
2025-12-13 16:18:29 -05:00
Peter Eisentraut
abb331da0a Fix out-of-date comment on makeRangeConstructors
We did define 4 functions in 4429f6a9e3, but in df73584431 we got
rid of the 0- and 1-arg versions.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BrenyVQti3iC7LE4UxtQb4ROLYMs6%2Bu-d4LrN5U4idH1Ghx6Q%40mail.gmail.com
2025-12-13 16:56:07 +01:00
Peter Eisentraut
ff30bad7f6 Clarify comment about temporal foreign keys
In RI_ConstraintInfo, period_contained_by_oper and
period_intersect_oper can take either anyrange or anymultirange.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/CA%2BrenyWzDth%2BjqLZA2L2Cezs3wE%2BWX-5P8W2EOVx_zfFD%3Daicg%40mail.gmail.com
2025-12-13 16:44:33 +01:00
Álvaro Herrera
630a93799d
Reject opclass options in ON CONFLICT clause
It's as pointless as ASC/DESC and NULLS FIRST/LAST are, so reject all of
them in the same way.  While at it, normalize the others' error messages
to have less translatable strings.  Add tests for these errors.

Noticed while reviewing recent INSERT ON CONFLICT patches.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/202511271516.oiefpvn3z27m@alvherre.pgsql
2025-12-12 14:26:42 +01:00
Peter Eisentraut
493eb0da31 Replace most StaticAssertStmt() with StaticAssertDecl()
Similar to commit 75f49221c2, it is preferable to use
StaticAssertDecl() instead of StaticAssertStmt() when possible.

Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com
2025-12-12 10:06:40 +01:00
Heikki Linnakangas
87a350e1f2 Never store 0 as the nextMXact
Before this commit, when multixid wraparound happens,
MultiXactState->nextMXact goes to 0, which is invalid. All the readers
need to deal with that possibility and skip over the 0. That's
error-prone and we've missed it a few times in the past. This commit
changes the responsibility so that all the writers of
MultiXactState->nextMXact skip over the zero already, and readers can
trust that it's never 0.

We were already doing that for MultiXactState->oldestMultiXactId; none
of its writers would set it to 0. ReadMultiXactIdRange() was
nevertheless checking for that possibility. For clarity, remove that
check.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
2025-12-12 10:47:34 +02:00
Nathan Bossart
b4cbc106a6 Fix some comments.
Like commit 123661427b, these were discovered while reviewing
Aleksander Alekseev's proposed changes to pgindent.
2025-12-11 15:13:04 -06:00
Álvaro Herrera
81f72115cf
Fix infer_arbiter_index for partitioned tables
The fix for concurrent index operations in bc32a12e0d started
considering indexes that are not yet marked indisvalid as arbiters for
INSERT ON CONFLICT.  For partitioned tables, this leads to including
indexes that may not exist in partitions, causing a trivially
reproducible "invalid arbiter index list" error to be thrown because of
failure to match the index.  To fix, it suffices to ignore !indisvalid
indexes on partitioned tables.  There should be no risk that the set of
indexes will change for concurrent transactions, because in order for
such an index to be marked valid, an ALTER INDEX ATTACH PARTITION must
run which requires AccessExclusiveLock.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/17622f79-117a-4a44-aa8e-0374e53faaf0%40gmail.com
2025-12-11 20:56:37 +01:00
Heikki Linnakangas
b65f1ad9b1 Fix comment on how temp files and subtransactions are handled
The comment was accurate a long time ago, but not any more. I failed
to update the comment in commit ab3148b712.
2025-12-11 15:57:11 +02:00
Heikki Linnakangas
d4b7bde418 Add runtime checks for bogus multixact offsets
It's not far-fetched that we'd try to read a multixid with an invalid
offset in case of bugs or corruption. Or if you call
pg_get_multixact_members() after a crash that left behind invalid but
unused multixids. Better to get a somewhat descriptive error message
if that happens.

Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi
2025-12-11 11:18:14 +02:00
Michael Paquier
4f7dacc5b8 Use palloc_object() and palloc_array(), the last change
This is the last batch of changes that have been suggested by the
author, this part covering the non-trivial changes.  Some of the changes
suggested have been discarded as they seem to lead to more instructions
generated, leaving the parts that can be qualified as in-place
replacements.

Similar work has been done in 1b105f9472, 0c3c5c3b06 and
31d3847a37.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-12-11 14:29:12 +09:00
Amit Kapila
1362bc33e0 Enhance slot synchronization API to respect promotion signal.
Previously, during a promotion, only the slot synchronization worker was
signaled to shut down. The backend executing slot synchronization via the
pg_sync_replication_slots() SQL function was not signaled, allowing it to
complete its synchronization cycle before exiting.

An upcoming patch improves pg_sync_replication_slots() to wait until
replication slots are fully persisted before finishing. This behaviour
requires the backend to exit promptly if a promotion occurs.

This patch ensures that, during promotion, a signal is also sent to the
backend running pg_sync_replication_slots(), allowing it to be interrupted
and exit immediately.

Author: Ajin Cherian <itsajin@gmail.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com
2025-12-11 03:49:28 +00:00
Peter Geoghegan
e16c6f0247 Clarify why _bt_killitems sorts its items array.
Make it clear why _bt_killitems sorts the scan's so->killedItems[]
array.  Also add an assertion to the _bt_killitems loop (that iterates
through this array) to verify it accesses tuples in leaf page order.

Follow-up to commit bfb335df58.

Author: Peter Geoghegan <pg@bowt.ie>
Suggested-by: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAGnEboirgArezZDNeFrR8FOGvKF-Xok333s2iVwWi65gZf8MEA@mail.gmail.com
2025-12-10 20:50:47 -05:00
Michael Paquier
06761b6096 Fix allocation formula in llvmjit_expr.c
An array of LLVMBasicBlockRef is allocated with the size used for an
element being "LLVMBasicBlockRef *" rather than "LLVMBasicBlockRef".
LLVMBasicBlockRef is a type that refers to a pointer, so this did not
directly cause a problem because both should have the same size, still
it is incorrect.

This issue has been spotted while reviewing a different patch, and
exists since 2a0faed9d7, so backpatch all the way down.

Discussion: https://postgr.es/m/CA+hUKGLngd9cKHtTUuUdEo2eWEgUcZ_EQRbP55MigV2t_zTReg@mail.gmail.com
Backpatch-through: 14
2025-12-11 10:25:21 +09:00
Peter Geoghegan
473cb1b951 Fix MULTIXACT_DEBUG builds.
Oversight in commit bd8d9c9b.

Discussion: https://postgr.es/m/CAH2-WzmvwVKZ+0Z=RL_+g_aOku8QxWddDCXmtyLj02y+nYaD0g@mail.gmail.com
2025-12-10 19:31:13 -05:00
Peter Geoghegan
bfb335df58 Return TIDs in desc order during backwards scans.
Always return TIDs in descending order when returning groups of TIDs
from an nbtree posting list tuple during nbtree backwards scans.  This
makes backwards scans tend to require fewer buffer hits, since the scan
is less likely to repeatedly pin and unpin the same heap page/buffer
(we'll get exactly as many buffer hits as we get with a similar forwards
scan case).

Commit 0d861bbb, which added nbtree deduplication, originally did things
this way to avoid interfering with _bt_killitems's approach to setting
LP_DEAD bits on posting list tuples.  _bt_killitems makes a soft
assumption that it can always iterate through posting lists in ascending
TID order, finding corresponding killItems[]/so->currPos.items[] entries
in that same order.  This worked out because of the prior _bt_readpage
backwards scan behavior.  If we just changed the backwards scan posting
list logic in _bt_readpage, without altering _bt_killitems itself, it
would break its soft assumption.

Avoid that problem by sorting the so->killedItems[] array at the start
of _bt_killitems.  That way the order that dead items are saved in from
btgettuple can't matter; so->killedItems[] will always be in the same
order as so->currPos.items[] in the end.  Since so->currPos.items[] is
now always in leaf page order, regardless of the scan direction used
within _bt_readpage, and since so->killedItems[] is always in that same
order, the _bt_killitems loop can continue to make a uniform assumption
about everything being in page order.  In fact, sorting like this makes
the previous soft assumption about item order into a hard invariant.

Also deduplicate the so->killedItems[] array after it is sorted.  That
way there's no risk of the _bt_killitems loop becoming confused by a
duplicate dead item/TID.  This was possible in cases that involved a
scrollable cursor that encountered the same dead TID more than once
(within the same leaf page/so->currPos context).  This doesn't come up
very much in practice, but it seems best to be as consistent as possible
about how and when _bt_killitems will LP_DEAD-mark index tuples.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com
2025-12-10 15:35:30 -05:00
Jeff Davis
630706ced0 Add pg_iswcased().
True if character has multiple case forms. Will be a useful
multibyte-aware replacement for char_is_cased().

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-10 11:56:11 -08:00
Jeff Davis
1e493158d3 Remove char_tolower() API.
It's only useful for an ILIKE optimization for the libc provider using
a single-byte encoding and a non-C locale, but it creates significant
internal complexity.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-10 11:55:59 -08:00
Melanie Plageman
eebec3ca4b Add comment about keeping PD_ALL_VISIBLE and VM in sync
The comment above heap_xlog_visible() about the critical integrity
requirement for PD_ALL_VISIBLE and the visibility map should also be in
heap_xlog_prune_freeze() where we set PD_ALL_VISIBLE.

Oversight in add323da40

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-12-10 11:10:13 -05:00
Melanie Plageman
bd298f54a0 Simplify vacuum visibility assertion
Phase I vacuum gives the page a once-over after pruning and freezing to
check that the values of all_visible and all_frozen agree with the
result of heap_page_is_all_visible(). This is meant to keep the logic in
phase I for determining visibility in sync with the logic in phase III.

Rewrite the assertion to avoid an Assert(false).

Suggested by Andres Freund.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo
2025-12-10 11:10:01 -05:00
Heikki Linnakangas
70b4d90439 Fix comment in GetPublicationRelations
This function gets the list of relations associated with the
publication but the comment said the opposite.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CANhcyEV3C_CGBeDtjvKjALDJDMH-Uuc9BWfSd=eck8SCXnE=fQ@mail.gmail.com
2025-12-10 15:33:29 +02:00
Heikki Linnakangas
fa44b8b7fb Fix some near-bugs related to ResourceOwner function arguments
These functions took a ResourceOwner argument, but only checked if it
was NULL, and then used CurrentResourceOwner for the actual work.
Surely the intention was to use the passed-in resource owner. All
current callers passed CurrentResourceOwner or NULL, so this has no
consequences at the moment, but it's an accident waiting to happen for
future caller and extensions.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEze2Whnfv8VuRZaohE-Af+GxBA1SNfD_rXfm84Jv-958UCcJA@mail.gmail.com
Backpatch-through: 17
2025-12-10 11:43:16 +02:00
Heikki Linnakangas
bae9d2f892 Fix typo in comment
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CABPTF7V8CbOXGePqrad6EH3Om7DRhNiO3C0rQ-62UuT7RdU-GQ@mail.gmail.com
2025-12-10 01:06:03 +02:00
David Rowley
f275afc997 Fix misleading comment in tuplesort.c
A comment in tuplesort.c was claiming that the code was defining
INITIAL_MEMTUPSIZE so that it *does not* exceed
ALLOCSET_SEPARATE_THRESHOLD, but the code actually ensures that we
purposefully *do* exceed ALLOCSET_SEPARATE_THRESHOLD for the initial
allocation of the tuples array, as per reasons detailed in the
commentary of grow_memtuples().

Also, there's not much need to repeat the mention about
ALLOCSET_SEPARATE_THRESHOLD in each location where INITIAL_MEMTUPSIZE is
used, so remove those comments.

Author: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Discussion: https://postgr.es/m/tencent_6FA14F85D6B5B5291532D6789E07F4765C08%40qq.com
2025-12-10 12:01:14 +13:00
Michael Paquier
1b105f9472 Use palloc_object() and palloc_array() in backend code
The idea is to encourage more the use of these new routines across the
tree, as these offer stronger type safety guarantees than palloc().
This batch of changes includes most of the trivial changes suggested by
the author for src/backend/.

A total of 334 files are updated here.  Among these files, 48 of them
have their build change slightly; these are caused by line number
changes as the new allocation formulas are simpler, shaving around 100
lines of code in total.

Similar work has been done in 0c3c5c3b06 and 31d3847a37.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-12-10 07:36:46 +09:00
Masahiko Sawada
ab40db3852 Add started_by column to pg_stat_progress_analyze view.
The new column, started_by, indicates the initiator of the
analyze ('manual' or 'autovacuum'), helping users and monitoring tools
to better understand ANALYZE behavior.

Bump catalog version.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Discussion: https://postgr.es/m/CAA5RZ0suoicwxFeK_eDkUrzF7s0BVTaE7M%2BehCpYcCk5wiECpw%40mail.gmail.com
2025-12-09 11:23:45 -08:00
Masahiko Sawada
0d78952061 Add mode and started_by columns to pg_stat_progress_vacuum view.
The new columns, mode and started_by, indicate the vacuum
mode ('normal', 'aggressive', or 'failsafe') and the initiator of the
vacuum ('manual', 'autovacuum', or 'autovacuum_wraparound'),
respectively. This allows users and monitoring tools to better
understand VACUUM behavior.

Bump catalog version.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yu Wang <wangyu_runtime@163.com>
Discussion: https://postgr.es/m/CAOzEurQcOY-OBL_ouEVfEaFqe_md3vB5pXjR_m6L71Dcp1JKCQ@mail.gmail.com
2025-12-09 10:51:14 -08:00
Heikki Linnakangas
3cb5808bd1 Add wait event for the group commit delay before WAL flush
Author: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BFpmFf-hWXtrC0Q3Cr_Xo78zuP_M_VC5xgWPOYOkwqOD0T8eg@mail.gmail.com
2025-12-09 17:06:40 +02:00
Heikki Linnakangas
bd8d9c9bdf Widen MultiXactOffset to 64 bits
This eliminates MultiXactOffset wraparound and the 2^32 limit on the
total number of multixid members. Multixids are still limited to 2^31,
but this is a nice improvement because 'members' can grow much faster
than the number of multixids. On such systems, you can now run longer
before hitting hard limits or triggering anti-wraparound vacuums.

Not having to deal with MultiXactOffset wraparound also simplifies the
code and removes some gnarly corner cases.

We no longer need to perform emergency anti-wraparound freezing
because of running out of 'members' space, so the offset stop limit is
gone. But you might still not want 'members' to consume huge amounts
of disk space. For that reason, I kept the logic for lowering vacuum's
multixid freezing cutoff if a large amount of 'members' space is
used. The thresholds for that are roughly the same as the "safe" and
"danger" thresholds used before, 2 billion transactions and 4 billion
transactions. This keeps the behavior for the freeze cutoff roughly
the same as before. It might make sense to make this smarter or
configurable, now that the threshold is only needed to manage disk
usage, but that's left for the future.

Add code to pg_upgrade to convert multitransactions from the old to
the new format, rewriting the pg_multixact SLRU files. Because
pg_upgrade now rewrites the files, we can get rid of some hacks we had
put in place to deal with old bugs and upgraded clusters. Bump catalog
version for the pg_multixact/offsets format change.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com
2025-12-09 13:53:03 +02:00
Heikki Linnakangas
bb3b1c4f64 Move pg_multixact SLRU page format definitions to a separate header
This makes them accessible from pg_upgrade, needed by the next commit.
I'm doing this mechanical move as a separate commit to make the next
commit's changes to these definitions more obvious.

Author: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com
2025-12-09 13:45:01 +02:00
Richard Guo
f00484c170 Fix distinctness check for queries with grouping sets
query_is_distinct_for() is intended to determine whether a query never
returns duplicates of the specified columns.  For queries using
grouping sets, if there are no grouping expressions, the query may
contain one or more empty grouping sets.  The goal is to detect
whether there is exactly one empty grouping set, in which case the
query would return a single row and thus be distinct.

The previous logic in query_is_distinct_for() was incomplete because
the check was insufficiently thorough and could return false when it
could have returned true.  It failed to consider cases where the
DISTINCT clause is used on the GROUP BY, in which case duplicate empty
grouping sets are removed, leaving only one.  It also did not
correctly handle all possible structures of GroupingSet nodes that
represent a single empty grouping set.

To fix, add a check for the groupDistinct flag, and expand the query's
groupingSets tree into a flat list, then verify that the expanded list
contains only one element.

No backpatch as this could result in plan changes.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMbWs480Z04NtP8-O55uROq2Zego309+h3hhaZhz6ztmgWLEBw@mail.gmail.com
2025-12-09 17:09:27 +09:00
Richard Guo
c925ad30b0 Fix const-simplification for index expressions and predicate
Similar to the issue with constraint and statistics expressions fixed
in 317c117d6, index expressions and predicate can also suffer from
incorrect reduction of NullTest clauses during const-simplification,
due to unfixed varnos and the use of a NULL root.  It has been
reported that this issue can cause the planner to fail to pick up a
partial index that it previously matched successfully.

Because we need to cache the const-simplified index expressions and
predicate in the relcache entry, we cannot fix the Vars before
applying eval_const_expressions.  To ensure proper reduction of
NullTest clauses, this patch runs eval_const_expressions a second time
-- after the Vars have been fixed and with a valid root.

It could be argued that the additional call to eval_const_expressions
might increase planning time, but I don't think that's a concern.  It
only runs when index expressions and predicate are present; it is
relatively cheap when run on small expression trees (which is
typically the case for index expressions and predicate), and it runs
on expressions that have already been const-simplified once, making
the second pass even cheaper.  In return, in cases like the one
reported, it allows the planner to match and use partial indexes,
which can lead to significant execution-time improvements.

Bug: #19007
Reported-by: Bryan Fox <bryfox@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/19007-4cc6e252ed8aa54a@postgresql.org
2025-12-09 16:56:26 +09:00
Amit Kapila
04396eacd3 Fix LOCK_TIMEOUT handling in slotsync worker.
Previously, the slotsync worker relied on SIGINT for graceful shutdown
during promotion. However, SIGINT is also used by the LOCK_TIMEOUT handler
to cancel queries. Since the slotsync worker can lock catalog tables while
parsing libpq tuples, this overlap caused it to ignore LOCK_TIMEOUT
signals and potentially wait indefinitely on locks.

This patch replaces the slotsync worker's SIGINT handler with
StatementCancelHandler to correctly process query-cancel interrupts.
Additionally, the startup process now uses SIGUSR1 to signal the slotsync
worker to stop during promotion. The worker exits after detecting that the
shared memory flag stopSignaled is set.

Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17, here it was introduced
Discussion: https://postgr.es/m/TY4PR01MB169078F33846E9568412D878C94A2A@TY4PR01MB16907.jpnprd01.prod.outlook.com
2025-12-09 07:25:20 +00:00
Peter Eisentraut
2268f2b91b Remove useless casts in format arguments
There were a number of useless casts in format arguments, either
where the input to the cast was already in the right type, or
seemingly uselessly casting between types instead of just using the
right format placeholder to begin with.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09 07:33:08 +01:00
Peter Eisentraut
907caf5c39 Clean up int64-related format strings
Remove some gratuitous uses of INT64_FORMAT.  Make use of
PRIu64/PRId64 were appropriate, remove unnecessary casts.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09 07:33:08 +01:00
Peter Eisentraut
2b117bb014 Remove unnecessary casts in printf format arguments (%zu/%zd)
Many of these are probably left over from before use of %zu/%zd was
portable.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org
2025-12-09 07:33:08 +01:00
David Rowley
52382feb78 Doc: fix typo in hash index documentation
Plus a similar fix to the README.

Backpatch as far back as the sgml issue exists.  The README issue does
exist in v14, but that seems unlikely to harm anyone.

Author: David Geier <geidav.pg@gmail.com>
Discussion: https://postgr.es/m/ed3db7ea-55b4-4809-86af-81ad3bb2c7d3@gmail.com
Backpatch-through: 15
2025-12-09 14:41:30 +13:00
Peter Geoghegan
83a26ba59b Avoid pointer chasing in _bt_readpage inner loop.
Make _bt_readpage pass down the current scan direction to various
utility functions within its pstate variable.  Also have _bt_readpage
work off of a local copy of scan->ignore_killed_tuples within its
per-tuple loop (rather than using scan->ignore_killed_tuples directly).

Testing has shown that this significantly benefits large range scans,
which are naturally able to take full advantage of the pstate.startikey
optimization added by commit 8a510275.  Running a pgbench script with a
"SELECT abalance FROM pgbench_accounts WHERE aid BETWEEN ..." query
shows an increase in transaction throughput of over 5%.  There also
appears to be a small performance benefit when running pgbench's
built-in select-only script.

Follow-up to commit 65d6acbc.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
2025-12-08 13:48:09 -05:00
Álvaro Herrera
d0d0ba6cf6
Unify some more messages
No backpatch here because of message wording changes.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/202512081537.ahw5gwoencou@alvherre.pgsql
2025-12-08 19:25:36 +01:00
Peter Geoghegan
65d6acbc56 Relocate _bt_readpage and related functions.
Quite a bit of code within nbtutils.c is only called by _bt_readpage.
Move _bt_readpage and all of the nbtutils.c functions it depends on into
a new .c file, nbtreadpage.c.  Also reorder some of the functions within
the new file for clarity.

This commit has no functional impact.  It is strictly mechanical.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com
2025-12-08 13:15:00 -05:00
Álvaro Herrera
502e256f22
Unify error messages
No visible changes, just refactor how messages are constructed.
2025-12-08 16:30:52 +01:00
Peter Eisentraut
804046b39a Use PGAlignedXLogBlock for some code simplification
The code in BootStrapXLOG() and in pg_test_fsync.c tried to align WAL
buffers in complicated ways.  Also, they still used XLOG_BLCKSZ for
the alignment, even though that should now be PG_IO_ALIGN_SIZE.  This
can now be simplified and made more consistent by using
PGAlignedXLogBlock, either directly in BootStrapXLOG() and using
alignas in pg_test_fsync.c.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/f462a175-b608-44a1-b428-bdf351e914f4%40eisentraut.org
2025-12-08 14:54:32 +01:00
Amit Kapila
006dd4b2e5 Prevent invalidation of newly created replication slots.
A race condition could cause a newly created replication slot to become
invalidated between WAL reservation and a checkpoint.

Previously, if the required WAL was removed, we retried the reservation
process. However, the slot could still be invalidated before the retry if
the WAL was not yet removed but the checkpoint advanced the redo pointer
beyond the slot's intended restart LSN and computed the minimum LSN that
needs to be preserved for the slots.

The fix is to acquire an exclusive lock on ReplicationSlotAllocationLock
during WAL reservation to serialize WAL reservation and checkpoint's
minimum restart_lsn computation. This ensures that, if WAL reservation
occurs first, the checkpoint waits until restart_lsn is updated before
removing WAL. If the checkpoint runs first, subsequent WAL reservations
pick a position at or after the latest checkpoint's redo pointer.

We can't use the same fix for branch 17 and prior because commit
2090edc6f3 changed to compute to the minimum restart_LSN among slot's at
the beginning of checkpoint (or restart point). The fix for 17 and prior
branches is under discussion and will be committed separately.

Reported-by: suyu.cmj <mengjuan.cmj@alibaba-inc.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj@alibaba-inc.com
2025-12-08 05:21:22 +00:00
Michael Paquier
f68597ee77 Improve error messages of input functions for pg_dependencies and pg_ndistinct
The error details updated in this commit can be reached in the
regression tests.  They did not follow the project style, and they
should be written them as full sentences.

Some of the errors are switched to use an elog(), for cases that involve
paths that cannot be reached based on the previous state of the parser
processing the input data (array start, object end, etc.).  The error
messages for these cases use now a more consistent style across the
board, with the state of the parser reported for debugging.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/1353179.1764901790@sss.pgh.pa.us
2025-12-08 10:23:48 +09:00
Tom Lane
005a2907dc Micro-optimize datatype conversions in datum_to_jsonb_internal.
The general case for converting to a JSONB numeric value is to run the
source datatype's output function and then numeric_in, but we can do
substantially better than that for integer and numeric source values.
This patch improves the speed of jsonb_agg by 30% for integer input,
and nearly 2X for numeric input.

Sadly, the obvious idea of using float4_numeric and float8_numeric
to speed up those cases doesn't work: they are actually slower than
the generic coerce-via-I/O method, and not by a small amount.
They might round off differently than this code has historically done,
too.  Leave that alone pending possible changes in those functions.

We can also do better than the existing code for text/varchar/bpchar
source data; this optimization is similar to one that already exists
in the json_agg() code.  That saves 20% or so for such inputs.

Also make a couple of other minor improvements, such as not giving
JSONTYPE_CAST its own special case outside the switch when it could
perfectly well be handled inside, and not using dubious string hacking
to detect infinity and NaN results.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
2025-12-07 11:54:33 -05:00
Tom Lane
b61aa76e45 Remove fundamentally-redundant processing in jsonb_agg() et al.
The various variants of jsonb_agg() operate as follows,
for each aggregate input value:

1. Build a JsonbValue tree representation of the input value.
2. Flatten the JsonbValue tree into a Jsonb in on-disk format.
3. Iterate through the Jsonb, building a JsonbValue that is part
of the aggregate's state stored in aggcontext, but is otherwise
identical to what phase 1 built.

This is very slightly less silly than it sounds, because phase 1
involves calling non-JSONB code such as datatype output functions,
which are likely to leak memory, and we don't want to leak into the
aggcontext.  Nonetheless, phases 2 and 3 are accomplishing exactly
nothing that is useful if we can make phase 1 put the JsonbValue
tree where we need it.  We could probably do that with a bunch of
MemoryContextSwitchTo's, but what seems more robust is to give
pushJsonbValue the responsibility of building the JsonbValue tree
in a specified non-current memory context.  The previous patch
created the infrastructure for that, and this patch simply makes
the aggregate functions use it and then rips out phases 2 and 3.

For me, this makes jsonb_agg() with a text column as input run
about 2X faster than before.  It's not yet on par with json_agg(),
but this removes a whole lot of the difference.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
2025-12-07 11:52:22 -05:00
Tom Lane
0986e95161 Revise APIs for pushJsonbValue() and associated routines.
Instead of passing "JsonbParseState **" to pushJsonbValue(),
pass a pointer to a JsonbInState, which will contain the
parseState stack pointer as well as other useful fields.
Also, instead of returning a JsonbValue pointer that is often
meaningless/ignored, return the top-level JsonbValue pointer
in the "result" field of the JsonbInState.

This involves a lot of (mostly mechanical) edits, but I think
the results are notationally cleaner and easier to understand.
Certainly the business with sometimes capturing the result of
pushJsonbValue() and sometimes not was bug-prone and incapable of
mechanical verification.  In the new arrangement, JsonbInState.result
remains null until we've completed a valid sequence of pushes, so
that an incorrect sequence will result in a null-pointer dereference,
not mistaken use of a partial result.

However, this isn't simply an exercise in prettier notation.
The real reason for doing it is to provide a mechanism whereby
pushJsonbValue() can be told to construct the JsonbValue tree
in a context that is not CurrentMemoryContext.  That happens
when a non-null "outcontext" is specified in the JsonbInState.
No callers exercise that option in this patch, but the next
patch in the series will make use of it.

I tried to improve the comments in this area too.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us
2025-12-07 11:51:33 -05:00
Tom Lane
6498287696 Handle constant inputs to corr() and related aggregates more precisely.
The SQL standard says that corr() and friends should return NULL in
the mathematically-undefined case where all the inputs in one of
the columns have the same value.  We were checking that by seeing
if the sums Sxx and Syy were zero, but that approach is very
vulnerable to roundoff error: if a sum is close to zero but not
exactly that, we'd come out with a pretty silly non-NULL result.

Instead, directly track whether the inputs are all equal by
remembering the common value in each column.  Once we detect
that a new input is different from before, represent that by
storing NaN for the common value.  (An objection to this scheme
is that if the inputs are all NaN, we will consider that they
were not all equal.  But under IEEE float arithmetic rules,
one NaN is never equal to another, so this behavior is arguably
correct.  Moreover it matches what we did before in such cases.)
Then, leave the sums at their exact value of zero for as long
as we haven't detected different input values.

This solution requires the aggregate transition state to contain
8 float values not 6, which is not problematic, and it seems to add
less than 1% to the aggregates' runtime, which seems acceptable.

While we're here, improve corr()'s final function to cope with
overflow/underflow in the final calculation, and to clamp its
result to [-1, 1] in case of roundoff error.

Although this is arguably a bug fix, it requires a catversion bump
due to the change in aggregates' initial states, so it can't be
back-patched.

Patch written by me, but many of the ideas are due to Dean Rasheed,
who also did a deal of testing.

Bug: #19340
Reported-by: Oleg Ivanov <o15611@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/19340-6fb9f6637f562092@postgresql.org
2025-12-06 18:31:26 -05:00
Tom Lane
6dfce8420e Fix text substring search for non-deterministic collations.
Due to an off-by-one error, the code failed to find matches at the
end of the haystack.  Fix by rewriting the loop.

While at it, fix a comment that claimed that the function could find
a zero-length match.  Such a match could send a caller into an endless
loop.  However, zero-length matches only make sense with an empty
search string, and that case is explicitly excluded by all callers.
To make sure it stays that way, add an Assert and a comment.

Bug: #19341
Reported-by: Adam Warland <adam.warland@infor.com>
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19341-1d9a22915edfec58@postgresql.org
Backpatch-through: 18
2025-12-05 20:10:33 -05:00
Robert Haas
014f9a831a Don't reset the pathlist of partitioned joinrels.
apply_scanjoin_target_to_paths wants to avoid useless work and
platform-specific dependencies by throwing away the path list created
prior to applying the final scan/join target and constructing a whole
new one using the final scan/join target. However, this is only valid
when we'll consider all the same strategies after the pathlist reset
as before.

After resetting the path list, we reconsider Append and MergeAppend
paths with the modified target list; therefore, it's only valid for a
partitioned relation. However, what the previous coding missed is that
it cannot be a partitioned join relation, because that also has paths
that are not Append or MergeAppend paths and will not be reconsidered.
Thus, before this patch, we'd sometimes choose a partitionwise strategy
with a higher total cost than cheapest non-partitionwise strategy,
which is not good.

We had a surprising number of tests cases that were relying on this
bug to work as they did. A big part of the reason for this is that row
counts in regression test cases tend to be low, which brings the cost
of partitionwise and non-partitionwise strategies very close together,
especially for merge joins, where the real and perceived advantages of
a partitionwise approach are minimal. In addition, one test case
included a row-count-inflating join. In such cases, a partitionwise
join can easily be a loser on cost, because the total number of tuples
passing through an Append node is much higher than it is with a
non-partitionwise strategy. That test case is adjusted by adding
additional join clauses to avoid the row count inflation.

Although the failure of the planner to choose the lowest-cost path is a
bug, we generally do not back-patch fixes of this type, because planning
is not an exact science and there is always a possibility that some user
will end up with a plan that has a lower estimated cost but actually
runs more slowly. Hence, no backpatch here, either.

The code change here is exactly what was originally proposed by
Ashutosh, but the changes to the comments and test cases have been
very heavily rewritten by me, helped along by some very useful advice
from Richard Guo.

Reported-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Author: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Arne Roland <arne.roland@malkut.net>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: http://postgr.es/m/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com
2025-12-05 12:00:18 -05:00
Tom Lane
8f1791c618 Fix some cases of indirectly casting away const.
Newest versions of gcc are able to detect cases where code implicitly
casts away const by assigning the result of strchr() or a similar
function applied to a "const char *" value to a target variable
that's just "char *".  This of course creates a hazard of not getting
a compiler warning about scribbling on a string one was not supposed
to, so fixing up such cases is good.

This patch fixes a dozen or so places where we were doing that.
Most are trivial additions of "const" to the target variable,
since no actually-hazardous change was occurring.  There is one
place in ecpg.trailer where we were indeed violating the intention
of not modifying a string passed in as "const char *".  I believe
that's harmless not a live bug, but let's fix it by copying the
string before modifying it.

There is a remaining trouble spot in ecpg/preproc/variable.c,
which requires more complex surgery.  I've left that out of this
commit because I want to study that code a bit more first.

We probably will want to back-patch this once compilers that detect
this pattern get into wider circulation, but for now I'm just
going to apply it to master to see what the buildfarm says.

Thanks to Bertrand Drouvot for finding a couple more spots than
I had.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1324889.1764886170@sss.pgh.pa.us
2025-12-05 11:17:23 -05:00
Heikki Linnakangas
4d936c3fff Fix setting next multixid's offset at offset wraparound
In commit 789d65364c, we started updating the next multixid's offset
too when recording a multixid, so that it can always be used to
calculate the number of members. I got it wrong at offset wraparound:
we need to skip over offset 0. Fix that.

Discussion: https://www.postgresql.org/message-id/d9996478-389a-4340-8735-bfad456b313c@iki.fi
Backpatch-through: 14
2025-12-05 11:32:38 +02:00
Amit Kapila
5db6a344ab Rename column slotsync_skip_at to slotsync_last_skip.
Commit 76b78721ca introduced two new columns in pg_stat_replication_slots
to improve monitoring of slot synchronization. One of these columns was
named slotsync_skip_at, which is inconsistent with the naming convention
used for similar columns in other system views.

Columns that store timestamps of the most recent event typically use the
'last_' in the column name (e.g., last_autovacuum, checksum_last_failure).
Renaming slotsync_skip_at to slotsync_last_skip aligns with this pattern,
making the purpose of the column clearer and improving overall consistency
across the views.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Michael Banck <mbanck@gmx.net>
Discussion: https://postgr.es/m/20251128091552.GB13635@p46.dedyn.io;lightning.p46.dedyn.io
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
2025-12-05 04:12:55 +00:00
Michael Paquier
7bc88c3d6f Fix some compiler warnings
Some of the buildfarm members with some old gcc versions have been
complaining about an always-true test for a NULL pointer caused by a
combination of SOFT_ERROR_OCCURRED() and a local ErrorSaveContext
variable.

These warnings are taken care of by removing SOFT_ERROR_OCCURRED(),
switching to a direct variable check, like 56b1e88c80.

Oversights in e1405aa5e3 and 44eba8f06e.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1341064.1764895052@sss.pgh.pa.us
2025-12-05 12:30:43 +09:00
Melanie Plageman
904f9f5ea0 Suppress spurious Coverity warning in prune freeze logic
Adjust the prune_freeze_setup() parameter types of new_relfrozen_xid and
new_relmin_mxid to prevent misleading Coverity analysis.
heap_page_prune_and_freeze() compared these values against NULL when
passing them to prune_freeze_setup(), causing Coverity to assume they
could be NULL and flag a possible null-pointer dereference later, even
though it occurs inside a directly related conditional.

Reported-by: Coverity
Author: Melanie Plageman <melanieplageman@gmail.com>
2025-12-04 18:55:02 -05:00
Nathan Bossart
80f6e2fb4a Fix key size of PrivateRefCountHash.
The key is the first member of PrivateRefCountEntry, which has type
Buffer.  This commit changes the key size from sizeof(int32) to
sizeof(Buffer).  This appears to be an oversight in commit
4b4b680c3d, but it's of no consequence because Buffer has been a
signed 32-bit integer for a long time.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aS77DTpl0fOkIKSZ%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-04 15:42:18 -06:00
Peter Eisentraut
e158fd4d68 Remove no longer needed casts from Pointer
These casts used to be required when Pointer was char *, but now it's
void * (commit 1b2bb5077e), so they are not needed anymore.

Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-04 20:44:52 +01:00
Peter Eisentraut
c6be3daa05 Remove no longer needed casts to Pointer
These casts used to be required when Pointer was char *, but now it's
void * (commit 1b2bb5077e), so they are not needed anymore.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-04 19:40:08 +01:00
Alexander Korotkov
d6ef8ee3ee Fix incorrect assertion bound in WaitForLSN()
The assertion checking MyProcNumber used MaxBackends as the upper
bound, but the procInfos array is allocated with size
MaxBackends + NUM_AUXILIARY_PROCS. This inconsistency would cause
a false assertion failure if an auxiliary process calls WaitForLSN().

Author: Xuneng Zhou <xunengzhou@gmail.com>
2025-12-04 10:38:12 +02:00
Andres Freund
6c5c393b74 Rename BUFFERPIN wait event class to BUFFER
In an upcoming patch more wait events will be added to the wait event
class (for buffer locking), making the current name too
specific. Alternatively we could introduce a dedicated wait event class for
those, but it seems somewhat confusing to have a BUFFERPIN and a BUFFER wait
event class.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-03 18:38:20 -05:00
Andres Freund
156680055d bufmgr: Turn BUFFER_LOCK_* into an enum
It seems cleaner to use an enum to tie the different values together. It also
helps to have a more descriptive type in the argument to various functions.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-12-03 18:38:20 -05:00
Heikki Linnakangas
789d65364c Set next multixid's offset when creating a new multixid
With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
2025-12-03 19:15:08 +02:00
Nathan Bossart
9b05e2ec08 Use "foo(void)" for definitions of functions with no parameters.
Standard practice in PostgreSQL is to use "foo(void)" instead of
"foo()", as the latter looks like an "old-style" function
declaration.  Similar changes were made in commits cdf4b9aff2,
0e72b9d440, 7069dbcc31, f1283ed6cc, 7b66e2c086, e95126cf04, and
9f7c527af3.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/aTBObQPg%2Bps5I7vl%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-03 10:54:37 -06:00
Peter Eisentraut
9790affcce Fix stray references to SubscriptRef
This type never existed.  SubscriptingRef was meant instead.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/2eaa45e3-efc5-4d75-b082-f8159f51445f%40eisentraut.org
2025-12-03 14:44:14 +01:00
Peter Eisentraut
756a436893 Don't rely on pointer arithmetic with Pointer type
The comment for the Pointer type says 'XXX Pointer arithmetic is done
with this, so it can't be void * under "true" ANSI compilers.'.  This
fixes that.  Change from Pointer to use char * explicitly where
pointer arithmetic is needed.  This makes the meaning of the code
clearer locally and removes a dependency on the actual definition of
the Pointer type.  (The definition of the Pointer type is not changed
in this commit.)

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-03 09:54:15 +01:00
Peter Eisentraut
623801b3bd Remove useless casts to Pointer
in arguments of memcpy() and memmove() calls

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org
2025-12-03 08:40:33 +01:00
Amit Kapila
c252d37d8c Fix shadow variable warning in subscriptioncmds.c.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/CAHut+PsF8R0Bt4J3c92+T2F0mun0rRfK=-GH+iBv2s-O8ahJJw@mail.gmail.com
2025-12-03 03:31:31 +00:00
Nathan Bossart
a6d05c8193 Use LW_SHARED in dsa.c where possible.
Both dsa_get_total_size() and dsa_get_total_size_from_handle() take
an exclusive lock just to read a variable.  This commit reduces the
lock level to LW_SHARED in those functions.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aS8fMzWs9e8iHxk2%40nathan
2025-12-02 16:40:23 -06:00
Heikki Linnakangas
c085aab278 Add a test for half-dead pages in B-tree indexes
To increase our test coverage in general, and because I will use this
in the next commit to test a bug we currently have in amcheck.

Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/33e39552-6a2a-46f3-8b34-3f9f8004451f@garret.ru
2025-12-02 21:11:05 +02:00
Heikki Linnakangas
1e4e5783e7 Add a test for incomplete splits in B-tree indexes
To increase our test coverage in general, and because I will add onto
this in the next commit to also test amcheck with incomplete splits.

This is copied from the similar test we had for GIN indexes. B-tree's
incomplete splits work similarly to GIN's, so with small changes, the
same test works for B-tree too.

Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/abd65090-5336-42cc-b768-2bdd66738404@iki.fi
2025-12-02 21:10:47 +02:00
Nathan Bossart
f894acb24a Show size of DSAs and dshashes in pg_dsm_registry_allocations.
Presently, this view reports NULL for the size of DSAs and dshash
tables because 1) the current backend might not be attached to them
and 2) the registry doesn't save the pointers to the dsa_area or
dshash_table in local memory.  Also, the view doesn't show
partially-initialized entries to avoid ambiguity, since those
entries would report a NULL size as well.

This commit introduces a function that looks up the size of a DSA
given its handle (transiently attaching to the control segment if
needed) and teaches pg_dsm_registry_allocations to use it to show
the size of successfully-initialized DSA and dshash entries.
Furthermore, the view now reports partially-initialized entries
with a NULL size.

Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aSeEDeznAsHR1_YF%40nathan
2025-12-02 10:29:45 -06:00
Álvaro Herrera
758479213d
Remove doc and code comments about ON CONFLICT deficiencies
They have been fixed, so we don't need this text anymore.  This reverts
commit 8b18ed6dfb.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Discussion: https://postgr.es/m/CADzfLwWo+FV9WSeOah9F1r=4haa6eay1hNvYYy_WfziJeK+aLQ@mail.gmail.com
2025-12-02 16:47:18 +01:00
Álvaro Herrera
5dee7a603f
Avoid use of NOTICE to wait for snapshot invalidation
This idea (implemented in commits and bc32a12e0d and 9e8fa05d34) of
using notices to detect that a session is sleeping was unreliable, so
simplify the concurrency controller session to just look at
pg_stat_activity for a process sleeping on the injection point we want
it to hit.  This change allows us to remove a secondary injection point
and the alternative expected output files.

Reproduced by Alexander Lakhin following a report in buildfarm member
skink (which runs the server under valgrind).

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/3e302c96-cdd2-45ec-af84-03dbcdccde4a@gmail.com
2025-12-02 16:43:27 +01:00
Álvaro Herrera
90eae926ab
Fix ON CONFLICT with REINDEX CONCURRENTLY and partitions
When planning queries with ON CONFLICT on partitioned tables, the
indexes to consider as arbiters for each partition are determined based
on those found in the parent table.  However, it's possible for an index
on a partition to be reindexed, and in that case, the auxiliary indexes
created on the partition must be considered as arbiters as well; failing
to do that may result in spurious "duplicate key" errors given
sufficient bad luck.

We fix that in this commit by matching every index that doesn't have a
parent to each initially-determined arbiter index.  Every unparented
matching index is considered an additional arbiter index.

Closely related to the fixes in bc32a12e0d and 2bc7e886fc, and for
identical reasons, not backpatched (for now) even though it's a
longstanding issue.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com
2025-12-02 13:51:53 +01:00
Peter Eisentraut
4f941d432b Remove useless casting to same type
This removes some casts where the input already has the same type as
the type specified by the cast.  Their presence could cause risks of
hiding actual type mismatches in the future or silently discarding
qualifiers.  It also improves readability.  Same kind of idea as
7f798aca1d and ef8fe69360.  (This does not change all such
instances, but only those hand-picked by the author.)

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-02 10:09:32 +01:00
Peter Eisentraut
35988b31db Simplify hash_xlog_split_allocate_page()
Instead of complicated pointer arithmetic, overlay a uint32 array and
just access the array members.  That's safe thanks to
XLogRecGetBlockData() returning a MAXALIGNed buffer.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-02 09:18:54 +01:00
Peter Eisentraut
ec782f56b0 Replace pointer comparisons and assignments to literal zero with NULL
While 0 is technically correct, NULL is the semantically appropriate
choice for pointers.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/aS1AYnZmuRZ8g%2B5G%40ip-10-97-1-34.eu-west-3.compute.internal
2025-12-02 08:39:24 +01:00
Michael Paquier
713d9a847e Update some timestamp[tz] functions to use soft-error reporting
This commit updates two functions that convert "timestamptz" to
"timestamp", and vice-versa, to use the soft error reporting rather than
a their own logic to do the same.  These are now named as follows:
- timestamp2timestamptz_safe()
- timestamptz2timestamp_safe()

These functions were suffixed with "_opt_overflow", previously.

This shaves some code, as it is possible to detect how a timestamp[tz]
overflowed based on the returned value rather than a custom state.  It
is optionally possible for the callers of these functions to rely on the
error generated internally by these functions, depending on the error
context.

Similar work has been done in d03668ea05 and 4246a977ba.

Reviewed-by: Amul Sul <sulamul@gmail.com>
Discussion: https://postgr.es/m/aS09YF2GmVXjAxbJ@paquier.xyz
2025-12-02 09:30:23 +09:00
Jeff Davis
19b966243c Make regex "max_chr" depend on encoding, not provider.
The regex mechanism scans through the first "max_chr" character values
to cache character property ranges (isalpha, etc.). For single-byte
encodings, there's no sense in scanning beyond UCHAR_MAX; but for
UTF-8 it makes sense to cache higher code point values (though not all
of them; only up to MAX_SIMPLE_CHR).

Prior to 5a38104b36, the logic about how many character values to scan
was based on the pg_regex_strategy, which was dependent on the
provider. Commit 5a38104b36 preserved that logic exactly, allowing
different providers to define the "max_chr".

Now, change it to depend only on the encoding and whether
ctype_is_c. For this specific calculation, distinguishing between
providers creates more complexity than it's worth.

Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
2025-12-01 11:06:17 -08:00
Jeff Davis
99cd8890be Change some callers to use pg_ascii_toupper().
The input is ASCII anyway, so it's better to be clear that it's not
locale-dependent.

Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
2025-12-01 09:24:03 -08:00
Álvaro Herrera
2bc7e886fc
Fix ON CONFLICT ON CONSTRAINT during REINDEX CONCURRENTLY
When REINDEX CONCURRENTLY is processing the index that supports a
constraint, there are periods during which multiple indexes match the
constraint index's definition.  Those must all be included in the set of
inferred index for INSERT ON CONFLICT, in order to avoid spurious
"duplicate key" errors.

To fix, we set things up to match all indexes against attributes,
expressions and predicates of the constraint index, then return all
indexes that match those, rather than just the one constraint index.
This is more onerous than before, where we would just test the named
constraint for validity, but it's not more onerous than processing
"conventional" inference (where a list of attribute names etc is given).

This is closely related to the misbehaviors fixed by bc32a12e0d, for a
different situation.  We're not backpatching this one for now either,
for the same reasons.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com
2025-12-01 17:34:13 +01:00
Peter Eisentraut
2fcc5a7151 Fix a strict aliasing violation
This one is almost a textbook example of an aliasing violation, and it
is straightforward to fix, so clean it up.  (The warning only shows up
if you remove the -fno-strict-aliasing option.)  Also, move the code
after the error checking.  Doesn't make a difference technically, but
it seems strange to do actions before errors are checked.

Reported-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/20240724.155525.366150353176322967.ishii%40postgresql.org
2025-12-01 16:41:08 +01:00
Michael Paquier
a87987cafc Move WAL sequence code into its own file
This split exists for most of the other RMGRs, and makes cleaner the
separation between the WAL code, the redo code and the record
description code (already in its own file) when it comes to the sequence
RMGR.  The redo and masking routines are moved to a new file,
sequence_xlog.c.  All the RMGR routines are now located in a new header,
sequence_xlog.h.

This separation is useful for a different patch related to sequences
that I have been working on, where it makes a refactoring of sequence.c
easier if its RMGR routines and its core routines are split.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/aSfTxIWjiXkTKh1E@paquier.xyz
2025-12-01 16:21:41 +09:00
Michael Paquier
d03668ea05 Switch some date/timestamp functions to use the soft error reporting
This commit changes some functions related to the data types date and
timestamp to use the soft error reporting rather than a custom boolean
flag called "overflow", used to let the callers of these functions know
if an overflow happens.

This results in the removal of some boilerplate code, as it is possible
to rely on an error context rather than a custom state, with the
possibility to use the error generated inside the functions updated
here, if necessary.

These functions were suffixed with "_opt_overflow".  They are now
renamed to use "_safe" as suffix.

This work is similar to 4246a977ba.

Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAJ_b95HEmFyzHZfsdPquSHeswcopk8MCG1Q_vn4tVkZ+xxofw@mail.gmail.com
2025-12-01 15:22:20 +09:00
David Rowley
5424f4da90 Don't call simplify_aggref with a NULL PlannerInfo
42473b3b3 added prosupport infrastructure to allow simplification of
Aggrefs during constant-folding.  In some cases the context->root that's
given to eval_const_expressions_mutator() can be NULL.  42473b3b3 failed
to take that into account, which could result in a crash.

To fix, add a check and only call simplify_aggref() when the PlannerInfo
is set.

Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Birler, Altan <altan.birler@tum.de>
Discussion: https://postgr.es/m/132d4da23b844d5ab9e352d34096eab5@tum.de
2025-11-30 12:55:34 +13:00
Peter Geoghegan
c902bd57af Update obsolete row compare preprocessing comments.
We have some limited ability to detect redundant and contradictory
conditions involving an nbtree row comparison key following commits
f09816a0 and bd3f59fd: we can do so in simple cases involving IS NULL
and IS NOT NULL keys on a row compare key's first column.  We can
likewise determine that a scan's qual is unsatisfiable given a row
compare whose first subkey's arg is NULL.  Update obsolete comments that
claimed that we merely copied row compares into the output key array
"without any editorialization".

Also update another _bt_preprocess_keys header comment paragraph: add a
parenthetical remark that points out that preprocessing will generate a
skip array for the preceding example qual.  That will ultimate lead to
preprocessing marking the example's lower-order y key required -- which
is exactly what the example supposes cannot happen.  Keep the original
comment, though, since it accurately describes the mechanical rules that
determine which keys get marked required in the absence of skip arrays
(which can occasionally still matter).  This fixes an oversight in
commit 92fe23d9, which added the nbtree skip scan optimization.

Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
2025-11-29 16:41:51 -05:00
Dean Rasheed
3881561d77 Avoid rewriting data-modifying CTEs more than once.
Formerly, when updating an auto-updatable view, or a relation with
rules, if the original query had any data-modifying CTEs, the rewriter
would rewrite those CTEs multiple times as RewriteQuery() recursed
into the product queries. In most cases that was harmless, because
RewriteQuery() is mostly idempotent. However, if the CTE involved
updating an always-generated column, it would trigger an error because
any subsequent rewrite would appear to be attempting to assign a
non-default value to the always-generated column.

This could perhaps be fixed by attempting to make RewriteQuery() fully
idempotent, but that looks quite tricky to achieve, and would probably
be quite fragile, given that more generated-column-type features might
be added in the future.

Instead, fix by arranging for RewriteQuery() to rewrite each CTE
exactly once (by tracking the number of CTEs already rewritten as it
recurses). This has the advantage of being simpler and more efficient,
but it does make RewriteQuery() dependent on the order in which
rewriteRuleAction() joins the CTE lists from the original query and
the rule action, so care must be taken if that is ever changed.

Reported-by: Bernice Southey <bernice.southey@gmail.com>
Author: Bernice Southey <bernice.southey@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CAEDh4nyD6MSH9bROhsOsuTqGAv_QceU_GDvN9WcHLtZTCYM1kA@mail.gmail.com
Backpatch-through: 14
2025-11-29 12:28:59 +00:00
Peter Eisentraut
87c6f8b047 Generate translator comments for GUC parameter descriptions
Automatically generate comments like

    /* translator: GUC parameter "client_min_messages" short description */

in the generated guc_tables.inc.c.

This provides translators more context.

Reviewed-by: Pavlo Golub <pavlo.golub@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Stéphane Schildknecht <sas.postgresql@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/1a89b3f0-e588-41ef-b712-aba766143cad%40eisentraut.org
2025-11-28 16:01:59 +01:00
Peter Eisentraut
8b3e2c622a Fix pg_isblank()
There was a pg_isblank() function that claimed to be a replacement for
the standard isblank() function, which was thought to be "not very
portable yet".  We can now assume that it's portable (it's in C99).

But pg_isblank() actually diverged from the standard isblank() by also
accepting '\r', while the standard one only accepts space and tab.
This was added to support parsing pg_hba.conf under Windows.  But the
hba parsing code now works completely differently and already handles
line endings before we get to pg_isblank().  The other user of
pg_isblank() is for ident protocol message parsing, which also handles
'\r' separately.  So this behavior is now obsolete and confusing.

To improve clarity, I separated those concerns.  The ident parsing now
gets its own function that hardcodes the whitespace characters
mentioned by the relevant RFC.  pg_isblank() is now static in hba.c
and is a wrapper around the standard isblank(), with some extra logic
to ensure robust treatment of non-ASCII characters.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
2025-11-28 08:33:07 +01:00
Amit Kapila
e68b6adad9 Add slotsync_skip_reason column to pg_replication_slots view.
Introduce a new column, slotsync_skip_reason, in the pg_replication_slots
view. This column records the reason why the last slot synchronization was
skipped. It is primarily relevant for logical replication slots on standby
servers where the 'synced' field is true. The value is NULL when
synchronization succeeds.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
2025-11-28 05:21:35 +00:00
David Rowley
d167c19295 Fix possibly uninitialized HeapScanDesc.rs_startblock
The solution used in 0ca3b1697 to determine the Parallel TID Range
Scan's start location was to modify the signature of
table_block_parallelscan_startblock_init() to allow the startblock
to be passed in as a parameter.  This allows the scan limits to be
adjusted before that function is called so that the limits are picked up
when the parallel scan starts.  The commit made it so the call to
table_block_parallelscan_startblock_init uses the HeapScanDesc's
rs_startblock to pass the startblock to the parallel scan.  That all
works ok for Parallel TID Range scans as the HeapScanDesc rs_startblock
gets set by heap_setscanlimits(), but for Parallel Seq Scans, initscan()
does not initialize rs_startblock, and that results in passing an
uninitialized value to table_block_parallelscan_startblock_init() as
noted by the buildfarm member skink, running Valgrind.

To fix this issue, make it so initscan() sets the rs_startblock for
parallel scans unless we're doing a rescan.  This makes it so
table_block_parallelscan_startblock_init() will be called with the
startblock set to InvalidBlockNumber, and that'll allow the syncscan
code to find the correct start location (when enabled).  For Parallel
TID Range Scans, this InvalidBlockNumber value will be overwritten in
the call to heap_setscanlimits().

initscan() is a bit light on documentation on what's meant to get
initialized where for parallel scans.  From what I can tell, it looks like
it just didn't matter prior to 0ca3b1697 that rs_startblock was left
uninitialized for parallel scans.  To address the light documentation,
I've also added some comments to mention that the syncscan location for
parallel scans is figured out in table_block_parallelscan_startblock_init.
I've also taken the liberty to adjust the if/else if/else code in
initscan() to make it clearer which parts apply to parallel scans and
which parts are for the serial scans.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqALm+k7FyfdQdCw1yF_8HojvR61YRrNhwRQPE=zSmnQA@mail.gmail.com
2025-11-28 12:40:50 +13:00
Michael Paquier
9660906dbd Add routines for marking buffers dirty efficiently
This commit introduces new internal bufmgr routines for marking shared
buffers as dirty:
* MarkDirtyUnpinnedBuffer()
* MarkDirtyRelUnpinnedBuffers()
* MarkDirtyAllUnpinnedBuffers()

These functions provide an efficient mechanism to respectively mark one
buffer, all the buffers of a relation, or the entire shared buffer pool
as dirty, something that can be useful to force patterns for the
checkpointer.  MarkDirtyUnpinnedBufferInternal(), an extra routine, is
used by these three, to mark as dirty an unpinned buffer.

They are intended as developer tools to manipulate buffer dirtiness in
bulk, and will be used in a follow-up commit.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Joseph Koshakow <koshy44@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw@mail.gmail.com
2025-11-28 07:39:33 +09:00
Tom Lane
5528e8d104 Allow indexscans on partial hash indexes with implied quals.
Normally, if a WHERE clause is implied by the predicate of a partial
index, we drop that clause from the set of quals used with the index,
since it's redundant to test it if we're scanning that index.
However, if it's a hash index (or any !amoptionalkey index), this
could result in dropping all available quals for the index's first
key, preventing us from generating an indexscan.

It's fair to question the practical usefulness of this case.  Since
hash only supports equality quals, the situation could only arise
if the index's predicate is "WHERE indexkey = constant", implying
that the index contains only one hash value, which would make hash
a really poor choice of index type.  However, perhaps there are
other !amoptionalkey index AMs out there with which such cases are
more plausible.

To fix, just don't filter the candidate indexquals this way if
the index is !amoptionalkey.  That's a bit hokey because it may
result in testing quals we didn't need to test, but to do it
more accurately we'd have to redundantly identify which candidate
quals are actually usable with the index, something we don't know
at this early stage of planning.  Doesn't seem worth the effort.

Reported-by: Sergei Glukhov <s.glukhov@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/e200bf38-6b45-446a-83fd-48617211feff@postgrespro.ru
Backpatch-through: 14
2025-11-27 13:09:59 -05:00
Amit Langote
519fa0433b Fix error reporting for SQL/JSON path type mismatches
transformJsonFuncExpr() used exprType()/exprLocation() on the
possibly coerced path expression, which could be NULL when
coercion to jsonpath failed, leading to "cache lookup failed
for type 0" errors.

Preserve the original expression node so that type and location
in the "must be of type jsonpath" error are reported correctly.
Add regression tests to cover these cases.

Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CACJufxHunVg81JMuNo8Yvv_hJD0DicgaVN2Wteu8aJbVJPBjZA@mail.gmail.com
Backpatch-through: 17
2025-11-27 12:07:01 +09:00
David Rowley
0ca3b16973 Add parallelism support for TID Range Scans
In v14, bb437f995 added support for scanning for ranges of TIDs using a
dedicated executor node for the purpose.  Here, we allow these scans to
be parallelized.  The range of blocks to scan is divvied up similarly to
how a Parallel Seq Scans does that, where 'chunks' of blocks are
allocated to each worker and the size of those chunks is slowly reduced
down to 1 block per worker by the time we're nearing the end of the
scan.  Doing that means workers finish at roughly the same time.

Allowing TID Range Scans to be parallelized removes the dilemma from the
planner as to whether a Parallel Seq Scan will cost less than a
non-parallel TID Range Scan due to the CPU concurrency of the Seq Scan
(disk costs are not divided by the number of workers).  It was possible
the planner could choose the Parallel Seq Scan which would result in
reading additional blocks during execution than the TID Scan would have.
Allowing Parallel TID Range Scans removes the trade-off the planner
makes when choosing between reduced CPU costs due to parallelism vs
additional I/O from the Parallel Seq Scan due to it scanning blocks from
outside of the required TID range.  There is also, of course, the
traditional parallelism performance benefits to be gained as well, which
likely doesn't need to be explained here.

Author: Cary Huang <cary.huang@highgo.ca>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Discussion: https://postgr.es/m/18f2c002a24.11bc2ab825151706.3749144144619388582@highgo.ca
2025-11-27 14:05:04 +13:00
David Rowley
42473b3b31 Have the planner replace COUNT(ANY) with COUNT(*), when possible
This adds SupportRequestSimplifyAggref to allow pg_proc.prosupport
functions to receive an Aggref and allow them to determine if there is a
way that the Aggref call can be optimized.

Also added is a support function to allow transformation of COUNT(ANY)
into COUNT(*).  This is possible to do when the given "ANY" cannot be
NULL and also that there are no ORDER BY / DISTINCT clauses within the
Aggref.  This is a useful transformation to do as it is common that
people write COUNT(1), which until now has added unneeded overhead.
When counting a NOT NULL column.  The overheads can be worse as that
might mean deforming more of the tuple, which for large fact tables may
be many columns in.

It may be possible to add prosupport functions for other aggregates.  We
could consider if ORDER BY could be dropped for some calls, e.g. the
ORDER BY is quite useless in MAX(c ORDER BY c).

There is a little bit of passing fallout from adjusting
expr_is_nonnullable() to handle Const which results in a plan change in
the aggregates.out regression test.  Previously, nothing was able to
determine that "One-Time Filter: (100 IS NOT NULL)" was always true,
therefore useless to include in the plan.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAApHDvqGcPTagXpKfH=CrmHBqALpziThJEDs_MrPqjKVeDF9wA@mail.gmail.com
2025-11-27 10:43:28 +13:00
Nathan Bossart
dbdc717ac6 Teach DSM registry to retry entry initialization if needed.
If DSM registry entry initialization fails, backends could try to
use an uninitialized DSM segment, DSA, or dshash table (since the
entry is still added to the registry).  To fix, restructure the
code so that the registry retries initialization as needed.  This
commit also modifies pg_get_dsm_registry_allocations() to leave out
partially-initialized entries, as they shouldn't have any allocated
memory.

DSM registry entry initialization shouldn't fail often in practice,
but retrying was deemed better than leaving entries in a
permanently failed state (as was done by commit 1165a933aa, which
has since been reverted).

Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
2025-11-26 15:12:25 -06:00
Jeff Davis
1476028225 Allow pg_locale_t APIs to work when ctype_is_c.
Previously, the caller needed to check ctype_is_c first for some
routines and not others. Now, the APIs consistently work, and the
caller can just check ctype_is_c for optimization purposes.

Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
2025-11-26 12:54:37 -08:00
Nathan Bossart
2dd506b859 Revert "Teach DSM registry to ERROR if attaching to an uninitialized entry."
This reverts commit 1165a933aa (and the corresponding commits on
the back-branches).  In a follow-up commit, we'll teach the
registry to retry entry initialization instead of leaving it in a
permanently failed state.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/E1vJHUk-006I7r-37%40gemulon.postgresql.org
Backpatch-through: 17
2025-11-26 11:37:21 -06:00
Melanie Plageman
e135e04457 Split heap_page_prune_and_freeze() into helpers
Refactor the setup and planning phases of pruning and freezing into
helpers. This streamlines heap_page_prune_and_freeze() and makes it more
clear when the examination of tuples ends and page modifications begin.

No code change beyond what was required to extract the code into helper
functions.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo
2025-11-26 11:00:34 -05:00
Nathan Bossart
9446f918ac Remove a few unused struct members.
Oversights in commits ab9e0e718a, f3049a603a, and 247ce06b88.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aScUuBSawPWogUxs%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-26 09:50:00 -06:00
Daniel Gustafsson
b3fe098d33 Add GUC to show EXEC_BACKEND state
There is no straightforward way to determine if a cluster is running
in EXEC_BACKEND mode or not, which is useful for tests to know. This
adds a GUC debug_exec_backend similar to debug_assertions which will
be true when the server is running in EXEC_BACKEND mode.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/5F301096-921A-427D-8EC1-EBAEC2A35082@yesql.se
2025-11-26 14:24:27 +01:00
Peter Eisentraut
8fe4aef829 Replace internal C function pg_hypot() by standard hypot()
The code comment said, "It is expected that this routine will
eventually be replaced with the C99 hypot() function.", so let's do
that now.

This function is tested via the geometry regression test, so if it is
faulty on any platform, it will show up there.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
2025-11-26 07:48:29 +01:00
Michael Paquier
e1405aa5e3 Add input function for data type pg_dependencies
pg_dependencies is used as data type for the contents of dependencies
extended statistics.  This new input function consumes the format that
has been established by e76defbcf0 for the output function of
pg_dependencies, enforcing some sanity checks for:
- Checks for the input object, which should be a one-dimension array
with correct attributes and values.
- The key names: "attributes", "dependency", "degree".  All are
required, other key names are blocked.
- Value types for each key: "attributes" requires an array of integers,
"dependency" an attribute number, "degree" a float.
- List of attributes.  In this case, it is possible that some
dependencies are not listed in the statistics data, as items with a
degree of 0 are discarded when building the statistics.  This commit
includes checks for simple scenarios, like duplicated attributes, or
overlapping values between the list of "attributes" and the "dependency"
value.  Even if the input function considers the input as valid, a value
still needs to be cross-checked with the attributes defined in a
statistics object at import.
- Based on the discussion, the checks on the values are loose, as there
is also an argument for potentially stats injection.  For example,
"degree" should be defined in [0.0,1.0], but a check is not enforced.

This is required for a follow-up patch that aims to implement the import
of extended statistics.  Some tests are added to check the code paths of
the JSON parser checking the shape of the pg_dependencies inputs, with
91% of code coverage reached.  The tests are located in their own new
test file, for clarity.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-11-26 10:53:16 +09:00
Michael Paquier
44eba8f06e Add input function for data type pg_ndistinct
pg_ndistinct is used as data type for the contents of ndistinct extended
statistics.  This new input function consumes the format that has been
established by 1f927cce44 for the output function of pg_ndistinct,
enforcing some sanity checks for:
- Checks for the input object, which should be a one-dimension array
with correct attributes and values.
- The key names: "attributes", "ndistinct".  Both are required, other
key names are blocked.
- Value types for each key: "attributes" requires an array of integers,
and "ndistinct" an integer.
- List of attributes.  Note that this enforces a check so as an
attribute list has to be a subset of the longest attribute list found.
This does not enforce that a full group of attribute sets exist, based
on how the groups are generated when the ndistinct objects are
generated, making the list of ndistinct items a bit loose.  Note a check
would still be required at import to see if the attributes listed match
with the attribute numbers set in the definition of a statistics object.
- Based on the discussion, the checks on the values are loose, as there
is also an argument for potentially stats injection.  The relation and
attribute level stats follow the same line of argument for the values.

This is required for a follow-up patch that aims to implement the import
of extended statistics.  Some tests are added to check the code paths of
the JSON parser checking the shape of the pg_ndistinct inputs, with 90%
of code coverage reached.  The tests are located in their own new test
file, for clarity.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-11-26 10:13:18 +09:00
Melanie Plageman
cd38b7e773 Assert that cutoffs are provided if freezing will be attempted
heap_page_prune_and_freeze() requires the caller to initialize
PruneFreezeParams->cutoffs so that the function can correctly evaluate
whether tuples should be frozen. This requirement previously existed
only in comments and was easy to miss, especially after “cutoffs” was
converted from a direct function parameter to a field of the newly
introduced PruneFreezeParams struct (added in 1937ed7062). Adding an
assert makes this requirement explicit and harder to violate.

Also, fix a minor typo while we're at it.

Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/0AC177F5-5E26-45EE-B273-357C51212AC5%40gmail.com
2025-11-25 16:41:29 -05:00
Jeff Davis
3b9c118920 Remove a useless length check.
Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CAEoWx2mW0P8CByavV58zm3=eb2MQHaKOcDEF5B2UJYRyC2c3ig@mail.gmail.com
2025-11-25 11:39:41 -08:00
Álvaro Herrera
417ac9c1ee
Improve test case stability
Given unlucky timing, some of the new tests added by commit bc32a12e0d
can fail spuriously.  We haven't seen such failures yet in buildfarm,
but allegedly we can prevent them with this tweak.

While at it, remove an unused injection point I (Álvaro) added.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Discussion: https://postgr.es/m/CADzfLwUc=jtSUEaQCtyt8zTeOJ-gHZ8=w_KJsVjDOYSLqaY9Lg@mail.gmail.com
Discussion: https://postgr.es/m/CADzfLwV5oQq-Vg_VmG_o4SdL6yHjDoNO4T4pMtgJLzYGmYf74g@mail.gmail.com
2025-11-25 18:20:06 +01:00
Peter Eisentraut
7169c0b96b gen_guc_tables.pl: Validate required GUC fields before code generation
Previously, gen_guc_tables.pl would emit "Use of uninitialized value"
warnings if required fields were missing in guc_parameters.dat (for
example, when an integer or real GUC omitted the 'max' value).  The
resulting error messages were unclear and did not identify which GUC
entry was problematic.

Add explicit validation of required fields depending on the parameter
type, and fail with a clear and specific message such as:

    guc_parameters.dat:1909: error: entry "max_index_keys" of type "int" is missing required field "max"

No changes to generated guc_tables.c.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2%3DoP4LgHi771_OKhPPUS7B-CTqCs%3D%3DuQcNXWrwBoAm5Vg%40mail.gmail.com
2025-11-25 16:50:34 +01:00
Peter Eisentraut
2256af4ba2 backend/nodes cleanup: Move loop variables definitions into for statement
Author: Chao Li (Evan) <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2nP12qwAaiJutbn1Kw50atN6FbMJNQ4bh4%2BfP_Ay_u7Eg%40mail.gmail.com
2025-11-25 15:38:23 +01:00
Amit Kapila
3df4df53b0 Fix a BF failure caused by commit 76b78721ca.
The issue occurred because the replication slot was not released in the
slotsync worker when a slot synchronization cycle was skipped. This skip
happened because the required WAL was not received and flushed on the
standby server. As a result, in the next cycle, when attempting to acquire
the slot, an assertion failure was triggered.

Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/CAA4eK1KMwYUYy=oAVHu9mam+vX50ixxfhO4_C=kgQC8VCQHEfw@mail.gmail.com
2025-11-25 08:49:46 +00:00
Amit Kapila
76b78721ca Add slotsync skip statistics.
This patch adds two new columns to the pg_stat_replication_slots view:
slotsync_skip_count - the total number of times a slotsync operation was
skipped.
slotsync_skip_at - the timestamp of the most recent skip.

These additions provide better visibility into replication slot
synchronization behavior.

A future patch will introduce the slotsync_skip_reason column in
pg_replication_slots to capture the reason for skip.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com
2025-11-25 07:06:02 +00:00
Peter Eisentraut
c581c9a7ac Remove obsolete comment
This comment should probably have been moved to pg_locale_libc.c in
commit 66ac94cdc7 (2024), but upon closer examination it was already
completely obsolete then.

The first part of the comment has been obsolete since commit
85feb77aa0 (2017), which required that the system provides the
required wide-character functions.

The second part has been obsolete since commit e9931bfb75 (2024),
which eliminated code paths depending on the global LC_CTYPE setting.

Discussion: https://www.postgresql.org/message-id/flat/170308e6-a7a3-4484-87b2-f960bb564afa%40eisentraut.org
2025-11-25 06:26:49 +01:00
Michael Paquier
ed823da128 Rename routines for write/read of pgstats file
This commit renames write_chunk and read_chunk to respectively
pgstat_write_chunk() and pgstat_read_chunk(), along with the *_s
convenience macros.

These are made available for plug-ins, so as any code that decides to
write and/or read stats data can rely on a single code path for this
work.

Extracted from a larger patch by the same author.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com
2025-11-25 10:55:40 +09:00
Andres Freund
81f7738953 lwlock: Fix, currently harmless, bug in LWLockWakeup()
Accidentally the code in LWLockWakeup() checked the list of to-be-woken up
processes to see if LW_FLAG_HAS_WAITERS should be unset. That means that
HAS_WAITERS would not get unset immediately, but only during the next,
unnecessary, call to LWLockWakeup().

Luckily, as the code stands, this is just a small efficiency issue.

However, if there were (as in a patch of mine) a case in which LWLockWakeup()
would not find any backend to wake, despite the wait list not being empty,
we'd wrongly unset LW_FLAG_HAS_WAITERS, leading to potentially hanging.

While the consequences in the backbranches are limited, the code as-is
confusing, and it is possible that there are workloads where the additional
wait list lock acquisitions hurt, therefore backpatch.

Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Backpatch-through: 14
2025-11-24 18:10:48 -05:00
Jeff Davis
f81bf78ce1 Avoid global LC_CTYPE dependency in pg_locale_libc.c.
Call tolower_l() directly instead of through pg_tolower(), because the
latter depends on the global LC_CTYPE.

Discussion: https://postgr.es/m/8186b28a1a39e61a0d833a4c25a8909ebbbabd48.camel@j-davis.com
2025-11-24 14:55:09 -08:00
Tom Lane
698fa924b1 Improve detection of implicitly-temporary views.
We've long had a practice of making views temporary by default if they
reference any temporary tables.  However the implementation was pretty
incomplete, in that it only searched for RangeTblEntry references to
temp relations.  Uses of temporary types, regclass constants, etc
were not detected even though the dependency mechanism considers them
grounds for dropping the view.  Thus a view not believed to be temp
could silently go away at session exit anyhow.

To improve matters, replace the ad-hoc isQueryUsingTempRelation()
logic with use of the dependency-based infrastructure introduced by
commit 572c40ba9.  This is complete by definition, and it's less code
overall.

While we're at it, we can also extend the warning NOTICE (or ERROR
in the case of a materialized view) to mention one of the temp
objects motivating the classification of the view as temp, as was
done for functions in 572c40ba9.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/19cf6ae1-04cd-422c-a760-d7e75fe6cba9@uni-muenster.de
2025-11-24 17:00:16 -05:00
Jacob Champion
e2ceff13d8 postgres: Use pg_{add,mul}_size_overflow()
The backend implementations of add_size() and mul_size() can now make
use of the APIs provided in common/int.h.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOYmi%2B%3D%2BpqUd2MUitvgW1pAJuXgG_TKCVc3_Ek7pe8z9nkf%2BAg%40mail.gmail.com
2025-11-24 09:59:54 -08:00
Álvaro Herrera
bc32a12e0d
Fix infer_arbiter_index during concurrent index operations
Previously, we would only consider indexes marked indisvalid as usable
for INSERT ON CONFLICT.  But that's problematic during CREATE INDEX
CONCURRENTLY and REINDEX CONCURRENTLY, because concurrent transactions
would end up with inconsistents lists of inferred indexes, leading to
deadlocks and spurious errors about unique key violations (because two
transactions are operating on different indexes for the speculative
insertion tokens).  Change this function to return indexes even if
invalid.  This fixes the spurious errors and deadlocks.

Because such indexes might not be complete, we still need uniqueness to
be verified in a different way.  We do that by requiring that at least
one index marked valid is part of the set of indexes returned.  It is
that index that is going to help ensure that the inserted tuple is
indeed unique.

This does not fix similar problems occurring with partitioned tables or
with named constraints.  These problems will be fixed in follow-up
commits.

We have no user report of this problem, even though it exists in all
branches.  Because of that and given that the fix is somewhat tricky, I
decided not to backpatch for now.

Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CANtu0ogv+6wqRzPK241jik4U95s1pW3MCZ3rX5ZqbFdUysz7Qw@mail.gmail.com
2025-11-24 17:17:29 +01:00
Tom Lane
572c40ba94 Issue a NOTICE if a created function depends on any temp objects.
We don't have an official concept of temporary functions.  (You can
make one explicitly in pg_temp, but then you have to explicitly
schema-qualify it on every call.)  However, until now we were quite
laissez-faire about whether a non-temporary function could depend on
a temporary object, such as a temp table or view.  If one does,
it will silently go away at end of session, due to the automatic
DROP ... CASCADE on the session's temporary objects.  People have
complained that that's surprising; however, we can't really forbid
it because other people (including our own regression tests) rely
on being able to do it.  Let's compromise by emitting a NOTICE
at CREATE FUNCTION time.  This is somewhat comparable to our
ancient practice of emitting a NOTICE when forcing a view to
become temp because it depends on temp tables.

Along the way, refactor recordDependencyOnExpr() so that the
dependencies of an expression can be combined with other
dependencies, instead of being emitted separately and perhaps
duplicatively.

We should probably make the implementation of temp-by-default
views use the same infrastructure used here, but that's for
another patch.  It's unclear whether there are any other object
classes that deserve similar treatment.

Author: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19cf6ae1-04cd-422c-a760-d7e75fe6cba9@uni-muenster.de
2025-11-23 15:02:55 -05:00
Tom Lane
b140c8d7a3 Add SupportRequestInlineInFrom planner support request.
This request allows a support function to replace a function call
appearing in FROM (typically a set-returning function) with an
equivalent SELECT subquery.  The subquery will then be subject
to the planner's usual optimizations, potentially allowing a much
better plan to be generated.  While the planner has long done this
automatically for simple SQL-language functions, it's now possible
for extensions to do it for functions outside that group.
Notably, this could be useful for functions that are presently
implemented in PL/pgSQL and work by generating and then EXECUTE'ing
a SQL query.

Author: Paul A Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/09de6afa-c33d-4d94-a5cb-afc6cea0d2bb@illuminatedcomputing.com
2025-11-22 19:33:34 -05:00
Peter Eisentraut
5eed8ce50c Add range_minus_multi and multirange_minus_multi functions
The existing range_minus function raises an exception when the range is
"split", because then the result can't be represented by a single range.
For example '[0,10)'::int4range - '[4,5)' would be '[0,4)' and '[5,10)'.

This commit adds new set-returning functions so that callers can get
results even in the case of splits. There is no risk of an exception for
multiranges, but a set-returning function lets us handle them the same
way we handle ranges.

Both functions return zero results if the subtraction would give an
empty range/multirange.

The main use-case for these functions is to implement UPDATE/DELETE FOR
PORTION OF, which must compute the application-time of "temporal
leftovers": the part of history in an updated/deleted row that was not
changed. To preserve the untouched history, we will implicitly insert
one record for each result returned by range/multirange_minus_multi.
Using a set-returning function will also let us support user-defined
types for application-time update/delete in the future.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2025-11-22 09:42:03 +01:00
Thomas Munro
0dceba21d7 jit: Adjust AArch64-only code for LLVM 21.
LLVM 21 changed the arguments of RTDyldObjectLinkingLayer's
constructor, breaking compilation with the backported
SectionMemoryManager from commit 9044fc1d.

cd585864c0

Backpatch-through: 14
Author: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Discussion: https://postgr.es/m/d25e6e4a-d1b4-84d3-2f8a-6c45b975f53d%40applied-asynchrony.com
2025-11-22 21:21:11 +13:00
Peter Eisentraut
ef8fe69360 Remove useless casts to (void *)
Their presence causes (small) risks of hiding actual type mismatches
or silently discarding qualifiers.  Some have been missed in
7f798aca1d and some are new ones along the same lines.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/aR8Yv%2BuATLKbJCgI%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-21 16:49:40 +01:00
Peter Eisentraut
97e04c74be C11 alignas instead of unions
This changes a few union members that only existed to ensure
alignments and replaces them with the C11 alignas specifier.

This change only uses fundamental alignments (meaning approximately
alignments of basic types), which all C11 compilers must support.
There are opportunities for similar changes using extended alignments,
for example in PGIOAlignedBlock, but these are not necessarily
supported by all compilers, so they are kept as a separate change.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/46f05236-d4d4-4b4e-84d4-faa500f14691%40eisentraut.org
2025-11-21 10:08:24 +01:00
Masahiko Sawada
266543a620 Use "COPY table TO" for partitioned tables in initial table synchronization.
Commit 4bea91f added support for "COPY table TO" with partitioned
tables. This commit enhances initial table synchronization in logical
replication to use "COPY table TO" for partitioned tables if possible,
instead of "COPY (SELECT ...) TO" variant, improving performance.

Author: Ajin Cherian <itsajin@gmail.com>
Discussion: https://postgr.es/m/CAFPTHDY=w+xmEof=yyjhbDzaLxhBkoBzKcksEofXcT6EcjMbtQ@mail.gmail.com
2025-11-20 14:50:27 -08:00
Melanie Plageman
1e14edcea5 Split PruneFreezeParams initializers to one field per line
This conforms more closely with the style of other struct initializers
in the code base. Initializing multiple fields on a single line is
unpopular in part because pgindent won't permit a space after the comma
before the next field's period.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/87see87fnq.fsf%40wibble.ilmari.org
2025-11-20 17:36:40 -05:00
Melanie Plageman
65ec565b19 Update PruneState.all_[visible|frozen] earlier in pruning
During pruning and freezing in phase I of vacuum, we delay clearing
all_visible and all_frozen in the presence of dead items. This allows
opportunistic freezing if the page would otherwise be fully frozen,
since those dead items are later removed in vacuum phase III.

To move the VM update into the same WAL record that
prunes and freezes tuples, we must know whether the page will
be marked all-visible/all-frozen before emitting WAL. Previously we
waited until after emitting WAL to update all_visible/all_frozen to
their correct values.

The only barrier to updating these flags immediately after deciding
whether to opportunistically freeze was that while emitting WAL for a
record freezing tuples, we use the pre-corrected value of all_frozen to
compute the snapshot conflict horizon. By determining the conflict
horizon earlier, we can update the flags immediately after making the
opportunistic freeze decision.

This is required to set the VM in the XLOG_HEAP2_PRUNE_VACUUM_SCAN
record emitted by pruning and freezing.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-11-20 11:43:18 -05:00
Melanie Plageman
351d7e2441 Keep all_frozen updated in heap_page_prune_and_freeze
Previously, we relied on all_visible and all_frozen being used together
to ensure that all_frozen was correct, but it is better to keep both
fields updated.

Future changes will separate their usage, so we should not depend on
all_visible for the validity of all_frozen.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-11-20 10:59:24 -05:00
Melanie Plageman
1937ed7062 Refactor heap_page_prune_and_freeze() parameters into a struct
heap_page_prune_and_freeze() had accumulated an unwieldy number of input
parameters and upcoming work to handle VM updates in this function will
add even more.

Introduce a new PruneFreezeParams struct to group the function’s input
parameters, improving readability and maintainability.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
2025-11-20 10:32:14 -05:00
Fujii Masao
99780da720 Add HINT listing valid encodings to encode() and decode() errors.
This commit updates encode() and decode() so that when an invalid encoding
is specified, their error message includes a HINT listing all valid encodings.
This helps users quickly see which encodings are supported without needing
to consult the documentation.

Author: Shinya Sugamoto <shinya34892@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAAe3y+99sfPv8UDF1VM-rC1i5HBdqxUh=2HrbJJFm2+i=1OwOw@mail.gmail.com
2025-11-20 09:14:02 +09:00
Tom Lane
057012b205 Speed up eqjoinsel() with lots of MCV entries.
If both sides of the operator have most-common-value statistics,
eqjoinsel wants to check which MCVs have matches on the other side.
Formerly it did this with a dumb compare-all-the-entries loop,
which had O(N^2) behavior for long MCV lists.  When that code was
written, twenty-plus years ago, that seemed tolerable; but nowadays
people frequently use much larger statistics targets, so that the
O(N^2) behavior can hurt quite a bit.

To add insult to injury, when asked for semijoin semantics, the
entire comparison loop was done over, even though we frequently
know that it will yield exactly the same results.

To improve matters, switch to using a hash table to perform the
matching.  Testing suggests that depending on the data type, we may
need up to about 100 MCVs on each side to amortize the extra costs
of setting up the hash table and performing hash-value computations;
so continue to use the old looping method when there are fewer MCVs
than that.

Also, refactor so that we don't repeat the matching work unless
we really need to, which occurs only in the uncommon case where
eqjoinsel_semi decides to truncate the set of inner MCVs it
considers.  The refactoring also got rid of the need to use the
presented operator's commutator.  Real-world operators that are
using eqjoinsel should pretty much always have commutators, but
at the very least this saves a few syscache lookups.

Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Co-authored-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20ea8bf5-3569-4e46-92ef-ebb2666debf6@tantorlabs.com
2025-11-19 13:22:12 -05:00
Peter Eisentraut
86b276a4a9 Fix indentation
for commit 0fc33b0053
2025-11-19 10:41:28 +01:00
Peter Eisentraut
0fc33b0053 Fix NLS for incorrect GUC enum value hint message
The translation markers were applied at the wrong place, so no string
was extracted for translation.

Also add translator comments here and in a similar place.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/2c961fa1-14f6-44a2-985c-e30b95654e8d%40eisentraut.org
2025-11-19 08:48:21 +01:00
Richard Guo
4d3f623ea8 Fix typo in nodeHash.c
Replace "overlow" with "overflow".

Author: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNnzFjAjYLTkP78HE2PQ17MjBqFdQQg+0X6Wo7YMUb68xA@mail.gmail.com
2025-11-19 11:04:03 +09:00
Álvaro Herrera
77fb3959a4
Fix typo 2025-11-18 19:31:23 +01:00
Tom Lane
35b5c62c3a Don't allow CTEs to determine semantic levels of aggregates.
The fix for bug #19055 (commit b0cc0a71e) allowed CTE references in
sub-selects within aggregate functions to affect the semantic levels
assigned to such aggregates.  It turns out this broke some related
cases, leading to assertion failures or strange planner errors such
as "unexpected outer reference in CTE query".  After experimenting
with some alternative rules for assigning the semantic level in
such cases, we've come to the conclusion that changing the level
is more likely to break things than be helpful.

Therefore, this patch undoes what b0cc0a71e changed, and instead
installs logic to throw an error if there is any reference to a
CTE that's below the semantic level that standard SQL rules would
assign to the aggregate based on its contained Var and Aggref nodes.
(The SQL standard disallows sub-selects within aggregate functions,
so it can't reach the troublesome case and hence has no rule for
what to do.)

Perhaps someone will come along with a legitimate query that this
logic rejects, and if so probably the example will help us craft
a level-adjustment rule that works better than what b0cc0a71e did.
I'm not holding my breath for that though, because the previous
logic had been there for a very long time before bug #19055 without
complaints, and that bug report sure looks to have originated from
fuzzing not from real usage.

Like b0cc0a71e, back-patch to all supported branches, though
sadly that no longer includes v13.

Bug: #19106
Reported-by: Kamil Monicz <kamil@monicz.dev>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19106-9dd3668a0734cd72@postgresql.org
Backpatch-through: 14
2025-11-18 12:56:55 -05:00
Nathan Bossart
f63ae72bbc Switch from tabs to spaces in postgresql.conf.sample.
This file is written for 8-space tabs, since we expect that most
users who edit their configuration files use 8-space tabs.
However, most of PostgreSQL is written for 4-space tabs, and at
least one popular web interface defaults to 4-space tabs.  Rather
than trying to standardize on a particular tab width for this file,
let's just switch to spaces.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aReNUKdMgKxLqmq7%40nathan
2025-11-18 10:28:36 -06:00
Alexander Korotkov
75e82b2f5a Optimize shared memory usage for WaitLSNProcInfo
We need separate pairing heaps for different WaitLSNType's, because there
might be waiters for different LSN's at the same time.  However, one process
can wait only for one type of LSN at a time.  So, no need for inHeap
and heapNode fields to be arrays.

Discussion: https://postgr.es/m/CAPpHfdsBR-7sDtXFJ1qpJtKiohfGoj%3DvqzKVjWxtWsWidx7G_A%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-18 09:50:12 +02:00
Amit Kapila
3edaf29fa5 Rename two columns in pg_stat_subscription_stats.
This patch renames the sync_error_count column to sync_table_error_count
in the pg_stat_subscription_stats view. The new name makes the purpose
explicit now that a separate column exists to track sequence
synchronization errors.

Additionally, the column seq_sync_error_count is renamed to
sync_seq_error_count to maintain a consistent naming pattern, making it
easier for users to group, and query synchronization related counters.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3WwJmz=-4ybTkhniB-Nf3qmFG9Zx1uKjyLLoPF5NYYXA@mail.gmail.com
2025-11-18 03:58:55 +00:00
Masahiko Sawada
a6eac2273e Use streaming read I/O in BRIN vacuum scan.
This commit implements streaming read I/O for BRIN vacuum
scans. Although BRIN indexes tend to be relatively small by design,
performance tests have shown performance improvements.

Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://postgr.es/m/CAE7r3ML01aiq9Th_1OSz7U7Aq2pWbhMLoz5T%2BPXcg8J9ZAPFFA%40mail.gmail.com
2025-11-17 13:22:20 -08:00
Tom Lane
ed931377ab Clean up match_orclause_to_indexcol().
Remove bogus stripping of RelabelTypes: that can result in building
an output SAOP tree with incorrect exposed exprType for the operands,
which might confuse polymorphic operators.  Moreover it demonstrably
prevents folding some OR-trees to SAOPs when the RHS expressions
have different base types that were coerced to the same type by
RelabelTypes.

Reduce prohibition on type_is_rowtype to just disallow type RECORD.
We need that because otherwise we would happily fold multiple RECORD
Consts into a RECORDARRAY Const even if they aren't the same record
type.  (We could allow that perhaps, if we checked that they all have
the same typmod, but the case doesn't seem worth that much effort.)
However, there is no reason at all to disallow the transformation
for named composite types, nor domains over them: as long as we can
find a suitable array type we're good.

Remove some assertions that seem rather out of place (it's not
this code's duty to verify that the RestrictInfo structure is
sane).  Rewrite some comments.

The issues with RelabelType stripping seem severe enough to
back-patch this into v18 where the code was introduced.

Author: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHewXN=aH7GQBk4fXU-WaEeVmQWUmBAeNyBfJ3VKzPphyPKUkQ@mail.gmail.com
Backpatch-through: 18
2025-11-17 13:54:52 -05:00
Daniel Gustafsson
ab805989b2 Fix typos in logical replication code comments
Author: Chao Li <lic@highgo.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAEoWx2kt8m7wV39_zOBds5SNXx9EAkDqb5cPshk7Bxw6Js4Zpg@mail.gmail.com
2025-11-17 13:37:25 +01:00
Daniel Gustafsson
721bf9ce18 Mention md5 deprecation in postgresql.conf.sample
PostgreSQL 18 deprecated password_encryption='md5', but the
comments for this GUC in the sample configuration file did
not mention the deprecation.  Update comments with a notice
to make as many users as possible aware of it.  Also add a
comment to the related md5_password_warnings GUC while there.

Author: Michael Banck <mbanck@gmx.net>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Backpatch-through: 18
2025-11-17 12:18:18 +01:00
Michael Paquier
e76defbcf0 Rework output format of pg_dependencies
The existing format of pg_dependencies uses a single-object JSON
structure, with each key value embedding all the knowledge about the
set attributes tracked, like:
{"1 => 5": 1.000000, "5 => 1": 0.423130}

While this is a very compact format, it is confusing to read and it is
difficult to manipulate the values within the object, particularly when
tracking multiple attributes.

The new output format introduced in this commit is a JSON array of
objects, with:
- A key named "degree", with a float value.
- A key named "attributes", with an array of attribute numbers.
- A key named "dependency", with an attribute number.

The values use the same underlying type as previously when printed, with
a new output format that shows now as follows:
[{"degree": 1.000000, "attributes": [1], "dependency": 5},
 {"degree": 0.423130, "attributes": [5], "dependency": 1}]

This new format will become handy for a follow-up set of changes, so as
it becomes possible to inject extended statistics rather than require an
ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new
cluster.

This format has been suggested by Tomas Vondra.  The key names are
defined in the header introduced by 1f927cce44, to ease the
integration of frontend-specific changes that are still under
discussion.  (Again a personal note: if anybody comes up with better
name for the keys, of course feel free.)

The bulk of the changes come from the regression tests, where
jsonb_pretty() is now used to make the outputs generated easier to
parse.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-11-17 10:44:26 +09:00
Michael Paquier
1f927cce44 Rework output format of pg_ndistinct
The existing format of pg_ndistinct uses a single-object JSON structure
where each key is itself a comma-separated list of attnums, like:
{"3, 4": 11, "3, 6": 11, "4, 6": 11, "3, 4, 6": 11}

While this is a very compact format, it is confusing to read and it is
difficult to manipulate the values within the object.

The new output format introduced in this commit is an array of objects,
with:
- A key named "attributes", that contains an array of attribute numbers.
- A key named "ndistinct", represented as an integer.

The values use the same underlying type as previously when printed, with
a new output format that shows now as follows:
[{"ndistinct": 11, "attributes": [3,4]},
 {"ndistinct": 11, "attributes": [3,6]},
 {"ndistinct": 11, "attributes": [4,6]},
 {"ndistinct": 11, "attributes": [3,4,6]}]

This new format will become handy for a follow-up set of changes, so as
it becomes possible to inject extended statistics rather than require an
ANALYZE, like in a dump/restore sequence or after pg_upgrade on a new
cluster.

This format has been suggested by Tomas Vondra.  The key names are
defined in a new header, to ease with the integration of
frontend-specific changes that are still under discussion.  (Personal
note: I am not specifically wedded to these key names, but if there are
better name suggestions for this release, feel free.)

The bulk of the changes come from the regression tests, where
jsonb_pretty() is now used to make the outputs generated easier to
parse.

Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-11-17 09:52:20 +09:00
Thomas Munro
32b236644d Define PS_USE_CLOBBER_ARGV on GNU/Hurd.
Until d2ea2d310d, the PS_USE_PS_STRINGS
option was used on the GNU/Hurd. As this option got removed and
PS_USE_CLOBBER_ARGV appears to work fine nowadays on the Hurd, define
this one to re-enable process title changes on this platform.

In the 14 and 15 branches, the existing test for __hurd__ (added 25
years ago by commit 209aa77d, removed in 16 by the above commit) is left
unchanged for now as it was activating slightly different code paths and
would need investigation by a Hurd user.

Author: Michael Banck <mbanck@debian.org>
Discussion: https://postgr.es/m/CA%2BhUKGJMNGUAqf27WbckYFrM-Mavy0RKJvocfJU%3DJ2XcAZyv%2Bw%40mail.gmail.com
Backpatch-through: 16
2025-11-17 12:48:55 +13:00
David Rowley
9c047da51f Get rid of long datatype in CATCACHE_STATS enabled builds
"long" is 32 bits on Windows 64-bit.  Switch to a datatype that's 64-bit
on all platforms.  While we're there, use an unsigned type as these
fields count things that have occurred, of which it's not possible to
have negative numbers of.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CAApHDvoGFjSA3aNyVQ3ivbyc4ST=CC5L-_VjEUQ92HbE2Cxovg@mail.gmail.com
2025-11-17 12:26:41 +13:00
Dean Rasheed
1b92fe7bb9 Fix Assert failure in EXPLAIN ANALYZE MERGE with a concurrent update.
When instrumenting a MERGE command containing both WHEN NOT MATCHED BY
SOURCE and WHEN NOT MATCHED BY TARGET actions using EXPLAIN ANALYZE, a
concurrent update of the target relation could lead to an Assert
failure in show_modifytable_info(). In a non-assert build, this would
lead to an incorrect value for "skipped" tuples in the EXPLAIN output,
rather than a crash.

This could happen if the concurrent update caused a matched row to no
longer match, in which case ExecMerge() treats the single originally
matched row as a pair of not matched rows, and potentially executes 2
not-matched actions for the single source row. This could then lead to
a state where the number of rows processed by the ModifyTable node
exceeds the number of rows produced by its source node, causing
"skipped_path" in show_modifytable_info() to be negative, triggering
the Assert.

Fix this in ExecMergeMatched() by incrementing the instrumentation
tuple count on the source node whenever a concurrent update of this
kind is detected, if both kinds of merge actions exist, so that the
number of source rows matches the number of actions potentially
executed, and the "skipped" tuple count is correct.

Back-patch to v17, where support for WHEN NOT MATCHED BY SOURCE
actions was introduced.

Bug: #19111
Reported-by: Dilip Kumar <dilipbalaut@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/19111-5b06624513d301b3@postgresql.org
Backpatch-through: 17
2025-11-16 22:14:06 +00:00
Alexander Korotkov
23792d7381 Fix incorrect function name in comments
Update comments to reference WaitForLSN() instead of the outdated
WaitForLSNReplay() function name.

Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
2025-11-15 12:27:42 +02:00
Alexander Korotkov
ede6acef49 Fix WaitLSNWakeup() fast-path check for InvalidXLogRecPtr
WaitLSNWakeup() incorrectly returned early when called with
InvalidXLogRecPtr (meaning "wake all waiters"), because the fast-path
check compared minWaitedLSN > 0 without validating currentLSN first.
This caused WAIT FOR LSN commands to wait indefinitely during standby
promotion until random signals woke them.

Add an XLogRecPtrIsValid() check before the comparison so that
InvalidXLogRecPtr bypasses the fast-path and wakes all waiters immediately.

Discussion: https://postgr.es/m/CABPTF7UieOYbOgH3EnQCasaqcT1T4N6V2wammwrWCohQTnD_Lw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
2025-11-15 12:27:42 +02:00
Nathan Bossart
478c4814a0 Comment out autovacuum_worker_slots in postgresql.conf.sample.
All settings in this file should be commented out.  In addition to
fixing that, also fix the indentation for this line.

Oversight in commit c758119e5b.

Reported-by: Daniel Gustafsson <daniel@yesql.se>
Author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/19727040-3EE4-4719-AF4F-2548544113D7%40yesql.se
Backpatch-through: 18
2025-11-14 13:45:04 -06:00
Nathan Bossart
7506bdbbf4 Add note about CreateStatistics()'s selective use of check_rights.
Commit 5e4fcbe531 added a check_rights parameter to this function
for use by ALTER TABLE commands that re-create statistics objects.
However, we intentionally ignore check_rights when verifying
relation ownership because this function's lookup could return a
different answer than the caller's.  This commit adds a note to
this effect so that we remember it down the road.

Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 14
2025-11-14 13:20:09 -06:00
Bruce Momjian
43e6929bb2 doc: double-quote use of %f, %p, and %r in literal commands.
Path expansion might expose characters like spaces which would cause
command failure, so double-quote the examples.  While %f doesn't need
quoting since it uses a fixed character set, it is best to be
consistent.

Discussion: https://postgr.es/m/aROPCQCfvKp9Htk4@momjian.us

Backpatch-through: master
2025-11-14 09:08:53 -05:00
Michael Paquier
910690415b Revert "Drop unnamed portal immediately after execution to completion"
This reverts commit 1fd981f053, based on concerns that the logging
improvements do not justify the protocol breakage of dropping an unnamed
portal once its execution has completed.

It seems unlikely that one would try to send an execute or describe
message after the portal has been used, but if they do such
post-completion messages would not be able to process as the previous
versions.  Let's revert this change for now so as we keep compatibility
and consider a different solution.

The tests added by 76bba03312 track the pre-1fd981f05369 behavior, and
are still valid.

Discussion: https://postgr.es/m/CA+TgmoYFJyJNQw3RT7veO3M2BWRE9Aw4hprC5rOcawHZti-f8g@mail.gmail.com
2025-11-14 14:37:10 +09:00
Thomas Munro
017249b828 Add some missing #include <limits.h>.
These files relied on transitive inclusion via port/atomics.h for
constants CHAR_BIT and INT_MAX.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/536409d2-c9df-4ef3-808d-1ffc3182868c@iki.fi
2025-11-13 22:56:08 +13:00
Michael Paquier
84fb27511d Replace off_t by pgoff_t in I/O routines
PostgreSQL's Windows port has never been able to handle files larger
than 2GB due to the use of off_t for file offsets, only 32-bit on
Windows.  This causes signed integer overflow at exactly 2^31 bytes when
trying to handle files larger than 2GB, for the routines touched by this
commit.

Note that large files are forbidden by ./configure (3c6248a828) and
meson (recent change, see 79cd66f28c).  This restriction also exists
in v16 and older versions for the now-dead MSVC scripts.

The code base already defines pgoff_t as __int64 (64-bit) on Windows for
this purpose, and some function declarations in headers use it, but many
internals still rely on off_t.  This commit switches more routines to
use pgoff_t, offering more portability, for areas mainly related to file
extensions and storage.

These are not critical for WAL segments yet, which have currently a
maximum size allowed of 1GB (well, this opens the door at allowing a
larger size for them).  This matters more for segment files if we want
to lift the large file restriction in ./configure and meson in the
future, which would make sense to remove once/if all traces of off_t are
gone from the tree.  This can additionally matter for out-of-core code
that may want files larger than 2GB in places where off_t is four bytes
in size.

Note that off_t is still used in other parts of the tree like
buffile.c, WAL sender/receiver, base backup, pg_combinebackup, etc.
These other code paths can be addressed separately, and their update
will be required if we want to remove the large file restriction in the
future.  This commit is a good first cut in itself towards more
portability, hopefully.

On Unix-like systems, pgoff_t is defined as off_t, so this change only
affects Windows behavior.

Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/0f238ff4-c442-42f5-adb8-01b762c94ca1@gmail.com
2025-11-13 12:41:40 +09:00
Fujii Masao
705601c5ae Fix incorrect assignment of InvalidXLogRecPtr to a non-LSN variable.
pg_logical_slot_get_changes_guts() previously assigned InvalidXLogRecPtr to
the local variable upto_nchanges, which is of type int32, not XLogRecPtr.
While this caused no functional issue since InvalidXLogRecPtr is defined as 0,
it was semantically incorrect.

This commit fixes the issue by updating pg_logical_slot_get_changes_guts()
to set upto_nchanges to 0 instead of InvalidXLogRecPtr.

No backpatch is needed, as the previous behavior was harmless.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Steven Niu <niushiji@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHKHuR5NGnGxU3+ebz7cbC1ZAR=AgG4Bueq==Lj6iX8Sw@mail.gmail.com
2025-11-13 08:44:33 +09:00
Nathan Bossart
180e7abe68 Remove obsolete autovacuum comment.
This comment seems to refer to some stuff that was removed during
development in 2005.

Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/aRJFDxKJLFE_1Iai%40nathan
2025-11-12 15:13:08 -06:00
Nathan Bossart
1165a933aa Teach DSM registry to ERROR if attaching to an uninitialized entry.
If DSM entry initialization fails, backends could try to use an
uninitialized DSM segment, DSA, or dshash table (since the entry is
still added to the registry).  To fix, keep track of whether
initialization completed, and ERROR if a backend tries to attach to
an uninitialized entry.  We could instead retry initialization as
needed, but that seemed complicated, error prone, and unlikely to
help most cases.  Furthermore, such problems probably indicate a
coding error.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/dd36d384-55df-4fc2-825c-5bc56c950fa9%40gmail.com
Backpatch-through: 17
2025-11-12 14:30:11 -06:00
Heikki Linnakangas
0bdc777e80 Clear 'xid' in dummy async notify entries written to fill up pages
Before we started to freeze async notify entries (commit 8eeb4a0f7c),
no one looked at the 'xid' on an entry with invalid 'dboid'. But now
we might actually need to freeze it later. Initialize them with
InvalidTransactionId to begin with, to avoid that work later.

Álvaro pointed this out in review of commit 8eeb4a0f7c, but I forgot
to include this change there.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/202511071410.52ll56eyixx7@alvherre.pgsql
Backpatch-through: 14
2025-11-12 21:19:03 +02:00
Heikki Linnakangas
797e9ea6e5 Fix remaining race condition with CLOG truncation and LISTEN/NOTIFY
Previous commit fixed a bug where VACUUM would truncate the CLOG
that's still needed to check the commit status of XIDs in the async
notify queue, but as mentioned in the commit message, it wasn't a full
fix. If a backend is executing asyncQueueReadAllNotifications() and
has just made a local copy of an async SLRU page which contains old
XIDs, vacuum can concurrently truncate the CLOG covering those XIDs,
and the backend still gets an error when it calls
TransactionIdDidCommit() on those XIDs in the local copy. This commit
fixes that race condition.

To fix, hold the SLRU bank lock across the TransactionIdDidCommit()
calls in NOTIFY processing.

Per Tom Lane's idea. Backpatch to all supported versions.

Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://www.postgresql.org/message-id/2759499.1761756503@sss.pgh.pa.us
Backpatch-through: 14
2025-11-12 20:59:44 +02:00
Heikki Linnakangas
8eeb4a0f7c Fix bug where we truncated CLOG that was still needed by LISTEN/NOTIFY
The async notification queue contains the XID of the sender, and when
processing notifications we call TransactionIdDidCommit() on the
XID. But we had no safeguards to prevent the CLOG segments containing
those XIDs from being truncated away. As a result, if a backend didn't
for some reason process its notifications for a long time, or when a
new backend issued LISTEN, you could get an error like:

test=# listen c21;
ERROR:  58P01: could not access status of transaction 14279685
DETAIL:  Could not open file "pg_xact/000D": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:1087

To fix, make VACUUM "freeze" the XIDs in the async notification queue
before truncating the CLOG. Old XIDs are replaced with
FrozenTransactionId or InvalidTransactionId.

Note: This commit is not a full fix. A race condition remains, where a
backend is executing asyncQueueReadAllNotifications() and has just
made a local copy of an async SLRU page which contains old XIDs, while
vacuum concurrently truncates the CLOG covering those XIDs. When the
backend then calls TransactionIdDidCommit() on those XIDs from the
local copy, you still get the error. The next commit will fix that
remaining race condition.

This was first reported by Sergey Zhuravlev in 2021, with many other
people hitting the same issue later. Thanks to:
- Alexandra Wang, Daniil Davydov, Andrei Varashen and Jacques Combrink
  for investigating and providing reproducable test cases,
- Matheus Alcantara and Arseniy Mukhin for review and earlier proposed
  patches to fix this,
- Álvaro Herrera and Masahiko Sawada for reviews,
- Yura Sokolov aka funny-falcon for the idea of marking transactions
  as committed in the notification queue, and
- Joel Jacobson for the final patch version. I hope I didn't forget
  anyone.

Backpatch to all supported versions. I believe the bug goes back all
the way to commit d1e027221d, which introduced the SLRU-based async
notification queue.

Discussion: https://www.postgresql.org/message-id/16961-25f29f95b3604a8a@postgresql.org
Discussion: https://www.postgresql.org/message-id/18804-bccbbde5e77a68c2@postgresql.org
Discussion: https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw@mail.gmail.com
Backpatch-through: 14
2025-11-12 20:59:36 +02:00
Heikki Linnakangas
1b4699090e Escalate ERRORs during async notify processing to FATAL
Previously, if async notify processing encountered an error, we would
report the error to the client and advance our read position past the
offending entry to prevent trying to process it over and over
again. Trying to continue after an error has a few problems however:

- We have no way of telling the client that a notification was
  lost. They get an ERROR, but that doesn't tell you much. As such,
  it's not clear if keeping the connection alive after losing a
  notification is a good thing. Depending on the application logic,
  missing a notification could cause the application to get stuck
  waiting, for example.

- If the connection is idle, PqCommReadingMsg is set and any ERROR is
  turned into FATAL anyway.

- We bailed out of the notification processing loop on first error
  without processing any subsequent notifications. The subsequent
  notifications would not be processed until another notify interrupt
  arrives. For example, if there were two notifications pending, and
  processing the first one caused an ERROR, the second notification
  would not be processed until someone sent a new NOTIFY.

This commit changes the behavior so that any ERROR while processing
async notifications is turned into FATAL, causing the client
connection to be terminated. That makes the behavior more consistent
as that's what happened in idle state already, and terminating the
connection is a clear signal to the application that it might've
missed some notifications.

The reason to do this now is that the next commits will change the
notification processing code in a way that would make it harder to
skip over just the offending notification entry on error.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://www.postgresql.org/message-id/fedbd908-4571-4bbe-b48e-63bfdcc38f64@iki.fi
Backpatch-through: 14
2025-11-12 20:59:28 +02:00
Álvaro Herrera
877a024902
Split out innards of pg_tablespace_location()
This creates a src/backend/catalog/pg_tablespace.c supporting file
containing a new function get_tablespace_location(), which lets the code
underlying pg_tablespace_location() be reused for other purposes.

Author: Manni Wood <manni.wood@enterprisedb.com>
Author: Nishant Sharma <nishant.sharma@enterprisedb.com>
Reviewed-by: Vaibhav Dalvi <vaibhav.dalvi@enterprisedb.com>
Reviewed-by: Ian Lawrence Barwick <barwick@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com
2025-11-12 16:39:55 +01:00
Daniel Gustafsson
b4e32a076c Fix range for commit_siblings in sample conf
The range for commit_siblings was incorrectly listed as starting on 1
instead of 0 in the sample configuration file.  Backpatch down to all
supported branches.

Author: Man Zeng <zengman@halodbtech.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/tencent_53B70BA72303AE9C6889E78E@qq.com
Backpatch-through: 14
2025-11-12 13:51:53 +01:00
Daniel Gustafsson
9122ff65a1 libpq: threadsafety for SSL certificate callback
In order to make the errorhandling code in backend libpq be thread-
safe the global variable used by the certificate verification call-
back need to be replaced with passing private data.

This moves the threadsafety needle a little but forwards, the call
to strerror_r also needs to be replaced with the error buffer made
thread local.  This is left as future work for when add the thread
primitives required for this to the tree.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/353226C7-97A1-4507-A380-36AA92983AE6@yesql.se
2025-11-12 12:37:40 +01:00
Michael Paquier
040a39ed25 Fix comments of output routines for pg_ndistinct and pg_dependencies
Oversights in 7b504eb282 (for pg_ndistinct) and 2686ee1b7c (for
pg_dependencies).

Reported-by: Man Zeng <zengman@halodbtech.com>
Discussion: https://postgr.es/m/176293711658.2081918.12019224686811870203.pgcf@coridan.postgresql.org
2025-11-12 20:24:10 +09:00
Michael Paquier
2ddc8d9e9b Move code specific to pg_dependencies to new file
This new file is named pg_dependencies.c and includes all the code
directly related to the data type pg_dependencies, extracted from the
extended statistics code.

Some patches are under discussion to change its input and output
functions, and this separation makes the follow-up changes cleaner by
separating the logic related to the data type and the functional
dependencies statistics core logic in dependencies.c.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz
2025-11-12 16:53:19 +09:00
Michael Paquier
a552312343 Move code specific to pg_ndistinct to new file
This new file is named pg_ndistinct.c and includes all the code directly
related to the data type pg_ndistinct, extracted from the extended
statistics code.

Some patches are under discussion to change its input and output
functions, and this separation makes the follow-up changes cleaner by
separating the logic related to the data type and the multivariate
ndistinct coefficient core logic in mvdistinct.c.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQ2k8--a0FfwSwX9@paquier.xyz
2025-11-12 16:34:52 +09:00
Amit Kapila
bfb7419b0b Remove unused assignment in CREATE PUBLICATION grammar.
Commit 96b3784973 extended the grammar for CREATE PUBLICATION to support
the ALL SEQUENCES variant. However, it unnecessarily prepared publication
objects for this variant, which is not required. This was a copy-paste
oversight in that commit.

Additionally, rename pub_obj_type_list to pub_all_obj_type_list to better
reflect its purpose.

Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CANhcyEWbjkFvk3mSy5LFs9+0z4K1gDwQeFj7GUjOe+L4vxs4AA@mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-11-12 03:28:17 +00:00
Thomas Munro
2421ade663 Prefer spelling "cacheable" over "cachable".
Previously we had both in code and comments.  Keep the more common and
accepted variant.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/5EBF1771-0566-4D08-9F9B-CDCDEF4BDC98@gmail.com
2025-11-12 14:35:16 +13:00
Michael Paquier
6e1535308c Report better object limits in error messages for injection points
Previously, error messages for oversized injection point names, libraries,
and functions showed buffer sizes (64, 128, 128) instead of the usable
character limits (63, 127, 127) as it did not count for the
zero-terminated byte, which was confusing.  These messages are adjusted
to show better the reality.

The limit enforced for the private area was also too strict by one byte,
as specifying a zone worth exactly INJ_PRIVATE_MAXLEN should be able to
work because three is no zero-terminated byte in this case.

This is a stylistic change (well, mostly, a private_area size of exactly
1024 bytes can be defined with this change, something that nobody seem
to care about based on the lack of complaints).  However, this is a
testing facility let's keep the logic consistent across all the branches
where this code exists, as there is an argument in favor of out-of-core
extensions that use injection points.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CABPTF7VxYp4Hny1h+7ejURY-P4O5-K8WZg79Q3GUx13cQ6B2kg@mail.gmail.com
Backpatch-through: 17
2025-11-12 10:18:50 +09:00
Peter Eisentraut
d2f24df19b Clean up qsort comparison function for GUC entries
guc_var_compare() is invoked from qsort() on an array of struct
config_generic, but the function accesses these directly as
strings (char *).  This relies on the name being the first field, so
this works.  But we can write this more clearly by using the struct
and then accessing the field through the struct.  Before the
reorganization of the GUC structs (commit a13833c35f), the old code
was probably more convenient, but now we can write this more clearly
and correctly.

After this change, it is no longer required that the name is the first
field in struct config_generic, so remove that comment.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/2c961fa1-14f6-44a2-985c-e30b95654e8d%40eisentraut.org
2025-11-11 07:55:10 +01:00
Nathan Bossart
5e4fcbe531 Check for CREATE privilege on the schema in CREATE STATISTICS.
This omission allowed table owners to create statistics in any
schema, potentially leading to unexpected naming conflicts.  For
ALTER TABLE commands that require re-creating statistics objects,
skip this check in case the user has since lost CREATE on the
schema.  The addition of a second parameter to CreateStatistics()
breaks ABI compatibility, but we are unaware of any impacted
third-party code.

Reported-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Security: CVE-2025-12817
Backpatch-through: 13
2025-11-10 09:00:00 -06:00
Heikki Linnakangas
3e0ae46d90 Move SLRU_PAGES_PER_SEGMENT to pg_config_manual.h
It seems plausible that someone might want to experiment with
different values. The pressing reason though is that I'm reviewing a
patch that requires pg_upgrade to manipulate SLRU files. That patch
needs to access SLRU_PAGES_PER_SEGMENT from pg_upgrade code, and
slru.h, where SLRU_PAGES_PER_SEGMENT is currently defined, cannot be
included from frontend code. Moving it to pg_config_manual.h makes it
accessible.

Now that it's a little more likely that someone might change
SLRU_PAGES_PER_SEGMENT, add a cluster compatibility check for it.

Bump catalog version because of the new field in the control file.

Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/c7a4ea90-9f7b-4953-81be-b3fcb47db057@iki.fi
2025-11-10 16:11:41 +02:00
Daniel Gustafsson
3a872ddd64 Fix typos in nodeWindowAgg comments
One of them submitted by the author, with another one other
spotted during review so this fixes both.

Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAHewXN=eNx2oJ_hzxJrkSvy-1A5Qf45SM8pxERWXE+6RoZyFrw@mail.gmail.com
2025-11-10 12:51:47 +01:00
Michael Paquier
9d7e851a21 Fix comment in copyto.c
Author: Tatsuya Kawata <kawatatatsuya0913@gmail.com>
Discussion: https://postgr.es/m/CAHza6qeNbqgMfgDi15Dv6E6GWx+8maRAqe97OwzYz3qpEFouJQ@mail.gmail.com
2025-11-09 08:17:31 +09:00
Bruce Momjian
6204d07ad6 Remove blank line in C code.
Was added in commit 5e89985928.

Reported-by: Ashutosh Bapat

Author: Ashutosh Bapat

Discussion: https://postgr.es/m/CAExHW5tba_biyuMrd_iPVzq-+XvsMdPcEnjQ+d+__V=cjYj8Pg@mail.gmail.com

Backpatch-through: master
2025-11-07 21:54:25 -05:00
Alexander Korotkov
7742f99a02 Fix checking for recovery state in WaitForLSN()
We only need to do it for WAIT_LSN_TYPE_REPLAY.  WAIT_LSN_TYPE_FLUSH can work
for both primary and follower.
2025-11-07 23:34:50 +02:00
Peter Eisentraut
a3ea5330fc Fix "inconsistent DLL linkage" warning on Windows MSVC
This warning was disabled in meson.build (warning 4273).  If you
enable it, it looks like this:

../src/backend/utils/misc/ps_status.c(27): warning C4273: '__p__environ': inconsistent dll linkage
C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h(1158): note: see previous definition of '__p__environ'

The declaration in ps_status.c was:

    #if !defined(WIN32) || defined(_MSC_VER)
    extern char **environ;
    #endif

The declaration in the OS header file is:

    _DCRTIMP char***    __cdecl __p__environ (void);
    #define _environ  (*__p__environ())

So it is evident that this could be problematic.

The old declaration was required by the old MSVCRT library, but we
don't support that anymore with MSVC.

To fix, disable the re-declaration in ps_status.c, and also in some
other places that use the same code pattern but didn't trigger the
warning.

Then we can also re-enable the warning (delete the disablement in
meson.build).

Reviewed-by: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/bf060644-47ff-441b-97cf-c685d0827757@eisentraut.org
2025-11-07 10:14:25 +01:00
Amit Kapila
f6a4c498dc Add seq_sync_error_count to subscription statistics.
This commit adds a new column, seq_sync_error_count, to the
pg_stat_subscription_stats view. This counter tracks the number of errors
encountered by the sequence synchronization worker during operation.

Since a single worker handles the synchronization of all sequences, this
value may reflect errors from multiple sequences. This addition improves
observability of sequence synchronization behavior and helps monitor
potential issues during replication.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-11-07 08:05:08 +00:00
Andres Freund
5310fac6e0 bufmgr: Use atomic sub for unpinning buffers
The prior commit made it legal to modify BufferDesc.state while the buffer
header spinlock is held. This allows us to replace the CAS loop
inUnpinBufferNoOwner() with an atomic sub. This improves scalability
significantly. See the prior commits for more background.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-11-06 16:43:16 -05:00
Andres Freund
c75ebc657f bufmgr: Allow some buffer state modifications while holding header lock
Until now BufferDesc.state was not allowed to be modified while the buffer
header spinlock was held. This meant that operations like unpinning buffers
needed to use a CAS loop, waiting for the buffer header spinlock to be
released before updating.

The benefit of that restriction is that it allowed us to unlock the buffer
header spinlock with just a write barrier and an unlocked write (instead of a
full atomic operation). That was important to avoid regressions in
48354581a4. However, since then the hottest buffer header spinlock uses have
been replaced with atomic operations (in particular, the most common use of
PinBuffer_Locked(), in GetVictimBuffer() (formerly in BufferAlloc()), has been
removed in 5e89985928).

This change will allow, in a subsequent commit, to release buffer pins with a
single atomic-sub operation. This previously was not possible while such
operations were not allowed while the buffer header spinlock was held, as an
atomic-sub would not have allowed a race-free check for the buffer header lock
being held.

Using atomic-sub to unpin buffers is a nice scalability win, however it is not
the primary motivation for this change (although it would be sufficient). The
primary motivation is that we would like to merge the buffer content lock into
BufferDesc.state, which will result in more frequent changes of the state
variable, which in some situations can cause a performance regression, due to
an increased CAS failure rate when unpinning buffers.  The regression entirely
vanishes when using atomic-sub.

Naively implementing this would require putting CAS loops in every place
modifying the buffer state while holding the buffer header lock. To avoid
that, introduce UnlockBufHdrExt(), which can set/add flags as well as the
refcount, together with releasing the lock.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-11-06 16:42:10 -05:00
David Rowley
448b6a4173 Tidyup WARNING ereports in subscriptioncmds.c
A couple of ereports were making use of StringInfos as temporary storage
for the portions of the WARNING message.  One was doing this to avoid
having 2 separate ereports.  This was all fairly unnecessary and
resulted in more code rather than less code.

Refactor out the additional StringInfos and make
check_publications_origin_tables() use 2 ereports.

In passing, adjust pubnames to become a stack-allocated StringInfoData to
avoid having to palloc the temporary StringInfoData.  This follows on
from the efforts made in 6d0eba662.

Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/0b381b02-cab9-41f9-a900-ad6c8d26c1fc%40gmail.com
2025-11-07 09:50:02 +13:00
Álvaro Herrera
a2b02293bc
Use XLogRecPtrIsValid() in various places
Now that commit 06edbed478 has introduced XLogRecPtrIsValid(), we can
use that instead of:

- XLogRecPtrIsInvalid()
- direct comparisons with InvalidXLogRecPtr
- direct comparisons with literal 0

This makes the code more consistent.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aQB7EvGqrbZXrMlg@ip-10-97-1-34.eu-west-3.compute.internal
2025-11-06 20:33:57 +01:00
Peter Eisentraut
aa606b9316 Disallow generated columns in COPY WHERE clause
Stored generated columns are not yet computed when the filtering
happens, so we need to prohibit them to avoid incorrect behavior.

Virtual generated columns currently error out ("unexpected virtual
generated column reference").  They could probably work if we expand
them in the right place, but for now let's keep them consistent with
the stored variant.  This doesn't change the behavior, it only gives a
nicer error message.

Co-authored-by: jian he <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxHb8YPQ095R_pYDr77W9XKNaXg5Rzy-WP525mkq+hRM3g@mail.gmail.com
2025-11-06 13:54:42 +01:00
Heikki Linnakangas
aa9c5fd3e3 Refactor shared memory allocation for semaphores
Before commit e25626677f, spinlocks were implemented using semaphores
on some platforms (--disable-spinlocks). That made it necessary to
initialize semaphores early, before any spinlocks could be used. Now
that we don't support --disable-spinlocks anymore, we can allocate the
shared memory needed for semaphores the same way as other shared
memory structures. Since the semaphores are used only in the PGPROC
array, move the semaphore shmem size estimation and initialization
calls to ProcGlobalShmemSize() and InitProcGlobal().

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
2025-11-06 14:45:00 +02:00
Heikki Linnakangas
daf3d99d2b Add comment to explain why PGReserveSemaphores() is called early
Before commit e25626677f, PGReserveSemaphores() had to be called
before SpinlockSemaInit() because spinlocks were implemented using
semaphores on some platforms (--disable-spinlocks). Add a comment
explaining that.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5seSZpPx-znjidVZNzdagGHOk06F+Ds88MpPUbxd1kTaA@mail.gmail.com
Backpatch-to: 18
2025-11-06 14:20:48 +02:00
John Naylor
07b3df5d00 Cosmetic fixes in GiST README
Fix a typo, add some missing conjunctions, and make a sentence flow
more smoothly.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://postgr.es/m/CA%2BrenyXZgwzegmO5t%3DUSU%3D9Wo5bc-YqNf-6E7Nv7e577DCmYXA%40mail.gmail.com
2025-11-06 16:35:40 +07:00
Amit Kapila
5a4eba558a Fix few issues in commit 5509055d69.
Test failure on buildfarm member prion:
The test failed due to an unexpected LOCATION: line appearing between the
WARNING and ERROR messages. This occurred because the prion machine uses
log_error_verbosity = verbose, which includes additional context in error
messages. The test was originally checking for both WARNING and ERROR
messages in sequence sync, but the extra LOCATION: line disrupted this
pattern. To make the test robust across different verbosity settings, it
now only checks for the presence of the WARNING message after the test,
which is sufficient to validate the intended behavior.

Failure to sync sequences with quoted names:
The previous implementation did not correctly quote sequence names when
querying remote information, leading to failures when quoted sequence
names were used. This fix ensures that sequence names are properly quoted
during remote queries, allowing sequences with quoted identifiers to be
synced correctly.

Author: Vignesh C <vignesh21@gmail.com>
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm0WcdSCoNPiE-5ek4J2dMJ5o111GPTzKCYj9G5i=ONYtQ@mail.gmail.com
Discussion: https://postgr.es/m/CAOzEurQOSN=Zcp9uVnatNbAy=2WgMTJn_DYszYjv0KUeQX_e_A@mail.gmail.com
2025-11-06 08:52:31 +00:00
Michael Paquier
d6c132d83b Document some structures in attribute_stats.c
Like relation_stats.c, these structures are used to track the argument
number, names and types of pg_restore_attribute_stats() and
pg_clear_attribute_stats().

Extracted from a larger patch by the same author, reworded by me for
consistency with relation_stats.c.

Author: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com
2025-11-06 16:22:12 +09:00
Peter Eisentraut
05b9edcb71 Update code comment
Should have been part of commit a13833c35f.
2025-11-06 07:16:30 +01:00
David Rowley
eaa159632d Fix UNION planner estimate_num_groups with varno==0
03d40e4b5 added code to provide better row estimates for when a UNION
query ended up only with a single child due to other children being
found to be dummy rels.  In that case, ordinarily it would be ok to call
estimate_num_groups() on the targetlist of the only child path, however
that's not safe to do if the UNION child is the result of some other set
operation as we generate targetlists containing Vars with varno==0 for
those, which estimate_num_groups() can't handle.  This could lead to:

ERROR:  XX000: no relation entry for relid 0

Fix this by avoiding doing this when the only child is the result of
another set operation.  In that case we'll fall back on the
assume-all-rows-are-unique method.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/cfbc99e5-9d44-4806-ba3c-f36b57a85e21@gmail.com
2025-11-06 16:34:55 +13:00
Etsuro Fujita
a3ebec4e4c Update obsolete comment in ExecScanReScan().
Commit 27cc7cd2b removed the epqScanDone flag from the EState struct,
and instead added an equivalent flag named relsubs_done to the EPQState
struct; but it failed to update this comment.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CAPmGK152zJ3fU5avDT5udfL0namrDeVfMTL3dxdOXw28SOrycg%40mail.gmail.com
Backpatch-through: 13
2025-11-06 12:25:00 +09:00
David Rowley
6d0eba6627 Use stack allocated StringInfoDatas, where possible
Various places that were using StringInfo but didn't need that
StringInfo to exist beyond the scope of the function were using
makeStringInfo(), which allocates both a StringInfoData and the buffer it
uses as two separate allocations.  It's more efficient for these cases to
use a StringInfoData on the stack and initialize it with initStringInfo(),
which only allocates the string buffer.  This also simplifies the cleanup,
in a few cases.

Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379aac8-26f1-42f2-a356-ff0e886228d3@gmail.com
2025-11-06 14:59:48 +13:00
Tom Lane
d4baa327a1 Avoid possible crash within libsanitizer.
We've successfully used libsanitizer for awhile with the undefined
and alignment sanitizers, but with some other sanitizers (at least
thread and hwaddress) it crashes due to internal recursion before
it's fully initialized itself.  It turns out that that's due to the
"__ubsan_default_options" hack installed by commit f686ae82f, and we
can fix it by ensuring that __ubsan_default_options is built without
any sanitizer instrumentation hooks.

Reported-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: Emmanuel Sibi <emmanuelsibi.mec@gmail.com>
Fix-suggested-by: Jacob Champion <jacob.champion@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/F7543B04-E56C-4D68-A040-B14CCBAD38F1@gmail.com
Discussion: https://postgr.es/m/dbf77bf7-6e54-ed8a-c4ae-d196eeb664ce@gmail.com
Backpatch-through: 16
2025-11-05 11:09:45 -05:00
Alexander Korotkov
447aae13b0 Implement WAIT FOR command
WAIT FOR is to be used on standby and specifies waiting for
the specific WAL location to be replayed.  This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.

WAIT FOR needs to wait without any snapshot held.  Otherwise, the snapshot
could prevent the replay of WAL records, implying a kind of self-deadlock.
This is why separate utility command seems appears to be the most robust
way to implement this functionality.  It's not possible to implement this as
a function.  Previous experience shows that stored procedures also have
limitation in this aspect.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Alexander Korotkov
3b4e53a075 Add infrastructure for efficient LSN waiting
Implement a new facility that allows processes to wait for WAL to reach
specific LSNs, both on primary (waiting for flush) and standby (waiting
for replay) servers.

The implementation uses shared memory with per-backend information
organized into pairing heaps, allowing O(1) access to the minimum
waited LSN. This enables fast-path checks: after replaying or flushing
WAL, the startup process or WAL writer can quickly determine if any
waiters need to be awakened.

Key components:
- New xlogwait.c/h module with WaitForLSNReplay() and WaitForLSNFlush()
- Separate pairing heaps for replay and flush waiters
- WaitLSN lightweight lock for coordinating shared state
- Wait events WAIT_FOR_WAL_REPLAY and WAIT_FOR_WAL_FLUSH for monitoring

This infrastructure can be used by features that need to wait for WAL
operations to complete.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Alexander Korotkov
8af3ae0d4b Add pairingheap_initialize() for shared memory usage
The existing pairingheap_allocate() uses palloc(), which allocates
from process-local memory. For shared memory use cases, the pairingheap
structure must be allocated via ShmemAlloc() or embedded in a shared
memory struct. Add pairingheap_initialize() to initialize an already-
allocated pairingheap structure in-place, enabling shared memory usage.

Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO+BBjcirozJ6nYbOW8Q@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/flat/CABPTF7UNft368x-RgOXkfj475OwEbp%2BVVO-wEXz7StgjD_%3D6sw%40mail.gmail.com
Author: Kartyshov Ivan <i.kartyshov@postgrespro.ru>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
2025-11-05 11:44:13 +02:00
Richard Guo
0ea5eee376 Avoid creating duplicate ordered append paths
In generate_orderedappend_paths(), the function does not handle the
case where the paths in total_subpaths and fractional_subpaths are
identical.  This situation is not uncommon, and as a result, it may
generate two exactly identical ordered append paths.

Fix by checking whether total_subpaths and fractional_subpaths contain
the same paths, and skipping creation of the ordered append path for
the fractional case when they are identical.

Given the lack of field complaints about this, I'm a bit hesitant to
back-patch, but let's clean it up in HEAD.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-OYsgA75tGGiBARt87G0y_z_GBTSLrzudcJxAzndYkYw@mail.gmail.com
2025-11-05 18:10:54 +09:00
Richard Guo
c1777f2d6d Fix assertion failure in generate_orderedappend_paths()
In generate_orderedappend_paths(), there is an assumption that a child
relation's row estimate is always greater than zero.  There is an
Assert verifying this assumption, and the estimate is also used to
convert an absolute tuple count into a fraction.

However, this assumption is not always valid -- for example, upper
relations can have their row estimates unset, resulting in a value of
zero.  This can cause an assertion failure in debug builds or lead to
the tuple fraction being computed as infinity in production builds.

To fix, use the row estimate from the cheapest_total path to compute
the tuple fraction.  The row estimate in this path should already have
been forced to a valid value.

In passing, update the comment for generate_orderedappend_paths() to
note that the function also considers the cheapest-fractional case
when not all tuples need to be retrieved.  That is, it collects all
the cheapest fractional paths and builds an ordered append path for
each interesting ordering.

Backpatch to v18, where this issue was introduced.

Bug: #19102
Reported-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/19102-93480667e1200169@postgresql.org
Backpatch-through: 18
2025-11-05 18:09:21 +09:00
Amit Kapila
5509055d69 Add sequence synchronization for logical replication.
This patch introduces sequence synchronization. Sequences that are synced
will have 2 states:
   - INIT (needs [re]synchronizing)
   - READY (is already synchronized)

A new sequencesync worker is launched as needed to synchronize sequences.
A single sequencesync worker is responsible for synchronizing all
sequences. It begins by retrieving the list of sequences that are flagged
for synchronization, i.e., those in the INIT state. These sequences are
then processed in batches, allowing multiple entries to be synchronized
within a single transaction. The worker fetches the current sequence
values and page LSNs from the remote publisher, updates the corresponding
sequences on the local subscriber, and finally marks each sequence as
READY upon successful synchronization.

Sequence synchronization occurs in 3 places:
1) CREATE SUBSCRIPTION
    - The command syntax remains unchanged.
    - The subscriber retrieves sequences associated with publications.
    - Published sequences are added to pg_subscription_rel with INIT
      state.
    - Initiate the sequencesync worker to synchronize all sequences.

2) ALTER SUBSCRIPTION ... REFRESH PUBLICATION
    - The command syntax remains unchanged.
    - Dropped published sequences are removed from pg_subscription_rel.
    - Newly published sequences are added to pg_subscription_rel with INIT
      state.
    - Initiate the sequencesync worker to synchronize only newly added
      sequences.

3) ALTER SUBSCRIPTION ... REFRESH SEQUENCES
    - A new command introduced for PG19 by f0b3573c3a.
    - All sequences in pg_subscription_rel are reset to INIT state.
    - Initiate the sequencesync worker to synchronize all sequences.
    - Unlike "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" command,
      addition and removal of missing sequences will not be done in this
      case.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-11-05 05:59:58 +00:00
Michael Paquier
1fd981f053 Drop unnamed portal immediately after execution to completion
Previously, unnamed portals were kept until the next Bind message or the
end of the transaction.  This could cause temporary files to persist
longer than expected and make logging not reflect the actual SQL
responsible for the temporary file.

This patch changes exec_execute_message() to drop unnamed portals
immediately after execution to completion at the end of an Execute
message, making their removal more aggressive.  This forces temporary
file cleanups to happen at the same time as the completion of the portal
execution, with statement logging correctly reflecting to which
statements these temporary files were attached to (see the diffs in the
TAP test updated by this commit for an idea).

The documentation is updated to describe the lifetime of unnamed
portals, and test cases are updated to verify temporary file removal and
proper statement logging after unnamed portal execution.  This changes
how unnamed portals are handled in the protocol, hence no backpatch is
done.

Author: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Co-Authored-by: Sami Imseih <samimseih@gmail.com>
Co-Authored-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0tTrTUoEr3kDXCuKsvqYGq8OOHiBwoD-dyJocq95uEOTQ%40mail.gmail.com
2025-11-05 14:35:16 +09:00
Richard Guo
59dec6c0b0 Fix comments for ChangeVarNodes() and related functions
The comment for ChangeVarNodes() refers to a parameter named
change_RangeTblRef, which does not exist in the code.

The comment for ChangeVarNodesExtended() contains an extra space,
while the comment for replace_relid_callback() has an awkward line
break and a typo.

This patch fixes these issues and revises some sentences for smoother
wording.

Oversights in commits ab42d643c and fc069a3a6.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs480j16HC1JtjKCgj5WshivT8ZJYkOfTyZAM0POjFomJkg@mail.gmail.com
Backpatch-through: 18
2025-11-05 12:29:31 +09:00
Michael Paquier
2fc3107962 Add assertions checking for the startup process in WAL replay routines
These assertions may prove to become useful to make sure that no process
other than the startup process calls the routines where these checks are
added, as we expect that these do not interfere with a WAL receiver
switched to a "stopping" state by a startup process.

The assumption that only the startup process can use this code has
existed for many years, without a check enforcing it.

Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/aQmGeVLYl51y1m_0@paquier.xyz
2025-11-05 10:41:50 +09:00
Andres Freund
dae00f333b aio: Improve assertions related to io_method
First, the assertions in assign_io_method() were the wrong way round. Second,
the lengthof() assertion checked the length of io_method_options, which is the
wrong array to check and is always longer than pgaio_method_ops_table.

While add it, add a static assert to ensure pgaio_method_ops_table and
io_method_options stay in sync.

Per coverity and Tom Lane.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 18
2025-11-04 20:03:53 -05:00
Andres Freund
2d83d729d5 jit: Fix accidentally-harmless type confusion
In 2a0faed9d7, which added JIT compilation support for expressions, I
accidentally used sizeof(LLVMBasicBlockRef *) instead of
sizeof(LLVMBasicBlockRef) as part of computing the size of an allocation. That
turns out to have no real negative consequences due to LLVMBasicBlockRef being
a pointer itself (and thus having the same size). It still is wrong and
confusing, so fix it.

Reported by coverity.

Backpatch-through: 13
2025-11-04 20:03:53 -05:00
Jeff Davis
d115de9d89 Special case C_COLLATION_OID in pg_newlocale_from_collation().
Allow pg_newlocale_from_collation(C_COLLATION_OID) to work even if
there's no catalog access, which some extensions expect.

Not known to be a bug without extensions involved, but backport to 18.

Also corrects an issue in master with dummy_c_locale (introduced in
commit 5a38104b36) where deterministic was not set. That wasn't a bug,
but could have been if that structure was used more widely.

Reported-by: Alexander Kukushkin <cyberdemn@gmail.com>
Reviewed-by: Alexander Kukushkin <cyberdemn@gmail.com>
Discussion: https://postgr.es/m/CAFh8B=nj966ECv5vi_u3RYij12v0j-7NPZCXLYzNwOQp9AcPWQ@mail.gmail.com
Backpatch-through: 18
2025-11-04 16:48:16 -08:00
Masahiko Sawada
8ae0f6a0c3 Add CHECK_FOR_INTERRUPTS in Evict{Rel,All}UnpinnedBuffers.
This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration
loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These
functions, used by pg_buffercache's pg_buffercache_evict_relation and
pg_buffercache_evict_all, can now be interrupted during long-running
operations.

Backpatch to version 18, where these functions and their corresponding
pg_buffercache functions were introduced.

Author: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com
Backpatch-through: 18
2025-11-04 15:47:25 -08:00
David Rowley
fdda78e361 Fix possible usage of incorrect UPPERREL_SETOP RelOptInfo
03d40e4b5 allowed dummy UNION [ALL] children to be removed from the plan
by checking for is_dummy_rel().  That commit neglected to still account
for the relids from the dummy rel so that the correct UPPERREL_SETOP
RelOptInfo could be found and used for adding the Paths to.

Not doing this could result in processing of subsequent UNIONs using the
same RelOptInfo as a previously processed UNION, which could result in
add_path() freeing old Paths that are needed by the previous UNION.

The same fix was independently submitted (2 mins later) by Richard Guo.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/bee34aec-659c-46f1-9ab7-7bbae0b7616c@gmail.com
2025-11-05 11:48:09 +13:00
Álvaro Herrera
0a3d27bfe0
Fix snapshot handling bug in recent BRIN fix
Commit a95e3d84c0 added ActiveSnapshot push+pop when processing
work-items (BRIN autosummarization), but forgot to handle the case of
a transaction failing during the run, which drops the snapshot untimely.
Fix by making the pop conditional on an element being actually there.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511041648.nofajnuddmwk@alvherre.pgsql
2025-11-04 20:31:43 +01:00
Tomas Vondra
1213cb4753 Trim TIDs during parallel GIN builds more eagerly
The parallel GIN builds perform "freezing" of TID lists when merging
chunks built earlier. This means determining what part of the list can
no longer change, depending on the last received chunk. The frozen part
can be evicted from memory and written out.

The code attempted to freeze items right before merging the old and new
TID list, after already attempting to trim the current buffer. That
means part of the data may get frozen based on the new TID list, but
will be trimmed later (on next loop). This increases memory usage.

This inverts the order, so that we freeze data first (before trimming).
The benefits are likely relatively small, but it's also virtually free
with no other downsides.

Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
2025-11-04 20:06:01 +01:00
Tomas Vondra
c98dffcb7c Limit the size of TID lists during parallel GIN build
When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.

The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). The code
processing item pointers uses regular palloc/repalloc calls, which means
it's subject to the MaxAllocSize (1GB) limit.

We could fix this by allowing huge allocations, but that'd require
changes in many places without much benefit. Larger chunks do not
actually improve performance, so the memory usage would be wasted.

Fixed by limiting the chunk size to not hit MaxAllocSize. Each worker
gets a fair share.

This requires remembering the number of participating workers, in a
place that can be accessed from the callback. Luckily, the bs_worker_id
field in GinBuildState was unused, so repurpose that.

Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.

Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
2025-11-04 18:51:17 +01:00
Jeff Davis
4bfaea11d2 Remove redundant memset() introduced by a0942f4.
Reported-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2kAkNaDa01O0nKsQmkfEmxsDvm09SU=f1T0CV8ew3qJEA@mail.gmail.com
2025-11-04 09:46:00 -08:00
Tom Lane
ff4597acd4 Allow "SET list_guc TO NULL" to specify setting the GUC to empty.
We have never had a SET syntax that allows setting a GUC_LIST_INPUT
parameter to be an empty list.  A locution such as
	SET search_path = '';
doesn't mean that; it means setting the GUC to contain a single item
that is an empty string.  (For search_path the net effect is much the
same, because search_path ignores invalid schema names and '' must be
invalid.)  This is confusing, not least because configuration-file
entries and the set_config() function can easily produce empty-list
values.

We considered making the empty-string syntax do this, but that would
foreclose ever allowing empty-string items to be valid in list GUCs.
While there isn't any obvious use-case for that today, it feels like
the kind of restriction that might hurt someday.  Instead, let's
accept the forbidden-up-to-now value NULL and treat that as meaning an
empty list.  (An objection to this could be "what if we someday want
to allow NULL as a GUC value?".  That seems unlikely though, and even
if we did allow it for scalar GUCs, we could continue to treat it as
meaning an empty list for list GUCs.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrei Klychkov <andrew.a.klychkov@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/CA+mfrmwsBmYsJayWjc8bJmicxc3phZcHHY=yW5aYe=P-1d_4bg@mail.gmail.com
2025-11-04 12:37:40 -05:00
Peter Eisentraut
040cc5f3c7 Tighten check for generated column in partition key expression
A generated column may end up being part of the partition key
expression, if it's specified as an expression e.g. "(<generated
column name>)" or if the partition key expression contains a whole-row
reference, even though we do not allow a generated column to be part
of partition key expression.  Fix this hole.

Co-authored-by: jian he <jian.universality@gmail.com>
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxF%3DWDGthXSAQr9thYUsfx_1_t9E6N8tE3B8EqXcVoVfQw%40mail.gmail.com
2025-11-04 14:46:58 +01:00
Álvaro Herrera
a95e3d84c0
BRIN autosummarization may need a snapshot
It's possible to define BRIN indexes on functions that require a
snapshot to run, but the autosummarization feature introduced by commit
7526e10224 fails to provide one.  This causes autovacuum to leave a
BRIN placeholder tuple behind after a failed work-item execution, making
such indexes less efficient.  Repair by obtaining a snapshot prior to
running the task, and add a test to verify this behavior.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Giovanni Fabris <giovanni.fabris@icon.it>
Reported-by: Arthur Nascimento <tureba@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511031106.h4fwyuyui6fz@alvherre.pgsql
2025-11-04 13:23:26 +01:00
Peter Eisentraut
c09a06918d Error message stylistic correction
Fixup for commit ef5e60a9d3: The inconsistent use of articles was a
bit awkward.
2025-11-04 12:25:04 +01:00
Michael Paquier
65f4976189 Add assertion check for WAL receiver state during stream-archive transition
When the startup process switches from streaming to archive as WAL
source, we avoid calling ShutdownWalRcv() if the WAL receiver is not
streaming, based on WalRcvStreaming().  WALRCV_STOPPING is a state set
by ShutdownWalRcv(), called only by the startup process, meaning that it
should not be possible to reach this state while in
WaitForWALToBecomeAvailable().

This commit adds an assertion to make sure that a WAL receiver is never
in a WALRCV_STOPPING state should the startup process attempt to reset
InstallXLogFileSegmentActive.

Idea suggested by Noah Misch.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
2025-11-04 13:14:46 +09:00
Michael Paquier
e0ca61e7c4 Add WalRcvGetState() to retrieve the state of a WAL receiver
This has come up as useful as an alternative of WalRcvStreaming(), to be
able to do sanity checks based on the state of a WAL receiver.  This
will be used in a follow-up commit.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
2025-11-04 12:57:36 +09:00
Michael Paquier
17b2d5ec75 Fix unconditional WAL receiver shutdown during stream-archive transition
Commit b4f584f9d2 (affecting v15~, later backpatched down to 13 as of
3635a0a35a) introduced an unconditional WAL receiver shutdown when
switching from streaming to archive WAL sources.  This causes problems
during a timeline switch, when a WAL receiver enters WALRCV_WAITING
state but remains alive, waiting for instructions.

The unconditional shutdown can break some monitoring scenarios as the
WAL receiver gets repeatedly terminated and re-spawned, causing
pg_stat_wal_receiver.status to show a "streaming" instead of "waiting"
status, masking the fact that the WAL receiver is waiting for a new TLI
and a new LSN to be able to continue streaming.

This commit changes the WAL receiver behavior so as the shutdown becomes
conditional, with InstallXLogFileSegmentActive being always reset to
prevent the regression fixed by b4f584f9d2: only terminate the WAL
receiver when it is actively streaming (WALRCV_STREAMING,
WALRCV_STARTING, or WALRCV_RESTARTING).  When in WALRCV_WAITING state,
just reset InstallXLogFileSegmentActive flag to allow archive
restoration without killing the process.  WALRCV_STOPPED and
WALRCV_STOPPING are not reachable states in this code path.  For the
latter, the startup process is the one in charge of setting
WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to
reach a WALRCV_STOPPED state after switching walRcvState, so
WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is
in a WALRCV_STOPPING state.

A regression test is added to check that a WAL receiver is not stopped
on timeline jump, that fails when the fix of this commit is reverted.

Reported-by: Ryan Bird <ryanzxg@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Backpatch-through: 13
2025-11-04 10:47:38 +09:00
Noah Misch
8b18ed6dfb Doc: cover index CONCURRENTLY causing errors in INSERT ... ON CONFLICT.
Author: Mikhail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com
Backpatch-through: 13
2025-11-03 12:57:09 -08:00
Masahiko Sawada
e7ccb247b3 Fix outdated comment of COPY in gram.y.
Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_392C0E92EC52432D0A336B9D52E66426F009@qq.com
2025-11-03 10:34:49 -08:00
Álvaro Herrera
cf8be02253
Prevent setting a column as identity if its not-null constraint is invalid
We don't allow null values to appear in identity-generated columns in
other ways, so we shouldn't let unvalidated not-null constraints do it
either.  Oversight in commit a379061a22.

Author: jian he <jian.universality@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CACJufxGQM_+vZoYJMaRoZfNyV=L2jxosjv_0TLAScbuLJXWRfQ@mail.gmail.com
2025-11-03 15:58:19 +01:00
Michael Paquier
ad25744f43 Add wal_fpi_bytes to VACUUM and ANALYZE logs
The new wal_fpi_bytes counter calculates the total amount of full page
images inserted in WAL records, in bytes.  This commit adds this
information to VACUUM and ANALYZE logs alongside the existing counters,
building upon f9a09aa295.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQMMSSlFXy4Evxn3@paquier.xyz
2025-11-03 19:42:03 +09:00
Peter Eisentraut
fce7c73fba Sort guc_parameters.dat alphabetically by name
The order in this list was previously pretty random and had grown
organically over time.  This made it unnecessarily cumbersome to
maintain these lists, as there was no clear guidelines about where to
put new entries.  Also, after the merger of the type-specific GUC
structs, the list still reflected the previous type-specific
super-order.

By using alphabetical order, the place for new entries becomes clear,
and often related entries will be listed close together.

This patch reorders the existing entries in guc_parameters.dat, and it
also augments the generation script to error if an entry is found at
the wrong place.

Note: The order is actually checked after lower-casing, to handle the
likes of "DateStyle".

Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-11-03 10:04:14 +01:00
Tom Lane
8f29467c57 Change "long" numGroups fields to be Cardinality (i.e., double).
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent.  Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates.  The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions.  Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable.  Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.

Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.

Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit.  It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit.  The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
2025-11-02 16:57:43 -05:00
Tom Lane
1ea5bdb00b Improve planner's estimates of tuple hash table sizes.
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.

To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)

I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables.  That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.

The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation.  On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
2025-11-02 16:57:26 -05:00
Peter Geoghegan
b8f1c62807 Document nbtree row comparison design.
Add comments explaining when and where it is safe for nbtree to treat
row compare keys as if they were simple scalar inequality keys on the
row's most significant column.  This is particularly important within
_bt_advance_array_keys, which deals with required inequality keys in a
general and uniform way, without any special handling for row compares.

Also spell out the implications of _bt_check_rowcompare's approach of
_conditionally_ evaluating lower-order row compare subkeys, particularly
when one of its lower-order subkeys might see NULL index tuple values
(these may or may not affect whether the qual as a whole is satisfied).
The behavior in this area isn't particularly intuitive, so these issues
seem worth going into.

In passing, add a few more defensive/documenting row comparison related
assertions to _bt_first and _bt_check_rowcompare.

Follow-up to commits bd3f59fd and ec986020.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com
Backpatch-through: 18
2025-11-02 15:27:05 -05:00
Peter Geoghegan
4f08586c7a Remove obsolete nbtree equality key comments.
_bt_first reliably uses the same equality key (on each index column) for
initial positioning purposes as the one that _bt_checkkeys can use to
end the scan following commit f09816a0.  _bt_first no longer applies its
own independent rules to determine which initial positioning key to use
on each column (for equality and inequality keys alike).  Preprocessing
is now fully in control of determining which keys start and end each
scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior.

Remove obsolete comments that described why _bt_first was expected to
use at least one of the available required equality keys for initial
positioning purposes.  The rules in this area are now maximally strict
and uniform, so there's no reason to draw attention to equality keys.
Any column with a required equality key cannot have a redundant required
inequality key (nor can it have a redundant required equality key).

Oversight in commit f09816a0, which removed similar comments from
_bt_first, but missed these comments.

Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
2025-11-02 13:34:18 -05:00
Peter Eisentraut
8a27d418f8 Mark function arguments of type "Datum *" as "const Datum *" where possible
Several functions in the codebase accept "Datum *" parameters but do
not modify the pointed-to data.  These have been updated to take
"const Datum *" instead, improving type safety and making the
interfaces clearer about their intent.  This change helps the compiler
catch accidental modifications and better documents immutability of
arguments.

Most of "Datum *" parameters have a pairing "bool *isnull" parameter,
they are constified as well.

No functional behavior is changed by this patch.

Author: Chao Li <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com
2025-10-31 10:47:25 +01:00
Peter Eisentraut
aa4535307e formatting.c cleanup: Change fill_str() return type to void
The return value is not used anywhere.

In passing, add a comment explaining the function's arguments.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-31 09:55:12 +01:00
Peter Eisentraut
da2052ab9a formatting.c cleanup: Rename DCH_S_* to DCH_SUFFIX_*
For clarity.  Also rename several related macros and turn them into
inline functions.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-31 08:06:46 +01:00
Peter Eisentraut
378212c68a formatting.c cleanup: Change several int fields to enums
This makes their purpose more self-documenting.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-31 08:06:46 +01:00
Peter Eisentraut
ce5f6817e4 formatting.c cleanup: Change TmFromChar.clock field to bool
This makes the purpose clearer and avoids having two extra symbols,
one of which (CLOCK_24_HOUR) was unused.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-31 08:06:46 +01:00
Tom Lane
c106ef0807 Use BumpContext contexts in TupleHashTables, and do some code cleanup.
For all extant uses of TupleHashTables, execGrouping.c itself does
nothing with the "tablecxt" except to allocate new hash entries in it,
and the callers do nothing with it except to reset the whole context.
So this is an ideal use-case for a BumpContext, and the hash tables
are frequently big enough for the savings to be significant.

(Commit cc721c459 already taught nodeAgg.c this idea, but neglected
the other callers of BuildTupleHashTable.)

While at it, let's clean up some ill-advised leftovers from rebasing
TupleHashTables on simplehash.h:

* Many comments and variable names were based on the idea that the
tablecxt holds the whole TupleHashTable, whereas now it only holds the
hashed tuples (plus any caller-defined "additional storage").  Rename
to names like tuplescxt and tuplesContext, and adjust the comments.
Also adjust the memory context names to be like "<Foo> hashed tuples".

* Make ResetTupleHashTable() reset the tuplescxt rather than relying
on the caller to do so; that was fairly bizarre and seems like a
recipe for leaks.  This is less efficient in the case where nodeAgg.c
uses the same tuplescxt for several different hashtables, but only
microscopically so because mcxt.c will short-circuit the extra resets
via its isReset flag.  I judge the extra safety and intellectual
cleanliness well worth those few cycles.

* Remove the long-obsolete "allow_jit" check added by ac88807f9;
instead, just Assert that metacxt and tuplescxt are different.
We need that anyway for this definition of ResetTupleHashTable() to
be safe.

There is a side issue of the extent to which this change invalidates
the planner's estimates of hashtable memory consumption.  However,
those estimates are already pretty bad, so improving them seems like
it can be a separate project.  This change is useful to do first to
establish consistent executor behavior that the planner can expect.

A loose end not addressed here is that the "entrysize" calculation
in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) +
additionalsize" corresponds neither to the size of the simplehash
entries nor to the total space needed per tuple.  It's questionable
why BuildTupleHashTable is second-guessing its caller's nbuckets
choice at all, since the original source of the number should have had
more information.  But that all seems wrapped up with the planner's
estimation logic, so let's leave it for the planned followup patch.

Reported-by: Jeff Janes <jeff.janes@gmail.com>
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us
2025-10-30 11:21:22 -04:00
Peter Eisentraut
e1ac846f3d Mark ItemPointer arguments as const throughout
This is a follow up 991295f.  I searched over src/ and made all
ItemPointer arguments as const as much as possible.

Note: We cut out from the original patch the pieces that would have
created incompatibilities in the index or table AM APIs.  Those could
be considered separately.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com
2025-10-30 14:12:06 +01:00
Álvaro Herrera
a27c40bfe8
Simplify coding in ProcessQuery
The original is pretty baroque for no apparent reason; arguably, commit
2f9661311b should have done this.  Noted while reviewing related code
for bug #18984.  This is cosmetic (though I'm surprised that my compiler
generates shorter assembly this way), so no backpatch.

Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org
2025-10-30 11:26:35 +01:00
Peter Eisentraut
8ce795fcb7 Fix some confusing uses of const
There are a few places where we have

    typedef struct FooData { ... } FooData;
    typedef FooData *Foo;

and then function declarations with

    bar(const Foo x)

which isn't incorrect but probably meant

    bar(const FooData *x)

meaning that the thing x points to is immutable, not x itself.

This patch makes those changes where appropriate.  In one
case (execGrouping.c), the thing being pointed to was not immutable,
so in that case remove the const altogether, to avoid further
confusion.

Co-authored-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEoWx2m2E0xE8Kvbkv31ULh_E%2B5zph-WA_bEdv3UR9CLhw%2B3vg%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAEoWx2kTDz%3Db6T2xHX78vy_B_osDeCC5dcTCi9eG0vXHp5QpdQ%40mail.gmail.com
2025-10-30 11:20:04 +01:00
Peter Eisentraut
3479a0f823 const-qualify ItemPointer comparison functions
Add const qualifiers to ItemPointerEquals() and ItemPointerCompare().
This will allow further changes up the stack.  It also complements
commit aeb767ca0b, as we now have all of itemptr.h appropriately
const-qualified.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw@mail.gmail.com
2025-10-30 10:13:47 +01:00
Peter Eisentraut
e2cf524e4a formatting.c cleanup: Improve formatting of some struct declarations
This makes future editing easier.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-30 08:35:33 +01:00
Peter Eisentraut
9a1a5dfee8 formatting.c cleanup: Remove unnecessary zeroize macros
Replace with initializer or memset().

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-30 08:35:28 +01:00
Peter Eisentraut
38506f55fd formatting.c cleanup: Remove unnecessary extra line breaks in error message literals
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-30 08:35:18 +01:00
Michael Paquier
5ab0b6a248 Expose wal_fpi_bytes in EXPLAIN (WAL)
The new wal_fpi_bytes counter calculates the total amount of full page
images inserted in WAL records, in bytes.  This commit exposes this
information in EXPLAIN (ANALYZE, WAL) alongside the existing counters,
for both the text and JSON/YAML outputs, building upon f9a09aa295.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discusssion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
2025-10-30 15:34:01 +09:00
Michael Paquier
d432094689 Fix regression with slot invalidation checks
This commit reverts 818fefd8fd, that has been introduced to address a
an instability in some of the TAP tests due to the presence of random
standby snapshot WAL records, when slots are invalidated by
InvalidatePossiblyObsoleteSlot().

Anyway, this commit had also the consequence of introducing a behavior
regression.  After 818fefd8fd, the code may determine that a slot needs
to be invalidated while it may not require one: the slot may have moved
from a conflicting state to a non-conflicting state between the moment
when the mutex is released and the moment when we recheck the slot, in
InvalidatePossiblyObsoleteSlot().  Hence, the invalidations may be more
aggressive than they actually have to.

105b2cb336 has tackled the test instability in a way that should be
hopefully sufficient for the buildfarm, even for slow members:
- In v18, the test relies on an injection point that bypasses the
creation of the random records generated for standby snapshots,
eliminating the random factor that impacted the test.  This option was
not available when 818fefd8fd was discussed.
- In v16 and v17, the problem was bypassed by disallowing a slot to
become active in some of the scenarios tested.

While on it, this commit adds a comment to document that it is fine for
a recheck to use xmin and LSN values stored in the slot, without storing
and reusing them across multiple checks.

Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com
Backpatch-through: 16
2025-10-30 13:13:28 +09:00
Richard Guo
257ee78341 Disable parallel plans for RIGHT_SEMI joins
RIGHT_SEMI joins rely on the HEAP_TUPLE_HAS_MATCH flag to guarantee
that only the first match for each inner tuple is considered.
However, in a parallel hash join, the inner relation is stored in a
shared global hash table that can be probed by multiple workers
concurrently.  This allows different workers to inspect and set the
match flags of the same inner tuples at the same time.

If two workers probe the same inner tuple concurrently, both may see
the match flag as unset and emit the same tuple, leading to duplicate
output rows and violating RIGHT_SEMI join semantics.

For now, we disable parallel plans for RIGHT_SEMI joins.  In the long
term, it may be possible to support parallel execution by performing
atomic operations on the match flag, for example using a CAS or
similar mechanism.

Backpatch to v18, where RIGHT_SEMI join was introduced.

Bug: #19094
Reported-by: Lori Corbani <Lori.Corbani@jax.org>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19094-6ed410eb5b256abd@postgresql.org
Backpatch-through: 18
2025-10-30 11:58:45 +09:00
David Rowley
50eb4e1181 Fix bogus use of "long" in AllocSetCheck()
Because long is 32-bit on 64-bit Windows, it isn't a good datatype to
store the difference between 2 pointers.  The under-sized type could
overflow and lead to scary warnings in MEMORY_CONTEXT_CHECKING builds,
such as:

WARNING:  problem in alloc set ExecutorState: bad single-chunk %p in block %p

However, the problem lies only in the code running the check, not from
an actual memory accounting bug.

Fix by using "Size" instead of "long".  This means using an unsigned
type rather than the previous signed type.  If the block's freeptr was
corrupted, we'd still catch that if the unsigned type wrapped.  Unsigned
allows us to avoid further needless complexities around comparing signed
and unsigned types.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 13
Discussion: https://postgr.es/m/CAApHDvo-RmiT4s33J=aC9C_-wPZjOXQ232V-EZFgKftSsNRi4w@mail.gmail.com
2025-10-30 14:48:10 +13:00
Jeff Davis
3853a6956c Use C11 char16_t and char32_t for Unicode code points.
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/bedcc93d06203dfd89815b10f815ca2de8626e85.camel%40j-davis.com
2025-10-29 14:17:13 -07:00
Álvaro Herrera
94f95d91b0
CheckNNConstraintFetch: Fill all of ConstrCheck in a single pass
Previously, we'd fill all fields except ccbin, and only later obtain and
detoast ccbin, with hypothetical failures being possible.  If ccbin is
null (rare catalog corruption I have never witnessed) or its a corrupted
toast entry, we leak a tiny bit of memory in CacheMemoryContext from
having strdup'd the constraint name.  Repair these by only attempting to
fill the struct once ccbin has been detoasted.

Author: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CAEudQAr=i3_Z4GvmediX900+sSySTeMkvuytYShhQqEwoGyvhA@mail.gmail.com
2025-10-29 11:41:39 +01:00
Peter Eisentraut
a13833c35f Reorganize GUC structs
Instead of having five separate GUC structs, one for each type, with
the generic part contained in each of them, flip it around and have
one common struct, with the type-specific part has a subfield.

The very original GUC design had type-specific structs and
type-specific lists, and the membership in one of the lists defined
the type.  But now the structs themselves know the type (from the
.vartype field), and they are all loaded into a common hash table at
run time, and so this original separation no longer makes sense.  It
creates a bunch of inconsistencies in the code about whether the
type-specific or the generic struct is the primary struct, and a lot
of casting in between, which makes certain assumptions about the
struct layouts.

After the change, all these casts are gone and all the data is
accessed via normal field references.  Also, various code is
simplified because only one kind of struct needs to be processed.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-29 09:52:29 +01:00
Peter Eisentraut
2724830929 formatting.c cleanup: Remove unnecessary extra parentheses
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-29 09:29:00 +01:00
Peter Eisentraut
6271d9922e formatting.c cleanup: Use array syntax instead of pointer arithmetic
for easier readability

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-29 09:28:55 +01:00
Peter Eisentraut
b9def57a3c formatting.c cleanup: Add some const pointer qualifiers
Co-authored-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-29 09:28:50 +01:00
Peter Eisentraut
d98b3cdbaf formatting.c cleanup: Use size_t for string length variables and arguments
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-29 09:28:43 +01:00
Michael Paquier
d3111cb753 Fix correctness issue with computation of FPI size in WAL stats
XLogRecordAssemble() may be called multiple times before inserting a
record in XLogInsertRecord(), and the amount of FPIs generated inside
a record whose insertion is attempted multiple times may vary.

The logic added in f9a09aa295 touched directly pgWalUsage in
XLogRecordAssemble(), meaning that it could be possible for pgWalUsage
to be incremented multiple times for a single record.  This commit
changes the code to use the same logic as the number of FPIs added to a
record, where XLogRecordAssemble() returns this information and feeds it
to XLogInsertRecord(), updating pgWalUsage only when a record is
inserted.

Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com
2025-10-29 09:13:31 +09:00
Peter Eisentraut
03fbb0814c formatting.c cleanup: Move loop variables definitions into for statement
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-28 19:20:17 +01:00
Peter Eisentraut
95924672d5 formatting.c cleanup: Remove dashes in comments
This saves some vertical space and makes the comments style more
consistent with the rest of the code.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/6dd9d208-a3ed-49b5-b03d-8617261da973%40eisentraut.org
2025-10-28 19:20:02 +01:00
Álvaro Herrera
d5845aa8ad
Don't error out when dropping constraint if relchecks is already zero
I have never seen this be a problem in practice, but it came up when
purposely corrupting catalog contents to study the fix for a nearby bug:
we'd try to decrement relchecks, but since it's zero we error out and
fail to drop the constraint.  The fix is to downgrade the error to
warning, skip decrementing the counter, and otherwise proceed normally.

Given lack of field complaints, no backpatch.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202508291058.q2zscdcs64fj@alvherre.pgsql
2025-10-28 19:13:32 +01:00
Jeff Davis
4da12e9e2e Move comment about casts from pg_wchar.
Suggested-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGLXQUYK7Cq5KbLGgTWo7pORs7yhBWO1AEnZt7xTYbLRhg@mail.gmail.com
2025-10-28 10:49:20 -07:00
Peter Eisentraut
35e53b6841 Check that index can return in get_actual_variable_range()
Some recent changes were made to remove the explicit dependency on
btree indexes in some parts of the code.  One of these changes was
made in commit 9ef1851685, which allows non-btree indexes to be used
in get_actual_variable_range().  A follow-up commit ee1ae8b99f fixes
the cases where an index doesn’t have a sortopfamily as this is a
prerequisite to be used in get_actual_variable_range().

However, it was found that indexes that have amcanorder = true but do
not allow index-only-scans (amcanreturn returns false or is NULL) will
pass all of the conditions, while they should be rejected since
get_actual_variable_range() uses the index-only-scan machinery in
get_actual_variable_endpoint().  Such an index might cause errors like

    ERROR:  no data returned for index-only scan

during query planning.

The fix is to add a check in get_actual_variable_range() to reject
indexes that do not allow index-only scans.

Author: Maxime Schoemans <maxime.schoemans@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/20ED852A-C2D9-41EB-8671-8C8B9D418BE9%40enterprisedb.com
2025-10-28 10:07:29 +01:00
Michael Paquier
f9a09aa295 Add wal_fpi_bytes to pg_stat_wal and pg_stat_get_backend_wal()
This new counter, called "wal_fpi_bytes", tracks the total amount in
bytes of full page images (FPIs) generated in WAL.  This data becomes
available globally via pg_stat_wal, and for backend statistics via
pg_stat_get_backend_wal().

Previously, this information could only be retrieved with pg_waldump or
pg_walinspect, which may not be available depending on the environment,
and are expensive to execute.  It offers hints about how much FPIs
impact the WAL generated, which could be a large percentage for some
workloads, as well as the effects of wal_compression or page holes.

Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in
PgStat_WalCounters.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
2025-10-28 16:21:51 +09:00
Amit Kapila
3e8e05596a Add worker type argument to logical replication worker functions.
Extend logicalrep_worker_stop, logicalrep_worker_wakeup, and
logicalrep_worker_find to accept a worker type argument. This change
enables differentiation between logical replication worker types, such as
apply workers and table sync workers. While preserving existing behavior,
it lays the groundwork for upcoming patch to add sequence synchronization
workers.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-28 05:47:50 +00:00
Nathan Bossart
123661427b Fix a couple of comments.
These were discovered while reviewing Aleksander Alekseev's
proposed changes to pgindent.

Oversights in commits 393e0d2314 and 25a30bbd42.

Discussion: https://postgr.es/m/aP-H6kSsGOxaB21k%40nathan
2025-10-27 10:30:05 -05:00
Peter Eisentraut
10b5bb3bff Add some const qualifications
Add some const qualifications afforded by the previous change that
added a const qualification to PageAddItemExtended().

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
2025-10-27 09:55:59 +01:00
Peter Eisentraut
76acf4b722 Remove Item type
This type is just char * underneath, it provides no real value, no
type safety, and just makes the code one level more mysterious.  It is
more idiomatic to refer to blobs of memory by a combination of void *
and size_t, so change it to that.

Also, since this type hides the pointerness, we can't apply qualifiers
to what is pointed to, which requires some unconstify nonsense.  This
change allows fixing that.

Extension code that uses the Item type can change its code to use
void * to be backward compatible.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
2025-10-27 09:55:59 +01:00
Amit Kapila
e0dc4bbfb8 Fix GUC check_hook validation for synchronized_standby_slots.
Previously, the check_hook for synchronized_standby_slots attempted to
validate that each specified slot existed and was physical. However, these
checks were not performed during server startup. As a result, if users
configured non-existent slots before startup, the misconfiguration would
go undetected initially. This could later cause parallel query failures,
as newly launched workers would detect the issue and raise an ERROR.

This patch improves the check_hook by validating the syntax and format of
slot names. Validation of slot existence and type is deferred to the WAL
sender process, aligning with the behavior of the check_hook for
primary_slot_name.

Reported-by: Fabrice Chapuis <fabrice636861@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/CAA5-nLCeO4MQzWipCXH58qf0arruiw0OeUc1+Q=Z=4GM+=v1NQ@mail.gmail.com
2025-10-27 06:48:32 +00:00
Jeff Davis
371a302eec Comment typo fixes: pg_wchar_t should be pg_wchar.
Reported-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJ5Xh0KxLYXDZuPvw1_fHX=yuzb4xxtam1Cr6TPZZ1o+w@mail.gmail.com
2025-10-26 12:31:50 -07:00
David Rowley
39dcfda2d2 Fix incorrect logic for caching ResultRelInfos for triggers
When dealing with ResultRelInfos for partitions, there are cases where
there are mixed requirements for the ri_RootResultRelInfo.  There are
cases when the partition itself requires a NULL ri_RootResultRelInfo and
in the same query, the same partition may require a ResultRelInfo with
its parent set in ri_RootResultRelInfo.  This could cause the column
mapping between the partitioned table and the partition not to be done
which could result in crashes if the column attnums didn't match
exactly.

The fix is simple.  We now check that the ri_RootResultRelInfo matches
what the caller passed to ExecGetTriggerResultRel() and only return a
cached ResultRelInfo when the ri_RootResultRelInfo matches what the
caller wants, otherwise we'll make a new one.

Author: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Reported-by: Dmitry Fomin <fomin.list@gmail.com>
Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com
Backpatch-through: 15
2025-10-26 10:59:50 +13:00
Tom Lane
9f9a04368f Fix off-by-one Asserts in FreePageBtreeInsertInternal/Leaf.
These two functions expect there to be room to insert another item
in the FreePageBtree's array, but their assertions were too weak
to guarantee that.  This has little practical effect granting that
the callers are not buggy, but it seems to be misleading late-model
Coverity into complaining about possible array overrun.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/799984.1761150474@sss.pgh.pa.us
Backpatch-through: 13
2025-10-23 12:32:06 -04:00
Amit Kapila
f0b3573c3a Introduce "REFRESH SEQUENCES" for subscriptions.
This patch adds support for a new SQL command:
ALTER SUBSCRIPTION ... REFRESH SEQUENCES
This command updates the sequence entries present in the
pg_subscription_rel catalog table with the INIT state to trigger
resynchronization.

In addition to the new command, the following subscription commands have
been enhanced to automatically refresh sequence mappings:
ALTER SUBSCRIPTION ... REFRESH PUBLICATION
ALTER SUBSCRIPTION ... ADD PUBLICATION
ALTER SUBSCRIPTION ... DROP PUBLICATION
ALTER SUBSCRIPTION ... SET PUBLICATION

These commands will perform the following actions:
Add newly published sequences that are not yet part of the subscription.
Remove sequences that are no longer included in the publication.

This ensures that sequence replication remains aligned with the current
state of the publication on the publisher side.

Note that the actual synchronization of sequence data/values will be
handled in a subsequent patch that introduces a dedicated sequence sync
worker.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-23 08:30:27 +00:00
Fujii Masao
abc2b71383 Add comments explaining overflow entries in the replication lag tracker.
Commit 883a95646a introduced overflow entries in the replication lag tracker
to fix an issue where lag columns in pg_stat_replication could stall when
the replay LSN stopped advancing.

This commit adds comments clarifying the purpose and behavior of overflow
entries to improve code readability and understanding.

Since commit 883a95646a was recently applied and backpatched to all
supported branches, this follow-up commit is also backpatched accordingly.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com
Backpatch-through: 13
2025-10-23 13:24:56 +09:00
Tatsuo Ishii
20628b62e4 Fix coding style with "else".
The "else" code block having single statement with comments on a
separate line should have been surrounded by braces.

Reported-by: Chao Li <lic@highgo.com>
Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/20251020.125847.997839131426057290.ishii%40postgresql.org
2025-10-23 10:58:41 +09:00
David Rowley
6911f80379 Fix incorrect zero extension of Datum in JIT tuple deform code
When JIT deformed tuples (controlled via the jit_tuple_deforming GUC),
types narrower than sizeof(Datum) would be zero-extended up to Datum
width.  This wasn't the same as what fetch_att() does in the standard
tuple deforming code.  Logically the values are the same when fetching
via the DatumGet*() marcos, but negative numbers are not the same in
binary form.

In the report, the problem was manifesting itself with:

ERROR: could not find memoization table entry

in a query which had a "Cache Mode: binary" Memoize node. However, it's
currently unclear what else is affected.  Anything that uses
datum_image_eq() or datum_image_hash() on a Datum from a tuple deformed by
JIT could be affected, but it may not be limited to that.

The fix for this is simple: use signed extension instead of zero
extension.

Many thanks to Emmanuel Touzery for reporting this issue and providing
steps and backup which allowed the problem to easily be recreated.

Reported-by: Emmanuel Touzery <emmanuel.touzery@plandela.si>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/DB8P194MB08532256D5BAF894F241C06393F3A@DB8P194MB0853.EURP194.PROD.OUTLOOK.COM
Backpatch-through: 13
2025-10-23 13:11:02 +13:00
Tom Lane
fe9c051fd3 Avoid assuming that time_t can fit in an int.
We had several places that used cast-to-unsigned-int as a substitute
for properly checking for overflow.  Coverity has started objecting
to that practice as likely introducing Y2038 bugs.  An extra
comparison is surely not much compared to the cost of time(NULL), nor
is this coding practice particularly readable.  Let's do it honestly,
with explicit logic covering the cases of first-time-through and
clock-went-backwards.

I don't feel a need to back-patch though: our released versions
will be out of support long before 2038, and besides which I think
the code would accidentally work anyway for another 70 years or so.
2025-10-22 17:50:11 -04:00
Nathan Bossart
4c5e1d0785 Remove make_temptable_name_n().
This small function is only used in one place, and it fails to
handle quoted table names (although the table name portion of the
input should never be quoted in current usage).  In addition to
removing make_temptable_name_n() in favor of open-coding it where
needed, this commit ensures the "diff" table name is properly
quoted in order to future-proof this area a bit.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TO3a5q2NKRsjdJ6sLf8isVe4aMaaX1-Hj2TdHdhFw8zRA%40mail.gmail.com
2025-10-22 12:31:55 -05:00
Fujii Masao
f33e60a53a Make invalid primary_slot_name follow standard GUC error reporting.
Previously, if primary_slot_name was set to an invalid slot name and
the configuration file was reloaded, both the postmaster and all other
backend processes reported a WARNING. With many processes running,
this could produce a flood of duplicate messages. The problem was that
the GUC check hook for primary_slot_name reported errors at WARNING
level via ereport().

This commit changes the check hook to use GUC_check_errdetail() and
GUC_check_errhint() for error reporting. As with other GUC parameters,
this causes non-postmaster processes to log the message at DEBUG3,
so by default, only the postmaster's message appears in the log file.

Backpatch to all supported versions.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com
Backpatch-through: 13
2025-10-22 20:09:43 +09:00
Tatsuo Ishii
2d7b247cb4 Fix multi WinGetFuncArgInFrame/Partition calls with IGNORE NULLS.
Previously it was mistakenly assumed that there's only one window
function argument which needs to be processed by WinGetFuncArgInFrame
or WinGetFuncArgInPartition when IGNORE NULLS option is specified. To
eliminate the limitation, WindowObject->notnull_info is modified from
"uint8 *" to "uint8 **" so that WindowObject->notnull_info could store
pointers to "uint8 *" which holds NOT NULL info corresponding to each
window function argument. Moreover, WindowObject->num_notnull_info is
changed from "int" to "int64 *" so that WindowObject->num_notnull_info
could store the number of NOT NULL info corresponding to each function
argument. Memories for these data structures will be allocated when
WinGetFuncArgInFrame or WinGetFuncArgInPartition is called. Thus no
memory except the pointers is allocated for function arguments which
do not call these functions

Also fix the set mark position logic in WinGetFuncArgInPartition to
not raise a "cannot fetch row before WindowObject's mark position"
error in IGNORE NULLS case.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
2025-10-22 12:06:33 +09:00
Fujii Masao
883a95646a Fix stalled lag columns in pg_stat_replication when replay LSN stops advancing.
Previously, when the replay LSN reported in feedback messages from a standby
stopped advancing, for example, due to a recovery conflict, the write_lag and
flush_lag columns in pg_stat_replication would initially update but then stop
progressing. This prevented users from correctly monitoring replication lag.

The problem occurred because when any LSN stopped updating, the lag tracker's
cyclic buffer became full (the write head reached the slowest read head).
In that state, the lag tracker could no longer compute round-trip lag values
correctly.

This commit fixes the issue by handling the slowest read entry (the one
causing the buffer to fill) as a separate overflow entry and freeing space
so the write and other read heads can continue advancing in the buffer.
As a result, write_lag and flush_lag now continue updating even if the reported
replay LSN remains stalled.

Backpatch to all supported versions.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com
Backpatch-through: 13
2025-10-22 11:27:15 +09:00
Michael Paquier
2b75c38b70 Add error_on_null(), checking if the input is the null value
This polymorphic function produces an error if the input value is
detected as being the null value; otherwise it returns the input value
unchanged.

This function can for example become handy in SQL function bodies, to
enforce that exactly one row was returned.

Author: Joel Jacobson <joel@compiler.org>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/ece8c6d1-2ab1-45d5-ba12-8dec96fc8886@app.fastmail.com
Discussion: https://postgr.es/m/de94808d-ed58-4536-9e28-e79b09a534c7@app.fastmail.com
2025-10-22 09:55:17 +09:00
David Rowley
2470ca435c Use CompactAttribute more often, when possible
5983a4cff added CompactAttribute for storing commonly used fields from
FormData_pg_attribute.  5983a4cff didn't go to the trouble of adjusting
every location where we can use CompactAttribute rather than
FormData_pg_attribute, so here we change the remaining ones.

There are some locations where I've left the code using
FormData_pg_attribute.  These are mostly in the ALTER TABLE code.  Using
CompactAttribute here seems more risky as often the TupleDesc is being
changed and those changes may not have been flushed to the
CompactAttribute yet.

I've also left record_recv(), record_send(), record_cmp(), record_eq()
and record_image_eq() alone as it's not clear to me that accessing the
CompactAttribute is a win here due to the FormData_pg_attribute still
having to be accessed for most cases.  Switching the relevant parts to
use CompactAttribute would result in having to access both for common
cases.  Careful benchmarking may reveal that something can be done to
make this better, but in absence of that, the safer option is to leave
these alone.

In ReorderBufferToastReplace(), there was a check to skip attnums < 0
while looping over the TupleDesc.  Doing this is redundant since
TupleDescs don't store < 0 attnums.  Removing that code allows us to
move to using CompactAttribute.

The change in validateDomainCheckConstraint() just moves fetching the
FormData_pg_attribute into the ERROR path, which is cold due to calling
errstart_cold() and results in code being moved out of the common path.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com
2025-10-22 11:36:26 +13:00
Jeff Davis
ff53907c35 Make char2wchar() static.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-21 09:32:12 -07:00
Jeff Davis
844385d12e Remove obsolete global database_ctype_is_c.
Now that tsearch uses the database default locale, there's no need to
track the database CTYPE separately.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-21 09:32:04 -07:00
Jeff Davis
e113f9c102 tsearch: use database default collation for parsing.
Previously, tsearch used the database's CTYPE setting, which only
matches the database default collation if the locale provider is libc.

Note that tsearch types (tsvector and tsquery) are not collatable
types. The locale affects parsing the original text, which is a lossy
process, so a COLLATE clause on the already-parsed value would not
make sense.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-21 09:31:49 -07:00
Nathan Bossart
e94a7afe44 Re-pgindent brin.c.
Backpatch-through: 13
2025-10-21 09:56:26 -05:00
Álvaro Herrera
b7cc6474e9
Make smgr access for a BufferManagerRelation safer in relcache inval
Currently there's no bug, because we have no code path where we
invalidate relcache entries where it'd cause a problem.  But it's more
robust to do it this way in case we introduce such a path later, as some
Postgres forks reportedly already have.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgj3FNzAhV+jjPqxMs3jz=OgPohsoXFj_fh-L+nS+13CKQ@mail.gmail.com
2025-10-21 10:51:55 +03:00
David Rowley
9fd29d7ff4 Fix BRIN 32-bit counter wrap issue with huge tables
A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange
to when the table contains close to 2^32 pages.  At worst, this could
result in a cancellable infinite loop during the BRIN index scan with
power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and
scanning of unneeded heap blocks for non power-of-2 pagesPerRange.

Backpatch to all supported versions.

Author: sunil s <sunilfeb26@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com
Backpatch-through: 13
2025-10-21 20:46:14 +13:00
Michael Paquier
e4e496e88c Fix comment in pg_get_shmem_allocations_numa()
The comment fixed in this commit described the function as dealing with
database blocks, but in reality it processes shared memory allocations.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aH4DDhdiG9Gi0rG7@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 18
2025-10-21 16:12:30 +09:00
Richard Guo
18d2614093 Fix pushdown of degenerate HAVING clauses
67a54b9e8 taught the planner to push down HAVING clauses even when
grouping sets are present, as long as the clause does not reference
any columns that are nullable by the grouping sets.  However, there
was an oversight: if any empty grouping sets are present, the
aggregation node can produce a row that did not come from the input,
and pushing down a HAVING clause in this case may cause us to fail to
filter out that row.

Currently, non-degenerate HAVING clauses are not pushed down when
empty grouping sets are present, since the empty grouping sets would
nullify the vars they reference.  However, degenerate (variable-free)
HAVING clauses are not subject to this restriction and may be
incorrectly pushed down.

To fix, explicitly check for the presence of empty grouping sets and
retain degenerate clauses in HAVING when they are present.  This
ensures that we don't emit a bogus aggregated row.  A copy of each
such clause is also put in WHERE so that query_planner() can use it in
a gating Result node.

To facilitate this check, this patch expands the groupingSets tree of
the query to a flat list of grouping sets before applying the HAVING
pushdown optimization.  This does not add any additional planning
overhead, since we need to do this expansion anyway.

In passing, make a small tweak to preprocess_grouping_sets() by
reordering its initial operations a bit.

Backpatch to v18, where this issue was introduced.

Reported-by: Yuhang Qiu <iamqyh@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0879D9C9-7FE2-4A20-9593-B23F7A0B5290@gmail.com
Backpatch-through: 18
2025-10-21 12:35:36 +09:00
Masahiko Sawada
4bea91f21f Support COPY TO for partitioned tables.
Previously, COPY TO command didn't support directly specifying
partitioned tables so users had to use COPY (SELECT ...) TO variant.

This commit adds direct COPY TO support for partitioned
tables, improving both usability and performance. Performance tests
show it's faster than the COPY (SELECT ...) TO variant as it avoids
the overheads of query processing and sending results to the COPY TO
command.

When used with partitioned tables, COPY TO copies the same rows as
SELECT * FROM table. Row-level security policies of the partitioned
table are applied in the same way as when executing COPY TO on a plain
table.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Melih Mutlu <m.melihmutlu@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CACJufxEZt%2BG19Ors3bQUq-42-61__C%3Dy5k2wk%3DsHEFRusu7%3DiQ%40mail.gmail.com
2025-10-20 10:38:52 -07:00
Tom Lane
92cf557ffa Add static assertion that RELSEG_SIZE fits in an int.
Our configure script intended to ensure this, but it supposed that
expr(1) would report an error for integer overflow.  Maybe that
was true when the code was written (commit 3c6248a82 of 2008-05-02),
but all the modern expr's I tried will deliver bigger-than-int32
results without complaint.  Moreover, if you use --with-segsize-blocks
then there's no check at all.

Ideally we'd add a test in configure itself to check that the value
fits in int, but to do that we'd need to suppose that test(1) handles
bigger-than-int32 numbers correctly.  Probably modern ones do, but
that's an assumption I could do without; and I'm not too trusting
about meson either.  Instead, let's install a static assertion, so
that even people who ignore all the compiler warnings you get from
such values will be forced to confront the fact that it won't work.

This has been hazardous for awhile, but given that we hadn't heard
a complaint about it till now, I don't feel a need to back-patch.

Reported-by: Casey Shobe <casey.allen.shobe@icloud.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/C5DC82D6-C76D-4E8F-BC2E-DF03EFC4FA24@icloud.com
2025-10-19 18:28:46 -04:00
Tatsuo Ishii
dd766a441d Fix Coverity issue reported in commit 2273fa32bc.
Coverity complains that the return value from gettuple_eval_partition
(stored in variable "datum") in a do..while loop in
WinGetFuncArgInPartition is overwritten when exiting the while
loop. This commit tries to fix the issue by changing the
gettuple_eval_partition call to:

(void) gettuple_eval_partition()

explicitly stating that we discard the return value. We are just
interested in whether we are inside or outside of partition, NULL or
NOT NULL here.

Also enhance some comments for easier code reading.

Reported-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aPCOabSE4VfJLaky%40paquier.xyz
2025-10-19 09:29:26 +09:00
Jeff Davis
e533524b23 Add pg_database_locale() to retrieve database default locale.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-18 16:25:23 -07:00
Jeff Davis
67a8b49e96 Add pg_iswxdigit(), useful for tsearch.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-18 16:25:11 -07:00
David Rowley
e3b9e44689 Tidyup truncate_useless_pathkeys() function
This removes a few static functions and replaces them with 2 functions
which aim to be more reusable.  The upper planner's pathkey requirements
can be simplified down to operations which require pathkeys in the same
order as the pathkeys for the given operation, and operations which can
make use of a Path's pathkeys in any order.

Here we also add some short-circuiting to truncate_useless_pathkeys().  At
any point we discover that all pathkeys are useful to a single operation,
we can stop checking the remaining operations as we're not going to be
able to find any further useful pathkeys - they're all possibly useful
already.  Adjusting this seems to warrant trying to put the checks
roughly in order of least-expensive-first so that the short-circuits
have the most chance of skipping the more expensive checks.

In passing clean up has_useful_pathkeys() as it seems to have grown a
redundant check for group_pathkeys.  This isn't needed as
standard_qp_callback will set query_pathkeys if there's any requirement
to have group_pathkeys.  All this code does is waste run-time effort and
take up needless space.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpbsEoTksvW5901MMoZo-hHf78E5up3uDOfkJnxDe_WAw@mail.gmail.com
2025-10-19 10:13:13 +13:00
David Rowley
5c0a20003b Fix reset of incorrect hash iterator in GROUPING SETS queries
This fixes an unlikely issue when fetching GROUPING SET results from
their internally stored hash tables.  It was possible in rare cases that
the hash iterator would be set up incorrectly which could result in a
crash.

This was introduced in 4d143509c, so backpatch to v18.

Many thanks to Yuri Zamyatin for reporting and helping to debug this
issue.

Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
2025-10-18 16:07:04 +13:00
Tomas Vondra
b85c4700fc Fix hashjoin memory balancing logic
Commit a1b4f289be improved the hashjoin sizing to also consider the
memory used by BufFiles for batches. The code however had multiple
issues, making it ineffective or not working as expected in some cases.

* The amount of memory needed by buffers was calculated using uint32,
  so it would overflow for nbatch >= 262144. If this happened the loop
  would exit prematurely and the memory usage would not be reduced.

  The nbatch overflow is fixed by reworking the condition to not use a
  multiplication at all, so there's no risk of overflow. An explicit
  cast was added to a similar calculation in ExecHashIncreaseBatchSize.

* The loop adjusting the nbatch value used hash_table_bytes to calculate
  the old/new size, but then updated only space_allowed. The consequence
  is the total memory usage was not reduced, but all the memory saved by
  reducing the number of batches was used for the internal hash table.

  This was fixed by using only space_allowed. This is also more correct,
  because hash_table_bytes does not account for skew buckets.

* The code was also doubling multiple parameters (e.g. the number of
  buckets for hash table), but was missing overflow protections.

  The loop now checks for overflow, and terminates if needed. It'd be
  possible to cap the value and continue the loop, but it's not worth
  the complexity. And the overflow implies the in-memory hash table is
  already very large anyway.

While at it, rework the comment explaining how the memory balancing
works, to make it more concise and easier to understand.

The initial nbatch overflow issue was reported by Vaibhav Jain. The
other issues were noticed by me and Melanie Plageman. Fix by me, with a
lot of review and feedback by Melanie.

Backpatch to 18, where the hashjoin memory balancing was introduced.

Reported-by: Vaibhav Jain <jainva@google.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Backpatch-through: 18
Discussion: https://postgr.es/m/CABa-Az174YvfFq7rLS+VNKaQyg7inA2exvPWmPWqnEn6Ditr_Q@mail.gmail.com
2025-10-17 22:21:50 +02:00
Peter Eisentraut
e1a912c86d Change config_generic.vartype to be initialized at compile time
Previously, this was initialized at run time so that it did not have
to be maintained by hand in guc_tables.c.  But since that table is now
generated anyway, we might as well generate this bit as well.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-17 10:33:54 +02:00
Peter Eisentraut
0a7bde4610 Use designated initializers for guc_tables
This makes the generating script simpler and the output easier to
read.  In the future, it will make it easier to reorder and rearrange
the underlying C structures.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-17 10:29:42 +02:00
Daniel Gustafsson
6aa184c80f Replace defunct URL with stable archive.org URL in rbtree.c
The URL for "Sorting and Searching Algorithms: A Cookbook"
by Thomas Niemann has started returning 404, and since we
refer to the page for license terms this replaces the now
defunct link with one to the copy on archive.org.

Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/6DED3DEF-875E-4D1D-8F8F-7353D5AF7B79@gmail.com
2025-10-17 09:38:49 +02:00
Nathan Bossart
812221b204 Remove partColsUpdated.
This information appears to have been unused since commit
c5b7ba4e67.  We could not find any references in third-party code,
either.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aO_CyFRpbVMtgJWM%40nathan
2025-10-16 11:31:38 -05:00
Amit Kapila
41c674d2e3 Refactor logical worker synchronization code into a separate file.
To support the upcoming addition of a sequence synchronization worker,
this patch extracts common synchronization logic shared by table sync
workers and the new sequence sync worker into a dedicated file. This
modularization improves code reuse, maintainability, and clarity in the
logical workers framework.

Author: vignesh C <vignesh21@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-16 05:10:50 +00:00
Amit Langote
905e932f09 Fix EPQ crash from missing partition directory in EState
EvalPlanQualStart() failed to propagate es_partition_directory into
the child EState used for EPQ rechecks. When execution time partition
pruning ran during the EPQ scan, executor code dereferenced a NULL
partition directory and crashed.

Previously, propagating es_partition_directory into the EPQ EState was
unnecessary because CreatePartitionPruneState(), which sets it on
demand, also initialized the exec-pruning context.  After commit
d47cbf474, CreatePartitionPruneState() now initializes only the init-
time pruning context, leaving exec-pruning context initialization to
ExecInitNode(). Since EvalPlanQualStart() runs only ExecInitNode() and
not CreatePartitionPruneState(), it can encounter a NULL
es_partition_directory.  Other executor fields initialized during
CreatePartitionPruneState() are already copied into the child EState
thanks to commit 8741e48e5d, but es_partition_directory was missed.

Fix by borrowing the parent estate's  es_partition_directory in
EvalPlanQualStart(), and by clearing that field in EvalPlanQualEnd()
so the parent remains responsible for freeing the directory.

Add an isolation test permutation that triggers EPQ with execution-
time partition pruning, the case that reproduces this crash.

Bug: #19078
Reported-by: Yuri Zamyatin <yuri@yrz.am>
Diagnosed-by: David Rowley <dgrowleyml@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19078-dfd62f840a2c0766@postgresql.org
Backpatch-through: 18
2025-10-16 14:01:44 +09:00
Nathan Bossart
079480dc20 Fix lookup code for REINDEX INDEX.
This commit adjusts RangeVarCallbackForReindexIndex() to handle an
extremely unlikely race condition involving concurrent OID reuse.
In short, if REINDEX INDEX is executed at the same time that the
index is re-created with the same name and OID but a different
parent table OID, we might lock the wrong parent table.  To fix,
simply detect when this happens and emit an ERROR.  Unfortunately,
we can't gracefully handle this situation because we will have
already locked the index, and we must lock the parent table before
the index to avoid deadlocks.

While at it, I've replaced all but one early return in this
callback function with ERRORs that should be unreachable.  While I
haven't verified the presence of a live bug, the checks in question
appear to be unnecessary, and the early returns seem prone to
breaking the parent table locking code in subtle ways.  If nothing
else, this simplifies the code a bit.

This is a bug fix and could be back-patched, but given the presumed
rarity of the race condition and the lack of reports, I'm not going
to bother.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
2025-10-15 16:32:40 -05:00
Jeff Davis
af164f31b9 Add pg_iswalpha() and related functions.
Per-character pg_locale_t APIs. Useful for tsearch parsing and
potentially other places.

Significant overlap with the regc_wc_isalpha() and related functions
in regc_pg_locale.c, but this change leaves those intact for
now.

Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-15 12:54:01 -07:00
Nathan Bossart
688dc6299a Fix lookups in pg_{clear,restore}_{attribute,relation}_stats().
Presently, these functions look up the relation's OID, lock it, and
then check privileges.  Not only does this approach provide no
guarantee that the locked relation matches the arguments of the
lookup, but it also allows users to briefly lock relations for
which they do not have privileges, which might enable
denial-of-service attacks.  This commit adjusts these functions to
use RangeVarGetRelidExtended(), which is purpose-built to avoid
both of these issues.  The new RangeVarGetRelidCallback function is
somewhat complicated because it must handle both tables and
indexes, and for indexes, we must check privileges on the parent
table and lock it first.  Also, it needs to handle a couple of
extremely unlikely race conditions involving concurrent OID reuse.

A downside of this change is that the coding doesn't allow for
locking indexes in AccessShare mode anymore; everything is locked
in ShareUpdateExclusive mode.  Per discussion, the original choice
of lock levels was intended for a now defunct implementation that
used in-place updates, so we believe this change is okay.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/Z8zwVmGzXyDdkAXj%40nathan
Backpatch-through: 18
2025-10-15 12:47:33 -05:00
Peter Eisentraut
5f4c3b33a9 Change reset_extra into a config_generic common field
This is not specific to the GUC parameter type, so it can be part of
the generic struct rather than the type-specific struct (like the
related "extra" field).  This allows for some code simplifications.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-15 15:20:28 +02:00
Peter Eisentraut
dd3ae37830 Add log_autoanalyze_min_duration
The log output functionality of log_autovacuum_min_duration applies to
both VACUUM and ANALYZE, so it is not possible to separate the VACUUM
and ANALYZE log output thresholds. Logs are likely to be output only for
VACUUM and not for ANALYZE.

Therefore, we decided to separate the threshold for log output of VACUUM
by autovacuum (log_autovacuum_min_duration) and the threshold for log
output of ANALYZE by autovacuum (log_autoanalyze_min_duration).

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kasahara Tatsuhito <kasaharatt@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CAOzEurQtfV4MxJiWT-XDnimEeZAY+rgzVSLe8YsyEKhZcajzSA@mail.gmail.com
2025-10-15 14:31:12 +02:00
Peter Eisentraut
29dc7a6687 Add some const qualifiers
in guc-related source files, in anticipation of some further
restructuring.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-15 10:05:53 +02:00
Peter Eisentraut
1a79518888 Modernize some for loops
in guc-related source files, in anticipation of some further
restructuring.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
2025-10-15 10:00:37 +02:00
Amit Kapila
2436b8c047 Standardize use of REFRESH PUBLICATION in code and messages.
This patch replaces ALTER SUBSCRIPTION REFRESH with
ALTER SUBSCRIPTION REFRESH PUBLICATION in comments and error messages to
improve clarity and support future extensibility. The change aligns with
upcoming addition REFRESH SEQUENCES for sequence synchronization.

Author: vignesh C <vignesh21@gmail.com>
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-15 03:42:27 +00:00
Melanie Plageman
3e4705484e Make heap_page_is_all_visible independent of LVRelState
This function only requires a few fields from LVRelState, so pass them
in individually.

This change allows calling heap_page_is_all_visible() from code such as
pruneheap.c, which does not have access to an LVRelState.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 17:43:41 -04:00
Melanie Plageman
43b05b38ea Inline TransactionIdFollows/Precedes[OrEquals]()
These functions appeared prominently in a profile of a patch that sets
the visibility map on-access. Inline them to remove call overhead and
make them cheaper to use in hot paths.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 17:03:48 -04:00
Melanie Plageman
c8dd6542ba Add helper for freeze determination to heap_page_prune_and_freeze
After scanning the line pointers on a heap page during the first phase
of vacuum, we use the information collected to decide whether to use
the assembled freeze plans.

Move this decision logic into a helper function to improve readability.

While here, rename a PruneState member and disambiguate some local
variables in heap_page_prune_and_freeze().

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 15:08:50 -04:00
Jeff Davis
8efe982fe2 pg_regc_locale.c: rename some static functions.
Use the more specific prefix "regc_" rather than the generic prefix
"pg_".

A subsequent commit will create generic versions of some of these
functions that can be called from other modules.

Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
2025-10-14 11:04:04 -07:00
Tatsuo Ishii
5f3808646f Use ereport rather than elog in WinCheckAndInitializeNullTreatment.
Previously WinCheckAndInitializeNullTreatment() used elog() to emit an
error message. ereport() should be used instead because it's a
user-facing error. Also use existing get_func_name() to get a
function's name, rather than own implementation.

Moreover add an assertion to validate winobj parameter, just like
other window function API.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
2025-10-14 19:15:24 +09:00
Richard Guo
1206df04c2 Rename apply_at to apply_agg_at for clarity
The field name "apply_at" in RelAggInfo was a bit ambiguous.  Rename
it to "apply_agg_at" to improve clarity and make its purpose clearer.

Per complaint from David Rowley, Robert Haas.

Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+TgmoZ0KR2_XCWHy17=HHcQ3p2Mamc9c6Dnnhf1J6wPYFD9ng@mail.gmail.com
2025-10-14 16:35:22 +09:00
Melanie Plageman
add323da40 Eliminate XLOG_HEAP2_VISIBLE from vacuum phase III
Instead of emitting a separate XLOG_HEAP2_VISIBLE WAL record for each
page that becomes all-visible in vacuum's third phase, specify the VM
changes in the already emitted XLOG_HEAP2_PRUNE_VACUUM_CLEANUP record.

Visibility checks are now performed before marking dead items unused.
This is safe because the heap page is held under exclusive lock for the
entire operation.

This reduces the number of WAL records generated by VACUUM phase III by
up to 50%.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-10-13 18:01:06 -04:00
Tom Lane
03bf7a12c5 Fix incorrect message-printing in win32security.c.
log_error() would probably fail completely if used, and would
certainly print garbage for anything that needed to be interpolated
into the message, because it was failing to use the correct printing
subroutine for a va_list argument.

This bug likely went undetected because the error cases this code
is used for are rarely exercised - they only occur when Windows
security API calls fail catastrophically (out of memory, security
subsystem corruption, etc).

The FRONTEND variant can be fixed just by calling vfprintf()
instead of fprintf().  However, there was no va_list variant
of write_stderr(), so create one by refactoring that function.
Following the usual naming convention for such things, call
it vwrite_stderr().

Author: Bryan Green <dbryan.green@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAF+pBj8goe4fRmZ0V3Cs6eyWzYLvK+HvFLYEYWG=TzaM+tWPnw@mail.gmail.com
Backpatch-through: 13
2025-10-13 17:56:45 -04:00
Peter Geoghegan
7a662a46eb Remove unused nbtree array advancement variable.
Remove a variable that is no longer in use following commit 9a2e2a28.
It's not immediately clear why there were no compiler warnings about
this oversight.

Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
2025-10-12 14:04:08 -04:00
Álvaro Herrera
3231fd0455
Stop creating constraints during DETACH CONCURRENTLY
Commit 71f4c8c6f7 (which implemented DETACH CONCURRENTLY) added code
to create a separate table constraint when a table is detached
concurrently, identical to the partition constraint, on the theory that
such a constraint was needed in case the optimizer had constructed any
query plans that depended on the constraint being there.  However, that
theory was apparently bogus because any such plans would be invalidated.

For hash partitioning, those constraints are problematic, because their
expressions reference the OID of the parent partitioned table, to which
the detached table is no longer related; this causes all sorts of
problems (such as inability of restoring a pg_dump of that table, and
the table no longer working properly if the partitioned table is later
dropped).

We'd like to get rid of all those constraints.  In fact, for branch
master, do that -- no longer create any substitute constraints.
However, out of fear that some users might somehow depend on these
constraints for other partitioning strategies, for stable branches
(back to 14, which added DETACH CONCURRENTLY), only do it for hash
partitioning.

(If you repeatedly DETACH CONCURRENTLY and then ATTACH a partition, then
with this constraint addition you don't need to scan the table in the
ATTACH step, which presumably is good.  But if users really valued this
feature, they would have requested that it worked for non-concurrent
DETACH also.)

Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reported-by: Fei Changhong <feichanghong@qq.com>
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/18371-7fef49f63de13f02@postgresql.org
Discussion: https://postgr.es/m/19070-781326347ade7c57@postgresql.org
2025-10-11 20:30:12 +02:00
Álvaro Herrera
ff47f9c16c
dbase_redo: Fix Valgrind-reported memory leak
Introduced by my (Álvaro's) commit 9e4f914b5e, which was itself
backpatched to pg10, though only pg15 and up contain the problem
because of commit 9c08aea6a3.

This isn't a particularly significant leak, but given the fix is
trivial, we might as well backpatch to all branches where it applies, so
do that.

Author: Nathan Bossart <nathandbossart@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/x4odfdlrwvsjawscnqsqjpofvauxslw7b4oyvxgt5owoyf4ysn@heafjusodrz7
2025-10-11 16:39:22 +02:00
Peter Geoghegan
843e50208a Remove overzealous _bt_killitems assertion.
An assertion in _bt_killitems expected the scan's currPos state to
contain a valid LSN, saved from when currPos's page was initially read.
The assertion failed to account for the fact that even logged relations
can have leaf pages with an invalid LSN when built with wal_level set to
"minimal".  Remove the faulty assertion.

Oversight in commit e6eed40e (though note that the assertion was
backpatched to stable branches before 18 by commit 7c319f54).

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Matthijs van der Vleuten <postgresql@zr40.nl>
Bug: #19082
Discussion: https://postgr.es/m/19082-628e62160dbbc1c1@postgresql.org
Backpatch-through: 13
2025-10-10 14:52:25 -04:00
Michael Paquier
3a36543d7d Fix two typos in xlogstats.h and xlogstats.c
Issue found while browsing this area of the code, introduced and
copy-pasted around by 2258e76f90.

Backpatch-through: 15
2025-10-10 11:51:45 +09:00
Michael Paquier
912af1c7e9 Remove state.tmp when failing to save a replication slot
An error happening while a slot data is saved on disk in
SaveSlotToPath() could cause a state.tmp file (temporary file holding
the slot state data, renamed to its permanent name at the end of the
function) to remain around after it has been created.  This temporary
file is created with O_EXCL, meaning that if an existing state.tmp is
found, its creation would fail.  This would prevent the slot data to be
saved, requiring a manual intervention to remove state.tmp before being
able to save again a slot.  Possible scenarios where this temporary file
could remain on disk is for example a ENOSPC case (no disk space) while
writing, syncing or renaming it.  The bug reports point to a write
failure as the principal cause of the problems.

Using O_TRUNC has been argued back in 2019 as a potential solution to
discard any temporary file that could exist.  This solution was rejected
as O_EXCL can also act as a safety measure when saving the slot state,
crash recovery offering cleanup guarantees post-crash.  This commit uses
the alternative approach that has been suggested by Andres Freund back
in 2019.  When the temporary state file cannot be written, synced,
closed or renamed (note: not when created!), an unlink() is used to
remove the temporary state file while holding the in-progress I/O
LWLock, so as any follow-up attempts to save a slot's data would not
choke on an existing file that remained around because of a previous
failure.

This problem has been reported a few times across the years, going back
to 2019, but for some reason I have never come back to do something
about it and it has been forgotten.  A recent report has reminded me
that this was still a problem.

Reported-by: Kevin K Biju <kevinkbiju@gmail.com>
Reported-by: Sergei Kornilov <sk@zsrv.org>
Reported-by: Grigory Smolkin <g.smolkin@postgrespro.ru>
Discussion: https://postgr.es/m/CAM45KeHa32soKL_G8Vk38CWvTBeOOXcsxAPAs7Jt7yPRf2mbVA@mail.gmail.com
Discussion: https://postgr.es/m/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net
Discussion: https://postgr.es/m/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru
Backpatch-through: 13
2025-10-10 09:23:59 +09:00
Andres Freund
c819d1017d bufmgr: Fix valgrind checking for buffers pinned in StrategyGetBuffer()
In 5e89985928 I made StrategyGetBuffer() pin buffers with a single CAS,
instead of using PinBuffer_Locked(). Unfortunately I missed that
PinBuffer_Locked() marked the page as defined for valgrind.

Fix this oversight by centralizing the valgrind initialization into
TrackNewBufferPin(), which also allows us to reduce the number of places doing
VALGRIND_MAKE_MEM_DEFINED.

Per buildfarm animal skink and Amit Langote.

Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/CA+HiwqGKJ6nEXEPQW7EpykVsEtzxp5-up_xhtcUAkWFtATVQvQ@mail.gmail.com
2025-10-09 19:17:13 -04:00
Melanie Plageman
d96f87332b Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE
Instead of emitting a separate WAL XLOG_HEAP2_VISIBLE record for setting
bits in the VM, specify the VM block changes in the
XLOG_HEAP2_MULTI_INSERT record.

This halves the number of WAL records emitted by COPY FREEZE.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-10-09 16:29:01 -04:00
David Rowley
1b073cba49 Cleanup VACUUM option processing error messages
The processing of the PARALLEL option for VACUUM was not quite
following what the DefElem code had intended.  defGetInt32() already has
code to handle missing parameters and returns a perfectly good error
message for when that happens.

Here we get rid of the ExecVacuum() error:

ERROR: parallel option requires a value between 0 and N

and leave defGetInt32() handle it, which will give:

ERROR:  parallel requires an integer value

defGetInt32() was already handling the non-integer parameter case, so it
may as well handle the missing parameter case too.

Additionally, parameterize the option name to make translator work easier,
and also use errhint_internal() rather than errhint() for the
BUFFER_USAGE_LIMIT option since there isn't any work for a translator to
do for "%s".

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAApHDvovH14tNWB+WvP6TSbfi7-=TysQ9h5tQ5AgavwyWRWKHA@mail.gmail.com
2025-10-10 09:25:23 +13:00
Tom Lane
89d57c1fb3 Clean up memory leakage that occurs in context callback functions.
An error context callback function might leak some memory into
ErrorContext, since those functions are run with ErrorContext as
current context.  In the case where the elevel is ERROR, this is
no problem since the code level that catches the error should do
FlushErrorState to clean up, and that will reset ErrorContext.
However, if the elevel is less than ERROR then no such cleanup occurs.
In principle, repeated leaks while emitting log messages or client
notices could accumulate arbitrarily much leaked data, if no ERROR
occurs in the session.

To fix, let errfinish() perform an ErrorContext reset if it is
at the outermost error nesting level.  (If it isn't, we'll delay
cleanup until the outermost nesting level is exited.)

The only actual leakage of this sort that I've been able to observe
within our regression tests was recently introduced by commit
f727b63e8.  While it seems plausible that there are other such
leaks not reached in the regression tests, the lack of field
reports suggests that they're not a big problem.  Accordingly,
I won't take the risk of back-patching this now.  We can always
back-patch later if we get field reports of leaks.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/jngsjonyfscoont4tnwi2qoikatpd5hifsg373vmmjvugwiu6g@m6opxh7uisgd
2025-10-09 15:37:42 -04:00
Masahiko Sawada
b46efe9048 Fix access-to-already-freed-memory issue in pgoutput.
While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

While the backbranches v13 and v14 have a similar issue where
RelationSyncCache persists even after an error when pgoutput is used
via SQL API, we decided not to backport this fix. This decision was
made because v13 is approaching its final minor release, and we won't
have an chance to fix any new issues that might arise. Additionally,
since using pgoutput via SQL API is not a common use case, the risk
outwights the benefit. If we receive bug reports, we can consider
backporting the fixes then.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch-through: 15
2025-10-09 10:59:27 -07:00
Tom Lane
71540dcdcb Avoid uninitialized-variable warnings from older compilers.
Some of the buildfarm is still unhappy with WinGetFuncArgInPartition
even after 2273fa32b.  While it seems to be just very old compilers,
we can suppress the warnings and arguably make the code more readable
by not initializing these variables till closer to where they are
used.  While at it, make a couple of cosmetic comment improvements.
2025-10-09 10:33:55 -04:00
Richard Guo
f997d777ad Remove unnecessary include of "utils/fmgroids.h"
In initsplan.c, no macros for built-in function OIDs are used, so this
include is unnecessary and can be removed.  This was my oversight in
commit 8e1185910.

Discussion: https://postgr.es/m/CAMbWs4_-sag-cAKrLJ+X+5njL1=oudk=+KfLmsLZ5a2jckn=kg@mail.gmail.com
2025-10-09 17:49:20 +09:00
Amit Kapila
96b3784973 Add "ALL SEQUENCES" support to publications.
This patch adds support for the ALL SEQUENCES clause in publications,
enabling synchronization/replication of all sequences that is useful for
upgrades.

Publications can now include all sequences via FOR ALL SEQUENCES.
psql enhancements:
\d shows publications for a given sequence.
\dRp indicates if a publication includes all sequences.

ALL SEQUENCES can be combined with ALL TABLES, but not with other options
like TABLE or TABLES IN SCHEMA. We can extend support for more granular
clauses in future.

The view pg_publication_sequences provides information about the mapping
between publications and sequences.

This patch enables publishing of sequences; subscriber-side support will
be added in upcoming patches.

Author: vignesh C <vignesh21@gmail.com>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-09 03:48:54 +00:00
Amit Langote
ef5e60a9d3 Fix internal error from CollateExpr in SQL/JSON DEFAULT expressions
SQL/JSON functions such as JSON_VALUE could fail with "unrecognized
node type" errors when a DEFAULT clause contained an explicit COLLATE
expression. That happened because assign_collations_walker() could
invoke exprSetCollation() on a JsonBehavior expression whose DEFAULT
still contained a CollateExpr, which exprSetCollation() does not
handle.

For example:

  SELECT JSON_VALUE('{"a":1}', '$.c' RETURNING text
                    DEFAULT 'A' COLLATE "C" ON EMPTY);

Fix by validating in transformJsonBehavior() that the DEFAULT
expression's collation matches the enclosing JSON expression’s
collation. In exprSetCollation(), replace the recursive call on the
JsonBehavior expression with an assertion that its collation already
matches the target, since the parser now enforces that condition.

Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CACJufxHVwYYSyiVQ6o+PsRX6zQ7rAFinh_fv1kCfTsT1xG4Zeg@mail.gmail.com
Backpatch-through: 17
2025-10-09 01:07:59 -04:00
David Rowley
a5a68dd6d5 Make truncate_useless_pathkeys() consider WindowFuncs
truncate_useless_pathkeys() seems to have neglected to account for
PathKeys that might be useful for WindowClause evaluation.  Modify it so
that it properly accounts for that.

Making this work required adjusting two things:

1. Change from checking query_pathkeys to check sort_pathkeys instead.
2. Add explicit check for window_pathkeys

For #1, query_pathkeys gets set in standard_qp_callback() according to the
sort order requirements for the first operation to be applied after the
join planner is finished, so this changes depending on which upper
planner operations a particular query needs.  If the query has window
functions and no GROUP BY, then query_pathkeys gets set to
window_pathkeys.  Before this change, this meant PathKeys useful for the
ORDER BY were not accounted for in queries with window functions.

Because of #1, #2 is now required so that we explicitly check to ensure
we don't truncate away PathKeys useful for window functions.

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
2025-10-09 12:38:33 +13:00
Andres Freund
5e89985928 bufmgr: Don't lock buffer header in StrategyGetBuffer()
Previously StrategyGetBuffer() acquired the buffer header spinlock for every
buffer, whether it was reusable or not. If reusable, it'd be returned, with
the lock held, to GetVictimBuffer(), which then would pin the buffer with
PinBuffer_Locked(). That's somewhat violating the spirit of the guidelines for
holding spinlocks (i.e. that they are only held for a few lines of consecutive
code) and necessitates using PinBuffer_Locked(), which scales worse than
PinBuffer() due to holding the spinlock.  This alone makes it worth changing
the code.

However, the main reason to change this is that a future commit will make
PinBuffer_Locked() slower (due to making UnlockBufHdr() slower), to gain
scalability for the much more common case of pinning a pre-existing buffer. By
pinning the buffer with a single atomic operation, iff the buffer is reusable,
we avoid any potential regression for miss-heavy workloads. There strictly are
fewer atomic operations for each potential buffer after this change.

The price for this improvement is that freelist.c needs two CAS loops and
needs to be able to set up the resource accounting for pinned buffers. The
latter is achieved by exposing a new function for that purpose from bufmgr.c,
that seems better than exposing the entire private refcount infrastructure.
The improvement seems worth the complexity.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 17:04:07 -04:00
Andres Freund
3baae90013 bufmgr: fewer calls to BufferDescriptorGetContentLock
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().

The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.

Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 16:06:19 -04:00
Andres Freund
2a2e1b470b bufmgr: Fix signedness of mask variable in BufferSync()
BM_PERMANENT is defined as 1U<<31, which is a negative number when interpreted
as a signed integer. Unfortunately the mask variable in BufferSync() was
signed. This has been wrong for a long time, but failed to fail, due to
integer conversion rules.

However, in an upcoming patch the width of the state variable will be
increased, with the wrong signedness leading to never flushing permanent
buffers - luckily caught in a test.

It seems better to fix this separately, instead of doing so as part of a
large, otherwise mechanical, patch.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 14:34:30 -04:00
Andres Freund
3c2b97b29e bufmgr: Introduce FlushUnlockedBuffer
There were several copies of code locking a buffer, flushing its contents, and
unlocking the buffer. It seems worth centralizing that into a helper function.

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 14:34:30 -04:00
Andres Freund
819dc118c0 Improve ReadRecentBuffer() scalability
While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends.  Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer().  Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers.  It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

Note that this fixes ReadRecentBuffer()'s failure to bump the usage
count. While this could be seen as a bug, there currently aren't cases
affected by this in core, so it doesn't seem worth backpatching that portion.

Author: Thomas Munro <thomas.munro@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 13:10:40 -04:00
Masahiko Sawada
d3b6183dd9 Add mem_exceeded_count column to pg_stat_replication_slots.
This commit introduces a new column mem_exceeded_count to the
pg_stat_replication_slots view. This counter tracks how often the
memory used by logical decoding exceeds the logical_decoding_work_mem
limit. The new statistic helps users determine whether exceeding the
logical_decoding_work_mem limit is a rare occurrences or a frequent
issue, information that wasn't available through existing statistics.

Bumps catversion.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/978D21E8-9D3B-40EA-A4B1-F87BABE7868C@yesql.se
2025-10-08 10:05:04 -07:00
Robert Haas
94f3ad3961 Add planner_setup_hook and planner_shutdown_hook.
These hooks allow plugins to get control at the earliest point at
which the PlannerGlobal object is fully initialized, and then just
before it gets destroyed. This is useful in combination with the
extendable plan state facilities (see extendplan.h) and perhaps for
other purposes as well.

Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
2025-10-08 09:05:38 -04:00
Robert Haas
c83ac02ec7 Add ExplainState argument to pg_plan_query() and planner().
This allows extensions to have access to any data they've stored
in the ExplainState during planning. Unfortunately, it won't help
with EXPLAIN EXECUTE is used, but since that case is less common,
this still seems like an improvement.

Since planner() has quite a few arguments now, also add some
documentation of those arguments and the return value.

Author: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
2025-10-08 08:33:29 -04:00
Richard Guo
8e11859102 Implement Eager Aggregation
Eager aggregation is a query optimization technique that partially
pushes aggregation past a join, and finalizes it once all the
relations are joined.  Eager aggregation may reduce the number of
input rows to the join and thus could result in a better overall plan.

In the current planner architecture, the separation between the
scan/join planning phase and the post-scan/join phase means that
aggregation steps are not visible when constructing the join tree,
limiting the planner's ability to exploit aggregation-aware
optimizations.  To implement eager aggregation, we collect information
about aggregate functions in the targetlist and HAVING clause, along
with grouping expressions from the GROUP BY clause, and store it in
the PlannerInfo node.  During the scan/join planning phase, this
information is used to evaluate each base or join relation to
determine whether eager aggregation can be applied.  If applicable, we
create a separate RelOptInfo, referred to as a grouped relation, to
represent the partially-aggregated version of the relation and
generate grouped paths for it.

Grouped relation paths can be generated in two ways.  The first method
involves adding sorted and hashed partial aggregation paths on top of
the non-grouped paths.  To limit planning time, we only consider the
cheapest or suitably-sorted non-grouped paths in this step.
Alternatively, grouped paths can be generated by joining a grouped
relation with a non-grouped relation.  Joining two grouped relations
is currently not supported.

To further limit planning time, we currently adopt a strategy where
partial aggregation is pushed only to the lowest feasible level in the
join tree where it provides a significant reduction in row count.
This strategy also helps ensure that all grouped paths for the same
grouped relation produce the same set of rows, which is important to
support a fundamental assumption of the planner.

For the partial aggregation that is pushed down to a non-aggregated
relation, we need to consider all expressions from this relation that
are involved in upper join clauses and include them in the grouping
keys, using compatible operators.  This is essential to ensure that an
aggregated row from the partial aggregation matches the other side of
the join if and only if each row in the partial group does.  This
ensures that all rows within the same partial group share the same
"destiny", which is crucial for maintaining correctness.

One restriction is that we cannot push partial aggregation down to a
relation that is in the nullable side of an outer join, because the
NULL-extended rows produced by the outer join would not be available
when we perform the partial aggregation, while with a
non-eager-aggregation plan these rows are available for the top-level
aggregation.  Pushing partial aggregation in this case may result in
the rows being grouped differently than expected, or produce incorrect
values from the aggregate functions.

If we have generated a grouped relation for the topmost join relation,
we finalize its paths at the end.  The final paths will compete in the
usual way with paths built from regular planning.

The patch was originally proposed by Antonin Houska in 2017.  This
commit reworks various important aspects and rewrites most of the
current code.  However, the original patch and reviews were very
useful.

Author: Richard Guo <guofenglinux@gmail.com>
Author: Antonin Houska <ah@cybertec.at> (in an older version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me> (in an older version)
Reviewed-by: Andy Fan <zhihuifan1213@163.com> (in an older version)
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> (in an older version)
Discussion: https://postgr.es/m/CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com
2025-10-08 17:04:23 +09:00
Michael Paquier
138da727a1 Improve description of some WAL records for GIN
The following information is added in the description of some GIN
records:
- In INSERT_LISTPAGE, the number of tuples and the right link block.
- In UPDATE_META_PAGE, the number of tuples, the previous tail block,
and the right link block.
- In SPLIT, the left and right children blocks.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
2025-10-08 14:02:26 +09:00
Michael Paquier
b71bae41a0 Add stats_reset to pg_stat_user_functions
It is possible to call pg_stat_reset_single_function_counters() for a
single function, but the reset time was missing the system view showing
its statistics.  Like all the fields of pg_stat_user_functions, the GUC
track_functions needs to be enabled to show the statistics about
function executions.

Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to
PgStat_StatFuncEntry.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aONjnsaJSx-nEdfU@paquier.xyz
2025-10-08 12:43:40 +09:00
Amit Kapila
035b09131d Fix typo in function header comment.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+TgmoZYh_nw-2j_Fi9y6ZAvrpN+W1aSOFNM7Rus2Q-zTkCsQw@mail.gmail.com
2025-10-08 03:17:05 +00:00
Tatsuo Ishii
2273fa32bc Fix Coverity issues reported in commit 25a30bbd42.
Fix several issues pointed out by Coverity (reported by Tome Lane).

- In row_is_in_frame(), return value of window_gettupleslot() was not
  checked.

- WinGetFuncArgInPartition() tried to derefference "isout" pointer
  even if it could be NULL in some places.

Besides the issues, I also fixed a compiler warning reported by Álvaro
Herrera.

Moreover, in WinGetFuncArgInPartition refactor the do...while loop so
that the codes inside the loop simpler. Also simplify the case when
abs_pos < 0.

Author: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Paul Ramsey <pramsey@cleverelephant.ca>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/1686755.1759679957%40sss.pgh.pa.us
Discussion: https://postgr.es/m/202510051612.gw67jlc2iqpw%40alvherre.pgsql
2025-10-08 09:26:49 +09:00
Robert Haas
64095d1574 Remove PlannerInfo's join_search_private method.
Instead, use the new mechanism that allows planner extensions to store
private state inside a PlannerInfo, treating GEQO as an in-core planner
extension.  This is a useful test of the new facility, and also buys
back a few bytes of storage.

To make this work, we must remove innerrel_is_unique_ext's hack of
testing whether join_search_private is set as a proxy for whether
the join search might be retried. Add a flag that extensions can
use to explicitly signal their intentions instead.

Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
2025-10-07 12:43:45 -04:00
Robert Haas
0132dddab3 Allow private state in certain planner data structures.
Extension that make extensive use of planner hooks may want to
coordinate their efforts, for example to avoid duplicate computation,
but that's currently difficult because there's no really good way to
pass data between different hooks.

To make that easier, allow for storage of extension-managed private
state in PlannerGlobal, PlannerInfo, and RelOptInfo, along very
similar lines to what we have permitted for ExplainState since commit
c65bc2e1d1.

Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoYWKHU2hKr62Toyzh-kTDEnMDeLw7gkOOnjL-TnOUq0kQ@mail.gmail.com
2025-10-07 12:09:30 -04:00
Robert Haas
8c49a484e8 Assign each subquery a unique name prior to planning it.
Previously, subqueries were given names only after they were planned,
which makes it difficult to use information from a previous execution of
the query to guide future planning. If, for example, you knew something
about how you want "InitPlan 2" to be planned, you won't know whether
the subquery you're currently planning will end up being "InitPlan 2"
until after you've finished planning it, by which point it's too late to
use the information that you had.

To fix this, assign each subplan a unique name before we begin planning
it. To improve consistency, use textual names for all subplans, rather
than, as we did previously, a mix of numbers (such as "InitPlan 1") and
names (such as "CTE foo"), and make sure that the same name is never
assigned more than once.

We adopt the somewhat arbitrary convention of using the type of sublink
to set the plan name; for example, a query that previously had two
expression sublinks shown as InitPlan 2 and InitPlan 1 will now end up
named expr_1 and expr_2. Because names are assigned before rather than
after planning, some of the regression test outputs show the numerical
part of the name switching positions: what was previously SubPlan 2 was
actually the first one encountered, but we finished planning it later.

We assign names even to subqueries that aren't shown as such within the
EXPLAIN output. These include subqueries that are a FROM clause item or
a branch of a set operation, rather than something that will be turned
into an InitPlan or SubPlan. The purpose of this is to make sure that,
below the topmost query level, there's always a name for each subquery
that is stable from one planning cycle to the next (assuming no changes
to the query or the database schema).

Author: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: http://postgr.es/m/3641043.1758751399@sss.pgh.pa.us
2025-10-07 09:18:54 -04:00
David Rowley
9c9d41af4d Teach planner to short-circuit EXCEPT/INTERSECT with dummy inputs
When either inputs of an INTERSECT [ALL] operator are proven not to return
any results (a dummy rel), then mark the entire INTERSECT operation as
dummy.

Likewise, if an EXCEPT [ALL] operation's left input is proven empty, then
mark the entire operation as dummy.

With EXCEPT ALL, we can easily handle the right input being dummy as
we can return the left input without any processing.  That can lead to
significant performance gains during query execution.  We can't easily
handle dummy right inputs for EXCEPT (without ALL), as that would require
deduplication of the left input.  Wiring up those Paths is likely more
complex than it's worth as the gains during execution aren't that great,
so let's leave that one to be handled by the normal Path generation code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com
2025-10-07 17:17:52 +13:00
David Rowley
928df067d1 Fix incorrect targetlist in dummy UNIONs
The prior code, added in 03d40e4b5 attempted to use the targetlist of the
first UNION child when all UNION children were proven as dummy rels.
That's not going to work when some operation atop of the Result node must
find target entries within the Result's targetlist.  This could have been
something as simple as trying to sort the results of the UNION operation,
which would lead to:

ERROR:  could not find pathkey item to sort

Instead, use the top-level UNION's targetlist and fix the varnos in
setrefs.c.  Because set operation targetlists always use varno==0, we
can rewrite those to become varno==1, i.e. use the Vars from the first
UNION child.  This does result in showing Vars from relations that are
not present in the final plan, but that's no different to what we see
when normal base relations are proven dummy.

Without this fix it would be possible to see the following error in
EXPLAIN VERBOSE when all UNION inputs were proven empty.

ERROR:  bogus varno: 0

Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrUASy9sfULMEsM2udvZJP6AoBRCZvHYXYxZTy2tX9FYw@mail.gmail.com
2025-10-07 14:15:04 +13:00
Masahiko Sawada
771cfe22a0 Avoid unnecessary GinFormTuple() calls for incompressible posting lists.
Previously, we attempted to form a posting list tuple even when
ginCompressPostingList() failed to compress the posting list due to
its size. While there was no functional failure, it always wasted one
GinFormTuple() call when item pointers didn't fit in a posting list
tuple.

This commit ensures that a GIN index tuple is formed only when all
item pointers in the posting list are successfully compressed.

Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAE7r3M+C=jcpTD93f_RBHrQp3C+=TAXFs+k4tTuZuuxboK8AvA@mail.gmail.com
2025-10-06 14:02:01 -07:00
Nathan Bossart
ec8719ccbf Optimize hex_encode() and hex_decode() using SIMD.
The hex_encode() and hex_decode() functions serve as the workhorses
for hexadecimal data for bytea's text format conversion functions,
and some workloads are sensitive to their performance.  This commit
adds new implementations that use routines from port/simd.h, which
testing indicates are much faster for larger inputs.  For small or
invalid inputs, we fall back on the existing scalar versions.
Since we are using port/simd.h, these optimizations apply to both
x86-64 and AArch64.

Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Co-authored-by: Susmitha Devanga <devanga.susmitha@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/aLhVWTRy0QPbW2tl%40nathan
2025-10-06 12:28:50 -05:00
Amit Kapila
b93172ca59 Expose sequence page LSN via pg_get_sequence_data.
This patch enhances the pg_get_sequence_data function to include the
page-level LSN (Log Sequence Number) of the sequence. This additional
metadata will be used by upcoming patches to support synchronization
of sequences during logical replication.

By exposing the LSN, we enable more accurate tracking of sequence
changes, which is essential for maintaining consistency across
replicated nodes.

Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
2025-10-06 08:30:16 +00:00
Michael Paquier
7072a8855e Remove block information from description of some WAL records for GIN
The WAL records XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE
included some information about the blocks added to the record.

This information is already provided by XLogRecGetBlockRefInfo() with
much more details about the blocks included in each record, like the
compression information, for example.  This commit removes the block
information that existed in the record descriptions specific to GIN.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgk=9WRoXhZy5fdk+T1hiau7qbL_vn94w_L1N=gtEdbsg@mail.gmail.com
2025-10-06 16:14:59 +09:00
Michael Paquier
a5b543258a Add stats_reset to pg_stat_all_{tables,indexes} and related views
It is possible to call pg_stat_reset_single_table_counters() on a
relation (index or table) but the reset time was missing from the system
views showing their statistics.

This commit adds the reset time as an attribute of pg_stat_all_tables,
pg_stat_all_indexes, and other relations related to them.

Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, as a result of the new field added to
PgStat_StatTabEntry.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aN8l182jKxEq1h9f@paquier.xyz
2025-10-06 15:31:21 +09:00
Michael Paquier
0c7f103028 Fix two comments in numeric.c
The comments at the top of numeric_int4_safe() and numeric_int8_safe()
mentioned respectively int4_numeric() and int8_numeric().  The intention
is to refer to numeric_int4() and numeric_int8().

Oversights in 4246a977ba.

Reported-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxFfVt7Jx9_j=juxXyP-6tznN8OcvS9E-QSgp0BrD8KUgA@mail.gmail.com
2025-10-06 11:18:30 +09:00
Álvaro Herrera
1a8b5b11e4
Don't include access/htup_details.h in executor/tuptable.h
This is not at all needed; I suspect it was a simple mistake in commit
5408e233f0.  It causes htup_details.h to bleed into a huge number of
places via execnodes.h.  Remove it and fix fallout.

Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql
2025-10-05 18:00:38 +02:00
Álvaro Herrera
1b6f61bd89
Don't include execnodes.h in brin.h or gin.h
These headers don't need execnodes.h for anything.  I think they never
have.

Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql
2025-10-05 17:35:25 +02:00
David Rowley
03d40e4b52 Teach UNION planner to remove dummy inputs
This adjusts UNION planning so that the planner produces more optimal
plans when one or more of the UNION's subqueries have been proven to be
empty (a dummy rel).

If any of the inputs are empty, then that input can be removed from the
Append / MergeAppend.  Previously, a const-false "Result" node would
appear to represent this.  Removing empty inputs has a few extra
benefits when only 1 union child remains as it means the Append or
MergeAppend can be removed in setrefs.c, making the plan slightly faster
to execute.  Also, we can provide better n_distinct estimates by looking
at the sole remaining input rel's statistics.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvri53PPF76c3M94_QNWbJfXjyCnjXuj_2=LYM-0m8WZtw@mail.gmail.com
2025-10-04 14:30:03 +13:00
David Rowley
5092aae431 Use bms_add_members() instead of bms_union() when possible
bms_union() causes a new set to be allocated.  What this caller needs is
members added to an existing set.  bms_add_members() is the tool for
that job.

This is just a matter of fixing an inefficiency due to surplus memory
allocations.  No bugs being fixed.

The only other place I found that might be valid to apply this change is
in markNullableIfNeeded(), but I opted not to do that due to the risk to
reward ratio not looking favorable.  The risk being that there *could* be
another pointer pointing to the Bitmapset.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAApHDvoCcoS-p5tZNJLTxFOKTYNjqVh7Dwf+5ikDUBwnvWftRw@mail.gmail.com
2025-10-04 12:19:31 +13:00
Nathan Bossart
74b41f5a77 Make some use of anonymous unions [DSM registry].
Make some use of anonymous unions, which are allowed as of C11, as
examples and encouragement for future code, and to test compilers.

This commit changes the DSMRegistryEntry struct.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/aNKsDg0fJwqhZdXX%40nathan
2025-10-03 10:14:33 -05:00
John Naylor
54ab748651 Fix reuse-after-free hazard in dead_items_reset
In similar vein to commit ccc8194e42, a reset instance of a shared
memory TID store happened to occupy the same private memory as the old
one for the entry point, since the chunk freed after the last round
of index vacuuming was put on the context's freelist. The failure
to update the vacrel->dead_items pointer was evident by nudging the
system to allocate memory in a different area. This was not discovered
at the time of the earlier commit since our regression tests didn't
cover multiple index passes with parallel vacuum.

Backpatch to v17, when TidStore came in.

Author: Kevin Oommen Anish <kevin.o@zohocorp.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Tested-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/199a07cbdfc.7a1c4aac25838.1675074408277594551%40zohocorp.com
Backpatch-through: 17
2025-10-03 16:05:02 +07:00
Richard Guo
605bfb7dbe Fix incorrect function reference in comment
The comment incorrectly references the defunct function
BufFileOpenShared(), which was replaced in commit dcac5e7ac.

This patch updates the comment to refer to the current function
BufFileOpenFileSet().

Author: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/1cb48b4c-54ab-40cc-b355-0b3c2af6d3f7@Spark
2025-10-03 16:34:42 +09:00
Tatsuo Ishii
25a30bbd42 Add IGNORE NULLS/RESPECT NULLS option to Window functions.
Add IGNORE NULLS/RESPECT NULLS option (null treatment clause) to lead,
lag, first_value, last_value and nth_value window functions.  If
unspecified, the default is RESPECT NULLS which includes NULL values
in any result calculation. IGNORE NULLS ignores NULL values.

Built-in window functions are modified to call new API
WinCheckAndInitializeNullTreatment() to indicate whether they accept
IGNORE NULLS/RESPECT NULLS option or not (the API can be called by
user defined window functions as well).  If WinGetFuncArgInPartition's
allowNullTreatment argument is true and IGNORE NULLS option is given,
WinGetFuncArgInPartition() or WinGetFuncArgInFrame() will return
evaluated function's argument expression on specified non NULL row (if
it exists) in the partition or the frame.

When IGNORE NULLS option is given, window functions need to visit and
evaluate same rows over and over again to look for non null rows. To
mitigate the issue, 2-bit not null information array is created while
executing window functions to remember whether the row has been
already evaluated to NULL or NOT NULL. If already evaluated, we could
skip the evaluation work, thus we could get better performance.

Author: Oliver Ford <ojford@gmail.com>
Co-authored-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Krasiyan Andreev <krasiyan@gmail.com>
Reviewed-by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Fetter <david@fetter.org>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/flat/CAGMVOdsbtRwE_4+v8zjH1d9xfovDeQAGLkP_B6k69_VoFEgX-A@mail.gmail.com
2025-10-03 09:47:36 +09:00
Michael Paquier
3f431109dc Remove useless pointer update in ginxlog.c
Oversight in 2c03216d83, when the redo code of GIN got refactored for
the new WAL format where block information has been standardized, as the
payload data got tracked for each block after the change, and not in the
whole record.  This is just a cleanup.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
2025-10-02 17:16:20 +09:00
John Naylor
48566180ef Generate EUC_CN mappings from gb18030-2022.ucm
In the wake of cfa6cd292, EUC_CN was the only encoding that used
gb-18030-2000.xml to generate the .map files. Since EUC_CN is a subset
of GB18030, we can easily use the same UCM file. This allows deleting
the XML file from our repository.

Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/CANWCAZaNRXZ-5NuXmsaMA2mKvMZnCGHZqQusLkpE%2B8YX%2Bi5OYg%40mail.gmail.com
2025-10-02 12:36:24 +07:00
David Rowley
91df0465a6 Fix typo in pgstat_relation.c header comment
Looks like a copy and paste error from pgstat_function.c

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aNuaVMAdTGbgBgqh@ip-10-97-1-34.eu-west-3.compute.internal
2025-10-01 00:23:38 +13:00
Peter Eisentraut
57d46dff9b Make some use of anonymous unions [reorderbuffer xact_time]
Make some use of anonymous unions, which are allowed as of C11, as
examples and encouragement for future code, and to test compilers.

This commit changes the ReorderBufferTXN struct.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/f00a9968-388e-4f8c-b5ef-5102e962d997%40eisentraut.org
2025-09-30 12:35:50 +02:00
Peter Eisentraut
4b7e6c73b0 Make some use of anonymous unions [pg_locale_t]
Make some use of anonymous unions, which are allowed as of C11, as
examples and encouragement for future code, and to test compilers.

This commit changes the pg_locale_t type.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/f00a9968-388e-4f8c-b5ef-5102e962d997%40eisentraut.org
2025-09-30 12:35:50 +02:00
Álvaro Herrera
3bf31dd243
Do a tiny bit of header file maintenance
Stop including utils/relcache.h in access/genam.h, and stop including
htup_details.h in nodes/tidbitmap.h.  Both these files (genam.h and
tidbitmap.h) are widely used in other header files, so it's in our best
interest that they remain as lean as reasonable.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/202509291356.o5t6ny2hoa3q@alvherre.pgsql
2025-09-30 12:28:29 +02:00
Michael Paquier
bb68cde413 Reorder XLogNeedsFlush() checks to be more consistent
During recovery, XLogNeedsFlush() checks the minimum recovery LSN point
instead of the flush LSN point.  The same condition checks are used when
updating the minimum recovery point in UpdateMinRecoveryPoint(), but are
written in reverse order.

This commit makes the order of the checks consistent between
XLogNeedsFlush() and UpdateMinRecoveryPoint(), improving the code
clarity.  Note that the second check (as ordered by this commit) relies
on InRecovery, which is true only in the startup process.  So this makes
XLogNeedsFlush() cheaper in the startup process with the first check
acting as a shortcut while doing crash recovery, where
LocalMinRecoveryPoint is an invalid LSN.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aMIHNRTP6Wj6vw1s%40paquier.xyz
2025-09-30 09:38:32 +09:00
Tom Lane
ef38a4d975 Add GROUP BY ALL.
GROUP BY ALL is a form of GROUP BY that adds any TargetExpr that does
not contain an aggregate or window function into the groupClause of
the query, making it exactly equivalent to specifying those same
expressions in an explicit GROUP BY list.

This feature is useful for certain kinds of data exploration.  It's
already present in some other DBMSes, and the SQL committee recently
accepted it into the standard, so we can be reasonably confident in
the syntax being stable.  We do have to invent part of the semantics,
as the standard doesn't allow for expressions in GROUP BY, so they
haven't specified what to do with window functions.  We assume that
those should be treated like aggregates, i.e., left out of the
constructed GROUP BY list.

In passing, wordsmith some existing documentation about GROUP BY,
and update some neglected synopsis entries in select_into.sgml.

Author: David Christensen <david@pgguru.net>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHM0NXjz0kDwtzoe-fnHAqPB1qA8_VJN0XAmCgUZ+iPnvP5LbA@mail.gmail.com
2025-09-29 16:55:17 -04:00
David Rowley
b91067c899 Remove unused parameter from find_window_run_conditions()
... and check_and_push_window_quals().

Similar to 4be9024d5, but it seems there was yet another unused
parameter.

Author: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/DD5BEKORUG34.2M8492NMB9DB8@gmail.com
2025-09-30 08:37:42 +13:00
Noah Misch
a95393ecdb Fix StatisticsObjIsVisibleExt() for pg_temp.
Neighbor get_statistics_object_oid() ignores objects in pg_temp, as has
been the standard for non-relation, non-type namespace searches since
CVE-2007-2138.  Hence, most operations that name a statistics object
correctly decline to map an unqualified name to a statistics object in
pg_temp.  StatisticsObjIsVisibleExt() did not.  Consequently,
pg_statistics_obj_is_visible() wrongly returned true for such objects,
psql \dX wrongly listed them, and getObjectDescription()-based ereport()
and pg_describe_object() wrongly omitted namespace qualification.  Any
malfunction beyond that would depend on how a human or application acts
on those wrong indications.  Commit
d99d58cdc8 introduced this.  Back-patch to
v13 (all supported versions).

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/20250920162116.2e.nmisch@google.com
Backpatch-through: 13
2025-09-29 11:15:44 -07:00
David Rowley
2cb49c609b Improve planner's width estimates for set operations
For UNION, EXCEPT and INTERSECT, we were not very good at estimating the
PathTarget.width for the set operation.  Since the targetlist of the set
operation is made up of Vars with varno==0, this would result in
get_expr_width() applying a default estimate based on the Var's type
rather than taking width estimates from any relation's statistics.

Here we attempt to improve the situation by looking at the width estimates
for the set operation child paths and calculating the average width of the
relevant child paths weighted over the estimated number of rows.  For
UNION and INTERSECT, the relevant paths to look at are *all* child paths.
For EXCEPT, since we don't return rows from the right-hand child (only
possibly remove left-hand rows matching those), we use only the left-hand
child for width estimates.

This also adjusts the hashed-UNION Path's PathTarget to use the same
PathTarget as its Append subpath.  Both PathTargets will be the same and
are void of any resjunk columns, per generate_append_tlist().  Making
the AggPath use the same PathTarget saves having to adjust the "width"
of the AggPath's PathTarget too.

This was reported as a bug by sunw.fnst, but it's not something we ever
claimed to do properly.  Plus, if we were to adjust this in back
branches, plans could change as the estimated input sizes to Sorts and
Hash Aggregates could go up or down.  Plan choices aren't something we
want to destabilize in stable versions.

Reported-by: sunw.fnst <936739278@qq.com>
Author: David Rowley <drowleyml@gmail.com>
Discussion: https://postgr.es/m/tencent_34CF8017AB81944A4C08DD089D410AB6C306@qq.com
2025-09-29 14:36:39 +13:00
Michael Paquier
7bd2975fa9 Add support for tracking of entry count in pgstats
Stats kinds can set a new option called "track_entry_count" (disabled by
default, available for variable-numbered stats) that will make pgstats
track the number of entries that exist in its shared hashtable.

As there is only one code path where a new entry is added, and one code
path where entries are freed, the count tracking is straight-forward in
its implementation.  Reads of these counters are optimistic, and may
change across two calls.  The counter is incremented when an entry is
created (not when reused), and is decremented when an entry is freed
from the hashtable (marked for drop with its refcount reaching 0), which
is something that pgstats decides internally.

A first use case of this facility would be pg_stat_statements, where we
need to be able to cap the number of entries that would be stored in the
shared hashtable, based on its "max" GUC.  The module currently relies
on hash_get_num_entries(), which offers a cheap way to count how many
entries are in its hash table, but we cannot do that in pgstats for
variable-sized stats kinds as a single hashtable is used for all the
stats kinds.  Independently of PGSS, this is useful for other custom
stats kinds that want to cap, control, or track the number of entries
they have, without depending on a potentially expensive sequential scan
to know the number of entries while holding an extra exclusive lock.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Keisuke Kuroda <keisuke.kuroda.3862@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aMPKWR81KT5UXvEr@paquier.xyz
2025-09-29 08:57:57 +09:00
Tom Lane
b0fb2c6aa5 Refactor to avoid code duplication in transformPLAssignStmt.
transformPLAssignStmt contained many lines cribbed directly from
transformSelectStmt.  I had supposed that we could manage to keep
the two copies in sync, but the bug just fixed in 7504d2be9 shows
that that hope was foolish.  Let's refactor so there's just one copy.

The main stumbling block to doing this is that transformPLAssignStmt
has a chunk of custom code that has to run after transformTargetList
but before we potentially modify the tlist further during analysis
of ORDER BY and GROUP BY.  Rather than make transformSelectStmt fully
aware of PLAssignStmt processing, I put that code into a callback
function.  It still feels a little bit ugly, but it's not too awful,
and surely it's better than a hundred lines of duplicated code.
The steps involved in processing a PLAssignStmt remain exactly
the same as before, just in different places.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us
2025-09-27 17:17:51 -04:00
Tom Lane
7504d2be9e Fix missed copying of groupDistinct in transformPLAssignStmt.
Because we failed to do this, DISTINCT in GROUP BY DISTINCT would be
ignored in PL/pgSQL assignment statements.  It's not surprising that
no one noticed, since such statements will throw an error if the query
produces more than one row.  That eliminates most scenarios where
advanced forms of GROUP BY could be useful, and indeed makes it hard
even to find a simple test case.  Nonetheless it's wrong.

This is directly the fault of be45be9c3 which added the groupDistinct
field, but I think much of the blame has to fall on c9d529848, in
which I incautiously supposed that we'd manage to keep two copies of
a big chunk of parse-analysis logic in sync.  As a follow-up, I plan
to refactor so that there's only one copy.  But that seems useful
only in master, so let's use this one-line fix for the back branches.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/31027.1758919078@sss.pgh.pa.us
Backpatch-through: 14
2025-09-27 14:29:41 -04:00
Masahiko Sawada
66cdef4425 Remove unused for_all_tables field from AlterPublicationStmt.
No backpatch as AlterPublicationStmt struct is exposed.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAD21AoC6B6AuxWOST-TkxUbDgp8FwX=BLEJZmKLG_VrH-hfxpA@mail.gmail.com
2025-09-26 09:23:00 -07:00
Álvaro Herrera
dbf8cfb4f0
Create a separate file listing backend types
Use our established coding pattern to reduce maintenance pain when
adding other per-process-type characteristics.

Like PG_KEYWORD, PG_CMDTAG, PG_RMGR.

To keep the strings translatable, the relevant makefile now also scans
src/include for this specific file.  I didn't want to have it scan all
.h files, as then gettext would have to scan all header files.  I didn't
find any way to affect the meson behavior in this respect though.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: Jonathan Gonzalez V. <jonathan.abdiel@gmail.com>
Discussion: https://postgr.es/m/202507151830.dwgz5nmmqtdy@alvherre.pgsql
2025-09-26 15:21:49 +02:00
Michael Paquier
85e0ff62b6 Improve stability of btree page split on ERRORs
This improves the stability of VACUUM when processing btree indexes,
which was previously able to trigger an assertion failure in
_bt_lock_subtree_parent() when an error was previously thrown outside
the scope of _bt_split() when splitting a btree page.  VACUUM would
consider the index as in a corrupted state as the right page would not
be zeroed for the error thrown (allocation failure is one pattern).

In a non-assert build, VACUUM is able to succeed, reporting what it sees
as a corruption while attempting to fix the index.  This would manifest
as a LOG message, as of:
LOG: failed to re-find parent key in index "idx" for deletion target
page N
CONTEXT:  while vacuuming index "idx" of relation "public.tab"

This commit improves the code to rely on two PGAlignedBlocks that are
used as a temporary space for the left and right pages.  The main change
concerns the right page, whose contents are now copied into the
"temporary" PGAlignedBlock page while its original space is zeroed.  Its
contents are moved from the PGAlignedBlock page back to the page once we
enter in the critical section used for the split.  This simplifies the
split logic, as it is not necessary to zero the right page before
throwing an error anymore.  Hence errors can now be thrown outside the
split code.  For the left page, this shaves one allocation, with
PageGetTempPage() being previously used.

The previous logic originates from commit 8fa30f906b, at a point where
PGAlignedBlock did not exist yet.  This could be argued as something
that should be backpatched, but the lack of complaints indicates that it
may not be necessary.

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
2025-09-26 08:41:06 +09:00
David Rowley
3760d278dc Fix misleading comment in pg_get_statisticsobjdef_string()
The comment claimed that a TABLESPACE reference was added to the
resulting string, but that's not true.  Looks like the comment was
copied from pg_get_indexdef_string() without being adjusted correctly.

Reported-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxHwVPgeu8o9D8oUeDQYEHTAZGt-J5uaJNgYMzkAW7MiCA@mail.gmail.com
2025-09-26 11:04:15 +12:00
David Rowley
4be9024d57 Remove unused parameter from check_and_push_window_quals
... and find_window_run_conditions.

This seems to have been around and unused ever since the Run Condition
feature was added in 9d9c02ccd.  Let's remove it to clean things up a
bit.

Author: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/DD26NJ0Y34ZS.2ZOJPHSY12PFI@gmail.com
2025-09-26 10:21:30 +12:00
Tom Lane
02c4bc8830 Try to avoid floating-point roundoff error in pg_sleep().
I noticed the surprising behavior that pg_sleep(0.001) will sleep
for 2ms not the expected 1ms.  Apparently the float8 calculation of
time-to-sleep is managing to produce something a hair over 1, which
ceil() rounds up to 2, and then WaitLatch() faithfully waits 2ms.
It could be that this works as-expected for some ranges of current
timestamp but not others, which would account for not having seen
it before.  In any case, let's try to avoid it by removing the
float arithmetic in the delay calculation.  We're stuck with the
declared input type being float8, but we can convert that to integer
microseconds right away, and then work strictly with integral values.
There might still be roundoff surprises for certain input values,
but at least the behavior won't be time-varying.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/3879137.1758825752@sss.pgh.pa.us
2025-09-25 17:02:15 -04:00
Robert Haas
803ef0ed49 Fix array allocation bugs in SetExplainExtensionState.
If we already have an extension_state array but see a new extension_id
much larger than the highest the extension_id we've previously seen,
the old code might have failed to expand the array to a large enough
size, leading to disaster. Also, if we don't have an extension array
at all and need to create one, we should make sure that it's big enough
that we don't have to resize it instantly.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/2949591.1758570711@sss.pgh.pa.us
Backpatch-through: 18
2025-09-25 11:43:52 -04:00
Daniel Gustafsson
0b3ce7878a Remove preprocessor guards from injection points
When defining an injection point there is no need to wrap the definition
with USE_INJECTION_POINT guards, the INJECTION_POINT macro is available
in all builds.  Remove to make the code consistent.

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/OSCPR01MB14966C8015DEB05ABEF2CE077F51FA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
2025-09-25 15:27:33 +02:00
Álvaro Herrera
7e638d7f50
Don't include execnodes.h in replication/conflict.h
... which silently propagates a lot of headers into many places
via pgstat.h, as evidenced by the variety of headers that this patch
needs to add to seemingly random places.  Add a minimum of typedefs to
conflict.h to be able to remove execnodes.h, and fix the fallout.

Backpatch to 18, where conflict.h first appeared.

Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql
2025-09-25 14:52:41 +02:00
Álvaro Herrera
81fc3e28e3
Update some more forward declarations to use typedef
As commit d4d1fc527b.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/202509191025.22agk3fvpilc@alvherre.pgsql
2025-09-25 14:33:19 +02:00
Melanie Plageman
ae8ea7278c Correct prune WAL record opcode name in comment
f83d709760 incorrectly refers to a XLOG_HEAP2_PRUNE_FREEZE WAL record
opcode. No such code exists. The relevant opcodes are
XLOG_HEAP2_PRUNE_ON_ACCESS, XLOG_HEAP2_PRUNE_VACUUM_SCAN, and
XLOG_HEAP2_PRUNE_VACUUM_CLEANUP. Correct it.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
2025-09-24 12:29:56 -04:00
Tom Lane
aadbcc40bc Ensure guc_tables.o's dependency on guc_tables.inc.c is known.
Without this, rebuilds can malfunction unless --enable-depend is used.
Historically we've expected that you can get away without
--enable-depend as long as you manually clean after changing *.h
files; the makefiles are supposed to handle other sorts of
dependencies.  So add this one.

Follow-on to 635998965, so no need for back-patch.

Discussion: https://postgr.es/m/3121329.1758650878@sss.pgh.pa.us
2025-09-24 12:28:20 -04:00
Fujii Masao
7fcb32ad02 Fix incorrect and inconsistent comments in tableam.h and heapam.c.
This commit corrects several issues in function comments:

* The parameter "rel" was incorrectly referred to as "relation" in the comments
   for table_tuple_delete(), table_tuple_update(), and table_tuple_lock().
* In table_tuple_delete(), "changingPart" was listed as an output parameter
   in the comments but is actually input.
* In table_tuple_update(), "slot" was listed as an input parameter
   in the comments but is actually output.
* The comment for "update_indexes" in table_tuple_update() was mis-indented.
* The comments for heap_lock_tuple() incorrectly referenced a non-existent
   "tid" parameter.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nB6Ay8g=KEn7L3qbYX_4+sLk9XOMkV0XZqHR4cTY8ZvQ@mail.gmail.com
2025-09-25 00:51:59 +09:00
Peter Eisentraut
a5b35fcedb Remove PointerIsValid()
This doesn't provide any value over the standard style of checking the
pointer directly or comparing against NULL.

Also remove related:
- AllocPointerIsValid() [unused]
- IndexScanIsValid() [had one user]
- HeapScanIsValid() [unused]
- InvalidRelation [unused]

Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(),
RelationIsValid for now, to reduce code churn.

Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi
2025-09-24 15:17:20 +02:00
Daniel Gustafsson
0fba25eb72 Fix incorrect option name in usage screen
The usage screen incorrectly refered to the --docs option as --sgml.
Backpatch down to v17 where this script was introduced.

Author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/20250729.135638.1148639539103758555.horikyota.ntt@gmail.com
Backpatch-through: 17
2025-09-24 14:58:18 +02:00
Daniel Gustafsson
711ccce38f Consistently handle tab delimiters for wait event names
Format validation and element extraction for intermediate line
strings were inconsistent in their handling of tab delimiters,
which resulted in an unclear error when multiple tab characters
were used as a delimiter.  This fixes it by using captures from
the validation regex instead of a separate split() to avoid the
inconsistency.  Also, it ensures that \t+ is used consistently
when inspecting the strings.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/20250729.135638.1148639539103758555.horikyota.ntt@gmail.com
2025-09-24 14:57:26 +02:00
John Naylor
5334620eef Update GB18030 encoding from version 2000 to 2022
Mappings for 18 characters have changed, affecting 36 code points. This
is a break in compatibility, but these characters are rarely used.

U+E5E5 (Private Use Area) was previously mapped to \xA3A0. This code
point now maps to \x65356535. Attempting to convert \xA3A0 will now
raise an error.

Separate from the 2022 update, the following mappings were previously
swapped, and subsequently corrected in 2000 and later versions:
 * U+E7C7 (Private Use Area) now maps to \x8135F437
 * U+1E3F (Latin Small Letter M with Acute) now maps to \xA8BC

The 2022 standard mentions the following policy changes, but they
have no effect in our implementation:

66 new ideographs are now required, but these are mapped
algorithmically so were already handled by utf8_and_gb18030.c.

Nine CJK compatibility ideographs are no longer required, but
implementations may retain them, as does the source we use from
the Unicode Consortium.

Release notes: Compatibility section

For further details, see:
https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

Author: Chao Li <lic@highgo.com>
Author: Zheng Tao <taoz@highgo.com>
Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
2025-09-24 13:26:05 +07:00
Amit Kapila
e41d954da6 Fix LOCK_TIMEOUT handling during parallel apply.
Previously, the parallel apply worker used SIGINT to receive a graceful
shutdown signal from the leader apply worker. However, SIGINT is also used
by the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. This
overlap caused the parallel apply worker to miss LOCK_TIMEOUT signals,
leading to incorrect behavior during lock wait/contention.

This patch resolves the conflict by switching the graceful shutdown signal
from SIGINT to SIGUSR2.

Reported-by: Zane Duffield <duffieldzane@gmail.com>
Diagnosed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 16, where it was introduced
Discussion: https://postgr.es/m/CACMiCkXyC4au74kvE2g6Y=mCEF8X6r-Ne_ty4r7qWkUjRE4+oQ@mail.gmail.com
2025-09-24 04:11:53 +00:00
Robert Haas
f2bae51dfd Keep track of what RTIs a Result node is scanning.
Result nodes now include an RTI set, which is only non-NULL when they
have no subplan, and is taken from the relid set of the RelOptInfo that
the Result is generating. ExplainPreScanNode now takes notice of these
RTIs, which means that a few things get schema-qualified in the
regression tests that previously did not. This makes the output more
consistent between cases where some part of the plan tree is replaced by
a Result node and those where this does not happen.

Likewise, pg_overexplain's EXPLAIN (RANGE_TABLE) now displays the RTIs
stored in a Result node just as it already does for other RTI-bearing
node types.

Result nodes also now include a result_reason, which tells us something
about why the Result node was inserted.  Using that information, EXPLAIN
now emits, where relevant, a "Replaces" line describing the origin of
a Result node.

The purpose of these changes is to allow code that inspects a Plan
tree to understand the origin of Result nodes that appear therein.

Discussion: http://postgr.es/m/CA+TgmoYeUZePZWLsSO+1FAN7UPePT_RMEZBKkqYBJVCF1s60=w@mail.gmail.com
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
2025-09-23 09:07:55 -04:00
David Rowley
9fc7f6ab72 Fix various incorrect filename references
Author: Chao Li <li.evan.chao@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2=hOBCPm-Z=F15twr_23XjHeoXSbifP5GdEdtWona97wQ@mail.gmail.com
2025-09-22 13:33:17 +12:00
Daniel Gustafsson
e1d917182c Add support for base64url encoding and decoding
This adds support for base64url encoding and decoding, a base64
variant which is safe to use in filenames and URLs.  base64url
replaces '+' in the base64 alphabet with '-' and '/' with '_',
thus making it safe for URL addresses and file systems.

Support for base64url was originally suggested by Przemysław Sztoch.

Author: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-by: David E. Wheeler <david@justatheory.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Chao Li (Evan) <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/70f2b6a8-486a-4fdb-a951-84cef35e22ab@sztoch.pl
2025-09-20 23:19:32 +02:00
Tom Lane
261f89a976 Track the maximum possible frequency of non-MCE array elements.
The lossy-counting algorithm that ANALYZE uses to identify most-common
array elements has a notion of cutoff frequency: elements with
frequency greater than that are guaranteed to be collected, elements
with smaller frequencies are not.  In cases where we find fewer MCEs
than the stats target would permit us to store, the cutoff frequency
provides valuable additional information, to wit that there are no
non-MCEs with frequency greater than that.  What the selectivity
estimation functions actually use the "minfreq" entry for is as a
ceiling on the possible frequency of non-MCEs, so using the cutoff
rather than the lowest stored MCE frequency provides a tighter bound
and more accurate estimates.

Therefore, instead of redundantly storing the minimum observed MCE
frequency, store the cutoff frequency when there are fewer tracked
values than we want.  (When there are more, then of course we cannot
assert that no non-stored elements are above the cutoff frequency,
since we're throwing away some that are; so we still use the
minimum stored frequency in that case.)

Notably, this works even when none of the values are common enough
to be called MCEs.  In such cases we previously stored nothing in
the STATISTIC_KIND_MCELEM pg_statistic slot, which resulted in the
selectivity functions falling back to default estimates.  So in that
case we want to construct a STATISTIC_KIND_MCELEM entry that contains
no "values" but does have "numbers", to wit the three extra numbers
that the MCELEM entry type defines.  A small obstacle is that
update_attstats() has traditionally stored a null, not an empty array,
when passed zero "values" for a slot.  That gives rise to an MCELEM
entry that get_attstatsslot() will spit up on.  The least risky
solution seems to be to adjust update_attstats() so that it will emit
a non-null (but possibly empty) array when the passed stavalues array
pointer isn't NULL, rather than conditioning that on numvalues > 0.
In other existing cases I don't believe that that changes anything.
For consistency, handle the stanumbers array the same way.

In passing, improve the comments in routines that use
STATISTIC_KIND_MCELEM data.  Particularly, explain why we use
minfreq / 2 not minfreq as the estimate for non-MCE values.

Thanks to Matt Long for the suggestion that we could apply this
idea even when there are more than zero MCEs.

Reported-by: Mark Frost <FROSTMAR@uk.ibm.com>
Reported-by: Matt Long <matt@mattlong.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/PH3PPF1C905D6E6F24A5C1A1A1D8345B593E16FA@PH3PPF1C905D6E6.namprd15.prod.outlook.com
2025-09-20 14:48:16 -04:00
Tom Lane
1eccb93150 Re-allow using statistics for bool-valued functions in WHERE.
Commit a391ff3c3, which added the ability for a function's support
function to provide a custom selectivity estimate for "WHERE f(...)",
unintentionally removed the possibility of applying expression
statistics after finding there's no applicable support function.
That happened because we no longer fell through to boolvarsel()
as before.  Refactor to do so again, putting the 0.3333333 default
back into boolvarsel() where it had been (cf. commit 39df0f150).

I surely wouldn't have made this error if 39df0f150 had included
a test case, so add one now.  At the time we did not have the
"extended statistics" infrastructure, but we do now, and it is
also unable to work in this scenario because of this error.
So make use of that for the test case.

This is very clearly a bug fix, but I'm afraid to put it into
released branches because of the likelihood of altering plan
choices, which we avoid doing in minor releases.  So, master only.

Reported-by: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/a8b99dce-1bfb-4d97-af73-54a32b85c916@dalibo.com
2025-09-20 12:44:52 -04:00
David Rowley
ac7c8e412c Improve wording in a few comments
Initially this was to fix the "catched" typo, but I (David) wasn't quite
clear on what the previous comment meant about being "effective".  I
expect this means efficiency, so I've reworded the comment to indicate
that.

While this is only a comment fixup, for the sake of possibly minimizing
possible future backpatching pain, I've opted to backpatch to 18 since
this code is new to that version and the release isn't out the door yet.

Author: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmSYWPud1sfBvpKbCJeRkWeZYuqatxtV9U9LvAFXBEiBw@mail.gmail.com
Backpatch-through: 18
2025-09-19 23:35:23 +12:00
Amit Kapila
5b148706c5 Add optional pid parameter to pg_replication_origin_session_setup().
Commit 216a784829 introduced parallel apply workers, allowing multiple
processes to share a replication origin. To support this,
replorigin_session_setup() was extended to accept a pid argument
identifying the process using the origin.

This commit exposes that capability through the SQL interface function
pg_replication_origin_session_setup() by adding an optional pid parameter.
This enables multiple processes to coordinate replication using the same
origin when using SQL-level replication functions.

This change allows the non-builtin logical replication solutions to
implement parallel apply for large transactions.

Additionally, an existing internal error was made user-facing, as it can
now be triggered via the exposed SQL API.

Author: Doruk Yilmaz <doruk@mixrank.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CAMPB6wfe4zLjJL8jiZV5kjjpwBM2=rTRme0UCL7Ra4L8MTVdOg@mail.gmail.com
Discussion: https://postgr.es/m/CAE2gYzyTSNvHY1+iWUwykaLETSuAZsCWyryokjP6rG46ZvRgQA@mail.gmail.com
2025-09-19 05:38:40 +00:00
Amit Kapila
8aac5923a3 Improve few errdetail messages introduced in commit 0d48d393d4.
Based on suggestions by Tom Lane

Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/20250916.114644.275726106301941878.horikyota.ntt@gmail.com
2025-09-19 04:52:59 +00:00
Michael Paquier
deb208df45 Make XLogFlush() and XLogNeedsFlush() decision-making more consistent
When deciding which code path to use depending on the state of recovery,
XLogFlush() and XLogNeedsFlush() have been relying on different
criterias:
- XLogFlush() relied on XLogInsertAllowed().
- XLogNeedsFlush() relied on RecoveryInProgress().

Currently, the checkpointer is allowed to insert WAL records while
RecoveryInProgress() returns true for an end-of-recovery checkpoint,
where XLogInsertAllowed() matters.  Using RecoveryInProgress() in
XLogNeedsFlush() did not really matter for its existing callers, as the
checkpointer only called XLogFlush().  However, a feature under
discussion, by Melanie Plageman, needs XLogNeedsFlush() to be able to
work in more contexts, the end-of-recovery checkpoint being one.

This commit changes XLogNeedsFlush() to use XLogInsertAllowed() instead
of RecoveryInProgress(), making the checks in both routines more
consistent.  While on it, an assertion based on XLogNeedsFlush() is
added at the end of XLogFlush(), triggered when flushing a physical
position (not for the normal recovery patch that checks for updates of
the minimum recovery point).  This assertion would fail for example in
the recovery test 015_promotion_pages if XLogNeedsFlush() is changed to
use RecoveryInProgress().  This should be hopefully enough to ensure
that the checks done in both routines remain consistent.

Author: Melanie Plageman <melanieplageman@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com
2025-09-19 13:47:28 +09:00
Amit Langote
8741e48e5d Fix EPQ crash from missing partition pruning state in EState
Commit bb3ec16e14 moved partition pruning metadata into PlannedStmt.
At executor startup this metadata is used to initialize the EState
fields es_part_prune_infos, es_part_prune_states, and
es_part_prune_results.  EvalPlanQualStart() failed to copy those
fields into the child EState, causing NULL dereference when Append
ran partition pruning during a recheck. This can occur with DELETE
or UPDATE on partitioned tables that use runtime pruning, e.g. with
generic plans.

Fix by copying all partition pruning state into the EPQ estate.

Add an isolation test that reproduces the crash with concurrent
UPDATE and DELETE on a partitioned table, where the DELETE session
hits the crash during its EPQ recheck after the UPDATE commits.

Bug: #19056
Reported-by: Fei Changhong <feichanghong@qq.com>
Diagnozed-by: Fei Changhong <feichanghong@qq.com>
Author: David Rowley <dgrowleyml@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/19056-a677cef9b54d76a0%40postgresql.org
2025-09-19 11:38:29 +09:00
Michael Paquier
3cd3a039da Document and check that PgStat_HashKey has no padding
This change is a tighter rework of 7d85d87f4d, which tried to improve
the code so as it would work should PgStat_HashKey gain new fields that
create padding bytes.  However, the previous change is proving to not be
enough as some code paths of pgstats do not pass PgStat_HashKey by
reference (valgrind would warn when padding is added to the structure,
through a new field).

Per discussion, let's document and check that PgStat_HashKey has no
padding rather than try to complicate the code of pgstats so as it is
able to work around that.

This removes a couple of memset(0) calls that should not be required.
While on it, this commit adds a static assertion checking that no
padding is introduced in the structure, by checking that the size of
PgStat_HashKey matches with the sum of the size of all its fields.

The object ID part of the hash key is already 8 bytes, which should be
plenty enough already.  A comment is added to discourage the addition of
new fields.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0t9omat+HVSakJXwTMWvhpYFcAZb41RPWKwrKFUgmAFBQ@mail.gmail.com
2025-09-19 09:54:05 +09:00
Nathan Bossart
c3cc2ab87d Fix re-initialization of LWLock-related shared memory.
When shared memory is re-initialized after a crash, the named
LWLock tranche request array that was copied to shared memory will
no longer be accessible.  To fix, save the pointer to the original
array in postmaster's local memory, and switch to it when
re-initializing the LWLock-related shared memory.

Oversight in commit ed1aad15e0.  Per buildfarm member batta.

Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aMoejB3iTWy1SxfF%40paquier.xyz
Discussion: https://postgr.es/m/f8ca018f-3479-49f6-a92c-e31db9f849d7%40gmail.com
2025-09-18 09:55:39 -05:00
Andres Freund
0110e2ec5c Mark shared buffer lookup table HASH_FIXED_SIZE
StrategyInitialize() calls InitBufTable() with maximum number of entries that
the buffer lookup table can ever have. Thus there should not be any need to
allocate more element after initialization. Hence mark the hash table as fixed
sized.

Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAExHW5v0jh3F_wj86yC=qBfWk0uiT94qy=Z41uzAHLHh0SerRA@mail.gmail.com
2025-09-17 20:28:43 -04:00
Tom Lane
b0cc0a71e0 Calculate agglevelsup correctly when Aggref contains a CTE.
If an aggregate function call contains a sub-select that has
an RTE referencing a CTE outside the aggregate, we must treat
that reference like a Var referencing the CTE's query level
for purposes of determining the aggregate's level.  Otherwise
we might reach the nonsensical conclusion that the aggregate
should be evaluated at some query level higher than the CTE,
ending in a planner error or a broken plan tree that causes
executor failures.

Bug: #19055
Reported-by: BugForge <dllggyx@outlook.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19055-6970cfa8556a394d@postgresql.org
Backpatch-through: 13
2025-09-17 16:32:57 -04:00
Thomas Munro
0951942bba jit: Fix type used for Datum values in LLVM IR.
Commit 2a600a93 made Datum 8 bytes wide everywhere.  It was no longer
appropriate to use TypeSizeT on 32 bit systems, and JIT compilation
would fail with various type check errors.  Introduce a separate
LLVMTypeRef with the name TypeDatum.  TypeSizeT is still used in some
places for actual size_t values.

Reported-by: Dmitry Mityugov <d.mityugov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Dmitry Mityugov <d.mityugov@postgrespro.ru>
Discussion: https://postgr.es/m/0a9f0be59171c2e8f1b3bc10f4fcf267%40postgrespro.ru
2025-09-17 13:38:35 +12:00
Michael Paquier
158c48303e Fix shared memory calculation size of PgAioCtl
The shared memory size was calculated based on an offset of io_handles,
which is itself a pointer included in the structure.  We tend to
overestimate the shared memory size overall, so this was unlikely an
issue in practice, but let's be correct and use the full size of the
structure in the calculation, so as the pointer for io_handles is
included.

Oversight in da7226993f.

Author: Madhukar Prasad <madhukarprasad@google.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAKi+wrbC2dTzh_vKJoAZXV5wqTbhY0n4wRNpCjJ=e36aoo0kFw@mail.gmail.com
Backpatch-through: 18
2025-09-17 09:33:32 +09:00
David Rowley
ac06ea8f7b Add missing EPQ recheck for TID Range Scan
The EvalPlanQual recheck for TID Range Scan wasn't rechecking the TID qual
still passed after following update chains.  This could result in tuples
being updated or deleted by plans using TID Range Scans where the ctid of
the new (updated) tuple no longer matches the clause of the scan.  This
isn't desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID Range or some other non-TID scan method
for such queries, and the chosen plan could change at any moment.

Here we fix this by properly implementing the recheck function for TID
Range Scans.

Backpatch to 14, where TID Range Scans were added

Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 14
2025-09-17 12:19:15 +12:00
David Rowley
dee21ea6d6 Add missing EPQ recheck for TID Scan
The EvalPlanQual recheck for TID Scan wasn't rechecking the TID qual
still passed after following update chains.  This could result in tuples
being updated or deleted by plans using TID Scans where the ctid of the
new (updated) tuple no longer matches the clause of the scan.  This isn't
desired behavior, and isn't consistent with what would happen if the
chosen plan had used an Index or Seq Scan, and that could lead to hard to
predict behavior for scans that contain TID quals and other quals as the
planner has freedom to choose TID or some other scan method for such
queries, and the chosen plan could change at any moment.

Here we fix this by properly implementing the recheck function for TID
Scans.

Backpatch to 13, oldest supported version

Reported-by: Sophie Alpert <pg@sophiebits.com>
Author: Sophie Alpert <pg@sophiebits.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Backpatch-through: 13
2025-09-17 11:48:55 +12:00
Tom Lane
8abbbbae61 Revert "Avoid race condition between "GRANT role" and "DROP ROLE"".
This reverts commit 98fc31d649.
That change allowed DROP OWNED BY to drop grants of the target
role to other roles, arguing that nobody would need those
privileges anymore.  But that's not so: if you're not superuser,
you still need admin privilege on the target role so you can
drop it.

It's not clear whether or how the dependency-based approach
to solving the original problem can be adapted to keep these
grants.  Since v18 release is fast approaching, the sanest
thing to do seems to be to revert this patch for now.  The
race-condition problem is low severity and not worth taking
risks for.

I didn't force a catversion bump in 98fc31d64, so I won't do
so here either.

Reported-by: Dipesh Dhameliya <dipeshdhameliya125@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CABgZEgczOFicCJoqtrH9gbYMe_BV3Hq8zzCBRcMgmU6LRsihUA@mail.gmail.com
Backpatch-through: 18
2025-09-16 13:05:53 -04:00
Tom Lane
83a5641945 Provide more-specific error details/hints for function lookup failures.
Up to now we've contented ourselves with a one-size-fits-all error
hint when we fail to find any match to a function or procedure call.
That was mostly okay in the beginning, but it was never great, and
since the introduction of named arguments it's really not adequate.
We at least ought to distinguish "function name doesn't exist" from
"function name exists, but not with those argument names".  And the
rules for named-argument matching are arcane enough that some more
detail seems warranted if we match the argument names but the call
still doesn't work.

This patch creates a framework for dealing with these problems:
FuncnameGetCandidates and related code will now pass back a bitmask of
flags showing how far the match succeeded.  This allows a considerable
amount of granularity in the reports.  The set-bits-in-a-bitmask
approach means that when there are multiple candidate functions, the
report will reflect the match(es) that got the furthest, which seems
correct.  Also, we can avoid mentioning "maybe add casts" unless
failure to match argument types is actually the issue.

Extend the same return-a-bitmask approach to OpernameGetCandidates.
The issues around argument names don't apply to operator syntax,
but it still seems worth distinguishing between "there is no
operator of that name" and "we couldn't match the argument types".

While at it, adjust these messages and related ones to more strictly
separate "detail" from "hint", following our message style guidelines'
distinction between those.

Reported-by: Dominique Devienne <ddevienne@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/1756041.1754616558@sss.pgh.pa.us
2025-09-16 12:17:02 -04:00
Richard Guo
b63a822452 Treat JsonConstructorExpr as non-strict
JsonConstructorExpr can produce non-NULL output with a NULL input, so
it should be treated as a non-strict construct.  Failing to do so can
lead to incorrect query behavior.

For example, in the reported case, when pulling up a subquery that is
under an outer join, if the subquery's target list contains a
JsonConstructorExpr that uses subquery variables and it is mistakenly
treated as strict, it will be pulled up without being wrapped in a
PlaceHolderVar.  As a result, the expression will be evaluated at the
wrong place and will not be forced to null when the outer join should
do so.

Back-patch to v16 where JsonConstructorExpr was introduced.

Bug: #19046
Reported-by: Runyuan He <runyuan@berkeley.edu>
Author: Tender Wang <tndrwang@gmail.com>
Co-authored-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/19046-765b6602b0a8cfdf@postgresql.org
Backpatch-through: 16
2025-09-16 18:42:20 +09:00
John Naylor
cfa6cd2927 Generate GB18030 mappings from the Unicode Consortium's UCM file
Previously we built the .map files for GB18030 (version 2000) from an
XML file. The 2022 version for this encoding is only available as a
Unicode Character Mapping (UCM) file, so as preparatory refactoring
switch to this format as the source for building version 2000.

As we do with most input files for the conversion mappings, download
the file on demand. In order to generate the same mappings we have
now, we must download from a previous upstream commit, rather than
the head since the latter contains a correction not present in our
current .map files.

The XML file is still used by EUC_CN, so we cannot delete it from our
repository. GB18030 is a superset of EUC_CN, so it may be possible to
build EUC_CN from the same UCM file, but that is left for future work.

Author: Chao Li <lic@highgo.com>
Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
2025-09-16 16:29:08 +07:00
Peter Eisentraut
bce18ef3c6 Fix incorrect const qualifier
Commit 7202d72787 added in passing some const qualifiers, but the one
on the postmaster_child_launch() startup_data argument was incorrect,
because the function itself modifies the pointed-to data.  This is
hidden from the compiler because of casts.  The qualifiers on the
functions called by postmaster_child_launch() are still correct.
2025-09-16 07:27:32 +02:00
Peter Geoghegan
7d9cd2df5f Teach nbtree to avoid evaluating row compare keys.
Add logic to _bt_set_startikey that determines whether row compare keys
are guaranteed to be satisfied by every tuple on a page that is about to
be read by _bt_readpage.  This works in essentially the same way as the
existing scalar inequality logic.  Testing has shown that the new logic
improves performance to about the same degree as the existing scalar
inequality logic (compared to the unoptimized case).  In other words,
the new logic makes many row compare scans significantly faster.

Note that the new row compare inequality logic is only effective when
the same individual row member is the deciding subkey for all tuples on
the page (obviously, all tuples have to satisfy the row compare, too).
This is what makes the new row compare logic very similar to the
existing logic for scalar inequalities.  Note, in particular, that this
makes it safe to ignore whether all row compare members are against
either ASC or DESC index attributes (i.e. it doesn't matter if
individual subkeys don't all use the same inequality strategy).

Also stop refusing to set pstate.startikey to an offset beyond any
nonrequired key (don't add logic that'll do that for an individual row
compare subkey, either).  We can fully rely on our firstchangingattnum
tests instead.  This will do the right thing when a page has a group of
tuples with NULLs in a lower-order attribute that makes the tuples fail
to satisfy a row compare key -- we won't incorrectly conclude that all
tuples must satisfy the row compare, just because firsttup and lasttup
happen to.  Our firstchangingattnum test prevents that from happening.
(Note that the original "avoid evaluating nbtree scan keys" mechanism
added by commit e0b1ee17 couldn't support row compares due to issues
with tuples that contain NULLs in a lower-order subkey's attribute.
That original mechanism relied on requiredness markings, which the
replacement _bt_set_startikey mechanism never really needed.)

Follow up to commit 8a510275, which added the _bt_set_startikey
optimization.  _bt_set_startikey is now feature complete; there's no
remaining kind of nbtree scan key that it still doesn't support.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznL6Z3H_GTQze9d8T_Ls=cYbnd-_9f-Jo7aYgTGRUD58g@mail.gmail.com
2025-09-15 16:56:49 -04:00
Peter Eisentraut
ce71993ae4 Expand virtual generated columns in constraint expressions
Virtual generated columns in constraint expressions need to be
expanded because the optimizer matches these expressions to qual
clauses.  Failing to do so can cause us to miss opportunities for
constraint exclusion.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/204804c0-798f-4c72-bd1f-36116024fda3%40eisentraut.org
2025-09-15 16:27:50 +02:00
Peter Eisentraut
9ec0b29976 CREATE STATISTICS: improve misleading error message
The previous change (commit f225473cba) was still not on target,
because it talked about relation kinds, which are not what is being
checked here.  Provide a more accurate message.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CACJufxEZ48toGH0Em_6vdsT57Y3L8pLF=DZCQ_gCii6=C3MeXw@mail.gmail.com
2025-09-15 11:43:34 +02:00
Peter Eisentraut
4bd9191298 Change fmgr.h typedefs to use original names
fmgr.h defined some types such as fmNodePtr which is just Node *, but
it made its own types to avoid having to include various header files.
With C11, we can now instead typedef the original names without fear
of conflicts.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org
2025-09-15 11:04:10 +02:00
Peter Eisentraut
dc41d7415f Remove hbaPort type
This was just a workaround to avoid including the header file that
defines the Port type.  With C11, we can now just re-define the Port
type without the possibility of a conflict.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org
2025-09-15 11:04:10 +02:00
Amit Kapila
0d48d393d4 Resume conflict-relevant data retention automatically.
This commit resumes automatic retention of conflict-relevant data for a
subscription. Previously, retention would stop if the apply process failed
to advance its xmin (oldest_nonremovable_xid) within the configured
max_retention_duration and user needs to manually re-enable
retain_dead_tuples option. With this change, retention will resume
automatically once the apply worker catches up and begins advancing its
xmin (oldest_nonremovable_xid) within the configured threshold.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-09-15 08:46:55 +00:00
Peter Eisentraut
282d0bdee6 jit: fix build with LLVM-21
LLVM-21 renamed llvm::GlobalValue::getGUID() to
getGUIDAssumingExternalLinkage(), so add a version guard.

Author: Holger Hoffstätte <holger@applied-asynchrony.com>
Discussion: https://www.postgresql.org/message-id/flat/d25e6e4a-d1b4-84d3-2f8a-6c45b975f53d%40applied-asynchrony.com
2025-09-15 08:31:11 +02:00
Peter Eisentraut
748caa9dcb Some stylistic improvements in toast_save_datum()
Move some variables to a smaller scope.  Initialize chunk_data before
storing a pointer to it; this avoids compiler warnings on clang-21, or
respectively us having to work around it by initializing it to zero
before the variable is used (as was done in commit e92677e863).

Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org
2025-09-15 07:43:23 +02:00
Peter Eisentraut
bf5da5d6ca Hide duplicate names from extension views
If extensions of equal names were installed in different directories
in the path, the views pg_available_extensions and
pg_available_extension_versions would show all of them, even though
only the first one was actually reachable by CREATE EXTENSION.  To
fix, have those views skip extensions found later in the path if they
have names already found earlier.

Also add a bit of documentation that only the first extension in the
path can be used.

Reported-by: Pierrick <pierrick.chovelon@dalibo.com>
Discussion: https://www.postgresql.org/message-id/flat/8f5a0517-1cb8-4085-ae89-77e7454e27ba%40dalibo.com
2025-09-15 07:30:31 +02:00
Peter Geoghegan
454c046094 nbtree: Always set skipScan flag on rescan.
The TimescaleDB extension expects to be able to change an nbtree scan's
keys across rescans.  The issue arises in the extension's implementation
of loose index scan.  This is arguably a misuse of the index AM API,
though apparently it worked until recently.  It stopped working when the
skipScan flag was added to BTScanOpaqueData by commit 8a510275, though.
The flag wouldn't reliably track whether the scan (actually, the current
rescan) has any skip arrays, leading to confusion in _bt_set_startikey.

nbtree preprocessing will now defensively initialize the scan's skipScan
flag in all cases, including the case where _bt_preprocess_array_keys
returns early due to the (re)scan not using arrays.  While nbtree isn't
obligated to support this use case (at least not according to my reading
of the index AM API), it still seems like a good idea to be consistent
here, on general robustness grounds.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/CAJumhcirfMojbk20+W0YimbNDkwdECvJprQGQ-XqK--ph09nQw@mail.gmail.com
Backpatch-through: 18
2025-09-13 21:01:33 -04:00
Tom Lane
cdf7feb965 Amend recent fix for SIMILAR TO regex conversion.
Commit e3ffc3e91 fixed the translation of character classes in
SIMILAR TO regular expressions.  Unfortunately the fix broke a corner
case: if there is an escape character right after the opening bracket
(for example in "[\q]"), a closing bracket right after the escape
sequence would not be seen as closing the character class.

There were two more oversights: a backslash or a nested opening bracket
right at the beginning of a character class should remove the special
meaning from any following caret or closing bracket.

This bug suggests that this code needs to be more readable, so also
rename the variables "charclass_depth" and "charclass_start" to
something more meaningful, rewrite an "if" cascade to be more
consistent, and improve the commentary.

Reported-by: Dominique Devienne <ddevienne@gmail.com>
Reported-by: Stephan Springl <springl-psql@bfw-online.de>
Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFCRh-8NwJd0jq6P=R3qhHyqU7hw0BTor3W0SvUcii24et+zAw@mail.gmail.com
Backpatch-through: 13
2025-09-13 16:55:51 -04:00
Nathan Bossart
7e9c216b52 Re-pgindent nbtpreprocesskeys.c after commit 796962922e.
Backpatch-through: 18
2025-09-13 14:50:02 -05:00
Tom Lane
9a71989a8f Reject "ALTER DATABASE/USER ... RESET foo" with invalid GUC name.
If the database or user had no entry in pg_db_role_setting,
RESET silently did nothing --- including not checking the
validity of the given GUC name.  This is quite inconsistent
and surprising, because you *would* get such an error if there
were any pg_db_role_setting entry, even though it contains
values for unrelated GUCs.

While this is clearly a bug, changing it in stable branches seems
unwise.  The effect will be that some ALTER commands that formerly
were no-ops will now be errors, and people don't like that sort of
thing in minor releases.

Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/30783e-68c28a00-9-41004480@130449754
2025-09-12 18:10:11 -04:00
Tom Lane
f14ea34d6e Fix oversights in pg_event_trigger_dropped_objects() fixes.
Commit a0b99fc12 caused pg_event_trigger_dropped_objects()
to not fill the object_name field for schemas, which it
should have; and caused it to fill the object_name field
for default values, which it should not have.

In addition, triggers and RLS policies really should behave
the same way as we're making column defaults do; that is,
they should have is_temporary = true if they belong to a
temporary table.

Fix those things, and upgrade event_trigger.sql's woefully
inadequate test coverage of these secondary output columns.

As before, back-patch only to v15.

Reported-by: Sergey Shinderuk <s.shinderuk@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/bd7b4651-1c26-4d30-832b-f942fabcb145@postgrespro.ru
Backpatch-through: 15
2025-09-12 17:43:17 -04:00
Peter Geoghegan
796962922e Always commute strategy when preprocessing DESC keys.
A recently added nbtree preprocessing step failed to account for the
fact that DESC columns already had their B-Tree strategy number commuted
at this point in preprocessing.  As a result, preprocessing could output
a set of scan keys where one or more keys had the correct strategy
number, but used the wrong comparison routine.

To fix, make the faulty code path that looks up a more restrictive
replacement operator/comparison routine commute its requested inequality
strategy (while outputting the transformed strategy number as before).
This makes the final transformed scan key comport with the approach
preprocessing has always used to deal with DESC columns (which is
described by comments above _bt_fix_scankey_strategy).

Oversight in commit commit b3f1a13f, which made nbtree preprocessing
perform transformations on skip array inequalities that can reduce the
total number of index searches.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/19049-b7df801e71de41b2@postgresql.org
Backpatch-through: 18
2025-09-12 13:23:00 -04:00
Álvaro Herrera
7dcea51c2a
Avoid unexpected changes of CurrentResourceOwner and CurrentMemoryContext
Users of logical decoding can encounter an unexpected change of
CurrentResourceOwner and CurrentMemoryContext.  The problem is that,
unlike other call sites of RollbackAndReleaseCurrentSubTransaction(), in
reorderbuffer.c we fail to restore the original values of these global
variables after being clobbered by subtransaction abort.  This patch
saves the values prior to the call and restores them eventually.

In addition, logical.c and logicalfuncs.c had a hack to restore resource
owner, presumably because of lack of this restore.  Remove that.
Instead, because the test coverage here is not very consistent, add an
Assert() to ensure that the resowner is kept identical; this would make
it easy to detect other cases of bugs were we fail to restore resowner
properly.  This could be removed later.

This is arguably an old bug, but there appears to be no reason to
backpatch it and it's risky to do so, so refrain for now.

Author: Antonin Houska <ah@cybertec.at>
Reported-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/119497.1756892972@localhost
2025-09-12 18:47:25 +02:00
Peter Eisentraut
ae0e1be9f2 Allow redeclaration of typedef yyscan_t
This is allowed in C11, so we don't need the workaround guards against
it anymore.  This effectively reverts commit 382092a0cd that put
these guards in place.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/10d32190-f31b-40a5-b177-11db55597355@eisentraut.org
2025-09-12 08:16:00 +02:00
Peter Eisentraut
2aac62be8c Default to log_lock_waits=on
If someone is stuck behind a lock for more than a second, that is
almost always a problem that is worth a log entry.

Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-By: Michael Banck <mbanck@gmx.net>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Christoph Berg <myon@debian.org>
Reviewed-By: Stephen Frost <sfrost@snowman.net>
Discussion: https://postgr.es/m/b8b8502915e50f44deb111bc0b43a99e2733e117.camel%40cybertec.at
2025-09-12 07:57:06 +02:00
Peter Eisentraut
25f36066dd Remove traces of support for Sun Studio compiler
Per discussion, this compiler suite is no longer maintained, and
it has not been able to compile PostgreSQL since at least PostgreSQL
17.

This removes all the remaining support code for this compiler.

Note that the Solaris operating system continues to be supported, but
using GCC as the compiler.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org
2025-09-12 07:39:05 +02:00
Peter Eisentraut
e92677e863 Silence compiler warnings on clang 21
Clang 21 shows some new compiler warnings, for example:

warning: variable 'dstsize' is uninitialized when passed as a const pointer argument here [-Wuninitialized-const-pointer]

The fix is to initialize the variables when they are defined.  This is
similar to, for example, the existing situation in gistKeyIsEQ().

Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org
2025-09-12 07:28:32 +02:00
Richard Guo
2d756ebbe8 Fix misuse of Relids for storing attribute numbers
The typedef Relids (Bitmapset *) is intended to represent set of
relation identifiers, but was incorrectly used in several places to
store sets of attribute numbers.  This is my oversight in e2debb643.

Fix that by replacing such usages with Bitmapset * to reflect the
correct semantics.

Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAEG8a3LJhp_xriXf39iCz0TsK+M-2biuhDhpLC6Baxw8+ZYT3A@mail.gmail.com
2025-09-12 11:12:19 +09:00
Michael Paquier
528dadf691 Add more information for WAL records of hash index AMs
hashdesc.c was missing a couple of fields in its record descriptions, as
of:
- is_prev_bucket_same_wrt for SQUEEZE_PAGE.
- procid for INIT_META_PAGE.
- old_bucket_flag and new_bucket_flag for SPLIT_ALLOCATE_PAGE.

The author has noted the first hole, and I have spotted the others while
double-checking this area of the code.  Note that the only data missing
now are the offsets stored in VACUUM_ONE_PAGE.  We could perhaps add
them, if somebody sees value in this data, even if it makes the output
larger.  These are discarded here.

Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjc-OVwtZH0Xrkvg7n=2ZwdbMJzqrm_ed_CfjiAzuKVGg@mail.gmail.com
2025-09-12 10:29:02 +09:00
Nathan Bossart
ed1aad15e0 Move named LWLock tranche requests to shared memory.
In EXEC_BACKEND builds, GetNamedLWLockTranche() can segfault when
called outside of the postmaster process, as it might access
NamedLWLockTrancheRequestArray, which won't be initialized.  Given
the lack of reports, this is apparently unusual, presumably because
it is usually called from a shmem_startup_hook like this:

    mystruct = ShmemInitStruct(..., &found);
    if (!found)
    {
        mystruct->locks = GetNamedLWLockTranche(...);
        ...
    }

This genre of shmem_startup_hook evades the aforementioned
segfaults because the struct is initialized in the postmaster, so
all other callers skip the !found path.

We considered modifying the documentation or requiring
GetNamedLWLockTranche() to be called from the postmaster, but
ultimately we decided to simply move the request array to shared
memory (and add it to the BackendParameters struct), thereby
allowing calls outside postmaster on all platforms.  Since the main
shared memory segment is initialized after accepting LWLock tranche
requests, postmaster builds the request array in local memory first
and then copies it to shared memory later.

Given the lack of reports, back-patching seems unnecessary.

Reported-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com
2025-09-11 16:13:55 -05:00
Tom Lane
a0b99fc122 Report the correct is_temporary flag for column defaults.
pg_event_trigger_dropped_objects() would report a column default
object with is_temporary = false, even if it belongs to a
temporary table.  This seems clearly wrong, so adjust it to
report the table's temp-ness.

While here, refactor EventTriggerSQLDropAddObject to make its
handling of namespace objects less messy and avoid duplication
of the schema-lookup code.  And add some explicit test coverage
of dropped-object reports for dependencies of temp tables.

Back-patch to v15.  The bug exists further back, but the
GetAttrDefaultColumnAddress function this patch depends on does not,
and it doesn't seem worth adjusting it to cope with the older code.

Author: Antoine Violin <violin.antuan@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFjUV9x3-hv0gihf+CtUc-1it0hh7Skp9iYFhMS7FJjtAeAptA@mail.gmail.com
Backpatch-through: 15
2025-09-11 17:11:57 -04:00
Peter Eisentraut
368c38dd47 Remove stray semicolon at global scope
The Sun Studio compiler complains about an empty declaration here.

Note for future historians:  This does not mean that this compiler is
still of current interest for anyone using PostgreSQL.  But we can let
this small fix be its parting gift.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/a0f817ee-fb86-483a-8a14-b6f7f5991b6e%40eisentraut.org
2025-09-11 12:03:15 +02:00
Tom Lane
09036dc71c Avoid faulty alignment of Datums in build_sorted_items().
If sizeof(Pointer) is 4 then sizeof(SortItem) will be 12, so that
if data->numrows is odd then we placed the values array at a location
that's not a multiple of 8.  That was fine when sizeof(Datum) was also
4, but in the wake of commit 2a600a93c it makes some alignment-picky
machines unhappy.  (You need a 32-bit machine that nonetheless expects
8-byte alignment of 8-byte quantities, which is an odd-seeming
combination but it does exist outside the Intel universe.)

To fix, MAXALIGN the space allocated to the SortItem array.
In passing, let's make the "len" variable be Size not int,
just for paranoia's sake.

This code was arguably not too safe even before 2a600a93c, but at
present I don't see a strong argument for back-patching.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/87036018-8d70-40ad-a0ac-192b07bd7b04@vondra.me
2025-09-10 17:51:24 -04:00
Tom Lane
bdc6cfcd12 Eliminate duplicative hashtempcxt in nodeSubplan.c.
Instead of building a separate memory context that's used just
for running hash functions, make the hash functions run in the
per-tuple context of the node's innerecontext.  This saves a
little space at runtime, and it avoids needing to reset two
contexts instead of one inside buildSubPlanHash's main loop.

This largely reverts commit 133924e13.  That's safe to do now
because bf6c614a2 decoupled the evaluation context used by
TupleHashTableMatch from that used for hash function evaluation,
so that there's no longer a risk of resetting the innerecontext
too soon.

Per discussion of bug #19040, although this is not directly
a fix for that.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reviewed-by: Fei Changhong <feichanghong@qq.com>
Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org
2025-09-10 16:15:08 -04:00
Tom Lane
abdeacdb09 Fix memory leakage in nodeSubplan.c.
If the hash functions used for hashing tuples leaked any memory,
we failed to clean that up, resulting in query-lifespan memory
leakage in queries using hashed subplans.  One way that could
happen is if the values being hashed require de-toasting, since
most of our hash functions don't trouble to clean up de-toasted
inputs.

Prior to commit bf6c614a2, this leakage was largely masked
because TupleHashTableMatch would reset hashtable->tempcxt
(via execTuplesMatch).  But it doesn't do that anymore, and
that's not really the right place for this anyway: doing it
there could reset the tempcxt many times per hash lookup,
or not at all.  Instead put reset calls into ExecHashSubPlan
and buildSubPlanHash.  Along the way to that, rearrange
ExecHashSubPlan so that there's just one place to call
MemoryContextReset instead of several.

This amounts to accepting the de-facto API spec that the caller
of the TupleHashTable routines is responsible for resetting the
tempcxt adequately often.  Although the other callers seem to
get this right, it was not documented anywhere, so add a comment
about it.

Bug: #19040
Reported-by: Haiyang Li <mohen.lhy@alibaba-inc.com>
Author: Haiyang Li <mohen.lhy@alibaba-inc.com>
Reviewed-by: Fei Changhong <feichanghong@qq.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19040-c9b6073ef814f48c@postgresql.org
Backpatch-through: 13
2025-09-10 16:05:03 -04:00
Nathan Bossart
9016fa7e3b meson: Build numeric.c with -ftree-vectorize.
autoconf builds have compiled this file with -ftree-vectorize since
commit 8870917623, but meson builds seem to have missed the memo.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/aL85CeasM51-0D1h%40nathan
Backpatch-through: 16
2025-09-10 11:21:12 -05:00
Peter Eisentraut
33eec80940 Fix CREATE TABLE LIKE with not-valid check constraint
In CREATE TABLE ... LIKE, any check constraints copied from the source
table should be set to valid if they are ENFORCED (the default).

Bug introduced in commit ca87c415e2.

Author: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACJufxH%3D%2Bod8Wy0P4L3_GpapNwLUP3oAes5UFRJ7yTxrM_M5kg%40mail.gmail.com
2025-09-10 13:25:58 +02:00
Michael Paquier
e6da68a6e1 Remove dynahash.h
All the callers of my_log2() are now limited inside dynahash.c, so let's
remove this header.  The same capability is provided by pg_bitutils.h
already.

Discussion: https://postgr.es/m/CAEZATCUJPQD_7sC-wErak2CQGNa6bj2hY-mr8wsBki=kX7f2_A@mail.gmail.com
2025-09-10 14:11:50 +09:00
Michael Paquier
b1187266e0 Replace callers of dynahash.h's my_log() by equivalent in pg_bitutils.h
All the calls replaced by this commit use 4-byte integers for their
variables used in input of my_log2().  Hence, the limit against
too-large inputs does not really apply.  Thresholds are also applied, as
of:
- In nodeAgg.c, the number of partitions is limited by
HASHAGG_MAX_PARTITIONS.
- In nodeHash.c, ExecChooseHashTableSize() caps its maximum number of
buckets based on HashJoinTuple and palloc() allocation limit.
- In worker.c, the number of subxacts tracked by ApplySubXactData uses
uint32, making pg_ceil_log2_64() safe to use directly.

Several approaches have been discussed, like an integration with
thresholds in pg_bitutils.h, but it was found confusing.  This uses
Dean's idea, which gives a simpler result than what I came up with to be
able to remove dynahash.h.  dynahash.h will be removed in a follow-up
commit, removing some duplication with the ceil log2 routines.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAEZATCUJPQD_7sC-wErak2CQGNa6bj2hY-mr8wsBki=kX7f2_A@mail.gmail.com
2025-09-10 11:20:46 +09:00
Michael Paquier
8c8f7b199d Fix leak with SMgrRelations in startup process
The startup process does not process shared invalidation messages, only
sending them, and never calls AtEOXact_SMgr() which clean up any
unpinned SMgrRelations.  Hence, it is never able to free SMgrRelations
on a periodic basis, bloating its hashtable over time.

Like the checkpointer and the bgwriter, this commit takes a conservative
approach by freeing periodically SMgrRelations when replaying a
checkpoint record, either online or shutdown, so as the startup process
has a way to perform a periodic cleanup.

Issue caused by 21d9c3ee4e, so backpatch down to v17.

Author: Jingtang Zhang <mrdrivingduck@gmail.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com
Backpatch-through: 17
2025-09-10 07:23:05 +09:00
Peter Eisentraut
81a61fde84 Fix typo in comment
Author: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAK98qZ0whQ%3Dc%2BJGXbGSEBxCtLgy6sf-YGYqsKTAGsS-wt0wj%2BA%40mail.gmail.com
2025-09-09 15:33:46 +02:00
Dean Rasheed
faf071b553 Add date and timestamp variants of random(min, max).
This adds 3 new variants of the random() function:

    random(min date, max date) returns date
    random(min timestamp, max timestamp) returns timestamp
    random(min timestamptz, max timestamptz) returns timestamptz

Each returns a random value x in the range min <= x <= max.

Author: Damien Clochard <damien@dalibo.info>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/f524d8cab5914613d9e624d9ce177d3d@dalibo.info
2025-09-09 10:39:30 +01:00
Amit Kapila
5ac3c1ac22 Fix Coverity issue reported in commit a850be2fe.
Address a potential SIGSEGV that may occur when the tablesync worker
attempts to locate a deleted row while applying changes. This situation
arises during conflict detection for update-deleted scenarios.

To prevent this crash, ensure that the operation is errored out early if
the leader apply worker is unavailable. Since the leader worker maintains
the necessary conflict detection metadata, proceeding without it serves no
purpose and risks reporting incorrect conflict type.

In the passing, improve a nearby comment.

Reported by Tom Lane as per Coverity
Author: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/334468.1757280992@sss.pgh.pa.us
2025-09-09 03:18:22 +00:00
Melanie Plageman
8ec97e78a7 Add error codes when vacuum discovers VM corruption
Commit fd6ec93bf8 and other previous work established the
principle that when an error is potentially reachable in case of on-disk
corruption but is not expected to be reached otherwise,
ERRCODE_DATA_CORRUPTED should be used. This allows log monitoring
software to search for evidence of corruption by filtering on the error
code.

Enhance the existing log messages emitted when the heap page is found to
be inconsistent with the VM by adding this error code.

Suggested-by: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/87DD95AA-274F-4F4F-BAD9-7738E5B1F905%40yandex-team.ru
2025-09-08 17:13:31 -04:00
Jeff Davis
9af672bcb2 meson: build checksums with extra optimization flags.
Use -funroll-loops and -ftree-vectorize when building checksum.c to
match what autoconf does.

Discussion: https://postgr.es/m/a81f2f7ef34afc24a89c613671ea017e3651329c.camel@j-davis.com
Reviewed-by: Andres Freund <andres@anarazel.de>
2025-09-08 12:29:42 -07:00
Nathan Bossart
3bcfcd815e pg_upgrade: Transfer pg_largeobject_metadata's files when possible.
Commit 161a3e8b68 taught pg_upgrade to use COPY for large object
metadata for upgrades from v12 and newer, which is much faster to
restore than the proper large object commands.  For upgrades from
v16 and newer, we can take this a step further and transfer the
large object metadata files as if they were user tables.  We can't
transfer the files from older versions because the aclitem data
type (needed by pg_largeobject_metadata.lomacl) changed its storage
format in v16 (see commit 7b378237aa).  Note that this commit is
essentially a revert of commit 12a53c732c.

There are a couple of caveats.  First, we still need to COPY the
corresponding pg_shdepend rows for large objects.  Second, we need
to COPY anything in pg_largeobject_metadata with a comment or
security label, else restoring those will fail.  This means that an
upgrade in which every large object has a comment or security label
won't gain anything from this commit, but it should at least avoid
making those unusual use-cases any worse.

pg_upgrade must also take care to transfer the relfilenodes of
pg_largeobject_metadata and its index, as was done for
pg_largeobject in commits d498e052b4 and bbe08b8869.

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aJ3_Gih_XW1_O2HF%40nathan
2025-09-08 14:19:48 -05:00
Robert Haas
5a170e992a Don't generate fake "*TLOCRN*" or "*TROCRN*" aliases, either.
This is just like the previous two commits, except that this fix
actually doesn't change any regression test outputs.

Author: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com
2025-09-08 12:58:07 -04:00
Robert Haas
6f79024df3 Don't generate fake "ANY_subquery" aliases, either.
This is just like the previous commit, but for a different invented
alias name.

Author: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com
2025-09-08 12:24:02 -04:00
Robert Haas
585e31fcb6 Don't generate fake "*SELECT*" or "*SELECT* %d" subquery aliases.
rte->alias should point only to a user-written alias, but in these
cases that principle was violated. Fixing this causes some regression
test output changes: wherever rte->alias previously had a value and
is now NULL, rte->eref is now set to a generated name rather than to
rte->alias; and the scheme used to generate eref names differs from
what we were doing for aliases.

The upshot is that instead of "*SELECT*" or "*SELECT* %d",
EXPLAIN will now emit "unnamed_subquery" or "unnamed_subquery_%d".
But that's a reasonable descriptor, and we were already producing
that in yet other cases, so this seems not too objectionable.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Co-authored-by: Robert Haas <rhaas@postgresql.org>
Discussion: https://postgr.es/m/CA+TgmoYSYmDA2GvanzPMci084n+mVucv0bJ0HPbs6uhmMN6HMg@mail.gmail.com
2025-09-08 11:50:33 -04:00
Melanie Plageman
3399c26554 Remove unneeded VM pin from VM replay
Previously, heap_xlog_visible() called visibilitymap_pin() even after
getting a buffer from XLogReadBufferForRedoExtended() -- which returns a
pinned buffer containing the specified block of the visibility map.

This would just have resulted in visibilitymap_pin() returning early
since the specified page was already present and pinned, but it was
confusing extraneous code, so remove it. It doesn't seem worth
backporting, though.

It appears to be an oversight in 2c03216.

While we are at it, remove two VM-related redundant asserts in the COPY
FREEZE code path. visibilitymap_set() already asserts that
PD_ALL_VISIBLE is set on the heap page and checks that the vmbuffer
contains the bits corresponding to the specified heap block, so callers
do not also need to check this.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>

Discussion: https://postgr.es/m/CALdSSPhu7WZd%2BEfQDha1nz%3DDC93OtY1%3DUFEdWwSZsASka_2eRQ%40mail.gmail.com
2025-09-08 10:22:42 -04:00
Amit Kapila
6456c6e2c4 Add test to prevent premature removal of conflict-relevant data.
A test has been added to ensure that conflict-relevant data is not
prematurely removed when a concurrent prepared transaction is being
committed on the publisher.

This test introduces an injection point that simulates the presence of a
prepared transaction in the commit phase, validating that the system
correctly delays conflict slot advancement until the transaction is fully
committed.

Additionally, the test serves as a safeguard for developers, ensuring that
the acquisition of the commit timestamp does not occur before marking
DELAY_CHKPT_IN_COMMIT in RecordTransactionCommitPrepared.

Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
2025-09-08 12:06:03 +00:00
Michael Paquier
8191e0c16a Fix corruption of pgstats shared hashtable due to OOM failures
A new pgstats entry is created as a two-step process:
- The entry is looked at in the shared hashtable of pgstats, and is
inserted if not found.
- When not found and inserted, its fields are then initialized.  This
part include a DSA chunk allocation for the stats data of the new entry.

As currently coded, if the DSA chunk allocation fails due to an
out-of-memory failure, an ERROR is generated, leaving in the pgstats
shared hashtable an inconsistent entry due to the first step, as the
entry has already been inserted in the hashtable.  These broken entries
can then be found by other backends, crashing them.

There are only two callers of pgstat_init_entry(), when loading the
pgstats file at startup and when creating a new pgstats entry.  This
commit changes pgstat_init_entry() so as we use dsa_allocate_extended()
with DSA_ALLOC_NO_OOM, making it return NULL on allocation failure
instead of failing.  This way, a backend failing an entry creation can
take appropriate cleanup actions in the shared hashtable before throwing
an error.  Currently, this means removing the entry from the shared
hashtable before throwing the error for the allocation failure.

Out-of-memory errors unlikely happen in the wild, and we do not bother
with back-patches when these are fixed, usually.  However, the problem
dealt with here is a degree worse as it breaks the shared memory state
of pgstats, impacting other processes that may look at an inconsistent
entry that a different process has failed to create.

Author: Mikhail Kot <mikhail.kot@databricks.com>
Discussion: https://postgr.es/m/CAAi9E7jELo5_-sBENftnc2E8XhW2PKZJWfTC3i2y-GMQd2bcqQ@mail.gmail.com
Backpatch-through: 15
2025-09-08 15:52:23 +09:00
Amit Kapila
1f7e9ba3ac Post-commit review fixes for 228c370868.
This commit fixes three issues:

1) When a disabled subscription is created with retain_dead_tuples set to true,
the launcher is not woken up immediately, which may lead to delays in creating
the conflict detection slot.

Creating the conflict detection slot is essential even when the subscription is
not enabled. This ensures that dead tuples are retained, which is necessary for
accurately identifying the type of conflict during replication.

2) Conflict-related data was unnecessarily retained when the subscription does
not have a table.

3) Conflict-relevant data could be prematurely removed before applying
prepared transactions on the publisher that are in the commit critical section.

This issue occurred because the backend executing COMMIT PREPARED was not
accounted for during the computation of oldestXid in the commit phase on
the publisher. As a result, the subscriber could advance the conflict
slot's xmin without waiting for such COMMIT PREPARED transactions to
complete.

We fixed this issue by identifying prepared transactions that are in the
commit critical section during computation of oldestXid in commit phase.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB16913DACB64E5721872AA5C02943BA@OS9PR01MB16913.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com
2025-09-08 06:10:15 +00:00
Michael Paquier
43eb2c5419 Update parser README to include parse_jsontable.c
The README was missing parse_jsontable.c which handles JSON_TABLE.
Oversight in de3600452b.

Author: Karthik S <karthikselvaam@gmail.com>
Discussion: https://postgr.es/m/CAK4gQD9gdcj+vq_FZGp=Rv-W+41v8_C7cmCUmDeu=cfrOdfXEw@mail.gmail.com
Backpatch-through: 17
2025-09-08 10:07:14 +09:00
Tatsuo Ishii
06473f5a34 Allow to log raw parse tree.
This commit allows to log the raw parse tree in the same way we
currently log the parse tree, rewritten tree, and plan tree.

To avoid unnecessary log noise for users not interested in this
detail, a new GUC option, "debug_print_raw_parse", has been added.

When starting the PostgreSQL process with "-d N", and N is 3 or higher,
debug_print_raw_parse is enabled automatically, alongside
debug_print_parse.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2mcO0Gpo4vd8kPMAFWeJLSp0MeUUnaLdE1x0tSVd-VzUw%40mail.gmail.com
2025-09-06 07:49:51 +09:00
Andres Freund
2c78940527 bufmgr: Remove freelist, always use clock-sweep
This set of changes removes the list of available buffers and instead simply
uses the clock-sweep algorithm to find and return an available buffer.  This
also removes the have_free_buffer() function and simply caps the
pg_autoprewarm process to at most NBuffers.

While on the surface this appears to be removing an optimization it is in fact
eliminating code that induces overhead in the form of synchronization that is
problematic for multi-core systems.

The main reason for removing the freelist, however, is not the moderate
improvement in scalability, but that having the freelist would require
dedicated complexity in several upcoming patches. As we have not been able to
find a case benefiting from the freelist...

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Andres Freund
50e4c6ace5 bufmgr: Use consistent naming of the clock-sweep algorithm
Minor edits to comments only.

Author: Greg Burd <greg@burd.me>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/70C6A5B5-2A20-4D0B-BC73-EB09DD62D61C@getmailspring.com
2025-09-05 12:25:59 -04:00
Melanie Plageman
e3d5ddb7ca Add assert and log message to visibilitymap_set
Add an assert to visibilitymap_set() that the provided heap buffer is
exclusively locked, which is expected.

Also, enhance the debug logging message to specify which VM flags were
set.

Based on a related suggestion by Kirill Reshke on an in-progress
patchset.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CALdSSPhAU56g1gGVT0%2BwG8RrSWE6qW8TOfNJS1HNAWX6wPgbFA%40mail.gmail.com
2025-09-05 09:33:36 -04:00
Dean Rasheed
6ede13d1b5 Fix concurrent update issue with MERGE.
When executing a MERGE UPDATE action, if there is more than one
concurrent update of the target row, the lock-and-retry code would
sometimes incorrectly identify the latest version of the target tuple,
leading to incorrect results.

This was caused by using the ctid field from the TM_FailureData
returned by table_tuple_lock() in a case where the result was TM_Ok,
which is unsafe because the TM_FailureData struct is not guaranteed to
be fully populated in that case. Instead, it should use the tupleid
passed to (and updated by) table_tuple_lock().

To reduce the chances of similar errors in the future, improve the
commentary for table_tuple_lock() and TM_FailureData to make it
clearer that table_tuple_lock() updates the tid passed to it, and most
fields of TM_FailureData should not be relied on in non-failure cases.
An exception to this is the "traversed" field, which is set in both
success and failure cases.

Reported-by: Dmitry <dsy.075@yandex.ru>
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1570d30e-2b95-4239-b9c3-f7bf2f2f8556@yandex.ru
Backpatch-through: 15
2025-09-05 08:18:18 +01:00
Michael Paquier
567d27e8e2 Fix outdated comments in slru.c
SlruRecentlyUsed() is an inline function since 53c2a97a92, not a
macro.  The description of long_segment_names was missing at the top of
SimpleLruInit(), part forgotten in 4ed8f0913b.

Author: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/aLpBLMOYwEQkaleF@jrouhaud
Backpatch-through: 17
2025-09-05 14:10:08 +09:00
Michael Paquier
4246a977ba Switch some numeric-related functions to use soft error reporting
This commit changes some functions related to the data type numeric to
use the soft error reporting rather than a custom boolean flag (called
"have_error") that callers of these functions could rely on to bypass
the generation of ERROR reports, letting the callers do their own error
handling (timestamp, jsonpath and numeric_to_char() require them).

This results in the removal of some boilerplate code that was required
to handle both the ereport() and the "have_error" code paths bypassing
ereport(), unifying everything under the soft error reporting facility.

While on it, some duplicated error messages are removed.  The function
upgraded in this commit were suffixed with "_opt_error" in their names.
They are renamed to "_safe" instead.

This change relies on d9f7f5d32f, that has introduced the soft error
reporting infrastructure.

Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com
2025-09-05 13:53:47 +09:00
Michael Paquier
ae45312008 Change pg_lsn_in_internal() to use soft error reporting
pg_lsn includes pg_lsn_in_internal() for the purpose of parsing a LSN
position for the GUC recovery_target_lsn (21f428ebde).  It relies on a
boolean called "have_error" that would be set when the LSN parsing
fails, then let its callers handle any errors.

d9f7f5d32f has added support for soft error reporting.  This commit
removes some boilerplate code and switches the routine to use soft error
reporting directly, giving to the callers of pg_lsn_in_internal()
the possibility to be fed the error message generated on failure.

pg_lsn_in_internal() routine is renamed to pg_lsn_in_safe(), for
consistency with other similar routines that are given an escontext.

Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b96No5h5tRuR+KhcC44YcYUCw8WAHuLoqqyyop8_k3+JDQ@mail.gmail.com
2025-09-05 12:59:29 +09:00
Nathan Bossart
d814d7fc3d Revert recent change to RequestNamedLWLockTranche().
Commit 38b602b028 modified this function to allocate enough space
for MAX_NAMED_TRANCHES (256) requests, which is likely far more
than most clusters need.  This commit reverts that change so that
it first allocates enough space for only 16 requests and resizes
the array when necessary.  While at it, remove the check for too
many tranches from this function.  We can now rely on
InitializeLWLocks() to do that check via its calls to
LWLockNewTrancheId() for the named tranches.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aLmzwC2dRbqk14y6%40nathan
2025-09-04 15:34:48 -05:00
Nathan Bossart
1129d3e4c8 Adjust commentary for WaitEventLWLock in wait_event_names.txt.
In addition to changing a couple of references for clarity, this
commit combines the two similar comments.
2025-09-04 10:18:42 -05:00
Dean Rasheed
fc6600fc1c Fix replica identity check for MERGE.
When executing a MERGE, check that the target relation supports all
actions mentioned in the MERGE command. Specifically, check that it
has a REPLICA IDENTITY if it publishes updates or deletes and the
MERGE command contains update or delete actions. Failing to do this
can silently break replication.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com
Backpatch-through: 15
2025-09-04 11:45:44 +01:00
Dean Rasheed
5386bfb9c1 Fix replica identity check for INSERT ON CONFLICT DO UPDATE.
If an INSERT has an ON CONFLICT DO UPDATE clause, the executor must
check that the target relation supports UPDATE as well as INSERT. In
particular, it must check that the target relation has a REPLICA
IDENTITY if it publishes updates. Formerly, it was not doing this
check, which could lead to silently breaking replication.

Fix by adding such a check to CheckValidResultRel(), which requires
adding a new onConflictAction argument. In back-branches, preserve ABI
compatibility by introducing a wrapper function with the original
signature.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Tested-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/OS3PR01MB57180C87E43A679A730482DF94B62@OS3PR01MB5718.jpnprd01.prod.outlook.com
Backpatch-through: 13
2025-09-04 11:27:53 +01:00
Michael Paquier
09119238a1 Fix incorrect comment in pgstat_backend.c
The counters saved from pgWalUsage, used for the difference calculations
when flushing the backend WAL stats, are updated when calling
pgstat_flush_backend() under PGSTAT_BACKEND_FLUSH_WAL, and not
pgstat_report_wal().  The comment updated in this commit referenced the
latter, but it is perfectly OK to flush the backend stats independently
of the WAL stats.

Noticed while looking at this area of the code, introduced by
76def4cdd7 as a copy-pasto.

Backpatch-through: 18
2025-09-04 08:34:51 +09:00
Nathan Bossart
38b602b028 Move dynamically-allocated LWLock tranche names to shared memory.
There are two ways for shared libraries to allocate their own
LWLock tranches.  One way is to call RequestNamedLWLockTranche() in
a shmem_request_hook, which requires the library to be loaded via
shared_preload_libraries.  The other way is to call
LWLockNewTrancheId(), which is not subject to the same
restrictions.  However, LWLockNewTrancheId() does require each
backend to store the tranche's name in backend-local memory via
LWLockRegisterTranche().  This API is a little cumbersome and leads
to things like unhelpful pg_stat_activity.wait_event values in
backends that haven't loaded the library.

This commit moves these LWLock tranche names to shared memory, thus
eliminating the need for each backend to call
LWLockRegisterTranche().  Instead, the tranche name must be
provided to LWLockNewTrancheId(), which immediately makes the name
available to all backends.  Since the tranche name array is
append-only, lookups can ordinarily avoid locking as long as their
local copy of the LWLock counter is greater than the requested
tranche ID.

One downside of this approach is that we now have a hard limit on
both the length of tranche names (NAMEDATALEN-1 bytes) and the
number of dynamically-allocated tranches (256).  Besides a limit of
NAMEDATALEN-1 bytes for tranche names registered via
RequestNamedLWLockTranche(), no such limits previously existed.  We
could avoid these new limits by using dynamic shared memory, but
the complexity involved didn't seem worth it.  We briefly
considered making the tranche limit user-configurable but
ultimately decided against that, too.  Since there is still a lot
of time left in the v19 development cycle, it's possible we will
revisit this choice.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA5RZ0vvED3naph8My8Szv6DL4AxOVK3eTPS0qXsaKi%3DbVdW2A%40mail.gmail.com
2025-09-03 13:57:48 -05:00
Peter Eisentraut
01d6e5b2cf Fix mistake in new GUC tables source
Commit 6359989654 had it so that the parameter "debug_discard_caches"
did not exist unless DISCARD_CACHES_ENABLED was defined (typically via
enabling asserts).  This was a mistake, it did not correspond to the
prior setup.  Several tests use this parameter, so they were now
failing if you did not have asserts enabled.
2025-09-03 11:48:35 +02:00
Peter Eisentraut
6359989654 Generate GUC tables from .dat file
Store the information in guc_tables.c in a .dat file similar to the
catalog data in src/include/catalog/, and generate a part of
guc_tables.c from that.  The goal is to make it easier to edit that
information, and to be able to make changes to the downstream data
structures more easily.  (Essentially, those are the same reasons as
for the original adoption of the .dat format.)

Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: David E. Wheeler <david@justatheory.com>
Discussion: https://www.postgresql.org/message-id/flat/dae6fe89-1e0c-4c3f-8d92-19d23374fb10%40eisentraut.org
2025-09-03 09:45:17 +02:00
Richard Guo
aba8f61c30 Fix planner error when estimating SubPlan cost
SubPlan nodes are typically built very early, before any RelOptInfos
have been constructed for the parent query level.  As a result, the
simple_rel_array in the parent root has not yet been initialized.
Currently, during cost estimation of a SubPlan's testexpr, we may call
examine_variable() to look up statistical data about the expressions.
This can lead to "no relation entry for relid" errors.

To fix, pass root as NULL to cost_qual_eval() in cost_subplan(), since
the root does not yet contain enough information to safely consult
statistics.

One exception is SubPlan nodes built for the initplans of MIN/MAX
aggregates from indexes.  In this case, having a NULL root is safe
because testexpr will be NULL.  Additionally, an initplan will by
definition not consult anything from the parent plan.

Backpatch to all supported branches.  Although the reported call path
that triggers this error is not reachable prior to v17, there's no
guarantee that other code paths -- especially in extensions -- could
not encounter the same issue when cost_qual_eval() is called with a
root that lacks a valid simple_rel_array.  The test case is not
included in pre-v17 branches though.

Bug: #19037
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19037-3d1c7bb553c7ce84@postgresql.org
Backpatch-through: 13
2025-09-03 16:00:38 +09:00
Amit Kapila
f2dbc83501 Fix use-after-free issue in slot synchronization.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 18, where it was introduced
Discussion: https://postgr.es/m/CANhcyEXMrcEdzj-RNGJam0nJHM4y+ttdWsgUCFmXciM7BNKc7A@mail.gmail.com
2025-09-03 06:31:05 +00:00
Michael Paquier
c6ea528b47 Update outdated references to the SLRU ControlLock
SLRU bank locks are referred as "bank locks" or "SLRU bank locks" in the
code comments.  The comments updated in this commit use the latter term.

Oversight in 53c2a97a92, that has replaced the single ControlLock by
the bank control locks.

Author: Julien Rouhaud <julien.rouhaud@free.fr>
Discussion: https://postgr.es/m/aLUT2UO8RjJOzZNq@jrouhaud
Backpatch-through: 17
2025-09-03 10:20:28 +09:00
Fujii Masao
229911c4bf Add HINT for COPY TO when WHERE clause is used.
COPY TO does not support a WHERE clause, and currently fails with the error:

    ERROR:  WHERE clause not allowed with COPY TO

Since the intended behavior can be achieved by using
COPY (SELECT ... WHERE ...) TO, this commit adds a HINT
to the error message:

    HINT:  Try the COPY (SELECT ... WHERE ...) TO variant.

This makes the error more informative and helps users
quickly find the alternative usage.

Author: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/3520c224c5ffac0113aef84a9179f37e@oss.nttdata.com
2025-09-03 08:35:55 +09:00
Tom Lane
1b1960c8c9 Improve error message for duplicate labels when creating an enum type.
Previously, duplicate labels in CREATE TYPE AS ENUM were caught by
the unique index on pg_enum, resulting in a generic error message.
While this was evidently intentional, it's not terribly user-friendly,
nor consistent with the ALTER TYPE cases which take more care with
such errors.  This patch adds an explicit check to produce a more
user-friendly and descriptive error message.

A potential objection to this implementation is that it adds O(N^2)
work to the creation operation.  However, quick testing finds that
that's pretty negligible below 1000 enum labels, and tolerable even
at 10000.  So it doesn't really seem worth being smarter.

Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20250704000402.37e605ab0c59c300965a17ee@sraoss.co.jp
2025-09-02 13:50:56 -04:00
Michael Paquier
eccba079c2 Generate pgstat_count_slru*() functions for slru using macros
This change replaces seven functions definitions by macros, reducing a
bit some repetitive patterns in the code.  An interesting side effect is
that this removes an inconsistency in the naming of SLRU increment
functions with the field names.

This change is similar to 850f4b4c8c, 8018ffbf58 or 83a1a1b566.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aLHA//gr4dTpDHHC@ip-10-97-1-34.eu-west-3.compute.internal
2025-09-02 16:22:03 +09:00
Amit Kapila
a850be2fe6 Add max_retention_duration option to subscriptions.
This commit introduces a new subscription parameter,
max_retention_duration, aimed at mitigating excessive accumulation of dead
tuples when retain_dead_tuples is enabled and the apply worker lags behind
the publisher.

When the time spent advancing a non-removable transaction ID exceeds the
max_retention_duration threshold, the apply worker will stop retaining
conflict detection information. In such cases, the conflict slot's xmin
will be set to InvalidTransactionId, provided that all apply workers
associated with the subscription (with retain_dead_tuples enabled) confirm
the retention duration has been exceeded.

To ensure retention status persists across server restarts, a new column
subretentionactive has been added to the pg_subscription catalog. This
prevents unnecessary reactivation of retention logic after a restart.

The conflict detection slot will not be automatically re-initialized
unless a new subscription is created with retain_dead_tuples = true, or
the user manually re-enables retain_dead_tuples.

A future patch will introduce support for automatic slot re-initialization
once at least one apply worker confirms that the retention duration is
within the configured max_retention_duration.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-09-02 03:20:18 +00:00
Richard Guo
317c117d6d Fix const-simplification for constraints and stats
Constraint expressions and statistics expressions loaded from the
system catalogs need to be run through const-simplification, because
the planner will be comparing them to similarly-processed qual
clauses.  Without this step, the planner may fail to detect valid
matches.

Currently, NullTest clauses in these expressions may not be reduced
correctly during const-simplification.  This happens because their Var
nodes do not yet have the correct varno when eval_const_expressions is
applied.  Since eval_const_expressions relies on varno to reduce
NullTest quals, incorrect varno can cause problems.

Additionally, for statistics expressions, eval_const_expressions is
called with root set to NULL, which also inhibits NullTest reduction.

This patch fixes the issue by ensuring that Vars are updated to have
the correct varno before const-simplification, and that a valid root
is passed to eval_const_expressions when needed.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/19007-4cc6e252ed8aa54a@postgresql.org
2025-08-31 08:59:48 +09:00
Nathan Bossart
5487058b56 Prepare DSM registry for upcoming changes to LWLock tranche names.
A proposed patch would place a limit of NAMEDATALEN-1 (i.e., 63)
bytes on the names of dynamically-allocated LWLock tranches, but
GetNamedDSA() and GetNamedDSHash() may register tranches with
longer names.  This commit lowers the maximum DSM registry entry
name length to NAMEDATALEN-1 bytes and modifies GetNamedDSHash() to
create only one tranche, thereby allowing us to keep the DSM
registry's tranche names below NAMEDATALEN bytes.

Author: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aKzIg1JryN1qhNuy%40nathan
2025-08-29 20:34:53 -05:00
Tom Lane
f727b63e81 Provide error context when an error is thrown within WaitOnLock().
Show the requested lock level and the object being waited on,
in the same format we use for deadlock reports and similar errors.
This is particularly helpful for debugging lock-timeout errors,
since otherwise the user has very little to go on about which
lock timed out.  The performance cost of setting up the callback
should be negligible compared to the other tracing support already
present in WaitOnLock.

As in the deadlock-report case, we just show numeric object OIDs,
because it seems too scary to try to perform catalog lookups
in this context.

Reported-by: Steve Baldwin <steve.baldwin@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1602369.1752167154@sss.pgh.pa.us
2025-08-29 15:43:34 -04:00
Nathan Bossart
67fcf48c3b Make LWLockCounter a global variable.
Using the LWLockCounter requires first calculating its address in
shared memory like this:

	LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));

Commit 82e861fbe1 started this trend in order to fix EXEC_BACKEND
builds, but it could also be fixed by adding it to the
BackendParameters struct.  The current approach is somewhat
difficult to follow, so this commit switches to the latter.  While
at it, swap around the code in LWLockShmemSize() to match the order
of assignments in CreateLWLocks() for added readability.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aLDLnan9gNCS9fHx%40nathan
2025-08-29 12:13:37 -05:00
Nathan Bossart
6fbd7b93c6 Remove unused parameter from ProcessSlotSyncInterrupts().
Oversight in commit 93db6cbda0.

Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_7B42BBE8D0A5C28DDAB91436192CBCCB8307%40qq.com
2025-08-29 10:56:10 -05:00
David Rowley
da9f9f75e5 Fix possible use after free in expand_partitioned_rtentry()
It's possible that if the only live partition is concurrently dropped
and try_table_open() fails, that the bms_del_member() will pfree the
live_parts Bitmapset.  Since the bms_del_member() call does not assign
the result back to the live_parts local variable, the while loop could
segfault as that variable would still reference the pfree'd Bitmapset.

Backpatch to 15. 52f3de874 was backpatched to 14, but there's no
bms_del_member() there due to live_parts not yet existing in RelOptInfo in
that version.  Technically there's no bug in version 15 as
bms_del_member() didn't pfree when the set became empty prior to
00b41463c (from v16).  Applied to v15 anyway to keep the code similar and
to avoid the bad coding pattern.

Author: Bernd Reiß <bd_reiss@gmx.at>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/6b88f27a-c45c-4826-8e37-d61a04d90182@gmx.at
Backpatch-through: 15
2025-08-30 00:50:50 +12:00
Álvaro Herrera
f225473cba
CREATE STATISTICS: improve misleading error message
I think the error message for a different condition was inadvertently
copied.

This problem seems to have been introduced by commit a4d75c86bf.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 14
Discussion: https://postgr.es/m/CACJufxEZ48toGH0Em_6vdsT57Y3L8pLF=DZCQ_gCii6=C3MeXw@mail.gmail.com
2025-08-29 14:43:47 +02:00
Peter Eisentraut
991295f387 Mark ItemPointer arguments as const in tuple/table lock functions
The functions LockTuple, ConditionalLockTuple, UnlockTuple, and
XactLockTableWait take an ItemPointer argument that they do not
modify, so the argument can be const-qualified to better convey intent
and allow the compiler to enforce immutability.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2m9e4rECHBwpRE4%2BGCH%2BpbYZXLh2f4rB1Du5hDfKug%2BOg%40mail.gmail.com
2025-08-29 07:39:58 +02:00
Peter Eisentraut
710e6c4301 Remove unneeded casts of BufferGetPage() result
BufferGetPage() already returns type Page, so casting it to Page
doesn't achieve anything.  A sizable number of call sites does this
casting; remove that.

This was already done inconsistently in the code in the first import
in 1996 (but didn't exist in the pre-1995 code), and it was then
apparently just copied around.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CALdSSPgFhc5=vLqHdk-zCcnztC0zEY3EU_Q6a9vPEaw7FkE9Vw@mail.gmail.com
2025-08-29 07:18:29 +02:00
Richard Guo
97b0f36bde Fix semijoin unique-ification for child relations
For a child relation, we should not assume that its parent's
unique-ified relation (or unique-ified path in v18) always exists.  In
cases where all RHS columns that need to be unique-ified are equated
to constants, the unique-ified relation/path for the parent table is
not built, as there are no columns left to unique-ify.  Failing to
account for this can result in a SIGSEGV crash during planning.

This patch checks whether the parent's unique-ified relation or path
exists and skips unique-ification of the child relation if it does
not.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49MOdLW2c+qbLHHBt8VBu=4ONpM91D19=AWeW93eFUF6A@mail.gmail.com
Backpatch-through: 18
2025-08-29 13:14:12 +09:00
Masahiko Sawada
fabd8b8e2a Use LW_SHARED in walsummarizer.c for WALSummarizerLock lock where possible.
Previously, we used LW_EXCLUSIVE in several places despite only reading
WalSummarizerCtl fields. This patch reduces the lock level to LW_SHARED
where we are only reading the shared fields.

Backpatch to 17, where wal summarization was introduced.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAD21AoDdKhf_9oriEYxY-JCdF+Oe_muhca3pcdkMEdBMzyHyKw@mail.gmail.com
Backpatch-through: 17
2025-08-28 17:06:42 -07:00
Tom Lane
b8a1bdc458 Fix "variable not found in subplan target lists" in semijoin de-duplication.
One mechanism we have for implementing semi-joins is to de-duplicate
the output of the RHS and then treat the join as a plain inner join.
Initial construction of the join's SpecialJoinInfo identifies the
RHS columns that need to be de-duplicated, but later we may find that
some of those don't need to be handled explicitly, either because
they're known to be constant or because they are redundant with some
previous column.

Up to now, while sort-based de-duplication handled such cases well,
hash-based de-duplication didn't: we'd still hash on all of the
originally-identified columns.  This is probably not a very big
deal performance-wise, but in the wake of commit a3179ab69 it can
cause planner errors.  That happens when join elimination causes
recalculation of variables' attr_needed bitmapsets, and we decide
that a variable mentioned in a semijoin clause doesn't need to be
propagated up to the join level anymore.

There are a number of ways we could slice the blame for this, but the
only fix that doesn't result in pessimizing plans for loosely-related
cases is to be more careful about not hashing columns we don't
actually need to de-duplicate.  We can install that consideration
into create_unique_paths in master, or the predecessor code in
create_unique_path in v18, without much refactoring.

(As follow-up work, it might be a good idea to look at more-invasive
refactoring, in hopes of preventing other bugs in this area.  But
with v18 release so close, there's not time for that now, nor would
we be likely to want to put such refactoring into v18 anyway.)

Reported-by: Sergey Soloviev <sergey.soloviev@tantorlabs.ru>
Diagnosed-by: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/1fd1a421-4609-4d46-a1af-ab74d5de504a@tantorlabs.ru
Backpatch-through: 18
2025-08-28 13:49:23 -04:00
Álvaro Herrera
325fc0ab14
Avoid including commands/dbcommands.h in so many places
This has been done historically because of get_database_name (which
since commit cb98e6fb8f belongs in lsyscache.c/h, so let's move it
there) and get_database_oid (which is in the right place, but whose
declaration should appear in pg_database.h rather than dbcommands.h).
Clean this up.

Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h
since commit f1fd515b39, so remove them.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql
2025-08-28 12:39:04 +02:00
Peter Eisentraut
80f1106132 Message style improvements
An improvement pass over the new stats import functionality.
2025-08-28 09:09:26 +02:00
Andres Freund
5865150b6d aio: Stop using enum bitfields due to bad code generation
During an investigation into rather odd aio related errors on macos, observed
by Alexander and Konstantin, we started to wonder if bitfield access is
related to the error. At the moment it looks like it is related, we cannot
reproduce the failures when replacing the bitfields. In addition, the problem
can only be reproduced with some compiler [versions] and not everyone has been
able to reproduce the issue.

The observed problem is that, very rarely, PgAioHandle->{state,target} are in
an inconsistent state, after having been checked to be in a valid state not
long before, triggering an assertion failure. Unfortunately, this could be
caused by wrong compiler code generation or somehow of missing memory barriers
- we don't really know. In theory there should not be any concurrent write
access to the handle in the state the bug is triggered, as the handle was idle
and is just being initialized.

Separately from the bug, we observed that at least gcc and clang generate
rather terrible code for the bitfield access. Even if it's not clear if the
observed assertion failure is actually caused by the bitfield somehow, the bad
code generation alone is sufficient reason to stop using bitfields.

Therefore, replace the enum bitfields with uint8s and instead cast in each
switch statement.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/1500090.1745443021@sss.pgh.pa.us
Backpatch-through: 18
2025-08-27 19:12:11 -04:00
Peter Eisentraut
e36fa9319b Improve objectNamesToOids() comment
Commit d31bbfb659 removed the comment at objectNamesToOids() that
there is no locking, because that commit added locking.  But to fix
all the problems, we'd still need a stronger lock.  So put the comment
back with more a detailed explanation.

Co-authored-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/bf72b82c-124d-4efa-a484-bb928e9494e4@eisentraut.org
2025-08-27 17:46:26 +02:00
Peter Eisentraut
990c8db182 Fix: Don't strip $libdir from nested module_pathnames
This patch fixes a bug in how 'load_external_function' handles
'$libdir/ prefixes in module paths.

Previously, 'load_external_function' would unconditionally strip
'$libdir/' from the beginning of the 'filename' string.  This caused
an issue when the path was nested, such as "$libdir/nested/my_lib".
Stripping the prefix resulted in a path of "nested/my_lib", which
would fail to be found by the expand_dynamic_library_name function
because the original '$libdir' macro was removed.

To fix this, the code now checks for the presence of an additional
directory separator ('/' or '\') after the '$libdir/' prefix.  The
prefix is only stripped if the remaining string does not contain a
separator.  This ensures that simple filenames like '"$libdir/my_lib"'
are correctly handled, while nested paths are left intact for
'expand_dynamic_library_name' to process correctly.

Reported-by: Dilip Kumar <dilipbalaut@gmail.com>
Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAFiTN-uKNzAro4tVwtJhF1UqcygfJ%2BR%2BRL%3Db-_ZMYE3LdHoGhA%40mail.gmail.com
2025-08-27 15:49:58 +02:00
Peter Eisentraut
e567e22290 Message style improvements
Mostly adding some quoting.
2025-08-26 22:52:11 +02:00
Tom Lane
327b7324d0 Put "excludeOnly" GIN scan keys at the end of the scankey array.
Commit 4b754d6c1 introduced the concept of an excludeOnly scan key,
which cannot select matching index entries but can reject
non-matching tuples, for example a tsquery such as '!term'.  There are
poorly-documented assumptions that such scan keys do not appear as the
first scan key.  ginNewScanKey did nothing to ensure that, however,
with the result that certain GIN index searches could go into an
infinite loop while apparently-equivalent queries with the clauses in
a different order were fine.

Fix by teaching ginNewScanKey to place all excludeOnly scan keys
after all not-excludeOnly ones.  So far as we know at present,
it might be sufficient to avoid the case where the very first
scan key is excludeOnly; but I'm not very convinced that there
aren't other dependencies on the ordering.

Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
2025-08-26 12:08:57 -04:00
Tom Lane
b55068236c Do CHECK_FOR_INTERRUPTS inside, not before, scanGetItem.
The CHECK_FOR_INTERRUPTS call in gingetbitmap turns out to be
inadequate to prevent a long uninterruptible loop, because
we now know a case where looping occurs within scanGetItem.
While the next patch will fix the bug that caused that, it
seems foolish to assume that no similar patterns are possible.
Let's do the CFI within scanGetItem's retry loop, instead.
This demonstrably allows canceling out of the loop exhibited
in bug #19031.

Bug: #19031
Reported-by: Tim Wood <washwithcare@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19031-0638148643d25548@postgresql.org
Backpatch-through: 13
2025-08-26 11:38:41 -04:00
Alexander Korotkov
5f6f951f88 Improve RowMark handling during Self-Join Elimination
The Self-Join Elimination SJE feature messes up keeping and removing RowMark's
in remove_self_joins_one_group().  That didn't lead to user-level error,
because the planned RowMark is only used to reference a rtable entry in later
execution stages.  An RTE entry for keeping and removing relations is
identical and refers to the same relation OID.

To reduce confusion and prevent future issues, this commit cleans up the code
and fixes the incorrect behaviour.  Furthermore, it includes sanity checks in
setrefs.c on existing non-null RTE and RelOptInfo entries for each RowMark.

Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Backpatch-through: 18
2025-08-26 13:23:18 +03:00
Alexander Korotkov
d713cf9b65 Refactor variable names in remove_self_joins_one_group()
Rename inner and outer to rrel and krel, respectively, to highlight their
connection to r and k indexes.  For the same reason, rename imark and omark
to rmark and kmark.

Discussion: https://postgr.es/m/18c6bd6c-6d2a-419a-b0da-dfedef34b585%40gmail.com
Author: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Backpatch-through: 18
2025-08-26 13:22:43 +03:00
Peter Eisentraut
99234e9ddc Message wording improvements
Use "row" instead of "tuple" for user-facing information for
logical replication conflicts.
2025-08-25 23:15:24 +02:00
Nathan Bossart
989b2e4d5c Use PqMsg_* macros in applyparallelworker.c.
Oversight in commit f4b54e1ed9.

Author: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CAEudQAobFsHaLMypA6C96-9YExvF4AcU1xNPoPuNYRVm3mq4dg%40mail.gmail.com
2025-08-25 14:11:01 -05:00
Peter Eisentraut
878656dbde Formatting cleanup of guc_tables.c
This cleans up a few minor formatting inconsistencies.

Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/dae6fe89-1e0c-4c3f-8d92-19d23374fb10%40eisentraut.org
2025-08-25 09:10:27 +02:00
Alexander Korotkov
c13070a27b Revert "Get rid of WALBufMappingLock"
This reverts commit bc22dc0e0d.
It appears that conditional variables are not suitable for use inside
critical sections.  If WaitLatch()/WaitEventSetWaitBlock() face postmaster
death, they exit, releasing all locks instead of PANIC.  In certain
situations, this leads to data corruption.

Reported-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 18
2025-08-22 19:26:38 +03:00
Heikki Linnakangas
661f821ef0 Use ereport() rather than elog()
Noah pointed this out before I committed 50f770c3d9, but I
accidentally pushed the old version with elog() anyway. Oops.

Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/20250820003756.31.nmisch@google.com
2025-08-22 13:35:05 +03:00
Heikki Linnakangas
50f770c3d9 Revert GetTransactionSnapshot() to return historic snapshot during LR
Commit 1585ff7387 changed GetTransactionSnapshot() to throw an error
if it's called during logical decoding, instead of returning the
historic snapshot. I made that change for extra protection, because a
historic snapshot can only be used to access catalog tables while
GetTransactionSnapshot() is usually called when you're executing
arbitrary queries. You might get very subtle visibility problems if
you tried to use the historic snapshot for arbitrary queries.

There's no built-in code in PostgreSQL that calls
GetTransactionSnapshot() during logical decoding, but it turns out
that the pglogical extension does just that, to evaluate row filter
expressions. You would get weird results if the row filter runs
arbitrary queries, but it is sane as long as you don't access any
non-catalog tables. Even though there are no checks to enforce that in
pglogical, a typical row filter expression does not access any tables
and works fine. Accessing tables marked with the user_catalog_table =
true option is also OK.

To fix pglogical with row filters, and any other extensions that might
do similar things, revert GetTransactionSnapshot() to return a
historic snapshot during logical decoding.

To try to still catch the unsafe usage of historic snapshots, add
checks in heap_beginscan() and index_beginscan() to complain if you
try to use a historic snapshot to scan a non-catalog table. We're very
close to the version 18 release however, so add those new checks only
in master.

Backpatch-through: 18
Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/20250809222338.cc.nmisch@google.com
2025-08-22 13:07:46 +03:00
Peter Eisentraut
16a0039dc0 Reduce lock level for ALTER DOMAIN ... VALIDATE CONSTRAINT
Reduce from ShareLock to ShareUpdateExclusivelock.  Validation during
ALTER DOMAIN ... ADD CONSTRAINT keeps using ShareLock.

Example:

    create domain d1 as int;
    create table t (a d1);
    alter domain d1 add constraint cc10 check (value > 10) not valid;

    begin;
    alter domain d1 validate constraint cc10;

    -- another session
    insert into t values (8);

Now we should still be able to perform DML operations on table t while
the domain constraint is being validated.  The equivalent works
already on table constraints.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxHz92A88NLRTA2msgE2dpXpE-EoZ2QO61od76-6bfqurA%40mail.gmail.com
2025-08-22 08:56:11 +02:00
Michael Paquier
13b935cd52 Change dynahash.c and hsearch.h to use int64 instead of long
This code was relying on "long", which is signed 8 bytes everywhere
except on Windows where it is 4 bytes, that could potentially expose it
to overflows, even if the current uses in the code are fine as far as I
know.  This code is now able to rely on the same sizeof() variable
everywhere, with int64.  long was used for sizes, partition counts and
entry counts.

Some callers of the dynahash.c routines used long declarations, that can
be cleaned up to use int64 instead.  There was one shortcut based on
SIZEOF_LONG, that can be removed.  long is entirely removed from
dynahash.c and hsearch.h.

Similar work was done in b1e5c9fa9a.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aKQYp-bKTRtRauZ6@paquier.xyz
2025-08-22 11:59:02 +09:00
Michael Paquier
ef03ea01fe Ignore temporary relations in RelidByRelfilenumber()
Temporary relations may share the same RelFileNumber with a permanent
relation, or other temporary relations associated with other sessions.

Being able to uniquely identify a temporary relation would require
RelidByRelfilenumber() to know about the proc number of the temporary
relation it wants to identify, something it is not designed for since
its introduction in f01d1ae3a1.

There are currently three callers of RelidByRelfilenumber():
- autoprewarm.
- Logical decoding, reorder buffer.
- pg_filenode_relation(), that attempts to find a relation OID based on
a tablespace OID and a RelFileNumber.

This makes the situation problematic particularly for the first two
cases, leading to the possibility of random ERRORs due to
inconsistencies that temporary relations can create in the cache
maintained by RelidByRelfilenumber().  The third case should be less of
an issue, as I suspect that there are few direct callers of
pg_filenode_relation().

The window where the ERRORs are happen is very narrow, requiring an OID
wraparound to create a lookup conflict in RelidByRelfilenumber() with a
temporary table reusing the same OID as another relation already cached.
The problem is easier to reach in workloads with a high OID consumption
rate, especially with a higher number of temporary relations created.

We could get pg_filenode_relation() and RelidByRelfilenumber() to work
with temporary relations if provided the means to identify them with an
optional proc number given in input, but the years have also shown that
we do not have a use case for it, yet.  Note that this could not be
backpatched if pg_filenode_relation() needs changes.  It is simpler to
ignore temporary relations.

Reported-by: Shenhao Wang <wangsh.fnst@fujitsu.com>
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-By: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-By: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-By: Michael Paquier <michael@paquier.xyz>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-By: Shenhao Wang <wangsh.fnst@fujitsu.com>
Discussion: https://postgr.es/m/bbaaf9f9-ebb2-645f-54bb-34d6efc7ac42@fujitsu.com
Backpatch-through: 13
2025-08-22 09:03:59 +09:00
Peter Eisentraut
47932f3cdc Use consistent type for pgaio_io_get_id() result
The result of pgaio_io_get_id() was being assigned to a mix of int and
uint32 variables.  This fixes it to use int consistently, which seems
the most correct.  Also change the queue empty special value in
method_worker.c to -1 from UINT32_MAX.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/70c784b3-f60b-4652-b8a6-75e5f051243e%40eisentraut.org
2025-08-21 19:45:25 +02:00
Fujii Masao
12da45742c Disallow server start with sync_replication_slots = on and wal_level < logical.
Replication slot synchronization (sync_replication_slots = on)
requires wal_level to be logical. This commit prevents the server
from starting if sync_replication_slots is enabled but wal_level
is set to minimal or replica.

Failing early during startup helps users catch invalid configurations
immediately, which is important because changing wal_level requires
a server restart.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
2025-08-21 22:18:11 +09:00
Tom Lane
a67d4847a4 Fix re-execution of a failed SQLFunctionCache entry.
If we error out during execution of a SQL-language function, we will
often leave behind non-null pointers in its SQLFunctionCache's cplan
and eslist fields.  This is problematic if the SQLFunctionCache is
re-used, because those pointers will point at resources that were
released during error cleanup.  This problem escaped detection so far
because ordinarily we won't re-use an FmgrInfo+SQLFunctionCache struct
after a query error.  However, in the rather improbable case that
someone implements an opclass support function in SQL language, there
will be long-lived FmgrInfos for it in the relcache, and then the
problem is reachable after the function throws an error.

To fix, add a flag to SQLFunctionCache that tracks whether execution
escapes out of fmgr_sql, and clear out the relevant fields during
init_sql_fcache if so.  (This is going to need more thought if we ever
try to share FMgrInfos across threads; but it's very far from being
the only problem such a project will encounter, since many functions
regard fn_extra as being query-local state.)

This broke at commit 0313c5dc6; before that we did not try to re-use
SQLFunctionCache state across calls.  Hence, back-patch to v18.

Bug: #19026
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19026-90aed5e71d0c8af3@postgresql.org
Backpatch-through: 18
2025-08-20 16:09:18 -04:00
Peter Eisentraut
e9c043a11a Minor error message enhancement
In refuseDupeIndexAttach(), change from

    errdetail("Another index is already attached for partition \"%s\"."...)

to

    errdetail("Another index \"%s\" is already attached for partition \"%s\"."...)

so we can easily understand which index is already attached for
partition \"%s\".

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CACJufxGBfykJ_1ztk9T%2BL_gLmkOSOF%2BmL9Mn4ZPydz-rh%3DLccQ%40mail.gmail.com
2025-08-20 18:14:24 +02:00
Michael Paquier
1f2e51e3c7 Fix assertion failure with replication slot release in single-user mode
Some replication slot manipulations (logical decoding via SQL,
advancing) were failing an assertion when releasing a slot in
single-user mode, because active_pid was not set in a ReplicationSlot
when its slot is acquired.

ReplicationSlotAcquire() has some logic to be able to work with the
single-user mode.  This commit sets ReplicationSlot->active_pid to
MyProcPid, to let the slot-related logic fall-through, considering the
single process as the one holding the slot.

Some TAP tests are added for various replication slot functions with the
single-user mode, while on it, for slot creation, drop, advancing, copy
and logical decoding with multiple slot types (temporary, physical vs
logical).  These tests are skipped on Windows, as direct calls of
postgres --single would fail on permission failures.  There is no
platform-specific behavior that needs to be checked, so living with this
restriction should be fine.  The CI is OK with that, now let's see what
the buildfarm tells.

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Mutaamba Maasha <maasha@gmail.com>
Discussion: https://postgr.es/m/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 13
2025-08-20 15:00:04 +09:00
Nathan Bossart
3eec0e6533 Fix comment for MAX_SIMUL_LWLOCKS.
This comment mentions that pg_buffercache locks all buffer
partitions simultaneously, but it hasn't done so since v10.

Oversight in commit 6e654546fb.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/aKTuAHVEuYCUmmIy%40nathan
2025-08-19 16:48:22 -05:00
Amit Kapila
aa21e49225 Fix self-deadlock during DROP SUBSCRIPTION.
The DROP SUBSCRIPTION command performs several operations: it stops the
subscription workers, removes subscription-related entries from system
catalogs, and deletes the replication slot on the publisher server.
Previously, this command acquired an AccessExclusiveLock on
pg_subscription before initiating these steps.

However, while holding this lock, the command attempts to connect to the
publisher to remove the replication slot. In cases where the connection is
made to a newly created database on the same server as subscriber, the
cache-building process during connection tries to acquire an
AccessShareLock on pg_subscription, resulting in a self-deadlock.

To resolve this issue, we reduce the lock level on pg_subscription during
DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier,
the higher lock level was used to prevent the launcher from starting a new
worker during the drop operation, as a restarted worker could become
orphaned.

Now, instead of relying on a strict lock, we acquire an AccessShareLock on
the specific subscription being dropped and re-validate its existence
after acquiring the lock. If the subscription is no longer valid, the
worker exits gracefully. This approach avoids the deadlock while still
ensuring that orphan workers are not created.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org
2025-08-19 05:33:17 +00:00
Michael Paquier
a977e419ee Refactor ReadMultiXactCounts() into GetMultiXactInfo()
This provides a single entry point to access some information about the
state of MultiXacts, able to return some data about multixacts offsets
and counts.  Originally this function was only able to return some
information about the number of multixacts and multixact members,
extended here to provide some data about the oldest multixact ID in use
and the oldest offset, if known.

This change has been proposed in a patch that aims at providing more
monitoring capabilities for multixacts, and it is useful on its own.
GetMultiXactInfo() is added to multixact.h, becoming available for
out-of-core code.

Extracted from a larger patch by the same author.

Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
2025-08-19 14:04:09 +09:00
Michael Paquier
9b7eb6f02e Remove useless pointer update in StatsShmemInit()
This pointer was not used after its last update.  This variable
assignment was most likely a vestige artifact of the earlier versions of
the patch set that have led to 5891c7a8ed.

This pointer update is useless, so let's remove it.  It removes one call
to pgstat_dsa_init_size(), making the code slightly easier to grasp.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aKLsu2sdpnyeuSSc@ip-10-97-1-34.eu-west-3.compute.internal
2025-08-19 09:54:18 +09:00
Richard Guo
bf9ee294e5 Simplify relation_has_unique_index_for()
Now that the only call to relation_has_unique_index_for() that
supplied an exprlist and oprlist has been removed, the loop handling
those lists is effectively dead code.  This patch removes that loop
and simplifies the function accordingly.

Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
2025-08-19 09:37:04 +09:00
Richard Guo
24225ad9aa Pathify RHS unique-ification for semijoin planning
There are two implementation techniques for semijoins: one uses the
JOIN_SEMI jointype, where the executor emits at most one matching row
per left-hand side (LHS) row; the other unique-ifies the right-hand
side (RHS) and then performs a plain inner join.

The latter technique currently has some drawbacks related to the
unique-ification step.

* Only the cheapest-total path of the RHS is considered during
unique-ification.  This may cause us to miss some optimization
opportunities; for example, a path with a better sort order might be
overlooked simply because it is not the cheapest in total cost.  Such
a path could help avoid a sort at a higher level, potentially
resulting in a cheaper overall plan.

* We currently rely on heuristics to choose between hash-based and
sort-based unique-ification.  A better approach would be to generate
paths for both methods and allow add_path() to decide which one is
preferable, consistent with how path selection is handled elsewhere in
the planner.

* In the sort-based implementation, we currently pay no attention to
the pathkeys of the input subpath or the resulting output.  This can
result in redundant sort nodes being added to the final plan.

This patch improves semijoin planning by creating a new RelOptInfo for
the RHS rel to represent its unique-ified version.  It then generates
multiple paths that represent elimination of distinct rows from the
RHS, considering both a hash-based implementation using the cheapest
total path of the original RHS rel, and sort-based implementations
that either exploit presorted input paths or explicitly sort the
cheapest total path.  All resulting paths compete in add_path(), and
those deemed worthy of consideration are added to the new RelOptInfo.
Finally, the unique-ified rel is joined with the other side of the
semijoin using a plain inner join.

As a side effect, most of the code related to the JOIN_UNIQUE_OUTER
and JOIN_UNIQUE_INNER jointypes -- used to indicate that the LHS or
RHS path should be made unique -- has been removed.  Besides, the
T_Unique path now has the same meaning for both semijoins and upper
DISTINCT clauses: it represents adjacent-duplicate removal on
presorted input.  This patch unifies their handling by sharing the
same data structures and functions.

This patch also removes the UNIQUE_PATH_NOOP related code along the
way, as it is dead code -- if the RHS rel is provably unique, the
semijoin should have already been simplified to a plain inner join by
analyzejoins.c.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-EBnaRvEs7frTLbsXiweSTUXifsteF-d3rvv01FKO86w@mail.gmail.com
2025-08-19 09:35:40 +09:00
Michael Paquier
24e71d53f8 Remove unneeded header declarations in multixact.c
Two header declarations were related to SQL-callable functions, that
should have been cleaned up in df9133fa63.  Some more includes can be
removed on closer inspection, so let's clean up these as well, while on
it.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/345438.1755524834@sss.pgh.pa.us
2025-08-19 08:57:20 +09:00
David Rowley
a98ccf727e Remove HASH_DEBUG output from dynahash.c
This existed in a semi broken stated from be0a66666 until 296cba276.
Recent discussion has questioned the value of having this at all as it
only outputs static information from various of the hash table's
properties when the hash table is created.

Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/OSCPR01MB1496650D03FA0293AB9C21416F534A@OSCPR01MB14966.jpnprd01.prod.outlook.com
2025-08-19 11:14:21 +12:00
David Rowley
05fcb9667c Use elog(DEBUG4) for dynahash.c statistics output
Previously this was being output to stderr.  This commit adjusts things
to use elog(DEBUG4).  Here we also adjust the format of the message to
add the hash table name and also put the message on a single line.  This
should make grepping the logs for this information easier.

Also get rid of the global hash table statistics.  This seems very dated
and didn't fit very well with trying to put all the statistics for a
specific hash table on a single log line.

The main aim here is to allow it so we can have at least one buildfarm
member build with HASH_STATISTICS to help prevent future changes from
breaking things in that area.  ca3891251 recently fixed some issues
here.

In passing, switch to using uint64 data types rather than longs for the
usage counters.  The long type is 32 bits on some platforms we support.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvoccvJ9CG5zx+i-EyCzJbcL5K=CzqrnL_YN59qaL5hiaw@mail.gmail.com
2025-08-19 10:57:44 +12:00
Peter Eisentraut
c61d51d500 Detect buffer underflow in get_th()
Input with zero length can result in a buffer underflow when
accessing *(num + (len - 1)), as (len - 1) would produce a negative
index.  Add an assertion for zero-length input to prevent it.

This was found by ALT Linux Team.

Reviewing the call sites shows that get_th() currently cannot be
applied to an empty string: it is always called on a string containing
a number we've just printed.  Therefore, an assertion rather than a
user-facing error message is sufficient.

Co-authored-by: Alexander Kuznetsov <kuznetsovam@altlinux.org>
Discussion: https://www.postgresql.org/message-id/flat/e22df993-cdb4-4d0a-b629-42211ebed582@altlinux.org
2025-08-18 11:03:22 +02:00
Michael Paquier
df9133fa63 Move SQL-callable code related to multixacts into its own file
A patch is under discussion to add more SQL capabilities related to
multixacts, and this move avoids bloating the file more than necessary.
This affects pg_get_multixact_members().  A side effect of this move is
the requirement to add mxstatus_to_string() to multixact.h.

Extracted from a larger patch by the same author, tweaked by me.

Author: Naga Appani <nagnrik@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com
2025-08-18 14:57:55 +09:00
Michael Paquier
ba3d93b2e8 Refactor init_params() in sequence.c to not use FormData_pg_sequence_data
init_params() sets up "last_value" and "is_called" for a sequence
relation holdind its metadata, based on the sequence properties in
pg_sequences.  "log_cnt" is the third property that can be updated in
this routine for FormData_pg_sequence_data, tracking when WAL records
should be generated for a sequence after nextval() iterations.  This
routine is called when creating or altering a sequence.

This commit refactors init_params() to not depend anymore on
FormData_pg_sequence_data, removing traces of it in sequence.c, making
easier the manipulation of metadata related to sequences.  The knowledge
about "log_cnt" is replaced with a more general "reset_state" flag, to
let the caller know if the sequence state should be reset.  In the case
of in-core sequences, this relates to WAL logging.  We still need to
depend on FormData_pg_sequence.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/ZWlohtKAs0uVVpZ3@paquier.xyz
2025-08-18 11:38:44 +09:00
Masahiko Sawada
928da6ff12 Fix typos in comments.
Oversight in commit fd5a1a0c3e.

Author: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAHewXNmTT3M_w4NngG=6G3mdT3iJ6DdncTqV9YnGXBPHW8XYtA@mail.gmail.com
2025-08-16 01:11:40 -07:00
Masahiko Sawada
37265ca01f Fix constant when extracting timestamp from UUIDv7.
When extracting a timestamp from a UUIDv7, a conversion from
milliseconds to microseconds was using the incorrect constant
NS_PER_US instead of US_PER_MS. Although both constants have the same
value, this fix improves code clarity by using the semantically
correct constant.

Backpatch to v18, where UUIDv7 was introduced.

Author: Erik Nordström <erik@tigerdata.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CACAa4V+i07eaP6h4MHNydZeX47kkLPwAg0sqe67R=M5tLdxNuQ@mail.gmail.com
Backpatch-through: 18
2025-08-15 11:58:53 -07:00
David Rowley
296cba2760 Fix invalid format string in HASH_DEBUG code
This seems to have been broken back in be0a66666.

Reported-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 14
2025-08-15 18:05:44 +12:00
David Rowley
ca38912512 Fix failing -D HASH_STATISTICS builds
This seems to have been broken for a few years by cc5ef90ed.

Author: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/OSCPR01MB14966E11EEFB37D7857FCEDB7F535A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
2025-08-15 17:23:45 +12:00
David Rowley
b4632883d4 Add Asserts to validate prevbit values in bms_prev_member
bms_prev_member() could attempt to access memory outside of the words[]
array in cases where the prevbit was a number < -1 or > a->nwords *
BITS_PER_BITMAPWORD + 1.

Here we add the Asserts to help draw attention to bogus callers so we're
more likely to catch them during development.

In passing, fix wording of bms_prev_member's header comment which talks
about how we expect the callers to ensure only valid prevbit values are
used.

Author: Greg Burd <greg@burd.me>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2000A717-1FFE-4031-827B-9330FB2E9065%40getmailspring.com
2025-08-15 16:33:07 +12:00
Álvaro Herrera
d0e7e04ede
Avoid including tableam.h and xlogreader.h in nbtree.h
Doing that seems rather random and unnecessary.  This commit removes
those and fixes fallout, which is pretty minimal.  We do need to add a
forward declaration of struct TM_IndexDeleteOp (whose full definition
appears in tableam.h) so that _bt_delitems_delete_check()'s declaration
can use it.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql
2025-08-14 17:48:46 +02:00
Tom Lane
ed07361721 Don't leak memory during failure exit from SelectConfigFiles().
Make sure the memory allocated by make_absolute_path() is freed
when SelectConfigFiles() fails.  Since all the callers will exit
immediately in that case, there's no practical gain here, but
silencing Valgrind leak complaints seems useful.  In any case,
it was inconsistent that only one of the failure exits did this.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAJ7c6TMByXE8dc7zDvDWTQjk6o-XXAdRg_RAg5CBaUOgFPV3LQ%40mail.gmail.com
2025-08-14 11:39:19 -04:00
Heikki Linnakangas
4ec6e22b43 Fix LSN format in debug message
Commit 2633dae2e4 standardized all existing messages to use `%X/%08X`
for LSNs, but this one crept back in after the commit.
2025-08-14 13:31:18 +03:00
Michael Paquier
6304256e79 Fix compilation warning with SerializeClientConnectionInfo()
This function uses an argument named "maxsize" that is only used in
assertions, being set once outside the assertion area.  Recent gcc
versions with -Wunused-but-set-parameter complain about a warning when
building without assertions enabled, because of that.

In order to fix this issue, PG_USED_FOR_ASSERTS_ONLY is added to the
function argument of SerializeClientConnectionInfo(), which is the first
time we are doing so in the tree.  The CI is fine with the change, but
let's see what the buildfarm has to say on the matter.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jacob Champion <jchampion@postgresql.org>
Discussion: https://postgr.es/m/pevajesswhxafjkivoq3yvwxga77tbncghlf3gq5fvchsvfuda@6uivg25sb3nx
Backpatch-through: 16
2025-08-14 16:21:50 +09:00
Fujii Masao
e9a31c0cc6 Revert logical snapshot filename format change in SnapBuildSnapshotExists().
Commit 2633dae2e4 standardized LSN formatting but mistakenly changed
the logical snapshot filename format in SnapBuildSnapshotExists() from
"%X-%X.snap" to "%08X-%08X.snap". Other code still used the original
"%X-%X.snap" format, causing the replication slot synchronization worker
to fail to find existing snapshot files and produce excessive log messages.

This commit restores the original "%X-%X.snap" format
in SnapBuildSnapshotExists() to resolve the issue.

Author: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHuHPB-ucAk_Tq3uSs4Fdziu1Jp_AA_RD3m5Ycky7m48w@mail.gmail.com
2025-08-14 12:33:14 +09:00
Tom Lane
ee54046601 Grab the low-hanging fruit from forcing USE_FLOAT8_BYVAL to true.
Remove conditionally-compiled code for the other case.

Replace uses of FLOAT8PASSBYVAL with constant "true", mainly because
it was quite confusing in cases where the type we were dealing with
wasn't float8.

I left the associated pg_control and Pg_magic_struct fields in place.
Perhaps we should get rid of them, but it would save little, so it
doesn't seem worth thinking hard about the compatibility implications.
I just labeled them "vestigial" in places where that seemed helpful.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
6aebedc384 Grab the low-hanging fruit from forcing sizeof(Datum) to 8.
Remove conditionally-compiled code for smaller Datum widths,
and simplify comments that describe cases no longer of interest.

I also fixed up a few more places that were not using
DatumGetIntXX where they should, and made some cosmetic
adjustments such as using sizeof(int64) not sizeof(Datum)
in places where that fit better with the surrounding code.

One thing I remembered while preparing this part is that SP-GiST
stores pass-by-value prefix keys as Datums, so that the on-disk
representation depends on sizeof(Datum).  That's even more
unfortunate than the existing commentary makes it out to be,
because now there is a hazard that the change of sizeof(Datum)
will break SP-GiST indexes on 32-bit machines.  It appears that
there are no existing SP-GiST opclasses that are actually
affected; and if there are some that I didn't find, the number
of installations that are using them on 32-bit machines is
doubtless tiny.  So I'm proceeding on the assumption that we
can get away with this, but it's something to worry about.

(gininsert.c looks like it has a similar problem, but it's okay
because the "tuples" it's constructing are just transient data
within the tuplesort step.  That's pretty poorly documented
though, so I added some comments.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
2a600a93c7 Make type Datum be 8 bytes wide everywhere.
This patch makes sizeof(Datum) be 8 on all platforms including
32-bit ones.  The objective is to allow USE_FLOAT8_BYVAL to be true
everywhere, and in consequence to remove a lot of code that is
specific to pass-by-reference handling of float8, int8, etc.  The
code for abbreviated sort keys can be simplified similarly.  In this
way we can reduce the maintenance effort involved in supporting 32-bit
platforms, without going so far as to actually desupport them.  Since
Datum is strictly an in-memory concept, this has no impact on on-disk
storage, though an initdb or pg_upgrade will be needed to fix affected
catalog entries.

We have required platforms to support [u]int64 for ages, so this
breaks no supported platform.  We can expect that this change will
make 32-bit builds a bit slower and more memory-hungry, although being
able to use pass-by-value handling of 8-byte types may buy back some
of that.  But we stopped optimizing for 32-bit cases a long time ago,
and this seems like just another step on that path.

This initial patch simply forces the correct type definition and
USE_FLOAT8_BYVAL setting, and cleans up a couple of minor compiler
complaints that ensued.  This is sufficient for testing purposes.
In the wake of a bunch of Datum-conversion cleanups by Peter
Eisentraut, this now compiles cleanly with gcc on a 32-bit platform.
(I'd only tested the previous version with clang, which it turns out
is less picky than gcc about width-changing coercions.)

There is a good deal of now-dead code that I'll remove in separate
follow-up patches.

A catversion bump is required because this affects initial catalog
contents (on 32-bit machines) in two ways: pg_type.typbyval changes
for some built-in types, and Const nodes in stored views/rules will
now have 8 bytes not 4 for pass-by-value types.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-13 17:18:22 -04:00
Tom Lane
21fddb3d76 Don't treat EINVAL from semget() as a hard failure.
It turns out that on some platforms (at least current macOS, NetBSD,
OpenBSD) semget(2) will return EINVAL if there is a pre-existing
semaphore set with the same key and too few semaphores.  Our code
expects EEXIST in that case and treats EINVAL as a hard failure,
resulting in failure during initdb or postmaster start.

POSIX does document EINVAL for too-few-semaphores-in-set, and is
silent on its priority relative to EEXIST, so this behavior arguably
conforms to spec.  Nonetheless it's quite problematic because EINVAL
is also documented to mean that nsems is greater than the system's
limit on the number of semaphores per set (SEMMSL).  If that is
where the problem lies, retrying would just become an infinite loop.

To resolve this contradiction, retry after EINVAL, but also install a
loop limit that will make us give up regardless of the specific errno
after trying 1000 different keys.  (1000 is a pretty arbitrary number,
but it seems like it should be sufficient.)  I like this better than
the previous infinite-looping behavior, since it will also keep us out
of trouble if (say) we get EACCES due to a system-level permissions
problem rather than anything to do with a specific semaphore set.

This problem has only been observed in the field in PG 17, which uses
a higher nsems value than other branches (cf. 38da05346, 810a8b1c8).
That makes it possible to get the failure if a new v17 postmaster
has a key collision with an existing postmaster of another branch.
In principle though, we might see such a collision against a semaphore
set created by some other application, in which case all branches are
vulnerable on these platforms.  Hence, backpatch.

Reported-by: Gavin Panella <gavinpanella@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALL7chmzY3eXHA7zHnODUVGZLSvK3wYCSP0RmcDFHJY8f28Q3g@mail.gmail.com
Backpatch-through: 13
2025-08-13 12:00:03 -04:00
Andres Freund
b227b0bb4e Reduce ExecSeqScan* code size using pg_assume()
fb9f955025 optimized code generation by using specialized variants of
ExecSeqScan* for [not] having a qual, projection etc. This allowed the
compiler to optimize the code out the code for qual / projection. However, as
observed by David Rowley at the time, the compiler couldn't prove the
opposite, i.e. that the qual etc *are* present.

By using pg_assume(), introduced in d65eb5b1b8, we can tell the compiler that
the relevant variables are non-null.

This reduces the code size to a surprising degree and seems to lead to a small
but reproducible performance gain.

Reviewed-by: Amit Langote <amitlangote09@gmail.com> Discussion:
https://postgr.es/m/CA+HiwqFk-MbwhfX_kucxzL8zLmjEt9MMcHi2YF=DyhPrSjsBEA@mail.gmail.com
2025-08-11 15:41:34 -04:00
Andres Freund
01d6832c10 meson: add and use stamp files for generated headers
Without using stamp files, meson lists the generated headers as the dependency
for every .c file, bloating build.ninja by more than 2x. Processing all the
dependencies also increases the time to generate build.ninja.

The immediate benefit is that this makes re-configuring and clean builds a bit
faster. The main motivation however is that I have other patches that
introduce additional build targets that further would increase the size of
build.ninja, making re-configuring more noticeably slower.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/cgkdgvzdpinkacf4v33mky7tbmk467oda5dd4dlmucjjockxzi@xkqfvjoq4uiy
2025-08-11 15:18:23 -04:00
Dean Rasheed
22424953cd Fix security checks in selectivity estimation functions.
Commit e2d4ef8de8 (the fix for CVE-2017-7484) added security checks
to the selectivity estimation functions to prevent them from running
user-supplied operators on data obtained from pg_statistic if the user
lacks privileges to select from the underlying table. In cases
involving inheritance/partitioning, those checks were originally
performed against the child RTE (which for plain inheritance might
actually refer to the parent table). Commit 553d2ec271 then extended
that to also check the parent RTE, allowing access if the user had
permissions on either the parent or the child. It turns out, however,
that doing any checks using the child RTE is incorrect, since
securityQuals is set to NULL when creating an RTE for an inheritance
child (whether it refers to the parent table or the child table), and
therefore such checks do not correctly account for any RLS policies or
security barrier views. Therefore, do the security checks using only
the parent RTE. This is consistent with how RLS policies are applied,
and the executor's ACL checks, both of which use only the parent
table's permissions/policies. Similar checks are performed in the
extended stats code, so update that in the same way, centralizing all
the checks in a new function.

In addition, note that these checks by themselves are insufficient to
ensure that the user has access to the table's data because, in a
query that goes via a view, they only check that the view owner has
permissions on the underlying table, not that the current user has
permissions on the view itself. In the selectivity estimation
functions, there is no easy way to navigate from underlying tables to
views, so add permissions checks for all views mentioned in the query
to the planner startup code. If the user lacks permissions on a view,
a permissions error will now be reported at planner-startup, and the
selectivity estimation functions will not be run.

Checking view permissions at planner-startup in this way is a little
ugly, since the same checks will be repeated at executor-startup.
Longer-term, it might be better to move all the permissions checks
from the executor to the planner so that permissions errors can be
reported sooner, instead of creating a plan that won't ever be run.
However, such a change seems too far-reaching to be back-patched.

Back-patch to all supported versions. In v13, there is the added
complication that UPDATEs and DELETEs on inherited target tables are
planned using inheritance_planner(), which plans each inheritance
child table separately, so that the selectivity estimation functions
do not know that they are dealing with a child table accessed via its
parent. Handle that by checking access permissions on the top parent
table at planner-startup, in the same way as we do for views. Any
securityQuals on the top parent table are moved down to the child
tables by inheritance_planner(), so they continue to be checked by the
selectivity estimation functions.

Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Noah Misch <noah@leadboat.com>
Backpatch-through: 13
Security: CVE-2025-8713
2025-08-11 09:03:11 +01:00
Thomas Munro
b421223172 Fix rare bug in read_stream.c's split IO handling.
The internal queue of buffers could become corrupted in a rare edge case
that failed to invalidate an entry, causing a stale buffer to be
"forwarded" to StartReadBuffers().  This is a simple fix for the
immediate problem.

A small API change might be able to remove this and related fragility
entirely, but that will have to wait a bit.

Defect in commit ed0b87ca.

Bug: 19006
Backpatch-through: 18
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19006-80fcaaf69000377e%40postgresql.org
2025-08-09 13:04:38 +12:00
Tom Lane
665c3dbba4 Mop-up for Datum conversion cleanups.
Fix a couple more places where an explicit Datum conversion
is needed (not clear how we missed these in ff89e182d and
previous commits).

Replace the minority usage "(Datum) NULL" with "(Datum) 0".
The former depends on the assumption that Datum is the same
width as Pointer, the latter doesn't.  Anyway consistency
is a good thing.

This is, I believe, the last of the notational mop-up needed
before we can consider changing Datum to uint64 everywhere.
It's also important cleanup for more aggressive ideas such
as making Datum a struct.

Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
Discussion: https://postgr.es/m/8246d7ff-f4b7-4363-913e-827dadfeb145@eisentraut.org
2025-08-08 18:44:57 -04:00
Peter Eisentraut
ff89e182d4 Add missing Datum conversions
Add various missing conversions from and to Datum.  The previous code
mostly relied on implicit conversions or its own explicit casts
instead of using the correct DatumGet*() or *GetDatum() functions.

We think these omissions are harmless.  Some actual bugs that were
discovered during this process have been committed
separately (80c758a2e1, fd2ab03fea).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-08 22:06:57 +02:00
Peter Eisentraut
dcfc0f8912 Remove useless/superfluous Datum conversions
Remove useless DatumGetFoo() and FooGetDatum() calls.  These are
places where no conversion from or to Datum was actually happening.

We think these extra calls covered here were harmless.  Some actual
bugs that were discovered during this process have been committed
separately (80c758a2e1, 2242b26ce4).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-08 22:06:57 +02:00
Thomas Munro
b5cd74612c Remove obsolete comment.
Remove a comment about potential for AIO in StartReadBuffersImpl(),
because that change happened.
2025-08-09 01:46:04 +12:00
Etsuro Fujita
9e63f83a7e Fix oversight in FindTriggerIncompatibleWithInheritance.
This function is called from ATExecAttachPartition/ATExecAddInherit,
which prevent tables with row-level triggers with transition tables from
becoming partitions or inheritance children, to check if there is such a
trigger on the given table, but failed to check if a found trigger is
row-level, causing the caller functions to needlessly prevent a table
with only a statement-level trigger with transition tables from becoming
a partition or inheritance child.  Repair.

Oversight in commit 501ed02cf.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Discussion: https://postgr.es/m/CAPmGK167mXzwzzmJ_0YZ3EZrbwiCxtM1vogH_8drqsE6PtxRYw%40mail.gmail.com
Backpatch-through: 13
2025-08-08 17:35:00 +09:00
Etsuro Fujita
62a1211d33 Disallow collecting transition tuples from child foreign tables.
Commit 9e6104c66 disallowed transition tables on foreign tables, but
failed to account for cases where a foreign table is a child table of a
partitioned/inherited table on which transition tables exist, leading to
incorrect transition tuples collected from such foreign tables for
queries on the parent table triggering transition capture.  This
occurred not only for inherited UPDATE/DELETE but for partitioned INSERT
later supported by commit 3d956d956, which should have handled it at
least for the INSERT case, but didn't.

To fix, modify ExecAR*Triggers to throw an error if the given relation
is a foreign table requesting transition capture.  Also, this commit
fixes make_modifytable so that in case of an inherited UPDATE/DELETE
triggering transition capture, FDWs choose normal operations to modify
child foreign tables, not DirectModify; which is needed because they
would otherwise skip the calls to ExecAR*Triggers at execution, causing
unexpected behavior.

Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CAPmGK14QJYikKzBDCe3jMbpGENnQ7popFmbEgm-XTNuk55oyHg%40mail.gmail.com
Backpatch-through: 13
2025-08-08 10:50:00 +09:00
Michael Paquier
84b32fd228 Add information about "generation" when dropping twice pgstats entry
Dropping twice a pgstats entry should not happen, and the error report
generated was missing the "generation" counter (tracking when an entry
is reused) that has been added in 818119afcc.

Like d92573adcb, backpatch down to v15 where this information is
useful to have, to gather more information from instances where the
problem shows up.  A report has shown that this error path has been
reached on a standby based on 17.3, for a relation stats entry and an
OID close to wraparound.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/CAN4RuQvYth942J2+FcLmJKgdpq6fE5eqyFvb_PuskxF2eL=Wzg@mail.gmail.com
Backpatch-through: 15
2025-08-08 09:07:10 +09:00
Dean Rasheed
d699687b32 Extend int128.h to support more numeric code.
This adds a few more functions to int128.h, allowing more of numeric.c
to use 128-bit integers on all platforms.

Specifically, int64_div_fast_to_numeric() and the following aggregate
functions can now use 128-bit integers for improved performance on all
platforms, rather than just platforms with native support for int128:

- SUM(int8)
- AVG(int8)
- STDDEV_POP(int2 or int4)
- STDDEV_SAMP(int2 or int4)
- VAR_POP(int2 or int4)
- VAR_SAMP(int2 or int4)

In addition to improved performance on platforms lacking native
128-bit integer support, this significantly simplifies this numeric
code by allowing a lot of conditionally compiled code to be deleted.

A couple of numeric functions (div_var_int64() and sqrt_var()) still
contain conditionally compiled 128-bit integer code that only works on
platforms with native 128-bit integer support. Making those work more
generally would require rolling our own higher precision 128-bit
division, which isn't supported for now.

Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/CAEZATCWgBMc9ZwKMYqQpaQz2X6gaamYRB+RnMsUNcdMcL2Mj_w@mail.gmail.com
2025-08-07 15:49:24 +01:00
Alexander Korotkov
466c5435fd Fix checkpointer shared memory allocation
Use Min(NBuffers, MAX_CHECKPOINT_REQUESTS) instead of NBuffers in
CheckpointerShmemSize() to match the actual array size limit set in
CheckpointerShmemInit().  This prevents wasting shared memory when
NBuffers > MAX_CHECKPOINT_REQUESTS.  Also, fix the comment.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1439188.1754506714%40sss.pgh.pa.us
Author: Xuneng Zhou <xunengzhou@gmail.com>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
2025-08-07 14:29:02 +03:00
Michael Paquier
2242b26ce4 Fix incorrect Datum conversion in timestamptz_trunc_internal()
The code used a PG_RETURN_TIMESTAMPTZ() where the return type is
TimestampTz and not a Datum.

On 64-bit systems, there is no effect since this just ends up casting
64-bit integers back and forth.  On 32-bit systems, timestamptz is
pass-by-reference.  PG_RETURN_TIMESTAMPTZ() allocates new memory and
returns the address, meaning that the caller could interpret this as a
timestamp value.

The effect is using "date_trunc(..., 'infinity'::timestamptz) will
return random values (instead of the correct return value 'infinity').

Bug introduced in commit d85ce012f9.

Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2d320b6f-b4af-4fbc-9eec-5d0fa15d187b@eisentraut.org
Discussion: https://postgr.es/m/4bf60a84-2862-4a53-acd5-8eddf134a60e@eisentraut.org
Backpatch-through: 18
2025-08-07 11:02:04 +09:00
Nathan Bossart
9ea3b6f751 Expand usage of macros for protocol characters.
This commit makes use of the existing PqMsg_* macros in more places
and adds new PqReplMsg_* and PqBackupMsg_* macros for use in
special replication and backup messages, respectively.

Author: Dave Cramer <davecramer@gmail.com>
Co-authored-by: Fabrízio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/aIECfYfevCUpenBT@nathan
Discussion: https://postgr.es/m/CAFcNs%2Br73NOUb7%2BqKrV4HHEki02CS96Z%2Bx19WaFgE087BWwEng%40mail.gmail.com
2025-08-06 13:37:00 -05:00
Nathan Bossart
35baa60cc7 Rename transformRelOptions()'s "namspace" parameter to "nameSpace".
The name "namspace" looks like a typo, but it was presumably meant
to avoid using the "namespace" C++ keyword.  This commit renames
the parameter to "nameSpace" to prevent future confusion while
still avoiding the keyword.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/aJJxpfsDfiQ1VbJ5%40nathan
2025-08-06 12:08:07 -05:00
Peter Eisentraut
73d33be4da Remove INT64_HEX_FORMAT and UINT64_HEX_FORMAT
These were introduced (commit efdc7d7475) at the same time as we were
moving to using the standard inttypes.h format macros (commit
a0ed19e0a9).  It doesn't seem useful to keep a new already-deprecated
interface like this with only a few users, so remove the new symbols
again and have the callers use PRIx64.

(Also, INT64_HEX_FORMAT was kind of a misnomer, since hex formats all
use unsigned types.)

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/0ac47b5d-e5ab-4cac-98a7-bdee0e2831e4%40eisentraut.org
2025-08-06 11:08:10 +02:00
Masahiko Sawada
b5c53b403c Suppress maybe-uninitialized warning.
Following commit e035863c9a, building with -O0 began triggering
warnings about potentially uninitialized 'workbuf' usage. While
theoretically the initialization isn't necessary since VARDATA()
doesn't access the contents of the pointed-to object, this commit
explicitly initializes the workbuf variable to suppress the warning.

Buildfarm members adder and flaviventris have shown the warning.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAD21AoCOZxfqnNgfM5yVKJZYnOq5m2Q96fBGy1fovEqQ9V4OZA@mail.gmail.com
2025-08-05 15:30:28 -07:00
Tom Lane
80c758a2e1 Fix incorrect return value in brin_minmax_multi_distance_numeric().
The result of "DirectFunctionCall1(numeric_float8, d)" is already in
Datum form, but the code was incorrectly applying PG_RETURN_FLOAT8()
to it.  On machines where float8 is pass-by-reference, this would
result in complete garbage, since an unpredictable pointer value
would be treated as an integer and then converted to float.  It's not
entirely clear how much of a problem would ensue on 64-bit hardware,
but certainly interpreting a float8 bitpattern as uint64 and then
converting that to float isn't the intended behavior.

As luck would have it, even the complete-garbage case doesn't break
BRIN indexes, since the results are only used to make choices about
how to merge values into ranges: at worst, we'd make poor choices
resulting in an inefficient index.  Doubtless that explains the lack
of field complaints.  However, users with BRIN indexes that use the
numeric_minmax_multi_ops opclass may wish to reindex in hopes of
making their indexes more efficient.

Author: Peter Eisentraut <peter@eisentraut.org>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2093712.1753983215@sss.pgh.pa.us
Backpatch-through: 14
2025-08-05 16:51:10 -04:00
Masahiko Sawada
deb674454c Add backup_type column to pg_stat_progress_basebackup.
This commit introduces a new column backup_type that indicates the
type of backup being performed: either 'full' or 'incremental'.

Bump catalog version.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/CAOzEurQuzbHwTj1ehk1a+eeQDidJPyrE5s6mYumkjwjZnurhkQ@mail.gmail.com
2025-08-05 10:50:45 -07:00
Jeff Davis
295a39770e Don't copy datlocale from template unless provider matches.
During CREATE DATABASE, if changing the locale provider, require that
a new locale is specified rather than trying to reinterpret the
template's locale using the new provider.

This only affects the behavior when the template uses the builtin
provider and CREATE DATABASE specifies the ICU provider without
specifying the locale. Previously, that may have succeeded due to
loose validation by ICU, whereas now that will cause an error. Because
it can cause an error, backport only to unreleased versions.

Discussion: https://postgr.es/m/5038b33a6dc639009f4b3d43fa6ae0c5ba9e04f7.camel@j-davis.com
Backpatch-through: 18
2025-08-05 09:25:23 -07:00
Tom Lane
f291751ef8 Mop-up for commit e035863c9.
Neither Peter nor I had tried this with USE_VALGRIND ...

Per buildfarm member skink.
2025-08-05 12:11:33 -04:00
Peter Eisentraut
0f5ade7a36 Fix varatt versus Datum type confusions
Macros like VARDATA() and VARSIZE() should be thought of as taking
values of type pointer to struct varlena or some other related struct.
The way they are implemented, you can pass anything to it and it will
cast it right.  But this is in principle incorrect.  To fix, add the
required DatumGetPointer() calls.  Or in a couple of cases, remove
superfluous PointerGetDatum() calls.

It is planned in a subsequent patch to change macros like VARDATA()
and VARSIZE() to inline functions, which will enforce stricter typing.
This is in preparation for that.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/928ea48f-77c6-417b-897c-621ef16685a6%40eisentraut.org
2025-08-05 12:11:36 +02:00
Peter Eisentraut
2ad6e80de9 Fix various hash function uses
These instances were using Datum-returning functions where a
lower-level function returning uint32 would be more appropriate.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-05 11:47:23 +02:00
Amit Kapila
c9a5860f7a Throw ERROR when publish_generated_columns is specified without a value.
Previously, specifying the publication option 'publish_generated_columns'
without an explicit value would incorrectly default to 'stored', which is
not the intended behavior.

This patch fixes the issue by raising an ERROR when no value is provided
for 'publish_generated_columns', ensuring that users must explicitly
specify a valid option.

Author: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Backpatch-through: 18, where it was introduced
Discussion: https://postgr.es/m/CAHut+PsCUCWiEKmB10DxhoPfXbF6jw5RD9ib2LuaQeA_XraW7w@mail.gmail.com
2025-08-05 09:34:22 +00:00
Peter Eisentraut
1469e31297 Fix mixups of FooGetDatum() vs. DatumGetFoo()
Some of these were accidentally reversed, but there was no ill effect.

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-08-05 10:53:49 +02:00
Fujii Masao
4614d53d4e Avoid unexpected shutdown when sync_replication_slots is enabled.
Previously, enabling sync_replication_slots while wal_level was not set
to logical could cause the server to shut down. This was because
the postmaster performed a configuration check before launching
the slot synchronization worker and raised an ERROR if the settings
were incompatible. Since ERROR is treated as FATAL in the postmaster,
this resulted in the entire server shutting down unexpectedly.

This commit changes the postmaster to log that message with a LOG-level
instead of raising an ERROR, allowing the server to continue running
even with the misconfiguration.

Back-patch to v17, where slot synchronization was introduced.

Reported-by: Hugo DUBOIS <hdubois@scaleway.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Hugo DUBOIS <hdubois@scaleway.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
Backpatch-through: 17
2025-08-04 20:51:42 +09:00
David Rowley
bca9a1900c Fix incorrect comment regarding mod_since_analyze
Author: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/20250804140120.280c2d6a9d2ea687cd167743@sraoss.co.jp
2025-08-04 17:43:22 +12:00
Amit Kapila
fd5a1a0c3e Detect and report update_deleted conflicts.
This enhancement builds upon the infrastructure introduced in commit
228c370868, which enables the preservation of deleted tuples and their
origin information on the subscriber. This capability is crucial for
handling concurrent transactions replicated from remote nodes.

The update introduces support for detecting update_deleted conflicts
during the application of update operations on the subscriber. When an
update operation fails to locate the target row-typically because it has
been concurrently deleted-we perform an additional table scan. This scan
uses the SnapshotAny mechanism and we do this additional scan only when
the retain_dead_tuples option is enabled for the relevant subscription.

The goal of this scan is to locate the most recently deleted tuple-matching
the old column values from the remote update-that has not yet been removed
by VACUUM and is still visible according to our slot (i.e., its deletion
is not older than conflict-detection-slot's xmin). If such a tuple is
found, the system reports an update_deleted conflict, including the origin
and transaction details responsible for the deletion.

This provides a groundwork for more robust and accurate conflict
resolution process, preventing unexpected behavior by correctly
identifying cases where a remote update clashes with a deletion from
another origin.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-08-04 04:02:47 +00:00
Tom Lane
5c8eda1f72 Take a little more care in set_backtrace().
Coverity complained that the "errtrace" string is leaked if we return
early because backtrace_symbols fails.  Another criticism that could
be leveled at this is that not providing any hint of what happened is
user-unfriendly.  Fix that.

The odds of a leak here are small, and typically it wouldn't matter
anyway since the leak will be in ErrorContext which will soon get
reset.  So I'm not feeling a need to back-patch.
2025-08-03 13:01:17 -04:00
Tom Lane
4fbfdde58e Avoid leakage of zero-length arrays in partition_bounds_copy().
If ndatums is zero, the code would allocate zero-length boundKinds
and boundDatums chunks, which would have nothing pointing to them,
leading to Valgrind complaints.  Rearrange the code to avoid the
useless pallocs, and also to not bother computing byval/typlen when
they aren't used.

I'm unsure why I didn't see this in my Valgrind testing back in May.
This code hasn't changed since then, but maybe we added a regression
test that reaches this edge case.  Or possibly I just failed to
notice the reports, which do say "0 bytes lost".

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
b102c8c473 Silence complaints about leaks in PlanCacheComputeResultDesc.
CompleteCachedPlan intentionally doesn't worry about small
leaks from PlanCacheComputeResultDesc.  However, Valgrind
knows nothing of engineering tradeoffs and complains anyway.
Silence it by doing things the hard way if USE_VALGRIND.

I don't really love this patch, because it makes the handling
of plansource->resultDesc different from the handling of query
dependencies and search_path just above, which likewise are willing
to accept small leaks into the cached plan's context.  However,
those cases aren't provoking Valgrind complaints.  (Perhaps in a
CLOBBER_CACHE_ALWAYS build, they would?)  For the moment, this
makes the src/pl/plpgsql tests leak-free according to Valgrind.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
7f6ededa76 Suppress complaints about leaks in TS dictionary loading.
Like the situation with function cache loading, text search
dictionary loading functions tend to leak some cruft into the
dictionary's long-lived cache context.  To judge by the examples in
the core regression tests, not very many bytes are at stake.
Moreover, I don't see a way to prevent such leaks without changing the
API for TS template initialization functions: right now they do not
have to worry about making sure that their results are long-lived.

Hence, I think we should install a suppression rule rather than trying
to fix this completely.  However, I did grab some low-hanging fruit:
several places were leaking the result of get_tsearch_config_filename.
This seems worth doing mostly because they are inconsistent with other
dictionaries that were freeing it already.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
db01c90b2f Silence Valgrind leakage complaints in more-or-less-hackish ways.
These changes don't actually fix any leaks.  They just make sure that
Valgrind will find pointers to data structures that remain allocated
at process exit, and thus not falsely complain about leaks.  In
particular, we are trying to avoid situations where there is no
pointer to the beginning of an allocated block (except possibly
within the block itself, which Valgrind won't count).

* Because dynahash.c never frees hashtable storage except by deleting
the whole hashtable context, it doesn't bother to track the individual
blocks of elements allocated by element_alloc().  This results in
"possibly lost" complaints from Valgrind except when the first element
of each block is actively in use.  (Otherwise it'll be on a freelist,
but very likely only reachable via "interior pointers" within element
blocks, which doesn't satisfy Valgrind.)

To fix, if we're building with USE_VALGRIND, expend an extra pointer's
worth of space in each element block so that we can chain them all
together from the HTAB header.  Skip this in shared hashtables though:
Valgrind doesn't track those, and we'd need additional locking to make
it safe to manipulate a shared chain.

While here, update a comment obsoleted by 9c911ec06.

* Put the dlist_node fields of catctup and catclist structs first.
This ensures that the dlist pointers point to the starts of these
palloc blocks, and thus that Valgrind won't consider them
"possibly lost".

* The postmaster's PMChild structs and the autovac launcher's
avl_dbase structs also have the dlist_node-is-not-first problem,
but putting it first still wouldn't silence the warning because we
bulk-allocate those structs in an array, so that Valgrind sees a
single allocation.  Commonly the first array element will be pointed
to only from some later element, so that the reference would be an
interior pointer even if it pointed to the array start.  (This is the
same issue as for dynahash elements.)  Since these are pretty simple
data structures, I don't feel too bad about faking out Valgrind by
just keeping a static pointer to the array start.

(This is all quite hacky, and it's not hard to imagine usages where
we'd need some other idea in order to have reasonable leak tracking of
structures that are only accessible via dlist_node lists.  But these
changes seem to be enough to silence this class of leakage complaints
for the moment.)

* Free a couple of data structures manually near the end of an
autovacuum worker's run when USE_VALGRIND, and ensure that the final
vac_update_datfrozenxid() call is done in a non-permanent context.
This doesn't have any real effect on the process's total memory
consumption, since we're going to exit as soon as that last
transaction is done.  But it does pacify Valgrind.

* Valgrind complains about the postmaster's socket-files and
lock-files lists being leaked, which we can silence by just
not nulling out the static pointers to them.

* Valgrind seems not to consider the global "environ" variable as
a valid root pointer; so when we allocate a new environment array,
it claims that data is leaked.  To fix that, keep our own
statically-allocated copy of the pointer, similarly to the previous
item.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
e78d1d6d47 Fix assorted pretty-trivial memory leaks in the backend.
In the current system architecture, none of these are worth obsessing
over; most are once-per-process leaks.  However, Valgrind complains
about all of them, and if we get to using threads rather than
processes for backend sessions, it will become more interesting to
avoid per-session leaks.

* Fix leaks in StartupXLOG() and ShutdownWalRecovery().

* Fix leakage of pq_mq_handle in a parallel worker.
While at it, move mq_putmessage's "Assert(pq_mq_handle != NULL)"
to someplace where it's not trivially useless.

* Fix leak in logicalrep_worker_detach().

* Don't leak the startup-packet buffer in ProcessStartupPacket().

* Fix leak in evtcache.c's DecodeTextArrayToBitmapset().
If the presented array is toasted, this neglected to free the
detoasted copy, which was then leaked into EventTriggerCacheContext.

* I'm distressed by the amount of code that BuildEventTriggerCache
is willing to run while switched into a long-lived cache context.
Although the detoasted array is the only leak that Valgrind reports,
let's tighten things up while we're here.  (DecodeTextArrayToBitmapset
is still run in the cache context, so doing this doesn't remove the
need for the detoast fix.  But it reduces the surface area for other
leaks.)

* load_domaintype_info() intentionally leaked some intermediate cruft
into the long-lived DomainConstraintCache's memory context, reasoning
that the amount of leakage will typically not be much so it's not
worth doing a copyObject() of the final tree to avoid that.  But
Valgrind knows nothing of engineering tradeoffs and complains anyway.
On the whole, the copyObject doesn't cost that much and this is surely
not a performance-critical code path, so let's do it the clean way.

* MarkGUCPrefixReserved didn't bother to clean up removed placeholder
GUCs at all, which shows up as a leak in one regression test.
It seems appropriate for it to do as much cleanup as
define_custom_variable does when replacing placeholders, so factor
that code out into a helper function.  define_custom_variable's logic
was one brick shy of a load too: it forgot to free the separate
allocation for the placeholder's name.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
9e9190154e Fix MemoryContextAllocAligned's interaction with Valgrind.
Arrange that only the "aligned chunk" part of the allocated space is
included in a Valgrind vchunk.  This suppresses complaints about that
vchunk being possibly lost because PG is retaining only pointers to
the aligned chunk.  Also make sure that trailing wasted space is
marked NOACCESS.

As a tiny performance improvement, arrange that MCXT_ALLOC_ZERO zeroes
only the returned "aligned chunk", not the wasted padding space.

In passing, fix GetLocalBufferStorage to use MemoryContextAllocAligned
instead of rolling its own implementation, which was equally broken
according to Valgrind.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
2025-08-02 21:59:46 -04:00
Tom Lane
bb049a79d3 Improve our support for Valgrind's leak tracking.
When determining whether an allocated chunk is still reachable,
Valgrind will consider only pointers within what it believes to be
allocated chunks.  Normally, all of a block obtained from malloc()
would be considered "allocated" --- but it turns out that if we use
VALGRIND_MEMPOOL_ALLOC to designate sub-section(s) of a malloc'ed
block as allocated, all the rest of that malloc'ed block is ignored.
This leads to lots of false positives of course.  In particular,
in any multi-malloc-block context, all but the primary block were
reported as leaked.  We also had a problem with context "ident"
strings, which were reported as leaked unless there was some other
pointer to them besides the one in the context header.

To fix, we need to use VALGRIND_MEMPOOL_ALLOC to designate
a context's management structs (the context struct itself and
any per-block headers) as allocated chunks.  That forces moving
the VALGRIND_CREATE_MEMPOOL/VALGRIND_DESTROY_MEMPOOL calls into
the per-context-type code, so that the pool identifier can be
made as soon as we've allocated the initial block, but otherwise
it's fairly straightforward.  Note that in Valgrind's eyes there
is no distinction between these allocations and the allocations
that the mmgr modules hand out to user code.  That's fine for
now, but perhaps someday we'll want to do better yet.

When reading this patch, it's helpful to start with the comments
added at the head of mcxt.c.

Author: Andres Freund <andres@anarazel.de>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/285483.1746756246@sss.pgh.pa.us
Discussion: https://postgr.es/m/20210317181531.7oggpqevzz6bka3g@alap3.anarazel.de
2025-08-02 21:59:46 -04:00
Michael Paquier
3b3fa94900 Fix use-after-free with INSERT ON CONFLICT changes in reorderbuffer.c
In ReorderBufferProcessTXN(), used to send the data of a transaction to
an output plugin, INSERT ON CONFLICT changes (INTERNAL_SPEC_INSERT) are
delayed until a confirmation record arrives (INTERNAL_SPEC_CONFIRM),
updating the change being processed.

8c58624df4 has added an extra step after processing a change to update
the progress of the transaction, by calling the callback
update_progress_txn() based on the LSN stored in a change after a
threshold of CHANGES_THRESHOLD (100) is reached.  This logic has missed
the fact that for an INSERT ON CONFLICT change the data is freed once
processed, hence update_progress_txn() could be called pointing to a LSN
value that's already been freed.  This could result in random crashes,
depending on the workload.

Per discussion, this issue is fixed by reusing in update_progress_txn()
the LSN from the change processed found at the beginning of the loop,
meaning that for a INTERNAL_SPEC_CONFIRM change the progress is updated
using the LSN of the INTERNAL_SPEC_CONFIRM change, and not the LSN from
its INTERNAL_SPEC_INSERT change.  This is actually more correct, as we
want to update the progress to point to the INTERNAL_SPEC_CONFIRM
change.

Masahiko Sawada has found a nice trick to reproduce the issue: hardcode
CHANGES_THRESHOLD at 1 and run test_decoding (test "ddl" being enough)
on an instance running valgrind.  The bug has been analyzed by Ethan
Mertz, who also originally suggested the solution used in this patch.

Issue introduced by 8c58624df4, so backpatch down to v16.

Author: Ethan Mertz <ethan.mertz@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/aIsQqDZ7x4LAQ6u1@paquier.xyz
Backpatch-through: 16
2025-08-02 17:08:45 +09:00
Nathan Bossart
9eb6068fb6 Allow resetting unknown custom GUCs with reserved prefixes.
Currently, ALTER DATABASE/ROLE/SYSTEM RESET [ALL] with an unknown
custom GUC with a prefix reserved by MarkGUCPrefixReserved() errors
(unless a superuser runs a RESET ALL variant).  This is problematic
for cases such as an extension library upgrade that removes a GUC.
To fix, simply make sure the relevant code paths explicitly allow
it.  Note that we require superuser or privileges on the parameter
to reset it.  This is perhaps a bit more restrictive than is
necessary, but it's not clear whether further relaxing the
requirements is safe.

Oversight in commit 88103567cb.  The ALTER SYSTEM fix is dependent
on commit 2d870b4aef, which first appeared in v17.  Unfortunately,
back-patching that commit would introduce ABI breakage, and while
that breakage seems unlikely to bother anyone, it doesn't seem
worth the risk.  Hence, the ALTER SYSTEM part of this commit is
omitted on v15 and v16.

Reported-by: Mert Alev <mert@futo.org>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Discussion: https://postgr.es/m/18964-ba09dea8c98fccd6%40postgresql.org
Backpatch-through: 15
2025-08-01 16:52:11 -05:00
Masahiko Sawada
a2c6c4ed31 Fix typo in AutoVacLauncherMain().
Author: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/20250802002027.cd35c481f6c6bae7ca2a3e26@sraoss.co.jp
2025-08-01 18:02:41 +00:00
Amit Kapila
2ab2d6f970 Fix a deadlock during ALTER SUBSCRIPTION ... DROP PUBLICATION.
A deadlock can occur when the DDL command and the apply worker acquire
catalog locks in different orders while dropping replication origins.

The issue is rare in PG16 and higher branches because, in most cases, the
tablesync worker performs the origin drop in those branches, and its
locking sequence does not conflict with DDL operations.

This patch ensures consistent lock acquisition to prevent such deadlocks.

As per buildfarm.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Ajin Cherian <itsajin@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 14, where it was introduced
Discussion: https://postgr.es/m/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
2025-08-01 07:58:48 +00:00
Michael Paquier
e125e36002 Rename CachedPlanType to PlannedStmtOrigin for PlannedStmt
Commit 719dcf3c42 introduced a field called CachedPlanType in
PlannedStmt to allow extensions to determine whether a cached plan is
generic or custom.

After discussion, the concepts that we want to track are a bit wider
than initially anticipated, as it is closer to knowing from which
"source" or "origin" a PlannedStmt has been generated or retrieved.
Custom and generic cached plans are a subset of that.

Based on the state of HEAD, we have been able to define two more
origins:
- "standard", for the case where PlannedStmt is generated in
standard_planner(), the most common case.
- "internal", for the fake PlannedStmt generated internally by some
query patterns.

This could be tuned in the future depending on what is needed.  This
looks like a good starting point, at least.  The default value is called
"UNKNOWN", provided as fallback value.  This value is not used in the
core code, the idea is to let extensions building their own PlannedStmts
know about this new field.

Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aILaHupXbIGgF2wJ@paquier.xyz
2025-07-31 10:06:34 +09:00
Heikki Linnakangas
613f647122 Handle cancel requests with PID 0 gracefully
If the client sent a query cancel request with backend PID 0, it
tripped an assertion. With assertions disabled, you got this in the
log instead:

    LOG:  invalid cancel request with PID 0
    LOG:  wrong key in cancel request for process 0

Query cancellations don't even require authentication, so we better
tolerate bogus requests. Fix by turning the assertion into a regular
runtime check.

Spotted while testing libpq behavior with a modified server that
didn't send BackendKeyData to the client.

Backpatch-through: 18
2025-07-30 00:39:49 +03:00
Tom Lane
4300d8b6a7 Don't put library-supplied -L/-I switches before user-supplied ones.
For many optional libraries, we extract the -L and -l switches needed
to link the library from a helper program such as llvm-config.  In
some cases we put the resulting -L switches into LDFLAGS ahead of
-L switches specified via --with-libraries.  That risks breaking
the user's intention for --with-libraries.

It's not such a problem if the library's -L switch points to a
directory containing only that library, but on some platforms a
library helper may "helpfully" offer a switch such as -L/usr/lib
that points to a directory holding all standard libraries.  If the
user specified --with-libraries in hopes of overriding the standard
build of some library, the -L/usr/lib switch prevents that from
happening since it will come before the user-specified directory.

To fix, avoid inserting these switches directly into LDFLAGS during
configure, instead adding them to LIBDIRS or SHLIB_LINK.  They will
still eventually get added to LDFLAGS, but only after the switches
coming from --with-libraries.

The same problem exists for -I switches: those coming from
--with-includes should appear before any coming from helper programs
such as llvm-config.  We have not heard field complaints about this
case, but it seems certain that a user attempting to override a
standard library could have issues.

The changes for this go well beyond configure itself, however,
because many Makefiles have occasion to manipulate CPPFLAGS to
insert locally-desirable -I switches, and some of them got it wrong.
The correct ordering is any -I switches pointing at within-the-
source-tree-or-build-tree directories, then those from the tree-wide
CPPFLAGS, then those from helper programs.  There were several places
that risked pulling in a system-supplied copy of libpq headers, for
example, instead of the in-tree files.  (Commit cb36f8ec2 fixed one
instance of that a few months ago, but this exercise found more.)

The Meson build scripts may or may not have any comparable problems,
but I'll leave it to someone else to investigate that.

Reported-by: Charles Samborski <demurgos@demurgos.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/70f2155f-27ca-4534-b33d-7750e20633d7@demurgos.net
Backpatch-through: 13
2025-07-29 15:17:40 -04:00
Peter Eisentraut
c3019bb778 Update comment
The code being referred to was moved to a different function in commit
eb8312a22a, so update the comment accordingly.
2025-07-29 18:57:14 +02:00
Tom Lane
902f922218 Remove unnecessary complication around xmlParseBalancedChunkMemory.
When I prepared 71c0921b6 et al yesterday, I was thinking that the
logic involving explicitly freeing the node_list output was still
needed to dodge leakage bugs in libxml2.  But I was misremembering:
we introduced that only because with early 2.13.x releases we could
not trust xmlParseBalancedChunkMemory's result code, so we had to
look to see if a node list was returned or not.  There's no reason
to believe that xmlParseBalancedChunkMemory will fail to clean up
the node list when required, so simplify.  (This essentially
completes reverting all the non-cosmetic changes in 6082b3d5d.)

Reported-by: Jim Jones <jim.jones@uni-muenster.de>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/997668.1753802857@sss.pgh.pa.us
Backpatch-through: 13
2025-07-29 12:47:38 -04:00
Robert Haas
1d1612aec7 Run pgindent.
Per buildfarm member koel, Nathan Bossart, and David Rowley.
2025-07-29 09:10:41 -04:00
David Rowley
4bc62b8684 Display Memoize planner estimates in EXPLAIN
There've been a few complaints that it can be overly difficult to figure
out why the planner picked a Memoize plan.  To help address that, here we
adjust the EXPLAIN output to display the following additional details:

1) The estimated number of cache entries that can be stored at once
2) The estimated number of unique lookup keys that we expect to see
3) The number of lookups we expect
4) The estimated hit ratio

Technically #4 can be calculated using #1, #2 and #3, but it's not a
particularly obvious calculation, so we opt to display it explicitly.
The original patch by Lukas Fittl only displayed the hit ratio, but
there was a fear that might lead to more questions about how that was
calculated.  The idea with displaying all 4 is to be transparent which
may allow queries to be tuned more easily.  For example, if #2 isn't
correct then maybe extended statistics or a manual n_distinct estimate can
be used to help fix poor plan choices.

Author: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAP53Pky29GWAVVk3oBgKBDqhND0BRBN6yTPeguV_qSivFL5N_g%40mail.gmail.com
2025-07-29 15:18:01 +12:00
Tom Lane
71c0921b64 Avoid regression in the size of XML input that we will accept.
This mostly reverts commit 6082b3d5d, "Use xmlParseInNodeContext
not xmlParseBalancedChunkMemory".  It turns out that
xmlParseInNodeContext will reject text chunks exceeding 10MB, while
(in most libxml2 versions) xmlParseBalancedChunkMemory will not.
The bleeding-edge libxml2 bug that we needed to work around a year
ago is presumably no longer a factor, and the argument that
xmlParseBalancedChunkMemory is semi-deprecated is not enough to
justify a functionality regression.  Hence, go back to doing it
the old way.

Reported-by: Michael Paquier <michael@paquier.xyz>
Author: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Erik Wienhold <ewie@ewie.name>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aIGknLuc8b8ega2X@paquier.xyz
Backpatch-through: 13
2025-07-28 16:50:41 -04:00
Robert Haas
d5b9b2d402 Remove misleading hint for "unexpected data beyond EOF" error.
Commit ffae5cc5a6 added this hint in 2006,
but it's now obsolete and doesn't reflect what users should really check
in this situation. We were not able to agree on a new hint, so just delete
the existing one and update the comments to mention one possibility that
is known to cause problems of this kind: something other than PostgreSQL
is modifying files in the PostgreSQL data directory.

Author: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Robert Haas <rhaas@postgresql.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/CAKZiRmxNbcaL76x=09Sxf7aUmrRQJBf8drzDdUHo+j9_eM+VMg@mail.gmail.com
2025-07-28 11:15:47 -04:00
Robert Haas
dcc9820a35 Avoid throwing away the error message in syncrep_yyerror.
Commit 473a575e05 purported to make this
function stash the error message in *syncrep_parse_result_p, but
it didn't actually.

As a result, an attempt to set synchronous_standby_names to any value
that does not parse resulted in a generic "parser failed." message
rather than anything more specific. This fixes that.

Discussion: http://postgr.es/m/CA+TgmoYF9wPNZ-Q_EMfib_espgHycY-eX__6Tzo2GpYpVXqCdQ@mail.gmail.com
Backpatch-through: 18
2025-07-28 10:35:05 -04:00
Michael Paquier
793928c2d5 Fix performance regression with flush of pending fixed-numbered stats
The callback added in fc415edf8c used to check if there is any pending
data to flush for fixed-numbered statistics, done by looping across all
the builtin and custom stats kinds with a call to have_fixed_pending_cb,
is proving to able to show in workloads that do not report any stats
(read-only, no function calls, no WAL, no IO, etc).  The code used in
v17 was cheaper than that what HEAD has introduced, relying on three
boolean checks for WAL, SLRU and IO stats.

This commit switches the code to use a more efficient approach than
fc415edf8c, with a single boolean flag that can be switched to "true"
by any fixed-numbered stats kinds to force pgstat_report_stat() to go
through one round of reports.  The flag is reset by pgstat_report_stat()
once a full round of reports is done.  The flag being false means that
fixed-numbered stats kinds saw no activity, and that there is no pending
data to flush.

ac000fca74 took one step in improving the performance by reducing the
number of stats kinds that the backend can hold.  This commit takes a
more drastic step by bringing back the code efficiency to what it was
before v18 with a cheap check at the beginning of pgstat_report_stat()
for its fast-exit path.

The callback have_static_pending_cb is removed as an effect of all that.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/eb224uegsga2hgq7dfq3ps5cduhpqej7ir2hjxzzozjthrekx5@dysei6buqthe
Backpatch-through: 18
2025-07-28 08:15:11 +09:00
Alexander Korotkov
258bf0a2ea Process sync requests incrementally in AbsorbSyncRequests
If the number of sync requests is big enough, the palloc() call in
AbsorbSyncRequests() will attempt to allocate more than 1 GB of memory,
resulting in failure.  This can lead to an infinite loop in the checkpointer
process, as it repeatedly fails to absorb the pending requests.

This commit introduces the following changes to cope with this problem:
 1. Turn pending checkpointer requests array in shared memory into a bounded
    ring buffer.
 2. Limit maximum ring buffer size to 10M items.
 3. Make AbsorbSyncRequests() process requests incrementally in 10K batches.

Even #2 makes the whole queue size fit the maximum palloc() size of 1 GB.
of continuous lock holding.

This commit is for master only.  Simpler fix, which just limits a request
queue size to 10M, will be backpatched.

Reported-by: Ekaterina Sokolova <e.sokolova@postgrespro.ru>
Discussion: https://postgr.es/m/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru
Author: Maxim Orlov <orlovmg@gmail.com>
Co-authored-by:  Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
2025-07-27 15:07:47 +03:00
Michael Paquier
6f22a82a40 Add assertions for all the required index AM callbacks
Similar checks are done for the mandatory table AM callbacks.  A portion
of the index AM callbacks are optional and can be NULL; the rest is
mandatory and is documented as such in the documentation and in amapi.h.

These checks are useful to detect quickly if all the mandatory callbacks
are defined when implementing a new index access method, as the
assertions are run when loading the AM.

Author: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/ME0P300MB0445795D31CEAB92C58B41FDB651A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
2025-07-27 17:48:47 +09:00
Tom Lane
80aa9848be Reap the benefits of not having to avoid leaking PGresults.
Remove a bunch of PG_TRY constructs, de-volatilize related
variables, remove some PQclear calls in error paths.
Aside from making the code simpler and shorter, this should
provide some marginal performance gains.

For ease of review, I did not re-indent code within the removed
PG_TRY constructs.  That'll be done in a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us
2025-07-25 16:31:43 -04:00
Tom Lane
7d8f595779 Create infrastructure to reliably prevent leakage of PGresults.
Commit 232d8caea fixed a case where postgres_fdw could lose track
of a PGresult object, resulting in a process-lifespan memory leak.
But I have little faith that there aren't other potential PGresult
leakages, now or in future, in the backend modules that use libpq.
Therefore, this patch proposes infrastructure that makes all
PGresults returned from libpq act as though they are palloc'd
in the CurrentMemoryContext (with the option to relocate them to
another context later).  This should greatly reduce the risk of
careless leaks, and it also permits removal of a bunch of code
that attempted to prevent such leaks via PG_TRY blocks.

This patch adds infrastructure that wraps each PGresult in a
"libpqsrv_PGresult" that provides a memory context reset callback
to PQclear the PGresult.  Code using this abstraction is inherently
memory-safe to the same extent as we are accustomed to in most backend
code.  Furthermore, we add some macros that automatically redirect
calls of the libpq functions concerned with PGresults to use this
infrastructure, so that almost no source-code changes are needed to
wheel this infrastructure into place in all the backend code that
uses libpq.

Perhaps in future we could create similar infrastructure for
PGconn objects, but there seems less need for that.

This patch just creates the infrastructure and makes relevant code
use it, including reverting 232d8caea in favor of this mechanism.
A good deal of follow-on simplification is possible now that we don't
have to be so cautious about freeing PGresults, but I'll put that in
a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us
2025-07-25 16:30:00 -04:00
Tom Lane
5457ea46d1 Fix dynahash's HASH_FIXED_SIZE ("isfixed") option.
This flag was effectively a no-op in EXEC_BACKEND (ie, Windows)
builds, because it was kept in the process-local HTAB struct,
and it could only ever become set in the postmaster's copy.

The simplest fix is to move it to the shared HASHHDR struct.
We could keep a copy in HTAB as well, as we do with keysize
and some other fields, but the "too much contention" argument
doesn't seem to apply here: we only examine isfixed during
element_alloc(), which had better not get hit very often for
a shared hashtable.

This oversight dates to 7c797e719 which invented the option.
But back-patching doesn't seem appropriate given the lack of
field complaints.  If there is anyone running an affected
workload on Windows, they might be unhappy about the behavior
changing in a minor release.

Author: Aidar Imamov <a.imamov@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/4d0cb35ff01c5c74d2b9a582ecb73823@postgrespro.ru
2025-07-25 10:56:55 -04:00
Álvaro Herrera
1dfe3ef3f9
Refactor grammar to create opt_utility_option_list
This changes the grammar for REINDEX, CHECKPOINT, CLUSTER, ANALYZE/ANALYSE;
they still accept the same options as before, but the grammar is written
differently for convenience of future development.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/202507231538.ir7pjzoow6oe@alvherre.pgsql
2025-07-25 12:03:19 +02:00
Fujii Masao
b5d084c535 Fix background worker not restarting after crash-and-restart cycle.
Previously, if a background worker crashed (e.g., due to a SIGKILL) and
the server restarted due to restart_after_crash being enabled,
the worker was not restarted as expected. Background workers without
the never-restart flag should automatically restart in this case.

This issue was introduced in commit 28a520c0b7, which failed to reset
the rw_pid field in the RegisteredBgWorker struct for the crashed worker.

This commit fixes the problem by resetting rw_pid for all eligible
background workers during the crash-and-restart cycle.

Back-patched to v18, where the bug was introduced.

Bug fix patches were proposed by Andrey Rudometov and ChangAo Chen,
but this commit uses a different approach.

Reported-by: Andrey Rudometov <unlimitedhikari@gmail.com>
Reported-by: ChangAo Chen <cca5507@qq.com>
Author: Andrey Rudometov <unlimitedhikari@gmail.com>
Author: ChangAo Chen <cca5507@qq.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAF6JsWiO=i24qYitWe6ns1sXqcL86rYxdyU+pNYk-WueKPSySg@mail.gmail.com
Discussion: https://postgr.es/m/tencent_E00A056B3953EE6440F0F40F80EC30427D09@qq.com
Backpatch-through: 18
2025-07-25 18:38:36 +09:00
Michael Paquier
641f20d4c4 Fix assertion failure with latch wait in single-user mode
LatchWaitSetPostmasterDeathPos, the latch event position for the
postmaster death event, is initialized under IsUnderPostmaster.
WaitLatch() considered it as a valid wait target in single-user mode
(!IsUnderPostmaster), which was incorrect.

One code path found to fail with an assertion failure is a database drop
in single-user mode while waiting in WaitForProcSignalBarrier() after
the drop.

Oversight in commit 84e5b2f07a.

Author: Patrick Stählin <me@packi.ch>
Co-authored-by: Ronan Dunklau <ronan.dunklau@aiven.io>
Discussion: https://postgr.es/m/18996-3a2744c8140488de@postgresql.org
Backpatch-through: 18
2025-07-25 16:17:13 +09:00
Nathan Bossart
15d33eb192 Fix return value of visibilitymap_get_status().
This function is declared as returning a uint8, but it returns a
bool in one code path.  To fix, return (uint8) 0 instead of false
there.  This should behave exactly the same as before, but it might
prevent future compiler complaints.

Oversight in commit a892234f83.

Author: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://postgr.es/m/aIHluT2isN58jqHV%40jrouhaud
2025-07-24 10:13:45 -05:00
Michael Paquier
719dcf3c42 Introduce field tracking cached plan type in PlannedStmt
PlannedStmt gains a new field, called CachedPlanType, able to track if a
given plan tree originates from the cache and if we are dealing with a
generic or custom cached plan.

This field can be used for monitoring or statistical purposes, in the
executor hooks, for example, based on the planned statement attached to
a QueryDesc.  A patch is under discussion for pg_stat_statements to
provide an equivalent of the counters in pg_prepared_statements for
custom and generic plans, to provide a more global view of such data, as
this data is now restricted to the current session.

The concept introduced in this commit is useful on its own, and has been
extracted from a larger patch by the same author.

Author: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAA5RZ0uFw8Y9GCFvafhC=OA8NnMqVZyzXPfv_EePOt+iv1T-qQ@mail.gmail.com
2025-07-24 15:41:18 +09:00
Tom Lane
e6dfd068ed Fix build breakage on Solaris-alikes with late-model GCC.
Solaris has never bothered to add "const" to the second argument
of PAM conversation procs, as all other Unixen did decades ago.
This resulted in an "incompatible pointer" compiler warning when
building --with-pam, but had no more serious effect than that,
so we never did anything about it.  However, as of GCC 14 the
case is an error not warning by default.

To complicate matters, recent OpenIndiana (and maybe illumos
in general?) *does* supply the "const" by default, so we can't
just assume that platforms using our solaris template need help.

What we can do, short of building a configure-time probe,
is to make solaris.h #define _PAM_LEGACY_NONCONST, which
causes OpenIndiana's pam_appl.h to revert to the traditional
definition, and hopefully will have no effect anywhere else.
Then we can use that same symbol to control whether we include
"const" in the declaration of pam_passwd_conv_proc().

Bug: #18995
Reported-by: Andrew Watkins <awatkins1966@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18995-82058da9ab4337a7@postgresql.org
Backpatch-through: 13
2025-07-23 15:44:29 -04:00
Nathan Bossart
2047ad0681 Cross-check lists of built-in LWLock tranches.
lwlock.c, lwlock.h, and wait_event_names.txt each contain a list of
built-in LWLock tranches.  It is easy to miss one or the other when
adding or removing tranches, and discrepancies have adverse effects
(e.g., breaking JOINs between pg_stat_activity and pg_wait_events).
This commit moves the lists of built-in tranches in lwlock.{c,h} to
lwlocklist.h and adds a cross-check to the script that generates
lwlocknames.h.  If the lists do not match exactly, building will
fail.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal
2025-07-23 12:06:20 -05:00
Nathan Bossart
37c7a7eeb6 Use PqMsg_* macros in walsender.c
Oversights in commits f4b54e1ed9, dc21234005, and 228c370868.

Author: Dave Cramer <davecramer@gmail.com>
Discussion: https://postgr.es/m/CADK3HH%2BowWVdnbmWH4NHG8%3D%2BkXA_wjsyEVLoY719iJnb%3D%2BtT6A%40mail.gmail.com
2025-07-23 10:29:45 -05:00
Amit Kapila
228c370868 Preserve conflict-relevant data during logical replication.
Logical replication requires reliable conflict detection to maintain data
consistency across nodes. To achieve this, we must prevent premature
removal of tuples deleted by other origins and their associated commit_ts
data by VACUUM, which could otherwise lead to incorrect conflict reporting
and resolution.

This patch introduces a mechanism to retain deleted tuples on the
subscriber during the application of concurrent transactions from remote
nodes. Retaining these tuples allows us to correctly ignore concurrent
updates to the same tuple. Without this, an UPDATE might be misinterpreted
as an INSERT during resolutions due to the absence of the original tuple.

Additionally, we ensure that origin metadata is not prematurely removed by
vacuum freeze, which is essential for detecting update_origin_differs and
delete_origin_differs conflicts.

To support this, a new replication slot named pg_conflict_detection is
created and maintained by the launcher on the subscriber. Each apply
worker tracks its own non-removable transaction ID, which the launcher
aggregates to determine the appropriate xmin for the slot, thereby
retaining necessary tuples.

Conflict information retention (deleted tuples and commit_ts) can be
enabled per subscription via the retain_conflict_info option. This is
disabled by default to avoid unnecessary overhead for configurations that
do not require conflict resolution or logging.

During upgrades, if any subscription on the old cluster has
retain_conflict_info enabled, a conflict detection slot will be created to
protect relevant tuples from deletion when the new cluster starts.

This is a foundational work to correctly detect update_deleted conflict
which will be done in a follow-up patch.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-07-23 02:56:00 +00:00
Fujii Masao
a7ca73af66 Remove translation marker from libpq-be-fe-helpers.h.
Commit 112faf1378 introduced a translation marker in libpq-be-fe-helpers.h,
but this caused build failures on some platforms—such as the one reported
by buildfarm member indri—due to linker issues with dblink. This is the same
problem previously addressed in commit 213c959a29.

To fix the issue, this commit removes the translation marker from
libpq-be-fe-helpers.h, following the approach used in 213c959a29.
It also removes the associated gettext_noop() calls added in commit
112faf1378, as they are no longer needed.

While reviewing this, a gettext_noop() call was also found in
contrib/basic_archive. Since contrib modules don't support translation,
this call has been removed as well.

Per buildfarm member indri.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/0e6299d9-608a-4ffa-aeb1-40cb8a99000b@oss.nttdata.com
2025-07-22 22:08:36 +09:00
Andres Freund
d3f97fd1dd aio: Fix assertion, clarify README
The assertion wouldn't have triggered for a long while yet, but this won't
accidentally fail to detect the issue if/when it occurs.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAEze2Wj-43JV4YufW23gm=Uwr7Lkj+p0yKctKHxNm1rwFC+_DQ@mail.gmail.com
Backpatch-through: 18
2025-07-22 08:30:52 -04:00
Fujii Masao
112faf1378 Log remote NOTICE, WARNING, and similar messages using ereport().
Previously, NOTICE, WARNING, and similar messages received from remote
servers over replication, postgres_fdw, or dblink connections were printed
directly to stderr on the local server (e.g., the subscriber). As a result,
these messages lacked log prefixes (e.g., timestamp), making them harder
to trace and correlate with other log entries.

This commit addresses the issue by introducing a custom notice receiver
for replication, postgres_fdw, and dblink connections. These messages
are now logged via ereport(), ensuring they appear in the logs with proper
formatting and context, which improves clarity and aids in debugging.

Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CALDaNm2xsHpWRtLm-VL_HJCsaE3+1Y_n-jDEAr3-suxVqc3xoQ@mail.gmail.com
2025-07-22 14:16:45 +09:00
Richard Guo
e2debb6438 Reduce "Var IS [NOT] NULL" quals during constant folding
In commit b262ad440, we introduced an optimization that reduces an IS
[NOT] NULL qual on a NOT NULL column to constant true or constant
false, provided we can prove that the input expression of the NullTest
is not nullable by any outer joins or grouping sets.  This deduction
happens quite late in the planner, during the distribution of quals to
rels in query_planner.  However, this approach has some drawbacks: we
can't perform any further folding with the constant, and it turns out
to be prone to bugs.

Ideally, this deduction should happen during constant folding.
However, the per-relation information about which columns are defined
as NOT NULL is not available at that point.  This information is
currently collected from catalogs when building RelOptInfos for base
or "other" relations.

This patch moves the collection of NOT NULL attribute information for
relations before pull_up_sublinks, storing it in a hash table keyed by
relation OID.  It then uses this information to perform the NullTest
deduction for Vars during constant folding.  This also makes it
possible to leverage this information to pull up NOT IN subqueries.

Note that this patch does not get rid of restriction_is_always_true
and restriction_is_always_false.  Removing them would prevent us from
reducing some IS [NOT] NULL quals that we were previously able to
reduce, because (a) the self-join elimination may introduce new IS NOT
NULL quals after constant folding, and (b) if some outer joins are
converted to inner joins, previously irreducible NullTest quals may
become reducible.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
2025-07-22 11:21:36 +09:00
Richard Guo
904f6a593a Centralize collection of catalog info needed early in the planner
There are several pieces of catalog information that need to be
retrieved for a relation during the early stage of planning.  These
include relhassubclass, which is used to clear the inh flag if the
relation has no children, as well as a column's attgenerated and
default value, which are needed to expand virtual generated columns.
More such information may be required in the future.

Currently, these pieces of catalog data are collected in multiple
places, resulting in repeated table_open/table_close calls for each
relation in the rangetable.  This patch centralizes the collection of
all required early-stage catalog information into a single loop over
the rangetable, allowing each relation to be opened and closed only
once.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
2025-07-22 11:20:40 +09:00
Richard Guo
e0d0529526 Expand virtual generated columns before sublink pull-up
Currently, we expand virtual generated columns after we have pulled up
any SubLinks within the query's quals.  This ensures that the virtual
generated column references within SubLinks that should be transformed
into joins are correctly expanded.  This approach works well and has
posed no issues.

In an upcoming patch, we plan to centralize the collection of catalog
information needed early in the planner.  This will help avoid
repeated table_open/table_close calls for relations in the rangetable.
Since this information is required during sublink pull-up, we are
moving the expansion of virtual generated columns to occur beforehand.

To achieve this, if any EXISTS SubLinks can be pulled up, their
rangetables are processed just before pulling them up.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAMbWs4-bFJ1At4btk5wqbezdu8PLtQ3zv-aiaY3ry9Ymm=jgFQ@mail.gmail.com
2025-07-22 11:19:17 +09:00
Tom Lane
aadf7db66e Mostly-cosmetic adjustments to estimate_multivariate_bucketsize().
The only practical effect of these changes is to avoid a useless
list_copy() operation when there is a single hashclause.  That's
never going to make any noticeable performance difference, but
the code is arguably clearer this way, especially if we take the
opportunity to add some comments so that readers don't have to
reverse-engineer the usage of these local variables.  Also add
some braces for better/more consistent style.

Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAHewXNnHBOO9NEa=NBDYOrwZL4oHu2NOcTYvqyNyWEswo8f5OQ@mail.gmail.com
2025-07-19 14:23:02 -04:00
Alexander Korotkov
d3917d8f13 Fix infinite wait when reading a partially written WAL record
If a crash occurs while writing a WAL record that spans multiple pages, the
recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag.  However, logical decoding currently attempts to read the full WAL
record based on its expected size before checking this flag, which can lead
to an infinite wait if the remaining data is never written (e.g., no activity
after crash).

This patch updates the logic first to read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct
the full WAL record.  If the flag is set, decoding correctly identifies
the record as incomplete and avoids waiting for WAL data that will never
arrive.

Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Backpatch-through: 13
2025-07-19 13:45:51 +03:00
Tom Lane
3683af6170 Speed up byteain by not parsing traditional-style input twice.
Instead of laboriously computing the exact output length, use strlen
to get an upper bound cheaply.  (This is still O(N) of course, but
the constant factor is a lot less.)  This will typically result in
overallocating the output datum, but that's of little concern since
it's a short-lived allocation in just about all use-cases.

A simple microbenchmark showed about 40% speedup for long input
strings.

While here, make some cosmetic cleanups and add a test case that
covers the double-backslash code path in byteain and byteaout.

Author: Steven Niu <niushiji@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Stepan Neretin <slpmcf@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/ca315729-140b-426e-81a6-6cd5cfe7ecc5@gmail.com
2025-07-18 16:42:10 -04:00
Nathan Bossart
84409ed640 Remove unused variable in generate-lwlocknames.pl.
Oversight in commit da952b415f.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aHpOgwuFQfcFMZ/B%40ip-10-97-1-34.eu-west-3.compute.internal
2025-07-18 11:27:19 -05:00
Dean Rasheed
5022ff250e Fix concurrent update trigger issues with MERGE in a CTE.
If a MERGE inside a CTE attempts an UPDATE or DELETE on a table with
BEFORE ROW triggers, and a concurrent UPDATE or DELETE happens, the
merge code would fail (crashing in the case of an UPDATE action, and
potentially executing the wrong action for a DELETE action).

This is the same issue that 9321c79c86 attempted to fix, except now
for a MERGE inside a CTE. As noted in 9321c79c86, what needs to happen
is for the trigger code to exit early, returning the TM_Result and
TM_FailureData information to the merge code, if a concurrent
modification is detected, rather than attempting to do an EPQ
recheck. The merge code will then do its own rechecking, and rescan
the action list, potentially executing a different action in light of
the concurrent update. In particular, the trigger code must never call
ExecGetUpdateNewTuple() for MERGE, since that is bound to fail because
MERGE has its own per-action projection information.

Commit 9321c79c86 did this using estate->es_plannedstmt->commandType
in the trigger code to detect that a MERGE was being executed, which
is fine for a plain MERGE command, but does not work for a MERGE
inside a CTE. Fix by passing that information to the trigger code as
an additional parameter passed to ExecBRUpdateTriggers() and
ExecBRDeleteTriggers().

Back-patch as far as v17 only, since MERGE cannot appear inside a CTE
prior to that. Additionally, take care to preserve the trigger ABI in
v17 (though not in v18, which is still in beta).

Bug: #18986
Reported-by: Yaroslav Syrytsia <me@ys.lc>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18986-e7a8aac3d339fa47@postgresql.org
Backpatch-through: 17
2025-07-18 09:55:43 +01:00
Álvaro Herrera
b8926a5b4b
Remove assertion from PortalRunMulti
We have an assertion to ensure that a command tag has been assigned by
the time we're done executing, but if we happen to execute a command
with no queries, the assertion would fail.  Per discussion, rather than
contort things to get a tag assigned, just remove the assertion.

Oversight in 2f9661311b.  That commit also retained a comment that
explained logic that had been adjacent to it but diffused into various
places, leaving none apt to keep part of the comment.  Remove that part,
and rewrite what remains for extra clarity.

Bug: #18984
Backpatch-through: 13
Reported-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michaël Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org
2025-07-17 17:40:22 +02:00
Amit Langote
afa5c365ec Remove duplicate line
In 231b7d670b, while copy-pasting some code into
ExecEvalJsonCoercionFinish(), I (amitlan) accidentally introduced
a duplicate line.  Remove it.

Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxHcf=BpmRAJcjgfjOUfV76MwKnyz1x3ErXsWL26EAFmng@mail.gmail.com
2025-07-17 14:37:06 +09:00
Michael Paquier
a493e741d3 Fix inconsistent LWLock tranche names for MultiXact*
The terms used in wait_event_names.txt and lwlock.c were inconsistent
for MultiXactOffsetSLRU and MultiXactMemberSLRU, which could cause joins
between pg_wait_events and pg_stat_activity to fail.  lwlock.c is
adjusted in this commit to what the historical name of the event has
always been, and what is documented.

Oversight in 53c2a97a92.  08b9b9e043 has fixed a similar
inconsistency some time ago.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/aHdxN0D0hKXzHFQG@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 17
2025-07-17 09:30:26 +09:00
Jeff Davis
5e6e42e44f Force LC_COLLATE to C in postmaster.
Avoid dependence on setlocale().

strcoll(), etc., are not called directly; all collation-sensitive
calls should go through pg_locale.c and use the appropriate
provider. By setting LC_COLLATE to C, we avoid accidentally depending
on libc behavior when using a different provider.

No behavior change in the backend, but it's possible that some
extensions will be affected. Such extensions should be updated to use
the pg_locale_t APIs.

Discussion: https://postgr.es/m/9875f7f9-50f1-4b5d-86fc-ee8b03e8c162@eisentraut.org
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
2025-07-16 14:13:18 -07:00
Peter Geoghegan
4c8ad67a98 nbtree: Use only one notnullkey ScanKeyData.
_bt_first need only store one ScanKeyData struct on the stack for the
purposes of building an IS NOT NULL key based on an implied NOT NULL
constraint.  We don't need INDEX_MAX_KEYS-many ScanKeyData structs.

This saves us a little over 2KB in stack space.  It's possible that this
has some performance benefit.  It also seems simpler and more direct.

It isn't possible for more than a single index attribute to need its own
implied IS NOT NULL key: the first such attribute/IS NOT NULL key always
makes _bt_first stop adding additional boundary keys to startKeys[].
Using INDEX_MAX_KEYS-many ScanKeyData entries was (at best) misleading.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm=1kJMSZhhTLoM5BPbwQNWxUj-ynOEh=89ptDZAVgauw@mail.gmail.com
2025-07-16 13:05:44 -04:00
Michael Paquier
1dbe6f7667 Refactor non-supported compression error message in toast_compression.c
This code used a NO_LZ4_SUPPORT() macro to issue an error in the code
paths where LZ4 [de]compression is attempted but the build does not
support it.  This commit refactors the code to use a more flexible error
message so as it can be used for other compression methods, where the
method is given in input of macro.

Extracted from a larger patch by the same author.

Author: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com
2025-07-16 11:59:22 +09:00
Fujii Masao
b8341ae856 pgoutput: Initialize missing default for "origin" parameter.
The pgoutput plugin initializes optional parameters like "binary" with
default values at the start of processing. However, the "origin"
parameter was previously missed and left without explicit initialization.

Although the PGOutputData struct, which holds these settings,
is zero-initialized at allocation (resulting in publish_no_origin field
for "origin" parameter being false by default), this default was not
set explicitly, unlike other parameters.

This commit adds explicit initialization of the "origin" parameter to
ensure consistency and clarity in how defaults are handled.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/d2790f10-238d-4cb5-a743-d9d2a9dd900f@oss.nttdata.com
2025-07-16 10:31:51 +09:00
Tom Lane
5fe55a0fe4 Doc: clarify description of regexp fields in pg_ident.conf.
The grammar was a little shaky and confusing here, so word-smith it
a bit.  Also, adjust the comments in pg_ident.conf.sample to use the
same terminology as the SGML docs, in particular "DATABASE-USERNAME"
not "PG-USERNAME".

Back-patch appropriate subsets.  I did not risk changing
pg_ident.conf.sample in released branches, but it still seems OK
to change it in v18.

Reported-by: Alexey Shishkin <alexey.shishkin@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Discussion: https://postgr.es/m/175206279327.3157504.12519088928605422253@wrigleys.postgresql.org
Backpatch-through: 13
2025-07-15 18:53:00 -04:00
Tom Lane
2a3a396432 Clarify the ra != rb case in compareJsonbContainers().
It's impossible to reach this case with either ra or rb being
WJB_DONE, because our earlier checks that the structure and
length of the inputs match should guarantee that we reach their
ends simultaneously.  However, the comment completely fails to
explain this, and the Asserts don't cover it either.  The comment
is pretty obscure anyway, so rewrite it, and extend the Asserts
to reject WJB_DONE.

This is only cosmetic, so no need for back-patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru
2025-07-15 18:21:12 -04:00
Tom Lane
aad1617b76 Silence uninitialized-value warnings in compareJsonbContainers().
Because not every path through JsonbIteratorNext() sets val->type,
some compilers complain that compareJsonbContainers() is comparing
possibly-uninitialized values.  The paths that don't set it return
WJB_DONE, WJB_END_ARRAY, or WJB_END_OBJECT, so it's clear by
manual inspection that the "(ra == rb)" code path is safe, and
indeed we aren't seeing warnings about that.  But the (ra != rb)
case is much less obviously safe.  In Assert-enabled builds it
seems that the asserts rejecting WJB_END_ARRAY and WJB_END_OBJECT
persuade gcc 15.x not to warn, which makes little sense because
it's impossible to believe that the compiler can prove of its
own accord that ra/rb aren't WJB_DONE here.  (In fact they never
will be, so the code isn't wrong, but why is there no warning?)
Without Asserts, the appearance of warnings is quite unsurprising.

We discussed fixing this by converting those two Asserts into
pg_assume, but that seems not very satisfactory when it's so unclear
why the compiler is or isn't warning: the warning could easily
reappear with some other compiler version.  Let's fix it in a less
magical, more future-proof way by changing JsonbIteratorNext()
so that it always does set val->type.  The cost of that should be
pretty negligible, and it makes the function's API spec less squishy.

Reported-by: Erik Rijkers <er@xs4all.nl>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/988bf1bc-3f1f-99f3-bf98-222f1cd9dc5e@xs4all.nl
Discussion: https://postgr.es/m/0c623e8a204187b87b4736792398eaf1@postgrespro.ru
Backpatch-through: 13
2025-07-15 18:11:18 -04:00
Michael Paquier
006fc975a2 Fix comments in index.c
This comment paragraph referred to text_eq(), but the name of the
function in charge of "text" comparisons is called texteq().

Author: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxHL--XNcCCO1LgKsygzYGiVHZMfTcAxOSG8+ezxWtjddw@mail.gmail.com
2025-07-15 16:05:59 +09:00
Tom Lane
3c4e26a62c In username-map substitution, cope with more than one \1.
If the system-name field of a pg_ident.conf line is a regex
containing capturing parentheses, you can write \1 in the
user-name field to represent the captured part of the system
name.  But what happens if you write \1 more than once?
The only reasonable expectation IMO is that each \1 gets
replaced, but presently our code replaces only the first.
Fix that.

Also, improve the tests for this feature to exercise cases
where a non-empty string needs to be substituted for \1.
The previous testing didn't inspire much faith that it
was verifying correct operation of the substitution code.

Given the lack of field complaints about this, I don't
feel a need to back-patch.

Reported-by: David G. Johnston <david.g.johnston@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAKFQuwZu6kZ8ZPvJ3pWXig+6UX4nTVK-hdL_ZS3fSdps=RJQQQ@mail.gmail.com
2025-07-13 13:52:32 -04:00
Nathan Bossart
8893c3ab36 Remove XLogCtl->ckptFullXid.
A few code paths set this variable, but its value is never used.

Oversight in commit 2fc7af5e96.

Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://postgr.es/m/aHFyE1bs9YR93dQ1%40nathan
2025-07-12 14:34:57 -05:00
Tom Lane
84ce258707 Replace float8 with int in date2isoweek() and date2isoyear().
The values of the "result" variables in these functions are
always integers; using a float8 variable accomplishes nothing
except to incur useless conversions to and from float.  While
that wastes a few nanoseconds, these functions aren't all that
time-critical.  But it seems worth fixing to remove possible
reader confusion.

Also, in the case of date2isoyear(), "result" is a very poorly
chosen variable name because it is *not* the function's result.
Rename it to "week", and do the same in date2isoweek() for
consistency.

Since this is mostly cosmetic, there seems little need
for back-patch.

Author: Sergey Fukanchik <s.fukanchik@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6323a-68726500-1-7def9d00@137821581
2025-07-12 11:50:37 -04:00
Andres Freund
f2c87ac04e Remove long-unused TransactionIdIsActive()
TransactionIdIsActive() has not been used since bb38fb0d43, in 2014. There
are no known uses in extensions either and it's hard to see valid uses for
it. Therefore remove TransactionIdIsActive().

Discussion: https://postgr.es/m/odgftbtwp5oq7cxjgf4kjkmyq7ypoftmqy7eqa7w3awnouzot6@hrwnl5tdqrgu
2025-07-12 11:00:44 -04:00
Thomas Munro
b8e1f2d96b aio: Fix configuration reload in IO workers.
method_worker.c installed SignalHandlerForConfigReload, but it failed to
actually process reload requests.  That hasn't yet produced any concrete
problem reports in terms of GUC changes it should have cared about in
v18, but it was inconsistent.

It did cause problems for a couple of patches in development that need
IO workers to react to ALTER SYSTEM + pg_reload_conf().  Fix extracted
from one of those patches.

Back-patch to 18.

Reported-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://postgr.es/m/sh5uqe4a4aqo5zkkpfy5fobe2rg2zzouctdjz7kou4t74c66ql%40yzpkxb7pgoxf
2025-07-12 16:33:02 +12:00
Thomas Munro
177c1f0593 aio: Remove obsolete IO worker ID references.
In an ancient ancestor of this code, the postmaster assigned IDs to IO
workers.  Now it tracks them in an unordered array and doesn't know
their IDs, so it might be confusing to readers that it still referred to
their indexes as IDs.

No change in behavior, just variable name and error message cleanup.

Back-patch to 18.

Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com
2025-07-12 14:44:22 +12:00
Thomas Munro
01d618bcd7 aio: Regularize IO worker internal naming.
Adopt PgAioXXX convention for pgaio module type names.  Rename a
function that didn't use a pgaio_worker_ submodule prefix.  Rename the
internal submit function's arguments to match the indirectly relevant
function pointer declaration and nearby examples.  Rename the array of
handle IDs in PgAioSubmissionQueue to sqes, a term of art seen in the
systems it emulates, also clarifying that they're not IO handle
pointers as the old name might imply.

No change in behavior, just type, variable and function name cleanup.

Back-patch to 18.

Discussion: https://postgr.es/m/CA%2BhUKG%2BwbaZZ9Nwc_bTopm4f-7vDmCwLk80uKDHj9mq%2BUp0E%2Bg%40mail.gmail.com
2025-07-12 14:44:09 +12:00
Thomas Munro
40e105042a Fix stale idle flag when IO workers exit.
Otherwise we could choose a worker that has exited and crash while
trying to wake it up.

Back-patch to 18.

Reported-by: Tomas Vondra <tomas@vondra.me>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg%40357aaf6u525j
2025-07-12 13:11:47 +12:00
Tom Lane
64840e4624 Fix inconsistent quoting of role names in ACLs.
getid() and putid(), which parse and deparse role names within ACL
input/output, applied isalnum() to see if a character within a role
name requires quoting.  They did this even for non-ASCII characters,
which is problematic because the results would depend on encoding,
locale, and perhaps even platform.  So it's possible that putid()
could elect not to quote some string that, later in some other
environment, getid() will decide is not a valid identifier, causing
dump/reload or similar failures.

To fix this in a way that won't risk interoperability problems
with unpatched versions, make getid() treat any non-ASCII as a
legitimate identifier character (hence not requiring quotes),
while making putid() treat any non-ASCII as requiring quoting.
We could remove the resulting excess quoting once we feel that
no unpatched servers remain in the wild, but that'll be years.

A lesser problem is that getid() did the wrong thing with an input
consisting of just two double quotes ("").  That has to represent an
empty string, but getid() read it as a single double quote instead.
The case cannot arise in the normal course of events, since we don't
allow empty-string role names.  But let's fix it while we're here.

Although we've not heard field reports of problems with non-ASCII
role names, there's clearly a hazard there, so back-patch to all
supported versions.

Reported-by: Peter Eisentraut <peter@eisentraut.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3792884.1751492172@sss.pgh.pa.us
Backpatch-through: 13
2025-07-11 18:50:13 -04:00
Nathan Bossart
8d33fbacba Add FLUSH_UNLOGGED option to CHECKPOINT command.
This option, which is disabled by default, can be used to request
the checkpoint also flush dirty buffers of unlogged relations.  As
with the MODE option, the server may consolidate the options for
concurrently requested checkpoints.  For example, if one session
uses (FLUSH_UNLOGGED FALSE) and another uses (FLUSH_UNLOGGED TRUE),
the server may perform one checkpoint with FLUSH_UNLOGGED enabled.

Author: Christoph Berg <myon@debian.org>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Nathan Bossart
2f698d7f4b Add MODE option to CHECKPOINT command.
This option may be set to FAST (the default) to request the
checkpoint be completed as fast as possible, or SPREAD to request
the checkpoint be spread over a longer interval (based on the
checkpoint-related configuration parameters).  Note that the server
may consolidate the options for concurrently requested checkpoints.
For example, if one session requests a "fast" checkpoint and
another requests a "spread" checkpoint, the server may perform one
"fast" checkpoint.

Author: Christoph Berg <myon@debian.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Nathan Bossart
a4f126516e Add option list to CHECKPOINT command.
This commit adds the boilerplate code for supporting a list of
options in CHECKPOINT commands.  No actual options are supported
yet, but follow-up commits will add support for MODE and
FLUSH_UNLOGGED.  While at it, this commit refactors the code for
executing CHECKPOINT commands to its own function since it's about
to become significantly larger.

Author: Christoph Berg <myon@debian.org>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Nathan Bossart
bb938e2c3c Rename CHECKPOINT_IMMEDIATE to CHECKPOINT_FAST.
The new name more accurately reflects the effects of this flag on a
requested checkpoint.  Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "fast" instead of "immediate", too.  Likewise, references to
"immediate" checkpoints in the documentation have been updated to
say "fast".  This is preparatory work for a follow-up commit that
will add a MODE option to the CHECKPOINT command.

Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Nathan Bossart
cd8324cc89 Rename CHECKPOINT_FLUSH_ALL to CHECKPOINT_FLUSH_UNLOGGED.
The new name more accurately relects the effects of this flag on a
requested checkpoint.  Checkpoint-related log messages (i.e., those
controlled by the log_checkpoints configuration parameter) will now
say "flush-unlogged" instead of "flush-all", too.  This is
preparatory work for a follow-up commit that will add a
FLUSH_UNLOGGED option to the CHECKPOINT command.

Author: Christoph Berg <myon@debian.org>
Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de
2025-07-11 11:51:25 -05:00
Amit Kapila
72e6c08fea Fix the handling of two GUCs during upgrade.
Previously, the check_hook functions for max_slot_wal_keep_size and
idle_replication_slot_timeout would incorrectly raise an ERROR for values
set in postgresql.conf during upgrade, even though those values were not
actively used in the upgrade process.

To prevent logical slot invalidation during upgrade, we used to set
special values for these GUCs. Now, instead of relying on those values, we
directly prevent WAL removal and logical slot invalidation caused by
max_slot_wal_keep_size and idle_replication_slot_timeout.

Note: PostgreSQL 17 does not include the idle_replication_slot_timeout
GUC, so related changes were not backported.

BUG #18979
Reported-by: jorsol <jorsol@gmail.com>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed by: vignesh C <vignesh21@gmail.com>
Reviewed by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/219561.1751826409@sss.pgh.pa.us
Discussion: https://postgr.es/m/18979-a1b7fdbb7cd181c6@postgresql.org
2025-07-11 10:46:43 +05:30
Fujii Masao
110e6dcaa6 doc: Clarify meaning of "idle" in idle_replication_slot_timeout.
This commit updates the documentation to clarify that "idle" in
idle_replication_slot_timeout means the replication slot is inactive,
that is, not currently used by any replication connection.

Without this clarification, "idle" could be misinterpreted to mean
that the slot is not advancing or that no data is being streamed,
even if a connection exists.

Back-patch to v18 where idle_replication_slot_timeout was added.

Author: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Gunnar Morling <gunnar.morling@googlemail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com
Backpatch-through: 18
2025-07-11 08:44:32 +09:00
Fujii Masao
05dedf43d3 Change unit of idle_replication_slot_timeout to seconds.
Previously, the idle_replication_slot_timeout parameter used minutes
as its unit, based on the assumption that values would typically exceed
one minute in production environments. However, this caused unexpected
behavior: specifying a value below 30 seconds would round down to 0,
effectively disabling the timeout. This could be surprising to users.

To allow finer-grained control and avoid such confusion, this commit changes
the unit of idle_replication_slot_timeout to seconds. Larger values can
still be specified easily using standard time suffixes, for example,
'24h' for 24 hours.

Back-patch to v18 where idle_replication_slot_timeout was added.

Reported-by: Gunnar Morling <gunnar.morling@googlemail.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CADGJaX_0+FTguWpNSpgVWYQP_7MhoO0D8=cp4XozSQgaZ40Odw@mail.gmail.com
Backpatch-through: 18
2025-07-11 08:39:24 +09:00
Jeff Davis
53cd0b71ee Change wchar2char() and char2wchar() to accept a locale_t.
These are libc-specific functions, so should require a locale_t rather
than a pg_locale_t (which could use another provider).

Discussion: https://postgr.es/m/a8666c391dfcabe79868d95f7160eac533ace718.camel%40j-davis.com
2025-07-09 08:45:34 -07:00
Nathan Bossart
167ed8082f Introduce pg_dsm_registry_allocations view.
This commit adds a new system view that provides information about
entries in the dynamic shared memory (DSM) registry.  Specifically,
it returns the name, type, and size of each entry.  Note that since
we cannot discover the size of dynamic shared memory areas (DSAs)
and hash tables backed by DSAs (dshashes) without first attaching
to them, the size column is left as NULL for those.

Bumps catversion.

Author: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Sungwoo Chang <swchangdev@gmail.com>
Discussion: https://postgr.es/m/4D445D3E-81C5-4135-95BB-D414204A0AB4%40gmail.com
2025-07-09 09:17:56 -05:00
Tom Lane
e03c952877 Fix low-probability memory leak in XMLSERIALIZE(... INDENT).
xmltotext_with_options() did not consider the possibility that
pg_xml_init() could fail --- most likely due to OOM.  If that
happened, the already-parsed xmlDoc structure would be leaked.
Oversight in commit 483bdb2af.

Bug: #18981
Author: Dmitry Kovalenko <d.kovalenko@postgrespro.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18981-9bc3c80f107ae925@postgresql.org
Backpatch-through: 16
2025-07-08 12:50:33 -04:00
Andres Freund
f54af9f267 aio: Combine io_uring memory mappings, if supported
By default io_uring creates a shared memory mapping for each io_uring
instance, leading to a large number of memory mappings. Unfortunately a large
number of memory mappings slows things down, backend exit is particularly
affected.  To address that, newer kernels (6.5) support using user-provided
memory for the memory. By putting the relevant memory into shared memory we
don't need any additional mappings.

On a system with a new enough kernel and liburing, there is no discernible
overhead when doing a pgbench -S -C anymore.

Reported-by: MARK CALLAGHAN <mdcallag@gmail.com>
Reviewed-by: "Burd, Greg" <greg@burd.me>
Reviewed-by: Jim Nasby <jnasby@upgrade.com>
Discussion: https://postgr.es/m/CAFbpF8OA44_UG+RYJcWH9WjF7E3GA6gka3gvH6nsrSnEe9H0NA@mail.gmail.com
Backpatch-through: 18
2025-07-07 22:57:07 -04:00
Richard Guo
55a780e947 Consider explicit incremental sort for Append and MergeAppend
For an ordered Append or MergeAppend, we need to inject an explicit
sort into any subpath that is not already well enough ordered.
Currently, only explicit full sorts are considered; incremental sorts
are not yet taken into account.

In this patch, for subpaths of an ordered Append or MergeAppend, we
choose to use explicit incremental sort if it is enabled and there are
presorted keys.

The rationale is based on the assumption that incremental sort is
always faster than full sort when there are presorted keys, a premise
that has been applied in various parts of the code.  In addition, the
current cost model tends to favor incremental sort as being cheaper
than full sort in the presence of presorted keys, making it reasonable
not to consider full sort in such cases.

No backpatch as this could result in plan changes.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4_V7a2enTR+T3pOY_YZ-FU8ZsFYym2swOz4jNMqmSgyuw@mail.gmail.com
2025-07-08 10:21:44 +09:00
Álvaro Herrera
c616785516
Refactor some repetitive SLRU code
Functions to bootstrap and zero pages in various SLRU callers were
fairly duplicative.  We can slash almost two hundred lines with a couple
of simple helpers:

 - SimpleLruZeroAndWritePage: Does the equivalent of SimpleLruZeroPage
   followed by flushing the page to disk
 - XLogSimpleInsertInt64: Does a XLogBeginInsert followed by XLogInsert
   of a trivial record whose data is just an int64.

Author: Evgeny Voropaev <evgeny.voropaev@tantorlabs.com>
Reviewed by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/97820ce8-a1cd-407f-a02b-47368fadb14b%40tantorlabs.com
2025-07-07 16:49:19 +02:00
Álvaro Herrera
2633dae2e4
Standardize LSN formatting by zero padding
This commit standardizes the output format for LSNs to ensure consistent
representation across various tools and messages.  Previously, LSNs were
inconsistently printed as `%X/%X` in some contexts, while others used
zero-padding.  This often led to confusion when comparing.

To address this, the LSN format is now uniformly set to `%X/%08X`,
ensuring the lower 32-bit part is always zero-padded to eight
hexadecimal digits.

Author: Japin Li <japinli@hotmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
2025-07-07 13:57:43 +02:00
Michael Paquier
62a17a9283 Integrate FullTransactionIds deeper into two-phase code
This refactoring is a follow-up of the work done in 5a1dfde833, that
has switched 2PC file names to use FullTransactionIds when written on
disk.  This will help with the integration of a follow-up solution
related to the handling of two-phase files during recovery, to address
older defects while reading these from disk after a crash.

This change is useful in itself as it reduces the need to build the
file names from epoch numbers and TransactionIds, because we can use
directly FullTransactionIds from which the 2PC file names are guessed.
So this avoids a lot of back-and-forth between the FullTransactionIds
retrieved from the file names and how these are passed around in the
internal 2PC logic.

Note that the core of the change is the use of a FullTransactionId
instead of a TransactionId in GlobalTransactionData, that tracks 2PC
file information in shared memory.  The change in TwoPhaseCallback makes
this commit unfit for stable branches.

Noah has contributed a good chunk of this patch.  I have spent some time
on it as well while working on the issues with two-phase state files and
recovery.

Author: Noah Misch <noah@leadboat.com>
Co-Authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz
Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com
2025-07-07 12:50:40 +09:00
Michael Paquier
5a6c39b6df Disable commit timestamps during bootstrap
Attempting to use commit timestamps during bootstrapping leads to an
assertion failure, that can be reached for example with an initdb -c
that enables track_commit_timestamp.  It makes little sense to register
a commit timestamp for a BootstrapTransactionId, so let's disable the
activation of the module in this case.

This problem has been independently reported once by each author of this
commit.  Each author has proposed basically the same patch, relying on
IsBootstrapProcessingMode() to skip the use of commit_ts during
bootstrap.  The test addition is a suggestion by me, and is applied down
to v16.

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Author: Andy Fan <zhihuifan1213@163.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/OSCPR01MB14966FF9E4C4145F37B937E52F5102@OSCPR01MB14966.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/87plejmnpy.fsf@163.com
Backpatch-through: 13
2025-07-04 15:09:24 +09:00
Fujii Masao
78ebda66bf Speed up truncation of temporary relations.
Previously, truncating a temporary relation required scanning the entire
local buffer pool once per relation fork to invalidate buffers. This could
be slow, especially with a large local buffers, as the scan was repeated
multiple times.

A similar issue with regular tables (shared buffers) was addressed in
commit 6d05086c0a by scanning the buffer pool only once for all forks.

This commit applies the same optimization to temporary relations,
improving truncation performance.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Discussion: https://postgr.es/m/CAJDiXggNqsJOH7C5co4jA8nDk8vw-=sokyh5s1_TENWnC6Ofcg@mail.gmail.com
2025-07-04 09:03:58 +09:00
Tom Lane
931766aaec Simplify COALESCE() with one surviving argument.
If, after removal of useless null-constant arguments, a CoalesceExpr
has exactly one remaining argument, we can just take that argument as
the result, without bothering to wrap a new CoalesceExpr around it.
This isn't likely to produce any great improvement in runtime per se,
but it can lead to better plans since the planner no longer has to
treat the expression as non-strict.

However, there were a few regression test cases that intentionally
wrote COALESCE(x) as a shorthand way of creating a non-strict
subexpression.  To avoid ruining the intent of those tests, write
COALESCE(x,x) instead.  (If anyone ever proposes de-duplicating
COALESCE arguments, we'll need another iteration of this arms race.
But it seems pretty unlikely that such an optimization would be
worthwhile.)

Author: Maksim Milyutin <maksim.milyutin@tantorlabs.ru>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/8e8573c3-1411-448d-877e-53258b7b2be0@tantorlabs.ru
2025-07-03 17:39:53 -04:00
Tom Lane
0059bbe1ec Break out xxx2yyy_opt_overflow APIs for more datetime conversions.
Previous commits invented timestamp2timestamptz_opt_overflow,
date2timestamp_opt_overflow, and date2timestamptz_opt_overflow
functions to perform non-error-throwing conversions between
datetime types.  This patch completes the set by adding
timestamp2date_opt_overflow, timestamptz2date_opt_overflow,
and timestamptz2timestamp_opt_overflow.

In addition, adjust timestamp2timestamptz_opt_overflow so that it
doesn't throw error if timestamp2tm fails, but treats that as an
overflow case.  The situation probably can't arise except with an
invalid timestamp value, and I can't think of a way that that would
happen except data corruption.  However, it's pretty silly to have a
function whose entire reason for existence is to not throw errors for
out-of-range inputs nonetheless throw an error for out-of-range input.

The new APIs are not used in this patch, but will be needed in
upcoming btree_gin changes.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Discussion: https://postgr.es/m/262624.1738460652@sss.pgh.pa.us
2025-07-03 16:17:08 -04:00
Tom Lane
a10f21e6ce Obtain required table lock during cross-table updates, redux.
Commits 8319e5cb5 et al missed the fact that ATPostAlterTypeCleanup
contains three calls to ATPostAlterTypeParse, and the other two
also need protection against passing a relid that we don't yet
have lock on.  Add similar logic to those code paths, and add
some test cases demonstrating the need for it.

In v18 and master, the test cases demonstrate that there's a
behavioral discrepancy between stored generated columns and virtual
generated columns: we disallow changing the expression of a stored
column if it's used in any composite-type columns, but not that of
a virtual column.  Since the expression isn't actually relevant to
either sort of composite-type usage, this prohibition seems
unnecessary; but changing it is a matter for separate discussion.
For now we are just documenting the existing behavior.

Reported-by: jian he <jian.universality@gmail.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: CACJufxGKJtGNRRSXfwMW9SqVOPEMdP17BJ7DsBf=tNsv9pWU9g@mail.gmail.com
Backpatch-through: 13
2025-07-03 13:46:07 -04:00
Álvaro Herrera
647cffd2f3
Prevent creation of duplicate not-null constraints for domains
This was previously harmless, but now that we create pg_constraint rows
for those, duplicates are not welcome anymore.

Backpatch to 18.

Co-authored-by: jian he <jian.universality@gmail.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CACJufxFSC0mcQ82bSk58sO-WJY4P-o4N6RD2M0D=DD_u_6EzdQ@mail.gmail.com
2025-07-03 11:46:12 +02:00
Álvaro Herrera
87251e1149
Fix bogus grammar for a CREATE CONSTRAINT TRIGGER error
If certain constraint characteristic clauses (NO INHERIT, NOT VALID, NOT
ENFORCED) are given to CREATE CONSTRAINT TRIGGER, the resulting error
message is
  ERROR:  TRIGGER constraints cannot be marked NO INHERIT
which is a bit silly, because these aren't "constraints of type
TRIGGER".  Hardcode a better error message to prevent it.  This is a
cosmetic fix for quite a fringe problem with no known complaints from
users, so no backpatch.

While at it, silently accept ENFORCED if given.

Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAAJ_b97hd-jMTS7AjgU6TDBCzDx_KyuKxG+K-DtYmOieg+giyQ@mail.gmail.com
Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com
2025-07-03 11:25:39 +02:00
Michael Paquier
8ec04c8577 Refactor subtype field of AlterDomainStmt
AlterDomainStmt.subtype used characters for its subtypes of commands,
SET|DROP DEFAULT|NOT NULL and ADD|DROP|VALIDATE CONSTRAINT, which were
hardcoded in a couple of places of the code.  The code is improved by
using an enum instead, with the same character values as the original
code.

Note that the field was documented in parsenodes.h and that it forgot to
mention 'V' (VALIDATE CONSTRAINT).

Author: Quan Zongliang <quanzongliang@yeah.net>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/41ff310b-16bd-44b9-a3ef-97e20f14b709@yeah.net
2025-07-03 16:34:28 +09:00
Fujii Masao
bc2f348e87 Support multi-line headers in COPY FROM command.
The COPY FROM command now accepts a non-negative integer for the HEADER option,
allowing multiple header lines to be skipped. This is useful when the input
contains multi-line headers that should be ignored during data import.

Author: Shinya Kato <shinya11.kato@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp>
Discussion: https://postgr.es/m/CAOzEurRPxfzbxqeOPF_AGnAUOYf=Wk0we+1LQomPNUNtyZGBZw@mail.gmail.com
2025-07-03 15:27:26 +09:00
Michael Paquier
fd7d7b7191 Improve checks for GUC recovery_target_timeline
Currently check_recovery_target_timeline() converts any value that is
not "current", "latest", or a valid integer to 0.  So, for example, the
following configuration added to postgresql.conf followed by a startup:
recovery_target_timeline = 'bogus'
recovery_target_timeline = '9999999999'

...  results in the following error patterns:
FATAL:  22023: recovery target timeline 0 does not exist
FATAL:  22023: recovery target timeline 1410065407 does not exist

This is confusing, because the server does not reflect the intention of
the user, and just reports incorrect data unrelated to the GUC.

The origin of the problem is that we do not perform a range check in the
GUC value passed-in for recovery_target_timeline.  This commit improves
the situation by using strtou64() and by providing stricter range
checks.  Some test cases are added for the cases of an incorrect, an
upper-bound and a lower-bound timeline value, checking the sanity of the
reports based on the contents of the server logs.

Author: David Steele <david@pgmasters.net>
Discussion: https://postgr.es/m/e5d472c7-e9be-4710-8dc4-ebe721b62cea@pgbackrest.org
2025-07-03 11:14:20 +09:00
Richard Guo
0da29e4cb1 Enable use of Memoize for ANTI joins
Currently, we do not support Memoize for SEMI and ANTI joins because
nested loop SEMI/ANTI joins do not scan the inner relation to
completion, which prevents Memoize from marking the cache entry as
complete.  One might argue that we could mark the cache entry as
complete after fetching the first inner tuple, but that would not be
safe: if the first inner tuple and the current outer tuple do not
satisfy the join clauses, a second inner tuple matching the parameters
would find the cache entry already marked as complete.

However, if the inner side is provably unique, this issue doesn't
arise, since there would be no second matching tuple.  That said, this
doesn't help in the case of SEMI joins, because a SEMI join with a
provably unique inner side would already have been reduced to an inner
join by reduce_unique_semijoins.

Therefore, in this patch, we check whether the inner relation is
provably unique for ANTI joins and enable the use of Memoize in such
cases.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48FdLiMNrmJL-g6mDvoQVt0yNyJAqMkv4e2Pk-5GKCZLA@mail.gmail.com
2025-07-03 10:57:26 +09:00
Michael Paquier
7b2eb72b1b Add InjectionPointList() to retrieve list of injection points
This routine has come as a useful piece to be able to know the list of
injection points currently attached in a system.  One area would be to
use it in a set-returning function, or just let out-of-core code play
with it.

This hides the internals of the shared memory array lookup holding the
information about the injection points (point name, library and function
name), allocating the result in a palloc'd List consumable by the
caller.

Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/Z_xYkA21KyLEHvWR@paquier.xyz
Discussion: https://postgr.es/m/aBG2rPwl3GE7m1-Q@paquier.xyz
2025-07-03 08:41:25 +09:00
Nathan Bossart
bb109382ef Make more use of RELATION_IS_OTHER_TEMP().
A few places were open-coding it instead of using this handy macro.

Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAEG8a3LjTGJcOcxQx-SUOGoxstG4XuCWLH0ATJKKt_aBTE5K8w%40mail.gmail.com
2025-07-02 12:32:19 -05:00
Nathan Bossart
fe07100e82 Add GetNamedDSA() and GetNamedDSHash().
Presently, the dynamic shared memory (DSM) registry only provides
GetNamedDSMSegment(), which allocates a fixed-size segment.  To use
the DSM registry for more sophisticated things like dynamic shared
memory areas (DSAs) or a hash table backed by a DSA (dshash), users
need to create a DSM segment that stores various handles and LWLock
tranche IDs and to write fairly complicated initialization code.
Furthermore, there is likely little variation in this
initialization code between libraries.

This commit introduces functions that simplify allocating a DSA or
dshash within the DSM registry.  These functions are very similar
to GetNamedDSMSegment().  Notable differences include the lack of
an initialization callback parameter and the prohibition of calling
the functions more than once for a given entry in each backend
(which should be trivially avoidable in most circumstances).  While
at it, this commit bumps the maximum DSM registry entry name length
from 63 bytes to 127 bytes.

Also note that even though one could presumably detach/destroy the
DSAs and dshashes created in the registry, such use-cases are not
yet well-supported, if for no other reason than the associated DSM
registry entries cannot be removed.  Adding such support is left as
a future exercise.

The test_dsm_registry test module contains tests for the new
functions and also serves as a complete usage example.

Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Discussion: https://postgr.es/m/aEC8HGy2tRQjZg_8%40nathan
2025-07-02 11:50:52 -05:00
Peter Geoghegan
9ca30a0b04 Update obsolete row compare preprocessing comments.
Restore nbtree preprocessing comments describing how we mark nbtree row
compare members required to how they were prior to 2016 bugfix commit
a298a1e0.

Oversight in commit bd3f59fd, which made nbtree preprocessing revert to
the original 2006 rules, but neglected to revert these comments.

Backpatch-through: 18
2025-07-02 12:36:35 -04:00
Tom Lane
7374b3a536 Allow width_bucket()'s "operand" input to be NaN.
The array-based variant of width_bucket() has always accepted NaN
inputs, treating them as equal but larger than any non-NaN,
as we do in ordinary comparisons.  But up to now, the four-argument
variants threw errors for a NaN operand.  This is inconsistent
and unnecessary, since we can perfectly well regard NaN as falling
after the last bucket.

We do still throw error for NaN or infinity histogram-bound inputs,
since there's no way to compute sensible bucket boundaries.

Arguably this is a bug fix, but given the lack of field complaints
I'm content to fix it in master.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/2822872.1750540911@sss.pgh.pa.us
2025-07-02 11:34:40 -04:00
Álvaro Herrera
c989affb52
Fix error message for ALTER CONSTRAINT ... NOT VALID
Trying to alter a constraint so that it becomes NOT VALID results in an
error that assumes the constraint is a foreign key.  This is potentially
wrong, so give a more generic error message.

While at it, give CREATE CONSTRAINT TRIGGER a better error message as
well.

Co-authored-by: jian he <jian.universality@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Co-authored-by: Amul Sul <sulamul@gmail.com>
Discussion: https://postgr.es/m/CACJufxHSp2puxP=q8ZtUGL1F+heapnzqFBZy5ZNGUjUgwjBqTQ@mail.gmail.com
2025-07-02 17:02:27 +02:00
Peter Geoghegan
bd3f59fdb7 Make row compares robust during nbtree array scans.
Recent nbtree bugfix commit 5f4d98d4 added a special case to the code
that sets up a page-level prefix of keys that are definitely satisfied
by every tuple on the page: whenever _bt_set_startikey reached a row
compare key, we'd refuse to apply the pstate.forcenonrequired behavior
in scans where that usually happens (scans with a higher-order array
key).  That hack made the scan avoid essentially the same infinite
cycling behavior that also affected nbtree scans with redundant keys
(keys that preprocessing could not eliminate) prior to commit f09816a0.
There are now serious doubts about this row compare workaround.

Testing has shown that a scan with a row compare key and an array key
could still read the same leaf page twice (without the scan's direction
changing), which isn't supposed to be possible following the SAOP
enhancements added by Postgres 17 commit 5bf748b8.  Also, we still
allowed a required row compare key to be used with forcenonrequired mode
when its header key happened to be beyond the pstate.ikey set by
_bt_set_startikey, which was complicated and brittle.

The underlying problem was that row compares had inconsistent rules
around how scans start (which keys can be used for initial positioning
purposes) and how scans end (which keys can set continuescan=false).
Quals with redundant keys that could not be eliminated by preprocessing
also had that same quality to them prior to today's bugfix f09816a0.  It
now seems prudent to bring row compare keys in line with the new charter
for required keys, by making the start and end rules symmetric.

This commit fixes two points of disagreement between _bt_first and
_bt_check_rowcompare.  Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after a required row compare
member.  _bt_first now has symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes.  It could use fewer or more keys than
_bt_check_rowcompare.  _bt_first now uses the same requiredness markings
as _bt_check_rowcompare for this.

Now that _bt_first and _bt_check_rowcompare agree on how to start and
end scans, we can get rid of the forcenonrequired special case, without
any risk of infinite cycling.  This approach also makes row compare keys
behave more like regular scalar keys, particularly within _bt_first.

Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares were marked required by preprocessing: we
didn't mark any lower-order row members required following 2016 bugfix
commit a298a1e0.  That approach was over broad.  The bug in question was
actually an oversight in how _bt_check_rowcompare dealt with tuple NULL
values that failed to satisfy a scan key marked required in the opposite
scan direction (it was a bug in 2011 commits 6980f817 and 882368e8, not
a bug in 2006 commit 3a0a16cb).  Go back to marking row compare members
as required using the original 2006 rules, and fix the 2016 bug in a
more principled way: by limiting use of the "set continuescan=false with
a key required in the opposite scan direction upon encountering a NULL
tuple value" optimization to the first/most significant row member key.
While it isn't safe to use an implied IS NOT NULL qualifier to end the
scan when it comes from a required lower-order row compare member key,
it _is_ generally safe for such a required member key to end the scan --
provided the key is marked required in the _current_ scan direction.

This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.  It is a direct follow-up to today's commit f09816a0.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
Backpatch-through: 18
2025-07-02 09:48:15 -04:00