postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-05-28 04:35:45 -04:00

Author	SHA1	Message	Date
Peter Eisentraut	5c2a8d272b	Use C11 alignas in typedef definitions They were already using pg_attribute_aligned. This replaces that with alignas and moves that into the required syntactic position. Suggested-by: Peter Eisentraut <peter@eisentraut.org> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/d7a788fa-e609-4894-a8be-2f70e135424f%40eisentraut.org	2026-03-16 11:35:51 +01:00
Peter Eisentraut	2f094e7ac6	SQL Property Graph Queries (SQL/PGQ) Implementation of SQL property graph queries, according to SQL/PGQ standard (ISO/IEC 9075-16:2023). This adds: - GRAPH_TABLE table function for graph pattern matching - DDL commands CREATE/ALTER/DROP PROPERTY GRAPH - several new system catalogs and information schema views - psql \dG command - pg_get_propgraphdef() function for pg_dump and psql A property graph is a relation with a new relkind RELKIND_PROPGRAPH. It acts like a view in many ways. It is rewritten to a standard relational query in the rewriter. Access privileges act similar to a security invoker view. (The security definer variant is not currently implemented.) Starting documentation can be found in doc/src/sgml/ddl.sgml and doc/src/sgml/queries.sgml. Author: Peter Eisentraut <peter@eisentraut.org> Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com> Reviewed-by: Henson Choi <assam258@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-16 10:14:18 +01:00
Fujii Masao	fd6ecbfa75	Ensure "still waiting on lock" message is logged only once per wait. When log_lock_waits is enabled, the "still waiting on lock" message is normally emitted only once while a session continues waiting. However, if the wait is interrupted, for example by wakeups from client_connection_check_interval, SIGHUP for configuration reloads, or similar events, the message could be emitted again each time the wait resumes. For example, with very small client_connection_check_interval values (e.g., 100 ms), this behavior could flood the logs with repeated messages, making them difficult to use. To prevent this, this commit guards the "still waiting on lock" message so it is reported at most once during a lock wait, even if the wait is interrupted. This preserves the intended behavior when no interrupts occur. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHZUmg+r4kMcPYt_Z-txxVX+CJJhfra+qemxKXvAxYbpw@mail.gmail.com	2026-03-16 18:10:57 +09:00
Michael Paquier	c336133c65	Reject ALTER TABLE .. CLUSTER earlier for partitioned tables ALTER TABLE .. CLUSTER ON and SET WITHOUT CLUSTER are not supported for partitioned tables and already fail with a check happening when the sub-command is executed, not when it is prepared. This commit moves the relkind check for partitioned tables to happen when the sub-command is prepared in ATSimplePermissions(). This matches with the practice of the other sub-commands of ALTER TABLE, shaving one translatable string. mark_index_clustered() can be a bit simplified, switching one elog(ERROR) to an assertion. Note that mark_index_clustered() can also be called through a CLUSTER command, but it cannot be reached for a partitioned table, per the assertion based on the relkind in cluster_rel(), and there is only one caller of rebuild_relation(). Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAEoWx2kggo1N2kDH6OSfXHL_5gKg3DqQ0PdNuL4LH4XSTKJ3-g@mail.gmail.com	2026-03-16 17:48:39 +09:00
Fujii Masao	8fe315f18d	Add stats_reset column to pg_statio_all_sequences pg_statio_all_sequences lacked a stats_reset column, unlike the other pg_statio_* views that already expose it. This commit adds the column so users can see when the statistics in this view were last reset. Also this commit updates the documentation for pg_stat_reset_single_table_counters() to clarify that it can reset statistics for sequences and materialized views as well. Catalog version bumped. Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Shihao Zhong <zhong950419@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0v0OPGyDpwxkX81CtTt9xsj9-TNxhm=8JdOvEKPsVVFNg@mail.gmail.com	2026-03-16 17:24:08 +09:00
Peter Eisentraut	a41bc38439	Fix accidentally casting away const Recently introduced in commit `8c2b30487c`.	2026-03-16 07:37:03 +01:00
Amit Kapila	5f39698c90	Remove obsolete speculative insert cleanup in ReorderBuffer. Commit `4daa140a2f` introduced proper decoding for speculative aborts. As a result, the internal state is guaranteed to be clean when a new speculative insert is encountered. This patch removes the defensive cleanup code that is no longer reachable. Author: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/23256.1772702981@localhost	2026-03-16 10:14:22 +05:30
Michael Paquier	bfa3c4f106	Optimize hash index bulk-deletion with streaming read This commit refactors hashbulkdelete() to use streaming reads, improving the efficiency of the operation by prefetching upcoming buckets while processing a current bucket. There are some specific changes required to make sure that the cleanup work happens in accordance to the data pushed to the stream read callback. When the cached metadata page is refreshed to be able to process the next set of buckets, the stream is reset and the data fed to the stream read callback has to be updated. The reset needs to happen in two code paths, when _hash_getcachedmetap() is called. The author has seen better performance numbers than myself on this one (with tweaks similar to `6c228755ad`). The numbers are good enough for both of us that this change is worth doing, in terms of IO and runtime. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com	2026-03-16 09:22:09 +09:00
Tom Lane	82ff54377e	Move -ffast-math defense to float.c and remove the configure check. We had defenses against -ffast-math in timestamp-related files, which is a pretty obsolete place for them since we've not supported floating-point timestamps in a long time. Remove those and instead put one in float.c, which is still broken by using this switch. Add some commentary to put more color on why it's a bad idea. Also remove the check from configure. That was just there to fail faster, but it doesn't really seem necessary anymore, and besides we have no corresponding check in meson.build. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Suggested-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/abFXfKC8zR0Oclon%40ip-10-97-1-34.eu-west-3.compute.internal	2026-03-15 19:34:52 -04:00
David Rowley	c456e39113	Optimize tuple deformation This commit includes various optimizations to improve the performance of tuple deformation. We now precalculate CompactAttribute's attcacheoff, which allows us to remove the code from the deform routines which was setting the attcacheoff. Setting the attcacheoff is now handled by TupleDescFinalize(), which must be called before the TupleDesc is used for anything. Having TupleDescFinalize() means we can store the first attribute in the TupleDesc which does not have an offset cached. That allows us to add a dedicated deforming loop to deform all attributes up to the final one with an attcacheoff set, or up to the first NULL attribute, whichever comes first. Here we also improve tuple deformation performance of tuples with NULLs. Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask, deforming would, one-by-one, check each and every bit in the NULL bitmap to see if it was zero. Now, we process the NULL bitmap 1 byte at a time rather than 1 bit at a time to find the attnum with the first NULL. We can now deform the tuple without checking for NULLs up to just before that attribute. We also record the maximum attribute number which is guaranteed to exist in the tuple, that is, has a NOT NULL constraint and isn't an atthasmissing attribute. When deforming only attributes prior to the guaranteed attnum, we've no need to access the tuple's natt count. As an additional optimization, we only count fixed-width columns when calculating the maximum guaranteed column, as this eliminates the need to emit code to fetch byref types in the deformation loop for guaranteed attributes. Some locations in the code deform tuples that have yet to go through NOT NULL constraint validation. We're unable to perform the guaranteed attribute optimization when that's the case. This optimization is opt-in via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS flag. This commit also adds a more efficient way of populating the isnull array by using a bit-wise SWAR trick which performs multiplication on the inverse of the tuple's bitmap byte and masking out all but the lower bit of each of the boolean's byte. This results in much more optimal code when compared to determining the NULLness via att_isnull(). 8 isnull elements are processed at once using this method, which means we need to round the tts_isnull array size up to the next 8 bytes. The palloc code does this anyway, but the round-up needed to be formalized so as not to overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds. Doing this also allows the NULL-checking deforming loop to more efficiently check the isnull array, rather than doing the bit-wise processing for each attribute that att_isnull() does. The level of performance improvement from these changes seems to vary depending on the CPU architecture. Apple's M chips seem particularly fond of the changes, with some of the tested deform-heavy queries going over twice as fast as before. With x86-64, the speedups aren't quite as large. With tables containing only a small number of columns, the speedups will be less. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:46:00 +13:00
David Rowley	503620311e	Add all required calls to TupleDescFinalize() As of this commit all TupleDescs must have TupleDescFinalize() called on them once the TupleDesc is set up and before BlessTupleDesc() is called. In this commit, TupleDescFinalize() does nothing. This change has only been separated out from the commit that properly implements this function to make the change more obvious. Any extension which makes its own TupleDesc will need to be modified to call the new function. The follow-up commit which properly implements TupleDescFinalize() will cause any code which forgets to do this to fail in assert-enabled builds in BlessTupleDesc(). It may still be worth mentioning this change in the release notes so that extension authors update their code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:45:49 +13:00
Tom Lane	e5a77d876d	Save a few bytes per CatCTup. CatalogCacheCreateEntry() computed the space needed for a CatCTup as sizeof(CatCTup) + MAXIMUM_ALIGNOF. That's not our usual style, and it wastes memory by allocating more padding than necessary. On 64-bit machines sizeof(CatCTup) would be maxaligned already since it contains pointer fields, therefore this code is wasting 8 bytes compared to the more usual MAXALIGN(sizeof(CatCTup)). While at it, we don't really need to do MemoryContextSwitchTo() when we're only allocating one block. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/tencent_A42E0544C6184FE940CD8E3B14A3F0A39605@qq.com	2026-03-15 18:05:38 -04:00
Melanie Plageman	99bf1f8aa6	Save vmbuffer in heap-specific scan descriptors for on-access pruning Future commits will use the visibility map in on-access pruning to fix VM corruption and set the VM if the page is all-visible. Saving the vmbuffer in the scan descriptor reduces the number of times it would need to be pinned and unpinned, making the overhead of doing so negligible. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com	2026-03-15 11:09:10 -04:00
Melanie Plageman	8d2c1df4f4	Avoid BufferGetPage() calls in heap_update() BufferGetPage() isn't cheap and heap_update() calls it multiple times when it could just save the page from a single call. Do that. While we are at it, make separate variables for old and new page in heap_xlog_update(). It's confusing to reuse "page" for both pages. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_a%2BhO4PCptyaPR7AMZd7FjcHfOFKKJT8ouU3KedMud0tQ%40mail.gmail.com	2026-03-15 10:42:34 -04:00
Melanie Plageman	a3511443e5	Initialize missing fields in CreateExecutorState() `d47cbf474e` and `cbc127917e` forgot to initialize a few fields they introduced in the EState, so do that now. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com	2026-03-15 10:13:14 -04:00
Tom Lane	2eb87345e1	Fix aclitemout() to work during early bootstrap. "initdb -d" has been broken since commit `f95d73ed4`, because I changed aclitemin to work in bootstrap mode but failed to consider aclitemout. That routine isn't reached by default, but it is if the elog message level is high enough, so it needs to work without catalog access too. This patch just makes it use its existing code paths to print role OIDs numerically. We could alternatively invent an inverse of boot_get_role_oid() and print them symbolically, but that would take more code and it's not apparent that it'd be any better for debugging purposes. Reported-by: Greg Burd <greg@burd.me> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/4416.1773328045@sss.pgh.pa.us	2026-03-14 13:46:54 -04:00
Tomas Vondra	02eecead86	Tighten asserts on ParallelWorkerNumber The comment about ParallelWorkerNumbr in parallel.c says: In parallel workers, it will be set to a value >= 0 and < the number of workers before any user code is invoked; each parallel worker will get a different parallel worker number. However asserts in various places collecting instrumentation allowed (ParallelWorkerNumber == num_workers). That would be a bug, as the value is used as index into an array with num_workers entries. Fixed by adjusting the asserts accordingly. Backpatch to all supported versions. Discussion: https://postgr.es/m/5db067a1-2cdf-4afb-a577-a04f30b69167@vondra.me Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Backpatch-through: 14	2026-03-14 15:26:39 +01:00
David Rowley	4deecb52af	Allow sibling call optimization in slot_getsomeattrs_int() This changes the TupleTableSlotOps contract to make it so the getsomeattrs() function is in charge of calling slot_getmissingattrs(). Since this removes all code from slot_getsomeattrs_int() aside from the getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so that it calls getsomeattrs() directly. We leave slot_getsomeattrs_int() intact as this is still called from the JIT code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com	2026-03-14 13:52:09 +13:00
Peter Geoghegan	8a879119a1	Use fake LSNs to improve nbtree dropPin behavior. Use fake LSNs in all nbtree critical sections that write a WAL record. That way we can safely apply the _bt_killitems LSN trick with logged and unlogged indexes alike. This brings the same benefits to plain scans of unlogged relations that commit `2ed5b87f` brought to plain scans of logged relations: scans will drop their leaf page pin eagerly (by applying the "dropPin" optimization), which avoids blocking progress by VACUUM. This is particularly helpful with applications that allow a scrollable cursor to remain idle for long periods. Preparation for an upcoming commit that will add the amgetbatch interface, and switch nbtree over to it (from amgettuple) to enable I/O prefetching. The index prefetching read stream's effective prefetch distance is adversely affected by any buffer pins held by the index AM. At the same time, it can be useful for prefetching to read dozens of leaf pages ahead of the scan to maintain an adequate prefetch distance. The index prefetching patch avoids this tension by always eagerly dropping index page pins of the kind traditionally held as an interlock against unsafe concurrent TID recycling by VACUUM (essentially the same way that amgetbitmap routines have always avoided holding onto pins). The work from this commit makes that possible during scans of nbtree unlogged indexes -- without our having to give up on setting LP_DEAD bits on index tuples altogether. Follow-up to commit `d774072f`, which moved the fake LSN infrastructure out of GiST so that it could be used by other index AMs. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com	2026-03-13 20:37:39 -04:00
Peter Geoghegan	d774072f00	Move fake LSN infrastructure out of GiST. Move utility functions used by GiST to generate fake LSNs into xlog.c and xloginsert.c, so that other index AMs can also generate fake LSNs. Preparation for an upcoming commit that will add support for fake LSNs to nbtree, allowing its dropPin optimization to be used during scans of unlogged relations. That commit is itself preparation for another upcoming commit that will add a new amgetbatch/btgetbatch interface to enable I/O prefetching. Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming XLOG_ASSIGN_LSN. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com	2026-03-13 19:38:17 -04:00
Jeff Davis	9b860373da	Add error code to user-visible message. Reported-by: Alexander Lakhin <exclusion@gmail.com>	2026-03-13 16:07:54 -07:00
Tomas Vondra	b1f14c9672	Use GetXLogInsertEndRecPtr in gistGetFakeLSN The function used GetXLogInsertRecPtr() to generate the fake LSN. Most of the time this is the same as what XLogInsert() would return, and so it works fine with the XLogFlush() call. But if the last record ends at a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the page header. In such case XLogFlush() fails with errors like this: ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 Such failures are very hard to trigger, particularly outside aggressive test scenarios. Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN without skipping the header. This is the same as GetXLogInsertRecPtr(), except that it calls XLogBytePosToEndRecPtr(). Initial investigation by me, root cause identified by Andres Freund. This is a long-standing bug in gistGetFakeLSN(), probably introduced by `c6b92041d3` in PG13. Backpatch to all supported versions. Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo Backpatch-through: 14	2026-03-13 23:25:24 +01:00
Heikki Linnakangas	311a851436	Free memory allocated for unrecognized_protocol_options Since `4966bd3ed9` Valgrind started to warn about little amount of memory being leaked in ProcessStartupPacket(). This is not critical but the warnings may distract from real issues. Fix it by freeing the list after use. Author: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://www.postgresql.org/message-id/CAJ7c6TN3Hbb5p=UHx0SPVN+h_JwPAV6rxoqOm7gHBMFKfnGK-Q@mail.gmail.com	2026-03-13 23:37:19 +02:00
Andres Freund	ce5d489166	Fix bug due to confusion about what IsMVCCSnapshot means In `0b96e734c5` I (Andres) relied on page_collect_tuples() being called only with an MVCC snapshot, and added assertions to that end, but did not realize that IsMVCCSnapshot() allows both proper MVCC snapshots and historical snapshots, which behave quite similarly to MVCC snapshots. Unfortunately that can lead to incorrect visibility results during logical decoding, as a historical snapshot is interpreted as a plain MVCC snapshot. The only reason this wasn't noticed earlier is that it's hard to reach as most of the time there are no sequential scans during logical decoding. To fix the bug and avoid issues like this in the future, split IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now only the latter includes historic snapshots. One effect of this is that during logical decoding no page-at-a-time snapshots are used, as otherwise runtime branches to handle historic snapshots would be needed in some performance critical paths. Given how uncommon sequential scans are during logical decoding, that seems acceptable. Author: Antonin Houska <ah@cybertec.at> Reported-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/61812.1770637345@localhost	2026-03-13 13:53:19 -04:00
Nathan Bossart	e0a3a3fd53	Optimize COPY FROM (FORMAT {text,csv}) using SIMD. Presently, such commands scan the input buffer one byte at a time looking for special characters. This commit adds a new path that uses SIMD instructions to skip over chunks of data without any special characters. This can be much faster. To avoid regressions, SIMD processing is disabled for the remainder of the COPY FROM command as soon as we encounter a short line or a special character (except for end-of-line characters, else we'd always disable it after the first line). This is perhaps too conservative, but it could probably be made more lenient in the future via fine-tuned heuristics. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Neil Conway <neil.conway@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Tested-by: Manni Wood <manni.wood@enterprisedb.com> Tested-by: Mark Wong <markwkm@gmail.com> Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com	2026-03-13 11:07:32 -05:00
Peter Eisentraut	8c2b30487c	Factor out constructSetOpTargetlist() from transformSetOperationTree() This would be used separately by a future patch. It also makes a little smaller. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-13 16:16:40 +01:00
Heikki Linnakangas	f9de9bf302	Add callback for I/O error messages in SLRUs Historically, all SLRUs were addressed by transaction IDs, but that hasn't been true for a long time. However, the error message on I/O error still always talked about accessing a transaction ID. This commit adds a callback that allows subsystems to construct their own error messages, which can then correctly refer to a transaction ID, multixid or whatever else is used to address the particular SLRU. Author: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com	2026-03-13 16:21:06 +02:00
Fujii Masao	723619eaa3	Add stats_reset column to pg_stat_database_conflicts. This commit adds a stats_reset column to pg_stat_database_conflicts, allowing users to see when the statistics in this view were last reset. This makes the view consistent with pg_stat_database and other statistics views. Catalog version bumped. Author: Shihao Zhong <zhong950419@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAGRkXqS98OebEWjax99_LVAECsxCB8i=BfsdAL34i-5QHfwyOQ@mail.gmail.com	2026-03-13 22:17:14 +09:00
Heikki Linnakangas	2e1dcf8c54	Check for interrupts during non-fast-update GIN insertion ginExtractEntries() can produce a lot of entries for a single item. During index build, we check for interrupts between entries, and the fast-update codepath does it as part of vacuum_delay_point(), but the non-fast update insertion codepath was uninterruptible. Add CHECK_FOR_INTERRUPTS() between entries in the non-fast update codepath too. Author: Vinod Sridharan <vsridh90@gmail.com> Discussion: https://www.postgresql.org/message-id/CAFMdLD6mQvAuStiOGvBJxAEfo6wdjZhj3+JveTLxOX8MVn4zmA@mail.gmail.com	2026-03-13 15:12:32 +02:00
Alexander Korotkov	fa6f2f624c	Rework ginScanToDelete() to pass Buffers instead of BlockNumbers. Previously, ginScanToDelete() and ginDeletePage() passed BlockNumbers and re-read pages that were already pinned and locked during the tree walk. The caller ginVacuumPostingTree()) held a cleanup-locked root buffer, yet ginScanToDelete() re-read it by block number with special-case code to skip re-locking. At first, this commit gives both functions more appropriate names, ginScanPostingTreeToDelete() and ginDeletePostingPage(), indicating they deal with posting trees/pages. This is more descriptive and similar to the way we name other GIN functions, for instance, ginVacuumPostingTree() and ginVacuumPostingTreeLeaves(). Then rework both functions to pass Buffers directly. DataPageDeleteStack now carries buffer, myoff (downlink offset in parent), and isRoot per level, so ginScanPostingTreeToDelete() takes only GinVacuumState and DataPageDeleteStack pointers. Also, ginDeletePostingPage() receives the three Buffers directly, and no longer reads or releases them itself. The caller reads and locks child pages before recursing, and manages buffer lifecycle afterward. This eliminates the confusing isRoot special cases in buffer management, including the apparent (but unreachable) double release of the root buffer identified by Andres Freund. Add comments explaining the locking protocol and the DataPageDeleteStack structure. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/utrlxij43fbguzw4kldte2spc4btoldizutcqyrfakqnbrp3ir@ph3sphpj4asz Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Jinbinge <jinbinge@126.com>	2026-03-13 13:50:13 +02:00
Heikki Linnakangas	f30cebb954	Fix pointer type of ShmemAllocatorData->index This went unnoticed in commit `e2362eb2bd` because the pointer is cast to/from a void pointer.	2026-03-13 11:00:15 +02:00
Andrew Dunstan	a0b6ef29a5	Enable fast default for domains with non-volatile constraints Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when the column type was a domain with constraints (CHECK or NOT NULL), even if the default value satisfied those constraints. This was because contain_volatile_functions() considers CoerceToDomain immutable, so the code conservatively assumed any constrained domain might fail. Improve this by using soft error handling (ErrorSaveContext) to evaluate the CoerceToDomain expression at ALTER TABLE time. If the default value passes the domain's constraints, the value is stored as a "missing" attribute default and no table rewrite is needed. If the constraint check fails, we fall back to a table rewrite, preserving the historical behavior that constraint violations are only raised when the table actually contains rows. Domains with volatile constraint expressions always require a table rewrite since the constraint result could differ per evaluation and cannot be cached. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:05:01 -04:00
Andrew Dunstan	487cf2cbd2	Extend DomainHasConstraints() to optionally check constraint volatility Add an optional bool *has_volatile output parameter to DomainHasConstraints(). When non-NULL, the function checks whether any CHECK constraint contains a volatile expression. Callers that don't need this information pass NULL and get the same behavior as before. This is needed by a subsequent commit that enables the fast default optimization for domains with non-volatile constraints: we can safely evaluate such constraints once at ALTER TABLE time, but volatile constraints require a full table rewrite. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:04:16 -04:00
Peter Geoghegan	a367c433ad	Use simplehash for backend-private buffer pin refcounts. Replace dynahash with simplehash for the per-backend PrivateRefCountHash overflow table. Simplehash generates inlined, open-addressed lookup code, avoiding the per-call overhead of dynahash that becomes noticeable when many buffers are pinned with a CPU-bound workload. Motivated by testing of the index prefetching patch, which pins many more buffers concurrently than typical index scans. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-03-12 13:26:16 -04:00
Peter Geoghegan	d071e1cfec	nbtree: Avoid allocating _bt_search stack. Avoid allocating memory for an nbtree descent stack during index scans. We only require a descent stack during inserts, when it is used to determine where to insert a new pivot tuple/downlink into the target leaf page's parent page in the event of a page split. (Page deletion's first phase also performs a _bt_search that requires a descent stack.) This optimization improves performance by minimizing palloc churn. It speeds up index scans that call _bt_search frequently/descend the index many times, especially when the cost of scanning the index dominates (e.g., with index-only skip scans). Testing has shown that the underlying issue causes performance problems for an upcoming patch that will replace btgettuple with a new btgetbatch interface to enable I/O prefetching. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com	2026-03-12 13:22:36 -04:00
Michael Paquier	6c228755ad	Use streaming read for VACUUM cleanup of GIN This commit replace the synchronous ReadBufferExtended() loop done in ginvacuumcleanup() with the streaming read equivalent, to improve I/O efficiency during GIN index vacuum cleanup operations. With dm_delay to emulate some latency and debug_io_direct=data to force synchronous writes and force the read path to be exercised, the author has noticed a 5x improvement in runtime, with a substantial reduction in IO stats numbers. I have reproduced similar numbers while running similar tests, with improvements becoming better with more tuples and more pages manipulated. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com	2026-03-12 11:48:31 +09:00
Richard Guo	383eb21ebf	Convert NOT IN sublinks to anti-joins when safe The planner has historically been unable to convert "x NOT IN (SELECT y ...)" sublinks into anti-joins. This is because standard SQL semantics for NOT IN require that if the comparison "x = y" returns NULL, the "NOT IN" expression evaluates to NULL (effectively false), causing the row to be discarded. In contrast, an anti-join preserves the row if no match is found. Due to this semantic mismatch regarding NULL handling, the conversion was previously considered unsafe. However, if we can prove that neither side of the comparison can yield NULL values, and further that the operator itself cannot return NULL for non-null inputs, the behavior of NOT IN and anti-join becomes identical. Enabling this conversion allows the planner to treat the sublink as a first-class relation rather than an opaque SubPlan filter. This unlocks global join ordering optimization and permits the selection of the most efficient join algorithm based on cost, often yielding significant performance improvements for large datasets. This patch verifies that neither side of the comparison can be NULL and that the operator is safe regarding NULL results before performing the conversion. To verify operator safety, we require that the operator be a member of a B-tree or Hash operator family. This serves as a proxy for standard boolean behavior, ensuring the operator does not return NULL on valid non-null inputs, as doing so would break index integrity. For operand non-nullability, this patch makes use of several existing mechanisms. It leverages the outer-join-aware-Var infrastructure to verify that a Var does not come from the nullable side of an outer join, and consults the NOT-NULL-attnums hash table to efficiently verify schema-level NOT NULL constraints. Additionally, it employs find_nonnullable_vars to identify Vars forced non-nullable by qual clauses, and expr_is_nonnullable to deduce non-nullability for other expression types. The logic for verifying the non-nullability of the subquery outputs was adapted from prior work by David Rowley and Tom Lane. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com	2026-03-12 09:45:18 +09:00
Andres Freund	6322a028fa	bufmgr: Fix use of wrong variable in GetPrivateRefCountEntrySlow() Unfortunately, in `30df61990c`, I made GetPrivateRefCountEntrySlow() set a wrong cache hint when moving entries from the hash table to the faster array. There are no correctness concerns due to this, just an unnecessary loss of performance. Noticed while testing the index prefetching patch. Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com	2026-03-11 17:52:21 -04:00
Jeff Davis	547c15f9f8	Fix use of volatile. Commit `8185bb5347` misused volatile. Fix it. See also `6307b096e2`. Reported-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1bb21c7d-885f-4f07-a3ed-21b60d7c92c6@eisentraut.org	2026-03-11 14:27:58 -07:00
Andrew Dunstan	342051d73b	Add support for altering CHECK constraint enforceability This commit adds support for ALTER TABLE ALTER CONSTRAINT ... [NOT] ENFORCED for CHECK constraints. Previously, only foreign key constraints could have their enforceability altered. When changing from NOT ENFORCED to ENFORCED, the operation not only updates catalog information but also performs a full table scan in Phase 3 to validate that existing data satisfies the constraint. For partitioned tables and inheritance hierarchies, the operation recurses to all child tables. When changing to NOT ENFORCED, we must recurse even if the parent is already NOT ENFORCED, since child constraints may still be ENFORCED. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@cybertec.at> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com	2026-03-11 16:15:35 -04:00
Andrew Dunstan	a9747153e1	rename alter constraint enforceability related functions The functions AlterConstrEnforceabilityRecurse and ATExecAlterConstrEnforceability are being renamed to AlterFKConstrEnforceabilityRecurse and ATExecAlterFKConstrEnforceability, respectively. The current alter constraint functions only handle Foreign Key constraints. Renaming them to be more explicit about the constraint type is necessary; otherwise, it will cause confusion when we later introduce the ability to alter the enforceability of other constraints. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/CACJufxHCh_FU-FsEwsCvg9mN6-5tzR6H9ntn+0KUgTCaerDOmg@mail.gmail.com	2026-03-11 16:14:58 -04:00
Andres Freund	a766125efd	bufmgr: Switch to standard order in MarkBufferDirtyHint() When we were updating hint bits with just a share lock MarkBufferDirtyHint() had to use a non-standard order of operations, i.e. WAL log the buffer before marking the buffer dirty. This was required because the lock level used to set hints did not conflict with the lock level that was used to flush pages, which would have allowed flushing the page out before the WAL record. The non-standard order in turn required preventing the checkpoint from starting between writing the WAL record and flushing out the page. Now that setting hints and writing out buffers use share-exclusive, we can revert back to the normal order of operations. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d	2026-03-11 14:58:29 -04:00
Andres Freund	b0f4ff3c92	bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED Due to the recent changes to use a share-exclusive mode for setting hint bits and for flushing pages - instead of using share mode as before - a buffer cannot be dirtied while the flush is ongoing. The reason we needed JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was ongoing - which is not possible anymore. Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d	2026-03-11 14:58:29 -04:00
Melanie Plageman	11e0824bd9	Avoid WAL flush checks for unlogged buffers in GetVictimBuffer() GetVictimBuffer() rejects a victim buffer if it is from a bulkread strategy ring and reusing it would require flushing WAL. Unlogged table buffers can have fake LSNs (e.g. unlogged GiST pages) and calling XLogNeedsFlush() on a fake LSN is meaningless. This is a bit of future-proofing because currently the bulkread strategy is not used for relations with fake LSNs. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Earlier version reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/flat/fmkqmyeyy7bdpvcgkheb6yaqewemkik3ls6aaveyi5ibmvtxnd%40nu2kvy5rq3a6	2026-03-11 14:50:50 -04:00
Tomas Vondra	943e881733	Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads On platforms where we can read or write the whole LSN atomically, we do not need to lock the buffer header to prevent torn LSNs. We can do this only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the pd_lsn field is properly aligned. For historical reasons the PageXLogRecPtr was defined as a struct with two uint32 fields. This replaces it with a single uint64 value, to make the intent clearer. To prevent issues with weak typedefs the value is still wrapped in a struct. This also adjusts heapfuncs() in pageinspect, to ensure proper alignment when reading the LSN from a page on alignment-sensitive hardware. Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by Peter Geoghegan. Minor tweaks by me. Author: Andreas Karlsson <andreas@proxel.se> Author: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se	2026-03-11 19:46:08 +01:00
Tomas Vondra	b6eb8dde6b	Fix indentation from commit `29a0fb2157` Per buildfarm animal koel	2026-03-11 15:14:46 +01:00
Tomas Vondra	29a0fb2157	Conditional locking in pgaio_worker_submit_internal With io_method=worker, there's a single I/O submission queue. With enough workers, the backends and workers may end up spending a lot of time competing for the AioWorkerSubmissionQueueLock lock. This can happen with workloads that keep the queue full, in which case it's impossible to add requests to the queue. Increasing the number of I/O workers increases the pressure on the lock, worsening the issue. This change improves the situation in two ways: * If AioWorkerSubmissionQueueLock can't be acquired without waiting, the I/O is performed synchronously (as if the queue was full). * When an entry can't be added to a full queue, stop trying to add more entries. All remaining entries are handled as synchronous I/O. The regression was reported by Alexandre Felipe. Investigation and patch by me, based on an idea by Andres Freund. Reported-by: Alexandre Felipe <o.alexandre.felipe@gmail.com> Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAE8JnxOn4+xUAnce+M7LfZWOqfrMMxasMaEmSKwiKbQtZr65uA@mail.gmail.com	2026-03-11 13:40:23 +01:00
Peter Eisentraut	d537f59fbb	Sort out table_open vs. relation_open in rewriter table_open() is a wrapper around relation_open() that checks that the relkind is table-like and gives a user-facing error message if not. It is best used in directly user-facing areas to check that the user used the right kind of command for the relkind. In internal uses where the relkind was previously checked from the user's perspective, table_open() is not necessary and might even be confusing if it were to give out-of-context error messages. In rewriteHandler.c, there were several such table_open() calls, which this changes to relation_open(). This currently doesn't make a difference, but there are plans to have other relkinds that could appear in the rewriter but that shouldn't be accessible via table-specific commands, and this clears the way for that. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-11 09:22:11 +01:00
Andres Freund	82467f627b	Require share-exclusive lock to set hint bits and to flush At the moment hint bits can be set with just a share lock on a page (and, until `45f658dacb`, in one case even without any lock). Because of this we need to copy pages while writing them out, as otherwise the checksum could be corrupted. The need to copy the page is problematic to implement AIO writes: 1) Instead of just needing a single buffer for a copied page we need one for each page that's potentially undergoing I/O 2) To be able to use the "worker" AIO implementation the copied page needs to reside in shared memory It also causes problems for using unbuffered/direct-IO, independent of AIO: Some filesystems, raid implementations, ... do not tolerate the data being written out to change during the write. E.g. they may compute internal checksums that can be invalidated by concurrent modifications, leading e.g. to filesystem errors (as the case with btrfs). It also just is plain odd to allow modifications of buffers that are just share locked. To address these issues, this commit changes the rules so that modifications to pages are not allowed anymore while holding a share lock. Instead the new share-exclusive lock (introduced in `fcb9c977aa`) allows at most one backend to modify a buffer while other backends have the same page share locked. An existing share-lock can be upgraded to a share-exclusive lock, if there are no conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits() and BufferSetHintBits16() have been introduced. To prevent hint bits from being set while the buffer is being written out, writing out buffers now requires a share-exclusive lock. The use of share-exclusive to gate setting hint bits means that from now on only one backend can set hint bits at a time. To allow multiple backends to set hint bits would require more complicated locking: For setting hint bits we'd need to store the count of backends currently setting hint bits and we would need another lock-level for I/O conflicting with the lock-level to set hint bits. Given that the share-exclusive lock for setting hint bits is only held for a short time, that backends would often just set the same hint bits and that the cost of occasionally not setting hint bits in hotly accessed pages is fairly low, this seems like an acceptable tradeoff. The biggest change to adapt to this is in heapam. To avoid performance regressions for sequential scans that need to set a lot of hint bits, we need to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint bits on a page have been set. Conversely, to avoid regressions in cases where we can't set hint bits in bulk (because we're looking only at individual tuples), use BufferSetHintBits16() when setting hint bits without batching. Several other places also need to be adapted, but those changes are comparatively simpler. After this we do not need to copy buffers to write them out anymore. That change is done separately however. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m	2026-03-10 19:32:13 -04:00
Melanie Plageman	4c7362c553	Remove unused PruneState member frz_conflict_horizon `c2a23dcf9e` removed use of PruneState.frz_conflict_horizon but neglected to actually remove the member. Do that now.	2026-03-10 18:31:00 -04:00
Heikki Linnakangas	138592d1b0	Don't clear pendingRecoveryConflicts at end of transaction Commit `17f51ea818` introduced a new pendingRecoveryConflicts field in PGPROC to replace the various ProcSignals. The new field was cleared in ProcArrayEndTransaction(), which makes sense for conflicts with e.g. locks or buffer pins which are gone at end of transaction. But it is not appropriate for conflicts on a database, or a logical slot. Because of this, the 035_standby_logical_decoding.pl test was occasionally getting stuck in the buildfarm. It happens if the startup process signals recovery conflict with the logical slot just when the walsender process using the slot calls ProcArrayEndTransaction(). To fix, don't clear pendingRecoveryConflicts in ProcArrayEndTransaction(). We could still clear certain conflict flags, like conflicts on locks, but we didn't try to do that before commit `17f51ea818` either. In the passing, fix a misspelled comment, and make InitAuxiliaryProcess() to also clear pendingRecoveryConflicts. I don't think aux processes can have recovery conflicts, but it seems best to initialize the field and keep InitAuxiliaryProcess() as close to InitProcess() as possible. Analyzed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://www.postgresql.org/message-id/3e07149d-060b-48a0-8f94-3d5e4946ae45@gmail.com	2026-03-11 00:06:09 +02:00
Melanie Plageman	c2a23dcf9e	Use the newest to-be-frozen xid as the conflict horizon for freezing Previously WAL records that froze tuples used OldestXmin as the snapshot conflict horizon, or the visibility cutoff if the page would become all-frozen. Both are newer than (or equal to) the newst XID actually frozen on the page. Track the newest XID that will be frozen and use that as the snapshot conflict horizon instead. This yields an older horizon resulting in fewer query cancellations on standbys. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com	2026-03-10 15:24:39 -04:00
Álvaro Herrera	ac58465e06	Introduce the REPACK command REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single command. Because this functionality is completely different from regular VACUUM, having it separate from VACUUM makes it easier for users to understand; as for CLUSTER, the term is heavily overloaded in the IT world and even in Postgres itself, so it's good that we can avoid it. We retain those older commands, but de-emphasize them in the documentation, in favor of REPACK; the difference between VACUUM FULL and CLUSTER (namely, the fact that tuples are written in a specific ordering) is neatly absorbed as two different modes of REPACK. This allows us to introduce further functionality in the future that works regardless of whether an ordering is being applied, such as (and especially) a concurrent mode. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/82651.1720540558@antos Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql	2026-03-10 19:56:39 +01:00
Masahiko Sawada	a596d27d80	Fix grammar in short description of effective_wal_level. Align with the convention of using third-person singular (e.g., "Shows" instead of "Show") for GUC parameter descriptions. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com	2026-03-10 11:36:38 -07:00
Andres Freund	f4a4ce52c0	heapam: Don't mimic MarkBufferDirtyHint() in inplace updates Previously heap_inplace_update_and_unlock() used an operation order similar to MarkBufferDirty(), to reduce the number of different approaches used for updating buffers. However, in an upcoming patch, MarkBufferDirtyHint() will switch to using the update protocol used by most other places (enabled by hint bits only being set while holding a share-exclusive lock). Luckily it's pretty easy to adjust heap_inplace_update_and_unlock(). As a comment already foresaw, we can use the normal order, with the slight change of updating the buffer contents after WAL logging. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d	2026-03-10 11:58:06 -04:00
Fujii Masao	59bae23435	Remove duplicate initialization in initialize_brin_buildstate(). Commit `dae761a` added initialization of some BrinBuildState fields in initialize_brin_buildstate(). Later, commit `b437571` inadvertently added the same initialization again. This commit removes that redundant initialization. No behavioral change is intended. Author: Chao Li <lic@highgo.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/CAEoWx2nmrca6-9SNChDvRYD6+r==fs9qg5J93kahS7vpoq8QVg@mail.gmail.com	2026-03-10 22:55:11 +09:00
Peter Eisentraut	8080f44f96	Rename grammar nonterminal to simplify reuse A list of expressions with optional AS-labels is useful in a few different places. Right now, this is available as xml_attribute_list because it was first used in the XMLATTRIBUTES construct, but it is already used elsewhere, and there are other possible future uses. To reduce possible confusion going forward, rename it to labeled_expr_list (like existing expr_list plus ColLabel). Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org	2026-03-10 14:09:09 +01:00
Robert Haas	0fbfd37cef	Allow extensions to mark an individual index as disabled. Up until now, the only way for a loadable module to disable the use of a particular index was to use build_simple_rel_hook (or, previous to yesterday's commit, get_relation_info_hook) to remove it from the index list. While that works, it has some disadvantages. First, the index becomes invisible for all purposes, and can no longer be used for optimizations such as self-join elimination or left join removal, which can severely degrade the resulting plan. Second, if the module attempts to compel the use of a certain index by removing all other indexes from the index list and disabling other scan types, but the planner is unable to use the chosen index for some reason, it will fall back to a sequential scan, because that is only disabled, whereas the other indexes are, from the planner's point of view, completely gone. While this situation ideally shouldn't occur, it's hard for a loadable module to be completely sure whether the planner will view a certain index as usable for a certain query. If it isn't, it may be better to fall back to a scan using a disabled index rather than falling back to an also-disabled sequential scan. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com	2026-03-10 08:33:55 -04:00
Michael Paquier	03facc1211	Switch to FATAL error for missing checkpoint record without backup_label Crash recovery started without a backup_label previously crashed with a PANIC if the checkpoint record could not be found. This commit lowers the report generated to be a FATAL instead. With recovery methods being more imaginative these days, this should provide more flexibility when handling PostgreSQL recovery processing in the event of a driver error, similarly to `15f68cebdc`. An extra benefit of this change is that it becomes possible to add a test to check that a FATAL is hit with an expected error message pattern. With the recovery code becoming more complicated over the last couple of years, I suspect that this will be benefitial to cover in the long-term. The original PANIC behavior has been introduced in the early days of crash recovery, as of `4d14fe0048` (PANIC did not exist yet, the code used STOP). Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Discussion: https://postgr.es/m/CAMm1aWZbQ-Acp_xAxC7mX9uZZMH8+NpfepY9w=AOxbBVT9E=uA@mail.gmail.com	2026-03-10 12:00:05 +09:00
Michael Paquier	6307b096e2	Fix misuse of "volatile" in xml.c What should be used is not "volatile foo ptr" but "foo volatile ptr", The incorrect (former) style means that what the pointer variable points to is volatile. The correct (latter) style means that the pointer variable itself needs to be treated as volatile. The latter style is required to ensure a consistent treatment of these variables after a longjmp with the TRY/CATCH blocks. Some casts can be removed thanks to this change. Issue introduced by `2e94721747`, so no backpatch is required. A similar set of issues has been fixed in `93001888d8` for contrib/xml2/. Author: ChangAo Chen <cca5507@qq.com> Discussion: https://postgr.es/m/tencent_5BE8DAD985EE140ED62EA728C8D4E1311F0A@qq.com	2026-03-10 07:05:32 +09:00
Robert Haas	91f33a2ae9	Replace get_relation_info_hook with build_simple_rel_hook. For a long time, PostgreSQL has had a get_relation_info_hook which plugins can use to editorialize on the information that get_relation_info obtains from the catalogs. However, this hook is only called for baserels of type RTE_RELATION, and there is potential utility in a similar call back for other types of RTEs. This might have had utility even before commit `4020b370f2` added pgs_mask to RelOptInfo, but it certainly has utility now. So, move the callback up one level, deleting get_relation_info_hook and adding build_simple_rel_hook instead. The new callback is called just slightly later than before and with slightly different arguments, but it should be fairly straightforward to adjust existing code that currently uses get_relation_info_hook: the values previously available as relationObjectId and inhparent are now available via rte->relid and rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if desired. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com	2026-03-09 09:48:26 -04:00
Robert Haas	8300d3ad4a	Consider startup cost as a figure of merit for partial paths. Previously, the comments stated that there was no purpose to considering startup cost for partial paths, but this is not the case: it's perfectly reasonable to want a fast-start path for a plan that involves a LIMIT (perhaps over an aggregate, so that there is enough data being processed to justify parallel query but yet we don't want all the result rows). Accordingly, rewrite add_partial_path and add_partial_path_precheck to consider startup costs. This also fixes an independent bug in add_partial_path_precheck: commit `e222534679` failed to update it to do anything with the new disabled_nodes field. That bug fix is formally separate from the rest of this patch and could be committed separately, but I think it makes more sense to fix both issues together, because then we can (as this commit does) just make add_partial_path_precheck do the cost comparisons in the same way as compare_path_costs_fuzzily, which hopefully reduces the chances of ending up with something that's still incorrect. This patch is based on earlier work on this topic by Tomas Vondra, but I have rewritten a great deal of it. Co-authored-by: Robert Haas <rhaas@postgresql.org> Co-authored-by: Tomas Vondra <tomas@vondra.me> Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com	2026-03-09 08:16:30 -04:00
Robert Haas	ffc226ab64	Prevent restore of incremental backup from bloating VM fork. When I (rhaas) wrote the WAL summarizer code, I incorrectly believed that XLOG_SMGR_TRUNCATE truncates all forks to the same length. In fact, what other parts of the code do is compute the truncation length for the FSM and VM forks from the truncation length used for the main fork. But, because I was confused, I coded the WAL summarizer to set the limit block for the VM fork to the same value as for the main fork. (Incremental backup always copies FSM forks in full, so there is no similar issue in that case.) Doing that doesn't directly cause any data corruption, as far as I can see. However, it does create a serious risk of consuming a large amount of extra disk space, because pg_combinebackup's reconstruct.c believes that the reconstructed file should always be at least as long as the limit block value. We might want to be smarter about that at some point in the future, because it's always safe to omit all-zeroes blocks at the end of the last segment of a relation, and doing so could save disk space, but the current algorithm will rarely waste enough disk space to worry about unless we believe that a relation has been truncated to a length much longer than its actual length on disk, which is exactly what happens as a result of the problem mentioned in the previous paragraph. To fix, create a new visibilitymap helper function and use it to include the right limit block in the summary files. Incremental backups taken with existing summary files will still have this issue, but this should improve the situation going forward. Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com> Diagnosed-by: Amul Sul <sulamul@gmail.com> Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com Backpatch-through: 17	2026-03-09 06:45:32 -04:00
Amit Kapila	06d8302262	Remove trailing period from errmsg in subscriptioncmds.c. Author: Sahitya Chandra <sahityajb@gmail.com> Discussion: https://postgr.es/m/20260308142806.181309-1-sahityajb@gmail.com	2026-03-09 15:10:03 +05:30
Michael Paquier	4da2afd01f	Fix size underestimation of DSA pagemap for odd-sized segments When make_new_segment() creates an odd-sized segment, the pagemap was only sized based on a number of usable_pages entries, forgetting that a segment also contains metadata pages, and that the FreePageManager uses absolute page indices that cover the entire segment. This miscalculation could cause accesses to pagemap entries to be out of bounds. During subsequent reuse of the allocated segment, allocations landing on pages with indices higher than usable_pages could cause out-of-bounds pagemap reads and/or writes. On write, 'span' pointers are stored into the data area, corrupting the allocated objects. On read (aka during a dsa_free), garbage is interpreted as a span pointer, typically crashing the server in dsa_get_address(). The normal geometric path correctly sizes the pagemap for all pages in the segment. The odd-sized path needs to do the same, but it works forward from usable_pages rather than backward from total_size. This commit fixes the sizing of the odd-sized case by adding pagemap entries for the metadata pages after the initial metadata_bytes calculation, using an integer ceiling division to compute the exact number of additional entries needed in one go, avoiding any iteration in the calculation. An assertion is added in the code path for odd-sized segments, ensuring that the pagemap includes the metadata area, and that the result is appropriately sized. This problem would show up depending on the size requested for the allocation of a DSA segment. The reporter has noticed this issue when a parallel hash join makes a DSA allocation large enough to trigger the odd-sized segment path, but it could happen for anything that does a DSA allocation. A regression test is added to test_dsa, down to v17 where the test module has been introduced. This adds a set of cheap tests to check the problem, the new assertion being useful for this purpose. Sami has proposed a test that took a longer time than what I have done here; the test committed is faster and good enough to check the odd-sized allocation path. Author: Paul Bunn <paul.bunn@icloud.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com Backpatch-through: 14	2026-03-09 13:46:27 +09:00
Masahiko Sawada	50ea4e09b6	Use palloc_object() and palloc_array() in more areas of the logical replication. The idea is to encourage the use of newer routines across the tree, as these offer stronger type-safety guarantees than raw palloc(). Similar work has been done in commits `1b105f9472`, `0c3c5c3b06`, `31d3847a37`, and `4f7dacc5b8`. This commit extends those changes to more locations within src/backend/replication/logical/. Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAHut+Pv4N7Vpxo18+NAR1r9RGvR8b0BtwTkoeCE2PfFoXgmR6A@mail.gmail.com	2026-03-06 10:49:50 -08:00
Tom Lane	415100aa62	Support grouping-expression references and GROUPING() in subqueries. Until now, substitute_grouped_columns and its predecessor check_ungrouped_columns intentionally did not cope with references to GROUP BY expressions (anything more complex than a Var) within subqueries of the query having GROUP BY. Because they didn't try to match subexpressions of subqueries to the GROUP BY list, they'd drill down to raw Vars of the grouping level and then fail with "subquery uses ungrouped column from outer query". There have been remarkably few complaints about this deficiency, so nobody ever did anything about it. The reason for not wanting to deal with it is that within a subquery, Vars will have varlevelsup different from zero and will thus not be equal() to the expressions seen in the outer query. We recognized this at least as far back as `96ca8ffeb`, although I think the comment I added about it then was just documenting a pre-existing deficiency. It looks like at the time, the solutions I considered were (1) write a version of equal() that permits an offset in varlevelsup, or (2) dynamically apply IncrementVarSublevelsUp at each subexpression. (1) would require an amount of new code that seems rather out of proportion to the benefit, while (2) would add an exponential amount of cost to the matching process. But rethinking it now, what seems attractive is (3) apply IncrementVarSublevelsUp to the groupingClause list not the subexpressions, and do so only once per subquery depth level. Then we can still use plain equal() to check for matches, and we're not incurring cost proportional to some power of the subquery's complexity. This patch continues to use the old logic when the GROUP BY list is all Vars. We could discard the special comparison logic for that and always do it the more general way, but that would be a good deal slower. (Micro-benchmarking just parse analysis suggests it's about 50% slower than the Vars-only path. But we've not heard complaints about the speed of matching within the main query, so I doubt that applying the same matching logic within subqueries will be a problem.) The lack of complaints suggests strongly that this is a very minority use-case, so I don't want to make the typical case slower to fix it. While testing that, I was surprised to discover a nearby bug: GROUPING() within a subquery fails to match GROUP BY Vars that are join alias Vars. It tries to apply flatten_join_alias_vars to make such cases work, but that fails to work inside a subquery because varlevelsup is wrong. Therefore, this patch invents a new entry point flatten_join_alias_for_parser() that allows specification of a sublevels_up offset. (It seems cleaner to give the parser its own entry point rather than abuse the planner's conventions even further.) While this is pretty clearly a bug fix, I'm hesitant to take the risk of back-patching, seeing that the existing behavior has stood for so long with so few complaints. Maybe we can reconsider once this patch has baked awhile in master. Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us	2026-03-06 13:40:55 -05:00
Jeff Davis	8185bb5347	CREATE SUBSCRIPTION ... SERVER. Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER clause instead of a raw connection string using the CONNECTION clause. * Enables a user with sufficient privileges to create a subscription using a foreign server by name without specifying the connection details. * Integrates with user mappings (and other FDW infrastructure) using the subscription owner. * Provides a layer of indirection to manage multiple subscriptions to the same remote server more easily. Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify a connection_function. To be eligible for a subscription, the foreign server's foreign data wrapper must specify a connection_function. Add connection_function support to postgres_fdw, and bump postgres_fdw version to 1.3. Bump catversion. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com	2026-03-06 08:27:56 -08:00
Álvaro Herrera	868825aaeb	Don't include wait_event.h in pgstat.h wait_event.h itself includes wait_event_types.h, which is a generated file, so it's nice that we can avoid compiling >10% of the tree just because that file is regenerated. To avoid breaking too many third-party modules, we now #include utils/wait_classes.h in storage/latch.h. Then, the very common case of doing WaitLatch(..., PG_WAIT_EXTENSION) continues to work by including just storage/latch.h. (I didn't try to determine how many modules would actually break if we don't do this, but this seems a convenient and low-impact measure.) Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql	2026-03-06 16:24:58 +01:00
Fujii Masao	6eedb2a5fd	Fix publisher shutdown hang caused by logical walsender busy loop. Previously, when logical replication was running, shutting down the publisher could cause the logical walsender to enter a busy loop and prevent the publisher from completing shutdown. During shutdown, the logical walsender waits for all pending WAL to be written out. However, some WAL records could remain unflushed, causing the walsender to wait indefinitely. The issue occurred because the walsender used XLogBackgroundFlush() to flush pending WAL. This function does not guarantee that all WAL is written. For example, WAL generated by a transaction without an assigned transaction ID that aborts might not be flushed. This commit fixes the bug by making the logical walsender call XLogFlush() instead, ensuring that all pending WAL is written and preventing the busy loop during shutdown. Backpatch to all supported versions. Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com Backpatch-through: 14	2026-03-06 16:43:40 +09:00
Michael Paquier	d5ea206728	Fix inconsistency with HeapTuple freeing in extended_stats_funcs.c heap_freetuple() is a thin wrapper doing a pfree(), and the function import_pg_statistic(), introduced by `ba97bf9cb7`, had the idea to call directly pfree() rather than the "dedicated" heap tuple routine. upsert_pg_statistic_ext_data already uses heap_freetuple(). This code is harmless as-is, but let's be consistent across the board. Reported-by: Yonghao Lee <yonghao_lee@qq.com> Discussion: https://postgr.es/m/tencent_CA1315EE8FB9C62F742C71E95FAD72214205@qq.com	2026-03-06 14:49:00 +09:00
Michael Paquier	2d4ead6f4b	Fix order of columns in pg_stat_recovery recovery_last_xact_time is listed before current_chunk_start_time in the documentation, the function definition and the view definition, but their order was reversed in the code. Thinko in `01d485b142`. Mea culpa. Author: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/CAOzEurQQ1naKmPJhfE5WOUQjtf5tu08Kw3QCGY5UY=7Rt9fE=w@mail.gmail.com	2026-03-06 14:41:41 +09:00
Amit Kapila	f1ddaa1535	Fix inconsistent elevel in pg_sync_replication_slots() retry logic. The commit `0d2d4a0ec3` allowed pg_sync_replication_slots() to retry sync attempts, but missed a case, when WAL prior to a slot's confirmed_flush_lsn is not yet flushed locally. By changing the elevel from ERROR to LOG, we allow the sync loop to continue. This provides the opportunity for the slot to be synchronized once the standby catches up with the necessary WAL. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAFPTHDZAA+gWDntpa5ucqKKba41=tXmoXqN3q4rpjO9cdxgQrw@mail.gmail.com	2026-03-06 10:51:32 +05:30
Michael Paquier	01d485b142	Add system view pg_stat_recovery This commit introduces pg_stat_recovery, that exposes at SQL level the state of recovery as tracked by XLogRecoveryCtlData in shared memory, maintained by the startup process. This new view includes the following fields, that are useful for monitoring purposes on a standby, once it has reached a consistent state (making the execution of the SQL function possible): - Last-successfully replayed WAL record LSN boundaries and its timeline. - Currently replaying WAL record end LSN and its timeline. - Current WAL chunk start time. - Promotion trigger state. - Timestamp of latest processed commit/abort. - Recovery pause state. Some of this data can already be recovered from different system functions, but not all of it. See pg_get_wal_replay_pause_state or pg_last_xact_replay_timestamp. This new view offers a stronger consistency guarantee, by grabbing the recovery state for all fields through one spinlock acquisition. The system view relies on a new function, called pg_stat_get_recovery(). Querying this data requires the pg_read_all_stats privilege. The view returns no rows if the node is not in recovery. This feature originates from a suggestion I have made while discussion the addition of a CONNECTING state to the WAL receiver's shared memory state, because we lacked access to some of the state data. The author has taken the time to implement it, so thanks for that. Bump catalog version. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz	2026-03-06 12:37:40 +09:00
Michael Paquier	42a12856a6	Refactor code retrieving string for RecoveryPauseState This refactoring is going to be useful in an upcoming commit, to avoid some code duplication with the function pg_get_wal_replay_pause_state(), that returns a string for the recovery pause state. Refactoring opportunity noticed while hacking on a different patch. Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com	2026-03-06 11:53:23 +09:00
Tom Lane	f95d73ed43	Simplify creation of built-in functions with non-default ACLs. Up to now, to create such a function, one had to make a pg_proc.dat entry and then modify it with GRANT/REVOKE commands, which we put in system_functions.sql. That seems a little ugly though, because it violates the idea of having a single source of truth about the initial contents of pg_proc, and it results in leaving dead rows in the initial contents of pg_proc. This patch improves matters by allowing aclitemin to work during early bootstrap, before pg_authid has been loaded. On the same principle that we use for early access to pg_type details, put a table of known built-in role names into bootstrap.c, and use that in bootstrap mode. To create a built-in function with a non-default ACL, one should write the desired ACL list in its pg_proc.dat entry, using a simplified version of aclitemout's notation: omit the grantor (if it is the bootstrap superuser, which it pretty much always should be) and spell the bootstrap superuser's name as POSTGRES, similarly to the notation used elsewhere in src/include/catalog. This results in entries like proacl => '{POSTGRES=X,pg_monitor=X}' which shows that we've revoked public execute permissions and instead granted that to pg_monitor. In addition to fixing up pg_proc.dat entries, I got rid of some role grants that had been stuck into system_functions.sql, and instead put them into a new file pg_auth_members.dat; that seems like a far less random place to put the information. The correctness of the data changes can be verified by comparing the initial contents of pg_proc and pg_auth_members before and after. pg_proc should match exactly, but the OID column of pg_auth_members will probably be different because those OIDs now get assigned a little earlier in bootstrap. (I forced a catversion bump out of caution, but it wasn't really necessary.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net	2026-03-05 17:43:09 -05:00
Melanie Plageman	34cb4254bd	Prefix PruneState->all_{visible,frozen} with set_ The PruneState had members called "all_visible" and "all_frozen" which reflect not the current state of the page but the state it could be in once pruning and freezing have been executed. These are then saved in the PruneFreezeResult so the caller can set the VM accordingly. Prefix the PruneState members as well as the corresponsding PruneFreezeResult members with "set_" to clarify that they represent the proposed state of the all-visible and all-frozen bits for a heap page in the visibility map, not the current state. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk	2026-03-05 16:55:00 -05:00
Melanie Plageman	68c2dcb913	Add PageGetPruneXid() helper This is similar to the other page accessors in bufpage.h. It improves readability and avoids long lines. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com	2026-03-05 16:22:57 -05:00
Melanie Plageman	59663e4207	Move commonly used context into PruneState and simplify helpers heap_page_prune_and_freeze() and many of its helpers use the heap buffer, block number, and page. Other helpers took the heap page and didn't use it. Initializing these values once during prune_freeze_setup() simplifies the helpers' interfaces and avoids any repeated calls to BufferGetBlockNumber() and BufferGetPage(). While updating PruneState, also reorganize its fields to make the layout and member documentation more consistent. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com	2026-03-05 16:10:29 -05:00
Alexander Korotkov	177037341a	Fix handling of updated tuples in the MERGE statement This branch missed the IsolationUsesXactSnapshot() check. That led to EPQ on repeatable read and serializable isolation levels. This commit fixes the issue and provides a simple isolation check for that. Backpatch through v15 where MERGE statement was introduced. Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAPpHfdvzZSaNYdj5ac-tYRi6MuuZnYHiUkZ3D-AoY-ny8v%2BS%2Bw%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Backpatch-through: 15	2026-03-05 19:49:28 +02:00
Fujii Masao	bffd7130e9	Improve validation of recovery_target_xid GUC values. Previously, the recovery_target_xid GUC values were not sufficiently validated. As a result, clearly invalid inputs such as the string "bogus", a decimal value like "1.1", or 0 (a transaction ID smaller than the minimum valid value of 3) were unexpectedly accepted. In these cases, the value was interpreted as transaction ID 0, which could cause recovery to behave unexpectedly. This commit improves validation of recovery_target_xid GUC so that invalid values are rejected with an error. This prevents recovery from proceeding with misconfigured recovery_target_xid settings. Also this commit updates the documentation to clarify the allowed values for recovery_target_xid GUC. Author: David Steele <david@pgbackrest.org> Reviewed-by: Hüseyin Demir <huseyin.d3r@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/f14463ab-990b-4ae9-a177-998d2677aae0@pgbackrest.org	2026-03-05 21:40:32 +09:00
Michael Paquier	5f8124a0cf	Move definition of XLogRecoveryCtlData to xlogrecovery.h XLogRecoveryCtlData is the structure that stores the shared-memory state of WAL recovery, including information such as promotion requests, the timeline ID (TLI), and the LSNs of replayed records. This refactoring is independently useful because it allows code outside of core to access the recovery state in live. It will be used by an upcoming patch that introduces a SQL function for querying this information, that can be accessed on a standby once a consistent state has been reached. This only moves code around, changing nothing functionally. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com	2026-03-05 12:17:47 +09:00
Michael Paquier	34dfca2934	Change default value of default_toast_compression to "lz4", take two The default value for default_toast_compression was "pglz". The main reason for this choice is that this option is always available, pglz code being embedded in Postgres. However, it is known that LZ4 is more efficient than pglz: less CPU required, more compression on average. As of this commit, the default value of default_toast_compression becomes "lz4", if available. By switching to LZ4 as the default, users should see natural speedups on TOAST data reads and/or writes. Support for LZ4 in TOAST compression was added in Postgres v14, or 5 releases ago. This should be long enough to consider this feature as stable. While at it, quotes are removed from default_toast_compression in postgresql.conf.sample. Quotes are not required in this case. The in-place value replacement done by initdb if the build supports LZ4 would not use them in the postgresql.conf file added to a freshly-initialized cluster. Note that this is a version lighter than `7c1849311e`, that included a replacement of --with-lz4 by --without-lz4 in configure builds, forcing a requirement for LZ4 in all environments. The buildfarm did not like it, at all. This commit switches default_toast_compression to lz4 as default only when --with-lz4 is defined, which should keep the buildfarm at bay while still allowing users to benefit from LZ4 compression in TOAST as long as the code is compiled with it. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org	2026-03-05 09:24:35 +09:00
Michael Paquier	4f0b3afab4	Revert "Change default value of default_toast_compression to "lz4"" This reverts commit `7c1849311e`, due to the fact that more than 60% of the buildfarm members do not have lz4 installed. As we are in the last commit fest of the development cycle, and that it could take a couple of weeks to stabilize things, this change is reverted for now. This commit will be reworked in a lighter version, as default_toast_compression's default can be changed to "lz4" without the switch from --with-lz4 to --without-lz4. This approach will keep the buildfarm at bay, and still allow builds to take advantage of LZ4 in TOAST by default, as long as the code is compiled with LZ4 support. A harder requirement based on LZ4 should be achievable at some point, but it is going to require some work from the buildfarm owners first. Perhaps this part could be revisited at the beginning of the next development cycle. Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com	2026-03-05 08:25:35 +09:00
Tom Lane	e6a1d8f5ac	Fix estimate_hash_bucket_stats's correction for skewed data. The previous idea was "scale up the bucketsize estimate by the ratio of the MCV's frequency to the average value's frequency". But we should have been suspicious of that plan, since it frequently led to impossible (> 1) values which we had to apply an ad-hoc clamp to. Joel Jacobson demonstrated that it sometimes leads to making the wrong choice about which side of the hash join should be inner. Instead, drop the whole business of estimating average frequency, and just clamp the bucketsize estimate to be at least the MCV's frequency. This corresponds to the bucket size we'd get if only the MCV appears in a bucket, and the MCV's frequency is not affected by the WHERE-clause filters. (We were already making the latter assumption.) This also matches the coding used since `4867d7f62` in the case where only a default ndistinct estimate is available. Interestingly, this change affects no existing regression test cases. Add one to demonstrate that it helps pick the smaller table to be hashed when the MCV is common enough to affect the results. This leaves estimate_hash_bucket_stats not considering the effects of null join keys at all, which we should probably improve. However, I have a different patch in the queue that will change the executor's handling of null join keys, so it seems appropriate to wait till that's in before doing anything more here. Reported-by: Joel Jacobson <joel@compiler.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Joel Jacobson <joel@compiler.org> Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com	2026-03-04 15:33:15 -05:00
Álvaro Herrera	ce4fbe1ac6	Don't malloc(0) in EventTriggerCollectAlterTSConfig Author: Florin Irion <florin.irion@enterprisedb.com> Discussion: https://postgr.es/m/c6fff161-9aee-4290-9ada-71e21e4d84de@gmail.com	2026-03-04 15:04:53 +01:00
Amit Kapila	fd366065e0	Allow table exclusions in publications via EXCEPT TABLE. Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE syntax. This allows one or more tables to be excluded. The publisher will not send the data of excluded tables to the subscriber. To support this, pg_publication_rel now includes a prexcept column to flag excluded relations. For partitioned tables, the exclusion is applied at the root level; specifying a root table excludes all current and future partitions in that tree. Follow-up work will implement ALTER PUBLICATION support for managing these exclusions. Author: vignesh C <vignesh21@gmail.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com	2026-03-04 15:56:48 +05:30
Michael Paquier	7c1849311e	Change default value of default_toast_compression to "lz4", when available The default value for default_toast_compression was "pglz". The main reason for this choice is that this option is always available, pglz code being embedded in Postgres. However, it is known that LZ4 is more efficient than pglz: less CPU required, more compression on average. As of this commit, the default value of default_toast_compression becomes "lz4", if available. By switching to LZ4 as the default, users should see natural speedups on TOAST data reads and/or writes. Support for LZ4 in TOAST compression was added in Postgres v14, or 5 releases ago. This should be long enough to consider this feature as stable. --with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the builds on an option-basis, following a practice similar to readline or ICU. References to --with-lz4 are removed from the documentation. While at it, quotes are removed from default_toast_compression in postgresql.conf.sample. Quotes are not required in this case. The in-place value replacement done by initdb if the build supports LZ4 would not use them in the postgresql.conf file added to a freshly-initialized cluster. For the reference, a similar switch has been done with ICU in `fcb21b3acd`. Some of the changes done in this commit are consistent with that. Note: this is going to create some disturbance in the buildfarm, in environments where lz4 is not installed. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org	2026-03-04 13:05:31 +09:00
Richard Guo	1f4f87d794	Remove redundant restriction checks in apply_child_basequals In apply_child_basequals, after translating a parent relation's restriction quals for a child relation, we simplify each child qual by calling eval_const_expressions. Historically, the code then called restriction_is_always_false and restriction_is_always_true to reduce NullTest quals that are provably false or true. However, since commit `e2debb643`, the planner natively performs NullTest deduction during constant folding. Therefore, calling restriction_is_always_false and restriction_is_always_true immediately afterward is redundant and wastes CPU cycles. We can safely remove them and simply rely on the constant folding to handle the deduction. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-vLmGXaUEZyOMacN0BVfqWCt2tM-eDVWdDfJnOQaauGg@mail.gmail.com	2026-03-04 10:57:43 +09:00
Richard Guo	ce1c17a316	Remove obsolete SAMESIGN macro The SAMESIGN macro was historically used as a helper for manual integer overflow checks. However, since commit `4d6ad3125` introduced overflow-aware integer operations, this manual sign-checking logic is no longer necessary. The macro remains defined in brin_minmax_multi.c and timestamp.c, but is not used in either file. This patch removes these definitions to clean things up. Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CAMbWs4-NL3J3hQ3LzrwV-YUkQC18P+jM7ZiegQyAHzgdZev2qg@mail.gmail.com	2026-03-04 10:56:06 +09:00
Melanie Plageman	38229cb905	Add read_stream_{pause,resume}() Read stream users can now pause lookahead when no blocks are currently available. After resuming, subsequent read_stream_next_buffer() calls continue lookahead with the previous lookahead distance. This is especially useful for read stream users with self-referential access patterns (where consuming already-read buffers can produce additional block numbers). Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com	2026-03-03 16:03:09 -05:00
Álvaro Herrera	cece37c984	Reduce scope of for-loop-local variables to avoid shadowing Adjust a couple of for-loops where a local variable was shadowed by another in the same scope, by renaming it as well as reducing its scope to the containing for-loop. Author: Chao Li <lic@highgo.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAEoWx2kQ2x5gMaj8tHLJ3=jfC+p5YXHkJyHrDTiQw2nn2FJTmQ@mail.gmail.com	2026-03-03 11:24:11 +01:00
Peter Eisentraut	f2d7570cdd	Reduce the scope of volatile qualifiers Commit `c66a7d75e6` introduced a new "cast discards ‘volatile’" warning (-Wcast-qual) in vac_truncate_clog(). Instead of making use of unvolatize(), remove the warning by reducing the scope of the volatile qualifier (added in commit `2d2e40e3be`) to only 2 fields. Also do the same for vac_update_datfrozenxid(), since the intent of commit `f65ab862e3` was to prevent the same kind of race condition that commit `2d2e40e3be` was fixing. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/aZ3a%2BV82uSfEjDmD%40ip-10-97-1-34.eu-west-3.compute.internal	2026-03-03 10:02:28 +01:00
Peter Eisentraut	2a525cc97e	Add COPY (on_error set_null) option If ON_ERROR SET_NULL is specified during COPY FROM, any data type conversion errors will result in the affected column being set to a null value. A column's not-null constraints are still enforced, and attempting to set a null value in such columns will raise a constraint violation error. This applies to a column whose data type is a domain with a NOT NULL constraint. Author: Jian He <jian.universality@gmail.com> Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp> Reviewed-by: torikoshia <torikoshia@oss.nttdata.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com	2026-03-03 07:37:12 +01:00
Michael Paquier	ba97bf9cb7	Add support for "exprs" in pg_restore_extended_stats() This commit adds support for the restore of extended statistics of the kind "exprs", counting for the statistics data computed for expressions. The input format consists of a jsonb object which must be an array of objects which are keyed by statistics parameter names, like this: [{"stat_type1": "...", "stat_type2": "...", ...}, {"stat_type1": "...", "stat_type2": "...", ...}, ...] The outer array must have as many elements as there are expressions defined in the statistics object, mapping with the way extended statistics are built with one pg_statistic tuple stored for each expression whose statistics have been computed. The elements of the array must be either objects or null values (equivalent of invalid data, case also supported by the stats computations when its data is inserted in the catalogs). The keys of the inner objects are names of the statistical columns in pg_stats_ext_exprs (i.e. everything after "inherited"). Not all parameter keys need to be provided, those omitted being silently ignored. Key values that do not match a statistical column name will cause a warning to be issued, but do not otherwise fail the expression or the import as a whole. The expected value type for all parameters is jbvString, which allows us to validate the values using the input function specific to that parameter. Any parameters with a null value are silently ignored, same as if they were not provided in the first place. This commit includes a battery of test cases: - Sanity checks for what-should-be-all the failures in restore code paths, including parsing errors, parameter sanity checks depending on the extended stats object definition, etc. - Value injection, for scalar, array, range, multi-range cases. - Stats data cloning, with differential checks between the source relation and its target. The source and the target should hold the same stats data after restore. - While expressions are supported in extended statistics since v14, range_length_histogram, range_empty_frac, and range_bounds_histogram have been added to pg_stat_ext_exprs only in v19. A test case has been added to emulate a dump taken from v18, with expression stats restored for a range data type where these three fields are NULL. Support for pg_dump is included, with expressions supported since v14, inherited since v15, and data for range types in expressions in v19. pg_upgrade is the main use-case of this feature; it is also possible to inject statistics, same as for the other extstat kinds. As of this commit, ANALYZE should not be required after pg_upgrade when the cluster upgrading from uses extended statistics, as MCV, dependencies, expressions and ndistinct stats are all covered. The stats data related to range types used in expressions requires v19, whose support has also been added. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=fPcci6oPyuyEZ0F4bWqAA7HzaWO+ZPptufuX5_uWt6kw@mail.gmail.com	2026-03-03 14:19:54 +09:00
Jeff Davis	11171fe1fc	style: define parameterless functions as foo(void). Change pg_icu_unicode_version() to pg_icu_unicode_version(void), introduced by commit `af2d4ca191`. See commit `9b05e2ec08`, which fixed similar cases. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aaEhpwrj1FY/8/7n@ip-10-97-1-34.eu-west-3.compute.internal	2026-03-02 20:12:38 -08:00
Heikki Linnakangas	ccae90abdb	Fix OldestMemberMXactId and OldestVisibleMXactId array usage Commit `ab355e3a88` changed how the OldestMemberMXactId array is indexed. It's no longer indexed by synthetic dummyBackendId, but with ProcNumber. The PGPROC entries for prepared xacts come after auxiliary processes in the allProcs array, which rendered the calculation for MaxOldestSlot and the indexes into the array incorrect. (The OldestVisibleMXactId array is not used for prepared xacts, and thus never accessed with ProcNumber's greater than MaxBackends, so this only affects the OldestMemberMXactId array.) As a result, a prepared xact would store its value past the end of the OldestMemberMXactId array, overflowing into the OldestVisibleMXactId array. That could cause a transaction's row lock to appear invisible to other backends, or other such visibility issues. With a very small max_connections setting, the store could even go beyond the OldestVisibleMXactId array, stomping over the first element in the BufferDescriptor array. To fix, calculate the array sizes more precisely, and introduce helper functions to calculate the array indexes correctly. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/7acc94b0-ea82-4657-b1b0-77842cb7a60c@postgrespro.ru Backpatch-through: 17	2026-03-02 19:19:22 +02:00
Melanie Plageman	8b9d42bf6b	Save prune cycles by consistently clearing prune hints on all-visible pages All-visible pages can't contain prunable tuples. We already clear the prune hint (pd_prune_xid) during pruning of all-visible pages, but we were not doing so in vacuum phase three, nor initializing it for all-frozen pages created by COPY FREEZE, and we were not clearing it on standbys. Because page hints are not WAL-logged, pages on a standby carry stale pd_prune_xid values. After promotion, that stale hint triggers unnecessary on-access pruning. Fix this by clearing the prune hint everywhere we currently mark a heap page all-visible. Clearing it when setting PD_ALL_VISIBLE ensures no extra overhead. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/CAAKRu_b-BMOyu0X-0jc_8bWNSbQ5K6JTEueayEhcQuw-OkCSKg%40mail.gmail.com	2026-03-02 11:05:59 -05:00
Michael Paquier	f68d7e7483	Remove WAL page header flag XLP_BKP_REMOVABLE There are no known users of this flag. The last supposed user was pglesslog, which is the reason why this flag has been introduced in core, based on an historical search pointing at `a8d539f124`. I have mentioned that we may want to remove this flag back in 2018, due to zero users of it in core. More recently, Noah has pointed out that this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL writer in a lock-free fashion with runningBackups > 0, meaning that some full-page images could be required but not logged, ultimately corrupting backups. Bump XLOG_PAGE_MAGIC. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com	2026-03-02 14:13:05 +09:00
Michael Paquier	f7dc17aa91	Fix memory allocation size in RegisterExtensionExplainOption() The allocations used for the static array ExplainExtensionOptionArray, that tracks a set of ExplainExtensionOption, used "char " instead of ExplainExtensionOption as the memory size consumed by one element, underestimating the memory required by half. The initial allocation of ExplainExtensionNameArray wants to hold 16 elements before being reallocated, and with "char " it meant that there was enough space only for 8 ExplainExtensionOption elements, 16 bytes required for each element. The backend would crash once one tries to register a 9th EXPLAIN option. As far as I can see, the allocation formulas of GetExplainExtensionId() have been copy-pasted to RegisterExtensionExplainOption(), but the internal maths of the copy were not adjusted accordingly. Oversight in `c65bc2e1d1`. Author: Joel Jacobson <joel@compiler.org> Discussion: https://postgr.es/m/2a4bd2f5-2a2f-409f-8ac7-110dd3fad4fc@app.fastmail.com Backpatch-through: 18	2026-03-02 13:14:15 +09:00
Michael Paquier	3b7a6fa157	Fix set of issues with extended statistics on expressions This commit addresses two defects regarding extended statistics on expressions: - When building extended statistics in lookup_var_attr_stats(), the call to examine_attribute() did not account for the possibility of a NULL return value. This can happen depending on the behavior of a typanalyze callback — for example, if the callback returns false, if no rows are sampled, or if no statistics are computed. In such cases, the code attempted to build MCV, dependency, and ndistinct statistics using a NULL pointer, incorrectly assuming valid statistics were available, which could lead to a server crash. - When loading extended statistics for expressions, statext_expressions_load() did not account for NULL entries in the pg_statistic array storing expression statistics. Such NULL entries can be generated when statistics collection fails for an expression, as may occur during the final step of serialize_expr_stats(). An extended statistics object defining N expressions requires N corresponding elements in the pg_statistic array stored for the expressions, and some of these elements can be NULL. This situation is reachable when a typanalyze callback returns true, but sets stats_valid to indicate that no useful statistics could be computed. While these scenarios cannot occur with in-core typanalyze callbacks, as far as I have analyzed, they can be triggered by custom data types with custom typanalyze implementations, at least. No tests are added in this commit. A follow-up commit will introduce a test module that can be extended to cover similar edge cases if additional issues are discovered. This takes care of the core of the problem. Attribute and relation statistics already offer similar protections: - ANALYZE detects and skips the build of invalid statistics. - Invalid catalog data is handled defensively when loading statistics. This issue exists since the support for extended statistics on expressions has been added, down to v14 as of `a4d75c86bf`. Backpatch to all supported stable branches. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aaDrJsE1I5mrE-QF@paquier.xyz Backpatch-through: 14	2026-03-02 09:38:37 +09:00
Tom Lane	d80b022501	Correctly calculate "MCV frequency" for a unique column. In commit `bd3e3e9e5`, I over-hastily used 1 / rel->rows as the assumed frequency of entries in a column that ANALYZE has found to be unique. However, rel->rows is the number of table rows that are estimated to pass the query's restriction conditions, so that we got a too-large result if the query has selective restrictions. What I should have used is 1 / rel->tuples, since that is the estimated total number of table rows. The pre-existing code path that digs a frequency out of the histogram produces a frequency relative to the whole table, so surely this new alternative code path must do so as well. Any correction needed on the basis of selectivity must be done by the user of the mcv_freq value. Fixing this causes all the regression test plans changed by `bd3e3e9e5` to revert to what they had been, except for the first change in join.out. As I correctly argued in `bd3e3e9e5`, in that test case we have no stats and should not risk a hash join. Evidently I was less correct to argue that the other changes were improvements. Reported-by: Joel Jacobson <joel@compiler.org> Diagnosed-by: Tender Wang <tndrwang@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/341b723c-da45-4058-9446-1514dedb17c1@app.fastmail.com	2026-03-01 12:56:55 -05:00
Peter Eisentraut	3f98862980	Fix some -Wcast-qual warnings This fixes some warnings from -Wcast-qual that are easy to fix, without using unconstify or the like. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/990c9117-b013-4026-aaf5-261fe2832c3d%40eisentraut.org	2026-02-27 21:57:33 +01:00
Tom Lane	65a3ff8f1b	Doc: improve user docs and code comments about EXISTS(SELECT * ...). Point out that Postgres automatically optimizes away the target list of an EXISTS' subquery, except in weird cases such as target lists containing set-returning functions. Thus, both common conventions EXISTS(SELECT * FROM ...) and EXISTS(SELECT 1 FROM ...) are overhead-free and there's little reason to prefer one over the other. In the code comments, mention that the SQL spec says that EXISTS(SELECT * FROM ...) should be interpreted as EXISTS(SELECT some-literal FROM ...), but we don't choose to do it exactly that way. Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/9b301c70-3909-4f0f-98ca-9e3c4d142f3e@eisentraut.org	2026-02-27 15:20:16 -05:00
Tom Lane	98616ac18b	Don't flatten join alias Vars that are stored within a GROUP RTE. The RTE's groupexprs list is used for deparsing views, and for that usage it must contain the original alias Vars; else we can get incorrect SQL output. But since commit `247dea89f`, parseCheckAggregates put the GROUP BY expressions through flatten_join_alias_vars before building the RTE_GROUP RTE. Changing the order of operations there is enough to fix it. This patch unfortunately can do nothing for already-created views: if they use a coding pattern that is subject to the bug, they will deparse incorrectly and hence present a dump/reload hazard in the future. The only fix is to recreate the view from the original SQL. But the trouble cases seem to be quite narrow. AFAICT the output was only wrong for "SELECT ... t1 LEFT JOIN t2 USING (x) GROUP BY x" where t1.x and t2.x were not of identical data types and t1.x was the side that required an implicit coercion. If there was no hidden coercion, or if the join was plain, RIGHT, or FULL, the deparsed output was uglier than intended but not functionally wrong. Reported-by: Swirl Smog Dowry <swirl-smog-dowry@duck.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CA+-gibjCg_vjcq3hWTM0sLs3_TUZ6Q9rkv8+pe2yJrdh4o4uoQ@mail.gmail.com Backpatch-through: 18	2026-02-27 12:54:02 -05:00
Álvaro Herrera	a2c89835f5	Don't include proc.h in shm_mq.h This prevents proliferation of proc.h to tons of other places; shm_mq.h is widely included. Discussion: https://postgr.es/m/202602261733.s2rkxezwuif6@alvherre.pgsql	2026-02-27 10:53:47 +01:00
Melanie Plageman	284925508a	Remove table_scan_analyze_next_tuple unneeded parameter OldestXmin heapam_scan_analyze_next_tuple() doesn't distinguish between dead and recently dead tuples when counting them, so it doesn't need OldestXmin. GetOldestNonRemovableTransactionId() isn't free, so removing it is a win. Looking at other table AMs implementing table_scan_analyze_next_tuple(), we couldn't find one using OldestXmin either, so remove it from the callback. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com	2026-02-26 15:41:53 -05:00
Melanie Plageman	3efe58febc	Simplify visibility check in heap_page_would_be_all_visible() heap_page_would_be_all_visible() does not need to distinguish between HEAPTUPLE_RECENTLY_DEAD and HEAPTUPLE_DEAD tuples: any tuple in a state other than HEAPTUPLE_LIVE means the page is not all-visible and heap_page_would_be_all_visible() returns false. Given that, calling HeapTupleSatisfiesVacuum() is unnecessary, since it performs extra work to distinguish between dead and recently dead tuples using OldestXmin. Replace it with the more minimal HeapTupleSatisfiesVacuumHorizon(). Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CALdSSPjvhGXihT_9f-GJabYU%3D_PjrFDUxYaURuTbfLyQM6TErg%40mail.gmail.com	2026-02-26 15:41:45 -05:00
Jeff Davis	d942511f08	Fix memory leaks in pg_locale_icu.c. The backport prior to 18 requires minor modification due to code refactoring. Discussion: https://postgr.es/m/e2b7a0a88aaadded7e2d19f42d5ab03c9e182ad8.camel@j-davis.com Backpatch-through: 16	2026-02-26 12:15:01 -08:00
Melanie Plageman	5aea60839b	Rename LVRelState VM-related logging counters The LVRelState fields that track newly all-visible/all-frozen pages were previously named vm_new_visible_pages, vm_new_frozen_pages, and vm_new_visible_frozen_pages. The correct terminology is all-visible and all-frozen; omitting “all” was open to misinterpretation, as the page isn't visible or invisible, rather all the tuples on the page are visible to all running and future transactions. Rename the members accordingly. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk	2026-02-26 15:04:49 -05:00
Álvaro Herrera	7b9b620d8f	Don't include latch.h in libpq/libpq.h This reduces the inclusion footprint of latch.h a bit. Per suggestion from Andres Freund. Discussion: https://postgr.es/m/pap7mzhcxvuwlfdebjkh646ntyk4brtwm4dbocfpllwdccta5t@w3d7wz6mjpwv	2026-02-26 18:04:13 +01:00
Andres Freund	9d6294c09e	instrumentation: Drop INSTR_TIME_SET_CURRENT_LAZY macro This macro had exactly one user in InstrStartNode, and the caller can instead use INSTR_TIME_IS_ZERO / INSTR_TIME_SET_CURRENT directly. This supports a future change that intends to modify the time source being used in the InstrStartNode case. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53Pkx1bK1FB71_nBqYmzvSSXnp_MbE0ZDnU+baPJF6Ud2WDA@mail.gmail.com	2026-02-26 10:39:29 -05:00
Andres Freund	3218825271	instrumentation: Rename INSTR_TIME_LT macro to INSTR_TIME_GT This was incorrectly named "LT" for "larger than" in `e5a5e0a907`, but that is against existing conventions, where "LT" means "less than". Clarify by using "GT" for "greater than" in macro name, and add a missing comment at the top of instr_time.h to note the macro's existence. Reported by: Peter Smith <smithpb2250@gmail.com> Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAHut%2BPut94CTpjQsqOJHdHkgJ2ZXq%2BqVSfMEcmDKLiWLW-hPfA%40mail.gmail.com	2026-02-26 10:38:59 -05:00
Fujii Masao	70f470314c	Fix ProcWakeup() resetting wrong waitStart field. Previously, when one process woke another that was waiting on a lock, ProcWakeup() incorrectly cleared its own waitStart field (i.e., MyProc->waitStart) instead of that of the process being awakened. As a result, the awakened process retained a stale lock-wait start timestamp. This did not cause user-visible issues. pg_locks.waitstart was reported as NULL for the awakened process (i.e., when pg_locks.granted is true), regardless of the waitStart value. This bug was introduced by commit `46d6e5f567`. This commit fixes this by resetting the waitStart field of the process being awakened in ProcWakeup(). Backpatch to all supported branches. Reported-by: Chao Li <li.evan.chao@gmail.com> Author: Chao Li <li.evan.chao@gmail.com> Reviewed-by: ji xu <thanksgreed@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/537BD852-EC61-4D25-AB55-BE8BE46D07D7@gmail.com Backpatch-through: 14	2026-02-26 08:46:12 +09:00
Richard Guo	77c7a17a6e	Fix unsafe RTE_GROUP removal in simplify_EXISTS_query When simplify_EXISTS_query removes the GROUP BY clauses from an EXISTS subquery, it previously deleted the RTE_GROUP RTE directly from the subquery's range table. This approach is dangerous because deleting an RTE from the middle of the rtable list shifts the index of any subsequent RTE, which can silently corrupt any Var nodes in the query tree that reference those later relations. (Currently, this direct removal has not caused problems because the RTE_GROUP RTE happens to always be the last entry in the rtable list. However, relying on that is extremely fragile and seems like trouble waiting to happen.) Instead of deleting the RTE_GROUP RTE, this patch converts it in-place to be RTE_RESULT type and clears its groupexprs list. This preserves the length and indexing of the rtable list, ensuring all Var references remain intact. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3472344.1771858107@sss.pgh.pa.us Backpatch-through: 18	2026-02-25 11:13:21 +09:00
Álvaro Herrera	65707ed9af	Add backtrace support for Windows using DbgHelp API Previously, backtrace generation on Windows would return an "unsupported" message. With this commit, we rely on CaptureStackBackTrace() to capture the call stack and the DbgHelp API (SymFromAddrW, SymGetLineFromAddrW64) for symbol resolution. Symbol handler initialization (SymInitialize) is performed once per process and cached. If initialization fails, the report for it is returned as the backtrace output. The symbol handler is cleaned up via on_proc_exit() to release DbgHelp resources. The implementation provides symbol names, offsets, and addresses. When PDB files are available, it also includes source file names and line numbers. Symbol names and file paths are converted from UTF-16 to the database encoding using wchar2char(), which properly handles both UTF-8 and non-UTF-8 databases on Windows. When symbol information is unavailable or encoding conversion fails, it falls back to displaying raw addresses. The implementation uses the explicit UTF16 versions of the DbgHelp functions (SYMBOL_INFOW, SymFromAddrW, IMAGEHLP_LINEW64, SymGetLineFromAddrW64) rather than the generic versions. This allows us to rely on predictable encoding conversion, rather than using the haphazard ANSI codepage that we'd get otherwise. DbgHelp is apparently available on all Windows platforms we support, so there are no version number checks. Author: Bryan Green <dbryan.green@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Discussion: https://postgr.es/m/a692c0fe-caca-4c08-9c5d-debfd0ef2504@gmail.com	2026-02-24 17:34:56 +01:00
Peter Eisentraut	a99c6b56ff	Make ALTER DOMAIN VALIDATE CONSTRAINT no-op when constraint is already validated Currently, AlterDomainValidateConstraint will re-validate a constraint that has already been validated, which would just waste cycles. This operation should be a no-op when the constraint is already validated. This also aligns with ATExecValidateConstraint. Author: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxG=-Dv9fPJHqkA9c-wGZ2dDOWOXSp-X-0K_G7r-DgaASw@mail.gmail.com	2026-02-24 10:58:36 +01:00
Peter Eisentraut	f80bedd52b	Allow ALTER COLUMN SET EXPRESSION on virtual columns with CHECK constraints Previously, changing the generation expression of a virtual column was prohibited if the column was referenced by a CHECK constraint. This lifts that restriction. RememberAllDependentForRebuilding within ATExecSetExpression will rebuild all the dependent constraints, later ATPostAlterTypeCleanup queues the required AlterTableStmt operations for ALTER TABLE Phase 3 execution. Overall, ALTER COLUMN SET EXPRESSION on virtual columns may require scanning the table to re-verify any associated CHECK constraints, but it does not require a table rewrite in ALTER TABLE Phase 3. Author: jian he <jian.universality@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CACJufxH3VETr7orF5rW29GnDk3n1wWbOE3WdkHYd3iPGrQ9E_A@mail.gmail.com	2026-02-24 10:32:05 +01:00
Michael Paquier	462fe0ff62	Fix variety of typos and grammar mistakes This commit includes a batch of fixes for various minor typos and grammar mistakes, that have been proposed to the hackers mailing list since the beginning of January. Similar batches are planned on a bi-monthly basis depending on the amount received, with the next one for the end of April.	2026-02-24 13:26:37 +09:00
Nathan Bossart	d981976027	Allow pg_{read,write}_all_data to access large objects. Since the initial goal of pg_read_all_data was to be able to run pg_dump as a non-superuser without explicitly granting access to every object, it follows that it should allow reading all large objects. For consistency, pg_write_all_data should allow writing all large objects, too. Author: Nitin Motiani <nitinmotiani@google.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Discussion: https://postgr.es/m/CAH5HC96dxAEvP78s1-JK_nDABH5c4w2MDfyx4vEWxBEfofGWsw%40mail.gmail.com	2026-02-23 14:55:21 -06:00
Tom Lane	d743545d84	Work around lgamma(NaN) bug on AIX. lgamma(NaN) should produce NaN, but on older versions of AIX it reports an ERANGE error. While that's been fixed in the latest version of libm, it'll take awhile for the fix to propagate. This workaround is harmless even when the underlying bug does get fixed. Discussion: https://postgr.es/m/3603369.1771877682@sss.pgh.pa.us	2026-02-23 15:30:50 -05:00
Peter Eisentraut	aca61f7e5f	Use LOCKMODE in parse_relation.c/.h There were a couple of comments in parse_relation.c > Note: properly, lockmode should be declared LOCKMODE not int, but that > would require importing storage/lock.h into parse_relation.h. Since > LOCKMODE is typedef'd as int anyway, that seems like overkill. but actually LOCKMODE has been in storage/lockdefs.h for a while, which is intentionally a more narrow header. So we can include that one in parse_relation.h and just use LOCKMODE normally. An alternative would be to add a duplicate typedef into parse_relation.h, but that doesn't seem necessary here. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/4bcd65fb-2497-484c-bb41-83cb435eb64d%40eisentraut.org	2026-02-23 21:25:55 +01:00
Tom Lane	4a1b05caa5	Restore AIX support. The concerns that led us to remove AIX support in commit `0b16bb877` have now been alleviated: 1. IBM has stepped forward to provide support, including buildfarm animal(s). 2. AIX 7.2 and later seem to be fine with large pg_attribute_aligned requirements. Since 7.1 is now EOL anyway, we can just cease to support it. 3. Tossing xlc support overboard seems okay as well. It's a bit sad to drop one of the few remaining non-gcc-alike compilers, but working around xlc's bugs and idiosyncrasies doesn't seem justified by the theoretical portability benefits. 4. Likewise, we can stop supporting 32-bit AIX builds. This is not so much about whether we could build such executables as that they're too much of a pain to manage in the field, due to limited address space available for dynamic library loading. 5. We hit on a way to manage catalog column alignment that doesn't require continuing developer effort (see commit `ecae09725`). Hence, this commit reverts `0b16bb877` and some follow-on commits such as `e6bb491bf`, except for not putting back XLC support nor the changes related to catalog column alignment. Some other notable changes from the way things were in v16: Prefer unnamed POSIX semaphores on AIX, rather than the default choice of SysV semaphores. Include /opt/freeware/lib in -Wl,-blibpath, even when it is not mentioned anywhere in LDFLAGS. Remove platform-specific adjustment of MEMSET_LOOP_LIMIT; maybe that's still the right thing, but it really ought to be re-tested. Silence compiler warnings related to getpeereid(), wcstombs_l(), and PAM conversation procs. Accept "libpythonXXX.a" as an okay name for the Python shared library (but only on AIX!). Author: Aditya Kamath <Aditya.Kamath1@ibm.com> Author: Srirama Kucherlapati <sriram.rk@in.ibm.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CY5PR11MB63928CC05906F27FB10D74D0FD322@CY5PR11MB6392.namprd11.prod.outlook.com	2026-02-23 13:34:22 -05:00
Nathan Bossart	bc60ee8606	Warn upon successful MD5 password authentication. This uses the "connection warning" infrastructure introduced by commit `1d92e0c2cc` to emit a WARNING when an MD5 password is used to authenticate. MD5 password support was marked as deprecated in v18 and will be removed in a future release of Postgres. These warnings are on by default but can be turned off via the existing md5_password_warnings parameter. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Xiangyu Liang <liangxiangyu_2013@163.com> Discussion: https://postgr.es/m/aYzeAYEbodkkg5e-%40nathan	2026-02-23 11:22:04 -06:00
Peter Eisentraut	797872f6b9	Rename validate_relation_kind() There are three static definitions of validate_relation_kind() in the codebase, one each in table.c, indexam.c and sequence.c, validating that the given relation is a table, an index or a sequence respectively. The compiler knows which definition to use where because they are static. But this could be confusing to a reader. Rename these functions so that their names reflect the kind of relation they are validating. While at it, also update the comments in table.c to clarify the definition of table-like relkinds so that we don't have to maintain the exclusion list as the set of relkinds undergoes changes. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org	2026-02-23 17:38:06 +01:00
Peter Eisentraut	d7be57ad85	Flip logic in table validate_relation_kind It instead of checking which relkinds it shouldn't be, explicitly list the ones we accept. This is used to check which relkinds are accepted in table_open() and related functions. Before this change, figuring that out was always a few steps too complicated. This also makes changes for new relkinds more explicit instead of accidental. Finally, this makes this more aligned with the functions of the same name in src/backend/access/index/indexam.c and src/backend/access/sequence/sequence.c. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org	2026-02-23 17:32:07 +01:00
Andrew Dunstan	b380a56a3f	Disallow CR and LF in database, role, and tablespace names Previously, these characters could cause problems when passed through shell commands, and were flagged with a comment in string_utils.c suggesting they be rejected in a future major release. The affected commands are CREATE DATABASE, CREATE ROLE, CREATE TABLESPACE, ALTER DATABASE RENAME, ALTER ROLE RENAME, and ALTER TABLESPACE RENAME. Also add a pg_upgrade check to detect these invalid names in clusters being upgraded from pre-v19 versions, producing a report file listing any offending objects that must be renamed before upgrading. Tests have been modified accordingly. Author: Mahendra Singh Thalor <mahi6run@gmail.com> Reviewed-By: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-By: Andrew Dunstan <andrew@dunslane.net> Reviewed-By: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-By: Nathan Bossart <nathandbossart@gmail.com> Reviewed-By: Srinath Reddy <srinath2133@gmail.com> Discussion: https://postgr.es/m/CAKYtNApkOi4FY0S7+3jpTqnHVyyZ6Tbzhtbah-NBbY-mGsiKAQ@mail.gmail.com	2026-02-23 11:19:13 -05:00
Nathan Bossart	f33b8793fd	Make use of pg_popcount() in more places. This replaces some loops over word-length popcount functions with calls to pg_popcount(). Since pg_popcount() may use a function pointer for inputs with sizes >= a Bitmapset word, this produces a small regression for the common one-word case in bms_num_members(). To deal with that, this commit adds an inlined fast-path for that case. This fast-path could arguably go in pg_popcount() itself (with an appropriate alignment check), but that is left for future study. Suggested-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CANWCAZY7R%2Biy%2Br9YM_sySNydHzNqUirx1xk0tB3ej5HO62GdgQ%40mail.gmail.com	2026-02-23 09:26:00 -06:00
Peter Eisentraut	55f3859329	Change error message for sequence validate_relation_kind() We can just say "... is not a sequence" instead of the more complicated variant from before, which was probably copied from src/backend/access/table/table.c. Fix a typo in a comment in passing. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/6d3fef19-a420-4e11-8235-8ea534bf2080%40eisentraut.org	2026-02-23 10:56:54 +01:00
Peter Eisentraut	3a63b76571	Fix additional fallthrough warnings from clang Clang warns if falling through to a case or default label that is immediately followed by break, but GCC does not (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432). (MSVC also warns about the equivalent code in C++.) This is in preparation for enabling fallthrough warnings on Clang. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org	2026-02-23 07:40:19 +01:00
Álvaro Herrera	0eeffd31bf	Avoid name collision with NOT NULL constraints If a CREATE TABLE statement defined a constraint whose name is identical to the name generated for a NOT NULL constraint, we'd throw an (unnecessary) unique key violation error on pg_constraint_conrelid_contypid_conname_index: this can easily be avoided by choosing a different name for the NOT NULL constraint. Fix by passing the constraint names already created by AddRelationNewConstraints() to AddRelationNotNullConstraints(), so that the latter can avoid name collisions with them. Bug: #19393 Author: Laurenz Albe <laurenz.albe@cybertec.at> Reported-by: Hüseyin Demir <huseyin.d3r@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/19393-6a82427485a744cf@postgresql.org	2026-02-21 12:22:08 +01:00
Heikki Linnakangas	36bbcd5be3	Split PGPROC 'links' field into two, for clarity The field was mainly used for the position in a LOCK's wait queue, but also as the position in a the freelist when the PGPROC entry was not in use. The reuse saves some memory at the expense of readability, which seems like a bad tradeoff. If we wanted to make the struct smaller there's other things we could do, but we're actually just discussing adding padding to the struct for performance reasons. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/3dd6f70c-b94d-4428-8e75-74a7136396be@iki.fi	2026-02-20 22:34:42 +02:00
Nathan Bossart	dc592a4155	Speedup COPY FROM with additional function inlining. Following the example set by commit `58a359e585`, we can squeeze out a little more performance from COPY FROM (FORMAT {text,csv}) by inlining CopyReadLineText() and passing the is_csv parameter as a constant. This allows the compiler to emit specialized code with fewer branches. This is preparatory work for a proposed follow-up commit that would further optimize this code with SIMD instructions. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz> Tested-by: Manni Wood <manni.wood@enterprisedb.com> Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com	2026-02-20 12:07:27 -06:00
Richard Guo	691977d370	Fix computation of varnullingrels when translating appendrel Var When adjust_appendrel_attrs translates a Var referencing a parent relation into a Var referencing a child relation, it propagates varnullingrels from the parent Var to the translated Var. Previously, the code simply overwrote the translated Var's varnullingrels with those of the parent. This was incorrect because the translated Var might already possess nonempty varnullingrels. This happens, for example, when a LATERAL subquery within a UNION ALL references a Var from the nullable side of an outer join. In such cases, the translated Var correctly carries the outer join's relid in its varnullingrels. Overwriting these bits with the parent Var's set caused the planner to lose track of the fact that the Var could be nulled by that outer join. In the reported case, because the underlying column had a NOT NULL constraint, the planner incorrectly deduced that the Var could never be NULL and discarded essential IS NOT NULL filters. This led to incorrect query results where NULL rows were returned instead of being filtered out. To fix, use bms_add_members to merge the parent Var's varnullingrels into the translated Var's existing set, preserving both sources of nullability. Back-patch to v16. Although the reported case does not seem to cause problems in v16, leaving incorrect varnullingrels in the tree seems like a trap for the unwary. Bug: #19412 Reported-by: Sergey Shinderuk <s.shinderuk@postgrespro.ru> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19412-1d0318089b86859e@postgresql.org Backpatch-through: 16	2026-02-20 17:57:53 +09:00
Michael Paquier	0dc22fff64	Fix constant in error message for recovery_target_timeline The intention was to use PG_UINT32_MAX, not UINT_MAX. Let's be consistent and use the same constant. Thinko in `fd7d7b7191`. Author: David Steele <david@pgbackrest.org> Discussion: https://postgr.es/m/aZfXO97jSQaTTlfD@paquier.xyz	2026-02-20 16:17:57 +09:00
Amit Kapila	9842e8aca0	Avoid including worker_internal.h in pgstat.h. pgstat.h is a widely included header. Including worker_internal.h there is unnecessary and creates tight coupling. By refactoring pgstat_report_subscription_error() to fetch the required LogicalRepWorkerType internally rather than receiving it as an argument, we can eliminate the need for the internal header. Reported-by: Andres Freund <andres@anarazel.de> Author: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/aY-UE-4t7FiYgH3t@alap3.anarazel.de	2026-02-20 09:26:33 +05:30
Nathan Bossart	ba401828c1	Remove SpinLockFree() and S_LOCK_FREE(). S_LOCK_FREE() is used by the test program in s_lock.c, but nobody has voiced concerns about losing some coverage there. SpinLockFree() appears to have been unused since it was introduced by commit `499abb0c0f`. There was agreement to remove these in 2020, but it never happened. Since we still have agreement for removal in 2026, let's do that now. Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/aZX2oUcKf7IzHnnK%40nathan Discussion: https://postgr.es/m/20200608225338.m5zho424w6lpwb2d%40alap3.anarazel.de	2026-02-19 16:19:41 -06:00
Robert Haas	6e466e1e83	Fix add_partial_path interaction with disabled_nodes Commit `e222534679` adjusted the logic in add_path() to keep the path list sorted by disabled_nodes and then by total_cost, but failed to make the corresponding adjustment to add_partial_path. As a result, add_partial_path might sort the path list just by total cost, which could lead to later planner misbehavior. In principle, this should be back-patched to v18, but we are typically reluctant to back-patch planner fixes for fear of destabilizing working installations, and it is unclear to me that this has sufficiently serious consequences to justify an exception, so for now, no back-patch. Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: http://postgr.es/m/CAMbWs4-mO3jMK4t_LgcJ+7Eo=NmGgkxettgRaVbJzZvVZ1koMA@mail.gmail.com	2026-02-19 13:46:10 -05:00
Álvaro Herrera	fc3896c786	Add translator comment Otherwise the message is not very clear. Backpatch-through: 18	2026-02-19 17:11:04 +01:00
Tom Lane	2f248ad573	Remove no-longer-useful markers in pg_hba.conf.sample. The source version of pg_hba.conf.sample contains @remove-line-for-nolocal@ markers that indicate which lines should be deleted for an installation that doesn't HAVE_UNIX_SOCKETS. We no longer support that case, and since commit `f55808828` all that initdb is doing is unconditionally removing the markers. We might as well remove the markers from the source version and drop the removal code, which is unintelligible now anyway. This will not of course save any noticeable number of cycles in initdb, but it might save some confusion for future developers looking at pg_hba.conf.sample. It also reduces the number of distinct cases that replace_token() has to support, possibly allowing some tightening of that function. Discussion: https://postgr.es/m/2287786.1771458157@sss.pgh.pa.us	2026-02-19 11:09:00 -05:00
Fujii Masao	fb80f388f4	Add per-subscription wal_receiver_timeout setting. This commit allows setting wal_receiver_timeout per subscription using the CREATE SUBSCRIPTION and ALTER SUBSCRIPTION commands. The value is stored in the subwalrcvtimeout column of the pg_subscription catalog. When set, this value overrides the global wal_receiver_timeout for the subscription's apply worker. The default is -1, which means the global setting (from the server configuration, command line, role, or database) remains in effect. This feature is useful for configuring different timeout values for each subscription, especially when connecting to multiple publisher servers, to improve failure detection. Bump catalog version. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/a1414b64-bf58-43a6-8494-9704975a41e9@oss.nttdata.com	2026-02-20 01:00:09 +09:00
Fujii Masao	8a6af3ad08	Make GUC wal_receiver_timeout user-settable. When multiple subscribers connect to different publisher servers, it can be useful to set different wal_receiver_timeout values for each connection to better detect failures. However, previously this wasn't possible, which limited flexibility in managing subscriptions. This commit changes wal_receiver_timeout to be user-settable, allowing different values to be assigned using ALTER ROLE SET for each subscription owner. This effectively enables per-subscription configuration. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/a1414b64-bf58-43a6-8494-9704975a41e9@oss.nttdata.com	2026-02-20 00:52:43 +09:00
Fujii Masao	5b93a5987b	Log checkpoint request flags in checkpoint completion messages. Checkpoint completion log messages include more detail than checkpoint start messages, but previously omitted the checkpoint request flags, which were only logged at checkpoint start. As a result, users had to correlate completion messages with earlier start messages to see the full context. This commit includes the checkpoint request flags in the checkpoint completion log message as well. This duplicates some information, but makes the completion message self-contained and easier to interpret. Author: Soumya S Murali <soumyamurali.work@gmail.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Yuan Li <carol.li2025@outlook.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAMtXxw9tPwV=NBv5S9GZXMSKPeKv5f9hRhSjZ8__oLsoS5jcuA@mail.gmail.com	2026-02-19 23:55:12 +09:00
Peter Eisentraut	8354b9d6b6	Use fallthrough attribute instead of comment Instead of using comments to mark fallthrough switch cases, use the fallthrough attribute. This will (in the future, not here) allow supporting other compilers besides gcc. The commenting convention is only supported by gcc, the attribute is supported by clang, and in the fullness of time the C23 standard attribute would allow supporting other compilers as well. Right now, we package the attribute into a macro called pg_fallthrough. This commit defines that macro and replaces the existing comments with that macro invocation. We also raise the level of the gcc -Wimplicit-fallthrough= option from 3 to 5 to enforce the use of the attribute. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org	2026-02-19 08:51:12 +01:00
Peter Eisentraut	0c3fbb3fef	Remove useless fallthrough annotation A fallthrough attribute after the last case is a constraint violation in C23, and clang warns about it (not about this comment, but if we changed it to an attribute). Remove it. (There was apparently never anything after this to fall through to, even in the first commit da07a1e8565.) Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org	2026-02-19 08:50:58 +01:00
Michael Paquier	21e323e941	Sanitize some WAL-logging buffer handling in GIN and GiST code As transam's README documents, the general order of actions recommended when WAL-logging a buffer is to unlock and unpin buffers after leaving a critical section. This pattern was not being followed by some code paths of GIN and GiST, adjusted in this commit, where buffers were either unlocked or unpinned inside a critical section. Based on my analysis of each code path updated here, there is no reason to not follow the recommended unlocking/unpin pattern done outside of a critical section. These inconsistencies are rather old, coming mainly from `ecaa4708e5` and `ff301d6e69`. The guidelines in the README predate these commits, being introduced in `6d61cdec07`. Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/CALdSSPgBPnpNNzxv0Y+_GNFzW6PmzRZYh+_hpf06Y1N2zLhZaQ@mail.gmail.com	2026-02-19 15:59:20 +09:00
Tom Lane	759b03b24c	Simplify creation of built-in functions with default arguments. Up to now, to create such a function, one had to make a pg_proc.dat entry and then overwrite it with a CREATE OR REPLACE command in system_functions.sql. That's error-prone (cf. bug #19409) and results in leaving dead rows in the initial contents of pg_proc. Manual maintenance of pg_node_tree strings seems entirely impractical, and parsing expressions during bootstrap would be extremely difficult as well. But Andres Freund observed that all the current use-cases are simple constants, and building a Const node is well within the capabilities of bootstrap mode. So this patch invents a special case: if bootstrap mode is asked to ingest a non-null value for pg_proc.proargdefaults (which would otherwise fail in pg_node_tree_in), it parses the value as an array literal and then feeds the element strings to the input functions for the corresponding parameter types. Then we can build a suitable pg_node_tree string with just a few more lines of code. This allows removing all the system_functions.sql entries that are just there to set up default arguments, replacing them with proargdefaults fields in pg_proc.dat entries. The old technique remains available in case someone needs a non-constant default. The initial contents of pg_proc are demonstrably the same after this patch, except that (1) json_strip_nulls and jsonb_strip_nulls now have the correct provolatile setting, as per bug #19409; (2) pg_terminate_backend, make_interval, and drandom_normal now have defaults that don't include a type coercion, which is how they should have been all along. In passing, remove some unused entries from bootstrap.c's TypInfo[] array. I had to add some new ones because we'll now need an entry for each default-possessing system function parameter, but we shouldn't carry more than we need there; it's just a maintenance gotcha. Bug: #19409 Reported-by: Lucio Chiessi <lucio.chiessi@trustly.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Author: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net Discussion: https://postgr.es/m/19409-e16cd2605e59a4af@postgresql.org	2026-02-18 14:14:44 -05:00
Heikki Linnakangas	d62dca3b29	Use standard die() handler for SIGTERM in bgworkers The previous default bgworker_die() signal would exit with elog(FATAL) directly from the signal handler. That could cause deadlocks or crashes if the signal handler runs while we're e.g holding a spinlock or in the middle of a memory allocation. All the built-in background workers overrode that to use the normal die() handler and CHECK_FOR_INTERRUPTS(). Let's make that the default for all background workers. Some extensions relying on the old behavior might need to adapt, but the new default is much safer and is the right thing to do for most background workers. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/5238fe45-e486-4c62-a7f3-c7d8d416e812@iki.fi	2026-02-18 19:59:34 +02:00
Michael Paquier	ee642cccc4	Switch SysCacheIdentifier to a typedef enum The main purpose of this change is to allow an ABI checker to understand when the list of SysCacheIdentifier changes, by switching all the routine declarations that relied on a signed integer for a syscache ID to this new type. This is going to be useful in the long-term for versions newer than v19 so as we will be able to check when the list of values in SysCacheIdentifier is updated in a non-ABI compliant fashion. Most of the changes of this commit are due to the new definition of SyscacheCallbackFunction, where a SysCacheIdentifier is now required for the syscache ID. It is a mechanical change, still slightly invasive. There are more areas in the tree that could be improved with an ABI checker in mind; this takes care of only one area. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/289125.1770913057@sss.pgh.pa.us	2026-02-18 09:58:38 +09:00
Michael Paquier	c06b5b99bb	Add concept of invalid value to SysCacheIdentifier This commit tweaks the generation of the syscache IDs for the enum SysCacheIdentifier to now include an invalid value, with -1 assigned as value. The concept of an invalid syscache ID exists when handling lookups of a ObjectAddress, based on their set of properties in ObjectPropertyType. -1 is used for the case where an object type has no option for a syscache lookup. This has been found as independently useful while discussing a switch of SysCacheIdentifier to a typedef, as we already have places that want to know about the concept of an invalid value when dealing with ObjectAddresses. Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/aZQRnmp9nVjtxAHS@paquier.xyz	2026-02-18 09:25:52 +09:00
Michael Paquier	f7df12a66c	Fix one-off issue with cache ID in objectaddress.c get_catalog_object_by_oid_extended() has been doing a syscache lookup when given a cache ID strictly higher than 0, which is wrong because the first valid value of SysCacheIdentifier is 0. This issue had no consequences, as the first value assigned in the enum SysCacheIdentifier is AGGFNOID, which is not used in the object type properties listed in objectaddress.c. Even if an ID of 0 was hypotherically given, the code would still work with a less efficient heap-or-index scan. Discussion: https://postgr.es/m/aZTr_R6JGmqokUBb@paquier.xyz	2026-02-18 08:47:58 +09:00
Álvaro Herrera	b7271aa1d7	Use a bitmask for ExecInsertIndexTuples options ... instead of passing a bunch of separate booleans. Also, rearrange the argument list in a hopefully more sensible order. Discussion: https://postgr.es/m/202602111846.xpvuccb3inbx@alvherre.pgsql Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> (older version)	2026-02-17 17:59:45 +01:00
Álvaro Herrera	661237056b	Fix memory leak in new GUC check_hook Commit `38e0190ced` forgot to pfree() an allocation (freed in other places of the same function) in only one of several spots in check_log_min_messages(). Per Coverity. Add that. While at it, avoid open-coding guc_strdup(). The new coding does a strlen() that wasn't there before, but I doubt it's measurable.	2026-02-17 16:38:24 +01:00
Heikki Linnakangas	a92b809f9d	Ignore SIGINT in walwriter and walsummarizer Previously, SIGINT was treated the same as SIGTERM in walwriter and walsummarizer. That decision goes back to when the walwriter process was introduced (commit `ad4295728e`), and was later copied to walsummarizer. It was a pretty arbitrary decision back then, and we haven't adopted that convention in all the other processes that have been introduced later. Summary of how other processes respond to SIGINT: - Autovacuum launcher: Cancel the current iteration of launching - bgworker: Ignore (unless connected to a database) - checkpointer: Request shutdown checkpoint - bgwriter: Ignore - pgarch: Ignore - startup process: Ignore - walreceiver: Ignore - IO worker: die() IO workers are a notable exception in that they exit on SIGINT, and there's a documented reason for that: IO workers ignore SIGTERM, so SIGINT provides a way to manually kill them. (They do respond to SIGUSR2, though, like all the other processes that we don't want to exit immediately on SIGTERM on operating system shutdown.) To make this a little more consistent, ignore SIGINT in walwriter and walsummarizer. They have no "query" to cancel, and they react to SIGTERM just fine. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/818bafaf-1e77-4c78-8037-d7120878d87c@iki.fi	2026-02-17 17:18:31 +02:00
Noah Misch	8cef93d8a5	Suppress new "may be used uninitialized" warning. Various buildfarm members, having compilers like gcc 8.5 and 6.3, fail to deduce that text_substring() variable "E" is initialized if slice_size!=-1. This suppression approach quiets gcc 8.5; I did not reproduce the warning elsewhere. Back-patch to v14, like commit `9f4fd119b2`. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1157953.1771266105@sss.pgh.pa.us Backpatch-through: 14	2026-02-16 18:04:58 -08:00
Peter Eisentraut	d50c86e743	Change remaining StaticAssertStmt() to StaticAssertDecl() This completes the work started by commit `75f49221c2`. In basebackup.c, changing the StaticAssertStmt to StaticAssertDecl results in having the same StaticAssertDecl() in 2 functions. So, it makes more sense to move it to file scope instead. Also, as it depends on some computations based on 2 tar blocks, define TAR_NUM_TERMINATION_BLOCKS. In deadlock.c, change the StaticAssertStmt to StaticAssertDecl and keep it in the function scope. Add new braces to avoid warning from -Wdeclaration-after-statement. In aset.c, change the StaticAssertStmt to StaticAssertDecl and move it to file scope. Finally, update the comments in c.h a bit. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/aYH6ii46AvGVCB84%40ip-10-97-1-34.eu-west-3.compute.internal	2026-02-16 09:22:43 +01:00
Fujii Masao	351265a6c7	Remove recovery.signal at recovery end when both signal files are present. When both standby.signal and recovery.signal are present, standby.signal takes precedence and the server runs in standby mode. Previously, in this case, recovery.signal was not removed at the end of standby mode (i.e., on promotion) or at the end of archive recovery, while standby.signal was removed. As a result, a leftover recovery.signal could cause a subsequent restart to enter archive recovery unexpectedly, potentially preventing the server from starting. This behavior was surprising and confusing to users. This commit fixes the issue by updating the recovery code to remove recovery.signal alongside standby.signal when both files are present and recovery completes. Because this code path is particularly sensitive and changes in recovery behavior can be risky for stable branches, this change is applied only to the master branch. Reported-by: Nikolay Samokhvalov <nik@postgres.ai> Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: David Steele <david@pgbackrest.org> Discussion: https://postgr.es/m/CAM527d8PVAQFLt_ndTXE19F-XpDZui861882L0rLY3YihQB8qA@mail.gmail.com	2026-02-16 13:57:38 +09:00
Noah Misch	9f4fd119b2	Fix SUBSTRING() for toasted multibyte characters. Commit `1e7fe06c10` changed pg_mbstrlen_with_len() to ereport(ERROR) if the input ends in an incomplete character. Most callers want that. text_substring() does not. It detoasts the most bytes it could possibly need to get the requested number of characters. For example, to extract up to 2 chars from UTF8, it needs to detoast 8 bytes. In a string of 3-byte UTF8 chars, 8 bytes spans 2 complete chars and 1 partial char. Fix this by replacing this pg_mbstrlen_with_len() call with a string traversal that differs by stopping upon finding as many chars as the substring could need. This also makes SUBSTRING() stop raising an encoding error if the incomplete char is past the end of the substring. This is consistent with the general philosophy of the above commit, which was to raise errors on a just-in-time basis. Before the above commit, SUBSTRING() never raised an encoding error. SUBSTRING() has long been detoasting enough for one more char than needed, because it did not distinguish exclusive and inclusive end position. For avoidance of doubt, stop detoasting extra. Back-patch to v14, like the above commit. For applications using SUBSTRING() on non-ASCII column values, consider applying this to your copy of any of the February 12, 2026 releases. Reported-by: SATŌ Kentarō <ranvis@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Bug: #19406 Discussion: https://postgr.es/m/19406-9867fddddd724fca@postgresql.org Backpatch-through: 14	2026-02-14 12:16:16 -08:00
Noah Misch	4644f8b23b	pg_mblen_range, pg_mblen_with_len: Valgrind after encoding ereport. The prior order caused spurious Valgrind errors. They're spurious because the ereport(ERROR) non-local exit discards the pointer in question. pg_mblen_cstr() ordered the checks correctly, but these other two did not. Back-patch to v14, like commit `1e7fe06c10`. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/20260214053821.fa.noahmisch@microsoft.com Backpatch-through: 14	2026-02-14 12:16:16 -08:00
John Naylor	ef3c3cf6d0	Perform radix sort on SortTuples with pass-by-value Datums Radix sort can be much faster than quicksort, but for our purposes it is limited to sequences of unsigned bytes. To make tuples with other types amenable to this technique, several features of tuple comparison must be accounted for, i.e. the sort key must be "normalized": 1. Signedness -- It's possible to modify a signed integer such that it can be compared as unsigned. For example, a signed char has range -128 to 127. If we cast that to unsigned char and add 128, the range of values becomes 0 to 255 while preserving order. 2. Direction -- SQL allows specification of ASC or DESC. The descending case is easily handled by taking the complement of the unsigned representation. 3. NULL values -- NULLS FIRST and NULLS LAST must work correctly. This commmit only handles the case where datum1 is pass-by-value Datum (possibly abbreviated) that compares like an ordinary integer. (Abbreviations of values of type "numeric" are a convenient counterexample.) First, tuples are partitioned by nullness in the correct NULL ordering. Then the NOT NULL tuples are sorted with radix sort on datum1. For tiebreaks on subsequent sortkeys (including the first sort key if abbreviated), we divert to the usual qsort. ORDER BY queries on pre-warmed buffers are up to 2x faster on high cardinality inputs with radix sort than the sort specializations added by commit `697492434`, so get rid of them. It's sufficient to fall back to qsort_tuple() for small arrays. Moderately low cardinality inputs show more modest improvents. Our qsort is strongly optimized for very low cardinality inputs, but radix sort is usually equal or very close in those cases. The changes to the regression tests are caused by under-specified sort orders, e.g. "SELECT a, b from mytable order by a;". For unstable sorts, such as our qsort and this in-place radix sort, there is no guarantee of the order of "b" within each group of "a". The implementation is taken from ska_byte_sort() (Boost licensed), which is similar to American flag sort (an in-place radix sort) with modifications to make it better suited for modern pipelined CPUs. The technique of normalization described above can also be extended to the case of multiple keys. That is left for future work (Thanks to Peter Geoghegan for the suggestion to look into this area). Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com> Reviewed-by: zengman <zengman@halodbtech.com> Reviewed-by: ChangAo Chen <cca5507@qq.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> (earlier version) Discussion: https://postgr.es/m/CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com	2026-02-14 13:50:06 +07:00
Michael Paquier	775fc01415	Improve error message for checksum failures in pgstat_database.c This log message was referring to conflicts, but it is about checksum failures. The log message improved in this commit should never show up, due to the fact that pgstat_prepare_report_checksum_failure() should always be called before pgstat_report_checksum_failures_in_db(), with a stats entry already created in the pgstats shared hash table. The three code paths able to report database-level checksum failures follow already this requirement. Oversight in `b96d3c3897`. Author: Wang Peng <215722532@qq.com> Discussion: https://postgr.es/m/tencent_9B6CD6D9D34AE28CDEADEC6188DB3BA1FE07@qq.com Backpatch-through: 18	2026-02-13 12:17:08 +09:00
Dean Rasheed	88327092ff	Add support for INSERT ... ON CONFLICT DO SELECT. This adds a new ON CONFLICT action DO SELECT [FOR UPDATE/SHARE], which returns the pre-existing rows when conflicts are detected. The INSERT statement must have a RETURNING clause, when DO SELECT is specified. The optional FOR UPDATE/SHARE clause allows the rows to be locked before they are are returned. As with a DO UPDATE conflict action, an optional WHERE clause may be used to prevent rows from being selected for return (but as with a DO UPDATE action, rows filtered out by the WHERE clause are still locked). Bumps catversion as stored rules change. Author: Andreas Karlsson <andreas@proxel.se> Author: Marko Tiikkaja <marko@joh.to> Author: Viktor Holmberg <v@viktorh.net> Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/d631b406-13b7-433e-8c0b-c6040c4b4663@Spark Discussion: https://postgr.es/m/5fca222d-62ae-4a2f-9fcb-0eca56277094@Spark Discussion: https://postgr.es/m/2b5db2e6-8ece-44d0-9890-f256fdca9f7e@proxel.se Discussion: https://postgr.es/m/CAL9smLCdV-v3KgOJX3mU19FYK82N7yzqJj2HAwWX70E=P98kgQ@mail.gmail.com	2026-02-12 09:57:04 +00:00
Amit Kapila	788ec96d59	Refactor slot synchronization logic in slotsync.c. Following `e68b6adad9`, the reason for skipping slot synchronization is stored as a slot property. This commit removes redundant function parameters that previously tracked this state, instead relying directly on the slot property. Additionally, this change centralizes the logic for skipping synchronization when required WAL has not yet been received or flushed. By consolidating this check, we reduce code duplication and the risk of inconsistent state updates across different code paths. In passing, add an assertion to ensure a slot is marked as temporary if a consistent point has not been reached during synchronization. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/TY4PR01MB16907DD16098BE3B20486D4569463A@TY4PR01MB16907.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/CAFPTHDZAA+gWDntpa5ucqKKba41=tXmoXqN3q4rpjO9cdxgQrw@mail.gmail.com	2026-02-12 14:38:31 +05:30
Dean Rasheed	706cadde32	Remove p_is_insert from struct ParseState. The only place that used p_is_insert was transformAssignedExpr(), which used it to distinguish INSERT from UPDATE when handling indirection on assignment target columns -- see commit `c1ca3a19df`. However, this information is already available to transformAssignedExpr() via its exprKind parameter, which is always either EXPR_KIND_INSERT_TARGET or EXPR_KIND_UPDATE_TARGET. As noted in the commit message for `c1ca3a19df`, this use of p_is_insert isn't particularly pretty, so have transformAssignedExpr() use the exprKind parameter instead. This then allows p_is_insert to be removed entirely, which simplifies state management in a few other places across the parser. Author: Viktor Holmberg <v@viktorh.net> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/badc3b4c-da73-4000-b8d3-638a6f53a769@Spark	2026-02-12 09:01:42 +00:00
Richard Guo	cf74558feb	Reduce LEFT JOIN to ANTI JOIN using NOT NULL constraints For a LEFT JOIN, if any var from the right-hand side (RHS) is forced to null by upper-level quals but is known to be non-null for any matching row, the only way the upper quals can be satisfied is if the join fails to match, producing a null-extended row. Thus, we can treat this left join as an anti-join. Previously, this transformation was limited to cases where the join's own quals were strict for the var forced to null by upper qual levels. This patch extends the logic to check table constraints, leveraging the NOT NULL attribute information already available thanks to the infrastructure introduced by `e2debb643`. If a forced-null var belongs to the RHS and is defined as NOT NULL in the schema (and is not nullable due to lower-level outer joins), we know that the left join can be reduced to an anti-join. Note that to ensure the var is not nullable by any lower-level outer joins within the current subtree, we collect the relids of base rels that are nullable within each subtree during the first pass of the reduce-outer-joins process. This allows us to verify in the second pass that a NOT NULL var is indeed safe to treat as non-nullable. Based on a proposal by Nicolas Adenis-Lamarre, but this is not the original patch. Suggested-by: Nicolas Adenis-Lamarre <nicolas.adenis.lamarre@gmail.com> Author: Tender Wang <tndrwang@gmail.com> Co-authored-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/CACPGbctKMDP50PpRH09in+oWbHtZdahWSroRstLPOoSDKwoFsw@mail.gmail.com	2026-02-12 15:30:13 +09:00
Heikki Linnakangas	78a5e3074b	Fix pg_stat_get_backend_wait_event() for aux processes The pg_stat_activity view shows information for aux processes, but the pg_stat_get_backend_wait_event() and pg_stat_get_backend_wait_event_type() functions did not. To fix, call AuxiliaryPidGetProc(pid) if BackendPidGetProc(pid) returns NULL, like we do in pg_stat_get_activity(). In version 17 and above, it's a little silly to use those functions when we already have the ProcNumber at hand, but it was necessary before v17 because the backend ID was different from ProcNumber. I have other plans for wait_event_info on master, so it doesn't seem worth applying a different fix on different versions now. Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/c0320e04-6e85-4c49-80c5-27cfb3a58108@iki.fi Backpatch-through: 14	2026-02-11 18:50:57 +02:00
Nathan Bossart	1d92e0c2cc	Add password expiration warnings. This commit adds a new parameter called password_expiration_warning_threshold that controls when the server begins emitting imminent-password-expiration warnings upon successful password authentication. By default, this parameter is set to 7 days, but this functionality can be disabled by setting it to 0. This patch also introduces a new "connection warning" infrastructure that can be reused elsewhere. For example, we may want to warn about the use of MD5 passwords for a couple of releases before removing MD5 password support. Author: Gilles Darold <gilles@darold.net> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: songjinzhou <tsinghualucky912@foxmail.com> Reviewed-by: liu xiaohui <liuxh.zj.cn@gmail.com> Reviewed-by: Yuefei Shi <shiyuefei1004@gmail.com> Reviewed-by: Steven Niu <niushiji@gmail.com> Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/129bcfbf-47a6-e58a-190a-62fc21a17d03%40migops.com	2026-02-11 10:36:15 -06:00
Álvaro Herrera	1efdd7cc63	Cleanup for log_min_messages changes in `38e0190ced` * Remove an unused variable * Use "default log level" consistently (instead of "generic") * Keep the process types in alphabetical order (missed one place in the SGML docs) * Since log_min_messages type was changed from enum to string, it is a good idea to add single quotes when printing it out. Otherwise it fails if the user copies and pastes from the SHOW output to SET, except in the simplest case. Using single quotes reduces confusion. * Use lowercase string for the burned-in default value, to keep the same output as previous versions. Author: Euler Taveira <euler@eulerto.com> Author: Man Zeng <zengman@halodbtech.com> Author: Noriyoshi Shinoda <noriyoshi.shinoda@hpe.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/202602091250.genyflm2d5dw@alvherre.pgsql	2026-02-11 16:38:18 +01:00
Heikki Linnakangas	7984ce7a1d	Move ProcStructLock to the ProcGlobal struct It protects the freeProcs and some other fields in ProcGlobal, so let's move it there. It's good for cache locality to have it next to the thing it protects, and just makes more sense anyway. I believe it was allocated as a separate shared memory area just for historical reasons. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/b78719db-0c54-409f-b185-b0d59261143f@iki.fi	2026-02-11 16:48:45 +02:00
Heikki Linnakangas	ab32a9e21d	Remove useless store to local variable It was a leftover from commit `5764f611e1`, which converted the loop to use dclist_foreach. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/3dd6f70c-b94d-4428-8e75-74a7136396be@iki.fi	2026-02-11 11:49:18 +02:00
Robert Haas	7358abcc60	Store information about Append node consolidation in the final plan. An extension (or core code) might want to reconstruct the planner's decisions about whether and where to perform partitionwise joins from the final plan. To do so, it must be possible to find all of the RTIs of partitioned tables appearing in the plan. But when an AppendPath or MergeAppendPath pulls up child paths from a subordinate AppendPath or MergeAppendPath, the RTIs of the subordinate path do not appear in the final plan, making this kind of reconstruction impossible. To avoid this, propagate the RTI sets that would have been present in the 'apprelids' field of the subordinate Append or MergeAppend nodes that would have been created into the surviving Append or MergeAppend node, using a new 'child_append_relid_sets' field for that purpose. The value of this field is a list of Bitmapsets, because each relation whose append-list was pulled up had its own set of RTIs: just one, if it was a partitionwise scan, or more than one, if it was a partitionwise join. Since our goal is to see where partitionwise joins were done, it is essential to avoid losing the information about how the RTIs were grouped in the pulled-up relations. This commit also updates pg_overexplain so that EXPLAIN (RANGE_TABLE) will display the saved RTI sets. Co-authored-by: Robert Haas <rhaas@postgresql.org> Co-authored-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com	2026-02-10 17:55:59 -05:00
Michael Paquier	9181c870ba	Improve type handling of varlena structures This commit changes the definition of varlena to a typedef, so as it becomes possible to remove "struct" markers from various declarations in the code base. Historically, "struct" markers are not the project style for variable declarations, so this update simplifies the code and makes it more consistent across the board. This change has an impact on the following structures, simplifying declarations using them: - varlena - varatt_indirect - varatt_external This cleanup has come up in a different path set that played with TOAST and varatt.h, independently worth doing on its own. Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aW8xvVbovdhyI4yo@paquier.xyz	2026-02-11 07:33:24 +09:00
Robert Haas	0d4391b265	Store information about elided nodes in the final plan. An extension (or core code) might want to reconstruct the planner's choice of join order from the final plan. To do so, it must be possible to find all of the RTIs that were part of the join problem in that plan. Commit `adbad833f3`, together with the earlier work in `8c49a484e8`, is enough to let us match up RTIs we see in the final plan with RTIs that we see during the planning cycle, but we still have a problem if the planner decides to drop some RTIs out of the final plan altogether. To fix that, when setrefs.c removes a SubqueryScan, single-child Append, or single-child MergeAppend from the final Plan tree, record the type of the removed node and the RTIs that the removed node would have scanned in the final plan tree. It would be natural to record this information on the child of the removed plan node, but that would require adding an additional pointer field to type Plan, which seems undesirable. So, instead, store the information in a separate list that the executor need never consult, and use the plan_node_id to identify the plan node with which the removed node is logically associated. Also, update pg_overexplain to display these details. Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com	2026-02-10 16:46:05 -05:00
Robert Haas	adbad833f3	Store information about range-table flattening in the final plan. Suppose that we're currently planning a query and, when that same query was previously planned and executed, we learned something about how a certain table within that query should be planned. We want to take note when that same table is being planned during the current planning cycle, but this is difficult to do, because the RTI of the table from the previous plan won't necessarily be equal to the RTI that we see during the current planning cycle. This is because each subquery has a separate range table during planning, but these are flattened into one range table when constructing the final plan, changing RTIs. Commit `8c49a484e8` allows us to match up subqueries seen in the previous planning cycles with the subqueries currently being planned just by comparing textual names, but that's not quite enough to let us deduce anything about individual tables, because we don't know where each subquery's range table appears in the final, flattened range table. To fix that, store a list of SubPlanRTInfo objects in the final planned statement, each including the name of the subplan, the offset at which it begins in the flattened range table, and whether or not it was a dummy subplan -- if it was, some RTIs may have been dropped from the final range table, but also there's no need to control how a dummy subquery gets planned. The toplevel subquery has no name and always begins at rtoffset 0, so we make no entry for it. This commit teaches pg_overexplain's RANGE_TABLE option to make use of this new data to display the subquery name for each range table entry. Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com	2026-02-10 15:33:39 -05:00
Robert Haas	0f4c8d33d4	Pass cursorOptions to planner_setup_hook. Commit `94f3ad3961` failed to do this because I couldn't think of a use for the information, but this has proven to be short-sighted. Best to fix it before this code is officially released. Now, the only argument to standard_planenr that isn't passed to planner_setup_hook is boundParams, but that is accessible via glob->boundParams, and so doesn't need to be passed separately. Discussion: https://www.postgresql.org/message-id/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com	2026-02-10 11:50:28 -05:00
Robert Haas	cbdf93d471	Fix PGS_CONSIDER_NONPARTIAL interaction with Materialize nodes. Commit `4020b370f2` had the idea that it would be a good idea to handle testing PGS_CONSIDER_NONPARTIAL within cost_material to save callers the trouble, but that turns out not to be a very good idea. One concern is that it makes cost_material() dependent on the caller having initialized certain fields in the MaterialPath, which is a bit awkward for materialize_finished_plan, which wants to use a dummy path. Another problem is that it can result in generated materialized nested loops where the Materialize node is disabled, contrary to the intention of joinpath.c's logic in match_unsorted_outer() and consider_parallel_nestloop(), which aims to consider such paths only when they would not need to be disabled. In the previous coding, it was possible for the pgs_mask on the joinrel to have PGS_CONSIDER_NONPARTIAL set, while the inner rel had the same bit clear. In that case, we'd generate and then disable a Materialize path. That seems wrong, so instead, pull up the logic to test the PGS_CONSIDER_NONPARTIAL bit into joinpath.c, restoring the historical behavior that either we don't generate a given materialized nested loop in the first place, or we don't disable it. Discussion: http://postgr.es/m/CA+TgmoawzvCoZAwFS85tE5+c8vBkqgcS8ZstQ_ohjXQ9wGT9sw@mail.gmail.com Discussion: http://postgr.es/m/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com	2026-02-10 11:49:07 -05:00
Heikki Linnakangas	be5257725d	Refactor ProcessRecoveryConflictInterrupt for readability Two changes here: 1. Introduce a separate RECOVERY_CONFLICT_BUFFERPIN_DEADLOCK flag to indicate a suspected deadlock that involves a buffer pin. Previously the startup process used the same flag for a deadlock involving just regular locks, and to check for deadlocks involving the buffer pin. The cases are handled separately in the startup process, but the receiving backend had to deduce which one it was based on HoldingBufferPinThatDelaysRecovery(). With a separate flag, the receiver doesn't need to guess. 2. Rewrite the ProcessRecoveryConflictInterrupt() function to not rely on fallthrough through the switch-statement. That was difficult to read. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi	2026-02-10 16:23:10 +02:00
Heikki Linnakangas	17f51ea818	Separate RecoveryConflictReasons from procsignals Share the same PROCSIG_RECOVERY_CONFLICT flag for all recovery conflict reasons. To distinguish, have a bitmask in PGPROC to indicate the reason(s). Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi	2026-02-10 16:23:08 +02:00
Heikki Linnakangas	ddc3250208	Use ProcNumber rather than pid in ReplicationSlot This helps the next commit. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi	2026-02-10 16:23:05 +02:00
Michael Paquier	f33c585774	Simplify some log messages in extended_stats_funcs.c The log messages used in this file applied too much quoting logic: - No need for quote_identifier(), which is fine to not use in the context of a log entry. - The usual project style is to group the namespace and object together in a quoted string, when mentioned in an log message. This code quoted the namespace name and the extended statistics object name separately, which was confusing. Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com	2026-02-10 16:59:19 +09:00
Michael Paquier	307447e6db	Add information about range type stats to pg_stats_ext_exprs This commit adds three attributes to the system view pg_stats_ext_exprs, whose data can exist when involving a range type in an expression: range_length_histogram range_empty_frac range_bounds_histogram These statistics fields exist since `918eee0c49`, and have become viewable in pg_stats later in `bc3c8db8ae`. This puts the definition of pg_stats_ext_exprs on par with pg_stats. This issue has showed up during the discussion about the restore of extended statistics for expressions, so as it becomes possible to query the stats data to restore from the catalogs. Having access to this data is useful on its own, without the restore part. Some documentation and some tests are added, written by me. Corey has authored the part in system_views.sql. Bump catalog version. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aYmCUx9VvrKiZQLL@paquier.xyz	2026-02-10 12:36:57 +09:00
Richard Guo	f41ab51573	Teach planner to transform "x IS [NOT] DISTINCT FROM NULL" to a NullTest In the spirit of `8d19d0e13`, this patch teaches the planner about the principle that NullTest with !argisrow is fully equivalent to SQL's IS [NOT] DISTINCT FROM NULL. The parser already performs this transformation for literal NULLs. However, a DistinctExpr expression with one input evaluating to NULL during planning (e.g., via const-folding of "1 + NULL" or parameter substitution in custom plans) currently remains as a DistinctExpr node. This patch closes the gap for const-folded NULLs. It specifically targets the case where one input is a constant NULL and the other is a nullable non-constant expression. (If the other input were otherwise, the DistinctExpr node would have already been simplified to a constant TRUE or FALSE.) This transformation can be beneficial because NullTest is much more amenable to optimization than DistinctExpr, since the planner knows a good deal about the former and next to nothing about the latter. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com	2026-02-10 10:19:25 +09:00
Richard Guo	0aaf0de7fe	Optimize BooleanTest with non-nullable input The BooleanTest construct (IS [NOT] TRUE/FALSE/UNKNOWN) treats a NULL input as the logical value "unknown". However, when the input is proven to be non-nullable, this special handling becomes redundant. In such cases, the construct can be simplified directly to a boolean expression or a constant. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com	2026-02-10 10:18:47 +09:00
Richard Guo	0a37961254	Optimize IS DISTINCT FROM with non-nullable inputs The IS DISTINCT FROM construct compares values acting as though NULL were a normal data value, rather than "unknown". Semantically, "x IS DISTINCT FROM y" yields true if the values differ or if exactly one is NULL, and false if they are equal or both NULL. Unlike ordinary comparison operators, it never returns NULL. Previously, the planner only simplified this construct if all inputs were constants, folding it to a constant boolean result. This patch extends the optimization to cases where inputs are non-constant but proven to be non-nullable. Specifically, "x IS DISTINCT FROM NULL" folds to constant TRUE if "x" is known to be non-nullable. For cases where both inputs are guaranteed not to be NULL, the expression becomes semantically equivalent to "x <> y", and the DistinctExpr is converted into an inequality OpExpr. This transformation provides several benefits. It converts the comparison into a standard operator, allowing the use of partial indexes and constraint exclusion. Furthermore, if the clause is negated (i.e., "IS NOT DISTINCT FROM"), it simplifies to an equality operator. This enables the planner to generate better plans using index scans, merge joins, hash joins, and EC-based qual deduction. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com	2026-02-10 10:17:45 +09:00
Nathan Bossart	158408fef8	pg_upgrade: Fix handling of pg_largeobject_metadata. For binary upgrades from v16 or newer, pg_upgrade transfers the files for pg_largeobject_metadata from the old cluster, as opposed to using COPY or ordinary SQL commands to reconstruct its contents. While this approach adds complexity, it can greatly reduce pg_upgrade's runtime when there are many large objects. Large objects with comments or security labels are one source of complexity for this approach. During pg_upgrade, schema restoration happens before files are transferred. Comments and security labels are transferred in the former step, but the COMMENT and SECURITY LABEL commands will fail if their corresponding large objects do not exist. To deal with this, pg_upgrade first copies only the rows of pg_largeobject_metadata that are needed to avoid failures. Later, pg_upgrade overwrites those rows by replacing pg_largeobject_metadata's files with its files in the old cluster. Unfortunately, there's a subtle problem here. Simply put, there's no guarantee that pg_upgrade will overwrite all of pg_largeobject_metadata's files on the new cluster. For example, the new cluster's version might more aggressively extend relations or create visibility maps, and pg_upgrade's file transfer code is not sophisticated enough to remove files that lack counterparts in the old cluster. These extra files could cause problems post-upgrade. More fortunately, we can simultaneously fix the aforementioned problem and further optimize binary upgrades for clusters with many large objects. If we teach the COMMENT and SECURITY LABEL commands to allow nonexistent large objects during binary upgrades, pg_upgrade no longer needs to transfer pg_largeobject_metadata's contents beforehand. This approach also allows us to remove the associated dependency tracking from pg_dump, even for upgrades from v12-v15 that use COPY to transfer pg_largeobject_metadata's contents. In addition to what is described in the previous paragraph, this commit modifies the query in getLOs() to only retrieve LOs with comments or security labels for upgrades from v12 or newer. We have long assumed that such usage is rare, so this should reduce pg_upgrade's memory usage and runtime in many cases. We might also be able to remove the "upgrades from v12 or newer" restriction on the recent batch of optimizations by adding special handling for pg_largeobject_metadata's hidden OID column on older versions (since this catalog previously used the now-removed WITH OIDS feature), but that is left as a future exercise. Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/3yd2ss6n7xywo6pmhd7jjh3bqwgvx35bflzgv3ag4cnzfkik7m%40hiyadppqxx6w	2026-02-09 14:58:02 -06:00
Heikki Linnakangas	73d60ac385	cleanup: Deadlock checker is no longer called from signal handler Clean up a few leftovers from when the deadlock checker was called from signal handler. We stopped doing that in commit `6753333f55`, in year 2015. - CheckDeadLock can return a return value directly to the caller, there's no need to use a global variable for that. - Remove outdated comments that claimed that CheckDeadLock "signals ProcSleep". - It should be OK to ereport() from DeadLockCheck now. I considered getting rid of InitDeadLockChecking() and moving the workspace allocations into DeadLockCheck, but it's still good to avoid doing the allocations while we're holding all the partition locks. So just update the comment to give that as the reason we do the allocations up front.	2026-02-09 20:26:23 +02:00
Heikki Linnakangas	18f0afb2a6	Fix incorrect iteration type in extension_file_exists() Commit `f3c9e341cd` changed the type of objects in the List that get_extension_control_directories() returns, from "char " to "ExtensionLocation ", but missed adjusting this one caller. Author: Chao Li <lic@highgo.com> Discussion: https://www.postgresql.org/message-id/362EA9B3-589B-475A-A16E-F10C30426E28@gmail.com	2026-02-09 19:15:44 +02:00
Tom Lane	8ebdf41c26	Harden _int_matchsel() against being attached to the wrong operator. While the preceding commit prevented such attachments from occurring in future, this one aims to prevent further abuse of any already- created operator that exposes _int_matchsel to the wrong data types. (No other contrib module has a vulnerable selectivity estimator.) We need only check that the Const we've found in the query is indeed of the type we expect (query_int), but there's a difficulty: as an extension type, query_int doesn't have a fixed OID that we could hard-code into the estimator. Therefore, the bulk of this patch consists of infrastructure to let an extension function securely look up the OID of a datatype belonging to the same extension. (Extension authors have requested such functionality before, so we anticipate that this code will have additional non-security uses, and may soon be extended to allow looking up other kinds of SQL objects.) This is done by first finding the extension that owns the calling function (there can be only one), and then thumbing through the objects owned by that extension to find a type that has the desired name. This is relatively expensive, especially for large extensions, so a simple cache is put in front of these lookups. Reported-by: Daniel Firer as part of zeroday.cloud Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com> Security: CVE-2026-2004 Backpatch-through: 14	2026-02-09 10:14:22 -05:00
Tom Lane	841d42cc4e	Require superuser to install a non-built-in selectivity estimator. Selectivity estimators come in two flavors: those that make specific assumptions about the data types they are working with, and those that don't. Most of the built-in estimators are of the latter kind and are meant to be safely attachable to any operator. If the operator does not behave as the estimator expects, you might get a poor estimate, but it won't crash. However, estimators that do make datatype assumptions can malfunction if they are attached to the wrong operator, since then the data they get from pg_statistic may not be of the type they expect. This can rise to the level of a security problem, even permitting arbitrary code execution by a user who has the ability to create SQL objects. To close this hole, establish a rule that built-in estimators are required to protect themselves against being called on the wrong type of data. It does not seem practical however to expect estimators in extensions to reach a similar level of security, at least not in the near term. Therefore, also establish a rule that superuser privilege is required to attach a non-built-in estimator to an operator. We expect that this restriction will have little negative impact on extensions, since estimators generally have to be written in C and thus superuser privilege is required to create them in the first place. This commit changes the privilege checks in CREATE/ALTER OPERATOR to enforce the rule about superuser privilege, and fixes a couple of built-in estimators that were making datatype assumptions without sufficiently checking that they're valid. Reported-by: Daniel Firer as part of zeroday.cloud Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com> Security: CVE-2026-2004 Backpatch-through: 14	2026-02-09 10:07:31 -05:00
Tom Lane	60e7ae41a6	Guard against unexpected dimensions of oidvector/int2vector. These data types are represented like full-fledged arrays, but functions that deal specifically with these types assume that the array is 1-dimensional and contains no nulls. However, there are cast pathways that allow general oid[] or int2[] arrays to be cast to these types, allowing these expectations to be violated. This can be exploited to cause server memory disclosure or SIGSEGV. Fix by installing explicit checks in functions that accept these types. Reported-by: Altan Birler <altan.birler@tum.de> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com> Security: CVE-2026-2003 Backpatch-through: 14	2026-02-09 09:57:43 -05:00
Álvaro Herrera	38e0190ced	Allow log_min_messages to be set per process type Change log_min_messages from being a single element to a comma-separated list of type:level elements, with 'type' representing a process type, and 'level' being a log level to use for that type of process. The list must also have a freestanding level specification which is used for process types not listed, which convenientely makes the whole thing backwards-compatible. Some choices made here could be contested; for instance, we use the process type `backend` to affect regular backends as well as dead-end backends and the standalone backend, and `autovacuum` means both the launcher and the workers. I think it's largely sensible though, and it can easily be tweaked if desired. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Tan Yang <332696245@qq.com> Discussion: https://postgr.es/m/e85c6671-1600-4112-8887-f97a8a5d07b2@app.fastmail.com	2026-02-09 13:23:10 +01:00
Thomas Munro	c67bef3f32	Code coverage for most pg_mblen* calls. A security patch changed them today, so close the coverage gap now. Test that buffer overrun is avoided when pg_mblen*() requires more than the number of bytes remaining. This does not cover the calls in dict_thesaurus.c or in dict_synonym.c. That code is straightforward. To change that code's input, one must have access to modify installed OS files, so low-privilege users are not a threat. Testing this would likewise require changing installed share/postgresql/tsearch_data, which was enough of an obstacle to not bother. Security: CVE-2026-2006 Backpatch-through: 14 Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>	2026-02-09 12:44:12 +13:00
Thomas Munro	1e7fe06c10	Replace pg_mblen() with bounds-checked versions. A corrupted string could cause code that iterates with pg_mblen() to overrun its buffer. Fix, by converting all callers to one of the following: 1. Callers with a null-terminated string now use pg_mblen_cstr(), which raises an "illegal byte sequence" error if it finds a terminator in the middle of the sequence. 2. Callers with a length or end pointer now use either pg_mblen_with_len() or pg_mblen_range(), for the same effect, depending on which of the two seems more convenient at each site. 3. A small number of cases pre-validate a string, and can use pg_mblen_unbounded(). The traditional pg_mblen() function and COPYCHAR macro still exist for backward compatibility, but are no longer used by core code and are hereby deprecated. The same applies to the t_isXXX() functions. Security: CVE-2026-2006 Backpatch-through: 14 Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reported-by: Paul Gerste (as part of zeroday.cloud) Reported-by: Moritz Sanft (as part of zeroday.cloud)	2026-02-09 12:44:04 +13:00
Tom Lane	73dd7163c5	Replace some hard-wired OID constants with corresponding macros. Looking again at commit `7cdb633c8`, I wondered why we have hard-wired "1034" for the OID of type aclitem[]. Some other entries in the same array have numeric type OIDs as well. This seems to be a hangover from years ago when not every built-in pg_type entry had an OID macro. But since we made genbki.pl responsible for generating these macros, there are macros available for all these array types, so there's no reason not to follow the project policy of never writing numeric OID constants in C code.	2026-02-07 23:15:20 -05:00
Tom Lane	7cdb633c89	Make some minor cleanups in typalign-related code. Commit `7b378237a` widened AclMode to 64 bits, which implies that the alignment of AclItem is now determined by an int64 field. That commit correctly set the typalign for SQL type aclitem to 'd', but it missed the hard-wired knowledge about _aclitem in bootstrap.c. This doesn't seem to have caused any ill effects, probably because we never try to fill a non-null value into an aclitem[] column during bootstrap. Nonetheless, it's clearly a gotcha waiting to happen, so fix it up. In passing, also fix a couple of typanalyze functions that were using hard-coded typalign constants when they could just as easily use greppable TYPALIGN_xxx macros. Noticed these while working on a patch to expand the set of typalign values. I doubt we are going to pursue that path, but these fixes still seem worth a quick commit. Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us	2026-02-06 20:46:03 -05:00
Nathan Bossart	ba1e14134a	Adjust style of some debugging macros. This commit adjusts a few debugging macros to match the style of those in pg_config_manual.h. Like commits `123661427b` and `b4cbc106a6`, these were discovered while reviewing Aleksander Alekseev's proposed changes to pgindent. Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aP-H6kSsGOxaB21k%40nathan	2026-02-06 16:24:21 -06:00
Michael Paquier	072c842135	Fix use of proc number in pgstat_create_backend() This routine's internals directly used MyProcNumber to choose which object ID to assign for the hash key of a backend's stats entry, while the value to use is given as input argument of the function. The original intention was to pass MyProcNumber as an argument of pgstat_create_backend() when called in pgstat_bestart_final(), pgstat_beinit() ensuring that MyProcNumber has been set, not use it directly in the function. This commit addresses this inconsistency by using the procnum given by the caller of pgstat_create_backend(), not MyProcNumber. This issue is not a cause of bugs currently. However, let's keep the code in sync across all the branches where this code exists, as it could matter in a future backpatch. Oversight in `4feba03d8b`. Reported-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Discussion: https://postgr.es/m/TYCPR01MB11316AD8150C8F470319ACCAEE866A@TYCPR01MB11316.jpnprd01.prod.outlook.com Backpatch-through: 18	2026-02-06 19:57:22 +09:00
Thomas Munro	f94e9141a0	Add file_extend_method=posix_fallocate,write_zeros. Provide a way to disable the use of posix_fallocate() for relation files. It was introduced by commit `4d330a61bb`. The new setting file_extend_method=write_zeros can be used as a workaround for problems reported from the field: * BTRFS compression is disabled by the use of posix_fallocate() * XFS could produce spurious ENOSPC errors in some Linux kernel versions, though that problem is reported to have been fixed The default is file_extend_method=posix_fallocate if available, as before. The write_zeros option is similar to PostgreSQL < 16, except that now it's multi-block. Backpatch-through: 16 Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reported-by: Dimitrios Apostolou <jimis@gmx.net> Discussion: https://postgr.es/m/b1843124-fd22-e279-a31f-252dffb6fbf2%40gmx.net	2026-02-06 17:38:49 +13:00
Michael Paquier	9476ef206c	Fix comment in extended_stats_funcs.c The attribute storing the statistics data for a set of expressions in pg_statistic_ext_data is stxdexpr. stxdexprs does not exist. Extracted from a larger patch by the same author. Incorrect as of `efbebb4e85`. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/CADkLM=fPcci6oPyuyEZ0F4bWqAA7HzaWO+ZPptufuX5_uWt6kw@mail.gmail.com	2026-02-05 15:14:53 +09:00
Masahiko Sawada	7a1f0f8747	pg_upgrade: Optimize logical replication slot caught-up check. Commit `29d0a77fa6` improved pg_upgrade to allow migrating logical slots provided that all logical slots have caught up (i.e., they have no pending decodable WAL records). Previously, this verification was done by checking each slot individually, which could be time-consuming if there were many logical slots to migrate. This commit optimizes the check to avoid reading the same WAL stream multiple times. It performs the check only for the slot with the minimum confirmed_flush_lsn and applies the result to all other slots in the same database. This limits the check to at most one logical slot per database. During the check, we identify the last decodable WAL record's LSN to report any slots with unconsumed records, consistent with the existing error reporting behavior. Additionally, the maximum confirmed_flush_lsn among all logical slots on the database is used as an early scan cutoff; finding a decodable WAL record beyond this point implies that no slot has caught up. Performance testing demonstrated that the execution time remains stable regardless of the number of slots in the database. Note that we do not distinguish slots based on their output plugins. A hypothetical plugin might use a replication origin filter that filters out changes from a specific origin. In such cases, we might get a false positive (erroneously considering a slot caught up). However, this is safe from a data integrity standpoint, such scenarios are rare, and the impact of a false positive is minimal. This optimization is applied only when the old cluster is version 19 or later. Bump catalog version. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAD21AoBZ0LAcw1OHGEKdW7S5TRJaURdhEk3CLAW69_siqfqyAg@mail.gmail.com	2026-02-04 17:11:27 -08:00
Álvaro Herrera	0c8e082fba	Assign "backend" type earlier during process start-up Instead of assigning the backend type in the Main function of each postmaster child, do it right after fork(), by which time it is already known by postmaster_child_launch(). This reduces the time frame during which MyBackendType is incorrect. Before this commit, ProcessStartupPacket would overwrite MyBackendType to B_BACKEND for dead-end backends, which is quite dubious. Stop that. We may now see MyBackendType == B_BG_WORKER before setting up MyBgworkerEntry. As far as I can see this is only a problem if we try to log a message and %b is in log_line_prefix, so we now have a constant string to cover that case. Previously, it would print "unrecognized", which seems strictly worse. Author: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/e85c6671-1600-4112-8887-f97a8a5d07b2@app.fastmail.com	2026-02-04 16:56:57 +01:00
John Naylor	176dffdf7d	Fix various instances of undefined behavior Mostly this involves checking for NULL pointer before doing operations that add a non-zero offset. The exception is an overflow warning in heap_fetch_toast_slice(). This was caused by unneeded parentheses forcing an expression to be evaluated to a negative integer, which then got cast to size_t. Per clang 21 undefined behavior sanitizer. Backpatch to all supported versions. Co-authored-by: Alexander Lakhin <exclusion@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/777bd201-6e3a-4da0-a922-4ea9de46a3ee@gmail.com Backpatch-through: 14	2026-02-04 18:09:35 +07:00
Heikki Linnakangas	084e42bc71	Add backendType to PGPROC, replacing isRegularBackend We can immediately make use of it in pg_signal_backend(), which previously fetched the process type from the backend status array with pgstat_get_backend_type_by_proc_number(). That was correct but felt a little questionable to me: backend status should be for observability purposes only, not for permission checks. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/b77e4962-a64a-43db-81a1-580444b3e8f5@iki.fi	2026-02-04 13:06:04 +02:00
Heikki Linnakangas	57bff90160	Don't hint that you can reconnect when the database is dropped Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi	2026-02-03 15:08:16 +02:00
Heikki Linnakangas	cd375d5b6d	Remove useless errdetail_abort() I don't understand how to reach errdetail_abort() with MyProc->recoveryConflictPending set. If a recovery conflict signal is received, ProcessRecoveryConflictInterrupt() raises an ERROR or FATAL error to cancel the query or connection, and abort processing clears the flag. The error message from ProcessRecoveryConflictInterrupt() is very clear that the query or connection was terminated because of recovery conflict. The only way to reach it AFAICS is with a race condition, if the startup process sends a recovery conflict signal when the transaction has just entered aborted state for some other reason. And in that case the detail would be misleading, as the transaction was already aborted for some other reason, not because of the recovery conflict. errdetail_abort() was the only user of the recoveryConflictPending flag in PGPROC, so we can remove that and all the related code too. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi	2026-02-03 15:08:13 +02:00
Álvaro Herrera	96e2af6050	Reject ADD CONSTRAINT NOT NULL if name mismatches existing constraint When using ALTER TABLE ... ADD CONSTRAINT to add a not-null constraint with an explicit name, we have to ensure that if the column is already marked NOT NULL, the provided name matches the existing constraint name. Failing to do so could lead to confusion regarding which constraint object actually enforces the rule. This patch adds a check to throw an error if the user tries to add a named not-null constraint to a column that already has one with a different name. Reported-by: yanliang lei <msdnchina@163.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-bu: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/19351-8f1c523ead498545%40postgresql.org	2026-02-03 12:33:29 +01:00
Peter Eisentraut	137d05df2f	Rename AssertVariableIsOfType to StaticAssertVariableIsOfType This keeps run-time assertions and static assertions clearly separate. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/2273bc2a-045d-4a75-8584-7cd9396e5534%40eisentraut.org	2026-02-03 08:45:24 +01:00
Michael Paquier	e05a24c2d4	Add two IO wait events for COPY FROM/TO on a pipe/file/program Two wait events are added to the COPY FROM/TO code: * COPY_FROM_READ: reading data from a copy_file. * COPY_TO_WRITE: writing data to a copy_file. In the COPY code, copy_file can be set when processing a command through the pipe mode (for the non-DestRemote case), the program mode or the file mode, when processing fread() or fwrite() on it. Author: Nikolay Samokhvalov <nik@postgres.ai> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAM527d_iDzz0Kqyi7HOfqa-Xzuq29jkR6AGXqfXLqA5PR5qsng@mail.gmail.com	2026-02-03 12:20:41 +09:00
Michael Paquier	213fec296f	Fix incorrect errno in OpenWalSummaryFile() This routine has an option to bypass an error if a WAL summary file is opened for read but is missing (missing_ok=true). However, the code incorrectly checked for EEXIST, that matters when using O_CREAT and O_EXCL, rather than ENOENT, for this case. There are currently only two callers of OpenWalSummaryFile() in the tree, and both use missing_ok=false, meaning that the check based on the errno is currently dead code. This issue could matter for out-of-core code or future backpatches that would like to use missing_ok set to true. Issue spotted while monitoring this area of the code, after `a9afa021e9`. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aYAf8qDHbpBZ3Rml@paquier.xyz Backpatch-through: 17	2026-02-03 11:25:10 +09:00
Fujii Masao	21c1125d66	Release synchronous replication waiters immediately on configuration changes. Previously, when synchronous_standby_names was changed (for example, by reducing the number of required synchronous standbys or modifying the standby list), backends waiting for synchronous replication were not released immediately, even if the new configuration no longer required them to wait. They could remain blocked until additional messages arrived from standbys and triggered their release. This commit improves walsender so that backends waiting for synchronous replication are released as soon as the updated configuration takes effect and the new settings no longer require them to wait, by calling SyncRepReleaseWaiters() when configuration changes are processed. As part of this change, the duplicated code that handles configuration changes in walsender has been refactored into a new helper function, which is now used at the three existing call places. Since this is an improvement rather than a bug fix, it is applied only to the master branch. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CAOzEurSRii0tEYhu5cePmRcvS=ZrxTLEvxm3Kj0d7_uKGdM23g@mail.gmail.com	2026-02-03 11:14:00 +09:00
Tom Lane	da7a1dc0d6	Refactor att_align_nominal() to improve performance. Separate att_align_nominal() into two macros, similarly to what was already done with att_align_datum() and att_align_pointer(). The inner macro att_nominal_alignby() is really just TYPEALIGN(), while att_align_nominal() retains its previous API by mapping TYPALIGN_xxx values to numbers of bytes to align to and then calling att_nominal_alignby(). In support of this, split out tupdesc.c's logic to do that mapping into a publicly visible function typalign_to_alignby(). Having done that, we can replace performance-critical uses of att_align_nominal() with att_nominal_alignby(), where the typalign_to_alignby() mapping is done just once outside the loop. In most places I settled for doing typalign_to_alignby() once per function. We could in many places pass the alignby value in from the caller if we wanted to change function APIs for this purpose; but I'm a bit loath to do that, especially for exported APIs that extensions might call. Replacing a char typalign argument by a uint8 typalignby argument would be an API change that compilers would fail to warn about, thus silently breaking code in hard-to-debug ways. I did revise the APIs of array_iter_setup and array_iter_next, moving the element type attribute arguments to the former; if any external code uses those, the argument-count change will cause visible compile failures. Performance testing shows that ExecEvalScalarArrayOp is sped up by about 10% by this change, when using a simple per-element function such as int8eq. I did not check any of the other loops optimized here, but it's reasonable to expect similar gains. Although the motivation for creating this patch was to avoid a performance loss if we add some more typalign values, it evidently is worth doing whether that patch lands or not. Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us	2026-02-02 14:39:50 -05:00
Michael Paquier	a9afa021e9	Fix error message in RemoveWalSummaryIfOlderThan() A failing unlink() was reporting an incorrect error message, referring to stat(). Author: Man Zeng <zengman@halodbtech.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/tencent_3BBE865C5F49D452360FF190@qq.com Backpath-through: 17	2026-02-02 10:21:04 +09:00
Michael Paquier	d46aa32ea5	Fix build inconsistency due to the generation of wait-event code The build generates four files based on the wait event contents stored in wait_event_names.txt: - wait_event_types.h - pgstat_wait_event.c - wait_event_funcs_data.c - wait_event_types.sgml The SGML file is generated as part of a documentation build, with its data stored in doc/src/sgml/ for meson and configure. The three others are handled differently for meson and configure: - In configure, all the files are created in src/backend/utils/activity/. A link to wait_event_types.h is created in src/include/utils/. - In meson, all the files are created in src/include/utils/. The two C files, pgstat_wait_event.c and wait_event_funcs_data.c, are then included in respectively wait_event.c and wait_event_funcs.c, without the "utils/" path. For configure, this does not present a problem. For meson, this has to be combined with a trick in src/backend/utils/activity/meson.build, where include_directories needs to point to include/utils/ to make the inclusion of the C files work properly, causing builds to pull in PostgreSQL headers rather than system headers in some build paths, as src/include/utils/ would take priority. In order to fix this issue, this commit reworks the way the C/H files are generated, becoming consistent with guc_tables.inc.c: - For meson, basically nothing changes. The files are still generated in src/include/utils/. The trick with include_directories is removed. - For configure, the files are now generated in src/backend/utils/, with links in src/include/utils/ pointing to the ones in src/backend/. This requires extra rules in src/backend/utils/activity/Makefile so as a make command in this sub-directory is able to work. - The three files now fall under header-stamp, which is actually simpler as guc_tables.inc.c does the same. - wait_event_funcs_data.c and pgstat_wait_event.c are now included with "utils/" in their path. This problem has not been an issue in the buildfarm; it has been noted with AIX and a conflict with float.h. This issue could, however, create conflicts in the buildfarm depending on the environment with unexpected headers pulled in, so this fix is backpatched down to where the generation of the wait-event files has been introduced. While on it, this commit simplifies wait_event_names.txt regarding the paths of the files generated, to mention just the names of the files generated. The paths where the files are generated became incorrect. The path of the SGML path was wrong. This change has been tested in the CI, down to v17. Locally, I have run tests with configure (with and without VPATH), as well as meson, on the three branches. Combo oversight in `fa88928470` and `1e68e43d3f`. Reported-by: Aditya Kamath <aditya.kamath1@ibm.com> Discussion: https://postgr.es/m/LV8PR15MB64888765A43D229EA5D1CFE6D691A@LV8PR15MB6488.namprd15.prod.outlook.com Backpatch-through: 17	2026-02-02 08:02:39 +09:00
Heikki Linnakangas	e2362eb2bd	Move shmem allocator's fields from PGShmemHeader to its own struct For readability. It was a slight modularity violation to have fields in PGShmemHeader that were only used by the allocator code in shmem.c. And it was inconsistent that ShmemLock was nevertheless not stored there. Moving all the allocator-related fields to a separate struct makes it more consistent and modular, and removes the need to allocate and pass ShmemLock separately via BackendParameters. Merge InitShmemAccess() and InitShmemAllocation() into a single function that initializes the struct when called from postmaster, and when called from backends in EXEC_BACKEND mode, re-establishes the global variables. That's similar to all the *ShmemInit() functions that we have. Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5uNRB9oT4pdo54qAo025MXFX4MfYrD9K15OCqe-ExnNvg@mail.gmail.com	2026-01-30 18:22:56 +02:00
Álvaro Herrera	e76221bd95	Minor cosmetic tweaks These changes should have been done by `2f9661311b`, but were overlooked. I noticed while reviewing the code for commit `b8926a5b4b`. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org	2026-01-30 14:26:02 +01:00
Álvaro Herrera	1eb09ed63a	Use C99 designated designators in a couple of places This makes the arrays somewhat easier to read. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://postgr.es/m/202601281204.sdxbr5qvpunk@alvherre.pgsql	2026-01-30 10:11:04 +01:00
Fujii Masao	bb26a81ee2	Remove unused argument from ApplyLogicalMappingFile(). Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Discussion: https://postgr.es/m/20260128120056.b2a3e8184712ab5a537879eb@sraoss.co.jp	2026-01-30 09:05:35 +09:00
Andres Freund	87f7b824f2	tableam: Perform CheckXidAlive check once per scan Previously, the CheckXidAlive check was performed within the table_scannext functions. This caused the check to be executed for every fetched tuple, an unnecessary overhead. To fix, move the check to table_beginscan* so it is performed once per scan rather than once per row. Note: table_tuple_fetch_row_version() does not use a scan descriptor; therefore, the CheckXidAlive check is retained in that function. The overhead is unlikely to be relevant for the existing callers. Reported-by: Andres Freund <andres@anarazel.de> Author: Dilip Kumar <dilipbalaut@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Suggested-by: Amit Kapila <akapila@postgresql.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/tlpltqm5jjwj7mp66dtebwwhppe4ri36vdypux2zoczrc2i3mp%40dhv4v4nikyfg	2026-01-29 17:52:07 -05:00
Andres Freund	333f586372	bufmgr: Allow conditionally locking of already locked buffer In `fcb9c977aa` I included an assertion in BufferLockConditional() to detect if a conditional lock acquisition is done on a buffer that we already have locked. The assertion was added in the course of adding other assertions. Unfortunately I failed to realize that some of our code relies on such lock acquisitions to silently fail. E.g. spgist and nbtree may try to conditionally lock an already locked buffer when acquiring a empty buffer. LWLockAcquireConditional(), which was previously used to implement ConditionalLockBuffer(), does not have such an assert. Instead of just removing the assert, and relying on the lock acquisition to fail due to the buffer already locked, this commit changes the behaviour of conditional content lock acquisition to fail if the current backend has any pre-existing lock on the buffer, even if the lock modes would not conflict. The reason for that is that we currently do not have space to track multiple lock acquisitions on a single buffer. Allowing multiple locks on the same buffer by a backend also seems likely to lead to bugs. There is only one non-self-exclusive conditional content lock acquisition, in GetVictimBuffer(), but it only is used if the target buffer is not pinned and thus can't already be locked by the current backend. Reported-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/90bd2cbb-49ce-4092-9f61-5ac2ab782c94@gmail.com	2026-01-29 16:49:01 -05:00
Jeff Davis	de90bb7db1	Fix theoretical memory leaks in pg_locale_libc.c. The leaks were hard to reach in practice and the impact was low. The callers provide a buffer the same number of bytes as the source string (plus one for NUL terminator) as a starting size, and libc never increases the number of characters. But, if the byte length of one of the converted characters is larger, then it might need a larger destination buffer. Previously, in that case, the working buffers would be leaked. Even in that case, the call typically happens within a context that will soon be reset. Regardless, it's worth fixing to avoid such assumptions, and the fix is simple so it's worth backporting. Discussion: https://postgr.es/m/e2b7a0a88aaadded7e2d19f42d5ab03c9e182ad8.camel@j-davis.com Backpatch-through: 18	2026-01-29 10:14:55 -08:00
Álvaro Herrera	ec31744071	Replace literal 0 with InvalidXLogRecPtr for XLogRecPtr assignments Use the proper constant InvalidXLogRecPtr instead of literal 0 when assigning XLogRecPtr variables and struct fields. This improves code clarity by making it explicit that these are invalid LSN values rather than ambiguous zero literals. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aRtd2dw8FO1nNX7k@ip-10-97-1-34.eu-west-3.compute.internal	2026-01-29 18:37:09 +01:00
Robert Haas	71c1136989	Fix mistakes in commit `4020b370f2` cost_tidrangescan() was setting the disabled_nodes value correctly, and then immediately resetting it to zero, due to poor code editing on my part. materialized_finished_plan correctly set matpath.parent to zero, but forgot to also set matpath.parallel_workers = 0, causing an access to uninitialized memory in cost_material. (This shouldn't result in any real problem, but it makes valgrind unhappy.) reparameterize_path was dereferencing a variable before verifying that it was not NULL. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1) Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1) Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1) Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2) Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3) Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz	2026-01-29 08:04:47 -05:00
Alexander Korotkov	20a8f783e1	Wake LSN waiters before recovery target stop Move WaitLSNWakeup() immediately after ApplyWalRecord() so waiters are signaled even when recoveryStopsAfter() breaks out for pause/promotion targets. Discussion: https://postgr.es/m/9533608f-f289-44bd-b881-9e5a73203c5b%40iki.fi Discussion: https://postgr.es/m/CABPTF7Wdq6KbvC3EhLX3Pz%3DODCCPEX7qVQ%2BE%3DcokkB91an2E-A%40mail.gmail.com Reported-by: Heikki Linnakangas <hlinnaka@iki.fi> Author: Xuneng Zhou <xunengzhou@gmail.com>	2026-01-29 09:47:09 +02:00
Michael Paquier	740a1494f4	Fix two error messages in extended_stats_funcs.c These have been fat-fingered in `0e80f3f88d` and `302879bd68`. The error message for ndistinct had an incorrect grammar, while the one for dependencies had finished with a period (incorrect based on the project guidelines). Discussion: https://postgr.es/m/aXrsjZQbVuB6236u@paquier.xyz	2026-01-29 14:57:47 +09:00
Michael Paquier	efbebb4e85	Add support for "mcv" in pg_restore_extended_stats() This commit adds support for the restore of extended statistics of the kind "mcv", aka most-common values. This format is different from n_distinct and dependencies stat types in that it is the combination of three of the four different arrays from the pg_stats_ext view which in turn require three different input parameters on pg_restore_extended_statistics(). These are translated into three input arguments for the function: - "most_common_vals", acting as a leader of the others. It is a 2-dimension array, that includes the common values. - "most_common_freqs", 1-dimension array of float8[], with a number of elements that has to match with "most_common_vals". - "most_common_base_freqs", 1-dimension array of float8[], with a number of elements that has to match with "most_common_vals". All three arrays are required to achieve the restore of this type of extended statistics (if "most_common_vals" happens to be NULL in the catalogs, the rest is NULL by design). Note that "most_common_val_nulls" is not required in input, its data is rebuilt from the decomposition of the "most_common_vals" array based on its text[] representation. The initial versions of the patch provided this option in input, but we do not require it and it simplifies a lot the result. Support in pg_dump is added down to v13 which is where the support for this type of extended statistics has been added, when --statistics is used. This means that upgrade and dumps can restore extended statistics data transparently, like "dependencies", "ndistinct", attribute and relation statistics. For MCV, the values are directly queried from the relevant catalogs. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-29 12:14:08 +09:00
Masahiko Sawada	8f1e2dfe03	Consolidate replication origin session globals into a single struct. This commit moves the separate global variables for replication origin state into a single ReplOriginXactState struct. This groups logically related variables, which improves code readability and simplifies state management (e.g., resetting the state) by handling them as a unit. Author: Chao Li <lic@highgo.com> Suggested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAEoWx2=pYvfRthXHTzSrOsf5_FfyY4zJyK4zV2v4W=yjUij1cA@mail.gmail.com	2026-01-28 12:26:22 -08:00
Masahiko Sawada	227eb4eea2	Refactor replication origin state reset helpers. Factor out common logic for clearing replorigin_session_* variables into a dedicated helper function, replorigin_xact_clear(). This removes duplicated assignments of these variables across multiple call sites, and makes the intended scope of each reset explicit. Author: Chao Li <lic@highgo.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAEoWx2=pYvfRthXHTzSrOsf5_FfyY4zJyK4zV2v4W=yjUij1cA@mail.gmail.com	2026-01-28 11:45:26 -08:00
Masahiko Sawada	1fdbca159e	Standardize replication origin naming to use "ReplOrigin". The replication origin code was using inconsistent naming conventions. Functions were typically prefixed with 'replorigin', while typedefs and constants used "RepOrigin". This commit unifies the naming convention by renaming RepOriginId to ReplOriginId. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAD21AoBDgm3hDqUZ+nqu=ViHmkCnJBuJyaxG_yvv27BAi2zBmQ@mail.gmail.com	2026-01-28 11:03:29 -08:00
Robert Haas	4020b370f2	Allow for plugin control over path generation strategies. Each RelOptInfo now has a pgs_mask member which is a mask of acceptable strategies. For most rels, this is populated from PlannerGlobal's default_pgs_mask, which is computed from the values of the enable_* GUCs at the start of planning. For baserels, get_relation_info_hook can be used to adjust pgs_mask for each new RelOptInfo, at least for rels of type RTE_RELATION. Adjusting pgs_mask is less useful for other types of rels, but if it proves to be necessary, we can revisit the way this hook works or add a new one. For joinrels, two new hooks are added. joinrel_setup_hook is called each time a joinrel is created, and one thing that can be done from that hook is to manipulate pgs_mask for the new joinrel. join_path_setup_hook is called each time we're about to add paths to a joinrel by considering some particular combination of an outer rel, an inner rel, and a join type. It can modify the pgs_mask propagated into JoinPathExtraData to restrict strategy choice for that particular combination of rels. To make joinrel_setup_hook work as intended, the existing calls to build_joinrel_partition_info are moved later in the calling functions; this is because that function checks whether the rel's pgs_mask includes PGS_CONSIDER_PARTITIONWISE, so we want it to only be called after plugins have had a chance to alter pgs_mask. Upper rels currently inherit pgs_mask from the input relation. It's unclear that this is the most useful behavior, but at the moment there are no hooks to allow the mask to be set in any other way. Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com	2026-01-28 11:28:34 -05:00
Álvaro Herrera	e6d6e32f42	Fix duplicate arbiter detection during REINDEX CONCURRENTLY on partitions Commit `90eae926a` fixed ON CONFLICT handling during REINDEX CONCURRENTLY on partitioned tables by treating unparented indexes as potential arbiters. However, there's a remaining race condition: when pg_inherits records are swapped between consecutive calls to get_partition_ancestors(), two different child indexes can appear to have the same parent, causing duplicate entries in the arbiter list and triggering "invalid arbiter index list" errors. Note that this is not a new problem introduced by `90eae926a`. The same error could occur before that commit in a slightly different scenario: an index is selected during planning, then index_concurrently_swap() commits, and a subsequent call to get_partition_ancestors() uses a new catalog snapshot that sees zero ancestors for that index. Fix by tracking which parent indexes have already been processed. If a subsequent call to get_partition_ancestors() returns a parent we've already seen, treat that index as unparented instead, allowing it to be matched via IsIndexCompatibleAsArbiter() like other concurrent reindex scenarios. Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/e5a8c1df-04e5-4343-85ef-5df2a7e3d90c@gmail.com	2026-01-28 14:38:53 +01:00
Michael Paquier	e3094679b9	Fix pg_restore_extended_stats() with expressions This commit fixes an issue with the restore of ndistinct and dependencies statistics, causing the operation to fail when any of these kinds included expressions. In extended statistics, expressions use strictly negative attribute numbers, decremented from -1. For example, let's imagine an object defined as follows: CREATE STATISTICS stats_obj (dependencies) ON lower(name), upper(name) FROM tab_obj; This object would generate dependencies stats using -1 and -2 as attribute numbers, like that: [{"attributes": [-1], "dependency": -2, "degree": 1.000000}, {"attributes": [-2], "dependency": -1, "degree": 1.000000}] However, pg_restore_extended_stats() forgot to account for the number of expressions defined in an extended statistics object. This would cause the validation step of ndistinct and dependencies data to fail, preventing a restore of their stats even if the input is valid. This issue has come up due to an incorrect split of the patch set. Some tests are included to cover this behavior. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/aXl4bMfSTQUxM_yy@paquier.xyz	2026-01-28 11:48:45 +09:00
Amit Kapila	851f6649cc	Prevent invalidation of newly synced replication slots. A race condition could cause a newly synced replication slot to become invalidated between its initial sync and the checkpoint. When syncing a replication slot to a standby, the slot's initial restart_lsn is taken from the publisher's remote_restart_lsn. Because slot sync happens asynchronously, this value can lag behind the standby's current redo pointer. Without any interlocking between WAL reservation and checkpoints, a checkpoint may remove WAL required by the newly synced slot, causing the slot to be invalidated. To fix this, we acquire ReplicationSlotAllocationLock before reserving WAL for a newly synced slot, similar to commit `006dd4b2e5`. This ensures that if WAL reservation happens first, the checkpoint process must wait for slotsync to update the slot's restart_lsn before it computes the minimum required LSN. However, unlike in ReplicationSlotReserveWal(), this lock alone cannot protect a newly synced slot if a checkpoint has already run CheckPointReplicationSlots() before slotsync updates the slot. In such cases, the remote restart_lsn may be stale and earlier than the current redo pointer. To prevent relying on an outdated LSN, we use the oldest WAL location available if it is greater than the remote restart_lsn. This ensures that newly synced slots always start with a safe, non-stale restart_lsn and are not invalidated by concurrent checkpoints. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Backpatch-through: 17 Discussion: https://postgr.es/m/TY4PR01MB16907E744589B1AB2EE89A31F94D7A%40TY4PR01MB16907.jpnprd01.prod.outlook.com	2026-01-27 05:06:29 +00:00
Fujii Masao	1ea44d7ddf	Remove unnecessary abort() from WalSndShutdown(). WalSndShutdown() previously called abort() after proc_exit(0) to silence compiler warnings. This is no longer needed, because both WalSndShutdown() and proc_exit() are declared pg_noreturn, allowing the compiler to recognize that the function does not return. Also there are already other functions, such as CheckpointerMain(), that call proc_exit() without an abort(), and they do not produce warnings. Therefore this abort() call in WalSndShutdown() is useless and this commit removes it. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAHGQGwHPX1yoixq+YB5rF4zL90TMmSEa3FpHURtqW3Jc5+=oSA@mail.gmail.com	2026-01-27 11:55:32 +09:00
Tomas Vondra	3fccbd94cb	Handle ENOENT status when querying NUMA node We've assumed that touching the memory is sufficient for a page to be located on one of the NUMA nodes. But a page may be moved to a swap after we touch it, due to memory pressure. We touch the memory before querying the status, but there is no guarantee it won't be moved to the swap in the meantime. The touching happens only on the first call, so later calls are more likely to be affected. And the batching increases the window too. It's up to the kernel if/when pages get moved to swap. We have to accept ENOENT (-2) as a valid result, and handle it without failing. This patch simply treats it as an unknown node, and returns NULL in the two affected views (pg_shmem_allocations_numa and pg_buffercache_numa). Hugepages cannot be swapped out, so this affects only regular pages. Reported by Christoph Berg, investigation and fix by me. Backpatch to 18, where the two views were introduced. Reported-by: Christoph Berg <myon@debian.org> Discussion: 18 Backpatch-through: https://postgr.es/m/aTq5Gt_n-oS_QSpL@msg.df7cb.de	2026-01-27 00:21:40 +01:00
Michael Paquier	302879bd68	Add support for "dependencies" in pg_restore_extended_stats() This commit adds support for the restore of extended statistics of the kind "dependencies", for the following input data: [{"attributes": [2], "dependency": 3, "degree": 1.000000}, {"attributes": [3], "dependency": 2, "degree": 1.000000}] This relies on the existing routines of "dependencies" to cross-check the input data with the definition of the extended statistics objects for the attribute numbers. An input argument of type "pg_dependencies" is required for this new option. Thanks to the work done in `0e80f3f88d` for the restore function and `e1405aa5e3` for the input handling of data type pg_dependencies, this addition is straight-forward. This will be used so as it is possible to transfer these statistics across dumps and upgrades, removing the need for a post-operation ANALYZE for these kinds of statistics. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-27 08:20:13 +09:00
Melanie Plageman	19af794b66	Refactor lazy_scan_prune() VM clear logic into helper Encapsulating the cases that clear the visibility map after vacuum phase I, when corruption is detected, into in a helper makes the code cleaner and enables further refactoring in future commits. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/7ib3sa55sapwjlaz4sijbiq7iezna27kjvvvar4dpgkmadml6t%40gfpkkwmdnepx	2026-01-26 17:12:05 -05:00
Melanie Plageman	648a7e28d7	Eliminate use of cached VM value in lazy_scan_prune() lazy_scan_prune() takes a parameter from lazy_scan_heap() indicating whether the page was marked all-visible in the VM at the time it was last checked in find_next_unskippable_block(). This behavior is historical, dating back to commit `608195a3a3`, when we did not pin the VM page until deciding we must read it. Now that the VM page is already pinned, there is no meaningful benefit to relying on a cached VM status. Removing this cached value simplifies the logic in both lazy_scan_heap() and lazy_scan_prune(). It also clarifies future work that will set the visibility map on-access: such paths will not have a cached value available, which would make the logic harder to reason about. And eliminating it enables us to detect and repair VM corruption on-access. Along with removing the cached value and unconditionally checking the visibility status of the heap page, this commit also moves the VM corruption handling to occur first. This reordering should have no performance impact, since the checks are inexpensive and performed only once per page. It does, however, make the control flow easier to understand. The new restructuring also makes it possible to set the VM after fixing corruption (if pruning found the page all-visible). Now that no callers of visibilitymap_set() use its return value, change its (and visibilitymap_set_vmbits()) return type to void. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/5CEAA162-67B1-44DA-B60D-8B65717E8B05%40gmail.com	2026-01-26 17:00:13 -05:00
Melanie Plageman	21796c267d	Combine visibilitymap_set() cases in lazy_scan_prune() lazy_scan_prune() previously had two separate cases that called visibilitymap_set() after pruning and freezing. These branches were nearly identical except that one attempted to avoid dirtying the heap buffer. However, that situation can never occur — the heap buffer cannot be clean at that point (and we would hit an assertion if it were). In lazy_scan_prune(), when we change a previously all-visible page to all-frozen and the page was recorded as all-visible in the visibility map by find_next_unskippable_block(), the heap buffer will always be dirty. Either we have just frozen a tuple and already dirtied the buffer, or the buffer was modified between find_next_unskippable_block() and heap_page_prune_and_freeze() and then pruned in heap_page_prune_and_freeze(). Additionally, XLogRegisterBuffer() asserts that the buffer is dirty, so attempting to add a clean heap buffer to the WAL chain would assert out anyway. Since the “clean heap buffer with already set VM” case is impossible, the two visibilitymap_set() branches in lazy_scan_prune() can be merged. Doing so makes the intent clearer and emphasizes that the heap buffer must always be marked dirty before being added to the WAL chain. This commit also adds a test case for vacuuming when no heap modifications are required. Currently this ensures that the heap buffer is marked dirty before it is added to the WAL chain, but if we later remove the heap buffer from the VM-set WAL chain or pass it with the REGBUF_NO_CHANGES flag, this test would guard that behavior. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/5CEAA162-67B1-44DA-B60D-8B65717E8B05%40gmail.com Discussion: https://postgr.es/m/flat/CAAKRu_ZWx5gCbeCf7PWCv8p5%3D%3Db7EEws0VD2wksDxpXCvCyHvQ%40mail.gmail.com	2026-01-26 16:03:32 -05:00
Tomas Vondra	db14dcdec6	Lookup the correct ordering for parallel GIN builds When building a tuplesort during parallel GIN builds, the function incorrectly looked up the default B-Tree operator, not the function associated with the GIN opclass (through GIN_COMPARE_PROC). Fixed by using the same logic as initGinState(), and the other place in parallel GIN builds. This could cause two types of issues. First, a data type might not have a B-Tree opclass, in which case the PrepareSortSupportFromOrderingOp() fails with an ERROR. Second, a data type might have both B-Tree and GIN opclasses, defining order/equality in different ways. This could lead to logical corruption in the index. Backpatch to 18, where parallel GIN builds were introduced. Discussion: https://postgr.es/m/73a28b94-43d5-4f77-b26e-0d642f6de777@iki.fi Reported-by: Heikki Linnakangas <hlinnaka@iki.fi> Backpatch-through: 18	2026-01-26 20:05:17 +01:00
Peter Eisentraut	5ca5f12c2c	Fix accidentally cast away qualifiers This fixes cases where a qualifier (const, in all cases here) was dropped by a cast, but the cast was otherwise necessary or desirable, so the straightforward fix is to add the qualifier into the cast. Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/b04f4d3a-5e70-4e73-9ef2-87f777ca4aac%40eisentraut.org	2026-01-26 16:02:31 +01:00
David Rowley	7027dd499d	Remove deduplication logic from find_window_functions This code thought it was optimizing WindowAgg evaluation by getting rid of duplicate WindowFuncs, but it turns out all it does today is lead to cost-underestimations and makes it possible that optimize_window_clauses could miss some of the WindowFuncs that must receive an updated winref. The deduplication likely was useful when it was first added, but since the projection code was changed in `b8d7f053c`, the list of WindowFuncs gathered by find_window_functions isn't used during execution. Instead, the expression evaluation code will process the node's targetlist to find the WindowFuncs. The reason the deduplication could cause issues for optimize_window_clauses() is because if a WindowFunc is moved to another WindowClause, the winref is adjusted to reference the new WindowClause. If any duplicate WindowFuncs were discarded in find_window_functions() then the WindowFuncLists may not include all the WindowFuncs that need their winref adjusted. This could lead to an error message such as: ERROR: WindowFunc with winref 2 assigned to WindowAgg with winref 1 The back-branches will receive a different fix so that the WindowAgg costs are not affected. Author: Meng Zhang <mza117jc@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAErYLFAuxmW0UVdgrz7iiuNrxGQnFK_OP9hBD5CUzRgjrVrz=Q@mail.gmail.com	2026-01-26 23:27:15 +13:00
Michael Paquier	114e84c532	Fix missing initialization in pg_restore_extended_stats() The tuple data upserted into pg_statistic_ext_data was missing an initialization for the nulls flag of stxoid and stxdinherit. This would cause an incorrect handling of the stats data restored. This issue has been spotted by CatalogTupleCheckConstraints(), translating to a NOT NULL constraint inconsistency, while playing more with the follow-up portions of the patch set. Oversight in `0e80f3f88d` (mea culpa). Surprisingly, the buildfarm did not complain yet. Discussion: https://postgr.es/m/CADkLM=c7DY3Jv6ef0n_MGUJ1FyTMUoT697LbkST05nraVGNHYg@mail.gmail.com	2026-01-26 16:13:41 +09:00
Michael Paquier	0e80f3f88d	Add pg_restore_extended_stats() This function closely mirror its relation and attribute counterparts, but for extended statistics (i.e. CREATE STATISTICS) objects, being able to restore extended statistics for an extended stats object. Like the other functions, the goal of this feature is to ease the dump or upgrade of clusters so as ANALYZE would not be required anymore after these operations, stats being directly loaded into the target cluster without any post-dump/upgrade computation. The caller of this function needs the following arguments for the extended stats to restore: - The name of the relation. - The schema name of the relation. - The name of the extended stats object. - The schema name of the extended stats object. - If the stats are inherited or not. - One or more extended stats kind with its data. This commit adds only support for the restore of the extended statistics kind "n_distinct", building the basic infrastructure for the restore of more extended statistics kinds in follow-up commits, including MVC and dependencies. The support for "n_distinct" is eased in this commit thanks to the previous work done particularly in commits `1f927cce44` and `44eba8f06e`, that have added the input function for the type pg_ndistinct, used as data type in input of this new restore function. Bump catalog version. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-26 15:08:15 +09:00
Michael Paquier	c100340729	Remove PG_MMAP_FLAGS from mem.h Based on name of the macro, it was implied that it could be used for all mmap() calls on portability grounds. However, its use is limited to sysv_shmem.c, for CreateAnonymousSegment(). This commit removes the declaration, reducing the confusion around it as a portability tweak, being limited to SysV-style shared memory. This macro has been introduced in `b0fc0df936` for sysv_shmem.c, originally. It has been moved to mem.h in `0ac5e5a7e1` a bit later. Suggested by: Peter Eisentraut <peter@eisentraut.org> Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAExHW5vTWABxuM5fbQcFkGuTLwaxuZDEE2vtx2WuMUWk6JnF4g@mail.gmail.com Discussion: https://postgr.es/m/12add41a-7625-4639-a394-a5563e349322@eisentraut.org	2026-01-26 10:52:02 +09:00
David Rowley	83a53572a6	Always inline SeqNext and SeqRecheck The intention of the work done in `fb9f95502` was that these functions are inlined. I noticed my compiler isn't doing this on -O2 (gcc version 15.2.0). Also, clang version 20.1.8 isn't inlining either. Fix by marking both of these functions as pg_attribute_always_inline to avoid leaving this up to the compiler's heuristics. A quick test with a Seq Scan on a table with a single int column running a query that filters all 1 million rows in the WHERE clause yields a 3.9% speedup on my Zen4 machine. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrL7Q41B=gv+3wc8+AJGKZugGegUbBo8FPQ+3+NGTPb+w@mail.gmail.com	2026-01-26 14:29:10 +13:00
Dean Rasheed	b4307ae2e5	Fix trigger transition table capture for MERGE in CTE queries. When executing a data-modifying CTE query containing MERGE and some other DML operation on a table with statement-level AFTER triggers, the transition tables passed to the triggers would fail to include the rows affected by the MERGE. The reason is that, when initializing a ModifyTable node for MERGE, MakeTransitionCaptureState() would create a TransitionCaptureState structure with a single "tcs_private" field pointing to an AfterTriggersTableData structure with cmdType == CMD_MERGE. Tuples captured there would then not be included in the sets of tuples captured when executing INSERT/UPDATE/DELETE ModifyTable nodes in the same query. Since there are no MERGE triggers, we should only create AfterTriggersTableData structures for INSERT/UPDATE/DELETE. Individual MERGE actions should then use those, thereby sharing the same capture tuplestores as any other DML commands executed in the same query. This requires changing the TransitionCaptureState structure, replacing "tcs_private" with 3 separate pointers to AfterTriggersTableData structures, one for each of INSERT, UPDATE, and DELETE. Nominally, this is an ABI break to a public structure in commands/trigger.h. However, since this is a private field pointing to an opaque data structure, the only way to create a valid TransitionCaptureState is by calling MakeTransitionCaptureState(), and no extensions appear to be doing that anyway, so it seems safe for back-patching. Backpatch to v15, where MERGE was introduced. Bug: #19380 Reported-by: Daniel Woelfel <dwwoelfel@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19380-4e293be2b4007248%40postgresql.org Backpatch-through: 15	2026-01-24 11:30:48 +00:00
Nathan Bossart	4ce105d9c4	Add missing #include. Oversight in commit `8eef2df189`. Per buildfarm.	2026-01-23 11:00:06 -06:00
Nathan Bossart	8eef2df189	Fix some rounding code for shared memory. InitializeShmemGUCs() always added 1 to the value calculated for shared_memory_size_in_huge_pages, which is unnecessary if the shared memory size is divisible by the huge page size. CreateAnonymousSegment() neglected to check for overflow when rounding up to a multiple of the huge page size. These are arguably bugs, but they seem extremely unlikely to be causing problems in practice, so no back-patch. Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAO6_Xqq2vZbva0R9eQSY0p2kfksX2aP4r%3D%2BZ_q1HBYNU%3Dm8bBg%40mail.gmail.com	2026-01-23 10:46:49 -06:00
Michael Paquier	a36164e746	Add WALRCV_CONNECTING state to the WAL receiver Previously, a WAL receiver freshly started would set its state to WALRCV_STREAMING immediately at startup, before actually establishing a replication connection. This commit introduces a new state called WALRCV_CONNECTING, which is the state used when the WAL receiver freshly starts, or when a restart is requested, with a switch to WALRCV_STREAMING once the connection to the upstream server has been established with COPY_BOTH, meaning that the WAL receiver is ready to stream changes. This change is useful for monitoring purposes, especially in environments with a high latency where a connection could take some time to be established, giving some room between the [re]start phase and the streaming activity. From the point of view of the startup process, that flips the shared memory state of the WAL receiver when it needs to be stopped, the existing WALRCV_STREAMING and the new WALRCV_CONNECTING states have the same semantics: the WAL receiver has started and it can be stopped. Based on an initial suggestion from Noah Misch, with some input from me about the design. Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Discussion: https://postgr.es/m/CABPTF7VQ5tGOSG5TS-Cg+Fb8gLCGFzxJ_eX4qg+WZ3ZPt=FtwQ@mail.gmail.com	2026-01-23 14:17:28 +09:00
Amit Langote	f9a468c664	Fix bogus ctid requirement for dummy-root partitioned targets ExecInitModifyTable() unconditionally required a ctid junk column even when the target was a partitioned table. This led to spurious "could not find junk ctid column" errors when all children were excluded and only the dummy root result relation remained. A partitioned table only appears in the result relations list when all leaf partitions have been pruned, leaving the dummy root as the sole entry. Assert this invariant (nrels == 1) and skip the ctid requirement. Also adjust ExecModifyTable() to tolerate invalid ri_RowIdAttNo for partitioned tables, which is safe since no rows will be processed in this case. Bug: #19099 Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19099-e05dcfa022fe553d%40postgresql.org Backpatch-through: 14	2026-01-23 10:23:30 +09:00
Tom Lane	4b760a181a	Remove faulty Assert in partitioned INSERT...ON CONFLICT DO UPDATE. Commit `f16241bef` mistakenly supposed that INSERT...ON CONFLICT DO UPDATE rejects partitioned target tables. (This may have been accurate when the patch was written, but it was already obsolete when committed.) Hence, there's an assertion that we can't see ItemPointerIndicatesMovedPartitions() in that path, but the assertion is triggerable. Some other places throw error if they see a moved-across-partitions tuple, but there seems no need for that here, because if we just retry then we get the same behavior as in the update-within-partition case, as demonstrated by the new isolation test. So fix by deleting the faulty Assert. (The fact that this is the fix doubtless explains why we've heard no field complaints: the behavior of a non-assert build is fine.) The TM_Deleted case contains a cargo-culted copy of the same Assert, which I also deleted to avoid confusion, although I believe that one is actually not triggerable. Per our code coverage report, neither the TM_Updated nor the TM_Deleted case were reached at all by existing tests, so this patch adds tests for both. Reported-by: Dmitry Koval <d.koval@postgrespro.ru> Author: Joseph Koshakow <koshy44@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/f5fffe4b-11b2-4557-a864-3587ff9b4c36@postgrespro.ru Backpatch-through: 14	2026-01-22 18:35:31 -05:00
Álvaro Herrera	69f98fce5b	Make some use of anonymous unions [reloptions] In the spirit of commit `4b7e6c73b0` and following, which see for more details; it appears to have been quite an uncontroversial C11 feature to use and it makes the code nicer to read. This commit changes the relopt_value struct. Author: Peter Eisentraut <peter@eisentraut.org> Author: Álvaro Herrera <alvherre@kurilemu.de> Note: Yes, this was written twice independently. Discussion: https://postgr.es/m/202601192106.zcdi3yu2gzti@alvherre.pgsql	2026-01-22 17:04:59 +01:00
Peter Eisentraut	c257ba8397	Record range constructor functions in pg_range When a range type is created, several construction functions are also created, two for the range type and three for the multirange type. These have an internal dependency, so they "belong" to the range type. But there was no way to identify those functions when given a range type. An upcoming patch needs access to the two- or possibly the three-argument range constructor function for a given range type. The only way to do that would be with fragile workarounds like matching names and argument types. The correct way to do that kind of thing is to record to the links in the system catalogs. This is what this patch does, it records the OIDs of these five constructor functions in the pg_range catalog. (Currently, there is no code that makes use of this.) Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/7d63ddfa-c735-4dfe-8c7a-4f1e2a621058%40eisentraut.org	2026-01-22 15:56:29 +01:00
Peter Eisentraut	a5b40d156e	Mark commented out code as unused There were many PG_GETARG_* calls, mostly around gin, gist, spgist code, that were commented out, presumably to indicate that the argument was unused and to indicate that it wasn't forgotten or miscounted. But keeping commented-out code updated with refactorings and style changes is annoying. So this commit changes them to #ifdef NOT_USED blocks, which is a style already in use. That way, at least the indentation and syntax highlighting works correctly, making some of these blocks much easier to read. An alternative would be to just delete that code, but there is some value in making unused arguments explicit, and some of this arguably serves as example code for index AM APIs. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/328e4371-9a4c-4196-9df9-1f23afc900df%40eisentraut.org	2026-01-22 12:44:07 +01:00
Peter Eisentraut	846fb3c790	Remove incorrect commented out code These calls, if activated, are happening before null checks, so they are not correct. Also, the "in" variable is shadowed later. Remove them to avoid confusion and bad examples. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: David Geier <geidav.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/328e4371-9a4c-4196-9df9-1f23afc900df%40eisentraut.org	2026-01-22 12:44:07 +01:00
Thomas Munro	e5d99b4d9e	jit: Add missing inline pass for LLVM >= 17. With LLVM >= 17, transform passes are provided as a string to LLVMRunPasses. Only two strings were used: "default<O3>" and "default<O0>,mem2reg". With previous LLVM versions, an additional inline pass was added when JIT inlining was enabled without optimization. With LLVM >= 17, the code would go through llvm_inline, prepare the functions for inlining, but the generated bitcode would be the same due to the missing inline pass. This patch restores the previous behavior by adding an inline pass when inlining is enabled but no optimization is done. This fixes an oversight introduced by `76200e5e` when support for LLVM 17 was added. Backpatch-through: 14 Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Pierre Ducroquet <p.psql@pinaraf.info> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAO6_XqrNjJnbn15ctPv7o4yEAT9fWa-dK15RSyun6QNw9YDtKg%40mail.gmail.com	2026-01-22 16:03:47 +13:00
Fujii Masao	26cb14aea1	file_fdw: Support multi-line HEADER option. Commit `bc2f348` introduced multi-line HEADER support for COPY. This commit extends this capability to file_fdw, allowing multiple header lines to be skipped. Because CREATE/ALTER FOREIGN TABLE requires option values to be single-quoted, this commit also updates defGetCopyHeaderOption() to accept integer values specified as strings for HEADER option. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: songjinzhou <tsinghualucky912@foxmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAOzEurT+iwC47VHPMS+uJ4WSzvOLPsZ2F2_wopm8M7O+CZa3Xw@mail.gmail.com	2026-01-22 10:14:12 +09:00
Fujii Masao	f3da70a805	Improve the error message in COPY with HEADER option. The error message reported for invalid values of the HEADER option in COPY command previously used the term "non-negative integer", which is discouraged by the Error Message Style Guide because it is ambiguous about whether zero is allowed. This commit improves the error message by replacing "non-negative integer" there with "an integer value greater than or equal to zero" to make the accepted values explicit. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Steven Niu <niushiji@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAHGQGwE86PcuPZbP=aurmW7Oo=eycF10gxjErWq4NmY-5TTX4Q@mail.gmail.com	2026-01-22 10:13:07 +09:00
Tom Lane	4576208454	Force standard_conforming_strings to always be ON. Continuing to support this backwards-compatibility feature has nontrivial costs; in particular it is potentially a security hazard if an application somehow gets confused about which setting the server is using. We changed the default to ON fifteen years ago, which seems like enough time for applications to have adapted. Let's remove support for the legacy string syntax. We should not remove the GUC altogether, since client-side code will still test it, pg_dump scripts will attempt to set it to ON, etc. Instead, just prevent it from being set to OFF. There is precedent for this approach (see commit `de66987ad`). This patch does remove the related GUC escape_string_warning, however. That setting does nothing when standard_conforming_strings is on, so it's now useless. We could leave it in place as a do-nothing setting to avoid breaking clients that still set it, if there are any. But it seems likely that any such client is also trying to turn off standard_conforming_strings, so it'll need work anyway. The client-side changes in this patch are pretty minimal, because even though we are dropping the server's support, most of our clients still need to be able to talk to older server versions. We could remove dead client code only once we disclaim compatibility with pre-v19 servers, which is surely years away. One change of note is that pg_dump/pg_dumpall now set standard_conforming_strings = on in their source session, rather than accepting the source server's default. This ensures that literals in view definitions and such will be printed in a way that's acceptable to v19+. In particular, pg_upgrade will work transparently even if the source installation has standard_conforming_strings = off. (However, pg_restore will behave the same as before if given an archive file containing standard_conforming_strings = off. Such an archive will not be safely restorable into v19+, but we shouldn't break the ability to extract valid data from it for use with an older server.) Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3279216.1767072538@sss.pgh.pa.us	2026-01-21 15:08:38 -05:00
Álvaro Herrera	4d6a66f675	Allow Boolean reloptions to have ternary values From the user's point of view these are just Boolean values; from the implementation side we can now distinguish an option that hasn't been set. Reimplement the vacuum_truncate reloption using this type. This could also be used for reloptions vacuum_index_cleanup and buffering, but those additionally need a per-option "alias" for the state where the variable is unset (currently the value "auto"). Author: Nikolay Shaplov <dhyan@nataraj.su> Reviewed-by: Timur Magomedov <t.magomedov@postgrespro.ru> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/3474141.usfYGdeWWP@thinkpad-pgpro	2026-01-21 20:06:01 +01:00
Tom Lane	cec5fe0d1e	Remove useless flag PVC_INCLUDE_CONVERTROWTYPES. This was introduced in the SJE patch (`fc069a3a6`), but it doesn't do anything because pull_var_clause() never tests it. Apparently it snuck in from somebody's private fork. Remove it again, but only in HEAD -- seems best to let it be in v18. Author: Alexander Pyhalov <a.pyhalov@postgrespro.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/70008c19d22e3dd1565ca57f8436c0ba@postgrespro.ru	2026-01-21 13:26:19 -05:00
Peter Eisentraut	e6bb491bf2	Remove more leftovers of AIX support The make variables MKLDEXPORT and POSTGRES_IMP were only used for AIX, so they should have been removed with commit `0b16bb8776`. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/7a48b624-2236-4e11-9b9d-6a3c658d77a1%40eisentraut.org	2026-01-21 14:51:05 +01:00
Amit Kapila	48efefa6ca	Improve errdetail for logical replication conflict messages. This change enhances the clarity and usefulness of error detail messages generated during logical replication conflicts. The following improvements have been made: 1. Eliminate redundant output: Avoid printing duplicate remote row and replica identity values for the multiple_unique_conflicts conflict type. 2. Improve message structure: Append tuple values directly to the main error message, separated by a colon (:), for better readability. 3. Simplify local row terminology: Remove the word 'existing' when referring to the local row, as this is already implied by context. 4. General code refinements: Apply miscellaneous code cleanups to improve how conflict detail messages are constructed and formatted. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Discussion: https://postgr.es/m/CAHut+Psgkwy5-yGRJC15izecySGGysrbCszv_z93ess8XtCDOQ@mail.gmail.com	2026-01-21 04:58:03 +00:00
Álvaro Herrera	f1cd34f952	Use integer backend type when exec'ing a postmaster child This way we don't have to walk the entire process type array and strcmp() the string with the names therein. The integer value can be directly used as array index instead. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Euler Taveira <euler@eulerto.com> Discussion: https://postgr.es/m/202512090935.k3xrtr44hxkn@alvherre.pgsql	2026-01-20 16:41:04 +01:00
Alexander Korotkov	30776ca468	Remove redundant pg_unreachable() after elog(ERROR) from ExecWaitStmt() elog(ERROR) never returns. Compilers don't always understand this. So, sometimes, we have to append pg_unreachable() to keep the compiler quiet about returning from a non-void function without a value. But pg_unreachable() is redundant for ExecWaitStmt(), which is void. Reported-by: Peter Eisentraut <peter@eisentraut.org> Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/8d72a2b3-7423-4a15-a981-e130bf60b1a6%40eisentraut.org Discussion: https://postgr.es/m/CABPTF7UcuVD0L-X%3DjZFfeygjPaZWWkVRwtWOaJw2tcXbEN2xsA%40mail.gmail.com	2026-01-20 16:10:25 +02:00
Amit Kapila	1ba3eee89a	Fix concurrent sequence drops during sequence synchronization. A recent BF failure showed that commit `7a485bd641` did not handle the case where a sequence is dropped concurrently during sequence synchronization on the subscriber. Previously, pg_get_sequence_data() would ERROR out if the sequence was dropped concurrently. After `7a485bd641`, it instead returns NULL, which leads to an assertion failure on the subscriber. To handle this change, update sequence synchronization to skip sequences for which pg_get_sequence_data() returns NULL. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CALDaNm0FoGdt+1mzua0t-=wYdup5_zmFrvfNf-L=MGBnj9HAcg@mail.gmail.com	2026-01-20 09:40:13 +00:00
Michael Paquier	7ebb64c557	Add routine to free MCVList This addition is in the same spirit as `32e27bd320` for MVNDistinct and MVDependencies, except that we were missing a free routine for the third type of extended statistics, MCVList. I was not sure if we needed an equivalent for MCVList, but after more review of the main patch set for the import of extended statistics, it has become clear that we do. This is introduced as its own change as this routine can be useful on its own. This one is a piece that has not been written by Corey Huinker, I have just noticed it by myself on the way. Author: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-20 13:13:47 +09:00
Michael Paquier	5d95219faa	pg_stat_statements: Fix crash in list squashing with Vars When IN/ANY clauses contain both constants and variable expressions, the optimizer transforms them into separate structures: constants become an array expression while variables become individual OR conditions. This transformation was creating an overlap with the token locations, causing pg_stat_statements query normalization to crash because it could not calculate the amount of bytes remaining to write for the normalized query. This commit disables squashing for mixed IN list expressions when constructing a scalar array op, by setting list_start and list_end to -1 when both variables and non-variables are present. Some regression tests are added to PGSS to verify these patterns. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0ts9qiONnHjjHxPxtePs22GBo4d3jZ_s2BQC59AN7XbAA@mail.gmail.com Backpatch-through: 18	2026-01-20 08:11:12 +09:00
Robert Haas	ecd275718b	Don't set the truncation block length greater than RELSEG_SIZE. When faced with a relation containing more than 1 physical segment (i.e. >1GB, with normal settings), the previous code could compute a truncation block length greater than RELSEG_SIZE, which could lead to restore failures of this form: file "%s" has truncation block length %u in excess of segment size %u The fix is simply to clamp the maximum computed truncation_block_length to RELSEG_SiZE. I have also added some comments to clarify the logic. The test case was written by Oleg Tkachenko, but I have rewritten its comments. Reported-by: Oleg Tkachenko <oatkachenko@gmail.com> Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com> Co-authored-by: Robert Haas <rhaas@postgresql.org> Co-authored-by: Oleg Tkachenko <oatkachenko@gmail.com> Reviewed-by: Amul Sul <sulamul@gmail.com> Backpatch-through: 17 Discussion: http://postgr.es/m/00FEFC88-EA1D-4271-B38F-EB741733A84A@gmail.com	2026-01-19 12:09:32 -05:00
Richard Guo	34740b90bc	Fix unsafe pushdown of quals referencing grouping Vars When checking a subquery's output expressions to see if it's safe to push down an upper-level qual, check_output_expressions() previously treated grouping Vars as opaque Vars. This implicitly assumed they were stable and scalar. However, a grouping Var's underlying expression corresponds to the grouping clause, which may be volatile or set-returning. If an upper-level qual references such an output column, pushing it down into the subquery is unsafe. This can cause strange results due to multiple evaluation of a volatile function, or introduce SRFs into the subquery's WHERE/HAVING quals. This patch teaches check_output_expressions() to look through grouping Vars to their underlying expressions. This ensures that any volatility or set-returning properties in the grouping clause are detected, preventing the unsafe pushdown. We do not need to recursively examine the Vars contained in these underlying expressions. Even if they reference outputs from lower-level subqueries (at any depth), those references are guaranteed not to expand to volatile or set-returning functions, because subqueries containing such functions in their targetlists are never pulled up. Backpatch to v18, where this issue was introduced. Reported-by: Eric Ridge <eebbrr@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/7900964C-F99E-481E-BEE5-4338774CEB9F@gmail.com Backpatch-through: 18	2026-01-19 11:13:23 +09:00
Michael Paquier	6bca4b50d0	Fix error message related to end TLI in backup manifest The code adding the WAL information included in a backup manifest is cross-checked with the contents of the timeline history file of the end timeline. A check based on the end timeline, when it fails, reported the value of the start timeline in the error message. This error is fixed to show the correct timeline number in the report. This error report would be confusing for users if seen, because it would provide an incorrect information, so backpatch all the way down. Oversight in `0d8c9c1210`. Author: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/tencent_0F2949C4594556F672CF4658@qq.com Backpatch-through: 14	2026-01-18 17:24:25 +09:00
Michael Paquier	2a6ce34b55	Remove useless asserts in report_namespace_conflict() An assertion is used in this routine to check that a valid namespace OID is given by the caller, but it was repeated twice: once at the top of the routine and a second time multiple times in a switch/case. This commit removes the assertions within the switch/case. Thinko in commit `765cbfdc92`. Author: Man Zeng <zengman@halodbtech.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/tencent_40F8C1D82E2EE28065009AAA@qq.com	2026-01-18 16:11:46 +09:00
Andres Freund	84705b3727	bufmgr: Avoid spurious compiler warning after `fcb9c977aa` Some compilers, e.g. gcc with -Og or -O1, warn about the wait_event in BufferLockAcquire() possibly being uninitialized. That can't actually happen, as the switch() covers all legal lock mode values, but we still need to silence the warning. We could add a default:, but we'd like to get a warning if we were to get a new lock mode in the future. So just initialize wait_event to 0. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/934395.1768518154@sss.pgh.pa.us	2026-01-16 06:58:35 -05:00
Michael Paquier	395b73c045	Improve pg_clear_extended_stats() with incorrect relation/stats combination Issue fat-fingered in `d756fa1019`, noticed while doing more review of the main patch set proposed. I have missed the fact that this can be triggered by specifying an extended stats object that does not match with the relation specified and already locked. Like the cases where an object defined in input is missing, the code is changed to issue a WARNING instead of a confusing cache lookup failure. A regression test is added to cover this case. Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-16 15:24:59 +09:00
Amit Langote	889676a0d5	Fix rowmark handling for non-relation RTEs during executor init Commit `cbc127917e` introduced tracking of unpruned relids to skip processing of pruned partitions. PlannedStmt.unprunableRelids is computed as the difference between PlannerGlobal.allRelids and prunableRelids, but allRelids only contains RTE_RELATION entries. This means non-relation RTEs (VALUES, subqueries, CTEs, etc.) are never included in unprunableRelids, and consequently not in es_unpruned_relids at runtime. As a result, rowmarks attached to non-relation RTEs were incorrectly skipped during executor initialization. This affects any DML statement that has rowmarks on such RTEs, including MERGE with a VALUES or subquery source, and UPDATE/DELETE with joins against subqueries or CTEs. When a concurrent update triggers an EPQ recheck, the missing rowmark leads to incorrect results. Fix by restricting the es_unpruned_relids membership check to RTE_RELATION entries only, since partition pruning only applies to actual relations. Rowmarks for other RTE kinds are now always processed. Bug: #19355 Reported-by: Bihua Wang <wangbihua.cn@gmail.com> Diagnosed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Diagnosed-by: Tender Wang <tndrwang@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/19355-57d7d52ea4980dc6@postgresql.org Backpatch-through: 18	2026-01-16 14:53:50 +09:00
Amit Langote	9cbb1d21d6	Fix segfault from releasing locks in detached DSM segments If a FATAL error occurs while holding a lock in a DSM segment (such as a dshash lock) and the process is not in a transaction, a segmentation fault can occur during process exit. The problem sequence is: 1. Process acquires a lock in a DSM segment (e.g., via dshash) 2. FATAL error occurs outside transaction context 3. proc_exit() begins, calling before_shmem_exit callbacks 4. dsm_backend_shutdown() detaches all DSM segments 5. Later, on_shmem_exit callbacks run 6. ProcKill() calls LWLockReleaseAll() 7. Segfault: the lock being released is in unmapped memory This only manifests outside transaction contexts because AbortTransaction() calls LWLockReleaseAll() during transaction abort, releasing locks before DSM cleanup. Background workers and other non-transactional code paths are vulnerable. Fix by calling LWLockReleaseAll() unconditionally at the start of shmem_exit(), before any callbacks run. Releasing locks before callbacks prevents the segfault - locks must be released before dsm_backend_shutdown() detaches their memory. This is safe because after an error, held locks are protecting potentially inconsistent data anyway, and callbacks can acquire fresh locks if needed. Also add a comment noting that LWLockReleaseAll() must be safe to call before LWLock initialization (which it is, since num_held_lwlocks will be 0), plus an Assert for the post-condition. This fix aligns with the original design intent from commit `001a573a2`, which noted that backends must clean up shared memory state (including releasing lwlocks) before unmapping dynamic shared memory segments. Reported-by: Rahila Syed <rahilasyed90@gmail.com> Author: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CAH2L28uSvyiosL+kaic9249jRVoQiQF6JOnaCitKFq=xiFzX3g@mail.gmail.com Backpatch-through: 14	2026-01-16 13:02:42 +09:00
Michael Paquier	d756fa1019	Add pg_clear_extended_stats() This function is able to clear the data associated to an extended statistics object, making things so as the object looks as newly-created. The caller of this function needs the following arguments for the extended stats to clear: - The name of the relation. - The schema name of the relation. - The name of the extended stats object. - The schema name of the extended stats object. - If the stats are inherited or not. The first two parameters are especially important to ensure a consistent lookup and ACL checks for the relation on which is based the extended stats object that will be cleared, relying first on a RangeVar lookup where permissions are checked without locking a relation, critical to prevent denial-of-service attacks when using this kind of function (see also `688dc6299a` for a similar concern). The third to fifth arguments give a way to target the extended stats records to clear. This has been extracted from a larger patch by the same author, for a piece which is again useful on its own. I have rewritten large portions of it. The tests have been extended while discussing this piece, resulting on what this commit includes. The intention behind this feature is to add support for the import of extended statistics across dumps and upgrades, this change building one piece that we will be able to rely on for the rest of the changes. Bump catalog version. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-16 08:13:30 +09:00
Andres Freund	d40fd85187	lwlock: Remove support for disowned lwlwocks This reverts commit `f8d7f29b3e`, plus parts of subsequent commits fixing a typo in a parameter name. Support for disowned lwlocks was added for the benefit of AIO, to be able to have content locks "owned" by the AIO subsystem. But as of commit `fcb9c977aa`, content locks do not use lwlocks anymore. It does not seem particularly likely that we need this facility outside of the AIO use-case, therefore remove the now unused functions. I did choose to keep the comment added in the aforementioned commit about lock->owner intentionally being left pointing to the last owner. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/cj5mcjdpucvw4a54hehslr3ctukavrbnxltvuzzhqnimvpju5e@cy3g3mnsefwz	2026-01-15 14:57:45 -05:00
Andres Freund	55fbfb738b	lwlock: Remove ForEachLWLockHeldByMe As of commit `fcb9c977aa`, ForEachLWLockHeldByMe(), introduced in `f4ece891fc`, is not used anymore, as content locks are now implemented in bufmgr.c. It doesn't seem that likely that a new user of the functionality will appear all that soon, making removal of the function seem like the most sensible path. It can easily be added back if necessary. Discussion: https://postgr.es/m/lneuyxqxamqoayd2ntau3lqjblzdckw6tjgeu4574ezwh4tzlg%40noioxkquezdw	2026-01-15 14:57:45 -05:00
Andres Freund	335f2231a3	pgindent fix for `8077649907` Per buildfarm member koel. Backpatch-through: 18	2026-01-15 14:57:45 -05:00
Andres Freund	fcb9c977aa	bufmgr: Implement buffer content locks independently of lwlocks Until now buffer content locks were implemented using lwlocks. That has the obvious advantage of not needing a separate efficient implementation of locks. However, the time for a dedicated buffer content lock implementation has come: 1) Hint bits are currently set while holding only a share lock. This leads to having to copy pages while they are being written out if checksums are enabled, which is not cheap. We would like to add AIO writes, however once many buffers can be written out at the same time, it gets a lot more expensive to copy them, particularly because that copy needs to reside in shared buffers (for worker mode to have access to the buffer). In addition, modifying buffers while they are being written out can cause issues with unbuffered/direct-IO, as some filesystems (like btrfs) do not like that, due to filesystem internal checksums getting corrupted. The solution to this is to require a new share-exclusive lock-level to set hint bits and to write out buffers, making those operations mutually exclusive. We could introduce such a lock-level into the generic lwlock implementation, however it does not look like there would be other users, and it does add some overhead into important code paths. 2) For AIO writes we need to be able to race-freely check whether a buffer is undergoing IO and whether an exclusive lock on the page can be acquired. That is rather hard to do efficiently when the buffer state and the lock state are separate atomic variables. This is a major hindrance to allowing writes to be done asynchronously. 3) Buffer locks are by far the most frequently taken locks. Optimizing them specifically for their use case is worth the effort. E.g. by merging content locks into buffer locks we will be able to release a buffer lock and pin in one atomic operation. 4) There are more complicated optimizations, like long-lived "super pinned & locked" pages, that cannot realistically be implemented with the generic lwlock implementation. Therefore implement content locks inside bufmgr.c. The lockstate is stored as part of BufferDesc.state. The implementation of buffer content locks is fairly similar to lwlocks, with a few important differences: 1) An additional lock-level share-exclusive has been added. This lock-level conflicts with exclusive locks and itself, but not share locks. 2) Error recovery for content locks is implemented as part of the already existing private-refcount tracking mechanism in combination with resowners, instead of a bespoke mechanism as the case for lwlocks. This means we do not need to add dedicated error-recovery code paths to release all content locks (like done with LWLockReleaseAll() for lwlocks). 3) The lock state is embedded in BufferDesc.state instead of having its own struct. 4) The wakeup logic is a tad more complicated due to needing to support the additional lock-level This commit unfortunately introduces some code that is very similar to the code in lwlock.c, however the code is not equivalent enough to easily merge it. The future wins that this commit makes possible seem worth the cost. As of this commit nothing uses the new share-exclusive lock mode. It will be used in a future commit. It seemed too complicated to introduce the lock-level in a separate commit. It's worth calling out one wart in this commit: Despite content locks not being lwlocks anymore, they continue to use PGPROC->lw* - that seemed better than duplicating the relevant infrastructure. Another thing worth pointing out is that, after this change, content locks are not reported as LWLock wait events anymore, but as new wait events in the "Buffer" wait event class (see also `6c5c393b74`). The old BufferContent lwlock tranche has been removed. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2026-01-15 14:26:53 -05:00
Andres Freund	dac328c8a6	bufmgr: Change BufferDesc.state to be a 64-bit atomic This is motivated by wanting to merge buffer content locks into BufferDesc.state in a future commit, rather than having a separate lwlock (see commit `c75ebc657f` for more details). As this change is rather mechanical, it seems to make sense to split it out into a separate commit, for easier review. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2026-01-15 14:20:41 -05:00
Tom Lane	282b1cde9d	Optimize LISTEN/NOTIFY via shared channel map and direct advancement. This patch reworks LISTEN/NOTIFY to avoid waking backends that have no need to process the notification messages we just sent. The primary change is to create a shared hash table that tracks which processes are listening to which channels (where a "channel" is defined by a database OID and channel name). This allows a notifying process to accurately determine which listeners are interested, replacing the previous weak approximation that listeners in other databases couldn't be interested. Secondly, if a listener is known not to be interested and is currently stopped at the old queue head, we avoid waking it at all and just directly advance its queue pointer past the notifications we inserted. These changes permit very significant improvements (integer multiples) in NOTIFY throughput, as well as a noticeable reduction in latency, when there are many listeners but only a few are interested in any specific message. There is no improvement for the simplest case where every listener reads every message, but any loss seems below the noise level. Author: Joel Jacobson <joel@compiler.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6899c044-4a82-49be-8117-e6f669765f7e@app.fastmail.com	2026-01-15 14:12:15 -05:00
Heikki Linnakangas	23b25586dc	Fix 'unexpected data beyond EOF' on replica restart On restart, a replica can fail with an error like 'unexpected data beyond EOF in block 200 of relation T/D/R'. These are the steps to reproduce it: - A relation has a size of 400 blocks. - Blocks 201 to 400 are empty. - Block 200 has two rows. - Blocks 100 to 199 are empty. - A restartpoint is done - Vacuum truncates the relation to 200 blocks - A FPW deletes a row in block 200 - A checkpoint is done - A FPW deletes the last row in block 200 - Vacuum truncates the relation to 100 blocks - The replica restarts When the replica restarts: - The relation on disk starts at 100 blocks, because all the truncations were applied before restart. - The first truncate to 200 blocks is replayed. It silently fails, but it will still (incorrectly!) update the cache size to 200 blocks - The first FPW on block 200 is applied. XLogReadBufferForRead relies on the cached size and incorrectly assumes that the page already exists in the file, and thus won't extend the relation. - The online checkpoint record is replayed, calling smgrdestroyall which causes the cached size to be discarded - The second FPW on block 200 is applied. This time, the detected size is 100 blocks, an extend is attempted. However, the block 200 is already present in the buffer cache due to the first FPW. This triggers the 'unexpected data beyond EOF'. To fix, update the cached size in SmgrRelation with the current size rather than the requested new size, when the requested new size is greater. Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> Discussion: https://www.postgresql.org/message-id/CAO6_Xqrv-snNJNhbj1KjQmWiWHX3nYGDgAc=vxaZP3qc4g1Siw@mail.gmail.com Backpatch-through: 14	2026-01-15 21:02:49 +02:00
Álvaro Herrera	35e3fae738	Remove #include <math.h> where not needed Liujinyang reported the one in binaryheap.c, I then found and analyzed the rest. For future patches, we require git archaelogical analysis before we accept patches of this nature. Co-authored-by: liujinyang <21043272@qq.com> Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/tencent_6B302BFCAF6F010E00AB5C2C0ECB7AA3F205@qq.com	2026-01-15 19:09:47 +01:00
Andres Freund	8077649907	aio: io_uring: Fix danger of completion getting reused before being read We called io_uring_cqe_seen(..., cqe) before reading cqe->res. That allows the completion to be reused, which in turn could lead to cqe->res being overwritten. The window for that is very narrow and the likelihood of it happening is very low, as we should never actually utilize all CQEs, but the consequences would be bad. This bug was reported to me privately. Backpatch-through: 18 Discussion: https://postgr.es/m/bwo3e5lj2dgi2wzq4yvbyzu7nmwueczvvzioqsqo6azu6lm5oy@pbx75g2ach3p	2026-01-15 11:09:07 -05:00
Heikki Linnakangas	d9c3c94365	Wake up autovacuum launcher from postmaster when a worker exits When an autovacuum worker exits, the launcher needs to be notified with SIGUSR2, so that it can rebalance and possibly launch a new worker. The launcher must be notified only after the worker has finished ProcKill(), so that the worker slot is available for a new worker. Before this commit, the autovacuum worker was responsible for that, which required a slightly complicated dance to pass the launcher's PID from FreeWorkerInfo() to ProcKill() in a global variable. Simplify that by moving the responsibility of the signaling to the postmaster. The postmaster was already doing it when it failed to fork a worker process, so it seems logical to make it responsible for notifying the launcher on worker exit too. That's also how the notification on background worker exit is done. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: li carol <carol.li2025@outlook.com> Discussion: https://www.postgresql.org/message-id/a5e27d25-c7e7-45d5-9bac-a17c8f462def@iki.fi	2026-01-15 18:02:25 +02:00
Heikki Linnakangas	102bdaa9be	Add check for invalid offset at multixid truncation If a multixid with zero offset is left behind after a crash, and that multixid later becomes the oldest multixid, truncation might try to look up its offset and read the zero value. In the worst case, we might incorrectly use the zero offset to truncate valid SLRU segments that are still needed. I'm not sure if that can happen in practice, or if there are some other lower-level safeguards or incidental reasons that prevent the caller from passing an unwritten multixid as the oldest multi. But better safe than sorry, so let's add an explicit check for it. In stable branches, we should perhaps do the same check for 'oldestOffset', i.e. the offset of the old oldest multixid (in master, 'oldestOffset' is gone). But if the old oldest multixid has an invalid offset, the damage has been done already, and we would never advance past that point. It's not clear what we should do in that case. The check that this commit adds will prevent such an multixid with invalid offset from becoming the oldest multixid in the first place, which seems enough for now. Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: Discussion: https://www.postgresql.org/message-id/000301b2-5b81-4938-bdac-90f6eb660843@iki.fi Backpatch-through: 14	2026-01-15 16:48:45 +02:00
Heikki Linnakangas	c4b71e6f60	Remove some unnecessary code from multixact truncation With 64-bit multixact offsets, PerformMembersTruncation() doesn't need the starting offset anymore. The 'oldestOffset' value that TruncateMultiXact() calculates is no longer used for anything. Remove it, and the code to calculate it. 'oldestOffset' was included in the WAL record as 'startTruncMemb', which sounds nice if you e.g. look at the WAL with pg_waldump, but it was also confusing because we didn't actually use the value for determining what to truncate. Replaying the WAL would remove all segments older than 'endTruncMemb', regardless of 'startTruncMemb'. The 'startTruncOff' stored in the WAL record was similarly unnecessary even before 64-bit multixid offsets, it was stored just for the sake of symmetry with 'startTruncMemb'. Remove both from the WAL record, and rename the remaining 'endTruncOff' to 'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency with the variable names used for them in other places. Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://www.postgresql.org/message-id/000301b2-5b81-4938-bdac-90f6eb660843@iki.fi	2026-01-15 13:34:50 +02:00
Michael Paquier	32e27bd320	Introduce routines to validate and free MVNDistinct and MVDependencies These routines are useful to perform some basic validation checks on each object structure, working currently on attribute numbers for non-expression and expression attnums. These checks could be extended in the future. Note that this code is not used yet in the tree, and that these functions will become handy for an upcoming patch for the import of extended statistics data. However, they are worth their own independent change as they are actually useful by themselves, with at least the extension code argument in mind (or perhaps I am just feeling more pedantic today). Extracted from a larger patch by the same author, with many adjustments and fixes by me. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2026-01-15 09:36:05 +09:00
Jeff Davis	ed425b5a20	Remove redundant assignment in CreateWorkExprContext In CreateWorkExprContext(), maxBlockSize is initialized to ALLOCSET_DEFAULT_MAXSIZE, and it then immediately reassigned, thus the initialization is a redundant. Author: Andreas Karlsson <andreas@proxel.se> Reported-by: Chao Li <lic@highgo.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/83a14f3c-f347-4769-9c01-30030b31f1eb@gmail.com	2026-01-14 12:01:36 -08:00
Andres Freund	556c92a689	lwlock: Improve local variable name In `9a385f6166` I used the variable name new_release_in_progress, but new_wake_in_progress makes more sense given the flag name. Suggested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/AC5E365D-7AD9-47AE-B2C6-25756712B188@gmail.com	2026-01-14 11:15:38 -05:00
Álvaro Herrera	4196d6178a	Reword confusing comment to avoid "typo fixes" Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CAApHDvqPmpa53jcTmfU8arFFm7=hB5cFoXX5dcUH=1qV0tRFHA@mail.gmail.com	2026-01-14 10:07:44 +01:00
Michael Paquier	6dcfac9696	Use more consistent *GetDatum() macros for some unsigned numbers This patch switches some code paths to use GetDatum() macros more in line with the data types of the variables they manipulate. This set of changes does not fix a problem, but it is always nice to be more consistent across the board. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Roman Khapov <rkhapov@yandex-team.ru> Reviewed-by: Yuan Li <carol.li2025@outlook.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/CALdSSPidtC7j3MwhkqRj0K2hyp36ztnnjSt6qzGxQtiePR1dzw@mail.gmail.com	2026-01-14 17:07:49 +09:00
Amit Kapila	e385a4e2fd	Prevent unintended dropping of active replication origins. Commit `5b148706c5` exposed functionality that allows multiple processes to use the same replication origin, enabling non-builtin logical replication solutions to implement parallel apply for large transactions. With this functionality, if two backends acquire the same replication origin and one of them resets it first, the acquired_by flag is cleared without acknowledging that another backend is still actively using the origin. This can lead to the origin being unintentionally dropped. If the shared memory for that dropped origin is later reused for a newly created origin, the remaining backend that still holds a pointer to the old memory may inadvertently advance the LSN of a completely different origin, causing unpredictable behavior. Although the underlying issue predates commit `5b148706c5`, it did not surface earlier because the internal parallel apply worker mechanism correctly coordinated origin resets and drops. This commit resolves the problem by introducing a reference counter for replication origins. The reference count increases when a backend sets the origin and decreases when it resets it. Additionally, the backend that first acquires the origin will not release it until all other backends using the origin have released it as well. The patch also prevents dropping a replication origin when acquired_by is zero but the reference counter is nonzero, covering the scenario where the first session exits without properly releasing the origin. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/TY4PR01MB169077EE72ABE9E55BAF162D494B5A@TY4PR01MB16907.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/CAMPB6wfe4zLjJL8jiZV5kjjpwBM2=rTRme0UCL7Ra4L8MTVdOg@mail.gmail.com	2026-01-14 07:15:46 +00:00
Andres Freund	9a385f6166	lwlock: Invert meaning of LW_FLAG_RELEASE_OK Previously, a flag was set to indicate that a lock release should wake up waiters. Since waking waiters is the default behavior in the majority of cases, this logic has been inverted. The new LW_FLAG_WAKE_IN_PROGRESS flag is now set iff wakeups are explicitly inhibited. The motivation for this change is that in an upcoming commit, content locks will be implemented independently of lwlocks, with the lock state stored as part of BufferDesc.state. As all of a buffer's flags are cleared when the buffer is invalidated, without this change we would have to re-add the RELEASE_OK flag after clearing the flags; otherwise, the next lock release would not wake waiters. It seems good to keep the implementation of lwlocks and buffer content locks as similar as reasonably possible. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/4csodkvvfbfloxxjlkgsnl2lgfv2mtzdl7phqzd4jxjadxm4o5@usw7feyb5bzf	2026-01-13 19:38:29 -05:00
John Naylor	94a24b4ee5	Improve some comment wording and grammar in extension.c Noted while looking at reports of grammatical errors. Reported-by: albert tan <alterttan1223@gmail.com> Reported-by: Yuan Li(carol) <carol.li2025@outlook.com> Discussion: https://postgr.es/m/CAEzortnJB7aue6miGT_xU2KLb3okoKgkBe4EzJ6yJ%3DY8LMB7gw%40mail.gmail.com	2026-01-13 12:33:08 +07:00
Jeff Davis	a00a25b6ce	Fix error message typo. Reported-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2mMmm9fTZYgE-r_T-KPTFR1rKO029QV-S-6n=7US_9EMA@mail.gmail.com	2026-01-12 19:07:00 -08:00
Andres Freund	0b96e734c5	heapam: Add batch mode mvcc check and use it in page mode There are two reasons for doing so: 1) It is generally faster to perform checks in a batched fashion and making sequential scans faster is nice. 2) We would like to stop setting hint bits while pages are being written out. The necessary locking becomes visible for page mode scans, if done for every tuple. With batching, the overhead can be amortized to only happen once per page. There are substantial further optimization opportunities along these lines: - Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could instead write an explicitly optimized version that avoids repeated xid tests. - Introduce batched version of the serializability test - Introduce batched version of HeapTupleSatisfiesVacuum Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar	2026-01-12 13:22:04 -05:00
Andres Freund	852558b9ec	heapam: Use exclusive lock on old page in CLUSTER To be able to guarantee that we can set the hint bit, acquire an exclusive lock on the old buffer. This is required as a future commit will only allow hint bits to be set with a new lock level, which is acquired as-needed in a non-blocking fashion. We need the hint bits, set in heapam_relation_copy_for_cluster() -> HeapTupleSatisfiesVacuum(), to be set, as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() will get confused. Specifically, rewrite_heap_tuple() checks for HEAP_XMAX_INVALID in the old tuple to determine whether to check the old-to-new mapping hash table. It'd be better if we somehow could avoid setting hint bits on the old page. A common reason to use VACUUM FULL is very bloated tables - rewriting most of the old table during VACUUM FULL doesn't exactly help. Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr	2026-01-12 12:40:13 -05:00
Andres Freund	45f658dacb	freespace: Don't modify page without any lock Before this commit fsm_vacuum_page() modified the page without any lock on the page. Historically that was kind of ok, as we didn't rely on the freespace to really stay consistent and we did not have checksums. But these days pages are checksummed and there are ways for FSM pages to be included in WAL records, even if the FSM itself is still not WAL logged. If a FSM page ever were modified while a WAL record referenced that page, we'd be in trouble, as the WAL CRC could end up getting corrupted. The reason to address this right now is a series of patches with the goal to only allow modifications of pages with an appropriate lock level. Obviously not having any lock is not appropriate :) Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/4wggb7purufpto6x35fd2kwhasehnzfdy3zdcu47qryubs2hdz@fa5kannykekr Discussion: https://postgr.es/m/e6a8f734-2198-4958-a028-aba863d4a204@iki.fi	2026-01-12 12:40:00 -05:00
Álvaro Herrera	225d1df1d2	Stop including {brin,gin}_tuple.h in tuplesort.h Doing this meant that those two headers, which are supposed to be internal to their corresponding index AMs, were being included pretty much universally, because tuplesort.h is included by execnodes.h which is very widely used. Stop that, and fix fallout. We also change indexing.h to no longer include execnodes.h (tuptable.h is sufficient), and relscan.h to no longer include buf.h (pointless since `c2fe139c20`). Author: Mario González <gonzalemario@gmail.com> Discussion: https://postgr.es/m/CAFsReFUcBFup=Ohv_xd7SNQ=e73TXi8YNEkTsFEE2BW7jS1noQ@mail.gmail.com	2026-01-12 18:09:49 +01:00
Álvaro Herrera	2defd00062	Move instrumentation-related structs to instrument_node.h Some structs and enums related to parallel query instrumentation had organically grown scattered across various files, and were causing header pollution especially through execnodes.h. Create a single file where they can live together. This only moves the structs to the new file; cleaning up the pollution by removing no-longer-necessary cross-header inclusion will be done in future commits. Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-by: Mario González <gonzalemario@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/202510051642.wwmn4mj77wch@alvherre.pgsql Discussion: https://postgr.es/m/CAFsReFUr4KrQ60z+ck9cRM4WuUw1TCghN7EFwvV0KvuncTRc2w@mail.gmail.com	2026-01-12 16:59:28 +01:00
Peter Eisentraut	c3c240537f	Avoid casting void * function arguments In many cases, the cast would silently drop a const qualifier. To fix, drop the unnecessary cast and let the compiler check the types and qualifiers. Add const to read-only local variables, preserving the const qualifiers from the function signatures. Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/aUQHy/MmWq7c97wK%40ip-10-97-1-34.eu-west-3.compute.internal	2026-01-12 16:12:56 +01:00
Andres Freund	e5a5e0a907	instrumentation: Keep time fields as instrtime, convert in callers Previously the instrumentation logic always converted to seconds, only for many of the callers to do unnecessary division to get to milliseconds. As an upcoming refactoring will split the Instrumentation struct, utilize instrtime always to keep things simpler. It's also a bit faster to not have to first convert to a double in functions like InstrEndLoop(), InstrAggNode(). Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com	2026-01-09 13:38:00 -05:00
Heikki Linnakangas	bba81f9d3d	Inline ginCompareAttEntries for speed It is called in tight loops during GIN index build. Author: David Geier <geidav.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/5d366878-2007-4d31-861e-19294b7a583b@gmail.com	2026-01-09 20:31:43 +02:00
Tom Lane	7a1d422e39	Improve "constraint must include all partitioning columns" message. This formerly said "unique constraint must ...", which was accurate enough when it only applied to UNIQUE and PRIMARY KEY constraints. However, now we use it for exclusion constraints too, and in that case it's a tad confusing. Do what we already did in the errdetail message: print the constraint_type, so that it looks like "UNIQUE constraint ...", "EXCLUDE constraint ...", etc. Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxH6VhAf65Vghg4T2q315gY=Rt4BUfMyunkfRj0n2S9n-g@mail.gmail.com	2026-01-09 12:59:35 -05:00
Nathan Bossart	7a485bd641	pg_dump: Fix gathering of sequence information. Since commit `bd15b7db48`, pg_dump uses pg_get_sequence_data() (née pg_sequence_read_tuple()) to gather all sequence data in a single query as opposed to a query per sequence. Two related bugs have been identified: * If the user lacks appropriate privileges on the sequence, pg_dump generates a setval() command with garbage values instead of failing as expected. * pg_dump can fail due to a concurrently dropped sequence, even if the dropped sequence's data isn't part of the dump. This commit fixes the above issues by 1) teaching pg_get_sequence_data() to return nulls instead of erroring for a missing sequence and 2) teaching pg_dump to fail if it tries to dump the data of a sequence for which pg_get_sequence_data() returned nulls. Note that pg_dump may still fail due to a concurrently dropped sequence, but it should now only do so when the sequence data is part of the dump. This matches the behavior before commit `bd15b7db48`. Bug: #19365 Reported-by: Paveł Tyślacki <pavel.tyslacki@gmail.com> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19365-6245240d8b926327%40postgresql.org Discussion: https://postgr.es/m/2885944.1767029161%40sss.pgh.pa.us Backpatch-through: 18	2026-01-09 10:12:54 -06:00
Fujii Masao	24cb3a08a4	Use IsA() macro in define.c, for sake of consistency. Commit `63d1b1cf7f` replaced a direct nodeTag() comparison with the IsA() macro in copy.c, but a similar direct comparison remained in define.c. This commit replaces that comparison with IsA() for consistency. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CAHGQGwGjWGS89_sTx=sbPm0FQemyQQrfTKm=taUhAJFV5k-9cw@mail.gmail.com	2026-01-09 20:24:00 +09:00
Peter Eisentraut	1f08d687c3	meson: Rename cpp variable to cxx Since CPP is also used to mean C PreProcessor in our meson.build files, it's confusing to use it to also mean C PlusPlus. This uses the non-ambiguous cxx wherever possible. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/D98JHQF7H2A8.VSE3I4CJBTAB%40partin.io	2026-01-09 08:58:23 +01:00
David Rowley	349107537d	Fix possible incorrect column reference in ERROR message When creating a partition for a RANGE partitioned table, the reporting of errors relating to converting the specified range values into constant values for the partition key's type could display the name of a previous partition key column when an earlier range was specified as MINVALUE or MAXVALUE. This was caused by the code not correctly incrementing the index that tracks which partition key the foreach loop was working on after processing MINVALUE/MAXVALUE ranges. Fix by using foreach_current_index() to ensure the index variable is always set to the List element being worked on. Author: myzhen <zhenmingyang@yeah.net> Reviewed-by: zhibin wang <killerwzb@gmail.com> Discussion: https://postgr.es/m/273cab52.978.19b96fc75e7.Coremail.zhenmingyang@yeah.net Backpatch-through: 14	2026-01-09 11:01:36 +13:00
Tom Lane	b352d3d80b	Mark GiST inet_ops as opcdefault, and deal with ensuing fallout. This patch completes the transition to making inet_ops be default for inet/cidr columns, rather than btree_gist's opclasses. Once we do that, though, pg_upgrade has a big problem. A dump from an older version will see btree_gist's opclasses as being default, so it will not mention the opclass explicitly in CREATE INDEX commands, which would cause the restore to create the indexes using inet_ops. Since that's not compatible with what's actually in the files, havoc would ensue. This isn't readily fixable, because the CREATE INDEX command strings are built by the older server's pg_get_indexdef() function; pg_dump hasn't nearly enough knowledge to modify those strings successfully. Even if we cared to put in the work to make that happen in pg_dump, it would be counterproductive because the end goal here is to get people off of these opclasses. Allowing such indexes to persist through pg_upgrade wouldn't advance that goal. Therefore, this patch just adds code to pg_upgrade to detect indexes that would be problematic and refuse to upgrade. There's another issue too: even without any indexes to worry about, pg_dump in binary-upgrade mode will reproduce the "CREATE OPERATOR CLASS ... DEFAULT" commands for btree_gist's opclasses, and those will fail because now we have a built-in opclass that provides a conflicting default. We could ask users to drop the btree_gist extension altogether before upgrading, but that would carry very severe penalties. It would affect perfectly-valid indexes for other data types, and it would drop operators that might be relied on in views or other database objects. Instead, put a hack in DefineOpClass to ignore the DEFAULT clauses for these opclasses when in binary-upgrade mode. This will result in installing a version of btree_gist that isn't quite the version it claims to be, but that can be fixed by issuing ALTER EXTENSION UPDATE afterwards. Since we don't apply that hack when not in binary-upgrade mode, it is now impossible to install any version of btree_gist less than 1.9 via CREATE EXTENSION. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/2483812.1754072263@sss.pgh.pa.us	2026-01-08 14:03:56 -05:00
Heikki Linnakangas	63d1b1cf7f	Use IsA macro, for sake of consistency Reported-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://www.postgresql.org/message-id/CAOzEurS=PzRzGba3mpNXgEhbnQFA0dxXaU0ujCJ0aa9yMSH6Pw@mail.gmail.com	2026-01-08 18:58:28 +02:00
Heikki Linnakangas	ad853bb877	Fix misc typos, mostly in comments The only user-visible change is the fix in the "malformed pg_dependencies" error detail. That one is new in commit `e1405aa5e3`, so no backpatching required.	2026-01-08 18:10:08 +02:00
Heikki Linnakangas	d3304921db	Improve comments around _bt_checkkeys Discussion: https://www.postgresql.org/message-id/f5388839-99da-465a-8744-23cdfa8ce4db@iki.fi	2026-01-08 18:10:05 +02:00
Amit Kapila	31ddbb38ee	Fix typos in the code. Author: "Dewei Dai" <daidewei1970@163.com> Author: zengman <zengman@halodbtech.com> Author: Zhiyuan Su <suzhiyuan_pg@126.com> Discussion: https://postgr.es/m/2026010719201902382410@163.com Discussion: https://postgr.es/m/tencent_4DC563C83443A4B1082D2BFF@qq.com Discussion: https://postgr.es/m/44656d72.2a63.19b9b92b0a3.Coremail.suzhiyuan_pg@126.com	2026-01-08 09:43:50 +00:00
Peter Eisentraut	6ade3cd459	Remove use of rindex() function rindex() has been removed from POSIX 2008. Replace the one remaining use with the equivalent and more standard strrchr(). Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/98ce805c-6103-421b-adc3-fcf8f3dddbe3%40eisentraut.org	2026-01-08 08:53:49 +01:00
Peter Geoghegan	8aed8e168f	Fix nbtree skip array transformation comments. Fix comments that incorrectly described transformations performed by the "Avoid extra index searches through preprocessing" mechanism introduced by commit `b3f1a13f`. Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/20251230190145.c3c88c5eb0f88b136adda92f@sraoss.co.jp Backpatch-through: 18	2026-01-07 12:53:07 -05:00
Michael Paquier	68119480a7	Fix unexpected reversal of lists during catcache rehash During catcache searches, the most-recently searched entries are kept at the head of the list to speed up subsequent searches, keeping the "freshest" entries at its beginning. A rehash of the catcache was doing the opposite: fresh entries were moved to the tail of the newly-created buckets, causing a rehash to slow down a bit. When a rehash is done, this commit switches the code to use dlist_push_tail() instead of dlist_push_head(), so as fresh entries are kept at the head of the lists, not their tail. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/tencent_9EA10D8512B5FE29E7323F780A0749768708@qq.com	2026-01-07 17:52:54 +09:00
Michael Paquier	b139bd3b6e	Add data type oid8, 64-bit unsigned identifier This new identifier type provides support for 64-bit unsigned values, to be used in catalogs, like OIDs. An advantage of a new data type is that it becomes easier to grep for it in the code when assigning this type to a catalog attribute, linking it to dedicated APIs and internal structures. The following operators are added in this commit, with dedicated tests: - Casts with integer types and OID. - btree and hash operators - min/max functions. - C type with related macros and defines, named around "Oid8". This has been mentioned as useful on its own on the thread to add support for 64-bit TOAST values, so as it becomes possible to attach this data type to the TOAST code and catalog definitions. However, as this concept can apply to many more areas, it is implemented as its own independent change. This is based on a discussion with Andres Freund and Tom Lane. Bump catalog version. Author: Michael Paquier <michael@paquier.xyz> Reviewed-by: Greg Burd <greg@burd.me> Reviewed-by: Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com> Discussion: https://postgr.es/m/1891064.1754681536@sss.pgh.pa.us	2026-01-07 11:37:00 +09:00
Jeff Davis	af2d4ca191	Clean up ICU includes. Remove ICU includes from pg_locale.h, and instead include them in the few C files that need ICU. Clean up a few other includes in passing. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/48911db71d953edec66df0d2ce303563d631fbe0.camel@j-davis.com	2026-01-06 17:19:51 -08:00
Andres Freund	75609fded3	Fix buggy interaction between array subscripts and subplan params In `a7f107df2` I changed subplan param evaluation to happen within the containing expression. As part of that, ExecInitSubPlanExpr() was changed to evaluate parameters via a new EEOP_PARAM_SET expression step. These parameters were temporarily stored into ExprState->resvalue/resnull, with some reasoning why that would be fine. Unfortunately, that analysis was wrong - ExecInitSubscriptionRef() evaluates the input array into "resv"/"resnull", which will often point to ExprState->resvalue/resnull. This means that the EEOP_PARAM_SET, if inside an array subscript, would overwrite the input array to array subscript. The fix is fairly simple - instead of evaluating into ExprState->resvalue/resnull, store the temporary result of the subplan in the subplan's return value. Bug: #19370 Reported-by: Zepeng Zhang <redraiment@gmail.com> Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us> Diagnosed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/19370-7fb7a5854b7618f1@postgresql.org Backpatch-through: 18	2026-01-06 19:51:10 -05:00
Jeff Davis	c4ff35f104	ICU: use UTF8-optimized case conversion API Initializes a UCaseMap object once for use across calls, and uses UTF8-optimized APIs. Author: Andreas Karlsson <andreas@proxel.se> Reviewed-by: zengman <zengman@halodbtech.com> Discussion: https://postgr.es/m/5a010b27-8ed9-4739-86fe-1562b07ba564@proxel.se	2026-01-06 14:09:07 -08:00
Alexander Korotkov	bf308639bf	Fix variable usage in wakeupWaiters() Mistakenly, `i` was used both as an integer representation of lsnType and an index in wakeUpProcs array. Discussion: https://postgr.es/m/CA%2BhUKG%2BL0OQR8dQtsNBSaA3FNNyQeFabdeRVzurm0b7Xtp592g%40mail.gmail.com Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Ruikai Peng <ruikai@pwno.io>	2026-01-06 10:04:28 +02:00
Michael Paquier	f1e251be80	Allow bgworkers to be terminated for database-related commands Background workers gain a new flag, called BGWORKER_INTERRUPTIBLE, that offers the possibility to terminate the workers when these are connected to a database that is involved in one of the following commands: ALTER DATABASE RENAME TO ALTER DATABASE SET TABLESPACE CREATE DATABASE DROP DATABASE This is useful to give background workers the same behavior as backends and autovacuum workers, which are stopped when these commands are executed. The default behavior, that exists since 9.3, is still to never terminate bgworkers connected to the database involved in any of these commands. The new flag has to be set to terminate the workers. A couple of tests are added to worker_spi to track the commands that impact the termination of the workers. There is a test case for a non-interruptible worker, additionally, that relies on an injection point to make the wait time in CountOtherDBBackends() reduced from 5s to 0.3s for faster test runs. The tests rely on the contents of the server logs to check if a worker has been started or terminated: - LOG generated by worker_spi_main() at startup, once connection to database is done. - FATAL in bgworker_die() when terminated. A couple of tests run in the CI have showed that this method is stable enough. The safe_psql() calls that scan pg_stat_activity could be replaced with some poll_query_until() for more stability, if the current method proves to be an issue in the buildfarm. Author: Aya Iwata <iwata.aya@fujitsu.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Ryo Matsumura <matsumura.ryo@fujitsu.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Discussion: https://postgr.es/m/OS7PR01MB11964335F36BE41021B62EAE8EAE4A@OS7PR01MB11964.jpnprd01.prod.outlook.com	2026-01-06 14:24:29 +09:00
Amit Kapila	c970bdc037	Update comments atop ReplicationSlotCreate. Since commit `1462aad2e4`, which introduced the ability to modify the two_phase property of a slot, the comments above ReplicationSlotCreate have become outdated. We have now added a cautionary note in the comments above ReplicationSlotAlter explaining when it is safe to modify the two_phase property of a slot. Author: Daniil Davydov <3danissimo@gmail.com> Author: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Backpatch-through: 18 Discussion: https://postgr.es/m/CAJDiXggZXQZ7bD0QcTizDt6us9aX6ZKK4dWxzgb5x3+TsVHjqQ@mail.gmail.com	2026-01-06 05:02:25 +00:00
David Rowley	6ca8506ea5	Fix issue with EVENT TRIGGERS and ALTER PUBLICATION When processing the "publish" options of an ALTER PUBLICATION command, we call SplitIdentifierString() to split the options into a List of strings. Since SplitIdentifierString() modifies the delimiter character and puts NULs in their place, this would overwrite the memory of the AlterPublicationStmt. Later in AlterPublicationOptions(), the modified AlterPublicationStmt is copied for event triggers, which would result in the event trigger only seeing the first "publish" option rather than all options that were specified in the command. To fix this, make a copy of the string before passing to SplitIdentifierString(). Here we also adjust a similar case in the pgoutput plugin. There's no known issues caused by SplitIdentifierString() here, so this is being done out of paranoia. Thanks to Henson Choi for putting together an example case showing the ALTER PUBLICATION issue. Author: sunil s <sunilfeb26@gmail.com> Reviewed-by: Henson Choi <assam258@gmail.com> Reviewed-by: zengman <zengman@halodbtech.com> Backpatch-through: 14	2026-01-06 17:29:12 +13:00
Amit Kapila	63ed3bc7f9	Fix typo in slot.c. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/AC9B87F1-ED04-4547-B85C-9443B4253A08@gmail.com Discussion: https://postgr.es/m/CAJDiXggZXQZ7bD0QcTizDt6us9aX6ZKK4dWxzgb5x3+TsVHjqQ@mail.gmail.com	2026-01-06 04:13:40 +00:00
Michael Paquier	ae28373602	Fix typo in planner.c `b8cfcb9e00` did not get this change right. Author: Alexander Law <exclusion@gmail.com> Discussion: https://postgr.es/m/CAJ0YPFFWhJXs-e-=7iJz-FLp=b1dXfJA_qtrVAgto=bZmzD9zQ@mail.gmail.com	2026-01-06 12:30:01 +09:00
Fujii Masao	d926462d81	Honor GUC settings specified in CREATE SUBSCRIPTION CONNECTION. Prior to v15, GUC settings supplied in the CONNECTION clause of CREATE SUBSCRIPTION were correctly passed through to the publisher's walsender. For example: CREATE SUBSCRIPTION mysub CONNECTION 'options=''-c wal_sender_timeout=1000''' PUBLICATION ... would cause wal_sender_timeout to take effect on the publisher's walsender. However, commit `f3d4019da5` changed the way logical replication connections are established, forcing the publisher's relevant GUC settings (datestyle, intervalstyle, extra_float_digits) to override those provided in the CONNECTION string. As a result, from v15 through v18, GUC settings in the CONNECTION string were always ignored. This regression prevented per-connection tuning of logical replication. For example, using a shorter timeout for walsender connecting to a nearby subscriber and a longer one for walsender connecting to a remote subscriber. This commit restores the intended behavior by ensuring that GUC settings in the CONNECTION string are again passed through and applied by the walsender, allowing per-connection configuration. Backpatch to v15, where the regression was introduced. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Chao Li <lic@highgo.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/CAHGQGwGYV+-abbKwdrM2UHUe-JYOFWmsrs6=QicyJO-j+-Widw@mail.gmail.com Backpatch-through: 15	2026-01-06 11:52:22 +09:00
David Rowley	7f9acc9bcc	Simplify GetOperatorFromCompareType() code The old code would set *opid = InvalidOid to determine if the get_opclass_opfamily_and_input_type() worked or not. This means more moving parts that what's really needed here. Let's just fail immediately if the get_opclass_opfamily_and_input_type() lookup fails. Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com> Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com	2026-01-06 15:25:13 +13:00
David Rowley	e8d4e94a47	Fix misleading comment for GetOperatorFromCompareType The comment claimed *strat got set to InvalidStrategy when the function lookup fails. This isn't true; an ERROR is raised when that happens. Author: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Neil Chen <carpenter.nail.cz@gmail.com> Discussion: https://postgr.es/m/CA+renyXOrjLacP_nhqEQUf2W+ZCoY2q5kpQCfG05vQVYzr8b9w@mail.gmail.com Backpatch-through: 18	2026-01-06 15:16:14 +13:00
Tom Lane	b85d5dc0e7	Fix meson build of snowball code. include/snowball/libstemmer has to be in the -I search path, as it is in the autoconf build. It's not apparent to me how this ever worked before, nor why my recent commit made it stop working. Discussion: https://postgr.es/m/ld2iurl7kzexwydxmdfhdgarpa7xxsfrgvggqhbblt4rvt3h6t@bxsk6oz5x7cc	2026-01-05 16:51:36 -05:00
Tom Lane	7dc95cc3b9	Update to latest Snowball sources. It's been almost a year since we last did this, and upstream has been busy. They've added stemmers for Polish and Esperanto, and also deprecated their old Dutch stemmer in favor of the Kraaij-Pohlmann algorithm. (The "dutch" stemmer is now the latter, and "dutch_porter" is the old algorithm.) Upstream also decided to rename their internal header "header.h" to something less generic: "snowball_runtime.h". Seems like a good thing, but it complicates this patch a bit because we were relying on interposing our own version of "header.h" to control system header inclusion order. (We're partially failing at that now, because now the generated stemmer files include <stddef.h> before snowball_runtime.h. I think that'll be okay, but if the buildfarm complains then we'll have to do more-extensive editing of the generated files.) I realized that we weren't documenting the available stemmers in any user-visible place, except indirectly through sample \dFd output. That's incomplete because we only provide built-in dictionaries for the recommended stemmers for each language, not alternative stemmers such as dutch_porter. So I added a list to the documentation. I did not do anything with the stopword lists. If those are still available from snowballstem.org, they are mighty well hidden. Discussion: https://postgr.es/m/1185975.1767569534@sss.pgh.pa.us	2026-01-05 15:22:37 -05:00
Masahiko Sawada	d351063e49	Fix typo in parallel.c. Author: kelan <ke_lan1@qq.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/tencent_38B5875E2D440C8DA8C0C022ABD999F9C207@qq.com	2026-01-05 10:16:28 -08:00
Alexander Korotkov	49a181b5d6	Add the MODE option to the WAIT FOR LSN command This commit extends the WAIT FOR LSN command with an optional MODE option in the WITH clause that specifies which LSN type to wait for: WAIT FOR LSN '<lsn>' [WITH (MODE '<mode>', ...)] where mode can be: - 'standby_replay' (default): Wait for WAL to be replayed to the specified LSN, - 'standby_write': Wait for WAL to be written (received) to the specified LSN, - 'standby_flush': Wait for WAL to be flushed to disk at the specified LSN, - 'primary_flush': Wait for WAL to be flushed to disk on the primary server. The default mode is 'standby_replay', matching the original behavior when MODE is not specified. This follows the pattern used by COPY and EXPLAIN commands, where options are specified as string values in the WITH clause. Modes are explicitly named to distinguish between primary and standby operations: - Standby modes ('standby_replay', 'standby_write', 'standby_flush') can only be used during recovery (on a standby server), - Primary mode ('primary_flush') can only be used on a primary server. The 'standby_write' and 'standby_flush' modes are useful for scenarios where applications need to ensure WAL has been received or persisted on the standby without necessarily waiting for replay to complete. The 'primary_flush' mode allows waiting for WAL to be flushed on the primary server. This commit also includes includes: - Documentation updates for the new syntax and mode descriptions, - Test coverage for all four modes, including error cases and concurrent waiters, - Wakeup logic in walreceiver for standby write/flush waiters, - Wakeup logic in WAL writer for primary flush waiters. Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>	2026-01-05 19:56:19 +02:00
Alexander Korotkov	7a39f43d88	Extend xlogwait infrastructure with write and flush wait types Add support for waiting on WAL write and flush LSNs in addition to the existing replay LSN wait type. This provides the foundation for extending the WAIT FOR command with MODE parameter. Key changes are following. - Add WAIT_LSN_TYPE_STANDBY_WRITE and WAIT_LSN_TYPE_STANDBY_FLUSH to WaitLSNType. - Add GetCurrentLSNForWaitType() to retrieve the current LSN for each wait type. - Add new wait events WAIT_EVENT_WAIT_FOR_WAL_WRITE and WAIT_EVENT_WAIT_FOR_WAL_FLUSH for pg_stat_activity visibility. - Update WaitForLSN() to use GetCurrentLSNForWaitType() internally. Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>	2026-01-05 19:56:19 +02:00
Alexander Korotkov	d51a5d8e56	Adjust errcode in checkPartition() Replace ERRCODE_UNDEFINED_TABLE with ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE for the case where we don't find a parent-child relationship between the partitioned table and its partition. In this case, tables are present, but they are not in a prerequisite state (no relationship). Discussion: https://postgr.es/m/CAHewXNmBM%2B5qbrJMu60NxPn%2B0y-%3D2wXM-QVVs3xRp8NxFvDb9A%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>	2026-01-05 19:56:19 +02:00
Michael Paquier	877ae5db89	Fix comment in tableam.c Author: shiyu qin <qinshy510@gmail.com> Discussion: https://postgr.es/m/CAJUCM3uJjoLR1zfKoZD4J71T-hdeFdFw1kTQoMkywKZP0hZsvw@mail.gmail.com	2026-01-05 19:15:55 +09:00
Heikki Linnakangas	461b8cc952	Tighten up assertion on a local variable 'lineindex' is 0-based, as mentioned in the comments. Backpatch to v18 where the assertion was added. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/tencent_A84F3C810365BB9BD08442955AE494141907@qq.com Backpatch-through: 18	2026-01-05 11:33:35 +02:00
David Rowley	4c144e0452	Use the GetPGProcByNumber() macro when possible A few places were accessing &ProcGlobal->allProcs directly, so adjust them to use the accessor macro instead. Author: Maksim Melnikov <m.melnikov@postgrespro.ru> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/80621c00-aba6-483c-88b1-a845461d1165@postgrespro.ru	2026-01-05 21:19:03 +13:00
Amit Kapila	3f906d3af9	Improve the comments atop build_replindex_scan_key(). Author: zourenli <398740848@qq.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/tencent_C2DC8157CC05C8F5C36E12678A7864554809@qq.com	2026-01-05 03:06:55 +00:00
Michael Paquier	b8cfcb9e00	Fix typos and inconsistencies in code and comments This change is a cocktail of harmonization of function argument names, grammar typos, renames for better consistency and unused code (see ltree). All of these have been spotted by the author. Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/b2c0d0b7-3944-487d-a03d-d155851958ff@gmail.com	2026-01-05 09:19:15 +09:00
Tom Lane	ba75f71752	Include error location in errors from ComputeIndexAttrs(). Make use of IndexElem's new location field to localize these errors better. Author: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxH3OgXF1hrzGAaWyNtye2jHEmk9JbtrtGv-KJK6tsGo5w@mail.gmail.com	2026-01-04 14:16:20 -05:00
Tom Lane	62299bbd90	Add parse location to IndexElem. This patch mostly just fills in the field, although a few error reports in resolve_unique_index_expr() are adjusted to use it. The next commit will add more uses. catversion bump out of an abundance of caution: I'm not sure IndexElem can appear in stored rules, but I'm not sure it can't either. Author: Álvaro Herrera <alvherre@kurilemu.de> Co-authored-by: jian he <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxH3OgXF1hrzGAaWyNtye2jHEmk9JbtrtGv-KJK6tsGo5w@mail.gmail.com Discussion: https://postgr.es/m/202512121327.f2zimsr6guso@alvherre.pgsql	2026-01-04 14:16:20 -05:00
Tom Lane	54315fde73	Improve a couple of error messages. Change "function" to "function or procedure" in PreventInTransactionBlock, and improve grammar of ExecWaitStmt's complaint about having an active snapshot. Author: Pavel Stehule <pavel.stehule@gmail.com> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Reviewed-by: Marcos Pegoraro <marcos@f10.com.br> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFj8pRCveWPR06bbad9GnMb0Kcr6jnXPttv9XOaOB+oFCD1Tsg@mail.gmail.com	2026-01-03 17:18:39 -05:00
Bruce Momjian	451c43974f	Update copyright for 2026 Backpatch-through: 14	2026-01-01 13:24:10 -05:00
Andrew Dunstan	f3c9e341cd	Add paths of extensions to pg_available_extensions Add a new "location" column to the pg_available_extensions and pg_available_extension_versions views, exposing the directory where the extension is located. The default system location is shown as '$system', the same value that can be used to configure the extension_control_path GUC. User-defined locations are only visible for super users, otherwise '<insufficient privilege>' is returned as a column value, the same behaviour that we already use in pg_stat_activity. I failed to resist the temptation to do a little extra editorializing of the TAP test script. Catalog version bumped. Author: Matheus Alcantara <mths.dev@pm.me> Reviewed-By: Chao Li <li.evan.chao@gmail.com> Reviewed-By: Rohit Prasad <rohit.prasad@arm.com> Reviewed-By: Michael Banck <mbanck@gmx.net> Reviewed-By: Manni Wood <manni.wood@enterprisedb.com> Reviewed-By: Euler Taveira <euler@eulerto.com> Reviewed-By: Quan Zongliang <quanzongliang@yeah.net>	2026-01-01 12:13:59 -05:00
Masahiko Sawada	85d5bd308b	Fix macro name for io_uring_queue_init_mem check. Commit `f54af9f267` added a check for io_uring_queue_init_mem(). However, it used the macro name HAVE_LIBURING_QUEUE_INIT_MEM in both meson.build and the C code, while the Autotools build script defined HAVE_IO_URING_QUEUE_INIT_MEM. As a result, the optimization was never enabled in builds configured with Autotools, as the C code checked for the wrong macro name. This commit changes the macro name to HAVE_IO_URING_QUEUE_INIT_MEM in meson.build and the C code. This matches the actual function name (io_uring_queue_init_mem), following the standard HAVE_<FUNCTION> convention. Backpatch to 18, where the macro was introduced. Bug: #19368 Reported-by: Evan Si <evsi@amazon.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19368-016d79a7f3a1c599@postgresql.org Backpatch-through: 18	2025-12-31 11:18:14 -08:00
Thomas Munro	915711c8a4	jit: Fix jit_profiling_support when unavailable. jit_profiling_support=true captures profile data for Linux perf. On other platforms, LLVMCreatePerfJITEventListener() returns NULL and the attempt to register the listener would crash. Fix by ignoring the setting in that case. The documentation already says that it only has an effect if perf support is present, and we already did the same for older LLVM versions that lacked support. No field reports, unsurprisingly for an obscure developer-oriented setting. Noticed in passing while working on commit `1a28b4b4`. Backpatch-through: 14 Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJgB6gvrdDohgwLfCwzVQm%3DVMtb9m0vzQn%3DCwWn-kwG9w%40mail.gmail.com	2025-12-31 14:50:23 +13:00
Tom Lane	bc6374cd76	Change IndexAmRoutines to be statically-allocated structs. Up to now, index amhandlers were expected to produce a new, palloc'd struct on each call. That requires palloc/pfree overhead, and creates a risk of memory leaks if the caller fails to pfree, and the time taken to fill such a large structure isn't nil. Moreover, we were storing these things in the relcache, eating several hundred bytes for each cached index. There is not anything in these structs that needs to vary at runtime, so let's change the definition so that an amhandler can return a pointer to a "static const" struct of which there's only one copy per index AM. Mark all the core code's IndexAmRoutine pointers const so that we catch anyplace that might still try to change or pfree one. (This is similar to the way we were already handling TableAmRoutine structs. This commit does fix one comment that was infelicitously copied-and-pasted into tableamapi.c.) This commit needs to be called out in the v19 release notes as an API change for extension index AMs. An un-updated AM will still work (as of now, anyway) but it risks memory leaks and will be slower than necessary. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAEoWx2=vApYk2LRu8R0DdahsPNEhWUxGBZ=rbZo1EXE=uA+opQ@mail.gmail.com	2025-12-30 18:26:23 -05:00
Masahiko Sawada	736f754eed	Add dead items memory usage to VACUUM (VERBOSE) and autovacuum logs. This commit adds the total memory allocated during vacuum, the number of times the dead items storage was reset, and the configured memory limit. This helps users understand how much memory VACUUM required, and such information can be used to avoid multiple index scans. Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHza6qcPitBCkyiKJosDTt3bmxMvzZOTONoebwCkBZrr3rk65Q%40mail.gmail.com	2025-12-30 13:12:10 -08:00
Masahiko Sawada	2a5225b99d	Fix a race condition in updating procArray->replication_slot_xmin. Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest xmin across all slots without holding ProcArrayLock (when already_locked is false), acquiring the lock just before updating the replication slot xmin. This could lead to a race condition: if a backend created a new slot and updates the global replication slot xmin, another backend concurrently running ReplicationSlotsComputeRequiredXmin() could overwrite that update with an invalid or stale value. This happens because the concurrent backend might have computed the aggregate xmin before the new slot was accounted for, but applied the update after the new slot had already updated the global value. In the reported failure, a walsender for an apply worker computed InvalidTransactionId as the oldest xmin and overwrote a valid replication slot xmin value computed by a walsender for a tablesync worker. Consequently, the tablesync worker computed a transaction ID via GetOldestSafeDecodingTransactionId() effectively without considering the replication slot xmin. This led to the error "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u", which was an assertion failure prior to commit `240e0dbacd`. To fix this, we acquire ReplicationSlotControlLock in exclusive mode during slot creation to perform the initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we hold ReplicationSlotControlLock in shared mode until the global slot xmin is updated in ProcArraySetReplicationSlotXmin(). This prevents concurrent computations and updates of the global xmin by other backends during the initial slot xmin update process, while still permitting concurrent calls to ReplicationSlotsComputeRequiredXmin(). Backpatch to all supported versions. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com> Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com Backpatch-through: 14	2025-12-30 10:56:30 -08:00
Michael Paquier	ffdcc9c638	Fix comment in lsyscache.c Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2miv0KGcM9j29ANRN45-Vz-2qAqrM0cv9OtaLx8e_WCMQ@mail.gmail.com	2025-12-30 16:42:21 +09:00
Thomas Munro	1a28b4b455	jit: Drop redundant LLVM configure probes. We currently require LLVM 14, so these probes for LLVM 9 functions always succeeded. Even when the features aren't enabled in an LLVM build, dummy functions are defined (a problem for a later commit). The whole PGAC_CHECK_LLVM_FUNCTIONS macro and Meson equivalent are removed, because we switched to testing LLVM_VERSION_MAJOR at compile time in subsequent work and these were the last holdouts. That suits the nature of LLVM API evolution better, and also allows for strictly mechanical pruning in future commits like `820b5af7` and `972c2cd2`. They advanced the minimum LLVM version but failed to spot these. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJgB6gvrdDohgwLfCwzVQm%3DVMtb9m0vzQn%3DCwWn-kwG9w%40mail.gmail.com	2025-12-30 20:24:42 +13:00
Michael Paquier	97b101776c	Add pg_get_multixact_stats() This new function exposes at SQL level some information related to multixacts, not available until now. This data is useful for monitoring purposes, especially for workloads that make a heavy use of multixacts: - num_mxids, number of MultiXact IDs in use. - num_members, number of member entries in use. - members_size, bytes used by num_members in pg_multixact/members/. - oldest_multixact: oldest MultiXact still needed. This patch has been originally proposed when MultiXactOffset was still 32 bits, to monitor wraparound. This part is not relevant anymore since `bd8d9c9bdf` that has widen MultiXactOffset to 64 bits. The monitoring of disk space usage for the members is still relevant. Some tests are added to check this function, in the shape of one isolation test with concurrent transactions that take a ROW SHARE lock, and some SQL tests for pg_read_all_stats. Some documentation is added to explain some patterns that can come from the information provided by the function. Bump catalog version. Author: Naga Appani <nagnrik@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com	2025-12-30 15:38:50 +09:00
Michael Paquier	9cf746a453	Change GetMultiXactInfo() to return the next multixact offset This routine returned a number of members as a MultiXactOffset, calculated based on the difference between the next-to-be-assigned offset and the oldest offset. However, this number is not actually an offset but a number. This type confusion comes from the original implementation of MultiXactMemberFreezeThreshold(), in `53bb309d2d`. The number of members is now defined as a uint64, large enough for MultiXactOffset. This change will be used in a follow-up patch. Reviewed-by: Naga Appani <nagnrik@gmail.com> Discussion: https://postgr.es/m/aUyTvZMq2CLgNEB4@paquier.xyz	2025-12-30 14:03:49 +09:00
Thomas Munro	7da9d8f2db	jit: Remove -Wno-deprecated-declarations in 18+. REL_18_STABLE and master have commit `ee485912`, so they always use the newer LLVM opaque pointer functions. Drop -Wno-deprecated-declarations (commit `a56e7b660`) for code under jit/llvm in those branches, to catch any new deprecation warnings that arrive in future version of LLVM. Older branches continued to use functions marked deprecated in LLVM 14 and 15 (ie switched to the newer functions only for LLVM 16+), as a precaution against unforeseen compatibility problems with bitcode already shipped. In those branches, the comment about warning suppression is updated to explain that situation better. In theory we could suppress warnings only for LLVM 14 and 15 specifically, but that isn't done here. Backpatch-through: 14 Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1407185.1766682319%40sss.pgh.pa.us	2025-12-30 14:34:32 +13:00
Tom Lane	bd3e3e9e56	Ensure sanity of hash-join costing when there are no MCV statistics. estimate_hash_bucket_stats is defined to return zero to mcv_freq if it cannot obtain a value for the frequency of the most common value. Its sole caller final_cost_hashjoin ignored this provision and would blindly believe the zero value, resulting in computing zero for the largest bucket size. In consequence, the safety check that intended to prevent the largest bucket from exceeding get_hash_memory_limit() was ineffective, allowing very silly plans to be chosen if statistics were missing. After fixing final_cost_hashjoin to disregard zero results for mcv_freq, a second problem appeared: some cases that should use hash joins failed to. This is because estimate_hash_bucket_stats was unaware of the fact that ANALYZE won't store MCV statistics if it doesn't find any multiply-occurring values. Thus the lack of an MCV stats entry doesn't necessarily mean that we know nothing; we may well know that the column is unique. The former coding returned zero for mcv_freq in this case, which was pretty close to correct, but now final_cost_hashjoin doesn't believe it and disables the hash join. So check to see if there is a HISTOGRAM stats entry; if so, ANALYZE has in fact run for this column and must have found it to be unique. In that case report the MCV frequency as 1 / rows, instead of claiming ignorance. Reporting a more accurate *mcv_freq in this case can also affect the bucket-size skew adjustment further down in estimate_hash_bucket_stats, causing hash-join cost estimates to change slightly. This affects some plan choices in the core regression tests. The first diff in join.out corresponds to a case where we have no stats and should not risk a hash join, but the remaining changes are caused by producing a better bucket-size estimate for unique join columns. Those are all harmless changes so far as I can tell. The existing behavior was introduced in commit `4867d7f62` in v11. It appears from the commit log that disabling the bucket-size safety check in the absence of statistics was intentional; but we've now seen a case where the ensuing behavior is bad enough to make that seem like a poor decision. In any case the lack of other problems with that safety check after several years helps to justify enforcing it more strictly. However, we won't risk back-patching this, in case any applications are depending on the existing behavior. Bug: #19363 Reported-by: Jinhui Lai <jinhui.lai@qq.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/2380165.1766871097@sss.pgh.pa.us Discussion: https://postgr.es/m/19363-8dd32fc7600a1153@postgresql.org	2025-12-29 13:01:27 -05:00
Richard Guo	559f9e90db	Ignore PlaceHolderVars when looking up statistics When looking up statistical data about an expression, we failed to look through PlaceHolderVar nodes, treating them as opaque. This could prevent us from matching an expression to base columns, index expressions, or extended statistics, as examine_variable() relies on strict structural matching. As a result, queries involving PlaceHolderVar nodes often fell back to default selectivity estimates, potentially leading to poor plan choices. This patch updates examine_variable() to strip PlaceHolderVars before analysis. This is safe during estimation because PlaceHolderVars are transparent for the purpose of statistics lookup: they do not alter the value distribution of the underlying expression. To minimize performance overhead on this hot path, a lightweight walker first checks for the presence of PlaceHolderVars. The more expensive mutator is invoked only when necessary. There is one ensuing plan change in the regression tests, which is expected and demonstrates the fix: the rowcount estimate becomes much more accurate with this patch. Back-patch to v18. Although this issue exists before that, changes in this version made it common enough to notice. Given the lack of field reports for older versions, I am not back-patching further. Reported-by: Haowu Ge <gehaowu@bitmoe.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com Backpatch-through: 18	2025-12-29 11:40:45 +09:00
Richard Guo	ad66f705fa	Strip PlaceHolderVars from index operands When pulling up a subquery, we may need to wrap its targetlist items in PlaceHolderVars to enforce separate identity or as a result of outer joins. However, this causes any upper-level WHERE clauses referencing these outputs to contain PlaceHolderVars, which prevents indxpath.c from recognizing that they could be matched to index columns or index expressions, potentially affecting the planner's ability to use indexes. To fix, explicitly strip PlaceHolderVars from index operands. A PlaceHolderVar appearing in a relation-scan-level expression is effectively a no-op. Nevertheless, to play it safe, we strip only PlaceHolderVars that are not marked nullable. The stripping is performed recursively to handle cases where PlaceHolderVars are nested or interleaved with other node types. To minimize performance impact, we first use a lightweight walker to check for the presence of strippable PlaceHolderVars. The expensive mutator is invoked only if a candidate is found, avoiding unnecessary memory allocation and tree copying in the common case where no PlaceHolderVars are present. Back-patch to v18. Although this issue exists before that, changes in this version made it common enough to notice. Given the lack of field reports for older versions, I am not back-patching further. Reported-by: Haowu Ge <gehaowu@bitmoe.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/62af586c-c270-44f3-9c5e-02c81d537e3d.gehaowu@bitmoe.com Backpatch-through: 18	2025-12-29 11:38:49 +09:00
Peter Eisentraut	b7057e4346	Change some Datum to void * for opaque pass-through pointer Here, Datum was used to pass around an opaque pointer between a group of functions. But one might as well use void * for that; the use of Datum doesn't achieve anything here and is just distracting. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/1c5d23cb-288b-4154-b1cd-191fe2301707%40eisentraut.org	2025-12-28 14:34:12 +01:00
Michael Paquier	9adf32da6b	Split some long Makefile lists This change makes more readable code diffs when adding new items or removing old items, while ensuring that lines do not get excessively long. Some SUBDIRS, PROGRAMS and REGRESS lists are split. Note that there are a few more REGRESS lists that could be split, particularly in contrib/. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Co-Authored-By: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/DF6HDGB559U5.3MPRFCWPONEAE@jeltef.nl	2025-12-28 09:17:42 +09:00
Michael Paquier	36b8f4974a	Fix pg_stat_get_backend_activity() to use multi-byte truncated result pg_stat_get_backend_activity() calls pgstat_clip_activity() to ensure that the reported query string is correctly truncated when it finishes with an incomplete multi-byte sequence. However, the result returned by the function was not what pgstat_clip_activity() generated, but the non-truncated, original, contents from PgBackendStatus.st_activity_raw. Oversight in `54b6cd589a`, so backpatch all the way down. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2mDzwc48q2EK9tSXS6iJMJ35wvxNQnHX+rXjy5VgLvJQw@mail.gmail.com Backpatch-through: 14	2025-12-27 17:23:30 +09:00
Michael Paquier	bde3a46160	Upgrade BufFile to use int64 for byte positions This change has the advantage of removing some weird type casts, caused by offset calculations based on pgoff_t but saved as int (on older branches we use off_t, which could be 4 or 8 bytes depending on the environment). These are safe currently because capped by MAX_PHYSICAL_FILESIZE, but we would run into problems when to make MAX_PHYSICAL_FILESIZE larger or allow callers of these routines to use a larger physical max size on demand. While on it, this improves BufFileDumpBuffer() so as we do not use an offset for "availbytes". It is not a file offset per-set, but a number of available bytes. This change should lead to no functional changes. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz	2025-12-26 08:41:56 +09:00
Michael Paquier	eee19a30d6	Fix typo in stat_utils.c Introduced by `213a1b8952`. Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNku-jz-FPKeJVk25fZ1pV2buYh5vpeqGDOB=bFQhKxXhw@mail.gmail.com	2025-12-26 07:53:46 +09:00
Michael Paquier	213a1b8952	Move attribute statistics functions to stat_utils.c Many of the operations done for attribute stats in attribute_stats.c share the same logic as extended stats, as done by a patch under discussion to add support for extended stats import and export. All the pieces necessary for extended statistics are moved to stats_utils.c, which is the file where common facilities are shared for stats files. The following renames are done: * get_attr_stat_type() -> statatt_get_type() * init_empty_stats_tuple() -> statatt_init_empty_tuple() * set_stats_slot() -> statatt_set_slot() * get_elem_stat_type() -> statatt_get_elem_type() While on it, this commit adds more documentation for all these functions, describing more their internals and the dependencies that have been implied for attribute statistics. The same concepts apply to extended statistics, at some degree. Author: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Yu Wang <wangyu_runtime@163.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CADkLM=dpz3KFnqP-dgJ-zvRvtjsa8UZv8wDAQdqho=qN3kX0Zg@mail.gmail.com	2025-12-25 15:13:39 +09:00
Richard Guo	325808cac9	Fix planner error with SRFs and grouping sets If there are any SRFs in a PathTarget, we must separate it into SRF-computing and SRF-free targets. This is because the executor can only handle SRFs that appear at the top level of the targetlist of a ProjectSet plan node. If we find a subexpression that matches an expression already computed in the previous plan level, we should treat it like a Var and should not split it again. setrefs.c will later replace the expression with a Var referencing the subplan output. However, when processing the grouping target for grouping sets, the planner can fail to recognize that an expression is already computed in the scan/join phase. The root cause is a mismatch in the nullingrels bits. Expressions in the grouping target carry the grouping nulling bit in their nullingrels to indicate that they can be nulled by the grouping step. However, the corresponding expressions in the scan/join target do not have these bits. As a result, the exact match check in list_member() fails, leading the planner to incorrectly believe that the expression needs to be re-evaluated from its arguments, which are often not available in the subplan. This can lead to planner errors such as "variable not found in subplan target list". To fix, ignore the grouping nulling bit when checking whether an expression from the grouping target is available in the pre-grouping input target. This aligns with the matching logic in setrefs.c. Backpatch to v18, where this issue was introduced. Bug: #19353 Reported-by: Marian MULLER REBEYROL <marian.muller@serli.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/19353-aaa179bba986a19b@postgresql.org Backpatch-through: 18	2025-12-25 12:12:52 +09:00
Fujii Masao	d1b35952da	Fix CREATE SUBSCRIPTION failure when the publisher runs on pre-PG19. CREATE SUBSCRIPTION with copy_data=true and origin='none' previously failed when the publisher was running a version earlier than PostgreSQL 19, even though this combination should be supported. The failure occurred because the command issued a query calling pg_get_publication_sequences function on the publisher. That function does not exist before PG19 and the query is only needed for logical replication sequence synchronization, which is supported starting in PG19. This commit fixes this issue by skipping that query when the publisher runs a version earlier than PG19. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com	2025-12-24 23:45:19 +09:00
Fujii Masao	5e813edb55	Fix version check for retain_dead_tuples subscription option. The retain_dead_tuples subscription option is supported only when the publisher runs PostgreSQL 19 or later. However, it could previously be enabled even when the publisher was running an earlier version. This was caused by check_pub_dead_tuple_retention() comparing the publisher server version against 19000 instead of 190000. Fix this typo so that the version check correctly enforces the PG19+ requirement. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://postgr.es/m/CAHGQGwEx4twHtJdiPWTyAXJhcBPLaH467SH2ajGSe-41m65giA@mail.gmail.com	2025-12-24 23:25:00 +09:00
Richard Guo	c8d2f68cc8	Teach expr_is_nonnullable() to handle more expression types Currently, the function expr_is_nonnullable() checks only Const and Var expressions to determine if an expression is non-nullable. This patch extends the detection logic to handle more expression types. This can enable several downstream optimizations, such as reducing NullTest quals to constant truth values (e.g., "COALESCE(var, 1) IS NULL" becomes FALSE) and converting "COUNT(expr)" to the more efficient "COUNT(*)" when the expression is proven non-nullable. This breaks a test case in test_predtest.sql, since we now simplify "ARRAY[] IS NULL" to constant FALSE, preventing it from weakly refuting a strict ScalarArrayOpExpr ("x = any(ARRAY[])"). To ensure the refutation logic is still exercised as intended, wrap the array argument in opaque_array(). Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com	2025-12-24 18:00:44 +09:00
Richard Guo	cb7b7ec7a1	Optimize ROW(...) IS [NOT] NULL using non-nullable fields We break ROW(...) IS [NOT] NULL into separate tests on its component fields. During this breakdown, we can improve efficiency by utilizing expr_is_nonnullable() to detect fields that are provably non-nullable. If a component field is proven non-nullable, it affects the outcome based on the test type. For an IS NULL test, a single non-nullable field refutes the whole NullTest, reducing it to constant FALSE. For an IS NOT NULL test, the check for that specific field is guaranteed to succeed, so we can discard it from the list of component tests. This extends the existing optimization logic, which previously only handled Const fields, to support any expression that can be proven non-nullable. In passing, update the existing constant folding of NullTests to use expr_is_nonnullable() instead of var_is_nonnullable(), enabling it to benefit from future improvements to that function. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com	2025-12-24 18:00:02 +09:00
Richard Guo	10c4fe074a	Simplify COALESCE expressions using non-nullable arguments The COALESCE function returns the first of its arguments that is not null. When an argument is proven non-null, if it is the first non-null-constant argument, the entire COALESCE expression can be replaced by that argument. If it is a subsequent argument, all following arguments can be dropped, since they will never be reached. Currently, we perform this simplification only for Const arguments. This patch extends the simplification to support any expression that can be proven non-nullable. This can help avoid the overhead of evaluating unreachable arguments. It can also lead to better plans when the first argument is proven non-nullable and replaces the expression, as the planner no longer has to treat the expression as non-strict, and can also leverage index scans on the resulting expression. There is an ensuing plan change in generated_virtual.out, and we have to modify the test to ensure that it continues to test what it is intended to. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs49UhPBjm+NRpxerjaeuFKyUZJ_AjM3NBcSYK2JgZ6VTEQ@mail.gmail.com	2025-12-24 17:58:49 +09:00
Michael Paquier	b947dd5c75	Improve comment in pgstatfuncs.c Author: Zizhen Qiao <zizhen_qiao@163.com> Discussion: https://postgr.es/m/5ee635f9.49f7.19b4ed9e803.Coremail.zizhen_qiao@163.com	2025-12-24 17:09:13 +09:00
Amit Kapila	1528b0d899	Don't advance origin during apply failure. The logical replication parallel apply worker could incorrectly advance the origin progress during an error or failed apply. This behavior risks transaction loss because such transactions will not be resent by the server. Commit `3f28b2fcac` addressed a similar issue for both the apply worker and the table sync worker by registering a before_shmem_exit callback to reset origin information. This prevents the worker from advancing the origin during transaction abortion on shutdown. This patch applies the same fix to the parallel apply worker, ensuring consistent behavior across all worker types. As with `3f28b2fcac`, we are backpatching through version 16, since parallel apply mode was introduced there and the issue only occurs when changes are applied before the transaction end record (COMMIT or ABORT) is received. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 16 Discussion: https://postgr.es/m/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com	2025-12-24 04:36:39 +00:00
Masahiko Sawada	67c20979ce	Toggle logical decoding dynamically based on logical slot presence. Previously logical decoding required wal_level to be set to 'logical' at server start. This meant that users had to incur the overhead of logical-level WAL logging even when no logical replication slots were in use. This commit adds functionality to automatically control logical decoding availability based on logical replication slot presence. The newly introduced module logicalctl.c allows logical decoding to be dynamically activated when needed when wal_level is set to 'replica'. When the first logical replication slot is created, the system automatically increases the effective WAL level to maintain logical-level WAL records. Conversely, after the last logical slot is dropped or invalidated, it decreases back to 'replica' WAL level. While activation occurs synchronously right after creating the first logical slot, deactivation happens asynchronously through the checkpointer process. This design avoids a race condition at the end of recovery; a concurrent deactivation could happen while the startup process enables logical decoding at the end of recovery, but WAL writes are still not permitted until recovery fully completes. The checkpointer will handle it after recovery is done. Asynchronous deactivation also avoids excessive toggling of the logical decoding status in workloads that repeatedly create and drop a single logical slot. On the other hand, this lazy approach can delay changes to effective_wal_level and the disabling logical decoding, especially when the checkpointer is busy with other tasks. We chose this lazy approach in all deactivation paths to keep the implementation simple, even though laziness is strictly required only for end-of-recovery cases. Future work might address this limitation either by using a dedicated worker instead of the checkpointer, or by implementing synchronous waiting during slot drops if workloads are significantly affected by the lazy deactivation of logical decoding. The effective WAL level, determined internally by XLogLogicalInfo, is allowed to change within a transaction until an XID is assigned. Once an XID is assigned, the value becomes fixed for the remainder of the transaction. This behavior ensures that the logging mode remains consistent within a writing transaction, similar to the behavior of GUC parameters. A new read-only GUC parameter effective_wal_level is introduced to monitor the actual WAL level in effect. This parameter reflects the current operational WAL level, which may differ from the configured wal_level setting. Bump PG_CONTROL_VERSION as it adds a new field to CheckPoint struct. Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAD21AoCVLeLYq09pQPaWs+Jwdni5FuJ8v2jgq-u9_uFbcp6UbA@mail.gmail.com	2025-12-23 10:13:16 -08:00
Heikki Linnakangas	955f550686	Fix bug in following update chain when locking a heap tuple After waiting for a concurrent updater to finish, heap_lock_tuple() followed the update chain to lock all tuple versions. However, when stepping from the initial tuple to the next one, it failed to check that the next tuple's XMIN matches the initial tuple's XMAX. That's an important check whenever following an update chain, and the recursive part that follows the chain did it, but the initial step missed it. Without the check, if the updating transaction aborts, the updated tuple is vacuumed away and replaced by an unrelated tuple, the unrelated tuple might get incorrectly locked. Author: Jasper Smit <jasper.smit@servicenow.com> Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com Backpatch-through: 14	2025-12-23 13:37:16 +02:00
Michael Paquier	8e0d32a4a1	Fix orphaned origin in shared memory after DROP SUBSCRIPTION Since `ce0fdbfe97`, a replication slot and an origin are created by each tablesync worker, whose information is stored in both a catalog and shared memory (once the origin is set up in the latter case). The transaction where the origin is created is the same as the one that runs the initial COPY, with the catalog state of the origin becoming visible for other sessions only once the COPY transaction has committed. The catalog state is coupled with a state in shared memory, initialized at the same time as the origin created in the catalogs. Note that the transaction doing the initial data sync can take a long time, time that depends on the amount of data to transfer from a publication node to its subscriber node. Now, when a DROP SUBSCRIPTION is executed, all its workers are stopped with the origins removed. The removal of each origin relies on a catalog lookup. A worker still running the initial COPY would fail its transaction, with the catalog state of the origin rolled back while the shared memory state remains around. The session running the DROP SUBSCRIPTION should be in charge of cleaning up the catalog and the shared memory state, but as there is no data in the catalogs the shared memory state is not removed. This issue would leave orphaned origin data in shared memory, leading to a confusing state as it would still show up in pg_replication_origin_status. Note that this shared memory data is sticky, being flushed on disk in replorigin_checkpoint at checkpoint. This prevents other origins from reusing a slot position in the shared memory data. To address this problem, the commit moves the creation of the origin at the end of the transaction that precedes the one executing the initial COPY, making the origin immediately visible in the catalogs for other sessions, giving DROP SUBSCRIPTION a way to know about it. A different solution would have been to clean up the shared memory state using an abort callback within the tablesync worker. The solution of this commit is more consistent with the apply worker that creates an origin in a short transaction. A test is added in the subscription test 004_sync.pl, which was able to display the problem. The test fails when this commit is reverted. Reported-by: Tenglong Gu <brucegu@amazon.com> Reported-by: Daisuke Higuchi <higudai@amazon.com> Analyzed-by: Michael Paquier <michael@paquier.xyz> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/aUTekQTg4OYnw-Co@paquier.xyz Backpatch-through: 14	2025-12-23 14:32:14 +09:00
Michael Paquier	e5f3839af6	Switch buffile.c/h to use pgoff_t instead of off_t off_t was previously used for offsets, which is 4 bytes on Windows, hence limiting the backend code to a hard limit for files longer than 2GB. This leads to some simplification in these files, removing some casts based on long, also 4 bytes on Windows. This commit removes one comment introduced in `db3c4c3a2d`, not relevant anymore as pgoff_t is a safe 8-byte alternative on Windows. This change is surprisingly not invasive, as the callers of BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c, tuplestore.c, etc.) track offsets in local structures that just to switch from off_t to pgoff_t for the most part. The file is still relying on a maximum file size of MAX_PHYSICAL_FILESIZE (1GB). This change allows the code to make this maximum potentially larger in the future, or larger on a per-demand basis. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz	2025-12-23 07:41:34 +09:00
Michael Paquier	d31276f0e2	Fix another typo in gininsert.c Reported-by: Tender Wang <tndrwang@gmail.com> Discussion: https://postgr.es/m/CAHewXNkRJ9DMFZMQKWQ32U+OTBR78KeGh2=9Wy5jEeWDxMVFcQ@mail.gmail.com	2025-12-22 12:38:40 +09:00
Peter Geoghegan	fab5cd3dd1	Remove obsolete name_ops index-only scan comments. nbtree index-only scans of an index that uses btree/name_ops as one of its index column's input opclasses are no longer at any risk of reading past the end of currTuples. We're no longer reliant on such scans being able to at least read from the start of markTuples storage (which uses space from the same allocation as currTuples) to avoid a segfault: StoreIndexTuple (from nodeIndexonlyscan.c) won't actually read past the end of a cstring datum from a name_ops index. In other words, we already have the "special-case treatment for name_ops" that the removed comment supposed we could avoid by relying on markTuples in this way. Oversight in commit `a63224be49`, which added special case handling of name_ops cstrings to StoreIndexTuple, but missed these comments.	2025-12-21 12:27:38 -05:00
Andres Freund	548de59d93	heapam: Move logic to handle HEAP_MOVED into a helper function Before we dealt with this in 6 near identical and one very similar copy. The helper function errors out when encountering a HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or in-progress. It'd be preferrable to do that change separately, but otherwise it'd not be possible to deduplicate the handling in HeapTupleSatisfiesVacuum(). Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6 Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar	2025-12-19 13:28:34 -05:00
Andres Freund	09ae2c8bac	bufmgr: Optimize & harmonize LockBufHdr(), LWLockWaitListLock() The main optimization is for LockBufHdr() to delay initializing SpinDelayStatus, similar to what LWLockWaitListLock already did. The initialization is sufficiently expensive & buffer header lock acquisitions are sufficiently frequent, to make it worthwhile to instead have a fastpath (via a likely() branch) that does not initialize the SpinDelayStatus. While LWLockWaitListLock() already the aforementioned optimization, it did not use likely(), and inspection of the assembly shows that this indeed leads to worse code generation (also observed in a microbenchmark). Fix that by adding the likely(). While the LockBufHdr() improvement is a small gain on its own, it mainly is aimed at preventing a regression after a future commit, which requires additional locking to set hint bits. While touching both, also make the comments more similar to each other. Reviewed-by: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-12-19 13:23:33 -05:00
Heikki Linnakangas	47a9f61fca	Use proper type for RestoreTransactionSnapshot's PGPROC arg Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/08cbaeb5-aaaf-47b6-9ed8-4f7455b0bc4b@iki.fi	2025-12-19 13:40:02 +02:00
Michael Paquier	5cdbec5aa9	Fix typos in gininsert.c Introduced by `8492feb98f`. Author: Xingbin She <xingbin.she@qq.com> Discussion: https://postgr.es/m/tencent_C254AE962588605F132DB4A6F87205D6A30A@qq.com	2025-12-19 14:33:38 +09:00
Fujii Masao	b3ccb0a2cb	Add guard to prevent recursive memory context logging. Previously, if memory context logging was triggered repeatedly and rapidly while a previous request was still being processed, it could result in recursive calls to ProcessLogMemoryContextInterrupt(). This could lead to infinite recursion and potentially crash the process. This commit adds a guard to prevent such recursion. If ProcessLogMemoryContextInterrupt() is already in progress and logging memory contexts, subsequent calls will exit immediately, avoiding unintended recursive calls. While this scenario is unlikely in practice, it's not impossible. This change adds a safety check to prevent such failures. Back-patch to v14, where memory context logging was introduced. Reported-by: Robert Haas <robertmhaas@gmail.com> Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Artem Gavrilov <artem.gavrilov@percona.com> Discussion: https://postgr.es/m/CA+TgmoZMrv32tbNRrFTvF9iWLnTGqbhYSLVcrHGuwZvCtph0NA@mail.gmail.com Backpatch-through: 14	2025-12-19 12:05:37 +09:00
Michael Paquier	9d0f7996e5	Use table/index_close() more consistently All the code paths updated here have been using relation_close() to close a relation that has already been opened with table_open() or index_open(), where a relkind check is enforced. table_close() and index_open() do the same thing as relation_close(), so there was no harm, but being inconsistent could lead to issues if the internals of these close() functions begin to introduce some logic specific to each relkind in the future. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aUKamYGiDKO6byp5@ip-10-97-1-34.eu-west-3.compute.internal	2025-12-19 07:55:58 +09:00
Heikki Linnakangas	951b60f7ab	Do not emit WAL for unlogged BRIN indexes Operations on unlogged relations should not be WAL-logged. The brin_initialize_empty_new_buffer() function didn't get the memo. The function is only called when a concurrent update to a brin page uses up space that we're just about to insert to, which makes it pretty hard to hit. If you do manage to hit it, a full-page WAL record is erroneously emitted for the unlogged index. If you then crash, crash recovery will fail on that record with an error like this: FATAL: could not create file "base/5/32819": File exists Author: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://www.postgresql.org/message-id/CALdSSPhpZXVFnWjwEBNcySx_vXtXHwB2g99gE6rK0uRJm-3GgQ@mail.gmail.com Backpatch-through: 14	2025-12-18 15:08:48 +02:00
Michael Paquier	f4e797171e	Change pgstat_report_vacuum() to use Relation This change makes pgstat_report_vacuum() more consistent with pgstat_report_analyze(), that also uses a Relation. This enforces a policy that callers of this routine should open and lock the relation whose statistics are updated before calling this routine. We will unlikely have a lot of callers of this routine in the tree, but it seems like a good idea to imply this requirement in the long run. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUEA6UZZkDCQFgSA@ip-10-97-1-34.eu-west-3.compute.internal	2025-12-17 11:26:17 +09:00
Michael Paquier	1d325ad99c	Remove useless code in InjectionPointAttach() strlcpy() ensures that a target string is zero-terminated, so there is no need to enforce it a second time in this code. This simplification could have been done in `0eb23285a2`. Author: Feilong Meng <feelingmeng@foxmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/tencent_771178777C5BC17FCB7F7A1771CD1FFD5708@qq.com	2025-12-17 08:58:58 +09:00
Jeff Davis	0a90df58cf	Avoid global LC_CTYPE dependency in pg_locale_icu.c. ICU still depends on libc for compatibility with certain historical behavior for single-byte encodings. Make the dependency explicit by holding a locale_t object when required. We should consider a better solution in the future, such as decoding the text to UTF-32 and using u_tolower(). That would be a behavior change and require additional infrastructure though; so for now, just avoid the global LC_CTYPE dependency. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-16 15:32:57 -08:00
Jeff Davis	87b2968df0	downcase_identifier(): use method table from locale provider. Previously, libc's tolower() was always used for lowercasing identifiers, regardless of the database locale (though only characters beyond 127 in single-byte encodings were affected). Refactor to allow each provider to supply its own implementation of identifier downcasing. For historical compatibility, when using a single-byte encoding, ICU still relies on tolower(). One minor behavior change is that, before the database default locale is initialized, it uses ASCII semantics to downcase the identifiers. Previously, it would use the postmaster's LC_CTYPE setting from the environment. While that could have some effect during GUC processing, for example, it would have been fragile to rely on the environment setting anyway. (Also, it only matters when the encoding is single-byte.) Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-16 15:32:41 -08:00
Robert Haas	f1a6e622bd	Switch memory contexts in ReinitializeParallelDSM. We already do this in CreateParallelContext, InitializeParallelDSM, and LaunchParallelWorkers. I suspect the reason why the matching logic was omitted from ReinitializeParallelDSM is that I failed to realize that any memory allocation was happening here -- but shm_mq_attach does allocate, which could result in a shm_mq_handle being allocated in a shorter-lived context than the ParallelContext which points to it. That could result in a crash if the shorter-lived context is freed before the parallel context is destroyed. As far as I am currently aware, there is no way to reach a crash using only code that is present in core PostgreSQL, but extensions could potentially trip over this. Fixing this in the back-branches appears low-risk, so back-patch to all supported versions. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Co-authored-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com> Backpatch-through: 14 Discussion: http://postgr.es/m/CAKZiRmwfVripa3FGo06=5D1EddpsLu9JY2iJOTgbsxUQ339ogQ@mail.gmail.com	2025-12-16 12:24:55 -05:00
Melanie Plageman	bfe5c4bec7	Add explanatory comment to prune_freeze_setup() heap_page_prune_and_freeze() fills in PruneState->deadoffsets, the array of OffsetNumbers of dead tuples. It is returned to the caller in the PruneFreezeResult. To avoid having two copies of the array, the PruneState saves only a pointer to the array. This was a bit unusual and confusing, so add a clarifying comment. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2=jiD1nqch4JQN+odAxZSD7mRvdoHUGJYN2r6tQG_66yQ@mail.gmail.com	2025-12-16 11:04:07 -05:00
Melanie Plageman	4877391ce8	Fix const qualification in prune_freeze_setup() The const qualification of the presult argument to prune_freeze_setup() is later cast away, so it was not correct. Remove it and add a comment explaining that presult should not be modified. Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fb97d0ae-a0bc-411d-8a87-f84e7e146488%40eisentraut.org	2025-12-16 11:00:05 -05:00
John Naylor	9303d62c6d	Separate out bytea sort support from varlena.c In the wake of commit `b45242fd3`, bytea_sortsupport() still called out to varstr_sortsupport(). Treating bytea as a kind of text/varchar required varstr_sortsupport() to allow for the possibility of NUL bytes, but only for C collation. This was confusing. For better separation of concerns, create an independent sortsupport implementation in bytea.c. The heuristics for bytea_abbrev_abort() remain the same as for varstr_abbrev_abort(). It's possible that the bytea case warrants different treatment, but that is left for future investigation. In passing, adjust some strange looking comparisons in varstr_abbrev_abort(). Author: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TP1bAbEhUJa6+rgceN6QJWMSsxhg1=mqfSN=Nb-n6DAKg@mail.gmail.com	2025-12-16 15:19:16 +07:00
Michael Paquier	15f68cebdc	Add TAP test to check recovery when redo LSN is missing This commit provides test coverage for `dc7c77f825`, where the redo record and the checkpoint record finish on different WAL segments with the start of recovery able to detect that the redo record is missing. This test uses a wait injection point done in the critical section of a checkpoint, method that requires not one but actually two wait injection points to avoid any memory allocations within the critical section of the checkpoint: - Checkpoint run with a background psql. - One first wait point is run by the checkpointer before the critical section, allocating the shared memory required by the DSM registry for the wait machinery in the library injection_points. - First point is woken up. - Second wait point is loaded before the critical section, allocating the memory to build the path to the library loaded, then run in the critical section once the checkpoint redo record has been logged. - WAL segment is switched while waiting on the second point. - Checkpoint completes. - Stop cluster with immediate mode. - The segment that includes the redo record is removed. - Start, recovery fails as the redo record cannot be found. The error message introduced in `dc7c77f825` is now reduced to a FATAL, meaning that the information is still provided while being able to use a test for it. Nitin has provided a basic version of the test, that I have enhanced to make it portable with two points. Without `dc7c77f825`, the cluster crashes in this test, not on a PANIC but due to the pointer dereference at the beginning of recovery, failure mentioned in the other commit. Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com	2025-12-16 14:28:05 +09:00
Michael Paquier	dc7c77f825	Fail recovery when missing redo checkpoint record without backup_label This commit adds an extra check at the beginning of recovery to ensure that the redo record of a checkpoint exists before attempting WAL replay, logging a PANIC if the redo record referenced by the checkpoint record could not be found. This is the same level of failure as when a checkpoint record is missing. This check is added when a cluster is started without a backup_label, after retrieving its checkpoint record. The redo LSN used for the check is retrieved from the checkpoint record successfully read. In the case where a backup_label exists, the startup process already fails if the redo record cannot be found after reading a checkpoint record at the beginning of recovery. Previously, the presence of the redo record was not checked. If the redo and checkpoint records were located on different WAL segments, it would be possible to miss a entire range of WAL records that should have been replayed but were just ignored. The consequences of missing the redo record depend on the version dealt with, these becoming worse the older the version used: - On HEAD, v18 and v17, recovery fails with a pointer dereference at the beginning of the redo loop, as the redo record is expected but cannot be found. These versions are good students, because we detect a failure before doing anything, even if the failure is misleading in the shape of a segmentation fault, giving no information that the redo record is missing. - In v16 and v15, problems show at the end of recovery within FinishWalRecovery(), the startup process using a buggy LSN to decide from where to start writing WAL. The cluster gets corrupted, still it is noisy about it. - v14 and older versions are worse: a cluster gets corrupted but it is entirely silent about the matter. The redo record missing causes the startup process to skip entirely recovery, because a missing record is the same as not redo being required at all. This leads to data loss, as everything is missed between the redo record and the checkpoint record. Note that I have tested that down to 9.4, reproducing the issue with a version of the author's reproducer slightly modified. The code is wrong since at least 9.2, but I did not look at the exact point of origin. This problem has been found by debugging a cluster where the WAL segment including the redo segment was missing due to an operator error, leading to a crash, based on an investigation in v15. Requesting archive recovery with the creation of a recovery.signal or a standby.signal even without a backup_label would mitigate the issue: if the record cannot be found in pg_wal/, the missing segment can be retrieved with a restore_command when checking that the redo record exists. This was already the case without this commit, where recovery would re-fetch the WAL segment that includes the redo record. The check introduced by this commit makes the segment to be retrieved earlier to make sure that the redo record can be found. On HEAD, the code will be slightly changed in a follow-up commit to not rely on a PANIC, to include a test able to emulate the original problem. This is a minimal backpatchable fix, kept separated for clarity. Reported-by: Andres Freund <andres@anarazel.de> Analyzed-by: Andres Freund <andres@anarazel.de> Author: Nitin Jadhav <nitinjadhavpostgres@gmail.com> Discussion: https://postgr.es/m/20231023232145.cmqe73stvivsmlhs@awork3.anarazel.de Discussion: https://postgr.es/m/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m=daGqiOuVdizYWYaA@mail.gmail.com Backpatch-through: 14	2025-12-16 13:29:28 +09:00
Nathan Bossart	48d4a1423d	Allow passing a pointer to GetNamedDSMSegment()'s init callback. This commit adds a new "void *arg" parameter to GetNamedDSMSegment() that is passed to the initialization callback function. This is useful for reusing an initialization callback function for multiple DSM segments. Author: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/CAN4CZFMjh8TrT9ZhWgjVTzBDkYZi2a84BnZ8bM%2BfLPuq7Cirzg%40mail.gmail.com	2025-12-15 14:27:16 -06:00
Noah Misch	64bf53dd61	Revisit cosmetics of "For inplace update, send nontransactional invalidations." This removes a never-used CacheInvalidateHeapTupleInplace() parameter. It adds README content about inplace update visibility in logical decoding. It rewrites other comments. Back-patch to v18, where commit `243e9b40f1` first appeared. Since this removes a CacheInvalidateHeapTupleInplace() parameter, expect a v18 ".abi-compliance-history" edit to follow. PGXN contains no calls to that function. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Ilyasov Ian <ianilyasov@outlook.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Surya Poondla <s_poondla@apple.com> Discussion: https://postgr.es/m/CA+renyU+LGLvCqS0=fHit-N1J-2=2_mPK97AQxvcfKm+F-DxJA@mail.gmail.com Backpatch-through: 18	2025-12-15 12:19:49 -08:00
Noah Misch	0839fbe400	Correct comments of "Fix data loss at inplace update after heap_update()". This corrects commit `a07e03fd8f`. Reported-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reported-by: Surya Poondla <s_poondla@apple.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Discussion: https://postgr.es/m/CA+renyWCW+_2QvXERBQ+mna6ANwAVXXmHKCA-WzL04bZRsjoBA@mail.gmail.com	2025-12-15 12:19:49 -08:00
Jeff Davis	54c41a6deb	Remove unused single-byte char_is_cased() API. https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-15 10:24:57 -08:00
Jeff Davis	9c8de15969	Use multibyte-aware extraction of pattern prefixes. Previously, like_fixed_prefix() used char-at-a-time logic, which forced it to be too conservative for case-insensitive matching. Introduce like_fixed_prefix_ci(), and use that for case-insensitive pattern prefixes. It uses multibyte and locale-aware logic, along with the new pg_iswcased() API introduced in `630706ced0`. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-15 10:24:47 -08:00
Tom Lane	8191937082	Add offnum range checks to suppress compile warnings with UBSAN. Late-model gcc with -fsanitize=undefined enabled issues warnings about uses of PageGetItemId() when it can't prove that the offsetNumber is > 0. The call sites where this happens are checking that the offnum is <= PageGetMaxOffsetNumber(page), so it seems reasonable to add an explicit check that offnum >= 1 too. While at it, rearrange the code to be less contorted and avoid duplicate checks on PageGetMaxOffsetNumber. Maybe the compiler would optimize away the duplicate logic or maybe not, but the existing coding has little to recommend it anyway. There are multiple instances of this identical coding pattern in heapam.c and heapam_xlog.c. Current gcc only complains about two of them, but I fixed them all in the name of consistency. Potentially this could be back-patched in the name of silencing warnings; but I think enabling UBSAN is mainly something people would do on HEAD, so for now it seems not worth the trouble. Discussion: https://postgr.es/m/1699806.1765746897@sss.pgh.pa.us	2025-12-15 12:40:09 -05:00
Heikki Linnakangas	ecb553ae82	Improve sanity checks on multixid members length In the server, check explicitly for multixids with zero members. We used to have an assertion for it, but commit `d4b7bde418` replaced it with more extensive runtime checks, but it missed the original case of zero members. In the upgrade code, a negative length never makes sense, so better check for it explicitly. Commit `d4b7bde418` added a similar sanity check to the corresponding server code on master, and in backbranches, the 'length' is passed to palloc which would fail with "invalid memory alloc request size" error. Clarify the comments on what kind of invalid entries are tolerated by the upgrade code and which ones are reported as fatal errors. Coverity complained about 'length' in the upgrade code being tainted. That's bogus because we trust the data on disk at least to some extent, but hopefully this will silence the complaint. If not, I'll dismiss it manually. Discussion: https://www.postgresql.org/message-id/7b505284-c6e9-4c80-a7ee-816493170abc@iki.fi	2025-12-15 13:30:17 +02:00
David Rowley	cd83ed9a91	Fix typo in tablecmds.c Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAEoWx2%3DAib%2BcatZn6wHKmz0BWe8-q10NAhpxu8wUDT19SddKNA%40mail.gmail.com	2025-12-15 17:17:04 +13:00
Amit Kapila	0d2d4a0ec3	Add retry logic to pg_sync_replication_slots(). Previously, pg_sync_replication_slots() would finish without synchronizing slots that didn't meet requirements, rather than failing outright. This could leave some failover slots unsynchronized if required catalog rows or WAL segments were missing or at risk of removal, while the standby continued removing needed data. To address this, the function now waits for the primary slot to advance to a position where all required data is available on the standby before completing synchronization. It retries cyclically until all failover slots that existed on the primary at the start of the call are synchronized. Slots created after the function begins are not included. If the standby is promoted during this wait, the function exits gracefully and the temporary slots will be removed. Author: Ajin Cherian <itsajin@gmail.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Yilin Zhang <jiezhilove@126.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com	2025-12-15 02:50:21 +00:00
Michael Paquier	4ba012a8ed	Allow cumulative statistics to read/write auxiliary data from/to disk Cumulative stats kinds gain the capability to write additional per-entry data when flushing the stats at shutdown, and read this data when loading back the stats at startup. This can be fit for example in the case of variable-length data (like normalized query strings), so as it becomes possible to link the shared memory stats entries to data that is stored in a different area, like a DSA segment. Three new optional callbacks are added to PgStat_KindInfo, available to variable-numbered stats kinds: * to_serialized_data: writes auxiliary data for an entry. * from_serialized_data: reads auxiliary data for an entry. * finish: performs actions after read/write/discard operations. This is invoked after processing all the entries of a kind, allowing extensions to close file handles and clean up resources. Stats kinds have the option to store this data in the existing pgstats file, but can as well store it in one or more additional files whose names can be built upon the entry keys. The new serialized callbacks are called once an entry key is read or written from the main stats file. A file descriptor to the main pgstats file is available in the arguments of the callbacks. Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s9SDOu+Z6veoJCHWk+kDeTktAtC-KY9fQ9Z6BJdDUirQ@mail.gmail.com	2025-12-15 09:40:56 +09:00
Tom Lane	58dad7f349	Update typedefs.list to match what the buildfarm currently reports. The current list from the buildfarm includes quite a few typedef names that it used to miss. The reason is a bit obscure, but it seems likely to have something to do with our recent increased use of palloc_object and palloc_array. In any case, this makes the relevant struct declarations be much more nicely formatted, so I'll take it. Install the current list and re-run pgindent to update affected code. Syncing with the current list also removes some obsolete typedef names and fixes some alphabetization errors. Discussion: https://postgr.es/m/1681301.1765742268@sss.pgh.pa.us	2025-12-14 17:03:53 -05:00
Andres Freund	30df61990c	bufmgr: Add one-entry cache for private refcount The private refcount entry for a buffer is often looked up repeatedly for the same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's worthwhile to have a one-entry cache for that case. With that cache in place, it's worth splitting GetPrivateRefCountEntry() into a small inline portion (for the cache hit case) and an out-of-line helper for the rest. This is helpful for some workloads today, but becomes more important in an upcoming patch that will utilize the private refcount infrastructure to also store whether the buffer is currently locked, as that increases the rate of lookups substantially. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar	2025-12-14 13:09:43 -05:00
Andres Freund	edbaaea0a9	bufmgr: Separate keys for private refcount infrastructure This makes lookups faster, due to allowing auto-vectorized lookups. It is also beneficial for an upcoming patch, independent of auto-vectorization, as the upcoming patch wants to track more information for each pinned buffer, making the existing loop, iterating over an array of PrivateRefCountEntry, more expensive due to increasing its size. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-12-14 13:09:43 -05:00
Alexander Korotkov	b27e48213f	Refactor WaitLSNType enum to use a macro for type count Change WAIT_LSN_TYPE_COUNT from an enum sentinel to a macro definition, in a similar way to IOObject, IOContext, and BackendType enums. Remove explicit enum value assignments well. Author: Xuneng Zhou <xunengzhou@gmail.com>	2025-12-14 17:18:32 +02:00
Alexander Korotkov	c5ae07a90a	Fix usage of palloc() in MERGE/SPLIT PARTITION(s) code `f2e4cc4279` and `4b3d173629` implement ALTER TABLE ... MERGE/SPLIT PARTITION(s) commands. In several places, these commits use palloc(), where we should use palloc_object() and palloc_array(). This commit provides appropriate usage of palloc_object() and palloc_array(). Reported-by: Man Zeng <zengman@halodbtech.com> Discussion: https://postgr.es/m/tencent_3661BB522D5466B33EA33666%40qq.com	2025-12-14 16:10:25 +02:00
Alexander Korotkov	4b3d173629	Implement ALTER TABLE ... SPLIT PARTITION ... command This new DDL command splits a single partition into several partitions. Just like the ALTER TABLE ... MERGE PARTITIONS ... command, new partitions are created using the createPartitionTable() function with the parent partition as the template. This commit comprises a quite naive implementation which works in a single process and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the operations, including the tuple routing. This is why the new DDL command can't be recommended for large, partitioned tables under high load. However, this implementation comes in handy in certain cases, even as it is. Also, it could serve as a foundation for future implementations with less locking and possibly parallelism. Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru Author: Dmitry Koval <d.koval@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Co-authored-by: Tender Wang <tndrwang@gmail.com> Co-authored-by: Richard Guo <guofenglinux@gmail.com> Co-authored-by: Dagfinn Ilmari Mannsaker <ilmari@ilmari.org> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Co-authored-by: Jian He <jian.universality@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Robert Haas <rhaas@postgresql.org> Reviewed-by: Stephane Tachoires <stephane.tachoires@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Daniel Gustafsson <dgustafsson@postgresql.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com>	2025-12-14 13:29:38 +02:00
Alexander Korotkov	f2e4cc4279	Implement ALTER TABLE ... MERGE PARTITIONS ... command This new DDL command merges several partitions into a single partition of the target table. The target partition is created using the new createPartitionTable() function with the parent partition as the template. This commit comprises a quite naive implementation which works in a single process and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the operations, including the tuple routing. This is why this new DDL command can't be recommended for large partitioned tables under a high load. However, this implementation comes in handy in certain cases, even as it is. Also, it could serve as a foundation for future implementations with less locking and possibly parallelism. Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru Author: Dmitry Koval <d.koval@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Co-authored-by: Tender Wang <tndrwang@gmail.com> Co-authored-by: Richard Guo <guofenglinux@gmail.com> Co-authored-by: Dagfinn Ilmari Mannsaker <ilmari@ilmari.org> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Co-authored-by: Jian He <jian.universality@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Robert Haas <rhaas@postgresql.org> Reviewed-by: Stephane Tachoires <stephane.tachoires@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Daniel Gustafsson <dgustafsson@postgresql.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Noah Misch <noah@leadboat.com>	2025-12-14 13:29:17 +02:00
Tom Lane	ef5f559b95	Fix jsonb_object_agg crash after eliminating null-valued pairs. In commit `b61aa76e4` I added an assumption in jsonb_object_agg_finalfn that it'd be okay to apply uniqueifyJsonbObject repeatedly to a JsonbValue. I should have studied that code more closely first, because in skip_nulls mode it removed leading nulls by changing the "pairs" array start pointer. This broke the data structure's invariants in two ways: pairs no longer references a repalloc-able chunk, and the distance from pairs to the end of its array is less than parseState->size. So any subsequent addition of more pairs is at high risk of clobbering memory and/or causing repalloc to crash. Unfortunately, adding more pairs is exactly what will happen when the aggregate is being used as a window function. Fix by rewriting uniqueifyJsonbObject to not do that. The prior coding had little to recommend it anyway. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/ec5e96fb-ee49-4e5f-8a09-3f72b4780538@gmail.com	2025-12-13 16:18:29 -05:00
Peter Eisentraut	abb331da0a	Fix out-of-date comment on makeRangeConstructors We did define 4 functions in `4429f6a9e3`, but in `df73584431` we got rid of the 0- and 1-arg versions. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CA%2BrenyVQti3iC7LE4UxtQb4ROLYMs6%2Bu-d4LrN5U4idH1Ghx6Q%40mail.gmail.com	2025-12-13 16:56:07 +01:00
Peter Eisentraut	ff30bad7f6	Clarify comment about temporal foreign keys In RI_ConstraintInfo, period_contained_by_oper and period_intersect_oper can take either anyrange or anymultirange. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Discussion: https://www.postgresql.org/message-id/CA%2BrenyWzDth%2BjqLZA2L2Cezs3wE%2BWX-5P8W2EOVx_zfFD%3Daicg%40mail.gmail.com	2025-12-13 16:44:33 +01:00
Álvaro Herrera	630a93799d	Reject opclass options in ON CONFLICT clause It's as pointless as ASC/DESC and NULLS FIRST/LAST are, so reject all of them in the same way. While at it, normalize the others' error messages to have less translatable strings. Add tests for these errors. Noticed while reviewing recent INSERT ON CONFLICT patches. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/202511271516.oiefpvn3z27m@alvherre.pgsql	2025-12-12 14:26:42 +01:00
Peter Eisentraut	493eb0da31	Replace most StaticAssertStmt() with StaticAssertDecl() Similar to commit `75f49221c2`, it is preferable to use StaticAssertDecl() instead of StaticAssertStmt() when possible. Discussion: https://www.postgresql.org/message-id/flat/CA%2BhUKGKvr0x_oGmQTUkx%3DODgSksT2EtgCA6LmGx_jQFG%3DsDUpg%40mail.gmail.com	2025-12-12 10:06:40 +01:00
Heikki Linnakangas	87a350e1f2	Never store 0 as the nextMXact Before this commit, when multixid wraparound happens, MultiXactState->nextMXact goes to 0, which is invalid. All the readers need to deal with that possibility and skip over the 0. That's error-prone and we've missed it a few times in the past. This commit changes the responsibility so that all the writers of MultiXactState->nextMXact skip over the zero already, and readers can trust that it's never 0. We were already doing that for MultiXactState->oldestMultiXactId; none of its writers would set it to 0. ReadMultiXactIdRange() was nevertheless checking for that possibility. For clarity, remove that check. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Maxim Orlov <orlovmg@gmail.com> Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi	2025-12-12 10:47:34 +02:00
Nathan Bossart	b4cbc106a6	Fix some comments. Like commit `123661427b`, these were discovered while reviewing Aleksander Alekseev's proposed changes to pgindent.	2025-12-11 15:13:04 -06:00
Álvaro Herrera	81f72115cf	Fix infer_arbiter_index for partitioned tables The fix for concurrent index operations in `bc32a12e0d` started considering indexes that are not yet marked indisvalid as arbiters for INSERT ON CONFLICT. For partitioned tables, this leads to including indexes that may not exist in partitions, causing a trivially reproducible "invalid arbiter index list" error to be thrown because of failure to match the index. To fix, it suffices to ignore !indisvalid indexes on partitioned tables. There should be no risk that the set of indexes will change for concurrent transactions, because in order for such an index to be marked valid, an ALTER INDEX ATTACH PARTITION must run which requires AccessExclusiveLock. Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/17622f79-117a-4a44-aa8e-0374e53faaf0%40gmail.com	2025-12-11 20:56:37 +01:00
Heikki Linnakangas	b65f1ad9b1	Fix comment on how temp files and subtransactions are handled The comment was accurate a long time ago, but not any more. I failed to update the comment in commit `ab3148b712`.	2025-12-11 15:57:11 +02:00
Heikki Linnakangas	d4b7bde418	Add runtime checks for bogus multixact offsets It's not far-fetched that we'd try to read a multixid with an invalid offset in case of bugs or corruption. Or if you call pg_get_multixact_members() after a crash that left behind invalid but unused multixids. Better to get a somewhat descriptive error message if that happens. Discussion: https://www.postgresql.org/message-id/3624730d-6dae-42bf-9458-76c4c965fb27@iki.fi	2025-12-11 11:18:14 +02:00
Michael Paquier	4f7dacc5b8	Use palloc_object() and palloc_array(), the last change This is the last batch of changes that have been suggested by the author, this part covering the non-trivial changes. Some of the changes suggested have been discarded as they seem to lead to more instructions generated, leaving the parts that can be qualified as in-place replacements. Similar work has been done in `1b105f9472`, `0c3c5c3b06` and `31d3847a37`. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com	2025-12-11 14:29:12 +09:00
Amit Kapila	1362bc33e0	Enhance slot synchronization API to respect promotion signal. Previously, during a promotion, only the slot synchronization worker was signaled to shut down. The backend executing slot synchronization via the pg_sync_replication_slots() SQL function was not signaled, allowing it to complete its synchronization cycle before exiting. An upcoming patch improves pg_sync_replication_slots() to wait until replication slots are fully persisted before finishing. This behaviour requires the backend to exit promptly if a promotion occurs. This patch ensures that, during promotion, a signal is also sent to the backend running pg_sync_replication_slots(), allowing it to be interrupted and exit immediately. Author: Ajin Cherian <itsajin@gmail.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CAFPTHDZAA%2BgWDntpa5ucqKKba41%3DtXmoXqN3q4rpjO9cdxgQrw%40mail.gmail.com	2025-12-11 03:49:28 +00:00
Peter Geoghegan	e16c6f0247	Clarify why _bt_killitems sorts its items array. Make it clear why _bt_killitems sorts the scan's so->killedItems[] array. Also add an assertion to the _bt_killitems loop (that iterates through this array) to verify it accesses tuples in leaf page order. Follow-up to commit `bfb335df58`. Author: Peter Geoghegan <pg@bowt.ie> Suggested-by: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAGnEboirgArezZDNeFrR8FOGvKF-Xok333s2iVwWi65gZf8MEA@mail.gmail.com	2025-12-10 20:50:47 -05:00
Michael Paquier	06761b6096	Fix allocation formula in llvmjit_expr.c An array of LLVMBasicBlockRef is allocated with the size used for an element being "LLVMBasicBlockRef *" rather than "LLVMBasicBlockRef". LLVMBasicBlockRef is a type that refers to a pointer, so this did not directly cause a problem because both should have the same size, still it is incorrect. This issue has been spotted while reviewing a different patch, and exists since `2a0faed9d7`, so backpatch all the way down. Discussion: https://postgr.es/m/CA+hUKGLngd9cKHtTUuUdEo2eWEgUcZ_EQRbP55MigV2t_zTReg@mail.gmail.com Backpatch-through: 14	2025-12-11 10:25:21 +09:00
Peter Geoghegan	473cb1b951	Fix MULTIXACT_DEBUG builds. Oversight in commit `bd8d9c9b`. Discussion: https://postgr.es/m/CAH2-WzmvwVKZ+0Z=RL_+g_aOku8QxWddDCXmtyLj02y+nYaD0g@mail.gmail.com	2025-12-10 19:31:13 -05:00
Peter Geoghegan	bfb335df58	Return TIDs in desc order during backwards scans. Always return TIDs in descending order when returning groups of TIDs from an nbtree posting list tuple during nbtree backwards scans. This makes backwards scans tend to require fewer buffer hits, since the scan is less likely to repeatedly pin and unpin the same heap page/buffer (we'll get exactly as many buffer hits as we get with a similar forwards scan case). Commit `0d861bbb`, which added nbtree deduplication, originally did things this way to avoid interfering with _bt_killitems's approach to setting LP_DEAD bits on posting list tuples. _bt_killitems makes a soft assumption that it can always iterate through posting lists in ascending TID order, finding corresponding killItems[]/so->currPos.items[] entries in that same order. This worked out because of the prior _bt_readpage backwards scan behavior. If we just changed the backwards scan posting list logic in _bt_readpage, without altering _bt_killitems itself, it would break its soft assumption. Avoid that problem by sorting the so->killedItems[] array at the start of _bt_killitems. That way the order that dead items are saved in from btgettuple can't matter; so->killedItems[] will always be in the same order as so->currPos.items[] in the end. Since so->currPos.items[] is now always in leaf page order, regardless of the scan direction used within _bt_readpage, and since so->killedItems[] is always in that same order, the _bt_killitems loop can continue to make a uniform assumption about everything being in page order. In fact, sorting like this makes the previous soft assumption about item order into a hard invariant. Also deduplicate the so->killedItems[] array after it is sorted. That way there's no risk of the _bt_killitems loop becoming confused by a duplicate dead item/TID. This was possible in cases that involved a scrollable cursor that encountered the same dead TID more than once (within the same leaf page/so->currPos context). This doesn't come up very much in practice, but it seems best to be as consistent as possible about how and when _bt_killitems will LP_DEAD-mark index tuples. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Mircea Cadariu <cadariu.mircea@gmail.com> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Wut2pKvbW-u3hJ_LXwsYeiXHiW8oN1GfbKPavcGo8Ow@mail.gmail.com	2025-12-10 15:35:30 -05:00
Jeff Davis	630706ced0	Add pg_iswcased(). True if character has multiple case forms. Will be a useful multibyte-aware replacement for char_is_cased(). Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-10 11:56:11 -08:00
Jeff Davis	1e493158d3	Remove char_tolower() API. It's only useful for an ILIKE optimization for the libc provider using a single-byte encoding and a non-C locale, but it creates significant internal complexity. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-10 11:55:59 -08:00
Melanie Plageman	eebec3ca4b	Add comment about keeping PD_ALL_VISIBLE and VM in sync The comment above heap_xlog_visible() about the critical integrity requirement for PD_ALL_VISIBLE and the visibility map should also be in heap_xlog_prune_freeze() where we set PD_ALL_VISIBLE. Oversight in `add323da40` Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com	2025-12-10 11:10:13 -05:00
Melanie Plageman	bd298f54a0	Simplify vacuum visibility assertion Phase I vacuum gives the page a once-over after pruning and freezing to check that the values of all_visible and all_frozen agree with the result of heap_page_is_all_visible(). This is meant to keep the logic in phase I for determining visibility in sync with the logic in phase III. Rewrite the assertion to avoid an Assert(false). Suggested by Andres Freund. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/mhf4vkmh3j57zx7vuxp4jagtdzwhu3573pgfpmnjwqa6i6yj5y%40sy4ymcdtdklo	2025-12-10 11:10:01 -05:00
Heikki Linnakangas	70b4d90439	Fix comment in GetPublicationRelations This function gets the list of relations associated with the publication but the comment said the opposite. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/CANhcyEV3C_CGBeDtjvKjALDJDMH-Uuc9BWfSd=eck8SCXnE=fQ@mail.gmail.com	2025-12-10 15:33:29 +02:00
Heikki Linnakangas	fa44b8b7fb	Fix some near-bugs related to ResourceOwner function arguments These functions took a ResourceOwner argument, but only checked if it was NULL, and then used CurrentResourceOwner for the actual work. Surely the intention was to use the passed-in resource owner. All current callers passed CurrentResourceOwner or NULL, so this has no consequences at the moment, but it's an accident waiting to happen for future caller and extensions. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/CAEze2Whnfv8VuRZaohE-Af+GxBA1SNfD_rXfm84Jv-958UCcJA@mail.gmail.com Backpatch-through: 17	2025-12-10 11:43:16 +02:00
Heikki Linnakangas	bae9d2f892	Fix typo in comment Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CABPTF7V8CbOXGePqrad6EH3Om7DRhNiO3C0rQ-62UuT7RdU-GQ@mail.gmail.com	2025-12-10 01:06:03 +02:00
David Rowley	f275afc997	Fix misleading comment in tuplesort.c A comment in tuplesort.c was claiming that the code was defining INITIAL_MEMTUPSIZE so that it does not exceed ALLOCSET_SEPARATE_THRESHOLD, but the code actually ensures that we purposefully do exceed ALLOCSET_SEPARATE_THRESHOLD for the initial allocation of the tuples array, as per reasons detailed in the commentary of grow_memtuples(). Also, there's not much need to repeat the mention about ALLOCSET_SEPARATE_THRESHOLD in each location where INITIAL_MEMTUPSIZE is used, so remove those comments. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Discussion: https://postgr.es/m/tencent_6FA14F85D6B5B5291532D6789E07F4765C08%40qq.com	2025-12-10 12:01:14 +13:00
Michael Paquier	1b105f9472	Use palloc_object() and palloc_array() in backend code The idea is to encourage more the use of these new routines across the tree, as these offer stronger type safety guarantees than palloc(). This batch of changes includes most of the trivial changes suggested by the author for src/backend/. A total of 334 files are updated here. Among these files, 48 of them have their build change slightly; these are caused by line number changes as the new allocation formulas are simpler, shaving around 100 lines of code in total. Similar work has been done in `0c3c5c3b06` and `31d3847a37`. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com	2025-12-10 07:36:46 +09:00
Masahiko Sawada	ab40db3852	Add started_by column to pg_stat_progress_analyze view. The new column, started_by, indicates the initiator of the analyze ('manual' or 'autovacuum'), helping users and monitoring tools to better understand ANALYZE behavior. Bump catalog version. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Yu Wang <wangyu_runtime@163.com> Discussion: https://postgr.es/m/CAA5RZ0suoicwxFeK_eDkUrzF7s0BVTaE7M%2BehCpYcCk5wiECpw%40mail.gmail.com	2025-12-09 11:23:45 -08:00
Masahiko Sawada	0d78952061	Add mode and started_by columns to pg_stat_progress_vacuum view. The new columns, mode and started_by, indicate the vacuum mode ('normal', 'aggressive', or 'failsafe') and the initiator of the vacuum ('manual', 'autovacuum', or 'autovacuum_wraparound'), respectively. This allows users and monitoring tools to better understand VACUUM behavior. Bump catalog version. Author: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Yu Wang <wangyu_runtime@163.com> Discussion: https://postgr.es/m/CAOzEurQcOY-OBL_ouEVfEaFqe_md3vB5pXjR_m6L71Dcp1JKCQ@mail.gmail.com	2025-12-09 10:51:14 -08:00
Heikki Linnakangas	3cb5808bd1	Add wait event for the group commit delay before WAL flush Author: Rafia Sabih <rafia.pghackers@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/CA%2BFpmFf-hWXtrC0Q3Cr_Xo78zuP_M_VC5xgWPOYOkwqOD0T8eg@mail.gmail.com	2025-12-09 17:06:40 +02:00
Heikki Linnakangas	bd8d9c9bdf	Widen MultiXactOffset to 64 bits This eliminates MultiXactOffset wraparound and the 2^32 limit on the total number of multixid members. Multixids are still limited to 2^31, but this is a nice improvement because 'members' can grow much faster than the number of multixids. On such systems, you can now run longer before hitting hard limits or triggering anti-wraparound vacuums. Not having to deal with MultiXactOffset wraparound also simplifies the code and removes some gnarly corner cases. We no longer need to perform emergency anti-wraparound freezing because of running out of 'members' space, so the offset stop limit is gone. But you might still not want 'members' to consume huge amounts of disk space. For that reason, I kept the logic for lowering vacuum's multixid freezing cutoff if a large amount of 'members' space is used. The thresholds for that are roughly the same as the "safe" and "danger" thresholds used before, 2 billion transactions and 4 billion transactions. This keeps the behavior for the freeze cutoff roughly the same as before. It might make sense to make this smarter or configurable, now that the threshold is only needed to manage disk usage, but that's left for the future. Add code to pg_upgrade to convert multitransactions from the old to the new format, rewriting the pg_multixact SLRU files. Because pg_upgrade now rewrites the files, we can get rid of some hacks we had put in place to deal with old bugs and upgraded clusters. Bump catalog version for the pg_multixact/offsets format change. Author: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Discussion: https://www.postgresql.org/message-id/CACG%3DezaWg7_nt-8ey4aKv2w9LcuLthHknwCawmBgEeTnJrJTcw@mail.gmail.com	2025-12-09 13:53:03 +02:00
Heikki Linnakangas	bb3b1c4f64	Move pg_multixact SLRU page format definitions to a separate header This makes them accessible from pg_upgrade, needed by the next commit. I'm doing this mechanical move as a separate commit to make the next commit's changes to these definitions more obvious. Author: Maxim Orlov <orlovmg@gmail.com> Discussion: https://www.postgresql.org/message-id/CACG%3DezbZo_3_fnx%3DS5BfepwRftzrpJ%2B7WET4EkTU6wnjDTsnjg@mail.gmail.com	2025-12-09 13:45:01 +02:00
Richard Guo	f00484c170	Fix distinctness check for queries with grouping sets query_is_distinct_for() is intended to determine whether a query never returns duplicates of the specified columns. For queries using grouping sets, if there are no grouping expressions, the query may contain one or more empty grouping sets. The goal is to detect whether there is exactly one empty grouping set, in which case the query would return a single row and thus be distinct. The previous logic in query_is_distinct_for() was incomplete because the check was insufficiently thorough and could return false when it could have returned true. It failed to consider cases where the DISTINCT clause is used on the GROUP BY, in which case duplicate empty grouping sets are removed, leaving only one. It also did not correctly handle all possible structures of GroupingSet nodes that represent a single empty grouping set. To fix, add a check for the groupDistinct flag, and expand the query's groupingSets tree into a flat list, then verify that the expanded list contains only one element. No backpatch as this could result in plan changes. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAMbWs480Z04NtP8-O55uROq2Zego309+h3hhaZhz6ztmgWLEBw@mail.gmail.com	2025-12-09 17:09:27 +09:00
Richard Guo	c925ad30b0	Fix const-simplification for index expressions and predicate Similar to the issue with constraint and statistics expressions fixed in `317c117d6`, index expressions and predicate can also suffer from incorrect reduction of NullTest clauses during const-simplification, due to unfixed varnos and the use of a NULL root. It has been reported that this issue can cause the planner to fail to pick up a partial index that it previously matched successfully. Because we need to cache the const-simplified index expressions and predicate in the relcache entry, we cannot fix the Vars before applying eval_const_expressions. To ensure proper reduction of NullTest clauses, this patch runs eval_const_expressions a second time -- after the Vars have been fixed and with a valid root. It could be argued that the additional call to eval_const_expressions might increase planning time, but I don't think that's a concern. It only runs when index expressions and predicate are present; it is relatively cheap when run on small expression trees (which is typically the case for index expressions and predicate), and it runs on expressions that have already been const-simplified once, making the second pass even cheaper. In return, in cases like the one reported, it allows the planner to match and use partial indexes, which can lead to significant execution-time improvements. Bug: #19007 Reported-by: Bryan Fox <bryfox@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/19007-4cc6e252ed8aa54a@postgresql.org	2025-12-09 16:56:26 +09:00
Amit Kapila	04396eacd3	Fix LOCK_TIMEOUT handling in slotsync worker. Previously, the slotsync worker relied on SIGINT for graceful shutdown during promotion. However, SIGINT is also used by the LOCK_TIMEOUT handler to cancel queries. Since the slotsync worker can lock catalog tables while parsing libpq tuples, this overlap caused it to ignore LOCK_TIMEOUT signals and potentially wait indefinitely on locks. This patch replaces the slotsync worker's SIGINT handler with StatementCancelHandler to correctly process query-cancel interrupts. Additionally, the startup process now uses SIGUSR1 to signal the slotsync worker to stop during promotion. The worker exits after detecting that the shared memory flag stopSignaled is set. Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 17, here it was introduced Discussion: https://postgr.es/m/TY4PR01MB169078F33846E9568412D878C94A2A@TY4PR01MB16907.jpnprd01.prod.outlook.com	2025-12-09 07:25:20 +00:00
Peter Eisentraut	2268f2b91b	Remove useless casts in format arguments There were a number of useless casts in format arguments, either where the input to the cast was already in the right type, or seemingly uselessly casting between types instead of just using the right format placeholder to begin with. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org	2025-12-09 07:33:08 +01:00
Peter Eisentraut	907caf5c39	Clean up int64-related format strings Remove some gratuitous uses of INT64_FORMAT. Make use of PRIu64/PRId64 were appropriate, remove unnecessary casts. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org	2025-12-09 07:33:08 +01:00
Peter Eisentraut	2b117bb014	Remove unnecessary casts in printf format arguments (%zu/%zd) Many of these are probably left over from before use of %zu/%zd was portable. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/07fa29f9-42d7-4aac-8834-197918cbbab6%40eisentraut.org	2025-12-09 07:33:08 +01:00
David Rowley	52382feb78	Doc: fix typo in hash index documentation Plus a similar fix to the README. Backpatch as far back as the sgml issue exists. The README issue does exist in v14, but that seems unlikely to harm anyone. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ed3db7ea-55b4-4809-86af-81ad3bb2c7d3@gmail.com Backpatch-through: 15	2025-12-09 14:41:30 +13:00
Peter Geoghegan	83a26ba59b	Avoid pointer chasing in _bt_readpage inner loop. Make _bt_readpage pass down the current scan direction to various utility functions within its pstate variable. Also have _bt_readpage work off of a local copy of scan->ignore_killed_tuples within its per-tuple loop (rather than using scan->ignore_killed_tuples directly). Testing has shown that this significantly benefits large range scans, which are naturally able to take full advantage of the pstate.startikey optimization added by commit `8a510275`. Running a pgbench script with a "SELECT abalance FROM pgbench_accounts WHERE aid BETWEEN ..." query shows an increase in transaction throughput of over 5%. There also appears to be a small performance benefit when running pgbench's built-in select-only script. Follow-up to commit `65d6acbc`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com	2025-12-08 13:48:09 -05:00
Álvaro Herrera	d0d0ba6cf6	Unify some more messages No backpatch here because of message wording changes. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/202512081537.ahw5gwoencou@alvherre.pgsql	2025-12-08 19:25:36 +01:00
Peter Geoghegan	65d6acbc56	Relocate _bt_readpage and related functions. Quite a bit of code within nbtutils.c is only called by _bt_readpage. Move _bt_readpage and all of the nbtutils.c functions it depends on into a new .c file, nbtreadpage.c. Also reorder some of the functions within the new file for clarity. This commit has no functional impact. It is strictly mechanical. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-WzmwMwcwKFgaf+mYPwiz3iL4AqpXnwtW_O0vqpWPXRom9Q@mail.gmail.com	2025-12-08 13:15:00 -05:00
Álvaro Herrera	502e256f22	Unify error messages No visible changes, just refactor how messages are constructed.	2025-12-08 16:30:52 +01:00
Peter Eisentraut	804046b39a	Use PGAlignedXLogBlock for some code simplification The code in BootStrapXLOG() and in pg_test_fsync.c tried to align WAL buffers in complicated ways. Also, they still used XLOG_BLCKSZ for the alignment, even though that should now be PG_IO_ALIGN_SIZE. This can now be simplified and made more consistent by using PGAlignedXLogBlock, either directly in BootStrapXLOG() and using alignas in pg_test_fsync.c. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f462a175-b608-44a1-b428-bdf351e914f4%40eisentraut.org	2025-12-08 14:54:32 +01:00
Amit Kapila	006dd4b2e5	Prevent invalidation of newly created replication slots. A race condition could cause a newly created replication slot to become invalidated between WAL reservation and a checkpoint. Previously, if the required WAL was removed, we retried the reservation process. However, the slot could still be invalidated before the retry if the WAL was not yet removed but the checkpoint advanced the redo pointer beyond the slot's intended restart LSN and computed the minimum LSN that needs to be preserved for the slots. The fix is to acquire an exclusive lock on ReplicationSlotAllocationLock during WAL reservation to serialize WAL reservation and checkpoint's minimum restart_lsn computation. This ensures that, if WAL reservation occurs first, the checkpoint waits until restart_lsn is updated before removing WAL. If the checkpoint runs first, subsequent WAL reservations pick a position at or after the latest checkpoint's redo pointer. We can't use the same fix for branch 17 and prior because commit `2090edc6f3` changed to compute to the minimum restart_LSN among slot's at the beginning of checkpoint (or restart point). The fix for 17 and prior branches is under discussion and will be committed separately. Reported-by: suyu.cmj <mengjuan.cmj@alibaba-inc.com> Author: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Backpatch-through: 18 Discussion: https://postgr.es/m/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan.cmj@alibaba-inc.com	2025-12-08 05:21:22 +00:00
Michael Paquier	f68597ee77	Improve error messages of input functions for pg_dependencies and pg_ndistinct The error details updated in this commit can be reached in the regression tests. They did not follow the project style, and they should be written them as full sentences. Some of the errors are switched to use an elog(), for cases that involve paths that cannot be reached based on the previous state of the parser processing the input data (array start, object end, etc.). The error messages for these cases use now a more consistent style across the board, with the state of the parser reported for debugging. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Author: Michael Paquier <michael@paquier.xyz> Co-authored-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/1353179.1764901790@sss.pgh.pa.us	2025-12-08 10:23:48 +09:00
Tom Lane	005a2907dc	Micro-optimize datatype conversions in datum_to_jsonb_internal. The general case for converting to a JSONB numeric value is to run the source datatype's output function and then numeric_in, but we can do substantially better than that for integer and numeric source values. This patch improves the speed of jsonb_agg by 30% for integer input, and nearly 2X for numeric input. Sadly, the obvious idea of using float4_numeric and float8_numeric to speed up those cases doesn't work: they are actually slower than the generic coerce-via-I/O method, and not by a small amount. They might round off differently than this code has historically done, too. Leave that alone pending possible changes in those functions. We can also do better than the existing code for text/varchar/bpchar source data; this optimization is similar to one that already exists in the json_agg() code. That saves 20% or so for such inputs. Also make a couple of other minor improvements, such as not giving JSONTYPE_CAST its own special case outside the switch when it could perfectly well be handled inside, and not using dubious string hacking to detect infinity and NaN results. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us	2025-12-07 11:54:33 -05:00
Tom Lane	b61aa76e45	Remove fundamentally-redundant processing in jsonb_agg() et al. The various variants of jsonb_agg() operate as follows, for each aggregate input value: 1. Build a JsonbValue tree representation of the input value. 2. Flatten the JsonbValue tree into a Jsonb in on-disk format. 3. Iterate through the Jsonb, building a JsonbValue that is part of the aggregate's state stored in aggcontext, but is otherwise identical to what phase 1 built. This is very slightly less silly than it sounds, because phase 1 involves calling non-JSONB code such as datatype output functions, which are likely to leak memory, and we don't want to leak into the aggcontext. Nonetheless, phases 2 and 3 are accomplishing exactly nothing that is useful if we can make phase 1 put the JsonbValue tree where we need it. We could probably do that with a bunch of MemoryContextSwitchTo's, but what seems more robust is to give pushJsonbValue the responsibility of building the JsonbValue tree in a specified non-current memory context. The previous patch created the infrastructure for that, and this patch simply makes the aggregate functions use it and then rips out phases 2 and 3. For me, this makes jsonb_agg() with a text column as input run about 2X faster than before. It's not yet on par with json_agg(), but this removes a whole lot of the difference. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us	2025-12-07 11:52:22 -05:00
Tom Lane	0986e95161	Revise APIs for pushJsonbValue() and associated routines. Instead of passing "JsonbParseState **" to pushJsonbValue(), pass a pointer to a JsonbInState, which will contain the parseState stack pointer as well as other useful fields. Also, instead of returning a JsonbValue pointer that is often meaningless/ignored, return the top-level JsonbValue pointer in the "result" field of the JsonbInState. This involves a lot of (mostly mechanical) edits, but I think the results are notationally cleaner and easier to understand. Certainly the business with sometimes capturing the result of pushJsonbValue() and sometimes not was bug-prone and incapable of mechanical verification. In the new arrangement, JsonbInState.result remains null until we've completed a valid sequence of pushes, so that an incorrect sequence will result in a null-pointer dereference, not mistaken use of a partial result. However, this isn't simply an exercise in prettier notation. The real reason for doing it is to provide a mechanism whereby pushJsonbValue() can be told to construct the JsonbValue tree in a context that is not CurrentMemoryContext. That happens when a non-null "outcontext" is specified in the JsonbInState. No callers exercise that option in this patch, but the next patch in the series will make use of it. I tried to improve the comments in this area too. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/1060917.1753202222@sss.pgh.pa.us	2025-12-07 11:51:33 -05:00
Tom Lane	6498287696	Handle constant inputs to corr() and related aggregates more precisely. The SQL standard says that corr() and friends should return NULL in the mathematically-undefined case where all the inputs in one of the columns have the same value. We were checking that by seeing if the sums Sxx and Syy were zero, but that approach is very vulnerable to roundoff error: if a sum is close to zero but not exactly that, we'd come out with a pretty silly non-NULL result. Instead, directly track whether the inputs are all equal by remembering the common value in each column. Once we detect that a new input is different from before, represent that by storing NaN for the common value. (An objection to this scheme is that if the inputs are all NaN, we will consider that they were not all equal. But under IEEE float arithmetic rules, one NaN is never equal to another, so this behavior is arguably correct. Moreover it matches what we did before in such cases.) Then, leave the sums at their exact value of zero for as long as we haven't detected different input values. This solution requires the aggregate transition state to contain 8 float values not 6, which is not problematic, and it seems to add less than 1% to the aggregates' runtime, which seems acceptable. While we're here, improve corr()'s final function to cope with overflow/underflow in the final calculation, and to clamp its result to [-1, 1] in case of roundoff error. Although this is arguably a bug fix, it requires a catversion bump due to the change in aggregates' initial states, so it can't be back-patched. Patch written by me, but many of the ideas are due to Dean Rasheed, who also did a deal of testing. Bug: #19340 Reported-by: Oleg Ivanov <o15611@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/19340-6fb9f6637f562092@postgresql.org	2025-12-06 18:31:26 -05:00
Tom Lane	6dfce8420e	Fix text substring search for non-deterministic collations. Due to an off-by-one error, the code failed to find matches at the end of the haystack. Fix by rewriting the loop. While at it, fix a comment that claimed that the function could find a zero-length match. Such a match could send a caller into an endless loop. However, zero-length matches only make sense with an empty search string, and that case is explicitly excluded by all callers. To make sure it stays that way, add an Assert and a comment. Bug: #19341 Reported-by: Adam Warland <adam.warland@infor.com> Author: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/19341-1d9a22915edfec58@postgresql.org Backpatch-through: 18	2025-12-05 20:10:33 -05:00
Robert Haas	014f9a831a	Don't reset the pathlist of partitioned joinrels. apply_scanjoin_target_to_paths wants to avoid useless work and platform-specific dependencies by throwing away the path list created prior to applying the final scan/join target and constructing a whole new one using the final scan/join target. However, this is only valid when we'll consider all the same strategies after the pathlist reset as before. After resetting the path list, we reconsider Append and MergeAppend paths with the modified target list; therefore, it's only valid for a partitioned relation. However, what the previous coding missed is that it cannot be a partitioned join relation, because that also has paths that are not Append or MergeAppend paths and will not be reconsidered. Thus, before this patch, we'd sometimes choose a partitionwise strategy with a higher total cost than cheapest non-partitionwise strategy, which is not good. We had a surprising number of tests cases that were relying on this bug to work as they did. A big part of the reason for this is that row counts in regression test cases tend to be low, which brings the cost of partitionwise and non-partitionwise strategies very close together, especially for merge joins, where the real and perceived advantages of a partitionwise approach are minimal. In addition, one test case included a row-count-inflating join. In such cases, a partitionwise join can easily be a loser on cost, because the total number of tuples passing through an Append node is much higher than it is with a non-partitionwise strategy. That test case is adjusted by adding additional join clauses to avoid the row count inflation. Although the failure of the planner to choose the lowest-cost path is a bug, we generally do not back-patch fixes of this type, because planning is not an exact science and there is always a possibility that some user will end up with a plan that has a lower estimated cost but actually runs more slowly. Hence, no backpatch here, either. The code change here is exactly what was originally proposed by Ashutosh, but the changes to the comments and test cases have been very heavily rewritten by me, helped along by some very useful advice from Richard Guo. Reported-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Author: Robert Haas <rhaas@postgresql.org> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Arne Roland <arne.roland@malkut.net> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: http://postgr.es/m/CAExHW5toze58+jL-454J3ty11sqJyU13Sz5rJPQZDmASwZgWiA@mail.gmail.com	2025-12-05 12:00:18 -05:00
Tom Lane	8f1791c618	Fix some cases of indirectly casting away const. Newest versions of gcc are able to detect cases where code implicitly casts away const by assigning the result of strchr() or a similar function applied to a "const char " value to a target variable that's just "char ". This of course creates a hazard of not getting a compiler warning about scribbling on a string one was not supposed to, so fixing up such cases is good. This patch fixes a dozen or so places where we were doing that. Most are trivial additions of "const" to the target variable, since no actually-hazardous change was occurring. There is one place in ecpg.trailer where we were indeed violating the intention of not modifying a string passed in as "const char *". I believe that's harmless not a live bug, but let's fix it by copying the string before modifying it. There is a remaining trouble spot in ecpg/preproc/variable.c, which requires more complex surgery. I've left that out of this commit because I want to study that code a bit more first. We probably will want to back-patch this once compilers that detect this pattern get into wider circulation, but for now I'm just going to apply it to master to see what the buildfarm says. Thanks to Bertrand Drouvot for finding a couple more spots than I had. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/1324889.1764886170@sss.pgh.pa.us	2025-12-05 11:17:23 -05:00
Heikki Linnakangas	4d936c3fff	Fix setting next multixid's offset at offset wraparound In commit `789d65364c`, we started updating the next multixid's offset too when recording a multixid, so that it can always be used to calculate the number of members. I got it wrong at offset wraparound: we need to skip over offset 0. Fix that. Discussion: https://www.postgresql.org/message-id/d9996478-389a-4340-8735-bfad456b313c@iki.fi Backpatch-through: 14	2025-12-05 11:32:38 +02:00
Amit Kapila	5db6a344ab	Rename column slotsync_skip_at to slotsync_last_skip. Commit `76b78721ca` introduced two new columns in pg_stat_replication_slots to improve monitoring of slot synchronization. One of these columns was named slotsync_skip_at, which is inconsistent with the naming convention used for similar columns in other system views. Columns that store timestamps of the most recent event typically use the 'last_' in the column name (e.g., last_autovacuum, checksum_last_failure). Renaming slotsync_skip_at to slotsync_last_skip aligns with this pattern, making the purpose of the column clearer and improving overall consistency across the views. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Discussion: https://postgr.es/m/20251128091552.GB13635@p46.dedyn.io;lightning.p46.dedyn.io Discussion: https://postgr.es/m/CAE9k0PkhfKrTEAsGz4DjOhEj1nQ+hbQVfvWUxNacD38ibW3a1g@mail.gmail.com	2025-12-05 04:12:55 +00:00
Michael Paquier	7bc88c3d6f	Fix some compiler warnings Some of the buildfarm members with some old gcc versions have been complaining about an always-true test for a NULL pointer caused by a combination of SOFT_ERROR_OCCURRED() and a local ErrorSaveContext variable. These warnings are taken care of by removing SOFT_ERROR_OCCURRED(), switching to a direct variable check, like `56b1e88c80`. Oversights in `e1405aa5e3` and `44eba8f06e`. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1341064.1764895052@sss.pgh.pa.us	2025-12-05 12:30:43 +09:00
Melanie Plageman	904f9f5ea0	Suppress spurious Coverity warning in prune freeze logic Adjust the prune_freeze_setup() parameter types of new_relfrozen_xid and new_relmin_mxid to prevent misleading Coverity analysis. heap_page_prune_and_freeze() compared these values against NULL when passing them to prune_freeze_setup(), causing Coverity to assume they could be NULL and flag a possible null-pointer dereference later, even though it occurs inside a directly related conditional. Reported-by: Coverity Author: Melanie Plageman <melanieplageman@gmail.com>	2025-12-04 18:55:02 -05:00
Nathan Bossart	80f6e2fb4a	Fix key size of PrivateRefCountHash. The key is the first member of PrivateRefCountEntry, which has type Buffer. This commit changes the key size from sizeof(int32) to sizeof(Buffer). This appears to be an oversight in commit `4b4b680c3d`, but it's of no consequence because Buffer has been a signed 32-bit integer for a long time. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aS77DTpl0fOkIKSZ%40ip-10-97-1-34.eu-west-3.compute.internal	2025-12-04 15:42:18 -06:00
Peter Eisentraut	e158fd4d68	Remove no longer needed casts from Pointer These casts used to be required when Pointer was char , but now it's void (commit `1b2bb5077e`), so they are not needed anymore. Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org	2025-12-04 20:44:52 +01:00
Peter Eisentraut	c6be3daa05	Remove no longer needed casts to Pointer These casts used to be required when Pointer was char , but now it's void (commit `1b2bb5077e`), so they are not needed anymore. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org	2025-12-04 19:40:08 +01:00
Alexander Korotkov	d6ef8ee3ee	Fix incorrect assertion bound in WaitForLSN() The assertion checking MyProcNumber used MaxBackends as the upper bound, but the procInfos array is allocated with size MaxBackends + NUM_AUXILIARY_PROCS. This inconsistency would cause a false assertion failure if an auxiliary process calls WaitForLSN(). Author: Xuneng Zhou <xunengzhou@gmail.com>	2025-12-04 10:38:12 +02:00
Andres Freund	6c5c393b74	Rename BUFFERPIN wait event class to BUFFER In an upcoming patch more wait events will be added to the wait event class (for buffer locking), making the current name too specific. Alternatively we could introduce a dedicated wait event class for those, but it seems somewhat confusing to have a BUFFERPIN and a BUFFER wait event class. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-12-03 18:38:20 -05:00
Andres Freund	156680055d	bufmgr: Turn BUFFER_LOCK_* into an enum It seems cleaner to use an enum to tie the different values together. It also helps to have a more descriptive type in the argument to various functions. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff	2025-12-03 18:38:20 -05:00
Heikki Linnakangas	789d65364c	Set next multixid's offset when creating a new multixid With this commit, the next multixid's offset will always be set on the offsets page, by the time that a backend might try to read it, so we no longer need the waiting mechanism with the condition variable. In other words, this eliminates "corner case 2" mentioned in the comments. The waiting mechanism was broken in a few scenarios: - When nextMulti was advanced without WAL-logging the next multixid. For example, if a later multixid was already assigned and WAL-logged before the previous one was WAL-logged, and then the server crashed. In that case the next offset would never be set in the offsets SLRU, and a query trying to read it would get stuck waiting for it. Same thing could happen if pg_resetwal was used to forcibly advance nextMulti. - In hot standby mode, a deadlock could happen where one backend waits for the next multixid assignment record, but WAL replay is not advancing because of a recovery conflict with the waiting backend. The old TAP test used carefully placed injection points to exercise the old waiting code, but now that the waiting code is gone, much of the old test is no longer relevant. Rewrite the test to reproduce the IPC/MultixactCreation hang after crash recovery instead, and to verify that previously recorded multixids stay readable. Backpatch to all supported versions. In back-branches, we still need to be able to read WAL that was generated before this fix, so in the back-branches this includes a hack to initialize the next offsets page when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the WAL is not compatible. Author: Andrey Borodin <amborodin@acm.org> Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru Backpatch-through: 14	2025-12-03 19:15:08 +02:00
Nathan Bossart	9b05e2ec08	Use "foo(void)" for definitions of functions with no parameters. Standard practice in PostgreSQL is to use "foo(void)" instead of "foo()", as the latter looks like an "old-style" function declaration. Similar changes were made in commits `cdf4b9aff2`, `0e72b9d440`, `7069dbcc31`, `f1283ed6cc`, `7b66e2c086`, `e95126cf04`, and `9f7c527af3`. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/aTBObQPg%2Bps5I7vl%40ip-10-97-1-34.eu-west-3.compute.internal	2025-12-03 10:54:37 -06:00
Peter Eisentraut	9790affcce	Fix stray references to SubscriptRef This type never existed. SubscriptingRef was meant instead. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/2eaa45e3-efc5-4d75-b082-f8159f51445f%40eisentraut.org	2025-12-03 14:44:14 +01:00
Peter Eisentraut	756a436893	Don't rely on pointer arithmetic with Pointer type The comment for the Pointer type says 'XXX Pointer arithmetic is done with this, so it can't be void * under "true" ANSI compilers.'. This fixes that. Change from Pointer to use char * explicitly where pointer arithmetic is needed. This makes the meaning of the code clearer locally and removes a dependency on the actual definition of the Pointer type. (The definition of the Pointer type is not changed in this commit.) Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org	2025-12-03 09:54:15 +01:00
Peter Eisentraut	623801b3bd	Remove useless casts to Pointer in arguments of memcpy() and memmove() calls Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/4154950a-47ae-4223-bd01-1235cc50e933%40eisentraut.org	2025-12-03 08:40:33 +01:00
Amit Kapila	c252d37d8c	Fix shadow variable warning in subscriptioncmds.c. Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Author: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/CAHut+PsF8R0Bt4J3c92+T2F0mun0rRfK=-GH+iBv2s-O8ahJJw@mail.gmail.com	2025-12-03 03:31:31 +00:00
Nathan Bossart	a6d05c8193	Use LW_SHARED in dsa.c where possible. Both dsa_get_total_size() and dsa_get_total_size_from_handle() take an exclusive lock just to read a variable. This commit reduces the lock level to LW_SHARED in those functions. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://postgr.es/m/aS8fMzWs9e8iHxk2%40nathan	2025-12-02 16:40:23 -06:00
Heikki Linnakangas	c085aab278	Add a test for half-dead pages in B-tree indexes To increase our test coverage in general, and because I will use this in the next commit to test a bug we currently have in amcheck. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/33e39552-6a2a-46f3-8b34-3f9f8004451f@garret.ru	2025-12-02 21:11:05 +02:00
Heikki Linnakangas	1e4e5783e7	Add a test for incomplete splits in B-tree indexes To increase our test coverage in general, and because I will add onto this in the next commit to also test amcheck with incomplete splits. This is copied from the similar test we had for GIN indexes. B-tree's incomplete splits work similarly to GIN's, so with small changes, the same test works for B-tree too. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/abd65090-5336-42cc-b768-2bdd66738404@iki.fi	2025-12-02 21:10:47 +02:00
Nathan Bossart	f894acb24a	Show size of DSAs and dshashes in pg_dsm_registry_allocations. Presently, this view reports NULL for the size of DSAs and dshash tables because 1) the current backend might not be attached to them and 2) the registry doesn't save the pointers to the dsa_area or dshash_table in local memory. Also, the view doesn't show partially-initialized entries to avoid ambiguity, since those entries would report a NULL size as well. This commit introduces a function that looks up the size of a DSA given its handle (transiently attaching to the control segment if needed) and teaches pg_dsm_registry_allocations to use it to show the size of successfully-initialized DSA and dshash entries. Furthermore, the view now reports partially-initialized entries with a NULL size. Reviewed-by: Rahila Syed <rahilasyed90@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aSeEDeznAsHR1_YF%40nathan	2025-12-02 10:29:45 -06:00
Álvaro Herrera	758479213d	Remove doc and code comments about ON CONFLICT deficiencies They have been fixed, so we don't need this text anymore. This reverts commit `8b18ed6dfb`. Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Discussion: https://postgr.es/m/CADzfLwWo+FV9WSeOah9F1r=4haa6eay1hNvYYy_WfziJeK+aLQ@mail.gmail.com	2025-12-02 16:47:18 +01:00
Álvaro Herrera	5dee7a603f	Avoid use of NOTICE to wait for snapshot invalidation This idea (implemented in commits and `bc32a12e0d` and `9e8fa05d34`) of using notices to detect that a session is sleeping was unreliable, so simplify the concurrency controller session to just look at pg_stat_activity for a process sleeping on the injection point we want it to hit. This change allows us to remove a secondary injection point and the alternative expected output files. Reproduced by Alexander Lakhin following a report in buildfarm member skink (which runs the server under valgrind). Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/3e302c96-cdd2-45ec-af84-03dbcdccde4a@gmail.com	2025-12-02 16:43:27 +01:00
Álvaro Herrera	90eae926ab	Fix ON CONFLICT with REINDEX CONCURRENTLY and partitions When planning queries with ON CONFLICT on partitioned tables, the indexes to consider as arbiters for each partition are determined based on those found in the parent table. However, it's possible for an index on a partition to be reindexed, and in that case, the auxiliary indexes created on the partition must be considered as arbiters as well; failing to do that may result in spurious "duplicate key" errors given sufficient bad luck. We fix that in this commit by matching every index that doesn't have a parent to each initially-determined arbiter index. Every unparented matching index is considered an additional arbiter index. Closely related to the fixes in `bc32a12e0d` and `2bc7e886fc`, and for identical reasons, not backpatched (for now) even though it's a longstanding issue. Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com	2025-12-02 13:51:53 +01:00
Peter Eisentraut	4f941d432b	Remove useless casting to same type This removes some casts where the input already has the same type as the type specified by the cast. Their presence could cause risks of hiding actual type mismatches in the future or silently discarding qualifiers. It also improves readability. Same kind of idea as `7f798aca1d` and `ef8fe69360`. (This does not change all such instances, but only those hand-picked by the author.) Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal	2025-12-02 10:09:32 +01:00
Peter Eisentraut	35988b31db	Simplify hash_xlog_split_allocate_page() Instead of complicated pointer arithmetic, overlay a uint32 array and just access the array members. That's safe thanks to XLogRecGetBlockData() returning a MAXALIGNed buffer. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/aSQy2JawavlVlEB0%40ip-10-97-1-34.eu-west-3.compute.internal	2025-12-02 09:18:54 +01:00
Peter Eisentraut	ec782f56b0	Replace pointer comparisons and assignments to literal zero with NULL While 0 is technically correct, NULL is the semantically appropriate choice for pointers. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/aS1AYnZmuRZ8g%2B5G%40ip-10-97-1-34.eu-west-3.compute.internal	2025-12-02 08:39:24 +01:00
Michael Paquier	713d9a847e	Update some timestamp[tz] functions to use soft-error reporting This commit updates two functions that convert "timestamptz" to "timestamp", and vice-versa, to use the soft error reporting rather than a their own logic to do the same. These are now named as follows: - timestamp2timestamptz_safe() - timestamptz2timestamp_safe() These functions were suffixed with "_opt_overflow", previously. This shaves some code, as it is possible to detect how a timestamp[tz] overflowed based on the returned value rather than a custom state. It is optionally possible for the callers of these functions to rely on the error generated internally by these functions, depending on the error context. Similar work has been done in `d03668ea05` and `4246a977ba`. Reviewed-by: Amul Sul <sulamul@gmail.com> Discussion: https://postgr.es/m/aS09YF2GmVXjAxbJ@paquier.xyz	2025-12-02 09:30:23 +09:00
Jeff Davis	19b966243c	Make regex "max_chr" depend on encoding, not provider. The regex mechanism scans through the first "max_chr" character values to cache character property ranges (isalpha, etc.). For single-byte encodings, there's no sense in scanning beyond UCHAR_MAX; but for UTF-8 it makes sense to cache higher code point values (though not all of them; only up to MAX_SIMPLE_CHR). Prior to `5a38104b36`, the logic about how many character values to scan was based on the pg_regex_strategy, which was dependent on the provider. Commit `5a38104b36` preserved that logic exactly, allowing different providers to define the "max_chr". Now, change it to depend only on the encoding and whether ctype_is_c. For this specific calculation, distinguishing between providers creates more complexity than it's worth. Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com Reviewed-by: Chao Li <li.evan.chao@gmail.com>	2025-12-01 11:06:17 -08:00
Jeff Davis	99cd8890be	Change some callers to use pg_ascii_toupper(). The input is ASCII anyway, so it's better to be clear that it's not locale-dependent. Discussion: https://postgr.es/m/450ceb6260cad30d7afdf155d991a9caafee7c0d.camel@j-davis.com	2025-12-01 09:24:03 -08:00
Álvaro Herrera	2bc7e886fc	Fix ON CONFLICT ON CONSTRAINT during REINDEX CONCURRENTLY When REINDEX CONCURRENTLY is processing the index that supports a constraint, there are periods during which multiple indexes match the constraint index's definition. Those must all be included in the set of inferred index for INSERT ON CONFLICT, in order to avoid spurious "duplicate key" errors. To fix, we set things up to match all indexes against attributes, expressions and predicates of the constraint index, then return all indexes that match those, rather than just the one constraint index. This is more onerous than before, where we would just test the named constraint for validity, but it's not more onerous than processing "conventional" inference (where a list of attribute names etc is given). This is closely related to the misbehaviors fixed by `bc32a12e0d`, for a different situation. We're not backpatching this one for now either, for the same reasons. Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/CANtu0ojXmqjmEzp-=aJSxjsdE76iAsRgHBoK0QtYHimb_mEfsg@mail.gmail.com	2025-12-01 17:34:13 +01:00
Peter Eisentraut	2fcc5a7151	Fix a strict aliasing violation This one is almost a textbook example of an aliasing violation, and it is straightforward to fix, so clean it up. (The warning only shows up if you remove the -fno-strict-aliasing option.) Also, move the code after the error checking. Doesn't make a difference technically, but it seems strange to do actions before errors are checked. Reported-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/20240724.155525.366150353176322967.ishii%40postgresql.org	2025-12-01 16:41:08 +01:00
Michael Paquier	a87987cafc	Move WAL sequence code into its own file This split exists for most of the other RMGRs, and makes cleaner the separation between the WAL code, the redo code and the record description code (already in its own file) when it comes to the sequence RMGR. The redo and masking routines are moved to a new file, sequence_xlog.c. All the RMGR routines are now located in a new header, sequence_xlog.h. This separation is useful for a different patch related to sequences that I have been working on, where it makes a refactoring of sequence.c easier if its RMGR routines and its core routines are split. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Discussion: https://postgr.es/m/aSfTxIWjiXkTKh1E@paquier.xyz	2025-12-01 16:21:41 +09:00
Michael Paquier	d03668ea05	Switch some date/timestamp functions to use the soft error reporting This commit changes some functions related to the data types date and timestamp to use the soft error reporting rather than a custom boolean flag called "overflow", used to let the callers of these functions know if an overflow happens. This results in the removal of some boilerplate code, as it is possible to rely on an error context rather than a custom state, with the possibility to use the error generated inside the functions updated here, if necessary. These functions were suffixed with "_opt_overflow". They are now renamed to use "_safe" as suffix. This work is similar to `4246a977ba`. Author: Amul Sul <sulamul@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CAAJ_b95HEmFyzHZfsdPquSHeswcopk8MCG1Q_vn4tVkZ+xxofw@mail.gmail.com	2025-12-01 15:22:20 +09:00
David Rowley	5424f4da90	Don't call simplify_aggref with a NULL PlannerInfo `42473b3b3` added prosupport infrastructure to allow simplification of Aggrefs during constant-folding. In some cases the context->root that's given to eval_const_expressions_mutator() can be NULL. `42473b3b3` failed to take that into account, which could result in a crash. To fix, add a check and only call simplify_aggref() when the PlannerInfo is set. Author: David Rowley <dgrowleyml@gmail.com> Reported-by: Birler, Altan <altan.birler@tum.de> Discussion: https://postgr.es/m/132d4da23b844d5ab9e352d34096eab5@tum.de	2025-11-30 12:55:34 +13:00

... 8 9 10 11 12 ...

28616 commits